Annotation Agreement
Annotation agreement measures the extent to which multiple annotators assign the same labels to the same data instances. In text classification tasks, agreement is one of the most important indicators of dataset quality because it reflects the clarity of the annotation guidelines, the complexity of the task, and the consistency of the annotators. High agreement suggests that the labels are reliable and reproducible, while low agreement may indicate ambiguous definitions, insufficient annotator training, or inherently subjective phenomena.
Annotation agreement should be reported because it provides evidence that the guidelines are understandable and that the labels are reproducible. Agreement values should always be interpreted together with the task difficulty and the degree of subjectivity involved.
Why Annotation Agreement Matters
Annotation agreement serves various purposes:
- Evaluates the reliability of the annotated dataset.
- Identifies ambiguities in the annotation guidelines.
- Detects inconsistencies among annotators.
- Provides evidence of dataset quality for publications and benchmark releases.
- Helps determine whether a task is objectively measurable or highly subjective.
Agreement should be calculated and reported for every dataset that involves human annotation if the data is annotated by two and more annotators.
Percentage Agreement
The simplest measure of agreement is percentage agreement, which calculates the proportion of instances for which annotators assigned the same label.
Agreement %= (Number of Agreed instances / Total Number of Instances) * 100
*For example, if two annotators label 1,000 texts and agree on 850 of them:*
Agreement % = (850 / 1000) * 100 = 85%
Although easy to understand, percentage agreement does not account for agreement occurring by chance and should not be the only metric reported.
Agreement Between Two Annotators
When exactly two annotators label each instance, Cohen's Kappa is the most commonly used agreement metric. Cohen's Kappa adjusts for the amount of agreement that could occur purely by chance.
Kappa = (Observed Agreement - Expected Agreement) / (1 - Expected Agreement)
Or
Kappa = (Po - Pe) / (1 - Pe)
Where:
Observed agreement (Po) is the proportion of instances where the annotators actually agreed.
Po=Total number of items/Number of agreements
Expected Agreement (Pe) represents the level of agreement that would be expected to occur purely by chance, given the distribution of labels assigned by each annotator. It is calculated by determining the probability that both annotators independently select the same category and then summing these probabilities across all categories.
Cohen's Kappa is widely used in sentiment analysis, hate speech detection, topic classification, emotion classification, and many other NLP tasks involving two annotators.
Python Example
from sklearn.metrics import cohen*kappa*score
annotator1 = [0, 1, 1, 0, 2]
annotator2 = [0, 1, 0, 0, 2]
kappa = cohen*kappa*score(annotator1, annotator2)
print(kappa)
Agreement Among Three or More Annotators
Many NLP datasets use three or more annotators per instance to improve reliability and reduce the influence of individual biases.
When more than two annotators are involved, commonly used agreement measures include:
Fleiss' Kappa
Fleiss' Kappa extends Cohen's Kappa to multiple annotators and is one of the most widely reported agreement measures in NLP datasets.
It is appropriate when:d
- Three or more annotators label each instance.
- Every instance receives the same number of annotations.
Fleiss kappa (k) = P−Pe)/(1-Pe)
Where
p is the mean of the agreement probability over all raters and
Pe is the mean agreement probability over all raters if they were randomly assigned.