Following on from Sheng et al.’s Get Another Label? paper, Panos and crew have a Human Computation Workshop paper,

- P. Ipeirotis, F. Provost, J. Wang. 2010. Quality Management on Amazon Mechanical Turk.
*KDD-HCOMP*.

The motivation for the new paper is to try to separate bias from accuracy. That is, some annotators would try hard, but the bias in their answers would give them an overall low exact accuracy. But their responses can still be useful if we correct for their bias. Luckily, the Dawid-and-Skene-type models for estimating and adjusting for annotator’s accuracies does just that.

### G, PG, R, or X?

As an example, Ipeirotis et al. consider having workers on Amazon Mechanical Turk classify images into G, PG, R, and X ratings following MPAA ratings guidelines.

This is really an ordinal classification problem where the approach of Uebersax and Grove (1993) seems most appropriate, but let’s not worry about that for right now. We can imagine a purely categorical example such as classifying tokens in context based on part-of-speech or classifying documents based on a flat topic list.

### Bias or Inaccuracy?

Uebersax and Grove discuss the bias versus accuracy issue. It’s easy to see in a two-category case that sensitivity and specificity (label response for 1 and 0 category items) may be reparameterized as accuracy and bias. Biased annotators have sensitivities that are lower (0 bias) or higher (1 bias) than their specificities.

What Ipeirotis et al. point out is that you can derive information from a biased annotator if their biases can be estimated. As they show, a model like Dawid and Skene’s (1979) performs just such a calculation in its “weighting” of annotations in a generative probability model. The key is that it uses the information about the likely response of an annotator given the true category of an item to estimate the category of that item.

### Decision-Theoretic Framework

Ipeirotis et al. set up a decision-theoretic framework where there is a loss (aka cost, which may be negative, and thus a gain) for each decision based on the true category of the item and the classification.

One nice thing about the image classification task is that it makes it easy to think about the misclassification costs. For instance, classifying an X-rated image as G and having it land on a children’s site is probably a more costly error than rating a G image as PG and banning it from the same site. In the ordinal setting, there’s a natural scale, where rating an X-rated image as R is closer to the true result than PG or G.

### Getting Another Label

Consider a binary setting where the true category of item is , the prevalence (overall population proportion) of true items is , and annotator has sensitivity (accuracy on 1 items; response to positive items) of and a specificity (accuracy on 0 items; 1 – response to negative items) of . If is the response of annotators , then we can calculate probabilities for the category by

and

.

The thing to take away from the above formula is that it reduces to an odds ratio. Before annotation, the odds ratio is . For each annotator, we multiply the odds ratio by the annotator’s ratio, which is if the label is , and is if the label is negative.

### Random Annotators Don’t Affect Category Estimates

Spammers in these settings can be characterized by having responses that do not depend on input items. That is, no matter what the true category, the response profile will be identical. For example, an annotator could always return a given label (say 1), or always choose randomly among the possible labels with the same distribution (say 1 with 20% chance and 0 with 80% chance, correspdonding, say, to just clicking some random checkboxes on an interface).

In the binary classification setting, if annotations don’t depend on the item being classified, we have specificity = 1 – sensitivity, or in symbols, for annotator . That is, there’s always a chance of returning the label 1 no matter what the input.

Plugging this into the odds ratio formulation above, it’s clear that there’s no effect of adding such a spammy annotator. The update to the odds ratios have no effect because and .

### Cooperative, Noisy and Adversarial Annotators

In the binary case, as long as the sum of sensitivity and specificity is greater than one, , there is positive information to be gained from an annotator’s response.

If the sum is 1, the annotator’s responses are pure noise and there is no information to be gained by their response.

If the sum is less than 1, the annotator is being adversarial. That is, they know the answers and are intentionally returning the wrong answers. In this case, their annotations will bias the category estimates.

### Information Gain

The expected information gain from having an annotator label an item is easily calculated. We need to calculate the probability of true category and then probability of response and figure out the odds of each and the contribution to our overall estimates.

The random variable in question is , the category of item . We will assume a current odds ratio after zero or more annotations of and consider the so-called *information gain* from observing , the label provided for item by annotator ,

,

where the expectation is with respect to our model’s posterior.

Expanding the expectation in the second term gives us

The formulas for the terms inside the entropy are given above. As before, we’ll calculate the probabilities of responses using our model posteriors. For instance, carrying this through normalization, the probabilities on which the expectations are based are

, and

, where

, and

.

so that the probability the annotation is 1 is proportional the sum of the probability that the true category is 1 (here ) and the response was correct () and the probability that the true category is 0 () and the response was incorrect ().

It’s easy to see that spam annotators who have provide zero information gain because as we showed in the last section, if annotator provides random responses, then .

### Decision-Theoretic Framework

Ipeirotis et al. go one step further and consider a decision-theoretic context in which the penalities for misclassifications may be arbitrary numbers and the goal is minimizing expected loss (equivalently maximizing expected gain).

Rather than pure information gain, the computation would proceed through the calculation of expected true positives, true negatives, false positives, and false negatives, each with a weight.

The core of Bayesian decision theory is that expected rewards are always improved by improving posterior estimates. As long as an annotator isn’t spammy, their contribution is expected to tighten up our posterior estimate and hence improve our decision-making ability.

Suppose have have weights , which are the losses for classifying an item of category as being of category . Returning to the binary case, consider an item whose current estimated chance of being positive is . Our expected loss is

.

We are implicitly assuming that the system operates by sampling the category from its estimated distribution. This is why shows up twice after , once for the probability that the category is 1 and once for the probability that’s the label chosen.

In practice, we often quantize answers to a single category. The site that wants a porn filter on an image presumably doesn’t want a soft decision — it needs to either display an image or not. In this case, the decision criterion is to return the result that minimizes expected loss. For instance, assigning category 1 if the probability the category is 1 is leads to expected loss

and the loss for assigning category 0 is expected to be

.

The decision is simple: return the result corresponding to the smallest loss.

After the annotation by annotator , the positive and negative probabilities get updated and plugged in to produce a new estimate , which we plug back in.

I’m running out of steam on the derivation front, so I’ll leave the rest as an exercise. It’s a hairy formula, especially when unfolded to the expectation. But it’s absolutely what you want to use as the basis of the decision as to which items to label.

### Bayesian Posteriors and Expectations

In practice, we work with estimates of the prevalence , sensitivities , and specificities . For full Bayesian inference, Gibbs sampling lets us easily compute the integrals required to use our uncertainty in the parameter estimates in calculating our estimate of and its uncertainty.

### Confused by Section 4

I don’t understand why Ipeirotis say, in section 4,

The main reason for this failure [in adjusting for spam] is the inability of the EM algorithm to identify the “strategic” spammers; these sophisticated spammers identify the class with the highest class prior (in our case the “G” class) and label all the pages as such.

Huh? It does in my calculations, as shown above, and in my experience with the Snow et al. NLP data and our own NE data.

One problem in practice may be that if a spammer annotates very few items, we don’t have a good estimate of their accuracies, and can’t adjust for their inaccuracy. Otherwise, I don’t see a problem.

November 30, 2010 at 12:48 am |

I presume in all of this that we’re assuming independence between annotators. What happens when annotators are correlated? E.g. when annotators tend to make the same mistakes on the same documents?

November 30, 2010 at 2:39 pm |

Here I’m assuming conditional independence of the annotators given the true category of the item being annotated. If you actually look at the annotations, they’re highly correlated! In addition, the sensitivity and specificity terms for modeling annotator accuracy provides a bit more power to model shared biases (e.g. higher specificity than sensitivity).

There are two extensions, only one of which I’ve pursued.

The first is to use a fixed-effects type model and try to estimate the correlations among annotators. That’s a pretty big covariance matrix with 150 annotators, and most pairs of annotators have no data in common, making direct estimation pretty much impossible.

The second is to use a random-effects model, and I’ve done this and written it up in the tech report (linked from the white papers section of the blog) and sketched it in the talk and tutorial.

In the simplest random-effects model, each item being annotated gets an effect, which is essentially modeling difficulty to annotate. This is like the approach taken by Uebersax and Grove (1993) in their ordinal rating model, and has been widely adopted within epidemiology, where the random effect models something like size of tumor and the ordinal rating something like stage of cancer.

The data from your (co-authored)

CIKMpaper, Assessor Error in Stratified Evaluation, is much richer. What I’d like to do is model accuracy based on strata effects. In the simplest case here, each stratum gets a difficulty parameter. In the richer model, you’d use those as priors for the individual item effects.The problem I had with individual item difficulty effects is the usual one — with only a handful of annotators (10 or fewer) per item, it’s hard to get a read on item difficulty. There are too many “degrees of freedom” in a possible explanation of annotator’s labels — you can either say an item is hard or that annotator’s are inaccurate.