Skip to content

Precision, Recall and F1 measure

Antonio Di Tecco edited this page Jun 6, 2023 · 7 revisions

This is a brief explanation of the implementation of the Precision, Recall and F1-measures in GERBIL with a focus on its special cases.

Dividing by 0

In some rare cases, the calculation of Precision or Recall can cause a division by 0. Regarding the precision, this can happen if there are no results inside the answer of an annotator and, thus, the true as well as the false positives are 0. For these special cases, we have defined that if the true positives, false positives and false negatives are all 0, the precision, recall and F1-measure are 1. This might occur in cases in which the gold standard contains a document without any annotations and the annotator (correctly) returns no annotations. If true positives are 0 and at least one of the two other counters is larger than 0, the precision, recall and F1-measure are 0.

Micro and Macro

Since every dataset contains a lot of single documents, we implemented Micro and Macro versions of Precision, Recall and F1-measure. Here, we will explain the difference between micro and macro precision very briefly. For the complete equations, take a look at [1].

For computing the micro precision all true positives and false positives of all documents are summed up. These sums are used to calculate a single micro precision value.

In contrast a single precision value for every document can be calculated. The macro precision is the average of these single per-document precisions. Further on the macro F1-measure is also calculated as the average of the single per-document F1-measures.

We can summarize that micro measures show the performance over the set of all annotations inside the dataset while macro measures show the average performance per document. Thus, in some cases these measures can have values with large differences. Let's assume that a dataset comprises three documents. Two of these documents have exactly 1 annotation that should be found by an annotator while the third document does not have any annotations. Let's assume that the annotator we are evaluating does not work well and that it always returns an empty result. The following table contains the counts for this example (tp = true positive, fp = false positive, fn = false negative, p = precision, r = recall, f1 = F1-score).

annotations tp fp fn p r f1
doc 1 1 0 0 1 0 0 0
doc 2 1 0 0 1 0 0 0
doc 3 0 0 0 0 1 1 1
sums (micro) 2 0 0 2 0 0 0
avg (macro) - - - - 1/3 1/3 1/3

It can be seen that while the micro measures are all 0, the macro measures are 1/3.

InKB, EE and GSInKB

The calculated precision, recall and F1-measure scores are calculated using all true positive, false positive and false negative counts. Beside these general scores, there are some special scores that focus on a certain part of the gold standard and annotator response.

The InKB scores are considering only those entities that have at least one URI that is part of a well-known knowledge base (KB). The EE scores are the opposite of the InKB scores. They consider only those entities that have no URI of the KB.

The GSInKB scores are focussing on those entities that are linked to the KB inside the gold standard. Entities marked by the annotator are only considered if they overlap with the InKB entities of the gold standard. The aim of these scores is to show the performance of an annotator in a D2KB experiment without taking emerging entities into account.

References

[1] Ricardo Usbeck, Michael Röder, Axel-Cyrille Ngonga Ngomo, Ciro Baron, Andreas Both, Martin Brümmer, Diego Ceccarelli, Marco Cornolti, Didier Cherix, Bernd Eickmann, Paolo Ferragina, Christiane Lemke, Andrea Moro, Roberto Navigli, Francesco Piccinno, Giuseppe Rizzo, Harald Sack, René Speck, Raphaël Troncy, Jörg Waitelonis und Lars Wesemann. GERBIL -- General Entity Annotation Benchmark Framework. In Proceedings of the International World Wide Web Conference (WWW) (Practice & Experience Track), ACM (2015).