Skip to content

Annotation Data Format

ninjin edited this page May 24, 2011 · 1 revision

Annotation Data Format

This page describes the annotation format used by brat.

History and Motivation

For the BioNLP 2009 Shared Task on Event Extraction a plain-text stand-off format was introduced as opposed to an XML-based format. brat derives it own format and aims to remain backwards compatible.

Motivation

This specification is motivated in order to avoid the case where the format is defined by the implementation (I am looking at you MediaWiki). When an implementation differs from what is described in this specification, the implementation is wrong, not the specification. Suggestions on additions which are to be motivated can of course be made. Try to weave these changes into this page and keep the version section up-to-date.

Specification

This section specifies the format.

Identifiers

Identifiers are used to reference between annotation lines and use the following format.

([A-Za-z]|#)([0-9]+)(.*)

The id consists of three groups:

  • Type specifier: Has semantic implications to the interpretation of the id
  • Number: A running number used to differentiate between ids of the same type
  • Tail: A free-text tail that is to have no semantic implications for the id

NOTE: Why do we even allow the tail in the first place again, shouldn't we discourage the use of the identifier as place to store information? The same with the hash, why was it necessary to have ids for comments?

Not all annotations have identifiers, for this special case the wild-card identifier * is used (see Equivalence for an example of such an annotation).

Text-bound

Text-bound annotations marks a span of text and assigns a type to it.

${ID}\t${TYPE} ${START} ${END}\t${TEXT}${COMMENT}

The following restrictions apply to a text-bound annotation:

  • Must have an id (${ID}) with a leading T
  • Must have a type (${TYPE}) which may contain any non-space character
  • Must have have both the marked span (${START} and ${END}) and the text contained for that span (${TEXT})
  • May have a comment (${COMMENT}) trailing after the ${TEXT} segment that

I difference between the BioNLP'09 ST format is that the ${TEXT} component is mandatory for brat to function properly, it also enables sanity checking to a larger extent than if it was left out.

Event

TBD

${ID}\t${TYPE} [${ARGUMENT}:${PARTICIPANT}...]

Equivalence

Equivalences signifies logical equivalence between annotations or entities.

*\tEquiv [${MEMBER}...]
  • Must have at least one member ${MEMBER} which is an id of another annotation

Modifier

A modifier annotation applies a binary modification of another id;ed annotation, for example: speculation, negation etc.

${ID}\t${TYPE} ${TARGET}
  • Must have an id (${ID}) with a leading M
  • Must have a type (${TYPE}) which may contain any non-space character
  • Must have a target (${TARGET}) which is a valid id for another annotation

Version

TBD