-
Notifications
You must be signed in to change notification settings - Fork 4
Data format
Ben Hachey edited this page Mar 8, 2014
·
2 revisions
To evaluate your system output against the gold-standard, you will need to output in tab-separated format.
- Each document is started with the
-DOCSTART- (some_doc_id)
line, wheresome_doc_id
might be something like1163testb SOCCER
- Each sentence is separated by a blank line
- Each document is separated by a blank line
- Each token is on its own line (we re-use the gold-standard tokenisation)
The column ordering for token lines is:
- Token
- Mention span:
B
for mention begin,I
for inside mention, empty column for outside - Mention text: this is a bit redundant, but a sanity check when reading the output (
text == ' '.join(mentiontokens
) - Entity identifier: where a mention is linked to the KB, this will be the id/title (e.g., a Wikipedia title). Where the mention is a
NIL
, this column should be blank
-DOCSTART- (some_doc_id)
Some
headline
about
two
Named B Named Entities Named_Entity
Entities I Named Entities Named_Entity
.
By
John B John Smith
Smith I John Smith