A German dataset for few-shot relation extraction.
The sentences in WissenschaftsSTANDARD are drawn from the 10kGNAD dataset. 10kGNAD is a topic classification dataset with journalistic articles from the Austrian newspaper "Der STANDARD". WissenschaftsSTANDARD only uses sentences from the "Wissenschaft" (science) topic category in 10kGNAD. Therefore, the sentences mainly treat scientific discoveries and events in academia.
This dataset was created for the master thesis "Few-Shot Relation Extraction for German" by Anna Sauer.
The sentences were annotated by a marvellous crowdsourcing team including Chris, Christl, Florian, Hans, Jean-Francois, Julia, Kristin, Marco, Marina, Marius, MG and Sandra. Thanks to you all! 💘
The minimalist annotation tool Locksley was used in the creation of WissenschaftsSTANDARD.
WissenschaftsSTANDARD contains the following eleven relations:
id | German name | English name | description | number of instances |
---|---|---|---|---|
0 | andere | OTHER | 1,496 | |
1 | veröffentlicht in | publishes in | a person publishes a text in a scientific journal | 61 |
2 | Leiter:in | director | a person is at the head of an organisation | 76 |
3 | gehört an | affiliation | a person is affiliated with an organisation (workplace, membership etc.) | 519 |
4 | Kollegin, Kollege | colleague | two people work together | 208 |
5 | nachgeordnete Organisation | subsidiary | an organisation belongs to a larger parent organisation | 145 |
6 | Partnerorganisation | partner organisation | two organisations work together | 216 |
7 | Standort in | located in | an organisation is situated in a location | 260 |
8 | Teilort | geographical part of | a location is part of a larger geographical entity | 116 |
9 | verleiht Preis an | awards prize to | an organisation awards a prize to a person | 13 |
10 | Gründer:in | founder | a person has founded an organisation | 6 |
In the dataset file, they are all referred to by their id. The total of labeled relation instances amounts to 3,116. Note that one sentence from 10kGNAD can be labeled with several relations and can therefore appear several times in WissenschaftsSTANDARD.
The relations are assumed to have a fixed direction from one entity to another one. Nevertheless, there are also two symmetric relations, colleague and partner organisation, that go both ways. For example, consider the sentence "Robin Hood and Friar Tuck are colleagues.". There is a colleague relation from Robin Hood to Friar Tuck. In addition, there is also a colleague relation from Friar Tuck. In cases like this, WissenschaftsSTANDARD contains an separate instance for both directions.
The format of the JSON file is modeled after the data format of the FewRel benchmark for few-shot relation extraction. Each file contains a dictionary whose keys are the names of the relations in the dataset. For each relation key, the corresponding value is a list of the labeled instances of that relation. This list contains a dictionary for each individual instance with
- "tokens": a list with the token string sequence in the sentence
- "h": information on the head entity in a list with
- a string with the entity mention in lower case
- a string with the Wikidata id of the entity (cf. wikidata.org). In WissenschaftsSTANDARD, this string is left empty because no entity linking between the head and tail entity and their Wikidata equivalent is performed.
- a list with a nested list that contains the indices of the entity mention tokens in the sentence
- "t": information on the tail entity in a list with the same structure as "h"
- "ner": a list with information obtained from the task of named entity recognition (NER). The list contains the BIOES entity type tag for each token in the sentence. The BIOES tagging has been created using the German NER model in Stanza with the CoNLL 2003 tag set. This tag set contains the entity types PER (person), ORG (organisation) and LOC (location) (cf. https://stanfordnlp.github.io/stanza/available_models.html#available-ner-models).
Consider the following made-up example:
{
"tokens": ["Robin", "Hood", "lebt", "in", "Sherwood", "Forest", "."],
"h": ["sherwood forest", "", [[4, 5]]],
"t": ["robin hood", "", [[0, 1]]],
"ner": ["B-PER", "E-PER", "O", "O", "B-PER", "E-PER", "O"]
}
10kGNAD is licensed under a Creative Commons BY-NC-SA 4.0 license (cf. https://github.com/tblock/10kGNAD). Therefore, WissenschaftsSTANDARD is also released under the Creative Commons BY-NC-SA 4.0 license.