Handle external Annotations/Annotation Layers to DraCor Corpora #153
Replies: 1 comment
-
E-Mail-Communication from Martin Janssen on the subject of enriching DraCor-Corpora with their TEITOK pipeline: Just as a quick follow-up on the comment you made this morning - the idea behind the TEITOK toolchain is that it does not modify the XML - it merely enriches it; and it does so by default TEI-compliant, but not default TEI for several reasons; in the demo I set up, it clones the Git, copies the tei files to a TEITOK project, tokenizes it inline, and then runs UDPIPE on the files - and prob. also NameTag2; the end result is an annotated version of the input file - so where in https://dracor.org/api/corpora/cal/play/a-dios-por-razon-de-estado/tei you have this line: y conmovido a mi ruego, After the pipeline it looks like this: y conmovido a mi ruego, You can see the annotation file in action here: https://quest.ms.mff.cuni.cz/teitok-dev/teitok/dracor/cal/index.php?action=file&id=a-dios-por-razon-de-estado.xml (it is a first draft - with only CAL annotated, I could soon add the rest, as well as add NER and other improvements) Now it is my understanding of ExistDB that those extra annotations would simply be ignored if you were to load up the annotated file instead of the original file - so that is the sense in which we mentioned that you could hook it up with the existing tool chain: if instead of copying the Git, TEITOK would simply write directly to the Git, you would get a branch that would contain the annotated version, which would be equally usable as the original - with of course the option to enrich the encoding in ExistDB to also make the annotation accessible - I have no experience with ExistDB, so I don’t know how hard that would be, but the demo could be fully completed soon, so there would be time to check whether that would be possible; and if there is a problem with the TEITOK format not being strictly TEI - there is a script that converts this format to ISO TEI, using instead of , @ana + standoff for the UD; so what is put back in the Git could be the converted format. TEITOK is explicitly an editing environment, so you can (as an admin) also correct the annotations, both the UD and the NER output - to create a more reliable annotation; you can also normalize the text without throwing away the original, making for a more reliable and readable source. Esp. for NER - the annotated result (if verified) might give a lot more information than the “plain” TEI files - so that could potentially be then exploited by the tools already in place for Dracor - I am of course not certain there would be for any of this - if the corpus authors are not involved, other people are unlikely to be interested in manually improving somebody elses data - but esp. for the LRL, those options tend to be very welcomed in my experience (and also necessary since the tools would be either less accurate, or completely missing as in the case of Bashkir. So the question is - would that kind of annotation pipeline be usable in the current set-up of Dracor? And both in the case of a positive or a negative, could something be done/added to make it more valuable for that set-up? |
Beta Was this translation helpful? Give feedback.
-
When DraCor (Vanilla-) Corpora will be used in a research context, they might get modified, enriched and/or annotated in some way or another. There won't be a central system that could store all this derivates and synchronize them with the latest version, but we could still come up with a system, that would allow users to find these derivates and work with them.
Example: In CLS INFRA the Team of Charles University Prague is thinking about using DraCor (API/Corpora/Platform) in their Training School. They probably run NLP-Tools on the corpora and add annotations to them. Currently, we could not re-ingest the annotated files into the DraCor-Exist-DB (it wouldn't make much sense to do so, because the extractor functions expect a certain encoding that would defer from the output of their pipelines, anyway).
But what, if we didn't try to re-ingest the enriched data, but only generate some metadata records, that would allow users to find the annotated files stored somewhere else?
Beta Was this translation helpful? Give feedback.
All reactions