Lack of language recognition for Sámi languages #379

joncto · 2023-06-27T12:24:15Z

Testing to index our archive with this bundle, and build a service for researchers. In the initial test, we experienced that texts in Sámi languages are identified as Estonian (content_language:"et").

We have stopword lists for five Sámi languages we would like to use, but not sure how this is implemented the best way. Is it interesting for you to implement support for Sámi languages in the bundle?

All resources are provided by Giellatekno at the University of Tromsø, and licenced under GNU General Public Licence, version 3:

Northern Sámi ("sme")
Lule Sámi ("smj")
Southern Sámi ("sma")
Skolt Sámi ("sms"
Inari Sámi ("smn")

thomasegense · 2023-06-27T14:24:44Z

Langauge detecting is done in the warc-indexer. See in config3.xml.next to the warc-indexer. Add the langauge code if it is not there. But I know know how good support there is for sami.
Also moving the langanges you expect to find most to the top will speed up indexing a little since it can match faster.

tokee · 2023-06-28T09:19:43Z

@joncto language detection is handled by optimaize/language-detector 0.6 and unfortunately it does not seem to support Sámi languages. If you only index content using Sámi languages, you might want to turn off detection. If you have some content in other langues, e.g. English, trimming the langdetectprofiles down to those other languages should reduce the amount of false positives.

Stopwords are not supported directly in SolrWayback or webarchive-discovery but can be added to the Solr schema, which should have the intended effect. See the Solr documentation on Stop Filter for details.

The schema.xml provided by the webarchive-discovery project as well as its mirror in the SolrWayback bundle uses stopwords for the path field. This can be used as a sample for applying a similar filter to the text_general field type in the schema, using the lists you link to.

tokee · 2023-06-28T09:25:08Z

Addendum: @anjackson and I talked about making webarchive-discovery more easily extensible for special processing. If you are aware of an external language detector that support Sámi languages and have time for implementation, this could be a fine task for implementing such an extension mechanism.

If such a detector is available in Java, it is should be possible to write a custom webarchive-discovery plugin, although I don't have experience with the process - Andy might give hints here?

anjackson · 2023-06-28T10:18:07Z

FWIW, when trying to detect Scots Gaelic, I made some notes here: ukwa/webarchive-discovery#94

As noted there, the Optimaize project appears to be dead. There is a new option called Lingua (https://github.com/pemistahl/lingua) that seems to be active, so one option would be to add Sámi support there and then add a Tika or webarchive-discovery module that uses it. If anyone has working Java code, I can try making a suitable module. (/CCing @tballison in case he has any plans for Tika in this regard...)

Here's an example text analyser class. This could be updated to use something else instead of Tika's wrapped version of Optimaize: https://github.com/ukwa/webarchive-discovery/blob/master/warc-indexer/src/main/java/uk/bl/wa/analyser/text/LanguageAnalyser.java

Addendum: The only issue with adding modules to Tika or webarchive-discovery is that if they require complex dependencies, that can become a bit of a nightmare. That's part of the reason why I'm interesting in supporting an additional workflow that generates JSONL and/or CDXJ files containing the extracted text in one pass, and then writing more modular tools that consume and enrich those files.

jesperlauridsen added the question label Aug 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lack of language recognition for Sámi languages #379

Lack of language recognition for Sámi languages #379

joncto commented Jun 27, 2023 •

edited

Loading

thomasegense commented Jun 27, 2023

tokee commented Jun 28, 2023

tokee commented Jun 28, 2023

anjackson commented Jun 28, 2023 •

edited

Loading

Lack of language recognition for Sámi languages #379

Lack of language recognition for Sámi languages #379

Comments

joncto commented Jun 27, 2023 • edited Loading

thomasegense commented Jun 27, 2023

tokee commented Jun 28, 2023

tokee commented Jun 28, 2023

anjackson commented Jun 28, 2023 • edited Loading

joncto commented Jun 27, 2023 •

edited

Loading

anjackson commented Jun 28, 2023 •

edited

Loading