Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lack of language recognition for Sámi languages #379

Open
joncto opened this issue Jun 27, 2023 · 4 comments
Open

Lack of language recognition for Sámi languages #379

joncto opened this issue Jun 27, 2023 · 4 comments
Labels

Comments

@joncto
Copy link
Contributor

joncto commented Jun 27, 2023

Testing to index our archive with this bundle, and build a service for researchers. In the initial test, we experienced that texts in Sámi languages are identified as Estonian (content_language:"et").

We have stopword lists for five Sámi languages we would like to use, but not sure how this is implemented the best way. Is it interesting for you to implement support for Sámi languages in the bundle?

All resources are provided by Giellatekno at the University of Tromsø, and licenced under GNU General Public Licence, version 3:

Northern Sámi ("sme")
Lule Sámi ("smj")
Southern Sámi ("sma")
Skolt Sámi ("sms"
Inari Sámi ("smn")

@thomasegense
Copy link
Contributor

Langauge detecting is done in the warc-indexer. See in config3.xml.next to the warc-indexer. Add the langauge code if it is not there. But I know know how good support there is for sami.
Also moving the langanges you expect to find most to the top will speed up indexing a little since it can match faster.

@tokee
Copy link
Contributor

tokee commented Jun 28, 2023

@joncto language detection is handled by optimaize/language-detector 0.6 and unfortunately it does not seem to support Sámi languages. If you only index content using Sámi languages, you might want to turn off detection. If you have some content in other langues, e.g. English, trimming the langdetectprofiles down to those other languages should reduce the amount of false positives.

Stopwords are not supported directly in SolrWayback or webarchive-discovery but can be added to the Solr schema, which should have the intended effect. See the Solr documentation on Stop Filter for details.

The schema.xml provided by the webarchive-discovery project as well as its mirror in the SolrWayback bundle uses stopwords for the path field. This can be used as a sample for applying a similar filter to the text_general field type in the schema, using the lists you link to.

@tokee
Copy link
Contributor

tokee commented Jun 28, 2023

Addendum: @anjackson and I talked about making webarchive-discovery more easily extensible for special processing. If you are aware of an external language detector that support Sámi languages and have time for implementation, this could be a fine task for implementing such an extension mechanism.

If such a detector is available in Java, it is should be possible to write a custom webarchive-discovery plugin, although I don't have experience with the process - Andy might give hints here?

@anjackson
Copy link
Collaborator

anjackson commented Jun 28, 2023

FWIW, when trying to detect Scots Gaelic, I made some notes here: ukwa/webarchive-discovery#94

As noted there, the Optimaize project appears to be dead. There is a new option called Lingua (https://github.com/pemistahl/lingua) that seems to be active, so one option would be to add Sámi support there and then add a Tika or webarchive-discovery module that uses it. If anyone has working Java code, I can try making a suitable module. (/CCing @tballison in case he has any plans for Tika in this regard...)

Here's an example text analyser class. This could be updated to use something else instead of Tika's wrapped version of Optimaize: https://github.com/ukwa/webarchive-discovery/blob/master/warc-indexer/src/main/java/uk/bl/wa/analyser/text/LanguageAnalyser.java

Addendum: The only issue with adding modules to Tika or webarchive-discovery is that if they require complex dependencies, that can become a bit of a nightmare. That's part of the reason why I'm interesting in supporting an additional workflow that generates JSONL and/or CDXJ files containing the extracted text in one pass, and then writing more modular tools that consume and enrich those files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants