-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lack of language recognition for Sámi languages #379
Comments
Langauge detecting is done in the warc-indexer. See in config3.xml.next to the warc-indexer. Add the langauge code if it is not there. But I know know how good support there is for sami. |
@joncto language detection is handled by optimaize/language-detector 0.6 and unfortunately it does not seem to support Sámi languages. If you only index content using Sámi languages, you might want to turn off detection. If you have some content in other langues, e.g. English, trimming the Stopwords are not supported directly in SolrWayback or webarchive-discovery but can be added to the Solr schema, which should have the intended effect. See the Solr documentation on Stop Filter for details. The |
Addendum: @anjackson and I talked about making webarchive-discovery more easily extensible for special processing. If you are aware of an external language detector that support Sámi languages and have time for implementation, this could be a fine task for implementing such an extension mechanism. If such a detector is available in Java, it is should be possible to write a custom webarchive-discovery plugin, although I don't have experience with the process - Andy might give hints here? |
FWIW, when trying to detect Scots Gaelic, I made some notes here: ukwa/webarchive-discovery#94 As noted there, the Optimaize project appears to be dead. There is a new option called Lingua (https://github.com/pemistahl/lingua) that seems to be active, so one option would be to add Sámi support there and then add a Tika or webarchive-discovery module that uses it. If anyone has working Java code, I can try making a suitable module. (/CCing @tballison in case he has any plans for Tika in this regard...) Here's an example text analyser class. This could be updated to use something else instead of Tika's wrapped version of Optimaize: https://github.com/ukwa/webarchive-discovery/blob/master/warc-indexer/src/main/java/uk/bl/wa/analyser/text/LanguageAnalyser.java Addendum: The only issue with adding modules to Tika or webarchive-discovery is that if they require complex dependencies, that can become a bit of a nightmare. That's part of the reason why I'm interesting in supporting an additional workflow that generates JSONL and/or CDXJ files containing the extracted text in one pass, and then writing more modular tools that consume and enrich those files. |
Testing to index our archive with this bundle, and build a service for researchers. In the initial test, we experienced that texts in Sámi languages are identified as Estonian (content_language:"et").
We have stopword lists for five Sámi languages we would like to use, but not sure how this is implemented the best way. Is it interesting for you to implement support for Sámi languages in the bundle?
All resources are provided by Giellatekno at the University of Tromsø, and licenced under GNU General Public Licence, version 3:
Northern Sámi ("sme")
Lule Sámi ("smj")
Southern Sámi ("sma")
Skolt Sámi ("sms"
Inari Sámi ("smn")
The text was updated successfully, but these errors were encountered: