-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Teach Tika to spot Scots Gaelic #94
Comments
I would maybe recommend using CLD2 instead of Tika for language recognition. It has a lot more languages, is more accurate and orders of magnitude faster. Gaelic is included: |
Thanks for this. I'm aware of CLD2 but chose not to use it as it requires bundling a native library rather than being something I can trivially reuse from Java. Of course this is not unsurmountable but I'm not currently sure how best to package these kinds of dependencies, especially when running map-reduce jobs. |
Noting also there appears to be a new set of libraries in mutiple programming languages, with clearer support for adding new natural languages: |
Adding that CommonCrawl have a Java wrapper for CLD2, but it's a bit of a pain to work with as it has to be built locally and doesn't bundle binaries etc. https://github.com/commoncrawl/language-detection-cld2 |
To summarize what happened with Scots Gaelic, I did work up a contribution to Optimaize but that project appears to be dead: optimaize/language-detector#81 However, IIRC, the detector did not appear to be good at distinguishing Scots and Irish Gaelic. The new tricks used by Lingua might pay off. |
The library used by Tika already spots Welsh, but needs to be taught to spot Scots Gaelic (gd). Detailed instructions here.
If we can get a reasonable chunk of text from our NLS colleagues, we should be able to add this easily enough. We might also be able to improve the Welsh language detection by providing data from a larger corpus.
EDIT: It would be interesting to teach it Scots too, but there might be technical barriers as the language detector appears to use two-character ISO 639-1 codes and Scots doesn't have one of those! The same applies to Scottish English, but that would probably be rather hard to spot anyway.
The text was updated successfully, but these errors were encountered: