Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Teach Tika to spot Scots Gaelic #94

Open
anjackson opened this issue Jul 5, 2017 · 5 comments
Open

Teach Tika to spot Scots Gaelic #94

anjackson opened this issue Jul 5, 2017 · 5 comments
Assignees
Milestone

Comments

@anjackson
Copy link
Contributor

anjackson commented Jul 5, 2017

The library used by Tika already spots Welsh, but needs to be taught to spot Scots Gaelic (gd). Detailed instructions here.

The training text should be rather clean; it is a good idea to remove parts written in other languages (like English phrases, or Latin script content in a Cyrillic text for example). Some also like to remove proper nouns like (international) place names in case there are too many. It's up to you how far you go. As a general rule, the cleaner the text is, the better is its profile. If you scrape text from Wikipedia then please only use the main content, without the left side navigation etc.

If we can get a reasonable chunk of text from our NLS colleagues, we should be able to add this easily enough. We might also be able to improve the Welsh language detection by providing data from a larger corpus.

EDIT: It would be interesting to teach it Scots too, but there might be technical barriers as the language detector appears to use two-character ISO 639-1 codes and Scots doesn't have one of those! The same applies to Scottish English, but that would probably be rather hard to spot anyway.

@anjackson anjackson added this to the 3.1.0 Release milestone Jul 5, 2017
@anjackson anjackson self-assigned this Jul 5, 2017
@ymaurer
Copy link

ymaurer commented Aug 22, 2017

I would maybe recommend using CLD2 instead of Tika for language recognition. It has a lot more languages, is more accurate and orders of magnitude faster. Gaelic is included:
{"scots_gaelic", "gd", SCOTS_GAELIC + W10, 0}
and Welsh
{"welsh", "cy", WELSH + W10, 0}
https://github.com/CLD2Owners/cld2
For a (quite old) comparison between the different systems, I can offer this article:
http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html

@anjackson
Copy link
Contributor Author

Thanks for this. I'm aware of CLD2 but chose not to use it as it requires bundling a native library rather than being something I can trivially reuse from Java.

Of course this is not unsurmountable but I'm not currently sure how best to package these kinds of dependencies, especially when running map-reduce jobs.

@anjackson
Copy link
Contributor Author

Noting also there appears to be a new set of libraries in mutiple programming languages, with clearer support for adding new natural languages:

@anjackson
Copy link
Contributor Author

Adding that CommonCrawl have a Java wrapper for CLD2, but it's a bit of a pain to work with as it has to be built locally and doesn't bundle binaries etc. https://github.com/commoncrawl/language-detection-cld2

@anjackson
Copy link
Contributor Author

To summarize what happened with Scots Gaelic, I did work up a contribution to Optimaize but that project appears to be dead: optimaize/language-detector#81

However, IIRC, the detector did not appear to be good at distinguishing Scots and Irish Gaelic. The new tricks used by Lingua might pay off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants