Teach Tika to spot Scots Gaelic #94

anjackson · 2017-07-05T12:45:24Z

The library used by Tika already spots Welsh, but needs to be taught to spot Scots Gaelic (gd). Detailed instructions here.

The training text should be rather clean; it is a good idea to remove parts written in other languages (like English phrases, or Latin script content in a Cyrillic text for example). Some also like to remove proper nouns like (international) place names in case there are too many. It's up to you how far you go. As a general rule, the cleaner the text is, the better is its profile. If you scrape text from Wikipedia then please only use the main content, without the left side navigation etc.

If we can get a reasonable chunk of text from our NLS colleagues, we should be able to add this easily enough. We might also be able to improve the Welsh language detection by providing data from a larger corpus.

EDIT: It would be interesting to teach it Scots too, but there might be technical barriers as the language detector appears to use two-character ISO 639-1 codes and Scots doesn't have one of those! The same applies to Scottish English, but that would probably be rather hard to spot anyway.

ymaurer · 2017-08-22T16:47:51Z

I would maybe recommend using CLD2 instead of Tika for language recognition. It has a lot more languages, is more accurate and orders of magnitude faster. Gaelic is included:
{"scots_gaelic", "gd", SCOTS_GAELIC + W10, 0}
and Welsh
{"welsh", "cy", WELSH + W10, 0}
https://github.com/CLD2Owners/cld2
For a (quite old) comparison between the different systems, I can offer this article:
http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html

anjackson · 2017-08-22T20:30:15Z

Thanks for this. I'm aware of CLD2 but chose not to use it as it requires bundling a native library rather than being something I can trivially reuse from Java.

Of course this is not unsurmountable but I'm not currently sure how best to package these kinds of dependencies, especially when running map-reduce jobs.

anjackson · 2023-06-28T09:58:17Z

Noting also there appears to be a new set of libraries in mutiple programming languages, with clearer support for adding new natural languages:

anjackson · 2023-06-28T10:02:00Z

Adding that CommonCrawl have a Java wrapper for CLD2, but it's a bit of a pain to work with as it has to be built locally and doesn't bundle binaries etc. https://github.com/commoncrawl/language-detection-cld2

anjackson · 2023-06-28T10:14:39Z

To summarize what happened with Scots Gaelic, I did work up a contribution to Optimaize but that project appears to be dead: optimaize/language-detector#81

However, IIRC, the detector did not appear to be good at distinguishing Scots and Irish Gaelic. The new tricks used by Lingua might pay off.

anjackson added the enhancement label Jul 5, 2017

anjackson added this to the 3.1.0 Release milestone Jul 5, 2017

anjackson self-assigned this Jul 5, 2017

anjackson modified the milestones: 3.1.0 Release, 3.1.1 Bugfix release, 3.2.0 Release Aug 3, 2022

anjackson mentioned this issue Jun 28, 2023

Lack of language recognition for Sámi languages netarchivesuite/solrwayback#379

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Teach Tika to spot Scots Gaelic #94

Teach Tika to spot Scots Gaelic #94

anjackson commented Jul 5, 2017 •

edited

Loading

ymaurer commented Aug 22, 2017 •

edited

Loading

anjackson commented Aug 22, 2017

anjackson commented Jun 28, 2023

anjackson commented Jun 28, 2023

anjackson commented Jun 28, 2023

Teach Tika to spot Scots Gaelic #94

Teach Tika to spot Scots Gaelic #94

Comments

anjackson commented Jul 5, 2017 • edited Loading

ymaurer commented Aug 22, 2017 • edited Loading

anjackson commented Aug 22, 2017

anjackson commented Jun 28, 2023

anjackson commented Jun 28, 2023

anjackson commented Jun 28, 2023

anjackson commented Jul 5, 2017 •

edited

Loading

ymaurer commented Aug 22, 2017 •

edited

Loading