AI Language Detection with LanguageChecker

This project provides a Python module for detecting the language of a given text using AI via the FastText language identification model by Meta-Facebook. It includes utilities for mapping between ISO 639-3 language codes and their corresponding language names, as well as a wrapper for easier use of the AI model.

Overview

The module consists of two main classes:

LanguageCodes: Handles mapping between ISO 639-3 language codes and their corresponding language names.
LanguageChecker: Uses the FastText language identification model to detect the language of a given text.

Features

Language Detection: Detects the language of input text with high accuracy.
Confidence Scores: Provides confidence scores for predictions.
Language Mapping: Converts between ISO 639-3 language codes and language names.
Top K Predictions: Retrieves the top K language candidates for a given text.
Exception Handling: Raises exceptions for low-confidence predictions when a certainty threshold is specified.

Installation

If you have git installed, you can easily install via pip:

pip install git+https://github.com/laelhalawani/language_checker.git

Otherwise download the repo, cd into the top level language checker folder and install via pip:

pip install .

From PyPI not yet available, will become available soon with first release.

Usage

The LanguageChecker class provides various methods for language detection and comparison. Here are some examples of how to use it:

Basic Usage

from language_checker import LanguageChecker

checker = LanguageChecker()

# Simple language prediction
text_en = "Hello, how are you?"
language_name = checker.predict_language(text_en)
print(f"Predicted language: {language_name}")
# Output: Predicted language: English

Language Prediction with Certainty Threshold

# Language prediction with defined minimum acceptable certainty
text_mixed = "Gut morning, jak are du?"
try:
    language_name = checker.predict_language(text_mixed, certainty=0.999)
except ValueError as e:
    print(f"Error: {e}")
# Output: Error: Language detection confidence 0.86 is below the threshold of 1.00.

Language Prediction with Confidence Score

# Language prediction with outputting the model's prediction confidence
language_name, confidence = checker.predict_language_and_certainty(text_en)
print(f"Predicted language: {language_name} with confidence: {confidence:.2f}")
# Output: Predicted language: English with confidence: 1.00

Getting Language Candidates

# Getting language candidates with confidence scores
candidates = checker.predict_language_candidates(text_en, k=3)
print("Language candidates:")
for name, confidence in candidates:
    print(f"\t{name}: {confidence:.6f}")
# Output:
# Language candidates:
#     English: 1.000006
#     Italian: 0.000011
#     Romanian: 0.000011

Checking Specific Language

# Checking if a text is in a specific language
is_en = checker.is_language("english", text_en)
print(f"Is text confirmed to be in English: {is_en}")
# Output: Is text confirmed to be in English: True

# Checking with a high certainty threshold for a mixed language text
is_en = checker.is_language("english", text_mixed, certainty=0.999)
print(f"Is text confirmed to be in English: {is_en}")
# Output:
# WARNING:root:Failed to predict language for text: Gut morning, jak are du?, returning False.
# Is text confirmed to be in English: False

Comparing Multiple Texts

# Checking if multiple texts are in the same language
text_en_2 = "Hi, I'm fine. How are you?"
text_pl = "Cześć, jak się masz?"

is_same_language = checker.is_same_language(text_en, text_en_2)
print(f"Are the texts in the same language: {is_same_language}")
# Output: Are the texts in the same language: True

# Checking with a high certainty threshold for texts in different languages
is_same_language = checker.is_same_language(text_en, text_mixed, certainty=0.999)
print(f"Are the texts in the same language: {is_same_language}")
# Output:
# WARNING:root:Failed to predict language for text: Gut morning, jak are du?, returning False.
# Are the texts in the same language: False

# Checking texts in different languages
is_same_language = checker.is_same_language(text_en, text_pl, certainty=0.8)
print(f"Are the texts in the same language: {is_same_language}")
# Output: Are the texts in the same language: False

These examples demonstrate the main functionalities of the LanguageChecker class. You can use these methods to detect languages, compare texts languages, and get language predictions with confidence scores in your projects. The examples also show how the class handles cases where the language detection confidence is below the specified threshold, logging warnings when appropriate.

References

Bag of Tricks for Efficient Text Classification

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@InProceedings{joulin2017bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},
  month={April},
  year={2017},
  publisher={Association for Computational Linguistics},
  pages={427--431},
}

FastText.zip: Compressing text classification models

[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models

@article{joulin2016fasttext,
  title={FastText.zip: Compressing text classification models},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1612.03651},
  year={2016}
}

(* These authors contributed equally.)

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
language_checker		language_checker
.gitingore		.gitingore
example.py		example.py
iso-639-3.tab		iso-639-3.tab
license.md		license.md
readme.md		readme.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Language Detection with LanguageChecker

Overview

Features

Installation

Usage

Basic Usage

Language Prediction with Certainty Threshold

Language Prediction with Confidence Score

Getting Language Candidates

Checking Specific Language

Comparing Multiple Texts

References

Bag of Tricks for Efficient Text Classification

FastText.zip: Compressing text classification models

About

Releases

Packages

Languages

License

laelhalawani/language_checker

Folders and files

Latest commit

History

Repository files navigation

AI Language Detection with LanguageChecker

Overview

Features

Installation

Usage

Basic Usage

Language Prediction with Certainty Threshold

Language Prediction with Confidence Score

Getting Language Candidates

Checking Specific Language

Comparing Multiple Texts

References

Bag of Tricks for Efficient Text Classification

FastText.zip: Compressing text classification models

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages