Release Description - Version 2.2.0
We are pleased to announce another significant release of our language model, introducing an additional dataset to further enhance its capabilities. Here are the key details of this release:
-
Dataset Addition: In this release, we have integrated another valuable dataset sourced from TamilCorpus. This dataset brings a substantial increase in the volume of available text data, providing a more comprehensive and diverse collection of Tamil words.
-
Summary of Total Dataset: With the inclusion of the new dataset, our language model now encompasses an extensive vocabulary and offers enhanced language processing capabilities. Here is a summary of the total dataset included in this release:
- Total Count of Words: 4,591,656
- Words with Frequency > 5: 699,092
- Words with Frequency > 100: 88,350
- Words with Frequency > 1,000: 14,743
These statistics reflect the breadth and depth of the dataset, enabling the language model to understand and generate more accurate responses across a wide range of topics and contexts.