How can i use the SentencePieceTokenizer ? #21

Allakazan · 2024-05-30T04:30:14Z

Could you guys provide a description of how to use this tokenizer ? I tried by myself but i couldn't figure out on how to make it work.

Thanks :)

keyvank · 2024-05-30T20:16:18Z

Feed your dataset.txt to the spm_train command. (Docs here: https://github.com/google/sentencepiece)

This will generate a "vocab file". Use it for initializing the SentencePieceTokenizer:

SentencePieceTokenizer::load("model.vocab")

Allakazan · 2024-05-31T20:32:46Z

Thanks, it works :)

Now i will try to implement some sort of EOS_TOKEN on the tokenizer, for doing a question/answer model

keyvank · 2024-05-31T20:35:55Z

Sounds good! Let me know if you got good results!

Provide feedback