-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AssertionError appears when trying to align embeddings #12
Comments
Hi! I think the issue is with this line:
which should be
since BERT uses subword tokenization (and hence the subwords need to be mapped to corpus tokens.) This is an annoying aspect of the config structure as of now, since you might expect that the |
Hi,
|
Or maybe I need to disable assertions to proceed? |
Was this ever resolved? I'm experiencing the same issue. |
Not sure, but it seems like the issue is the process by which vectors are written to disk (which may happen independently of this codebase) and the tokenization performed when loading text from disk are leading to differing numbers of tokens in the sequence. This could be because different tokenizers are used, or because the data isn't ordered the same way, or some preprocessing thing; I'm not 100% sure. Also, added uncertainty: I'm not sure how the huggingface |
In case someone finds it useful, I wrote a version compatible with the new |
I am trying to run an experiment on a new dataset. Followed your instructions, but everytime when I run this code, 'AssertionError' appears. I don't know whether it is an issue or not, nevertheless, could you check my yaml file? I am working in google colab.
bert_exam (my yaml file) is located here: https://github.com/Awkwafina/files/blob/master/bert_exam
The text was updated successfully, but these errors were encountered: