Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError appears when trying to align embeddings #12

Open
Awkwafina opened this issue Aug 27, 2020 · 6 comments
Open

AssertionError appears when trying to align embeddings #12

Awkwafina opened this issue Aug 27, 2020 · 6 comments

Comments

@Awkwafina
Copy link

I am trying to run an experiment on a new dataset. Followed your instructions, but everytime when I run this code, 'AssertionError' appears. I don't know whether it is an issue or not, nevertheless, could you check my yaml file? I am working in google colab.
bert_exam (my yaml file) is located here: https://github.com/Awkwafina/files/blob/master/bert_exam
bert_train

@john-hewitt
Copy link
Owner

Hi! I think the issue is with this line:

    type: token #{token,subword}

which should be

    type: subword #{token,subword}

since BERT uses subword tokenization (and hence the subwords need to be mapped to corpus tokens.)

This is an annoying aspect of the config structure as of now, since you might expect that the BERT-disk flag alone would specify this, but alas.

@Awkwafina
Copy link
Author

Hi,
still facing this issue. maybe there is something wrong within hdf5 file?
here, assert single_layer_features.shape[0]=68 and len(tokenized_sent)=74

68 74 [aligning embeddings]: 25% 3173/12543 [00:09<00:27, 339.59it/s] Traceback (most recent call last): File "/content/structural-probes/structural-probes/run_experiment.py", line 242, in <module> execute_experiment(yaml_args, train_probe=cli_args.train_probe, report_results=cli_args.report_results) File "/content/structural-probes/structural-probes/run_experiment.py", line 170, in execute_experiment expt_dataset = dataset_class(args, task) File "/content/structural-probes/structural-probes/data.py", line 34, in __init__ self.train_obs, self.dev_obs, self.test_obs = self.read_from_disk() File "/content/structural-probes/structural-probes/data.py", line 65, in read_from_disk train_observations = self.optionally_add_embeddings(train_observations, train_embeddings_path) File "/content/structural-probes/structural-probes/data.py", line 408, in optionally_add_embeddings embeddings = self.generate_subword_embeddings_from_hdf5(observations, pretrained_embeddings_path, layer_index) File "/content/structural-probes/structural-probes/data.py", line 398, in generate_subword_embeddings_from_hdf5 assert single_layer_features.shape[0] == len(tokenized_sent) AssertionError

@Awkwafina
Copy link
Author

Or maybe I need to disable assertions to proceed?

@joebartusek
Copy link

Was this ever resolved? I'm experiencing the same issue.

@john-hewitt
Copy link
Owner

Not sure, but it seems like the issue is the process by which vectors are written to disk (which may happen independently of this codebase) and the tokenization performed when loading text from disk are leading to differing numbers of tokens in the sequence. This could be because different tokenizers are used, or because the data isn't ordered the same way, or some preprocessing thing; I'm not 100% sure.

Also, added uncertainty: I'm not sure how the huggingface transformers module tokenizers API has changed since I wrote this code, back when it was still pytorch-pretrained-BERT, not transformers.

@caspillaga
Copy link

In case someone finds it useful, I wrote a version compatible with the new transformers library.
I posted the main changes in the issue #13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants