AssertionError appears when trying to align embeddings #12

Awkwafina · 2020-08-27T12:57:33Z

I am trying to run an experiment on a new dataset. Followed your instructions, but everytime when I run this code, 'AssertionError' appears. I don't know whether it is an issue or not, nevertheless, could you check my yaml file? I am working in google colab.
bert_exam (my yaml file) is located here: https://github.com/Awkwafina/files/blob/master/bert_exam

john-hewitt · 2020-08-27T18:32:07Z

Hi! I think the issue is with this line:

    type: token #{token,subword}

which should be

    type: subword #{token,subword}

since BERT uses subword tokenization (and hence the subwords need to be mapped to corpus tokens.)

This is an annoying aspect of the config structure as of now, since you might expect that the BERT-disk flag alone would specify this, but alas.

Awkwafina · 2020-08-30T08:57:42Z

Hi,
still facing this issue. maybe there is something wrong within hdf5 file?
here, assert single_layer_features.shape[0]=68 and len(tokenized_sent)=74

68 74 [aligning embeddings]: 25% 3173/12543 [00:09<00:27, 339.59it/s] Traceback (most recent call last): File "/content/structural-probes/structural-probes/run_experiment.py", line 242, in <module> execute_experiment(yaml_args, train_probe=cli_args.train_probe, report_results=cli_args.report_results) File "/content/structural-probes/structural-probes/run_experiment.py", line 170, in execute_experiment expt_dataset = dataset_class(args, task) File "/content/structural-probes/structural-probes/data.py", line 34, in __init__ self.train_obs, self.dev_obs, self.test_obs = self.read_from_disk() File "/content/structural-probes/structural-probes/data.py", line 65, in read_from_disk train_observations = self.optionally_add_embeddings(train_observations, train_embeddings_path) File "/content/structural-probes/structural-probes/data.py", line 408, in optionally_add_embeddings embeddings = self.generate_subword_embeddings_from_hdf5(observations, pretrained_embeddings_path, layer_index) File "/content/structural-probes/structural-probes/data.py", line 398, in generate_subword_embeddings_from_hdf5 assert single_layer_features.shape[0] == len(tokenized_sent) AssertionError

Awkwafina · 2020-09-12T11:58:02Z

Or maybe I need to disable assertions to proceed?

joebartusek · 2021-05-01T02:09:23Z

Was this ever resolved? I'm experiencing the same issue.

john-hewitt · 2021-05-02T00:06:14Z

Not sure, but it seems like the issue is the process by which vectors are written to disk (which may happen independently of this codebase) and the tokenization performed when loading text from disk are leading to differing numbers of tokens in the sequence. This could be because different tokenizers are used, or because the data isn't ordered the same way, or some preprocessing thing; I'm not 100% sure.

Also, added uncertainty: I'm not sure how the huggingface transformers module tokenizers API has changed since I wrote this code, back when it was still pytorch-pretrained-BERT, not transformers.

caspillaga · 2021-12-22T21:31:12Z

In case someone finds it useful, I wrote a version compatible with the new transformers library.
I posted the main changes in the issue #13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AssertionError appears when trying to align embeddings #12

AssertionError appears when trying to align embeddings #12

Awkwafina commented Aug 27, 2020

john-hewitt commented Aug 27, 2020

Awkwafina commented Aug 30, 2020

Awkwafina commented Sep 12, 2020

joebartusek commented May 1, 2021

john-hewitt commented May 2, 2021

caspillaga commented Dec 22, 2021

AssertionError appears when trying to align embeddings #12

AssertionError appears when trying to align embeddings #12

Comments

Awkwafina commented Aug 27, 2020

john-hewitt commented Aug 27, 2020

Awkwafina commented Aug 30, 2020

Awkwafina commented Sep 12, 2020

joebartusek commented May 1, 2021

john-hewitt commented May 2, 2021

caspillaga commented Dec 22, 2021