Fix models like phi-3 which have a mismatched tokenizer definition and model output tensor size #792

Harsha-Nori · 2024-05-03T05:09:34Z

The Phi-3 Mini models (e.g. https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) have a total tokenizer vocab size of 32011 (base 32000 plus 11 special tokens), but the model outputs logit vectors of size 32064 (for CUDA performance reasons maybe?).

I think models that use this trick will always have the "padded" model output at the end, which makes this PR a very clean and simple change to guarantee tokenizer <> model output consistency. This fix will NOT work if the padded outputs are either in the beginning or interspersed throughout the token vector, but I haven't seen a model do that yet. If models do start doing that, we'll need to make significantly heavier changes across the guidance codebase, as we currently heavily rely on the len(tokenizer) to be usable for iteration.

… shape.

Harsha-Nori · 2024-05-03T06:06:43Z

Build failure seems to be unrelated to this change @riedgar-ms

riedgar-ms · 2024-05-03T13:47:01Z

Doesn't this need to be combined with the Phi3-enabling PR?

Harsha-Nori · 2024-05-03T14:46:12Z

Doesn't this need to be combined with the Phi3-enabling PR?

Yes, oops, meant to merge that in (assuming you mean the one that adds phi-3 to the tests). Just did.

riedgar-ms

Gates looking good so far....

Harsha-Nori · 2024-05-04T05:18:52Z

Build failures on Mac seem to be due to build machine size issues, not this change. I can run all tests on my Mac.

Harsha-Nori added 2 commits May 2, 2024 22:01

Bugfix for models that have a mismatched vocab size and output tensor…

8e264c2

… shape.

black

0f8dbec

Harsha-Nori requested review from slundberg, paulbkoch and riedgar-ms May 3, 2024 05:09

Merge branch 'main' into phi3fix

49eb378

riedgar-ms approved these changes May 3, 2024

View reviewed changes

Harsha-Nori merged commit 7229656 into main May 4, 2024
111 of 116 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix models like phi-3 which have a mismatched tokenizer definition and model output tensor size #792

Fix models like phi-3 which have a mismatched tokenizer definition and model output tensor size #792

Harsha-Nori commented May 3, 2024 •

edited

Loading

Harsha-Nori commented May 3, 2024

riedgar-ms commented May 3, 2024

Harsha-Nori commented May 3, 2024 •

edited

Loading

riedgar-ms left a comment

Harsha-Nori commented May 4, 2024

Fix models like phi-3 which have a mismatched tokenizer definition and model output tensor size #792

Fix models like phi-3 which have a mismatched tokenizer definition and model output tensor size #792

Conversation

Harsha-Nori commented May 3, 2024 • edited Loading

Harsha-Nori commented May 3, 2024

riedgar-ms commented May 3, 2024

Harsha-Nori commented May 3, 2024 • edited Loading

riedgar-ms left a comment

Choose a reason for hiding this comment

Harsha-Nori commented May 4, 2024

Harsha-Nori commented May 3, 2024 •

edited

Loading

Harsha-Nori commented May 3, 2024 •

edited

Loading