-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about Llama-3.1-405b-instruct's results #43
Comments
Hi! We use the same endpoints as Chatbot Arena:
Notably when we tested what happens if we don't add this system prompt, we observe a degradation in performance.
The leaderboard presented in the README.md is not style controlled. |
Thanks @CodingWithTim ! One more question: is there a documentation explaining how you get the 60%ish winrate from gpt4 judges? Is that by sampling 100 problems from it, and then compute winrate, or some more sophisticated equation? Thanks a lot again! |
No problem! So the number 69.3% is the win-rate against gpt-4-0314 (default baseline), which is produced using We did not subsample. We used the same code as the code that is on the repo. So if you are using The code in I just pushed the llama-3.1-405b-instruct-fp8 generation and judgment file to repo, feel free to check it out. If you compute the win-rate against gpt-4-0314 (default baseline), you should get 69.3%. I reproduced this number on my end, feel free to try it on your end. |
Also here is the code to only calculate the win-rate given any judgment file: import pandas as pd
judgment = pd.read_json(judgment_file, lines=True)
win_map_1 = {"B>A": ["model"],
"B>>A": ["model"] * 3,
"A>B": ["baseline"],
"A>>B": ["baseline"] * 3,
"A=B": ["tie"]}
win_map_2 = {"B>A": ["baseline"],
"B>>A": ["baseline"] * 3,
"A>B": ["model"],
"A>>B": ["model"] * 3,
"A=B": ["tie"]}
outcomes = pd.concat([judgment.games.map(lambda x: win_map_1[x[0]["score"]]).explode(),
judgment.games.map(lambda x: win_map_2[x[1]["score"]]).explode()])
outcomes.value_counts() If you try this on the judgment file I pushed to huggingface, you should also get 69.3%. Note, this is not using Bradley Terry Coefficient, indicating the Bradley Terry step is invariant to the raw win-rate. |
Hi thanks for the great work! Would you mind sharing how you collected the 405b-instruct number? We measured locally but got 62-63%, not the 69% in the leaderboard. Do you get the number with specific system prompt/style control?
Thanks a lot!
The text was updated successfully, but these errors were encountered: