Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Llama-3.1-405b-instruct's results #43

Closed
snova-bol opened this issue Sep 3, 2024 · 4 comments
Closed

Question about Llama-3.1-405b-instruct's results #43

snova-bol opened this issue Sep 3, 2024 · 4 comments
Assignees
Labels
question Further information is requested

Comments

@snova-bol
Copy link

snova-bol commented Sep 3, 2024

Hi thanks for the great work! Would you mind sharing how you collected the 405b-instruct number? We measured locally but got 62-63%, not the 69% in the leaderboard. Do you get the number with specific system prompt/style control?

Thanks a lot!

@CodingWithTim
Copy link
Collaborator

Hi! We use the same endpoints as Chatbot Arena: llama-3.1-405b-instruct-fp8 from anyscale. Additionally we also add the following system prompt: Cutting Knowledge Date: December 2023\nToday Date: 31 Aug 2024 when using Llama-3.1-405b-instruct on both Chatbot Arena and Arena-Hard-Auto. You can add system instruction when generating model answers by adding system_prompt to the api_config.yaml:

gpt-3.5-turbo-0125:
    model_name: gpt-3.5-turbo-0125
    endpoints: null
    api_type: openai
    parallel: 8
    system_prompt: [insert system instruction]

Notably when we tested what happens if we don't add this system prompt, we observe a degradation in performance.

llama-3.1-405b-instruct-fp8               | score: 69.3  | 95% CI: (-2.2, 2.7)  | average #tokens: 658                      
llama-3.1-405b-instruct-fp8-no-sys-prompt | score: 64.2  | 95% CI: (-2.2, 2.4)  | average #tokens: 635                     

The leaderboard presented in the README.md is not style controlled.

@CodingWithTim CodingWithTim self-assigned this Sep 3, 2024
@CodingWithTim CodingWithTim pinned this issue Sep 3, 2024
@snova-bol
Copy link
Author

Thanks @CodingWithTim ! One more question: is there a documentation explaining how you get the 60%ish winrate from gpt4 judges? Is that by sampling 100 problems from it, and then compute winrate, or some more sophisticated equation? Thanks a lot again!

@CodingWithTim
Copy link
Collaborator

CodingWithTim commented Sep 3, 2024

No problem! So the number 69.3% is the win-rate against gpt-4-0314 (default baseline), which is produced using >python show_result.py.

We did not subsample. We used the same code as the code that is on the repo. So if you are using >python show_result.py, then we are using the same setup.

The code in >python show_result.py first computes the Bradley Terry Coefficients for each model, and then recompute the win-rate against the gpt-4-0314 baseline. Since every model is only compared against a single baseline, the win-rate is invariant to the number of models. Further, we set a significant win as 3 wins (eg. A>>B), a significant loss as 3 losses, a small win (eg. A>B) as 1 win, a small loss as 1 loss, and tie (eg. A=B) as a single tie. Thus the win-rate is computed as (total number of win + 0.5 * total number of ties) / total number of win, tie, and loss.

I just pushed the llama-3.1-405b-instruct-fp8 generation and judgment file to repo, feel free to check it out. If you compute the win-rate against gpt-4-0314 (default baseline), you should get 69.3%. I reproduced this number on my end, feel free to try it on your end.

@CodingWithTim
Copy link
Collaborator

Also here is the code to only calculate the win-rate given any judgment file:

import pandas as pd

judgment = pd.read_json(judgment_file, lines=True)

win_map_1 = {"B>A": ["model"],
           "B>>A": ["model"] * 3,
           "A>B": ["baseline"],
           "A>>B": ["baseline"] * 3,
           "A=B": ["tie"]}
win_map_2 = {"B>A": ["baseline"],
           "B>>A": ["baseline"] * 3,
           "A>B": ["model"],
           "A>>B": ["model"] * 3,
           "A=B": ["tie"]}

outcomes = pd.concat([judgment.games.map(lambda x: win_map_1[x[0]["score"]]).explode(),
           judgment.games.map(lambda x: win_map_2[x[1]["score"]]).explode()])

outcomes.value_counts()

If you try this on the judgment file I pushed to huggingface, you should also get 69.3%. Note, this is not using Bradley Terry Coefficient, indicating the Bradley Terry step is invariant to the raw win-rate.

@CodingWithTim CodingWithTim changed the title 405B number Question about Llama-3.1-405b-instruct's results Sep 3, 2024
@CodingWithTim CodingWithTim added the question Further information is requested label Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants