Question about Llama-3.1-405b-instruct's results #43

snova-bol · 2024-09-03T05:08:04Z

Hi thanks for the great work! Would you mind sharing how you collected the 405b-instruct number? We measured locally but got 62-63%, not the 69% in the leaderboard. Do you get the number with specific system prompt/style control?

Thanks a lot!

CodingWithTim · 2024-09-03T06:03:39Z

Hi! We use the same endpoints as Chatbot Arena: llama-3.1-405b-instruct-fp8 from anyscale. Additionally we also add the following system prompt: Cutting Knowledge Date: December 2023\nToday Date: 31 Aug 2024 when using Llama-3.1-405b-instruct on both Chatbot Arena and Arena-Hard-Auto. You can add system instruction when generating model answers by adding system_prompt to the api_config.yaml:

gpt-3.5-turbo-0125:
    model_name: gpt-3.5-turbo-0125
    endpoints: null
    api_type: openai
    parallel: 8
    system_prompt: [insert system instruction]

Notably when we tested what happens if we don't add this system prompt, we observe a degradation in performance.

llama-3.1-405b-instruct-fp8               | score: 69.3  | 95% CI: (-2.2, 2.7)  | average #tokens: 658                      
llama-3.1-405b-instruct-fp8-no-sys-prompt | score: 64.2  | 95% CI: (-2.2, 2.4)  | average #tokens: 635

The leaderboard presented in the README.md is not style controlled.

snova-bol · 2024-09-03T16:46:08Z

Thanks @CodingWithTim ! One more question: is there a documentation explaining how you get the 60%ish winrate from gpt4 judges? Is that by sampling 100 problems from it, and then compute winrate, or some more sophisticated equation? Thanks a lot again!

CodingWithTim · 2024-09-03T20:24:06Z

No problem! So the number 69.3% is the win-rate against gpt-4-0314 (default baseline), which is produced using >python show_result.py.

We did not subsample. We used the same code as the code that is on the repo. So if you are using >python show_result.py, then we are using the same setup.

The code in >python show_result.py first computes the Bradley Terry Coefficients for each model, and then recompute the win-rate against the gpt-4-0314 baseline. Since every model is only compared against a single baseline, the win-rate is invariant to the number of models. Further, we set a significant win as 3 wins (eg. A>>B), a significant loss as 3 losses, a small win (eg. A>B) as 1 win, a small loss as 1 loss, and tie (eg. A=B) as a single tie. Thus the win-rate is computed as (total number of win + 0.5 * total number of ties) / total number of win, tie, and loss.

I just pushed the llama-3.1-405b-instruct-fp8 generation and judgment file to repo, feel free to check it out. If you compute the win-rate against gpt-4-0314 (default baseline), you should get 69.3%. I reproduced this number on my end, feel free to try it on your end.

CodingWithTim · 2024-09-03T20:30:47Z

Also here is the code to only calculate the win-rate given any judgment file:

import pandas as pd

judgment = pd.read_json(judgment_file, lines=True)

win_map_1 = {"B>A": ["model"],
           "B>>A": ["model"] * 3,
           "A>B": ["baseline"],
           "A>>B": ["baseline"] * 3,
           "A=B": ["tie"]}
win_map_2 = {"B>A": ["baseline"],
           "B>>A": ["baseline"] * 3,
           "A>B": ["model"],
           "A>>B": ["model"] * 3,
           "A=B": ["tie"]}

outcomes = pd.concat([judgment.games.map(lambda x: win_map_1[x[0]["score"]]).explode(),
           judgment.games.map(lambda x: win_map_2[x[1]["score"]]).explode()])

outcomes.value_counts()

If you try this on the judgment file I pushed to huggingface, you should also get 69.3%. Note, this is not using Bradley Terry Coefficient, indicating the Bradley Terry step is invariant to the raw win-rate.

CodingWithTim self-assigned this Sep 3, 2024

CodingWithTim pinned this issue Sep 3, 2024

CodingWithTim changed the title ~~405B number~~ Question about Llama-3.1-405b-instruct's results Sep 3, 2024

CodingWithTim closed this as completed Sep 4, 2024

CodingWithTim added the question Further information is requested label Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Llama-3.1-405b-instruct's results #43

Question about Llama-3.1-405b-instruct's results #43

snova-bol commented Sep 3, 2024 •

edited

Loading

CodingWithTim commented Sep 3, 2024

snova-bol commented Sep 3, 2024

CodingWithTim commented Sep 3, 2024 •

edited

Loading

CodingWithTim commented Sep 3, 2024

Question about Llama-3.1-405b-instruct's results #43

Question about Llama-3.1-405b-instruct's results #43

Comments

snova-bol commented Sep 3, 2024 • edited Loading

CodingWithTim commented Sep 3, 2024

snova-bol commented Sep 3, 2024

CodingWithTim commented Sep 3, 2024 • edited Loading

CodingWithTim commented Sep 3, 2024

snova-bol commented Sep 3, 2024 •

edited

Loading

CodingWithTim commented Sep 3, 2024 •

edited

Loading