Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the style control leaderboard #50

Closed
yangzy39 opened this issue Nov 6, 2024 · 3 comments
Closed

About the style control leaderboard #50

yangzy39 opened this issue Nov 6, 2024 · 3 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@yangzy39
Copy link

yangzy39 commented Nov 6, 2024

Hello, and thank you for the recent release of the style control leaderboard. I have two questions:

  1. The latest model responses and judgment files seem to be missing from this Hugging Face repository, which prevents us from fully reproducing the leaderboard. Could you clarify if these files will be made available?

  2. When evaluating custom models, we’ve noticed that adding a new model impacts the Style Control Score of all models, leading to inconsistent results across evaluations. Is there a recommended approach for obtaining stable scores when assessing new models?

Thank you for your assistance!

@CodingWithTim CodingWithTim self-assigned this Nov 6, 2024
@CodingWithTim
Copy link
Collaborator

Hello there,

  1. Sorry I just uploaded the newest model answers and judgment to the huggingface repo.

  2. Due to the nature of how style control works, adding a new model will affect the style control score of all models. This is because Style Control is a statistical model which seek to learn the effect of response length and style on the judge's decision conditioned on the dataset in question. Therefore it is also dependent on the dataset. And the effect of response length and style are estimated as logistic coefficients. One way to obtain a more stable score is to add only 1 model at a given time and take the n + 1 model's score as your official score. You can also potentially lock in the coefficients, however, this method is not currently implemented. Additionally, the style control scores, in theory, is actually improving as you add more models into the dataset, since you are giving the statistical model more data to learn the biases in the judge.

Hopefully this helps.

@yangzy39
Copy link
Author

Thank you for your response, but we found that the answers and judgment from yi-lightning, gpt-4o-2024-08-06, qwen2.5-72b-instruct and gemma-2-9b-it are still missing.

@CodingWithTim
Copy link
Collaborator

Thanks you very much for bring this up. Just uploaded all the data to huggingface.

@CodingWithTim CodingWithTim added the documentation Improvements or additions to documentation label Dec 14, 2024
@CodingWithTim CodingWithTim pinned this issue Dec 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants