-
Notifications
You must be signed in to change notification settings - Fork 474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3Eval Dataset #916
S3Eval Dataset #916
Conversation
Thanks for your contribution, please add README.md in the config folder. For example: https://github.com/open-compass/opencompass/blob/main/configs/datasets/IFEval/IFEval.md |
Also please give some instructions on the dataset preparation and evaluation in README, thanks a lot. |
Hi, i have added the README file, https://github.com/lfy79001/opencompass/blob/s3eval_branch/configs/datasets/s3eval/s3eval.md |
@lfy79001 Hi, please update the readme with data preparation. |
Hi, this benchmark can be used directly by loading the dataset of huggingface without data preparation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* s3eval_branch * update s3eval
* s3eval_branch * update s3eval
S3Eval
Introduction
The following introduction comes from the abstract in S3Eval: A Synthetic, Scalable and Systematic Evaluation Suite for Large Language Models
S3Eval, our latest contribution to the field, addresses the critical need for comprehensive evaluation resources for Large Language Models (LLMs). In the pursuit of understanding long-context comprehension and enhancing reasoning capabilities, we present a benchmarking suite that is both synthetic and scalable.
Operating on SQL execution tasks, S3Eval challenges LLMs with randomly generated tables and SQL queries, evaluating their ability to produce accurate execution results. This benchmark stands out for its versatility and scalability, providing unlimited evaluation resources for a robust assessment of LLM capabilities.
In this latest submission, we have generated a batch of high-quality data, encompassing nearly all types of queries with strong diversity. Moreover, the length of the tables spans from 200 to 200K, enabling a systematic evaluation of the long-context capabilities of the models.
For researchers and practitioners alike, S3Eval holds the promise of uncovering deeper insights into LLM performance. Explore the paper for detailed information on its design, experiments, and implications. We invite you to leverage S3Eval for your research endeavors and contribute to the evolving landscape of synthetic benchmark construction. 😊
Official link
Paper
S3Eval: A Synthetic, Scalable and Systematic Evaluation Suite for Large Language Models
Repository
S3Eval
Examples
Input example I:
Output example I (from GPT-4):
Reference