S3Eval Dataset #916

lfy79001 · 2024-02-23T17:01:23Z

S3Eval

Introduction

The following introduction comes from the abstract in S3Eval: A Synthetic, Scalable and Systematic Evaluation Suite for Large Language Models

S3Eval, our latest contribution to the field, addresses the critical need for comprehensive evaluation resources for Large Language Models (LLMs). In the pursuit of understanding long-context comprehension and enhancing reasoning capabilities, we present a benchmarking suite that is both synthetic and scalable.

Operating on SQL execution tasks, S3Eval challenges LLMs with randomly generated tables and SQL queries, evaluating their ability to produce accurate execution results. This benchmark stands out for its versatility and scalability, providing unlimited evaluation resources for a robust assessment of LLM capabilities.

In this latest submission, we have generated a batch of high-quality data, encompassing nearly all types of queries with strong diversity. Moreover, the length of the tables spans from 200 to 200K, enabling a systematic evaluation of the long-context capabilities of the models.

For researchers and practitioners alike, S3Eval holds the promise of uncovering deeper insights into LLM performance. Explore the paper for detailed information on its design, experiments, and implications. We invite you to leverage S3Eval for your research endeavors and contribute to the evolving landscape of synthetic benchmark construction. 😊

Official link

Paper

S3Eval: A Synthetic, Scalable and Systematic Evaluation Suite for Large Language Models

Repository

S3Eval

Examples

Input example I:


You are an SQL executor, you need to execute SQL based on the give table and SQL statement to obtain the execution results.
| suiting   | chisel    |   highboy |   broccoli | newburgh   | acetum    |   brewpub |
|:----------|:----------|----------:|-----------:|:-----------|:----------|----------:|
| zbwamhiui | nnkfvevxw |        50 |         88 | zhwohj     | opufj     |       214 |
| zroosgm   | yvftt     |       309 |        168 | zhwohj     | xqsu      |       136 |
| zroosgm   | lnri      |       152 |         78 | zhwohj     | ikvsd     |       219 |
| kjsdl     | trei      |       234 |        287 | egkgkvbec  | mhxcxyg   |        23 |
| zroosgm   | mctnpwbd  |        71 |        242 | egkgkvbec  | yszfokeom |       180 |
| zbwamhiui | ptqtj     |        19 |         81 | egkgkvbec  | hyfmk     |       116 |
| zroosgm   | lpjvwn    |       258 |        313 | uftnwbd    | oevmj     |        65 |
| kjsdl     | ididumrhw |        64 |        101 | uftnwbd    | xjakwpayx |       327 |
| zbwamhiui | wdtncbyn  |       165 |        209 | uftnwbd    | xrbqvxb   |       192 |
| zbwamhiui | wyjjc     |       219 |          6 | uftnwbd    | pzqr      |       188 |
| zroosgm   | qumxgwvls |       314 |        246 | uftnwbd    | ehevtf    |        60 |
| zbwamhiui | adiyf     |       207 |        298 | egkgkvbec  | wbrgejgf  |        80 |
| zbwamhiui | qpgpbj    |       307 |        306 | egkgkvbec  | mcjuonhc  |       192 |
| zbwamhiui | ehsk      |        47 |        244 | zhwohj     | tcdlnc    |       280 |
| kjsdl     | orlosbok  |        21 |         93 | egkgkvbec  | dzvwohjo  |       103 |
| zbwamhiui | webyyylw  |        84 |        195 | egkgkvbec  | xbmv      |       289 |
| kjsdl     | mrcecp    |        48 |        264 | egkgkvbec  | xhprcocik |       265 |
| kjsdl     | ngajupd   |       247 |         52 | zhwohj     | pcokyw    |       247 |
| zroosgm   | xeeuixkze |       120 |        288 | zhwohj     | yishnriw  |       138 |
| kjsdl     | kbczy     |       119 |         13 | egkgkvbec  | ltpmyfdt  |        73 |
| zbwamhiui | uvvdzo    |       150 |         57 | uftnwbd    | tajlsm    |       295 |
| zbwamhiui | enbffevhp |       290 |         92 | zhwohj     | gjjznp    |        18 |
| zroosgm   | imubtcc   |        79 |         19 | uftnwbd    | eqymwj    |       112 |

SQL:select suiting from my_table group by suiting having count ( newburgh ) > 6
Answer:
| suiting   |
|:----------|
| zbwamhiui |
| zroosgm   |

SQL:select acetum,newburgh,suiting from my_table where highboy > 234
Answer:
| acetum   | newburgh   | suiting   |
|:---------|:-----------|:----------|
| xqsu     | zhwohj     | zroosgm   |
| oevmj    | uftnwbd    | zroosgm   |
| ehevtf   | uftnwbd    | zroosgm   |
| mcjuonhc | egkgkvbec  | zbwamhiui |
| pcokyw   | zhwohj     | kjsdl     |
| gjjznp   | zhwohj     | zbwamhiui |

SQL:select count ( chisel ) from my_table where highboy < brewpub group by newburgh having min ( highboy ) < 47 
Answer:
|   count ( chisel ) |
|-------------------:|
|                  5 |

SQL:select newburgh from my_table where brewpub > 138 order by broccoli desc limit 1
Answer:
| newburgh   |
|:-----------|
| egkgkvbec  |


SQL:select suiting from my_table where highboy > broccoli group by suiting having min ( highboy ) < 314

Answer:

Output example I (from GPT-4):

| suiting   |
|:----------|
| kjsdl     |
| zbwamhiui |
| zroosgm   |

Reference

@article{lei2023s3eval,
  title={S3eval: A synthetic, scalable, systematic evaluation suite for large language models},
  author={Lei, Fangyu and Liu, Qian and Huang, Yiming and He, Shizhu and Zhao, Jun and Liu, Kang},
  journal={arXiv preprint arXiv:2310.15147},
  year={2023}
}

tonysy · 2024-02-28T14:52:58Z

Thanks for your contribution, please add README.md in the config folder.

For example: https://github.com/open-compass/opencompass/blob/main/configs/datasets/IFEval/IFEval.md

tonysy · 2024-02-28T14:53:36Z

Also please give some instructions on the dataset preparation and evaluation in README, thanks a lot.

…s3eval_branch

lfy79001 · 2024-03-12T04:27:07Z

Also please give some instructions on the dataset preparation and evaluation in README, thanks a lot.

Hi, i have added the README file, https://github.com/lfy79001/opencompass/blob/s3eval_branch/configs/datasets/s3eval/s3eval.md

tonysy · 2024-04-23T10:22:43Z

@lfy79001 Hi, please update the readme with data preparation.

lfy79001 · 2024-04-23T10:30:15Z

@lfy79001 Hi, please update the readme with data preparation.

Hi, this benchmark can be used directly by loading the dataset of huggingface without data preparation.

tonysy

LGTM

* s3eval_branch * update s3eval

s3eval_branch

c25449a

mm-assistant bot assigned tonysy Feb 23, 2024

lfy79001 temporarily deployed to prod February 28, 2024 14:52 — with GitHub Actions Inactive

lfy79001 and others added 3 commits March 12, 2024 04:05

Merge branch 'open-compass:main' into s3eval_branch

a2b976f

update s3eval

54ff655

Merge branch 's3eval_branch' of github.com:lfy79001/opencompass into …

813c1dc

…s3eval_branch

lfy79001 temporarily deployed to prod March 19, 2024 08:12 — with GitHub Actions Inactive

tonysy approved these changes May 6, 2024

View reviewed changes

tonysy merged commit 862044f into open-compass:main May 6, 2024
9 checks passed

liuyaox pushed a commit to liuyaox/opencompass that referenced this pull request Jun 26, 2024

[Feature] Add S3Eval Dataset (open-compass#916)

3fc12e9

* s3eval_branch * update s3eval

Leymore pushed a commit to Leymore/opencompass that referenced this pull request Jul 12, 2024

Add S3Eval Dataset (open-compass#916)

b36ef56

* s3eval_branch * update s3eval

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3Eval Dataset #916

S3Eval Dataset #916

lfy79001 commented Feb 23, 2024

tonysy commented Feb 28, 2024

tonysy commented Feb 28, 2024

lfy79001 commented Mar 12, 2024

tonysy commented Apr 23, 2024

lfy79001 commented Apr 23, 2024

tonysy left a comment

S3Eval Dataset #916

S3Eval Dataset #916

Conversation

lfy79001 commented Feb 23, 2024

S3Eval

Introduction

Official link

Paper

Repository

Examples

Reference

tonysy commented Feb 28, 2024

tonysy commented Feb 28, 2024

lfy79001 commented Mar 12, 2024

tonysy commented Apr 23, 2024

lfy79001 commented Apr 23, 2024

tonysy left a comment

Choose a reason for hiding this comment