-
+
- InternLM
- LLaMA
- Vicuna
diff --git a/README_zh-CN.md b/README_zh-CN.md
index 580fbc6..d42740e 100644
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -40,10 +40,8 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下
我们将陆续提供开源模型和API模型的具体性能榜单,请见 [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) 。如需加入评测,请提供模型仓库地址或标准的 API 接口至邮箱 `opencompass@pjlab.org.cn`.
-
![image](https://github.com/InternLM/OpenCompass/assets/7881589/fddc8ab4-d2bd-429d-89f0-4ca90606599a)
-
## 数据集支持
diff --git a/docs/en/advanced_guides/new_dataset.md b/docs/en/advanced_guides/new_dataset.md
index d1f05d2..f7bb2c5 100644
--- a/docs/en/advanced_guides/new_dataset.md
+++ b/docs/en/advanced_guides/new_dataset.md
@@ -1,3 +1,57 @@
-# New Dataset
+# Add a dataset
-Coming soon.
+Although OpenCompass has already included most commonly used datasets, users need to follow the steps below to support a new dataset if wanted:
+
+1. Add a dataset script `mydataset.py` to the `opencompass/datasets` folder. This script should include:
+
+ - The dataset and its loading method. Define a `MyDataset` class that implements the data loading method `load` as a static method. This method should return data of type `datasets.Dataset`. We use the Hugging Face dataset as the unified interface for datasets to avoid introducing additional logic. Here's an example:
+
+ ```python
+ import datasets
+ from .base import BaseDataset
+
+ class MyDataset(BaseDataset):
+
+ @staticmethod
+ def load(**kwargs) -> datasets.Dataset:
+ pass
+ ```
+
+ - (Optional) If the existing evaluators in OpenCompass do not meet your needs, you need to define a `MyDatasetEvaluator` class that implements the scoring method `score`. This method should take `predictions` and `references` as input and return the desired dictionary. Since a dataset may have multiple metrics, the method should return a dictionary containing the metrics and their corresponding scores. Here's an example:
+
+ ```python
+ from opencompass.openicl.icl_evaluator import BaseEvaluator
+
+ class MyDatasetEvaluator(BaseEvaluator):
+
+ def score(self, predictions: List, references: List) -> dict:
+ pass
+ ```
+
+ - (Optional) If the existing postprocessors in OpenCompass do not meet your needs, you need to define the `mydataset_postprocess` method. This method takes an input string and returns the corresponding postprocessed result string. Here's an example:
+
+ ```python
+ def mydataset_postprocess(text: str) -> str:
+ pass
+ ```
+
+2. After defining the dataset loading, data postprocessing, and evaluator methods, you need to add the following configurations to the configuration file:
+
+ ```python
+ from opencompass.datasets import MyDataset, MyDatasetEvaluator, mydataset_postprocess
+
+ mydataset_eval_cfg = dict(
+ evaluator=dict(type=MyDatasetEvaluator),
+ pred_postprocessor=dict(type=mydataset_postprocess))
+
+ mydataset_datasets = [
+ dict(
+ type=MyDataset,
+ ...,
+ reader_cfg=...,
+ infer_cfg=...,
+ eval_cfg=mydataset_eval_cfg)
+ ]
+ ```
+
+ Once the dataset is configured, you can refer to the instructions on [Get started](../get_started.md) for other requirements.
diff --git a/docs/en/advanced_guides/new_model.md b/docs/en/advanced_guides/new_model.md
index a93e773..ce7bc6f 100644
--- a/docs/en/advanced_guides/new_model.md
+++ b/docs/en/advanced_guides/new_model.md
@@ -1,3 +1,73 @@
-# New A Model
+# Add a Model
-Coming soon.
+Currently, we support HF models, some model APIs, and some third-party models.
+
+## Adding API Models
+
+To add a new API-based model, you need to create a new file named `mymodel_api.py` under `opencompass/models` directory. In this file, you should inherit from `BaseAPIModel` and implement the `generate` method for inference and the `get_token_len` method to calculate the length of tokens. Once you have defined the model, you can modify the corresponding configuration file.
+
+```python
+from ..base_api import BaseAPIModel
+
+class MyModelAPI(BaseAPIModel):
+
+ is_api: bool = True
+
+ def __init__(self,
+ path: str,
+ max_seq_len: int = 2048,
+ query_per_second: int = 1,
+ retry: int = 2,
+ **kwargs):
+ super().__init__(path=path,
+ max_seq_len=max_seq_len,
+ meta_template=meta_template,
+ query_per_second=query_per_second,
+ retry=retry)
+ ...
+
+ def generate(
+ self,
+ inputs,
+ max_out_len: int = 512,
+ temperature: float = 0.7,
+ ) -> List[str]:
+ """Generate results given a list of inputs."""
+ pass
+
+ def get_token_len(self, prompt: str) -> int:
+ """Get lengths of the tokenized string."""
+ pass
+```
+
+## Adding Third-Party Models
+
+To add a new third-party model, you need to create a new file named `mymodel.py` under `opencompass/models` directory. In this file, you should inherit from `BaseModel` and implement the `generate` method for generative inference, the `get_ppl` method for discriminative inference, and the `get_token_len` method to calculate the length of tokens. Once you have defined the model, you can modify the corresponding configuration file.
+
+```python
+from ..base import BaseModel
+
+class MyModel(BaseModel):
+
+ def __init__(self,
+ pkg_root: str,
+ ckpt_path: str,
+ tokenizer_only: bool = False,
+ meta_template: Optional[Dict] = None,
+ **kwargs):
+ ...
+
+ def get_token_len(self, prompt: str) -> int:
+ """Get lengths of the tokenized strings."""
+ pass
+
+ def generate(self, inputs: List[str], max_out_len: int) -> List[str]:
+ """Generate results given a list of inputs. """
+ pass
+
+ def get_ppl(self,
+ inputs: List[str],
+ mask_length: Optional[List[int]] = None) -> List[float]:
+ """Get perplexity scores given a list of inputs."""
+ pass
+```
diff --git a/docs/en/get_started.md b/docs/en/get_started.md
index 4cdca44..e773ff3 100644
--- a/docs/en/get_started.md
+++ b/docs/en/get_started.md
@@ -107,7 +107,7 @@ models = [llama_7b]
-Lauch Evalution
+Launch Evaluation
First, we can start the task in **debug mode** to check for any exceptions in model loading, dataset reading, or incorrect cache usage.
diff --git a/docs/en/index.rst b/docs/en/index.rst
index bcd7f43..0b46b75 100644
--- a/docs/en/index.rst
+++ b/docs/en/index.rst
@@ -79,4 +79,4 @@ Indexes & Tables
==================
* :ref:`genindex`
-* :ref:`search`
\ No newline at end of file
+* :ref:`search`
diff --git a/docs/en/user_guides/dataset_prepare.md b/docs/en/user_guides/dataset_prepare.md
index 2ff2ef4..faca61b 100644
--- a/docs/en/user_guides/dataset_prepare.md
+++ b/docs/en/user_guides/dataset_prepare.md
@@ -8,34 +8,26 @@ First, let's introduce the structure under the `configs/datasets` directory in O
```
configs/datasets/
-├── ChineseUniversal # Ability dimension
-│ ├── CLUE_afqmc # Dataset under this dimension
-│ │ ├── CLUE_afqmc_gen_db509b.py # Different configuration files for this dataset
-│ │ ├── CLUE_afqmc_gen.py
-│ │ ├── CLUE_afqmc_ppl_00b348.py
-│ │ ├── CLUE_afqmc_ppl_2313cf.py
-│ │ └── CLUE_afqmc_ppl.py
-│ ├── CLUE_C3
-│ │ ├── ...
-│ ├── ...
-├── Coding
-├── collections
-├── Completion
-├── EnglishUniversal
-├── Exam
-├── glm
-├── LongText
-├── MISC
-├── NLG
-├── QA
-├── Reasoning
-├── Security
-└── Translation
+├── agieval
+├── apps
+├── ARC_c
+├── ...
+├── CLUE_afqmc # dataset
+│ ├── CLUE_afqmc_gen_901306.py # different version of config
+│ ├── CLUE_afqmc_gen.py
+│ ├── CLUE_afqmc_ppl_378c5b.py
+│ ├── CLUE_afqmc_ppl_6507d7.py
+│ ├── CLUE_afqmc_ppl_7b0c1e.py
+│ └── CLUE_afqmc_ppl.py
+├── ...
+├── XLSum
+├── Xsum
+└── z_bench
```
-In the `configs/datasets` directory structure, we have divided the datasets into over ten dimensions based on ability dimensions, such as: Chinese and English Universal, Exam, QA, Reasoning, Security, etc. Each dimension contains a series of datasets, and there are multiple dataset configurations in the corresponding folder of each dataset.
+In the `configs/datasets` directory structure, we flatten all datasets directly, and there are multiple dataset configurations within the corresponding folders for each dataset.
-The naming of the dataset configuration file is made up of `{dataset name}_{evaluation method}_{prompt version number}.py`. For example, `ChineseUniversal/CLUE_afqmc/CLUE_afqmc_gen_db509b.py`, this configuration file is the `CLUE_afqmc` dataset under the Chinese universal ability, the corresponding evaluation method is `gen`, i.e., generative evaluation, and the corresponding prompt version number is `db509b`; similarly, `CLUE_afqmc_ppl_00b348.py` indicates that the evaluation method is `ppl`, i.e., discriminative evaluation, and the prompt version number is `00b348`.
+The naming of the dataset configuration file is made up of `{dataset name}_{evaluation method}_{prompt version number}.py`. For example, `CLUE_afqmc/CLUE_afqmc_gen_db509b.py`, this configuration file is the `CLUE_afqmc` dataset under the Chinese universal ability, the corresponding evaluation method is `gen`, i.e., generative evaluation, and the corresponding prompt version number is `db509b`; similarly, `CLUE_afqmc_ppl_00b348.py` indicates that the evaluation method is `ppl`, i.e., discriminative evaluation, and the prompt version number is `00b348`.
In addition, files without a version number, such as: `CLUE_afqmc_gen.py`, point to the latest prompt configuration file of that evaluation method, which is usually the most accurate prompt.
@@ -49,13 +41,13 @@ The datasets supported by OpenCompass mainly include two parts:
2. OpenCompass Self-built Datasets
-In addition to supporting Huggingface's existing datasets, OpenCompass also provides some self-built CN datasets. In the future, a dataset-related Repo will be provided for users to download and use. Following the instructions in the document to place the datasets uniformly in the `./data` directory can complete dataset preparation.
+In addition to supporting Huggingface's existing datasets, OpenCompass also provides some self-built CN datasets. In the future, a dataset-related link will be provided for users to download and use. Following the instructions in the document to place the datasets uniformly in the `./data` directory can complete dataset preparation.
It is important to note that the Repo not only contains self-built datasets, but also includes some HF-supported datasets for testing convenience.
## Dataset Selection
-In each dataset configuration file, the dataset will be defined in the `{}_datasets` variable, such as `afqmc_datasets` in `ChineseUniversal/CLUE_afqmc/CLUE_afqmc_gen_db509b.py`.
+In each dataset configuration file, the dataset will be defined in the `{}_datasets` variable, such as `afqmc_datasets` in `CLUE_afqmc/CLUE_afqmc_gen_db509b.py`.
```python
afqmc_datasets = [
@@ -70,7 +62,7 @@ afqmc_datasets = [
]
```
-And `afqmc_datasets` in `ChineseUniversal/CLUE_cmnli/CLUE_cmnli_ppl_b78ad4.py`.
+And `cmnli_datasets` in `CLUE_cmnli/CLUE_cmnli_ppl_b78ad4.py`.
```python
cmnli_datasets = [
diff --git a/docs/zh_cn/advanced_guides/new_dataset.md b/docs/zh_cn/advanced_guides/new_dataset.md
index 44f2f33..2b631cd 100644
--- a/docs/zh_cn/advanced_guides/new_dataset.md
+++ b/docs/zh_cn/advanced_guides/new_dataset.md
@@ -4,7 +4,7 @@
1. 在 `opencompass/datasets` 文件夹新增数据集脚本 `mydataset.py`, 该脚本需要包含:
- - 数据集及其加载方式,需要定义一个 `MyDataset` 类,实现数据集加载方法 `load` ,该方法为静态方法,需要返回 `datasets.Dataset` 类型的数据。这里我们使用 huggingface dataset 作为数据集的统一接口,避免引入额外的逻辑。具体示例如下:
+ - 数据集及其加载方式,需要定义一个 `MyDataset` 类,实现数据集加载方法 `load`,该方法为静态方法,需要返回 `datasets.Dataset` 类型的数据。这里我们使用 huggingface dataset 作为数据集的统一接口,避免引入额外的逻辑。具体示例如下:
```python
import datasets
@@ -17,10 +17,9 @@
pass
```
- - (可选)如果OpenCompass已有的evaluator不能满足需要,需要用户定义 `MyDatasetlEvaluator` 类,实现评分方法 `score` ,需要根据输入的 `predictions` 和 `references` 列表,得到需要的字典。由于一个数据集可能存在多种metric,需要返回一个 metrics 以及对应 scores 的相关字典。具体示例如下:
+ - (可选)如果 OpenCompass 已有的评测器不能满足需要,需要用户定义 `MyDatasetlEvaluator` 类,实现评分方法 `score`,需要根据输入的 `predictions` 和 `references` 列表,得到需要的字典。由于一个数据集可能存在多种 metric,需要返回一个 metrics 以及对应 scores 的相关字典。具体示例如下:
```python
-
from opencompass.openicl.icl_evaluator import BaseEvaluator
class MyDatasetlEvaluator(BaseEvaluator):
@@ -30,14 +29,14 @@
```
- - (可选)如果 OpenCompass 已有的 postprocesser 不能满足需要,需要用户定义 `mydataset_postprocess` 方法,根据输入的字符串得到相应后处理的结果。具体示例如下:
+ - (可选)如果 OpenCompass 已有的后处理方法不能满足需要,需要用户定义 `mydataset_postprocess` 方法,根据输入的字符串得到相应后处理的结果。具体示例如下:
```python
def mydataset_postprocess(text: str) -> str:
pass
```
-2. 在定义好数据集加载,数据后处理以及 `evaluator` 等方法之后,需要在配置文件中新增以下配置:
+2. 在定义好数据集加载、评测以及数据后处理等方法之后,需要在配置文件中新增以下配置:
```python
from opencompass.datasets import MyDataset, MyDatasetlEvaluator, mydataset_postprocess
@@ -56,5 +55,4 @@
]
```
- 配置好数据集之后,其他需要的配置文件直接参考如何启动评测任务教程即可。
-
\ No newline at end of file
+ 配置好数据集之后,其他需要的配置文件直接参考[快速上手](../get_started.md)教程即可。
diff --git a/docs/zh_cn/advanced_guides/new_model.md b/docs/zh_cn/advanced_guides/new_model.md
index 258dec8..db228ea 100644
--- a/docs/zh_cn/advanced_guides/new_model.md
+++ b/docs/zh_cn/advanced_guides/new_model.md
@@ -1,6 +1,6 @@
# 支持新模型
-目前我们已经支持的模型有 HF 模型、部分模型 API 、自建模型和部分第三方模型。
+目前我们已经支持的模型有 HF 模型、部分模型 API 、部分第三方模型。
## 新增API模型
diff --git a/docs/zh_cn/index.rst b/docs/zh_cn/index.rst
index 00bf6c2..ce59451 100644
--- a/docs/zh_cn/index.rst
+++ b/docs/zh_cn/index.rst
@@ -79,4 +79,4 @@ OpenCompass 上手路线
==================
* :ref:`genindex`
-* :ref:`search`
\ No newline at end of file
+* :ref:`search`
diff --git a/docs/zh_cn/prompt/prompt_template.md b/docs/zh_cn/prompt/prompt_template.md
index 667140e..7a814bd 100644
--- a/docs/zh_cn/prompt/prompt_template.md
+++ b/docs/zh_cn/prompt/prompt_template.md
@@ -1,3 +1,3 @@
# Prompt 模板
-Coming soon.
\ No newline at end of file
+Coming soon.
diff --git a/docs/zh_cn/user_guides/dataset_prepare.md b/docs/zh_cn/user_guides/dataset_prepare.md
index e0637bc..989fdb7 100644
--- a/docs/zh_cn/user_guides/dataset_prepare.md
+++ b/docs/zh_cn/user_guides/dataset_prepare.md
@@ -8,34 +8,26 @@
```
configs/datasets/
-├── ChineseUniversal # 能力维度
-│ ├── CLUE_afqmc # 该维度下的数据集
-│ │ ├── CLUE_afqmc_gen_db509b.py # 该数据集的不同配置文件
-│ │ ├── CLUE_afqmc_gen.py
-│ │ ├── CLUE_afqmc_ppl_00b348.py
-│ │ ├── CLUE_afqmc_ppl_2313cf.py
-│ │ └── CLUE_afqmc_ppl.py
-│ ├── CLUE_C3
-│ │ ├── ...
-│ ├── ...
-├── Coding
-├── collections
-├── Completion
-├── EnglishUniversal
-├── Exam
-├── glm
-├── LongText
-├── MISC
-├── NLG
-├── QA
-├── Reasoning
-├── Security
-└── Translation
+├── agieval
+├── apps
+├── ARC_c
+├── ...
+├── CLUE_afqmc # 数据集
+│ ├── CLUE_afqmc_gen_901306.py # 不同版本数据集配置文件
+│ ├── CLUE_afqmc_gen.py
+│ ├── CLUE_afqmc_ppl_378c5b.py
+│ ├── CLUE_afqmc_ppl_6507d7.py
+│ ├── CLUE_afqmc_ppl_7b0c1e.py
+│ └── CLUE_afqmc_ppl.py
+├── ...
+├── XLSum
+├── Xsum
+└── z_bench
```
-在 `configs/datasets` 目录结构下,我们主要以能力维度对数据集划分了十余项维度,例如:中英文通用、考试、问答、推理、安全等等。每一项维度又包含了一系列数据集,在各个数据集对应的文件夹下存在多个数据集配置。
+在 `configs/datasets` 目录结构下,我们直接展平所有数据集,在各个数据集对应的文件夹下存在多个数据集配置。
-数据集配置文件名由以下命名方式构成 `{数据集名称}_{评测方式}_{prompt版本号}.py`,以 `ChineseUniversal/CLUE_afqmc/CLUE_afqmc_gen_db509b.py` 为例,该配置文件则为中文通用能力下的 `CLUE_afqmc` 数据集,对应的评测方式为 `gen`,即生成式评测,对应的prompt版本号为 `db509b`;同样的, `CLUE_afqmc_ppl_00b348.py` 指评测方式为`ppl`即判别式评测,prompt版本号为 `00b348` 。
+数据集配置文件名由以下命名方式构成 `{数据集名称}_{评测方式}_{prompt版本号}.py`,以 `CLUE_afqmc/CLUE_afqmc_gen_db509b.py` 为例,该配置文件则为中文通用能力下的 `CLUE_afqmc` 数据集,对应的评测方式为 `gen`,即生成式评测,对应的prompt版本号为 `db509b`;同样的, `CLUE_afqmc_ppl_00b348.py` 指评测方式为`ppl`即判别式评测,prompt版本号为 `00b348` 。
除此之外,不带版本号的文件,例如: `CLUE_afqmc_gen.py` 则指向该评测方式最新的prompt配置文件,通常来说会是精度最高的prompt。
@@ -49,13 +41,13 @@ OpenCompass 支持的数据集主要包括两个部分:
2. OpenCompass 自建数据集
-除了支持 Huggingface 已有的数据集, OpenCompass 还提供了一些自建CN数据集,未来将会提供一个数据集相关的Repo供用户下载使用。按照文档指示将数据集统一放置在`./data`目录下即可完成数据集准备。
+除了支持 Huggingface 已有的数据集, OpenCompass 还提供了一些自建CN数据集,未来将会提供一个数据集相关的链接供用户下载使用。按照文档指示将数据集统一放置在`./data`目录下即可完成数据集准备。
需要注意的是,Repo中不仅包含自建的数据集,为了方便也加入了部分HF已支持的数据集方便测试。
## 数据集选择
-在各个数据集配置文件中,数据集将会被定义在 `{}_datasets` 变量当中,例如下面 `ChineseUniversal/CLUE_afqmc/CLUE_afqmc_gen_db509b.py` 中的 `afqmc_datasets`。
+在各个数据集配置文件中,数据集将会被定义在 `{}_datasets` 变量当中,例如下面 `CLUE_afqmc/CLUE_afqmc_gen_db509b.py` 中的 `afqmc_datasets`。
```python
afqmc_datasets = [
@@ -70,7 +62,7 @@ afqmc_datasets = [
]
```
-以及 `ChineseUniversal/CLUE_cmnli/CLUE_cmnli_ppl_b78ad4.py` 中的 `afqmc_datasets`。
+以及 `CLUE_cmnli/CLUE_cmnli_ppl_b78ad4.py` 中的 `cmnli_datasets`。
```python
cmnli_datasets = [
diff --git a/docs/zh_cn/user_guides/experimentation.md b/docs/zh_cn/user_guides/experimentation.md
index 9578900..8f87de0 100644
--- a/docs/zh_cn/user_guides/experimentation.md
+++ b/docs/zh_cn/user_guides/experimentation.md
@@ -39,27 +39,27 @@ run.py {--slrum | --dlc | None} $Config [-p PARTITION] [-q QUOTATYPE] [--debug]
1. 打开 `configs/lark.py` 文件,并在文件中加入以下行:
- ```python
- lark_bot_url = 'YOUR_WEBHOOK_URL'
- ```
+```python
+lark_bot_url = 'YOUR_WEBHOOK_URL'
+```
- 通常, Webhook URL 格式如 https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxxxxxxxxx 。
+通常, Webhook URL 格式如 https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxxxxxxxxx 。
2. 在完整的评测配置中继承该文件:
- ```python
- from mmengine.config import read_base
+```python
+ from mmengine.config import read_base
- with read_base():
- from .lark import lark_bot_url
+ with read_base():
+ from .lark import lark_bot_url
- ```
+```
3. 为了避免机器人频繁发消息形成骚扰,默认运行时状态不会自动上报。有需要时,可以通过 `-l` 或 `--lark` 启动状态上报:
- ```bash
- python run.py configs/eval_demo.py -p {PARTITION} -l
- ```
+```bash
+python run.py configs/eval_demo.py -p {PARTITION} -l
+```
## Summerizer介绍
diff --git a/opencompass/utils/__init__.py b/opencompass/utils/__init__.py
index 2960423..d9fdeb4 100644
--- a/opencompass/utils/__init__.py
+++ b/opencompass/utils/__init__.py
@@ -1,6 +1,6 @@
from .abbr import * # noqa
from .build import * # noqa
-from .collect_env import * #noqa
+from .collect_env import * # noqa
from .fileio import * # noqa
from .git import * # noqa
from .lark import * # noqa
|