diff --git a/LICENSE b/LICENSE index 7973cfa..2921c35 100644 --- a/LICENSE +++ b/LICENSE @@ -200,4 +200,4 @@ Copyright 2020 OpenCompass Authors. All rights reserved. distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and - limitations under the License. \ No newline at end of file + limitations under the License. diff --git a/README.md b/README.md index 0b03b29..3a724b3 100644 --- a/README.md +++ b/README.md @@ -39,8 +39,6 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun [![image](https://github.com/InternLM/OpenCompass/assets/7881589/6b56c297-77c0-4e1a-9acc-24a45c5a734a)](https://opencompass.org.cn/rank) - - ## Dataset Support @@ -245,7 +243,7 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun
- + - InternLM - LLaMA - Vicuna diff --git a/README_zh-CN.md b/README_zh-CN.md index 580fbc6..d42740e 100644 --- a/README_zh-CN.md +++ b/README_zh-CN.md @@ -40,10 +40,8 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下 我们将陆续提供开源模型和API模型的具体性能榜单,请见 [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) 。如需加入评测,请提供模型仓库地址或标准的 API 接口至邮箱 `opencompass@pjlab.org.cn`. - ![image](https://github.com/InternLM/OpenCompass/assets/7881589/fddc8ab4-d2bd-429d-89f0-4ca90606599a) - ## 数据集支持 diff --git a/docs/en/advanced_guides/new_dataset.md b/docs/en/advanced_guides/new_dataset.md index d1f05d2..f7bb2c5 100644 --- a/docs/en/advanced_guides/new_dataset.md +++ b/docs/en/advanced_guides/new_dataset.md @@ -1,3 +1,57 @@ -# New Dataset +# Add a dataset -Coming soon. +Although OpenCompass has already included most commonly used datasets, users need to follow the steps below to support a new dataset if wanted: + +1. Add a dataset script `mydataset.py` to the `opencompass/datasets` folder. This script should include: + + - The dataset and its loading method. Define a `MyDataset` class that implements the data loading method `load` as a static method. This method should return data of type `datasets.Dataset`. We use the Hugging Face dataset as the unified interface for datasets to avoid introducing additional logic. Here's an example: + + ```python + import datasets + from .base import BaseDataset + + class MyDataset(BaseDataset): + + @staticmethod + def load(**kwargs) -> datasets.Dataset: + pass + ``` + + - (Optional) If the existing evaluators in OpenCompass do not meet your needs, you need to define a `MyDatasetEvaluator` class that implements the scoring method `score`. This method should take `predictions` and `references` as input and return the desired dictionary. Since a dataset may have multiple metrics, the method should return a dictionary containing the metrics and their corresponding scores. Here's an example: + + ```python + from opencompass.openicl.icl_evaluator import BaseEvaluator + + class MyDatasetEvaluator(BaseEvaluator): + + def score(self, predictions: List, references: List) -> dict: + pass + ``` + + - (Optional) If the existing postprocessors in OpenCompass do not meet your needs, you need to define the `mydataset_postprocess` method. This method takes an input string and returns the corresponding postprocessed result string. Here's an example: + + ```python + def mydataset_postprocess(text: str) -> str: + pass + ``` + +2. After defining the dataset loading, data postprocessing, and evaluator methods, you need to add the following configurations to the configuration file: + + ```python + from opencompass.datasets import MyDataset, MyDatasetEvaluator, mydataset_postprocess + + mydataset_eval_cfg = dict( + evaluator=dict(type=MyDatasetEvaluator), + pred_postprocessor=dict(type=mydataset_postprocess)) + + mydataset_datasets = [ + dict( + type=MyDataset, + ..., + reader_cfg=..., + infer_cfg=..., + eval_cfg=mydataset_eval_cfg) + ] + ``` + + Once the dataset is configured, you can refer to the instructions on [Get started](../get_started.md) for other requirements. diff --git a/docs/en/advanced_guides/new_model.md b/docs/en/advanced_guides/new_model.md index a93e773..ce7bc6f 100644 --- a/docs/en/advanced_guides/new_model.md +++ b/docs/en/advanced_guides/new_model.md @@ -1,3 +1,73 @@ -# New A Model +# Add a Model -Coming soon. +Currently, we support HF models, some model APIs, and some third-party models. + +## Adding API Models + +To add a new API-based model, you need to create a new file named `mymodel_api.py` under `opencompass/models` directory. In this file, you should inherit from `BaseAPIModel` and implement the `generate` method for inference and the `get_token_len` method to calculate the length of tokens. Once you have defined the model, you can modify the corresponding configuration file. + +```python +from ..base_api import BaseAPIModel + +class MyModelAPI(BaseAPIModel): + + is_api: bool = True + + def __init__(self, + path: str, + max_seq_len: int = 2048, + query_per_second: int = 1, + retry: int = 2, + **kwargs): + super().__init__(path=path, + max_seq_len=max_seq_len, + meta_template=meta_template, + query_per_second=query_per_second, + retry=retry) + ... + + def generate( + self, + inputs, + max_out_len: int = 512, + temperature: float = 0.7, + ) -> List[str]: + """Generate results given a list of inputs.""" + pass + + def get_token_len(self, prompt: str) -> int: + """Get lengths of the tokenized string.""" + pass +``` + +## Adding Third-Party Models + +To add a new third-party model, you need to create a new file named `mymodel.py` under `opencompass/models` directory. In this file, you should inherit from `BaseModel` and implement the `generate` method for generative inference, the `get_ppl` method for discriminative inference, and the `get_token_len` method to calculate the length of tokens. Once you have defined the model, you can modify the corresponding configuration file. + +```python +from ..base import BaseModel + +class MyModel(BaseModel): + + def __init__(self, + pkg_root: str, + ckpt_path: str, + tokenizer_only: bool = False, + meta_template: Optional[Dict] = None, + **kwargs): + ... + + def get_token_len(self, prompt: str) -> int: + """Get lengths of the tokenized strings.""" + pass + + def generate(self, inputs: List[str], max_out_len: int) -> List[str]: + """Generate results given a list of inputs. """ + pass + + def get_ppl(self, + inputs: List[str], + mask_length: Optional[List[int]] = None) -> List[float]: + """Get perplexity scores given a list of inputs.""" + pass +``` diff --git a/docs/en/get_started.md b/docs/en/get_started.md index 4cdca44..e773ff3 100644 --- a/docs/en/get_started.md +++ b/docs/en/get_started.md @@ -107,7 +107,7 @@ models = [llama_7b]
-Lauch Evalution +Launch Evaluation First, we can start the task in **debug mode** to check for any exceptions in model loading, dataset reading, or incorrect cache usage. diff --git a/docs/en/index.rst b/docs/en/index.rst index bcd7f43..0b46b75 100644 --- a/docs/en/index.rst +++ b/docs/en/index.rst @@ -79,4 +79,4 @@ Indexes & Tables ================== * :ref:`genindex` -* :ref:`search` \ No newline at end of file +* :ref:`search` diff --git a/docs/en/user_guides/dataset_prepare.md b/docs/en/user_guides/dataset_prepare.md index 2ff2ef4..faca61b 100644 --- a/docs/en/user_guides/dataset_prepare.md +++ b/docs/en/user_guides/dataset_prepare.md @@ -8,34 +8,26 @@ First, let's introduce the structure under the `configs/datasets` directory in O ``` configs/datasets/ -├── ChineseUniversal # Ability dimension -│ ├── CLUE_afqmc # Dataset under this dimension -│ │ ├── CLUE_afqmc_gen_db509b.py # Different configuration files for this dataset -│ │ ├── CLUE_afqmc_gen.py -│ │ ├── CLUE_afqmc_ppl_00b348.py -│ │ ├── CLUE_afqmc_ppl_2313cf.py -│ │ └── CLUE_afqmc_ppl.py -│ ├── CLUE_C3 -│ │ ├── ... -│ ├── ... -├── Coding -├── collections -├── Completion -├── EnglishUniversal -├── Exam -├── glm -├── LongText -├── MISC -├── NLG -├── QA -├── Reasoning -├── Security -└── Translation +├── agieval +├── apps +├── ARC_c +├── ... +├── CLUE_afqmc # dataset +│   ├── CLUE_afqmc_gen_901306.py # different version of config +│   ├── CLUE_afqmc_gen.py +│   ├── CLUE_afqmc_ppl_378c5b.py +│   ├── CLUE_afqmc_ppl_6507d7.py +│   ├── CLUE_afqmc_ppl_7b0c1e.py +│   └── CLUE_afqmc_ppl.py +├── ... +├── XLSum +├── Xsum +└── z_bench ``` -In the `configs/datasets` directory structure, we have divided the datasets into over ten dimensions based on ability dimensions, such as: Chinese and English Universal, Exam, QA, Reasoning, Security, etc. Each dimension contains a series of datasets, and there are multiple dataset configurations in the corresponding folder of each dataset. +In the `configs/datasets` directory structure, we flatten all datasets directly, and there are multiple dataset configurations within the corresponding folders for each dataset. -The naming of the dataset configuration file is made up of `{dataset name}_{evaluation method}_{prompt version number}.py`. For example, `ChineseUniversal/CLUE_afqmc/CLUE_afqmc_gen_db509b.py`, this configuration file is the `CLUE_afqmc` dataset under the Chinese universal ability, the corresponding evaluation method is `gen`, i.e., generative evaluation, and the corresponding prompt version number is `db509b`; similarly, `CLUE_afqmc_ppl_00b348.py` indicates that the evaluation method is `ppl`, i.e., discriminative evaluation, and the prompt version number is `00b348`. +The naming of the dataset configuration file is made up of `{dataset name}_{evaluation method}_{prompt version number}.py`. For example, `CLUE_afqmc/CLUE_afqmc_gen_db509b.py`, this configuration file is the `CLUE_afqmc` dataset under the Chinese universal ability, the corresponding evaluation method is `gen`, i.e., generative evaluation, and the corresponding prompt version number is `db509b`; similarly, `CLUE_afqmc_ppl_00b348.py` indicates that the evaluation method is `ppl`, i.e., discriminative evaluation, and the prompt version number is `00b348`. In addition, files without a version number, such as: `CLUE_afqmc_gen.py`, point to the latest prompt configuration file of that evaluation method, which is usually the most accurate prompt. @@ -49,13 +41,13 @@ The datasets supported by OpenCompass mainly include two parts: 2. OpenCompass Self-built Datasets -In addition to supporting Huggingface's existing datasets, OpenCompass also provides some self-built CN datasets. In the future, a dataset-related Repo will be provided for users to download and use. Following the instructions in the document to place the datasets uniformly in the `./data` directory can complete dataset preparation. +In addition to supporting Huggingface's existing datasets, OpenCompass also provides some self-built CN datasets. In the future, a dataset-related link will be provided for users to download and use. Following the instructions in the document to place the datasets uniformly in the `./data` directory can complete dataset preparation. It is important to note that the Repo not only contains self-built datasets, but also includes some HF-supported datasets for testing convenience. ## Dataset Selection -In each dataset configuration file, the dataset will be defined in the `{}_datasets` variable, such as `afqmc_datasets` in `ChineseUniversal/CLUE_afqmc/CLUE_afqmc_gen_db509b.py`. +In each dataset configuration file, the dataset will be defined in the `{}_datasets` variable, such as `afqmc_datasets` in `CLUE_afqmc/CLUE_afqmc_gen_db509b.py`. ```python afqmc_datasets = [ @@ -70,7 +62,7 @@ afqmc_datasets = [ ] ``` -And `afqmc_datasets` in `ChineseUniversal/CLUE_cmnli/CLUE_cmnli_ppl_b78ad4.py`. +And `cmnli_datasets` in `CLUE_cmnli/CLUE_cmnli_ppl_b78ad4.py`. ```python cmnli_datasets = [ diff --git a/docs/zh_cn/advanced_guides/new_dataset.md b/docs/zh_cn/advanced_guides/new_dataset.md index 44f2f33..2b631cd 100644 --- a/docs/zh_cn/advanced_guides/new_dataset.md +++ b/docs/zh_cn/advanced_guides/new_dataset.md @@ -4,7 +4,7 @@ 1. 在 `opencompass/datasets` 文件夹新增数据集脚本 `mydataset.py`, 该脚本需要包含: - - 数据集及其加载方式,需要定义一个 `MyDataset` 类,实现数据集加载方法 `load` ,该方法为静态方法,需要返回 `datasets.Dataset` 类型的数据。这里我们使用 huggingface dataset 作为数据集的统一接口,避免引入额外的逻辑。具体示例如下: + - 数据集及其加载方式,需要定义一个 `MyDataset` 类,实现数据集加载方法 `load`,该方法为静态方法,需要返回 `datasets.Dataset` 类型的数据。这里我们使用 huggingface dataset 作为数据集的统一接口,避免引入额外的逻辑。具体示例如下: ```python import datasets @@ -17,10 +17,9 @@ pass ``` - - (可选)如果OpenCompass已有的evaluator不能满足需要,需要用户定义 `MyDatasetlEvaluator` 类,实现评分方法 `score` ,需要根据输入的 `predictions` 和 `references` 列表,得到需要的字典。由于一个数据集可能存在多种metric,需要返回一个 metrics 以及对应 scores 的相关字典。具体示例如下: + - (可选)如果 OpenCompass 已有的评测器不能满足需要,需要用户定义 `MyDatasetlEvaluator` 类,实现评分方法 `score`,需要根据输入的 `predictions` 和 `references` 列表,得到需要的字典。由于一个数据集可能存在多种 metric,需要返回一个 metrics 以及对应 scores 的相关字典。具体示例如下: ```python - from opencompass.openicl.icl_evaluator import BaseEvaluator class MyDatasetlEvaluator(BaseEvaluator): @@ -30,14 +29,14 @@ ``` - - (可选)如果 OpenCompass 已有的 postprocesser 不能满足需要,需要用户定义 `mydataset_postprocess` 方法,根据输入的字符串得到相应后处理的结果。具体示例如下: + - (可选)如果 OpenCompass 已有的后处理方法不能满足需要,需要用户定义 `mydataset_postprocess` 方法,根据输入的字符串得到相应后处理的结果。具体示例如下: ```python def mydataset_postprocess(text: str) -> str: pass ``` -2. 在定义好数据集加载,数据后处理以及 `evaluator` 等方法之后,需要在配置文件中新增以下配置: +2. 在定义好数据集加载、评测以及数据后处理等方法之后,需要在配置文件中新增以下配置: ```python from opencompass.datasets import MyDataset, MyDatasetlEvaluator, mydataset_postprocess @@ -56,5 +55,4 @@ ] ``` - 配置好数据集之后,其他需要的配置文件直接参考如何启动评测任务教程即可。 - \ No newline at end of file + 配置好数据集之后,其他需要的配置文件直接参考[快速上手](../get_started.md)教程即可。 diff --git a/docs/zh_cn/advanced_guides/new_model.md b/docs/zh_cn/advanced_guides/new_model.md index 258dec8..db228ea 100644 --- a/docs/zh_cn/advanced_guides/new_model.md +++ b/docs/zh_cn/advanced_guides/new_model.md @@ -1,6 +1,6 @@ # 支持新模型 -目前我们已经支持的模型有 HF 模型、部分模型 API 、自建模型和部分第三方模型。 +目前我们已经支持的模型有 HF 模型、部分模型 API 、部分第三方模型。 ## 新增API模型 diff --git a/docs/zh_cn/index.rst b/docs/zh_cn/index.rst index 00bf6c2..ce59451 100644 --- a/docs/zh_cn/index.rst +++ b/docs/zh_cn/index.rst @@ -79,4 +79,4 @@ OpenCompass 上手路线 ================== * :ref:`genindex` -* :ref:`search` \ No newline at end of file +* :ref:`search` diff --git a/docs/zh_cn/prompt/prompt_template.md b/docs/zh_cn/prompt/prompt_template.md index 667140e..7a814bd 100644 --- a/docs/zh_cn/prompt/prompt_template.md +++ b/docs/zh_cn/prompt/prompt_template.md @@ -1,3 +1,3 @@ # Prompt 模板 -Coming soon. \ No newline at end of file +Coming soon. diff --git a/docs/zh_cn/user_guides/dataset_prepare.md b/docs/zh_cn/user_guides/dataset_prepare.md index e0637bc..989fdb7 100644 --- a/docs/zh_cn/user_guides/dataset_prepare.md +++ b/docs/zh_cn/user_guides/dataset_prepare.md @@ -8,34 +8,26 @@ ``` configs/datasets/ -├── ChineseUniversal # 能力维度 -│   ├── CLUE_afqmc # 该维度下的数据集 -│   │   ├── CLUE_afqmc_gen_db509b.py # 该数据集的不同配置文件 -│   │   ├── CLUE_afqmc_gen.py -│   │   ├── CLUE_afqmc_ppl_00b348.py -│   │   ├── CLUE_afqmc_ppl_2313cf.py -│   │   └── CLUE_afqmc_ppl.py -│   ├── CLUE_C3 -│   │   ├── ... -│   ├── ... -├── Coding -├── collections -├── Completion -├── EnglishUniversal -├── Exam -├── glm -├── LongText -├── MISC -├── NLG -├── QA -├── Reasoning -├── Security -└── Translation +├── agieval +├── apps +├── ARC_c +├── ... +├── CLUE_afqmc # 数据集 +│   ├── CLUE_afqmc_gen_901306.py # 不同版本数据集配置文件 +│   ├── CLUE_afqmc_gen.py +│   ├── CLUE_afqmc_ppl_378c5b.py +│   ├── CLUE_afqmc_ppl_6507d7.py +│   ├── CLUE_afqmc_ppl_7b0c1e.py +│   └── CLUE_afqmc_ppl.py +├── ... +├── XLSum +├── Xsum +└── z_bench ``` -在 `configs/datasets` 目录结构下,我们主要以能力维度对数据集划分了十余项维度,例如:中英文通用、考试、问答、推理、安全等等。每一项维度又包含了一系列数据集,在各个数据集对应的文件夹下存在多个数据集配置。 +在 `configs/datasets` 目录结构下,我们直接展平所有数据集,在各个数据集对应的文件夹下存在多个数据集配置。 -数据集配置文件名由以下命名方式构成 `{数据集名称}_{评测方式}_{prompt版本号}.py`,以 `ChineseUniversal/CLUE_afqmc/CLUE_afqmc_gen_db509b.py` 为例,该配置文件则为中文通用能力下的 `CLUE_afqmc` 数据集,对应的评测方式为 `gen`,即生成式评测,对应的prompt版本号为 `db509b`;同样的, `CLUE_afqmc_ppl_00b348.py` 指评测方式为`ppl`即判别式评测,prompt版本号为 `00b348` 。 +数据集配置文件名由以下命名方式构成 `{数据集名称}_{评测方式}_{prompt版本号}.py`,以 `CLUE_afqmc/CLUE_afqmc_gen_db509b.py` 为例,该配置文件则为中文通用能力下的 `CLUE_afqmc` 数据集,对应的评测方式为 `gen`,即生成式评测,对应的prompt版本号为 `db509b`;同样的, `CLUE_afqmc_ppl_00b348.py` 指评测方式为`ppl`即判别式评测,prompt版本号为 `00b348` 。 除此之外,不带版本号的文件,例如: `CLUE_afqmc_gen.py` 则指向该评测方式最新的prompt配置文件,通常来说会是精度最高的prompt。 @@ -49,13 +41,13 @@ OpenCompass 支持的数据集主要包括两个部分: 2. OpenCompass 自建数据集 -除了支持 Huggingface 已有的数据集, OpenCompass 还提供了一些自建CN数据集,未来将会提供一个数据集相关的Repo供用户下载使用。按照文档指示将数据集统一放置在`./data`目录下即可完成数据集准备。 +除了支持 Huggingface 已有的数据集, OpenCompass 还提供了一些自建CN数据集,未来将会提供一个数据集相关的链接供用户下载使用。按照文档指示将数据集统一放置在`./data`目录下即可完成数据集准备。 需要注意的是,Repo中不仅包含自建的数据集,为了方便也加入了部分HF已支持的数据集方便测试。 ## 数据集选择 -在各个数据集配置文件中,数据集将会被定义在 `{}_datasets` 变量当中,例如下面 `ChineseUniversal/CLUE_afqmc/CLUE_afqmc_gen_db509b.py` 中的 `afqmc_datasets`。 +在各个数据集配置文件中,数据集将会被定义在 `{}_datasets` 变量当中,例如下面 `CLUE_afqmc/CLUE_afqmc_gen_db509b.py` 中的 `afqmc_datasets`。 ```python afqmc_datasets = [ @@ -70,7 +62,7 @@ afqmc_datasets = [ ] ``` -以及 `ChineseUniversal/CLUE_cmnli/CLUE_cmnli_ppl_b78ad4.py` 中的 `afqmc_datasets`。 +以及 `CLUE_cmnli/CLUE_cmnli_ppl_b78ad4.py` 中的 `cmnli_datasets`。 ```python cmnli_datasets = [ diff --git a/docs/zh_cn/user_guides/experimentation.md b/docs/zh_cn/user_guides/experimentation.md index 9578900..8f87de0 100644 --- a/docs/zh_cn/user_guides/experimentation.md +++ b/docs/zh_cn/user_guides/experimentation.md @@ -39,27 +39,27 @@ run.py {--slrum | --dlc | None} $Config [-p PARTITION] [-q QUOTATYPE] [--debug] 1. 打开 `configs/lark.py` 文件,并在文件中加入以下行: - ```python - lark_bot_url = 'YOUR_WEBHOOK_URL' - ``` +```python +lark_bot_url = 'YOUR_WEBHOOK_URL' +``` - 通常, Webhook URL 格式如 https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxxxxxxxxx 。 +通常, Webhook URL 格式如 https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxxxxxxxxx 。 2. 在完整的评测配置中继承该文件: - ```python - from mmengine.config import read_base +```python + from mmengine.config import read_base - with read_base(): - from .lark import lark_bot_url + with read_base(): + from .lark import lark_bot_url - ``` +``` 3. 为了避免机器人频繁发消息形成骚扰,默认运行时状态不会自动上报。有需要时,可以通过 `-l` 或 `--lark` 启动状态上报: - ```bash - python run.py configs/eval_demo.py -p {PARTITION} -l - ``` +```bash +python run.py configs/eval_demo.py -p {PARTITION} -l +``` ## Summerizer介绍 diff --git a/opencompass/utils/__init__.py b/opencompass/utils/__init__.py index 2960423..d9fdeb4 100644 --- a/opencompass/utils/__init__.py +++ b/opencompass/utils/__init__.py @@ -1,6 +1,6 @@ from .abbr import * # noqa from .build import * # noqa -from .collect_env import * #noqa +from .collect_env import * # noqa from .fileio import * # noqa from .git import * # noqa from .lark import * # noqa