Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pairwise comparison GPT evaluation #34

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,13 +61,13 @@ To fully configure BERGEN, please read our [configuration guide](documentation/c
Run the evaluation script to calculate LLMEval metrics and print the results:

```bash
python3 eval.py --experiments_folder experiments/ --llm_batch_size 16 --split 'dev' --llm vllm_SOLAR-107B
python3 evaluate.py --experiments_folder experiments/ --llm_batch_size 16 --split 'dev' --llm vllm_SOLAR-107B

#parse all the experiments files into a panda dataframe
python print_results.py --folder experiments/ --format=tiny
```

For more evaluation options and details, refer to the [Evaluation section](documentation/evaluations.md) in the complete documentation.
Bergen also offers the possiblity to run pairwise comparisons using an LLM as judge. For more evaluation options and details, refer to the [Evaluation section](documentation/evaluations.md) in the complete documentation.

## RAG Baselines
Bergen provides results for several models and many datasets aiming to **provide strong baselines**. On the important datasets for RAG, the match metric is given by this table (see more in our paper):
Expand Down
9 changes: 8 additions & 1 deletion config/evaluator/default_multi_qa.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,11 @@ output_options:
prompt:
system: f"You are an evaluation tool. Answer with one of \n {self.rubrik_section}."
user: f"Here is a question, a golden answer and an AI-generated answer. Can you judge whether the AI-generated answer is correct according to the question and golden answer, simply answer with one of {self.rubrik_section}.\n Question:\ {question}. \nGolden answer:\ {answer} \n Generated answer:\ {prediction}"
user_without_system: f"You are an evaluation tool. Just answer as following {self.rubrik_section}. Here is a question, a golden answer and an AI-generated answer. Judge whether the AI-generated answer is correct according to the question and golden answer, answer with {self.rubrik_section}.\nQuestion:\ {question}.\nGolden answer:\ {answer}\nGenerated answer:\ {prediction}"
user_without_system: f"You are an evaluation tool. Just answer as following {self.rubrik_section}. Here is a question, a golden answer and an AI-generated answer. Judge whether the AI-generated answer is correct according to the question and golden answer, answer with {self.rubrik_section}.\nQuestion:\ {question}.\nGolden answer:\ {answer}\nGenerated answer:\ {prediction}"
output_options_pairwise:
'1': 1.
'2': 0.
'3': 0.5
prompt_pairwise:
system: f"You are a helpful assistant, that ranks models by the quality of their answers. Please act as an impartial judge. Do not allow the length of the responses to influence your evaluation. Be as objective as possible."
user: f"Here is a question, a ground truth answer, an AI-generated answer 1 and an AI-generated answer 2. Which answer is the most correct one ? Simply answer {{1}} if the first is better, {{2}} if the second is better and {{3}} if it's a tie. \n Question:\ {question}.\n Ground truth answer:\ {ref_answer}.\n Answer 1:\ {answer_1}.\n Answer 2:\ {answer_2}."
8 changes: 7 additions & 1 deletion config/evaluator/default_qa.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,11 @@ output_options:
prompt:
system: f"You are an evaluation tool. Answer with one of {self.rubrik_section}."
user: f"Here is a question, a golden answer and an AI-generated answer. Can you judge whether the AI-generated answer is correct according to the question and golden answer, simply answer with one of {self.rubrik_section}.\n Question:\ {question}. \nGolden answer:\ {answer} \n Generated answer:\ {prediction}"
assistant: f"Response:\ {{"
user_without_system: f"You are an evaluation tool. Just answer by {self.rubrik_section}. Here is a question, a golden answer and an AI-generated answer. Judge whether the AI-generated answer is correct according to the question and golden answer, answer with {self.rubrik_section}.\nQuestion:\ {question}.\nGolden answer:\ {answer}\nGenerated answer:\ {prediction}"
output_options_pairwise:
'1': 1.
'2': 0.
'3': 0.5
prompt_pairwise:
system: f"You are a helpful assistant, that ranks models by the quality of their answers. Please act as an impartial judge. Do not allow the length of the responses to influence your evaluation. Be as objective as possible."
user: f"Here is a question, a ground truth answer, an AI-generated answer 1 and an AI-generated answer 2. Which answer is the most correct one ? Simply answer 1 if the first is better, 2 if the second is better and 3 if it's a tie. \n Question:\ {question}.\n Ground truth answer:\ {answer}.\n Answer 1:\ {prediction_1}.\n Answer 2:\ {prediction_2}."
22 changes: 18 additions & 4 deletions documentation/evaluations.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Example files generated for split `dev` using `naver_splade-cocondenser-selfdist

Non-neural metrics will be calculated automatically. Neural metrics such as `BEM` and `LLM` need to be evoked seperately.

By default `eval.py` will scan all folders in `experiments/` and evaluate them sequentially. To evaluate a single folder pass the folder using `--folder`. To avoid running out of memory either run `BEM` using `--bem` or run `LLM` using `--llm` . A csv file will automatically be saved to `results/` containing the table in `csv` format.
By default `evaluate.py` will scan all folders in `experiments/` and evaluate them sequentially. To evaluate a single folder pass the folder using `--folder`. To avoid running out of memory either run `BEM` using `--bem` or run `LLM` using `--llm` . A csv file will automatically be saved to `results/` containing the table in `csv` format.

When using `--llm` you have a choice on how you transform LLM predictions in the final score:
- directly check in the generated answer for the expepected label occurence (default Yes/No), and assign corresponding score (default 1/0), when no expected label is found, or more than one expected label is matched, we assign score -100 to the corresponding sample, such samples are excluded from the mean score computation
Expand All @@ -23,17 +23,17 @@ The choice of score interpretation is done via `use_logits` parameter specified


```bash
python3 eval.py --experiments_folder experiments/ --llm_batch_size 16 --split 'dev' --llm
python3 evaluate.py --experiments_folder experiments/ --llm_batch_size 16 --split 'dev' --llm
```
Similarly to `--generator` you can specify which LLM you are willing as first options of `--llm`, as well as short name at metrics naming (use the name of the configuration file as the name of the llm).


```bash
# use llama2-7b-chat to run evaluation, output metric will be named VLLMeval_l2_7b
python3 eval.py --experiments_folder experiments/ --llm_batch_size 16 --split 'dev' --llm "vllm_llama-2-7b-chat" "l2_7b"
python3 evaluate.py --experiments_folder experiments/ --llm_batch_size 16 --split 'dev' --llm "vllm_llama-2-7b-chat" "l2_7b"

# use tinyllama to run evaluation, output metric will be named LLMeval_tinyllama
python3 eval.py --experiments_folder experiments/ --llm_batch_size 16 --split 'dev' --llm "tinyllama-chat" "tinyllama"
python3 evaluate.py --experiments_folder experiments/ --llm_batch_size 16 --split 'dev' --llm "tinyllama-chat" "tinyllama"

# in default settings (with no arguments specified) we use SOLAR-107B for evaluation and output metric is named LLMeval
python3 eval.py --experiments_folder experiments/ --llm_batch_size 16 --split 'dev' --llm
Expand All @@ -53,3 +53,17 @@ If you have local ollama server running, you can call models installed on this s
python3 eval.py --experiments_folder experiments/ --llm_ollama "phi3:latest" --ollama_url "http://localhost:11434" --llm_prompt default_multi_qa
```

### Pairwise comparisons

Instead of computing an LLM eval score for a given run, you can compare two outputs using the same script and some additional arguments e.g.
````
python3 evaluate.py --llm --folder mistral_preds --opponent_folder llama_preds --opponent_name llama
```
where both `mistral_preds` and `llama_preds` are output folders of bergen inferences.
This scripts uses an LLM (can be any LLM supported in bergen or gpt-4o) to compare the two sets of predictions and compute win/tie/lose rates against the opponent. Results are stored in the metrics file of the folder. The prompt used is the pairwise prompt in `config/default_qa.yaml`.

This approach does not use logits but rather the raw prediction of the LLMs (win, tie or lose).

In this setup note that:
- A single experiment folder must be specified for `--folder` and `--opponent_folder`
- the `opponent_name` is required
204 changes: 0 additions & 204 deletions eval.py

This file was deleted.

Loading