naver · maxime-louis · Nov 20, 2024 · Dec 9, 2024 · Dec 19, 2024 · Jan 17, 2025
diff --git a/README.md b/README.md
@@ -61,13 +61,13 @@ To fully configure BERGEN, please read our [configuration guide](documentation/c
 Run the evaluation script to calculate LLMEval metrics and print the results:
 
 ```bash
-python3 eval.py --experiments_folder experiments/ --llm_batch_size 16 --split 'dev' --llm vllm_SOLAR-107B
+python3 evaluate.py --experiments_folder experiments/ --llm_batch_size 16 --split 'dev' --llm vllm_SOLAR-107B
 
 #parse all the experiments files into a panda dataframe
 python print_results.py --folder experiments/ --format=tiny
 ```
 
-For more evaluation options and details, refer to the [Evaluation section](documentation/evaluations.md) in the complete documentation.
+Bergen also offers the possiblity to run pairwise comparisons using an LLM as judge. For more evaluation options and details, refer to the [Evaluation section](documentation/evaluations.md) in the complete documentation.
 
 ## RAG Baselines
 Bergen provides results for several models and many datasets aiming to **provide strong baselines**. On the important datasets for RAG, the match metric is given by this table (see more in our paper): 

diff --git a/config/evaluator/default_multi_qa.yaml b/config/evaluator/default_multi_qa.yaml
@@ -7,4 +7,11 @@ output_options:
 prompt:   
   system: f"You are an evaluation tool. Answer with one of \n {self.rubrik_section}."
   user: f"Here is a question, a golden answer and an AI-generated answer. Can you judge whether the AI-generated answer is correct according to the question and golden answer, simply answer with one of {self.rubrik_section}.\n Question:\ {question}. \nGolden answer:\ {answer} \n Generated answer:\ {prediction}"
-  user_without_system: f"You are an evaluation tool. Just answer as following {self.rubrik_section}. Here is a question, a golden answer and an AI-generated answer. Judge whether the AI-generated answer is correct according to the question and golden answer, answer with {self.rubrik_section}.\nQuestion:\ {question}.\nGolden answer:\ {answer}\nGenerated answer:\ {prediction}"
+  user_without_system: f"You are an evaluation tool. Just answer as following {self.rubrik_section}. Here is a question, a golden answer and an AI-generated answer. Judge whether the AI-generated answer is correct according to the question and golden answer, answer with {self.rubrik_section}.\nQuestion:\ {question}.\nGolden answer:\ {answer}\nGenerated answer:\ {prediction}"
+output_options_pairwise:
+  '1': 1.
+  '2': 0.
+  '3': 0.5
+prompt_pairwise: 
+  system: f"You are a helpful assistant, that ranks models by the quality of their answers. Please act as an impartial judge. Do not allow the length of the responses to influence your evaluation. Be as objective as possible."
+  user: f"Here is a question, a ground truth answer, an AI-generated answer 1 and an AI-generated answer 2. Which answer is the most correct one ? Simply answer {{1}} if the first is better, {{2}} if the second is better and {{3}} if it's a tie. \n Question:\ {question}.\n Ground truth answer:\ {ref_answer}.\n Answer 1:\ {answer_1}.\n Answer 2:\ {answer_2}."
diff --git a/config/evaluator/default_qa.yaml b/config/evaluator/default_qa.yaml
@@ -6,5 +6,11 @@ output_options:
 prompt:   
   system: f"You are an evaluation tool. Answer with one of {self.rubrik_section}."
   user: f"Here is a question, a golden answer and an AI-generated answer. Can you judge whether the AI-generated answer is correct according to the question and golden answer, simply answer with one of {self.rubrik_section}.\n Question:\ {question}. \nGolden answer:\ {answer} \n Generated answer:\ {prediction}"
-  assistant: f"Response:\ {{"
   user_without_system: f"You are an evaluation tool. Just answer by {self.rubrik_section}. Here is a question, a golden answer and an AI-generated answer. Judge whether the AI-generated answer is correct according to the question and golden answer, answer with {self.rubrik_section}.\nQuestion:\ {question}.\nGolden answer:\ {answer}\nGenerated answer:\ {prediction}"
+output_options_pairwise:
+    '1': 1.
+    '2': 0.
+    '3': 0.5
+prompt_pairwise: 
+  system: f"You are a helpful assistant, that ranks models by the quality of their answers. Please act as an impartial judge. Do not allow the length of the responses to influence your evaluation. Be as objective as possible."
+  user: f"Here is a question, a ground truth answer, an AI-generated answer 1 and an AI-generated answer 2. Which answer is the most correct one ? Simply answer 1 if the first is better, 2 if the second is better and 3 if it's a tie. \n Question:\ {question}.\n Ground truth answer:\ {answer}.\n Answer 1:\ {prediction_1}.\n Answer 2:\ {prediction_2}."
diff --git a/documentation/evaluations.md b/documentation/evaluations.md
@@ -14,7 +14,7 @@ Example files generated for split `dev` using `naver_splade-cocondenser-selfdist
 
 Non-neural metrics will be calculated automatically. Neural metrics such as `BEM` and `LLM` need to be evoked seperately.
 
-By default `eval.py` will scan all folders in `experiments/` and evaluate them sequentially. To evaluate a single folder pass the folder using `--folder`. To avoid running out of memory either run `BEM` using `--bem` or run `LLM` using `--llm` . A csv file will automatically be saved to `results/` containing the table in `csv` format.
+By default `evaluate.py` will scan all folders in `experiments/` and evaluate them sequentially. To evaluate a single folder pass the folder using `--folder`. To avoid running out of memory either run `BEM` using `--bem` or run `LLM` using `--llm` . A csv file will automatically be saved to `results/` containing the table in `csv` format.
 
 When using `--llm` you have a choice on how you transform LLM predictions in the final score:
 - directly check in the generated answer for the expepected label occurence (default Yes/No), and assign corresponding score (default 1/0), when no expected label is found, or more than one expected label is matched, we assign score -100 to the corresponding sample, such samples are excluded from the mean score computation
@@ -23,17 +23,17 @@ The choice of score interpretation is done via `use_logits` parameter specified
 
 
 ```bash
-python3 eval.py --experiments_folder experiments/ --llm_batch_size 16 --split 'dev' --llm
+python3 evaluate.py --experiments_folder experiments/ --llm_batch_size 16 --split 'dev' --llm
 ```
 Similarly to  `--generator` you can specify which LLM you are willing as first options of `--llm`, as well as short name at metrics naming (use the name of the configuration file as the name of the llm). 
 
 
 ```bash
 # use llama2-7b-chat to run evaluation, output metric will be named VLLMeval_l2_7b
-python3 eval.py --experiments_folder experiments/ --llm_batch_size 16 --split 'dev' --llm  "vllm_llama-2-7b-chat" "l2_7b"
+python3 evaluate.py --experiments_folder experiments/ --llm_batch_size 16 --split 'dev' --llm  "vllm_llama-2-7b-chat" "l2_7b"
 
 # use tinyllama to run evaluation, output metric will be named LLMeval_tinyllama
-python3 eval.py --experiments_folder experiments/ --llm_batch_size 16 --split 'dev' --llm  "tinyllama-chat" "tinyllama"
+python3 evaluate.py --experiments_folder experiments/ --llm_batch_size 16 --split 'dev' --llm  "tinyllama-chat" "tinyllama"
 
 # in default settings (with no arguments specified) we use SOLAR-107B for evaluation and output metric is named LLMeval
 python3 eval.py --experiments_folder experiments/ --llm_batch_size 16 --split 'dev' --llm  
@@ -53,3 +53,17 @@ If you have local ollama server running, you can call models installed on this s
 python3 eval.py --experiments_folder experiments/ --llm_ollama "phi3:latest" --ollama_url "http://localhost:11434"   --llm_prompt default_multi_qa
 ```
 
+### Pairwise comparisons
+
+Instead of computing an LLM eval score for a given run, you can compare two outputs using the same script and some additional arguments e.g.
+````
+python3 evaluate.py --llm --folder mistral_preds --opponent_folder llama_preds  --opponent_name llama
+```
+where both `mistral_preds` and `llama_preds` are output folders of bergen inferences.
+This scripts uses an LLM (can be any LLM supported in bergen or gpt-4o) to compare the two sets of predictions and compute win/tie/lose rates against the opponent. Results are stored in the metrics file of the folder. The prompt used is the pairwise prompt in `config/default_qa.yaml`.
+
+This approach does not use logits but rather the raw prediction of the LLMs (win, tie or lose).
+
+In this setup note that:
+    - A single experiment folder must be specified for `--folder` and `--opponent_folder`
+    - the `opponent_name` is required 
diff --git a/eval.py b/eval.py