pytrec_eval unable to perform evaluation #37

bhattg · 2024-12-19T23:36:51Z

Hi!

I am trying to run the framework, and wondering if anyone also encountered this issue. For starters, I launched the following command and everything works fine till the evaluation phase.

CUDA_VISIBLE_DEVICES=0 HYDRA_FULL_ERROR=1 python3 bergen.py retriever=splade-v3 reranker=debertav3 dataset=kilt_hotpotqa

Following is the error stack trace.

Evaluating retrieval...
Error executing job with overrides: ['retriever=splade-v3', 'reranker=debertav3', 'dataset=kilt_hotpotqa']
Traceback (most recent call last):
  File "/home/gbhatt2/bergen/bergen.py", line 37, in <module>
    main()
  File "/data/gbhatt2/.conda/envs/rag_latest/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/data/gbhatt2/.conda/envs/rag_latest/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/data/gbhatt2/.conda/envs/rag_latest/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/data/gbhatt2/.conda/envs/rag_latest/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/data/gbhatt2/.conda/envs/rag_latest/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/data/gbhatt2/.conda/envs/rag_latest/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/data/gbhatt2/.conda/envs/rag_latest/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/data/gbhatt2/.conda/envs/rag_latest/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/data/gbhatt2/.conda/envs/rag_latest/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/gbhatt2/bergen/bergen.py", line 31, in main
    rag.eval(dataset_split=dataset_split)
  File "/home/gbhatt2/bergen/modules/rag.py", line 223, in eval
    query_ids, doc_ids, _ = self.rerank(
  File "/home/gbhatt2/bergen/modules/rag.py", line 374, in rerank
    eval_retrieval_kilt(
  File "/home/gbhatt2/bergen/utils.py", line 288, in eval_retrieval_kilt
    metrics_out = evaluator.evaluate(run)
  File "/data/gbhatt2/.conda/envs/rag_latest/lib/python3.10/site-packages/pytrec_eval/__init__.py", line 64, in evaluate
    return super().evaluate(scores)
TypeError: Unable to extract query/object scores.

Could the issue be due to - Unfinished experiment_folder: /data/gbhatt2/RAG_index_experiment/experiments/tmp_e2ca04a13ada8612 ? That is, there was an unfinished experiment_folder?

Following is the full stack trace -

experiment_folder /data/gbhatt2/RAG_index_experiment/experiments/e2ca04a13ada8612                                                                                                                                                                            
run_name: null                                                                                                                                                                                                                                               
dataset_folder: /data/gbhatt2/RAG_dataset/datasets/                                                                                                                                                                                                          
index_folder: /data/gbhatt2/RAG_index_experiment/indexes/                                                                                                                                                                                                    
runs_folder: /data/gbhatt2/RAG_index_experiment/runs/                                                                                                                                                                                                        
generated_query_folder: /data/gbhatt2/RAG_index_experiment/generated_queries/                                                                                                                                                                                
processed_context_folder: /data/gbhatt2/RAG_index_experiment/processed_contexts/                                                                                                                                                                             
experiments_folder: /data/gbhatt2/RAG_index_experiment/experiments/                                                                                                                                                                                          
retrieve_top_k: 50                                                                                                                                                                                                                                           
rerank_top_k: 50                                                                                                                                                                                                                                             
generation_top_k: 5                                                                                                                                                                                                                                          
pyserini_num_threads: 20                                                                                                                                                                                                                                     
processing_num_proc: 40                                                                                                                                                                                                                                      
retriever:                                                                                                                                                                                                                                                   
  init_args:                                                                                                                                                                                                                                                 
    _target_: models.retrievers.splade.Splade                                                                                                                                                                                                                
    model_name: naver/splade-v3                                                                                                                                                                                                                              
    max_len: 128                                                                                                                                                                                                                                             
  batch_size: 1024                                                                                                                                                                                                                                           
  batch_size_sim: 512                                                                                                                                                                                                                                        
reranker:                                                                                                                                                                                                                                                    
  init_args:                                                                                                                                                                                                                                                 
    _target_: models.rerankers.crossencoder.CrossEncoder                                                                                                                                                                                                     
    model_name: naver/trecdl22-crossencoder-debertav3                                                                                                                                                                                                        
    max_len: 256                                                                                                                                                                                                                                             
  batch_size: 64                                                                                                                                                                                                                                             
dataset:                                                                                                                                                                                                                                                     
  train:                                                                                                                                                                                                                                                     
    doc: null                                                                                                                                                                                                                                                
    query: null                                                                                                                                                                                                                                              
  dev:                                                                                                                                                                                                                                                       
    doc:                                                                                                                                                                                                                                                     
      init_args:                                                                                                                                                                                                                                             
        _target_: modules.dataset_processor.KILT100w                                                                                                                                                                                                         
        split: full                                                                                                                                                                                                                                          
    query:                                                                                                                                                                                                                                                   
      init_args:                                                                                                                                                                                                                                             
        _target_: modules.processors.kilt_dataset_processor.KILTHotpotqa                                                                                                                                                                                     
        split: validation                                                                                                                                                                                                                                    
  test:                                                                                                                                                                                                                                                      
    doc: null                                                                                                                                                                                                                                                
    query: null                                                                                                                                                                                                                                              
prompt:                                                                                                                                                                                                                                                      
  system: You are a helpful assistant. Your task is to extract relevant information                                                                                                                                                                          
    from provided documents and to answer to questions as briefly as possible.                                                                                                                                                                               
  user: f"Background:\n{docs}\n\nQuestion:\ {question}"                                                                                                                                                                                                      
  system_without_docs: You are a helpful assistant. Answer the questions as briefly                                                                                                                                                                          
    as possible.                                                                                                                                                                                                                                             
  user_without_docs: f"Question:\ {question}"  
  Processing dataset kilt-100w in full split                                                                                                                                                                                                                   
Loading dataset from disk:   0%|                                                                                                                                                                                                      | 0/30 [00:00<?, ?it/s]
                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                             
Loading dataset from disk: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [02:36<00:00,  5.20s/it]
Processing dataset kilt_hotpotqa in validation split                                                                                                                                                                                                         
README.md: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35.5k/35.5k [00:00<00:00, 64.0MB/s]
validation-00000-of-00001.parquet: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.05M/1.05M [00:00<00:00, 6.26MB/s]
test-00000-of-00001.parquet: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 517k/517k [00:00<00:00, 6.36MB/s]
train-00000-of-00001.parquet: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16.3M/16.3M [00:01<00:00, 10.6MB/s]
Setting num_proc from 40 back to 1 for the train split to disable multiprocessing as it only contains one shard.                                                                                                                                             
[2024-12-19 22:32:57,938][datasets.builder][WARNING] - Setting num_proc from 40 back to 1 for the train split to disable multiprocessing as it only contains one shard.                                                                                      
Generating train split: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 88869/88869 [00:00<00:00, 191266.11 examples/s]
Setting num_proc from 40 back to 1 for the validation split to disable multiprocessing as it only contains one shard.
[2024-12-19 22:32:58,405][datasets.builder][WARNING] - Setting num_proc from 40 back to 1 for the validation split to disable multiprocessing as it only contains one shard.
Generating validation split: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5600/5600 [00:00<00:00, 179557.55 examples/s]
Setting num_proc from 40 back to 1 for the test split to disable multiprocessing as it only contains one shard.
[2024-12-19 22:32:58,437][datasets.builder][WARNING] - Setting num_proc from 40 back to 1 for the test split to disable multiprocessing as it only contains one shard.
Generating test split: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5569/5569 [00:00<00:00, 301087.65 examples/s]
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5600/5600 [00:01<00:00, 4671.06 examples/s]
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5600/5600 [00:01<00:00, 4988.45 examples/s]
Saving the dataset (1/1 shards): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5600/5600 [00:00<00:00, 786344.24 examples/s]
Checking dataset..: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5600/5600 [00:00<00:00, 21021.90it/s]
/data/gbhatt2/.conda/envs/rag_latest/lib/python3.10/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, an
d this model will lose the ability to call `generate` and other related functions.
  - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
/data/gbhatt2/.conda/envs/rag_latest/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py:561: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in 
the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
  warnings.warn(


::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
RAG Model:
Retriever: naver/splade-v3
Reranker: naver/trecdl22-crossencoder-debertav3
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::



Run /data/gbhatt2/RAG_index_experiment/runs//run.retrieve.top_50.kilt_hotpotqa.kilt-100w.dev.naver_splade-v3.trec does not exists, running retrieve...
Encoding: naver/splade-v3: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:07<00:00,  1.21s/it]
Load embeddings...:   0%|                                                                                                                                                                                                              | 0/1 [00:00<?, ?it/s]
/home/gbhatt2/bergen/utils.py:55: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbi
trary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be e
xecuted during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any u
se case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  emb_chunk = torch.load(emb_file)
Load embeddings...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 16.14it/s]
Load embeddings and retrieve...:   0%|                                                                                                                                                                                               | 0/167 [00:00<?, ?it/s]
/home/gbhatt2/bergen/modules/retrieve.py:89: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will e
xecute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that
 could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True
` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  emb_chunk = torch.load(emb_file)
Load embeddings and retrieve...:  75%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                            | 126/167 [07:16<01:38,  2.40s/it]
Load embeddings and retrieve...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 167/167 [08:49<00:00,  3.17s/it]
Retrieving docs...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [13:48<00:00, 75.36s/it]
Evaluating retrieval...
Getting wiki ids...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5600/5600 [00:07<00:00, 732.34it/s]
Fetching data from dataset...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5600/5600 [00:10<00:00, 542.31it/s]
Generating train split: 280000 examples [00:10, 27113.11 examples/s]███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎| 5579/5600 [00:10<00:00, 573.71it/s]
Reranking: naver/trecdl22-crossencoder-debertav3: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4375/4375 [18:54<00:00,  3.86it/s]

The text was updated successfully, but these errors were encountered:

DRRV · 2024-12-20T07:12:19Z

Hi!

As long as the experiment is not finished, everything is stored in the tmp folder.

Can you check the folder '/data/gbhatt2/RAG_index_experiment/runs/' You should have the retrieval output (trec file) corresponding to the hotpotqa dataset

bhattg · 2024-12-20T18:53:32Z

Hi!

What led to the following issue however --

Evaluating retrieval...
Error executing job with overrides: ['retriever=splade-v3', 'reranker=debertav3', 'dataset=kilt_hotpotqa']
Traceback (most recent call last):
  File "/home/gbhatt2/bergen/bergen.py", line 37, in <module>
    main()
  File "/data/gbhatt2/.conda/envs/rag_latest/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/data/gbhatt2/.conda/envs/rag_latest/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/data/gbhatt2/.conda/envs/rag_latest/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/data/gbhatt2/.conda/envs/rag_latest/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/data/gbhatt2/.conda/envs/rag_latest/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/data/gbhatt2/.conda/envs/rag_latest/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/data/gbhatt2/.conda/envs/rag_latest/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/data/gbhatt2/.conda/envs/rag_latest/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/data/gbhatt2/.conda/envs/rag_latest/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/gbhatt2/bergen/bergen.py", line 31, in main
    rag.eval(dataset_split=dataset_split)
  File "/home/gbhatt2/bergen/modules/rag.py", line 223, in eval
    query_ids, doc_ids, _ = self.rerank(
  File "/home/gbhatt2/bergen/modules/rag.py", line 374, in rerank
    eval_retrieval_kilt(
  File "/home/gbhatt2/bergen/utils.py", line 288, in eval_retrieval_kilt
    metrics_out = evaluator.evaluate(run)
  File "/data/gbhatt2/.conda/envs/rag_latest/lib/python3.10/site-packages/pytrec_eval/__init__.py", line 64, in evaluate
    return super().evaluate(scores)
TypeError: Unable to extract query/object scores.

This is based off this.

CUDA_VISIBLE_DEVICES=0 HYDRA_FULL_ERROR=1 python3 bergen.py  retriever=splade-v3 reranker=debertav3 dataset=kilt_hotpotqa

DRRV · 2024-12-23T09:29:52Z

Can you just try python3 bergen.py retriever=splade-v3 dataset=kilt_hotpotqa
It ignores the reranker stage and it seems that the error comes from the reranker results (see in your log File "/home/gbhatt2/bergen/modules/rag.py", line 374, in rerank).

bhattg · 2024-12-25T05:02:09Z

That works, but it is strange that whenever I try any new run, with reranker, it crashes with the error above. However, it runs when I repeat the same command (now only reranking and eval are remaining). Did you face similar issues?

DRRV · 2024-12-26T10:54:11Z

Can you delete (or move) the runs corresponding to the reranker and regenerate it?
(If a run corresponds to the current config, it is used.)

Hotpotqa has a lot of queries (if I remember well), maybe use another smaller dataset for testing

sclincha mentioned this issue Jan 16, 2025

fix eval scores + multigpu indexing , reranking #39

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pytrec_eval unable to perform evaluation #37

pytrec_eval unable to perform evaluation #37

bhattg commented Dec 19, 2024

DRRV commented Dec 20, 2024

bhattg commented Dec 20, 2024 •

edited

Loading

DRRV commented Dec 23, 2024

bhattg commented Dec 25, 2024

DRRV commented Dec 26, 2024

pytrec_eval unable to perform evaluation #37

pytrec_eval unable to perform evaluation #37

Comments

bhattg commented Dec 19, 2024

DRRV commented Dec 20, 2024

bhattg commented Dec 20, 2024 • edited Loading

DRRV commented Dec 23, 2024

bhattg commented Dec 25, 2024

DRRV commented Dec 26, 2024

bhattg commented Dec 20, 2024 •

edited

Loading