Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to download kilt_wikipedia dataset #35

Open
bhattg opened this issue Nov 26, 2024 · 5 comments
Open

Unable to download kilt_wikipedia dataset #35

bhattg opened this issue Nov 26, 2024 · 5 comments

Comments

@bhattg
Copy link

bhattg commented Nov 26, 2024

Hi!

I've been trying to download the kilt_wikipedia, however, everytime I try to download the dataset, it gets at exact 3.39G download progress bar. Any suggestions?

Traceback (most recent call last):                                                                                                                                                                                                     
 File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/aiohttp/streams.py", line 344, in _wait                                                                                                                           
   await waiter                                                                                                                                                                                                                      
asyncio.exceptions.CancelledError                                                                                                                                                                                                     
                                                                                                                                                                                                                                     
The above exception was the direct cause of the following exception:                                                                                                                                                                  
                                                                                                                                                                                                                                     
Traceback (most recent call last):                                                                                                                                                                                                    
 File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/fsspec/asyn.py", line 56, in _runner                                                                                                                              
   result[0] = await coro                                                                                                                                                                                                            
 File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/fsspec/implementations/http.py", line 262, in _get_file                                                                                                           
   chunk = await r.content.read(chunk_size)                                                                                                                                                                                          
 File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/aiohttp/streams.py", line 425, in read                                                                                                                            
   await self._wait("read")                                                                                                                                                                                                          
 File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/aiohttp/streams.py", line 343, in _wait                                                                                                                           
   with self._timer:                                                                                                                                                                                                                 
 File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/aiohttp/helpers.py", line 671, in __exit__                                                                                                                        
   raise asyncio.TimeoutError from exc_val                                                                                                                                                                                           
asyncio.exceptions.TimeoutError                                                                                                                                                                                                       
                                                                                                                                                                                                                                     
The above exception was the direct cause of the following exception:                                                                                                                                                                  
                                                                                                                                                                                                                                     
Traceback (most recent call last):                                                                                                                                                                                                    
 File "<stdin>", line 1, in <module>                                                                                                                                                                                                 
 File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/load.py", line 2154, in load_dataset                                                                                                                     
   builder_instance.download_and_prepare(                                                                                                                                                                                            
 File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/builder.py", line 924, in download_and_prepare                                                                                                           
   self._download_and_prepare(                                                                                                                                                                                                       
 File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/builder.py", line 1648, in _download_and_prepare                                                                                                         
   super()._download_and_prepare(                                                                                                                                                                                                    
 File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/builder.py", line 978, in _download_and_prepare                                                                                                          
   split_generators = self._split_generators(dl_manager, **split_generators_kwargs)                               
 File "/home/gbhatt2/.cache/huggingface/modules/datasets_modules/datasets/kilt_wikipedia/2538d1b7191d2e7570a1e928e50d7d7751d24f2b2292f0e91ee566af5ebf0183/kilt_wikipedia.py", line 129, in _split_generators
   downloaded_path = dl_manager.download_and_extract(
 File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/download/download_manager.py", line 326, in download_and_extract                                                                                         
   return self.extract(self.download(url_or_urls))
 File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/download/download_manager.py", line 159, in download                                                                                                     
   downloaded_path_or_paths = map_nested(
 File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 484, in map_nested                                                                                                              
   mapped = function(data_struct)
 File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/download/download_manager.py", line 219, in _download_batched                                                                                            
   return [
 File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/download/download_manager.py", line 220, in <listcomp>                                                                                                   
   self._download_single(url_or_filename, download_config=download_config)                                        
 File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/download/download_manager.py", line 229, in _download_single                                                                                             
   out = cached_path(url_or_filename, download_config=download_config)                                            
 File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 205, in cached_path                                                                                                           
   output_path = get_from_cache(
 File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 411, in get_from_cache                                                                                                        
   fsspec_get(url, temp_file, storage_options=storage_options, desc=download_desc, disable_tqdm=disable_tqdm)     
 File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 330, in fsspec_get                                                                                                            
   fs.get_file(path, temp_file.name, callback=callback)
 File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/fsspec/asyn.py", line 118, in wrapper          
   return sync(self.loop, func, *args, **kwargs)
 File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/fsspec/asyn.py", line 101, in sync             
   raise FSTimeoutError from return_result
fsspec.exceptions.FSTimeoutError

I am using transformers v4.46.3 and torch v2.3.0+cu121

@DRRV
Copy link
Contributor

DRRV commented Nov 26, 2024

Hi
Can you just try the hf datasets command: (I don't see any lines from Bergen in the error lines)

dataset = datasets.load_dataset('kilt_wikipedia')

@bhattg
Copy link
Author

bhattg commented Nov 26, 2024

Thank you for the fast response. The issue still persists when loaded using HF dataset commands. Following is the full stack trace

python3 bergen.py  retriever=splade-v3 reranker=debertav3 dataset=popqa  generator=vllm_SOLAR-107B                                                                                                   
                                                                                                                                                                                                                                      
[2024-11-25 07:42:40,480][datasets][INFO] - PyTorch version 2.3.0 available.                                                                                                                                                          
[2024-11-25 07:42:40,481][datasets][INFO] - TensorFlow version 2.8.0 available.                                                                                                                                                       
Unfinished experiment_folder: experiments/tmp_460be916ccb7601b                                                                                                                                                                        
experiment_folder experiments/460be916ccb7601b                                                                                                                                                                                        
run_name: null                                                                                                                                                                                                                        
dataset_folder: datasets/                                                                                                                                                                                                             
index_folder: indexes/                                                                                                                                                                                                                
runs_folder: runs/                                                                                                                                                                                                                    
generated_query_folder: generated_queries/                                                                                                                                                                                            
processed_context_folder: processed_contexts/                                                                                                                                                                                         
experiments_folder: experiments/                                                                                                                                                                                                      
retrieve_top_k: 50                                                                                                                                                                                                                    
rerank_top_k: 50                                                                                                                                                                                                                      
generation_top_k: 5                                                                                                                                                                                                                   
pyserini_num_threads: 20                                                                                                                                                                                                              
processing_num_proc: 40                                                                                                                                                                                                               
retriever:                                                                                                                                                                                                                            
  init_args:                                                                                                                                                                                                                          
    _target_: models.retrievers.splade.Splade                                                                                                                                                                                         
    model_name: naver/splade-v3                                                                                                                                                                                                       
    max_len: 128                                                                                                                                                                                                                      
  batch_size: 64                                                                                                                                                                                                                      
  batch_size_sim: 512                                                                                                                                                                                                                 
reranker:                                                                                                                                                                                                                             
  init_args:                                                                                                                                                                                                                          
    _target_: models.rerankers.crossencoder.CrossEncoder                                                                                                                                                                              
    model_name: naver/trecdl22-crossencoder-debertav3                                                                                                                                                                                 
    max_len: 256                                                                                                                                                                                                                      
  batch_size: 64                                                                                                                                                                                                                      
generator:                                                                                                                                                                                                                            
  init_args:                                                                                                                                                                                                                          
    _target_: models.generators.vllm.VLLM       
    model_name: Upstage/SOLAR-10.7B-Instruct-v1.0
    max_new_tokens: 128
    max_length: 4096
    batch_size: 256
dataset:
  train:
    doc: null
    query: null
  dev:
    doc:
      init_args:
        _target_: modules.dataset_processor.KILT100w
        split: full
    query:
      init_args:
        _target_: modules.processors.qa_dataset_processor.POPQA
        split: test
  test:
    doc: null
    query: null
prompt:
  system: You are a helpful assistant. Your task is to extract relevant information
    from provided documents and to answer to questions as briefly as possible.
  user: f"Background:\n{docs}\n\nQuestion:\ {question}"
  system_without_docs: You are a helpful assistant. Answer the questions as briefly
    as possible.
  user_without_docs: f"Question:\ {question}"

Processing dataset kilt-100w in full split 
Downloading data:   9%|███████████████▌                                                                                                                                                           | 3.38G/37.3G [05:00<49:22, 11.5MB/s
]Error executing job with overrides: ['retriever=splade-v3', 'reranker=debertav3', 'dataset=popqa', 'generator=vllm_SOLAR-107B']
Traceback (most recent call last):
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/aiohttp/streams.py", line 344, in _wait
    await waiter
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/fsspec/asyn.py", line 56, in _runner
    result[0] = await coro
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/fsspec/implementations/http.py", line 262, in _get_file
    chunk = await r.content.read(chunk_size)
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/aiohttp/streams.py", line 425, in read
    await self._wait("read")
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/aiohttp/streams.py", line 343, in _wait
    with self._timer:
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/aiohttp/helpers.py", line 671, in __exit__
    raise asyncio.TimeoutError from exc_val
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/gbhatt2/bergen/bergen.py", line 24, in main 
    rag = RAG(**config, config=config)
  File "/home/gbhatt2/bergen/modules/rag.py", line 159, in __init__
    self.datasets = ProcessDatasets.process(
  File "/home/gbhatt2/bergen/modules/dataset_processor.py", line 624, in process
    dataset = processor.get_dataset()
  File "/home/gbhatt2/bergen/modules/dataset_processor.py", line 92, in get_dataset
    dataset = self.process()
  File "/home/gbhatt2/bergen/modules/dataset_processor.py", line 282, in process
    dataset = datasets.load_dataset(hf_name, num_proc=self.num_proc)[self.split]
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/load.py", line 2154, in load_dataset
    builder_instance.download_and_prepare(
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/builder.py", line 924, in download_and_prepare
    self._download_and_prepare(
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/builder.py", line 1648, in _download_and_prepare
    super()._download_and_prepare(
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/builder.py", line 978, in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
  File "/data/gbhatt2/HF_HOME/modules/datasets_modules/datasets/kilt_wikipedia/2538d1b7191d2e7570a1e928e50d7d7751d24f2b2292f0e91ee566af5ebf0183/kilt_wikipedia.py", line 129, in _split_generators
    downloaded_path = dl_manager.download_and_extract(
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/download/download_manager.py", line 326, in download_and_extract
    return self.extract(self.download(url_or_urls))
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/download/download_manager.py", line 159, in download
    downloaded_path_or_paths = map_nested(
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 484, in map_nested
    mapped = function(data_struct)
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/download/download_manager.py", line 219, in _download_batched
    return [
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/download/download_manager.py", line 220, in <listcomp>
    self._download_single(url_or_filename, download_config=download_config)
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/download/download_manager.py", line 229, in _download_single
    out = cached_path(url_or_filename, download_config=download_config)
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 205, in cached_path
    output_path = get_from_cache(
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 411, in get_from_cache
    fsspec_get(url, temp_file, storage_options=storage_options, desc=download_desc, disable_tqdm=disable_tqdm)
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 330, in fsspec_get
    fs.get_file(path, temp_file.name, callback=callback)
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/fsspec/asyn.py", line 118, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/fsspec/asyn.py", line 101, in sync
    raise FSTimeoutError from return_result
fsspec.exceptions.FSTimeoutError

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I can download the .json file independently by performing

wget http://dl.fbaipublicfiles.com/KILT/kilt_knowledgesource.json

Can you provide the version of the dataset, I'd try to see if I can replicate the things with exactly that version. My dataset version is v3.1.0

@DRRV
Copy link
Contributor

DRRV commented Nov 26, 2024

I was facing the same issue as you but now I can download the dataset. network issue? can you try again?

@sclincha
Copy link
Contributor

Tried today with dataset version '2.19.1' and it worked . @bhattg any update?

@bhattg
Copy link
Author

bhattg commented Nov 26, 2024

Thanks for the fast responses. I am still unable to download the dataset with 3.1.0 and get stuck at the exact same spot. I switched to 2.19.1 as per the recommendation and it does not get stuck at the very spot now. I am still downloading the full dataset, and will post it, if there are any issues.

Edit:

I was able to get it working with 2.19.1 but it still throws an error at the latest version 3.1.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants