Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset librispeech_asr fails to load #4179

Closed
albertz opened this issue Apr 19, 2022 · 21 comments
Closed

Dataset librispeech_asr fails to load #4179

albertz opened this issue Apr 19, 2022 · 21 comments
Labels
bug Something isn't working

Comments

@albertz
Copy link

albertz commented Apr 19, 2022

Describe the bug

The dataset librispeech_asr (standard Librispeech) fails to load.

Steps to reproduce the bug

datasets.load_dataset("librispeech_asr")

Expected results

It should download and prepare the whole dataset (all subsets).

In the doc, it says it has two configurations (clean and other).
However, the dataset doc says that not specifying split should just load the whole dataset, which is what I want.

Also, in case of this specific dataset, this is also the standard what the community uses. When you look at any publications with results on Librispeech, they always use the whole train dataset for training.

Actual results

...
  File "/home/az/.cache/huggingface/modules/datasets_modules/datasets/librispeech_asr/1f4602f6b5fed8d3ab3e3382783173f2e12d9877e98775e34d7780881175096c/librispeech_asr.py", line 119, in LibrispeechASR._split_generators
    line: archive_path = dl_manager.download(_DL_URLS[self.config.name])
    locals:
      archive_path = <not found>
      dl_manager = <local> <datasets.utils.download_manager.DownloadManager object at 0x7fc07b426160>
      dl_manager.download = <local> <bound method DownloadManager.download of <datasets.utils.download_manager.DownloadManager object at 0x7fc07b426160>>
      _DL_URLS = <global> {'clean': {'dev': 'http://www.openslr.org/resources/12/dev-clean.tar.gz', 'test': 'http://www.openslr.org/resources/12/test-clean.tar.gz', 'train.100': 'http://www.openslr.org/resources/12/train-clean-100.tar.gz', 'train.360': 'http://www.openslr.org/resources/12/train-clean-360.tar.gz'}, 'other'...
      self = <local> <datasets_modules.datasets.librispeech_asr.1f4602f6b5fed8d3ab3e3382783173f2e12d9877e98775e34d7780881175096c.librispeech_asr.LibrispeechASR object at 0x7fc12a633310>
      self.config = <local> BuilderConfig(name='default', version=0.0.0, data_dir='/home/az/i6/setups/2022-03-20--sis/work/i6_core/datasets/huggingface/DownloadAndPrepareHuggingFaceDatasetJob.TV6Nwm6dFReF/output/data_dir', data_files=None, description=None)
      self.config.name = <local> 'default', len = 7
KeyError: 'default'

Environment info

  • datasets version: 2.1.0
  • Platform: Linux-5.4.0-107-generic-x86_64-with-glibc2.31
  • Python version: 3.9.9
  • PyArrow version: 6.0.1
  • Pandas version: 1.4.2
@albertz albertz added the bug Something isn't working label Apr 19, 2022
@albertz
Copy link
Author

albertz commented Apr 19, 2022

@patrickvonplaten Hi! I saw that you prepared this? :)

@albertz
Copy link
Author

albertz commented Apr 19, 2022

Another thing, but maybe this should be a separate issue: As I see from the code, it would try to use up to 16 simultaneous downloads? This is problematic for Librispeech or anything on OpenSLR. On the homepage, it says:

If you want to download things from this site, please download them one at a time, and please don't use any fancy software-- just download things from your browser or use 'wget'. We have a firewall rule to drop connections from hosts with more than 5 simultaneous connections, and certain types of download software may activate this rule.

Related: tensorflow/datasets#3885

@patrickvonplaten
Copy link
Contributor

Hey @albertz,

Nice to see you here! It's been a while ;-)

@patrickvonplaten
Copy link
Contributor

Sorry maybe the docs haven't been super clear here. By split we mean one of train.500, train.360, train.100, validation, test. For Librispeech, you'll have to specific a config (either other or clean) though:

datasets.load_dataset("librispeech_asr", "clean")

should work and give you all splits (being "train", "test", ...) for the clean config of the dataset.

@patrickvonplaten
Copy link
Contributor

If you need both "clean" and "other" I think you'll have to do concatenate them as follows:

from datasets import concatenate_datasets, load_dataset

other = load_dataset("librispeech_asr", "other")
clean = load_dataset("librispeech_asr", "clean")

librispeech = concatenate_datasets([other, clean])

See https://huggingface.co/docs/datasets/v2.1.0/en/process#concatenate

@patrickvonplaten
Copy link
Contributor

Downloading one split would be:

from datasets import load_dataset

other = load_dataset("librispeech_asr", "other", split="train.500")

@patrickvonplaten
Copy link
Contributor

cc @lhoestq FYI maybe the docs can be improved here

@albertz
Copy link
Author

albertz commented Apr 19, 2022

Ah thanks. But wouldn't it be easier/nicer (and more canonical) to just make it in a way that simply load_dataset("librispeech_asr") works?

@patrickvonplaten
Copy link
Contributor

Pinging @lhoestq here, think this could make sense! Not sure however how the dictionary would then look like

@lhoestq
Copy link
Member

lhoestq commented Apr 19, 2022

Would it make sense to have clean as the default config ?

Also I think load_dataset("librispeech_asr") should have raised you an error that says that you need to specify a config

I also opened a PR to improve the doc: #4183

@albertz
Copy link
Author

albertz commented Apr 19, 2022

Would it make sense to have clean as the default config ?

I think a user would expect that the default would give you the full dataset.

Also I think load_dataset("librispeech_asr") should have raised you an error that says that you need to specify a config

It does raise an error, but this error confused me because I did not understand why I needed a config, or why I could not simply download the whole dataset, which is what people usually do with Librispeech.

@patrickvonplaten
Copy link
Contributor

+1 for @albertz. Also think lots of people download the whole dataset ("clean" + "other") for Librispeech.

Think there are also some people though who:

  • a) Don't have the memory to store the whole dataset
  • b) Just want to evaluate on one of the two configs

@lhoestq
Copy link
Member

lhoestq commented Apr 19, 2022

Ok ! Adding the "all" configuration would do the job then, thanks ! In the "all" configuration we can merge all the train.xxx splits into one "train" split, or keep them separate depending on what's the most practical to use (probably put everything in "train" no ?)

@albertz
Copy link
Author

albertz commented Apr 19, 2022

I'm not too familiar with how to work with HuggingFace datasets, but people often do some curriculum learning scheme, where they start with train.100, later go over to train.100 + train.360, and then later use the whole train (960h). It would be good if this is easily possible.

@patrickvonplaten
Copy link
Contributor

Hey @albertz,

opened a PR here. Think by adding the "subdataset" class to each split "train", "dev", "other" as shown here: https://github.com/huggingface/datasets/pull/4184/files#r853272727 it should be easily possible (e.g. with the filter function https://huggingface.co/docs/datasets/v2.1.0/en/package_reference/main_classes#datasets.Dataset.filter )

@patrickvonplaten
Copy link
Contributor

But also since everything is cached one could also just do:

load_dataset("librispeech", "clean", "train.100")
load_dataset("librispeech", "clean", "train.100+train.360")
load_dataset("librispeech" "all", "train") 

@SreyanG-NVIDIA
Copy link

Hi @patrickvonplaten ,

load_dataset("librispeech_asr", "clean", "train.100") actually downloads the whole dataset and not the 100 hr split, is this a bug?

@patrickvonplaten
Copy link
Contributor

Hmm, I don't really see how that's possible:

"train.100": _DL_URL + "train-clean-100.tar.gz",

Note that all datasets related to "clean" are downloaded, but only "train.100" should be used.

cc @lhoestq @albertvillanova @mariosasko can we do anything against download dataset links that are not related to the "split" that one actually needs. E.g. why should the split "train.360" be downloaded if for the user executes the above command:

load_dataset("librispeech_asr", "clean", "train.100")

@mariosasko
Copy link
Collaborator

@patrickvonplaten This problem is a bit harder than it may seem, and it has to do with how our scripts are structured - _split_generators downloads data for a split before its definition. There was an attempt to fix this in #2249, but it wasn't flexible enough. Luckily, I have a plan of attack, and this issue is on our short-term roadmap, so I'll work on it soon.

In the meantime, one can use streaming or manually download a dataset script, remove unwanted splits and load a dataset via load_dataset.

@andrea-gasparini
Copy link

andrea-gasparini commented Jul 7, 2022

load_dataset("librispeech_asr", "clean", "train.100") actually downloads the whole dataset and not the 100 hr split, is this a bug?

Since this bug is still there and google led me here when I was searching for a solution, I am writing down how to quickly fix it (as suggested by @mariosasko) for whoever else is not familiar with how the HF Hub works.

Download the librispeech_asr.py script and remove the unwanted splits both from the _DL_URLS dictionary and from the _split_generators function.
Here I made an example with only the test sets.

Then either save the script locally and load the dataset via

load_dataset("${local_path}/librispeech_asr.py")

or create a new dataset repo on the hub named "librispeech_asr" and upload the script there, then you can just run

load_dataset("${hugging_face_username}/librispeech_asr")

@lhoestq
Copy link
Member

lhoestq commented Jul 27, 2022

Fixed by #4184

@lhoestq lhoestq closed this as completed Jul 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants