Dataset librispeech_asr fails to load #4179

albertz · 2022-04-19T08:45:48Z

Describe the bug

The dataset librispeech_asr (standard Librispeech) fails to load.

Steps to reproduce the bug

datasets.load_dataset("librispeech_asr")

Expected results

It should download and prepare the whole dataset (all subsets).

In the doc, it says it has two configurations (clean and other).
However, the dataset doc says that not specifying split should just load the whole dataset, which is what I want.

Also, in case of this specific dataset, this is also the standard what the community uses. When you look at any publications with results on Librispeech, they always use the whole train dataset for training.

Actual results

...
  File "/home/az/.cache/huggingface/modules/datasets_modules/datasets/librispeech_asr/1f4602f6b5fed8d3ab3e3382783173f2e12d9877e98775e34d7780881175096c/librispeech_asr.py", line 119, in LibrispeechASR._split_generators
    line: archive_path = dl_manager.download(_DL_URLS[self.config.name])
    locals:
      archive_path = <not found>
      dl_manager = <local> <datasets.utils.download_manager.DownloadManager object at 0x7fc07b426160>
      dl_manager.download = <local> <bound method DownloadManager.download of <datasets.utils.download_manager.DownloadManager object at 0x7fc07b426160>>
      _DL_URLS = <global> {'clean': {'dev': 'http://www.openslr.org/resources/12/dev-clean.tar.gz', 'test': 'http://www.openslr.org/resources/12/test-clean.tar.gz', 'train.100': 'http://www.openslr.org/resources/12/train-clean-100.tar.gz', 'train.360': 'http://www.openslr.org/resources/12/train-clean-360.tar.gz'}, 'other'...
      self = <local> <datasets_modules.datasets.librispeech_asr.1f4602f6b5fed8d3ab3e3382783173f2e12d9877e98775e34d7780881175096c.librispeech_asr.LibrispeechASR object at 0x7fc12a633310>
      self.config = <local> BuilderConfig(name='default', version=0.0.0, data_dir='/home/az/i6/setups/2022-03-20--sis/work/i6_core/datasets/huggingface/DownloadAndPrepareHuggingFaceDatasetJob.TV6Nwm6dFReF/output/data_dir', data_files=None, description=None)
      self.config.name = <local> 'default', len = 7
KeyError: 'default'

Environment info

datasets version: 2.1.0
Platform: Linux-5.4.0-107-generic-x86_64-with-glibc2.31
Python version: 3.9.9
PyArrow version: 6.0.1
Pandas version: 1.4.2

The text was updated successfully, but these errors were encountered:

albertz · 2022-04-19T09:14:53Z

@patrickvonplaten Hi! I saw that you prepared this? :)

albertz · 2022-04-19T09:18:23Z

Another thing, but maybe this should be a separate issue: As I see from the code, it would try to use up to 16 simultaneous downloads? This is problematic for Librispeech or anything on OpenSLR. On the homepage, it says:

If you want to download things from this site, please download them one at a time, and please don't use any fancy software-- just download things from your browser or use 'wget'. We have a firewall rule to drop connections from hosts with more than 5 simultaneous connections, and certain types of download software may activate this rule.

Related: tensorflow/datasets#3885

patrickvonplaten · 2022-04-19T09:24:17Z

Hey @albertz,

Nice to see you here! It's been a while ;-)

patrickvonplaten · 2022-04-19T09:27:46Z

Sorry maybe the docs haven't been super clear here. By split we mean one of train.500, train.360, train.100, validation, test. For Librispeech, you'll have to specific a config (either other or clean) though:

datasets.load_dataset("librispeech_asr", "clean")

should work and give you all splits (being "train", "test", ...) for the clean config of the dataset.

patrickvonplaten · 2022-04-19T09:30:03Z

If you need both "clean" and "other" I think you'll have to do concatenate them as follows:

from datasets import concatenate_datasets, load_dataset

other = load_dataset("librispeech_asr", "other")
clean = load_dataset("librispeech_asr", "clean")

librispeech = concatenate_datasets([other, clean])

See https://huggingface.co/docs/datasets/v2.1.0/en/process#concatenate

patrickvonplaten · 2022-04-19T09:30:48Z

Downloading one split would be:

from datasets import load_dataset

other = load_dataset("librispeech_asr", "other", split="train.500")

patrickvonplaten · 2022-04-19T09:31:06Z

cc @lhoestq FYI maybe the docs can be improved here

albertz · 2022-04-19T09:31:20Z

Ah thanks. But wouldn't it be easier/nicer (and more canonical) to just make it in a way that simply load_dataset("librispeech_asr") works?

patrickvonplaten · 2022-04-19T09:50:40Z

Pinging @lhoestq here, think this could make sense! Not sure however how the dictionary would then look like

lhoestq · 2022-04-19T14:23:05Z

Would it make sense to have clean as the default config ?

Also I think load_dataset("librispeech_asr") should have raised you an error that says that you need to specify a config

I also opened a PR to improve the doc: #4183

albertz · 2022-04-19T14:34:38Z

Would it make sense to have clean as the default config ?

I think a user would expect that the default would give you the full dataset.

Also I think load_dataset("librispeech_asr") should have raised you an error that says that you need to specify a config

It does raise an error, but this error confused me because I did not understand why I needed a config, or why I could not simply download the whole dataset, which is what people usually do with Librispeech.

patrickvonplaten · 2022-04-19T14:37:03Z

+1 for @albertz. Also think lots of people download the whole dataset ("clean" + "other") for Librispeech.

Think there are also some people though who:

a) Don't have the memory to store the whole dataset
b) Just want to evaluate on one of the two configs

lhoestq · 2022-04-19T15:15:58Z

Ok ! Adding the "all" configuration would do the job then, thanks ! In the "all" configuration we can merge all the train.xxx splits into one "train" split, or keep them separate depending on what's the most practical to use (probably put everything in "train" no ?)

albertz · 2022-04-19T15:37:30Z

I'm not too familiar with how to work with HuggingFace datasets, but people often do some curriculum learning scheme, where they start with train.100, later go over to train.100 + train.360, and then later use the whole train (960h). It would be good if this is easily possible.

patrickvonplaten · 2022-04-19T16:34:41Z

Hey @albertz,

opened a PR here. Think by adding the "subdataset" class to each split "train", "dev", "other" as shown here: https://github.com/huggingface/datasets/pull/4184/files#r853272727 it should be easily possible (e.g. with the filter function https://huggingface.co/docs/datasets/v2.1.0/en/package_reference/main_classes#datasets.Dataset.filter )

patrickvonplaten · 2022-04-19T16:36:26Z

But also since everything is cached one could also just do:

load_dataset("librispeech", "clean", "train.100")
load_dataset("librispeech", "clean", "train.100+train.360")
load_dataset("librispeech" "all", "train")

SreyanG-NVIDIA · 2022-05-25T17:11:55Z

Hi @patrickvonplaten ,

load_dataset("librispeech_asr", "clean", "train.100") actually downloads the whole dataset and not the 100 hr split, is this a bug?

patrickvonplaten · 2022-05-25T19:36:30Z

Hmm, I don't really see how that's possible:

datasets/datasets/librispeech_asr/librispeech_asr.py

Line 51 in d22e39a

"train.100": _DL_URL + "train-clean-100.tar.gz",

Note that all datasets related to "clean" are downloaded, but only "train.100" should be used.

cc @lhoestq @albertvillanova @mariosasko can we do anything against download dataset links that are not related to the "split" that one actually needs. E.g. why should the split "train.360" be downloaded if for the user executes the above command:

load_dataset("librispeech_asr", "clean", "train.100")

mariosasko · 2022-05-26T16:59:03Z

@patrickvonplaten This problem is a bit harder than it may seem, and it has to do with how our scripts are structured - _split_generators downloads data for a split before its definition. There was an attempt to fix this in #2249, but it wasn't flexible enough. Luckily, I have a plan of attack, and this issue is on our short-term roadmap, so I'll work on it soon.

In the meantime, one can use streaming or manually download a dataset script, remove unwanted splits and load a dataset via load_dataset.

andrea-gasparini · 2022-07-07T09:09:33Z

load_dataset("librispeech_asr", "clean", "train.100") actually downloads the whole dataset and not the 100 hr split, is this a bug?

Since this bug is still there and google led me here when I was searching for a solution, I am writing down how to quickly fix it (as suggested by @mariosasko) for whoever else is not familiar with how the HF Hub works.

Download the librispeech_asr.py script and remove the unwanted splits both from the _DL_URLS dictionary and from the _split_generators function.
Here I made an example with only the test sets.

Then either save the script locally and load the dataset via

load_dataset("${local_path}/librispeech_asr.py")

or create a new dataset repo on the hub named "librispeech_asr" and upload the script there, then you can just run

load_dataset("${hugging_face_username}/librispeech_asr")

lhoestq · 2022-07-27T16:10:00Z

Fixed by #4184

albertz added the bug Something isn't working label Apr 19, 2022

patrickvonplaten mentioned this issue Apr 19, 2022

Document librispeech configs #4183

Closed

patrickvonplaten mentioned this issue Apr 19, 2022

[Librispeech] Add 'all' config #4184

Merged

lhoestq closed this as completed Jul 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset librispeech_asr fails to load #4179

Dataset librispeech_asr fails to load #4179

albertz commented Apr 19, 2022 •

edited

Loading

albertz commented Apr 19, 2022

albertz commented Apr 19, 2022

patrickvonplaten commented Apr 19, 2022

patrickvonplaten commented Apr 19, 2022

patrickvonplaten commented Apr 19, 2022

patrickvonplaten commented Apr 19, 2022

patrickvonplaten commented Apr 19, 2022

albertz commented Apr 19, 2022

patrickvonplaten commented Apr 19, 2022

lhoestq commented Apr 19, 2022 •

edited

Loading

albertz commented Apr 19, 2022

patrickvonplaten commented Apr 19, 2022

lhoestq commented Apr 19, 2022

albertz commented Apr 19, 2022

patrickvonplaten commented Apr 19, 2022

patrickvonplaten commented Apr 19, 2022

SreyanG-NVIDIA commented May 25, 2022

patrickvonplaten commented May 25, 2022

mariosasko commented May 26, 2022

andrea-gasparini commented Jul 7, 2022 •

edited

Loading

lhoestq commented Jul 27, 2022

Dataset librispeech_asr fails to load #4179

Dataset librispeech_asr fails to load #4179

Comments

albertz commented Apr 19, 2022 • edited Loading

Describe the bug

Steps to reproduce the bug

Expected results

Actual results

Environment info

albertz commented Apr 19, 2022

albertz commented Apr 19, 2022

patrickvonplaten commented Apr 19, 2022

patrickvonplaten commented Apr 19, 2022

patrickvonplaten commented Apr 19, 2022

patrickvonplaten commented Apr 19, 2022

patrickvonplaten commented Apr 19, 2022

albertz commented Apr 19, 2022

patrickvonplaten commented Apr 19, 2022

lhoestq commented Apr 19, 2022 • edited Loading

albertz commented Apr 19, 2022

patrickvonplaten commented Apr 19, 2022

lhoestq commented Apr 19, 2022

albertz commented Apr 19, 2022

patrickvonplaten commented Apr 19, 2022

patrickvonplaten commented Apr 19, 2022

SreyanG-NVIDIA commented May 25, 2022

patrickvonplaten commented May 25, 2022

mariosasko commented May 26, 2022

andrea-gasparini commented Jul 7, 2022 • edited Loading

lhoestq commented Jul 27, 2022

albertz commented Apr 19, 2022 •

edited

Loading

lhoestq commented Apr 19, 2022 •

edited

Loading

andrea-gasparini commented Jul 7, 2022 •

edited

Loading