-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset librispeech_asr fails to load #4179
Comments
@patrickvonplaten Hi! I saw that you prepared this? :) |
Another thing, but maybe this should be a separate issue: As I see from the code, it would try to use up to 16 simultaneous downloads? This is problematic for Librispeech or anything on OpenSLR. On the homepage, it says:
Related: tensorflow/datasets#3885 |
Hey @albertz, Nice to see you here! It's been a while ;-) |
Sorry maybe the docs haven't been super clear here. By datasets.load_dataset("librispeech_asr", "clean") should work and give you all splits (being "train", "test", ...) for the clean config of the dataset. |
If you need both from datasets import concatenate_datasets, load_dataset
other = load_dataset("librispeech_asr", "other")
clean = load_dataset("librispeech_asr", "clean")
librispeech = concatenate_datasets([other, clean]) See https://huggingface.co/docs/datasets/v2.1.0/en/process#concatenate |
Downloading one split would be: from datasets import load_dataset
other = load_dataset("librispeech_asr", "other", split="train.500") |
cc @lhoestq FYI maybe the docs can be improved here |
Ah thanks. But wouldn't it be easier/nicer (and more canonical) to just make it in a way that simply |
Pinging @lhoestq here, think this could make sense! Not sure however how the dictionary would then look like |
Would it make sense to have Also I think I also opened a PR to improve the doc: #4183 |
I think a user would expect that the default would give you the full dataset.
It does raise an error, but this error confused me because I did not understand why I needed a config, or why I could not simply download the whole dataset, which is what people usually do with Librispeech. |
+1 for @albertz. Also think lots of people download the whole dataset ( Think there are also some people though who:
|
Ok ! Adding the "all" configuration would do the job then, thanks ! In the "all" configuration we can merge all the train.xxx splits into one "train" split, or keep them separate depending on what's the most practical to use (probably put everything in "train" no ?) |
I'm not too familiar with how to work with HuggingFace datasets, but people often do some curriculum learning scheme, where they start with train.100, later go over to train.100 + train.360, and then later use the whole train (960h). It would be good if this is easily possible. |
Hey @albertz, opened a PR here. Think by adding the "subdataset" class to each split "train", "dev", "other" as shown here: https://github.com/huggingface/datasets/pull/4184/files#r853272727 it should be easily possible (e.g. with the filter function https://huggingface.co/docs/datasets/v2.1.0/en/package_reference/main_classes#datasets.Dataset.filter ) |
But also since everything is cached one could also just do: load_dataset("librispeech", "clean", "train.100")
load_dataset("librispeech", "clean", "train.100+train.360")
load_dataset("librispeech" "all", "train") |
Hi @patrickvonplaten , load_dataset("librispeech_asr", "clean", "train.100") actually downloads the whole dataset and not the 100 hr split, is this a bug? |
Hmm, I don't really see how that's possible:
Note that all datasets related to cc @lhoestq @albertvillanova @mariosasko can we do anything against download dataset links that are not related to the "split" that one actually needs. E.g. why should the split load_dataset("librispeech_asr", "clean", "train.100") |
@patrickvonplaten This problem is a bit harder than it may seem, and it has to do with how our scripts are structured - In the meantime, one can use streaming or manually download a dataset script, remove unwanted splits and load a dataset via |
Since this bug is still there and google led me here when I was searching for a solution, I am writing down how to quickly fix it (as suggested by @mariosasko) for whoever else is not familiar with how the HF Hub works. Download the librispeech_asr.py script and remove the unwanted splits both from the Then either save the script locally and load the dataset via load_dataset("${local_path}/librispeech_asr.py") or create a new dataset repo on the hub named "librispeech_asr" and upload the script there, then you can just run load_dataset("${hugging_face_username}/librispeech_asr") |
Fixed by #4184 |
Describe the bug
The dataset librispeech_asr (standard Librispeech) fails to load.
Steps to reproduce the bug
Expected results
It should download and prepare the whole dataset (all subsets).
In the doc, it says it has two configurations (clean and other).
However, the dataset doc says that not specifying
split
should just load the whole dataset, which is what I want.Also, in case of this specific dataset, this is also the standard what the community uses. When you look at any publications with results on Librispeech, they always use the whole train dataset for training.
Actual results
Environment info
datasets
version: 2.1.0The text was updated successfully, but these errors were encountered: