Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get the original dataset name with username? #7311

Open
npuichigo opened this issue Dec 8, 2024 · 2 comments
Open

How to get the original dataset name with username? #7311

npuichigo opened this issue Dec 8, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@npuichigo
Copy link
Contributor

npuichigo commented Dec 8, 2024

Feature request

The issue is related to ray data ray-project/ray#49008 which it requires to check if the dataset is the original one just after load_dataset and parquet files are already available on hf hub.

The solution used now is to get the dataset name, config and split, then load_dataset again and check the fingerprint. But it's unable to get the correct dataset name if it contains username. So how to get the dataset name with username prefix, or is there another way to query if a dataset is the original one with parquet available?

@lhoestq

Motivation

ray-project/ray#49008

Your contribution

Would like to fix that.

@npuichigo npuichigo added the enhancement New feature or request label Dec 8, 2024
@lhoestq
Copy link
Member

lhoestq commented Jan 9, 2025

Hi ! why not pass the dataset id to Ray and let it check the parquet files ? Or pass the parquet files lists directly ?

@npuichigo
Copy link
Contributor Author

I'm not sure why ray design an API like this to accept a Dataset object, so they need to verify the Dataset is the original one and use the DatasetInfo to query the huggingface hub. I'll advise the ray data team to use dataset id instead of dataset for the HuggingFaceDatasource API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants