Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One or several metadata.jsonl were found, but not in the same directory or in a parent directory of #7337

Open
mst272 opened this issue Dec 17, 2024 · 1 comment

Comments

@mst272
Copy link

mst272 commented Dec 17, 2024

Describe the bug

ImageFolder with metadata.jsonl error. I downloaded liuhaotian/LLaVA-CC3M-Pretrain-595K locally from Hugging Face. According to the tutorial in https://huggingface.co/docs/datasets/image_dataset#image-captioning, only put images.zip and metadata.jsonl containing information in the same folder. However, after loading, an error was reported: One or several metadata.jsonl were found, but not in the same directory or in a parent directory of.

The data in my jsonl file is as follows:

{"id": "GCC_train_002448550", "file_name": "GCC_train_002448550.jpg", "conversations": [{"from": "human", "value": "\nProvide a brief description of the given image."}, {"from": "gpt", "value": "a view of a city , where the flyover was proposed to reduce the increasing traffic on thursday ."}]}

Steps to reproduce the bug

from datasets import load_dataset
image = load_dataset("imagefolder",data_dir='data/opensource_data')

Expected behavior

success

Environment info

datasets==3.2.0

@lhoestq
Copy link
Member

lhoestq commented Jan 3, 2025

Hmmm I double checked in the source code and I found a contradiction: in the current implementation the metadata file is ignored if it's not in the same archive as the zip image somehow:

if metadata_file_candidate
is None # ignore metadata_files that are not inside archives

in the tests suite the metadata file is placed inside the archive:

image_metadata_filename = archive_dir / "metadata.jsonl"

Thanks for reporting this issue, it seems the documentation is wrong and we never implemented the support for zip + metadata outside zip. We might rewrite part of this code soon though to make it more flexible, it can be a good occasion to fix this. In the meantime feel free to open a PR to fix the documentation if you'd like

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants