-
-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hello from fsspec #579
Comments
There's not much in the way of docs over at fsspec yet. It would be great to have a way to specify a file or range of files from a location via a (pseudo) glob, and just have them stream into a local instance (ML training). Supporting a shared pseudo glob format (e.g. s3://some_bucket/*.csv) would be very useful. Right now, we're rolling our own on that front. |
The great majority of users never see fsspec directly, so the docs, such as they are, are more dev-oriented. I wasn't entirely sure what your feature request is, but you have the following two options:
to copy files to local (this works in batch/concurrently) or
to create |
Ok, this is pretty nice. Thanks for the tips here! I might humbly suggest to sell this library a bit better via docs, My impression was that fsspec was more of an internal library (and as such likely to change without much notice). |
Yes, indeed, that is how it has mostly been; but actually that makes it less likely to change, as it would break APIs in those other libraries. I am the worst at keeping docs up to date and complete. I'll put it onto my list... |
Well, many people understand internal libraries as "this code could change/break suddenly since it is subservient to another project". This really isn't a "big" doc issues, there's already good api docs for fsspec. Just add a few blurbs on the readme talking about the basic use cases you're handling with fsspec. Talk about where the library came from and where it's going. If you're nice, add some links back to smart_open under a list of alternative libraries, etc. |
You've already done most of this, it's just buried in a subheader in the api docs : https://filesystem-spec.readthedocs.io/en/latest/features.html |
I opened an issue to reformat the docs, as many use cases ended up on the features page, which has grown steadily with time. |
The initial reason for this issue was to see if there was a chance to work together, reduce duplication or confusion for users wishing to pick a library. It should be pointed out that fsspec's origins were explicitly as a layer for Dask, so we have some unique concerns, particularly around serialising of file-system objects and open files (or |
Sorry for not having tackled this yet - I was reminded to do so by the discussion at https://news.ycombinator.com/item?id=27523377#27523893 |
@isidentical , it occurs to me that your team might be in a better position than me to write the brief overview text discussed above, if you have the time and appetite. |
Thanks, I've mentioned this library to other colleagues, and they're put off by the lack of a "friendly" landing page that everyone seems to do these days. |
Hi there, I wanted to give a quick greeting without opening a new issue because I feel like there is also some overlap with ratarmount and/or rather the library backend ratarmountcore. I started it motivated by the lack of performance of archivemount with TAR files. Therefore, I'd see the focus and advantages in access performance, which is enabled by the persisting SQLite index and also custom backends such as indexed_bzip2 and rapidgzip to enable parallelized decompression and constant-time random access to compressed streams. Both, fsspec and Ratarmount is or was rather narrowly focused on tar support, hence the name, but over time support for other archives was added based on user requests. I'm also very close to finishing a libarchive backend. This further increases the overlap with fsspec. It even has some kind of filesystem interface called MountSource, but because of its focus on read-only access, it is much more terse. The focus is also more on random access as opposed to streaming access as is the case for smart_open, and for now, there are no backends for web protocols. However, multiple users tried to stack ratarmount on top of S3 or HTTP in some way or another to achieve similar things. Another project that might be comparable with all three mentioned projects might be NVIDIA's aistore. |
Thanks @mxmlnkn , a lot to read and think about there. In fact, I have been tracking indexed_gzip (and zstd, bzip2) as a target for kerchunk, which would indeed allow parallel decompression access within archives (including zip and tar.*); it sounds like ratarmount has something similar (or perhaps more general-purpose). I hadn't thought of that also in combination with a filesystem. Many moving parts! So we need to think about how to make these things work together, and ideally come together to build a better get-my-bytes story for all. |
@mxmlnkn I've used indexed_gzip and indexed_bzip2 with great success. Small world! First time I hear of ratarmount, looks cool, let me check it out :) And thanks for your continued work in this ecosystem. |
Yes, it has something similar.
But to be honest, the parallelization is probably not worth it for anything residing on a disk because gzip decompression is sufficiently fast already, especially with implementations such as ISA-L, which comes close in performance to zstd. (The comparison benchmarks on the zstd site don't show that though, they show comparisons with the much slower zlib ;) ) It might be worth it for data on good SSDs or cloud access via very fast networks though. I benchmarked it in-memory on HPC systems to show scaling up to 20 GB/s gzip decompression bandwidth. It also has another caveat that the intricate algorithm for parallelization adds overhead, i.e., you will need at least 2 or 3 cores to be faster than single-core decompression. This is also partly because the generic parallel implementation cannot use ISA-L and instead uses a custom gzip implementation that lacks decades of finetuning. In the end, if you want random access to data you compressed yourself, you are probably best off using bgzip / BGZF, which adds metadata for seeking in the gzip stream "extra" headers. Rapidgzip can detect such files and is then >2x faster than for generic gzip files. pzstd would also be an option. The normal zstd compression tool does not add information for seeking, unfortunately, and even limiting the frame size is not yet possible. In that way, zstd and xz are still worse options than bzip2 and gzip, when it comes to seekability and generic parallel decompressability. I read a bit into kerchunk and zarr. With any new format, there always is the issue of adoption, therefore I find it important to mention improvements over existing formats. The website doesn't mention much regarding that, but the video presentation from PyData Global 2021 has a list of features that are missing in HDF5. It also sounds similar to Parquet in a way. A comparison to that might also be helpful. |
Would people here be interested in having a coordination meeting? There are a few repos and a lot of interesting code and ideas here. I'm not really sure how big a niche there is for all this: whether fsspec should explicitly integrate, or if ad-hoc solutions are enough.
Agreed, more motivation/documentation would be very helpful! I'll answer your immediate questions here, and perhaps that's a start. Kerchunk is not a new format, but a way of presenting various binary array formats as if they were zarr. Zarr itself is missing a "why" section ( https://zarr.readthedocs.io/en/stable/getting_started.html#highlights hmm), but it is a "cloud native" N-D array format, where the array data is chunked along each dimension, allowing for remote storage and parallel processing with the likes of dask, with the metadata stored in small JSON files. It is well integrated with xarray and some others. Zarr has been around for a while, and is well established in some specific (scientific) fields like climatology and microscopy. Yes, it shared "cloud native" with parquet, but the latter is columnar and nearly always used with tabular/2d data. So kerchunk allow HDF5 files (and grib, fits, netcdf3) to be viewed as zarr, and get all the advantages of that without copying/recoding the original data. You can even form logical datasets out of potentially thousands of source files, so that instead of some search interface to find the right files for a job, you simply do coordinate selection/slicing of the zarr or xarray object. This trick has been done in python only, with one working POC for JS. |
I tried to get myself an overview of the ecosystem, but I'm running out of steam and the further away from ratarmount and raw compression it goes, the more out-of-depth I am. Here is some kind of arranged chart. You can click on it to get the SVG. Please correct me if I understood or categorized anything wrong. Two frameworks with significant overlap in my opinion that haven't been mentioned here are: pyfilesystem2 and fox-it/dissect. The former has received no commits for over a years, and fox-it/dissect seems to have shot out from nothing in October 2022 if you look at the star history. It feels like everyone has started from different niches (forensics, cloud data analysis, local big-data archival) and while adding features arrived at the observed overlaps. The kinds of overlaps I observe are:
An online meeting might be interesting, but if it doesn't happen, then I guess I'll hopefully, slowly but steadily, progress as outlined above with the checkboxes whenever I have some free time and motivation. I don't feel it to be necessary or desirable to merge any project completely into another one. However, some backends as outlined above can probably be reused. A verbal meeting might also be easier to digest than my wall of text ;) I did some similar albeit much much shorter contemplations for: mxmlnkn/ratarmount#109 |
Thanks for the detailed summary - quite a lot to digest there! First things first, fsspec should definitely wrap dissect; I was totally unaware of its existence. I agree that there's no huge motivation to try to merge projects, but it would be great if they can work together. fsspec has a history of hooking into third party libraries, providing no extra functionality but a familiar interface to users. Actually, the set of non-FS data inputs available in dissect feels a lot like something the Intake project would be interested in, but that's another issue. I see in your issue you are also interested in data storage formats with internal compression - you might be interested in kerchunk as a way to find the encoded chunks within them. |
I should also mention that yes, fsspec has FUSE support, but it's super flaky and breaks under serious load. There haven't been enough users requesting better service to justify trying to make it better. |
Just spotted the bullet
If you can mount archives (or anything!) with fsspec as a backend, it would be valuable that way around too :) |
https://filesystem-spec.readthedocs.io/en/latest/ and related packages seems to cover much of the same ground as this repo. I don't know how I didn't come across it before!
There is probably scope to share and make each-others' libraries better. While I have a look at what is here, I would kindly ask anyone interested to look over fsspec. A few key features I would point out:
fsspec is being used by some high-profile projects such as dask and pandas.
The text was updated successfully, but these errors were encountered: