Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use zstd instead of zip as the main compression algorithm #64

Open
Dri0m opened this issue Sep 24, 2021 · 1 comment
Open

Use zstd instead of zip as the main compression algorithm #64

Dri0m opened this issue Sep 24, 2021 · 1 comment

Comments

@Dri0m
Copy link

Dri0m commented Sep 24, 2021

zstd is much, much faster at roughly the same (or better) compression on the default settings, which is quite important once your web crawls reach hundreds of gigabytes

source (includes some benchmarks):
https://github.com/facebook/zstd

some random benchmark which includes both DEFLATE and zstd:
https://www.gaia-gis.it/fossil/librasterlite2/wiki?name=benchmarks+(2019+update)

@ikreymer
Copy link
Member

Yes, well aware of zstd for compression. A key goal of this spec is to use zip is to bundle multiple resources together (raw web archive data, indexes, metadata, etc...) in such a way that they can be accessed via client-side random access. In particular, the web archives are stored in WARC, and there's a separate proposal for using zstd with WARCs, see: iipc/warc-specifications/issues/53

The WACZ format should be able to support zstd WARCs like any other WARC, though for practical applications, will need to be able to read zstd WARCs in the browser. I suppose same could be done for the compressed CDX index, though other data stored in the zip is essentially negligible in size compared to the raw WARC data, so it really comes to usability of zstd WARCs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants