Use zstd instead of zip as the main compression algorithm #64

Dri0m · 2021-09-24T08:02:14Z

zstd is much, much faster at roughly the same (or better) compression on the default settings, which is quite important once your web crawls reach hundreds of gigabytes

source (includes some benchmarks):
https://github.com/facebook/zstd

some random benchmark which includes both DEFLATE and zstd:
https://www.gaia-gis.it/fossil/librasterlite2/wiki?name=benchmarks+(2019+update)

ikreymer · 2021-09-24T16:55:49Z

Yes, well aware of zstd for compression. A key goal of this spec is to use zip is to bundle multiple resources together (raw web archive data, indexes, metadata, etc...) in such a way that they can be accessed via client-side random access. In particular, the web archives are stored in WARC, and there's a separate proposal for using zstd with WARCs, see: iipc/warc-specifications/issues/53

The WACZ format should be able to support zstd WARCs like any other WARC, though for practical applications, will need to be able to read zstd WARCs in the browser. I suppose same could be done for the compressed CDX index, though other data stored in the zip is essentially negligible in size compared to the raw WARC data, so it really comes to usability of zstd WARCs.

ikreymer mentioned this issue Feb 21, 2022

ZST File Support webrecorder/archiveweb.page#68

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use zstd instead of zip as the main compression algorithm #64

Use zstd instead of zip as the main compression algorithm #64

Dri0m commented Sep 24, 2021 •

edited

Loading

ikreymer commented Sep 24, 2021

Use zstd instead of zip as the main compression algorithm #64

Use zstd instead of zip as the main compression algorithm #64

Comments

Dri0m commented Sep 24, 2021 • edited Loading

ikreymer commented Sep 24, 2021

Dri0m commented Sep 24, 2021 •

edited

Loading