-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pipx run fastwarc check faild: binascii.Error: Non-base32 digit found #19
Comments
This looks like a non-standard WARC record digest. Can you post an example of where this happens? |
i created the WARC file with https://github.com/internetarchive/warcprox |
(venv) user@box:/tmp/warcs$ sha1sum WARCPROX-20220315191329244-00000-icvgw961.warc* | tee WARCPROX-20220315191329244-00000-icvgw961.warc.sha1
5cfa65c0cb6cf7aeed36be9a812dedbd7d2f7add WARCPROX-20220315191329244-00000-icvgw961.warc
c220d5ea3067eadb3ae6caa39b3ac919eeccb23e WARCPROX-20220315191329244-00000-icvgw961.warc.tar.gz
(venv) user@box:/tmp/warcs$ |
The file you uploaded (although the hash matches the one you posted), is not a valid GZip file, so I cannot open it. |
The file seems to be a mixture of text and binary, but I can see what your original problem is: the digest hash is stored as hex, not as Base32, which is required by the WARC spec. I'll add support for that later, but it's non-standard and worth a bug report to warcprox. |
i packed the WARC file with tar. (venv) user@box:/tmp$ jhove -k warcs/WARCPROX-20220315191329244-00000-icvgw961.warc
Jhove (Rel. 1.20.0, 2019-01-19)
Date: 2022-03-16 18:41:20 CET
RepresentationInformation: warcs/WARCPROX-20220315191329244-00000-icvgw961.warc
ReportingModule: BYTESTREAM, Rel. 1.3 (2007-04-10)
LastModified: 2022-03-15 20:13:31 CET
Size: 16927625
Format: bytestream
Status: Well-Formed and valid
MIMEtype: application/octet-stream
Checksum: 2371829d
Type: CRC32
Checksum: 297bd32582ca019fb5922efb8d74b1a4
Type: MD5
Checksum: 5cfa65c0cb6cf7aeed36be9a812dedbd7d2f7add
Type: SHA-1
(venv) user@box:/tmp$ |
I'm wrong with this interpretation? feedback are welcome. if i don't miss read the discussion about a specifications clarification: |
Yeah, I figured. But no, a tar does not make a valid WARC file and tar is also no compression algorithm. A compressed WARC file is a series of records that are compressed individually with the gzip tool. I do not recommend that you try to do that manually. An uncompressed .warc file is perfectly valid, although space-inefficient.
The WARC specification makes no mention of hex-encoded digests. As per the specification, these should be Base32, although it only mentions it as an example and does not explicitly say that no other encoding is allowed: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/#warc-block-digest |
FastWARC now supports hex-digests. The new wheels should be up on PyPi as soon as this is done: https://github.com/chatnoir-eu/chatnoir-resiliparse/actions/runs/1995457297 |
The text was updated successfully, but these errors were encountered: