Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pipx run fastwarc check faild: binascii.Error: Non-base32 digit found #19

Closed
MaxPeal opened this issue Mar 16, 2022 · 9 comments
Closed

Comments

@MaxPeal
Copy link

MaxPeal commented Mar 16, 2022

$ pipx run --verbose fastwarc check /tmp/warcs/WARCPROX-20220315191329244-00000-icvgw961.warc
pipx >(setup:729): pipx version is 1.0.0
pipx >(setup:730): Default python interpreter is '/home/user/.local/pipx/venvs/pipx/bin/python'
pipx >(needs_upgrade:69): Time since last upgrade of shared libs, in seconds: 1561898. Upgrade will be run by pipx if greater than 2592000.
pipx >(run_subprocess:172): running /home/user/.local/pipx/.cache/7a73b1e86637c39/bin/python -c import sysconfig; print(sysconfig.get_path('purelib'))
pipx >(run:103): Reusing cached venv /home/user/.local/pipx/.cache/7a73b1e86637c39
pipx >(run_subprocess:172): running /home/user/.local/pipx/.cache/7a73b1e86637c39/bin/python -c import sysconfig; print(sysconfig.get_path('purelib'))
pipx >(exec_app:387): exec_app: /home/user/.local/pipx/.cache/7a73b1e86637c39/bin/fastwarc check /tmp/warcs/WARCPROX-20220315191329244-00000-icvgw961.warc
0 records were verified successfully.                           
1 records were skipped without digest.
Error in sys.excepthook:
Traceback (most recent call last):
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/bin/fastwarc", line 8, in <module>
    sys.exit(main())
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/fastwarc/cli.py", line 138, in check
    for v in pbar:
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "fastwarc/tools.pyx", line 178, in verify_digests
  File "fastwarc/warc.pyx", line 922, in fastwarc.warc.WarcRecord.verify_block_digest
  File "fastwarc/warc.pyx", line 934, in fastwarc.warc.WarcRecord.verify_block_digest
  File "fastwarc/warc.pyx", line 872, in fastwarc.warc.WarcRecord._verify_digest
  File "/usr/lib/python3.9/base64.py", line 231, in b32decode
    raise binascii.Error('Non-base32 digit found') from None
binascii.Error: Non-base32 digit found

Original exception was:
Traceback (most recent call last):
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/bin/fastwarc", line 8, in <module>
    sys.exit(main())
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/fastwarc/cli.py", line 138, in check
    for v in pbar:
  File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "fastwarc/tools.pyx", line 178, in verify_digests
  File "fastwarc/warc.pyx", line 922, in fastwarc.warc.WarcRecord.verify_block_digest
  File "fastwarc/warc.pyx", line 934, in fastwarc.warc.WarcRecord.verify_block_digest
  File "fastwarc/warc.pyx", line 872, in fastwarc.warc.WarcRecord._verify_digest
  File "/usr/lib/python3.9/base64.py", line 231, in b32decode
    raise binascii.Error('Non-base32 digit found') from None
binascii.Error: Non-base32 digit found
$
@phoerious
Copy link
Member

This looks like a non-standard WARC record digest. Can you post an example of where this happens?

@MaxPeal
Copy link
Author

MaxPeal commented Mar 16, 2022

i created the WARC file with https://github.com/internetarchive/warcprox

@MaxPeal
Copy link
Author

MaxPeal commented Mar 16, 2022

(venv) user@box:/tmp/warcs$ sha1sum WARCPROX-20220315191329244-00000-icvgw961.warc* | tee WARCPROX-20220315191329244-00000-icvgw961.warc.sha1
5cfa65c0cb6cf7aeed36be9a812dedbd7d2f7add  WARCPROX-20220315191329244-00000-icvgw961.warc
c220d5ea3067eadb3ae6caa39b3ac919eeccb23e  WARCPROX-20220315191329244-00000-icvgw961.warc.tar.gz
(venv) user@box:/tmp/warcs$ 

WARCPROX-20220315191329244-00000-icvgw961.warc.tar.gz

@phoerious
Copy link
Member

The file you uploaded (although the hash matches the one you posted), is not a valid GZip file, so I cannot open it.

@phoerious
Copy link
Member

phoerious commented Mar 16, 2022

The file seems to be a mixture of text and binary, but I can see what your original problem is: the digest hash is stored as hex, not as Base32, which is required by the WARC spec.

I'll add support for that later, but it's non-standard and worth a bug report to warcprox.

@MaxPeal
Copy link
Author

MaxPeal commented Mar 16, 2022

i packed the WARC file with tar.
i'm missing something?
Jhove installed via apt on debian 11 say its valid?

(venv) user@box:/tmp$ jhove -k warcs/WARCPROX-20220315191329244-00000-icvgw961.warc
Jhove (Rel. 1.20.0, 2019-01-19)
 Date: 2022-03-16 18:41:20 CET
 RepresentationInformation: warcs/WARCPROX-20220315191329244-00000-icvgw961.warc
  ReportingModule: BYTESTREAM, Rel. 1.3 (2007-04-10)
  LastModified: 2022-03-15 20:13:31 CET
  Size: 16927625
  Format: bytestream
  Status: Well-Formed and valid
  MIMEtype: application/octet-stream
  Checksum: 2371829d
   Type: CRC32
  Checksum: 297bd32582ca019fb5922efb8d74b1a4
   Type: MD5
  Checksum: 5cfa65c0cb6cf7aeed36be9a812dedbd7d2f7add
   Type: SHA-1
(venv) user@box:/tmp$ 

@MaxPeal
Copy link
Author

MaxPeal commented Mar 16, 2022

I'm wrong with this interpretation? feedback are welcome.

if i don't miss read the discussion about a specifications clarification:
the digest hash stored as hex, not as Base32, is possible by the WARC spec.
iipc/warc-specifications#29
webrecorder/warcio#74 (comment)

@phoerious
Copy link
Member

phoerious commented Mar 16, 2022

i packed the WARC file with tar.

Yeah, I figured. But no, a tar does not make a valid WARC file and tar is also no compression algorithm. A compressed WARC file is a series of records that are compressed individually with the gzip tool. I do not recommend that you try to do that manually. An uncompressed .warc file is perfectly valid, although space-inefficient.

the digest hash stored as hex, not as Base32, is possible by the WARC spec.

The WARC specification makes no mention of hex-encoded digests. As per the specification, these should be Base32, although it only mentions it as an example and does not explicitly say that no other encoding is allowed: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/#warc-block-digest

@phoerious
Copy link
Member

FastWARC now supports hex-digests. The new wheels should be up on PyPi as soon as this is done: https://github.com/chatnoir-eu/chatnoir-resiliparse/actions/runs/1995457297

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants