-
-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect WARC-Payload-Digest values when transfer encoding is present #74
Comments
Is this a bug in the standard instead of a bug in warcio and many other tools? Can you point out any common tool that actually obeys the standard? There are a couple of examples where the standard was changed to match the tools, this might be a new one. I see from your links that this has been discussed by the WARC standards folks, but it sure doesn't seem like the subset of tools that preserve transfer encoding have adopted that digest algorithm. One thing this does inspire me to do is to add some code to "warcio check" and "warcio test" to check for and comment on this situation |
From the discussion I linked, it seems like the intention was as written, not as implemented. So in that sense, I'd say it's not a bug in the standard. I haven't investigated every tool; I just did some validation of WARCs created with wpull and warcio, both of which had this issue. From a glance over the relevant code, it seems like wget does remove chunking for the payload digest. However, it also looks like it doesn't handle compression correctly (with which wget has had issues before). Due to the lack of a suitable testing setup, I haven't been able to test that yet though. |
Well, if you'd like to look farther, the develop branch of warcio supports "warcio check", which can be used to also check digests -- albeit with warcio's current opinion that digests are computed on the bytes without removing transfer encoding. You can also install the branch from my "warcio test" pullreq (which is a work in progress) and give "warcio test" a spin! It's my hope that "warcio test" will lead the community to interact about all of these conflicts between the standard and the tools people actually use. |
Thanks for flagging this! Hm, I wonder if any tools actually do what the standard suggests in this regard? I think the content-decoding would also need to be decoded. The change in warcio should be simple, I think just: - for buf in self._iter_stream(record.raw_stream):
+ for buf in self._iter_stream(record.content_stream()): Of course, would invalidate previous payload digests, so not sure if this is the right thing to do, especially if other tools also behave the same way now.. Also, I wonder what the implications of this are.. presumably the problem would be for deduplication if the same content is served with different transfer-encoding and/or content-encodings and would unnecessarily be fetched? Or chunks of different size used for same content? I ask this because making such a change will make the payload digests mismatch and affect deduplication with any existing records that have transfer- and content- encoding, while keeping it as is might affect some deduplication. |
These issues are exactly why I suspect the standard is what's going to change. Those "bad" digests are all over petabytes of response and revisit records, and the cdx indices ... |
@wumpus @ikreymer I guess the intention is indeed deduplication when the chunk boundaries or something in the compression changes. I've definitely seen varying chunks returned from web servers for identical content, which isn't too surprising since those chunks are often based on a buffer which may get flushed before being full depending on a variety of factors (current CPU/disk speed, concurrent activity, etc.). I have little experience with compression transfer encoding. I'm aware of the implications for existing archives. But unfortunately, those files do not conform to the standard as it's written currently, and I believe that's a significant issue that needs to be solved, whether that requires changes in the softwares or in the standard. Based on the discussion linked above (or rather, the fact that this exact issue was discussed fairly extensively before), I assumed that a change of the standard is not wanted/needed. But I realise now that this may very well just be an issue of the people working on the standard not knowing how the existing implementations work in this regard. I'll continue my investigations soon how other tools react to this and file an issue on the WARC specification repo instead if it turns out that the majority of relevant tools don't adhere to the written standard, i.e. if the standard should likely be changed. |
I'm planning on writing a jumbo article on what "warcio test" finds, and this will end up being a part of that, if you don't beat me to it. I've got a repo with a pile of warcs from all of the major tools, etc etc. |
Thanks for looking into this more deeply. At first glance, removing transfer-encoding but not content-encoding seems to make even less sense then keeping both or removing both. I wanted to add, the IIPC hosts a weekly open source web archiving call on Tuesdays, alternating between 9am and 6pm EST currently. I think this would be a great conversation to bring up there as well for more immediate feedback. The next one is tomorrow, April 29 at 9am EST, following week will be 6pm. I can also mention this issue on the IIPC slack channel as well if you'd like. |
This came up for me recently as well; it broke my expectations about how CDX indexes worked, in that when I fetched files via wayback (without replay) and wrote out to disk, the on-disk blob's digest didn't match the digest in the corresponding CDX line. Our CDX build process simply copies the WARC content digest header as-is without verification or re-calculation, and the browser (and/or wayback earlier) strips any encoding before writing to disk. Here's a URL which consistently induces the ambiguity: http://www5.informatik.uni-erlangen.de/Forschung/Publikationen/2012/Fischer12-BAF.pdf A notable tool which does "follow the spec" is warcprox, used as part of brozzler, as of 2017 (internetarchive/warcprox@3a0f6e0947) |
That's an easy problem to see because many http clients will undo chunked encoding and decompress by default. So you can't depend on the checksum not changing. And even if the community agreed about transfer encoding and checksums, decompression is still going to change it. |
@wumpus Sounds great! I remember your request on the wpull repository from a few months ago. Great to hear that this project of yours is coming along nicely. :-) @ikreymer At first glance, it seems silly, yes. But Transfer-Encoding is intended for encoding the data on-the-fly whereas Content-Encoding is at-rest compression. So the content-encoded data should never change between requests unless the source data has changed whereas the transfer encoding may vary freely. So I guess it kind-of makes sense. First time I hear about that call after working with WARCs daily for two years, although I must admit that I never searched for things like this either. Is this publicly accessible? Please do spread it wherever it makes sense. We should also probably find a better place for this discussion than here, but I have no idea where that might be. @bnewbold That example URL uses chunked encoding, so it makes perfect sense that this causes issues. Thanks for pointing out how warcprox behaves! |
@JustAnotherArchivist Yes, I see your point, though I suppose both could just as easily applied on-the-fly, eg. with nginx The IIPC call is open to everyone and there is more info in the IIPC slack at: https://iipc.slack.com #oh-sos channel (I can invite you if you'd like, PM your email) @bnewbold Good to know that warcprox does follow the standard, was about to look at what it does. So, I decided to run a few quick tests using different tools and got 3 different digests! However, they were consistent across multiple tries of the same tool. I tried:
This does preserve the chunked encoding in the WARC record. Trying each tool multiple times produces the same results, so that's good news. But, clearly there's a few off, such as 1) stripping out the transfer-encoding altogether) while 2) and 3) should be matching. Also, I've attached three files here. Will try to investigate further.. |
Thanks, will contact you shortly. 1-pywb.warc.gz is arguably the worst because it preserves the Transfer-Encoding header while decoding the body. That means the data consumer will have to detect and work around that. It means that the payload digest is according to spec though. 2-warcio.warc.gz has the expected, spec-violating "payload with transfer encoding" hash. 3-warcprox.warc.gz has the (arguably) correct "payload with transfer encoding removed" hash. But it writes the hash in hexadecimal form rather than the base-32 one which is common in other WARCing tools. |
Looks like 3-warcprox.warc.gz has an incorrect hash for the request record though: it is equal to the block hash when it should be |
I thought my latest pullreq (which got pulled into the devel branch) fixed that bug, and whatever I have installed locally sees digest pass. tests/data/example-iana.org-chunked.warc works fine for me, too, although it is excluded from test_check_digest_examples.py for some reason. Maybe you haven't pulled devel since that merge? |
(Your unchunking code wasn't reading the last couple of bytes... ;-) ) |
Hmm, perhaps this should move to #75 or separate issue, but running
Somehow, the |
Whoops, I fixed the bug in my develop-warcio-test branch, which is still a work in progress:
I'll prepare another "tweak" PR and make sure there's a test... hm and that assert needs to get wrapped in a try/except like the other ones in this method: #77 |
Regarding the choice made in the WARC standard, IMHO it makes as much sense as anything. My understand is that in theory, On the other hand,
(That last sentence might explain why we rarely see |
@JustAnotherArchivist oh ouch thanks for pointing out that warcprox writes an incorrected WARC-Payload-Digest header on request records |
... request records see webrecorder/warcio#74 (comment)
#97 brought this to my attention again. Sorry, I haven't really had time the past few months to investigate further. I'm not sure what the best route forward is. I still intend to initiate a software comparison about this, both for WARC writing and reading tools. However, I think we could only really call this a bug in the standard if all common tools used the same wrong digest, i.e. if there were a consensus on how it should be done. We've already established that at least one common tool (warcprox) follows the standard, which leads me to think that the best way may be to fix all other tools to do so rather than amending the standard. This may be painful, but on the other hand, this issue was not noticed for nearly a decade since the initial WARC/1.0 specification release, so clearly digest validation is not something that is commonly done (correctly). |
For the examples I mentioned at the top, one common tool was different and it was still considered a bug by IIPC... wget started putting <> around the URI in WARC-Target-URI, but no other tool did, and the standard was changed to not have brackets. My suspicion is that every tool that doesn't remove transfer encoding is doing this wrong, other than warcprox. And yes, digest validation isn't commonly done, that's why I put it in warcio! Along the way I discovered several other bugs in the standard. I opened tickets in the correct place for each of them; this repo isn't the place for the IIPC community to have a discussion about possible standards bugs. |
Yeah, I agree that this isn't the place. This issue turned from "warcio is writing invalid digests" to "(nearly) everything is writing standard-violating digests" very quickly. Later I didn't want to open an issue on the warc-specifications repository without data on how the most common tools behave. And then I didn't have time for that analysis. Anyway, I'll move this discussion elsewhere. warcio should probably keep doing its current digest calculation for now to avoid another mess like with wget's |
If this issue does get raised in another venue a link to it from here would be greatly appreciated! |
I think the correct place is the IIPC issue tracker, examples: iipc/warc-specifications#49 iipc/warc-specifications#33 iipc/warc-specifications#48 ... ideally one of us interested people would do a bit of a survey to see which tools keep transfer-encoding and then what they do for the digest. So far I've been too lazy to do that. |
Yep, that's exactly what I meant by "data on how the most common tools behave" above. I did start working on that back in April but then got distracted by other things. To be continued soon™. |
Hi all, apparently nobody care much about this issue but it's actually quite concerning, and I think that If nobody wants to fix that issue, it's fine, I don't personally use |
It's not that 'nobody cares', this is was an ambiguity in the standard, and we needed to be careful in making a backwards incompatible changes like this.. Unfortunately, there hasn't been much discussion of this elsewhere for a while.. |
I understand that it is a mistake that is fairly easy to do because it's not clearly stated in the specs, but it's not ambiguous because
I think the priority is to stop giving the possibility to verify WARCs with If I understand all the messages above, it sounds like people are expecting the standard to change to match what tools like I also understand that fixing |
JWAT and https://github.com/nlnwa/warchaeology also mark these digests as incorrect. I though the purpose of these digests were to validate the content as it is in the payload. This is the first time since writing JWAT that I've been made aware of this digest issue. |
Wget has this issue as well. At Archive Team we use the modified version of Wget at https://github.com/ArchiveTeam/wget-lua. This is used to archive petabytes of data. This modified Wget is now updated ArchiveTeam/wget-lua#13, and as of version 1.21.3-at.20230623.01 calculates the correct Over the next days this will be rolled out for all Archive Team distributed Warrior projects. |
Coming back to this after a few years, I think the best solution is to strip out the Transfer-Encoding altogether when writing the WARCs, so that WARCs are written with the Transfer-Encoding removed from the data, and header replaced with something like X-Removed-Transfer-Removed (we might have discussed that in another issue), and the payload digest matches the data with no Transfer-Encoding. After all these years, haven't really seen a need to access the transfer-encoded data, so I agree with @nclarkekb that it just creates more work for anyone using the data. WARCs created capture via other methods, such as browser extensions / warcio.js also has the encoding already removed. |
I also remember that (brief) discussion, but don't know where it happened. I thought it was in iipc/warc-specifications but couldn't find it there. It's certainly a discussion worth having, probably over there. However, any tools that currently strip the transfer encoding from the record body, transform the HTTP headers, or write HTTP/2 or HTTP/3 data to WARC are undeniably violating the standard. This is not ambiguous in the WARC spec: only HTTP/1.1 with the exact HTTP-layer-level data as sent/received is compatible with WARC versions 1.0 and 1.1. And since this is a format intended for long-term archival, spec compliance is crucial in my opinion. (Side note: perhaps the IIPC repo should also contain a list of known WARC spec violations found in the wild. It would fit in with the implementation recommendations and the annotated spec, I think. Apart from this issue right here, the angle brackets misunderstanding would also belong there.) |
@ikreymer I very strongly disagree with this idea, and I agree with what @JustAnotherArchivist noted. If the WARCs captured with "browser extensions / warcio.js" already have encoding removed, then these implementation are in violation of the standard and do not create proper WARCs. Whether this is an issue is up to the individual using the tool, but they should be aware of this, if the issue is known. Previously (#74 (comment)) the argument was made that
If the suggestion is to strip Transfer-Encoding entirely, store this in the WARC, and calculate the payload digest anyway on the stripped content, then what is stopping us from just calculating the payload anyway on the stripped body, while writing the raw body (including Transfer-Encoding) to WARC? The WARC standard notes clearly what should be stored, and what the digests should be calculated over. For many implementations the standard was not accurately read/analysed. I would say it would be best if implementations are fixed whenever possible and simply start calculating the correct digests moving forward.
We should agree on a solution for this in https://github.com/iipc/warc-specifications, and move forward with that.
Yes, I agree with this. When people use tools that write WARCs, they will likely assume that these WARCs are entirely following the WARC standard. Some tools do not. Sometimes that is known, sometimes it is yet to be discovered. I think it would be good to make clear to people who use WARC tools in what ways these tools do not follow the WARC standard, and possibly why they do not follow the WARC standard. They can then decide for themselves whether that is a problem, possibly with recommendations from the IIPC on the long term aspect. |
Weird that after 4 years of discussion, no one has yet to look at IA's and CC's warcs to see what they do with this issue. |
Just putting my 2c, as a developer / maintainer of several different versions of crawlers and wayback machine at Internet Archive...
|
Hey, so we are more than a year later now since the last message, and almost 5 years since this issue has been opened, the damages made has been huge so I wonder: still no plan about fixing this library? Thanks. |
Per WARC/1.0 spec section 5.9:
The entity-body is the HTTP body without transfer encoding per section 4.3 in RFC 2616. (In the newer RFC 723# family, it's called "payload body" instead and defined in section 3.3 of RFC 7230.)
Just to be clear to avoid confusion: this is the definition of the payload; the WARC record should still contain the exact response sent by the server with transfer encoding intact. But when calculating the WARC-Payload-Digest, the transfer encoding must be stripped.
warcio (like many other tools) passes the response data directly into the payload digester without removing transfer encoding. This means that it produces an invalid WARC-Payload-Digest when the HTTP body is transfer-encoded.
The text was updated successfully, but these errors were encountered: