-
-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunked encoding is recorded incorrectly #584
Comments
Side note: any tool that decodes the payload and removes the header would produce invalid WARCs. The Transfer-Encoding must be preserved (section 6.3.2 in WARC/1.0). (The payload digest calculation is a different matter and one that is implemented inconsistently across tools.) |
Yeah, that's sort of why it works the way it does. On the one hand, removing the Transfer-Encoding header would violate the spec.. But on the other, storing it with the chunked encoding turns out to be more work, and is generally less useful, since at least 99% of the time, you just want to de-chunk it anyway when consuming the payload. Others have complained about transfer-encoding and even gzip encoding being in the WARCs at all.. There's a few reasons for not storing with transfer-encoding in particular, including that actually equivalent records may be different due to different chunks. Also, I've found browsers tend not respect the header in certain circumstances at all: for example, getting the payload from the browser (via debugging protocol) also has the payload de-chunked, as well as when receiving a response via fetch() and serving via service-workers. In these cases, I've found that the header is there, while the payload is already de-chunked. I suppose some options are:
|
It may be worth holding a discussion about changing the WARC specification. Chunked TE is essentially a transmission artefact; it can provide useful clues about the HTTP server (e.g. based on how the chunk size lines are formatted), but the gain in terms of archived information is small. And removing TE in data and headers before writing the records would solve the issue you mentioned: two retrievals with the same payload digest (when calculated according to spec, which most tools don't seem to do; cf. webrecorder/warcio#74) may actually hold technically different data due to TE, and that difference would be lost when writing a revisit record instead. The situation is however complicated by the fact that nobody uses TE correctly for on-the-fly compression and abuses Content-Encoding instead (which is intended for at-rest compression, e.g. serving a static .tar.gz from disk). But that's a whole other can of worms. Regardless, the current WARC specification (1.0 and 1.1) require that chunked TE be preserved, so I think that until a version 1.2 or later is released which removes this requirement and that version is implemented in pywb, pywb should adhere to the specification. In other words, I believe that number 2 is the only valid option at this time. |
I'd like to know if there is any progress on this issue. We are currently using the jwarc library to read warc files created by pywb, and as many others, we have encountered the problem that jwarc throws a parse exception when reading responses that have the Transfer-Encoding chunked header, while the body is decoded. Is the consensus still that the only solution in pywb is to keep and store the Transfer-Encoding? If so, has any progress been made to implement this? What about WARC specification 1.2? Is there any progress on this, and has this issue been addressed? What about having a setting in pywb to produce 'invalid' WARC files without the Transfer-Encoding header? This sounds like something easy to implement, and it lets the user of pywb make a deliberate choice to deviate from the specification. |
No changes have been made to the WARC specifications, no proper discussion on this topic has been launched, and a version 1.2 is not currently on the horizon to my knowledge. So yes, preserving the transfer encoding in the WARC records is still the only valid option for the foreseeable future in my eyes. I had a quick look at how this could be fixed, and it doesn't seem to be straighforward. The |
Describe the bug
When used to record an HTTP response that uses
Transfer-Encoding: chunked
, pywb produces a WARC record where the chunks have been decoded but theTransfer-Encoding
header is still in place. This behavior is incompatible with other WARC implementations, which either preserve the exact encoded data of the original response, or decode the data and remove theTransfer-Encoding
header.Steps to reproduce the bug
Start pywb in record mode and run a command like:
The resulting WARC file will look like this:
Expected behavior
The WARC record should include the original chunk-encoded data, like this:
The text was updated successfully, but these errors were encountered: