Chunked encoding is recorded incorrectly #584

yotann2 · 2020-10-09T04:10:14Z

Describe the bug

When used to record an HTTP response that uses Transfer-Encoding: chunked, pywb produces a WARC record where the chunks have been decoded but the Transfer-Encoding header is still in place. This behavior is incompatible with other WARC implementations, which either preserve the exact encoded data of the original response, or decode the data and remove the Transfer-Encoding header.

Steps to reproduce the bug

Start pywb in record mode and run a command like:

curl http://127.0.0.1:8080/my-web-archive/record/https://httpbin.org/stream/1

The resulting WARC file will look like this:

WARC/1.0
WARC-Type: response
WARC-Target-URI: https://httpbin.org/stream/1
...

HTTP/1.1 200 OK
Transfer-Encoding: chunked
...

{"url": "https://httpbin.org/stream/1", "args": {}, "headers": {"Host": "httpbin.org", "X-Amzn-Trace-Id": "Root=1-5f7fde1e-324e053b6cccf5765be1c3ca", "Accept-Encoding": "identity", "User-Agent": "curl/7.70.0", "Accept": "*/*"}, "origin": "35.222.231.221", "id": 0}

Expected behavior

The WARC record should include the original chunk-encoded data, like this:

WARC/1.0
WARC-Type: response
WARC-Target-URI: https://httpbin.org/stream/1
...

HTTP/1.1 200 OK
Transfer-Encoding: chunked
...

12f
{"url": "https://httpbin.org/stream/1", "args": {}, "headers": {"Host": "httpbin.org", "X-Amzn-Trace-Id": "Root=1-5f7f735d-5f540ab35fef070975fdd02b", "User-Agent": "Wpull
/2.0.3 (gzip)", "Accept-Encoding": "gzip, deflate", "Referer": "https://httpbin.org/stream/1"}, "origin": "35.222.231.221", "id": 0}

0

The text was updated successfully, but these errors were encountered:

JustAnotherArchivist · 2020-10-09T17:25:00Z

Side note: any tool that decodes the payload and removes the header would produce invalid WARCs. The Transfer-Encoding must be preserved (section 6.3.2 in WARC/1.0). (The payload digest calculation is a different matter and one that is implemented inconsistently across tools.)

ikreymer · 2020-10-09T17:47:46Z

Yeah, that's sort of why it works the way it does. On the one hand, removing the Transfer-Encoding header would violate the spec.. But on the other, storing it with the chunked encoding turns out to be more work, and is generally less useful, since at least 99% of the time, you just want to de-chunk it anyway when consuming the payload.

Others have complained about transfer-encoding and even gzip encoding being in the WARCs at all..

There's a few reasons for not storing with transfer-encoding in particular, including that actually equivalent records may be different due to different chunks. Also, I've found browsers tend not respect the header in certain circumstances at all: for example, getting the payload from the browser (via debugging protocol) also has the payload de-chunked, as well as when receiving a response via fetch() and serving via service-workers. In these cases, I've found that the header is there, while the payload is already de-chunked.

I suppose some options are:

remove the transfer-encoding header, or rewrite it? X-Original-Transfer-Encoding: chunked ?
ensure the WARC payload keeps the transfer-encoding. I think that's probably doable, but is more work and may be less desirable for users of the WARC.
Do nothing, maybe document current behavior : )
'Fake' transfer-encoding when its not available by simply writing a WARC with response as a single chunk?

JustAnotherArchivist · 2020-10-09T18:14:41Z

It may be worth holding a discussion about changing the WARC specification. Chunked TE is essentially a transmission artefact; it can provide useful clues about the HTTP server (e.g. based on how the chunk size lines are formatted), but the gain in terms of archived information is small. And removing TE in data and headers before writing the records would solve the issue you mentioned: two retrievals with the same payload digest (when calculated according to spec, which most tools don't seem to do; cf. webrecorder/warcio#74) may actually hold technically different data due to TE, and that difference would be lost when writing a revisit record instead. The situation is however complicated by the fact that nobody uses TE correctly for on-the-fly compression and abuses Content-Encoding instead (which is intended for at-rest compression, e.g. serving a static .tar.gz from disk). But that's a whole other can of worms.

Regardless, the current WARC specification (1.0 and 1.1) require that chunked TE be preserved, so I think that until a version 1.2 or later is released which removes this requirement and that version is implemented in pywb, pywb should adhere to the specification. In other words, I believe that number 2 is the only valid option at this time.

robertvanloenhout · 2021-09-14T13:22:52Z

I'd like to know if there is any progress on this issue. We are currently using the jwarc library to read warc files created by pywb, and as many others, we have encountered the problem that jwarc throws a parse exception when reading responses that have the Transfer-Encoding chunked header, while the body is decoded.

Is the consensus still that the only solution in pywb is to keep and store the Transfer-Encoding? If so, has any progress been made to implement this?
Will this be difficult to implement? I'm an experienced software engineer, but unfortunately I have almost no experience with Python. So I'm somewhat hesitant to try to fix this for pywb and offer you my changes. Still, if nobody started work on this yet, and you think this would be a valuable addition to pywb, then I could take a look.

What about WARC specification 1.2? Is there any progress on this, and has this issue been addressed?

What about having a setting in pywb to produce 'invalid' WARC files without the Transfer-Encoding header? This sounds like something easy to implement, and it lets the user of pywb make a deliberate choice to deviate from the specification.

JustAnotherArchivist · 2021-09-14T16:27:16Z

No changes have been made to the WARC specifications, no proper discussion on this topic has been launched, and a version 1.2 is not currently on the horizon to my knowledge. So yes, preserving the transfer encoding in the WARC records is still the only valid option for the foreseeable future in my eyes.

I had a quick look at how this could be fixed, and it doesn't seem to be straighforward. The pywb.recorder.recorderapp.RecorderApp accesses requests.Response.raw for retrieving the body, which already decodes the TE as far as I remember. warcio gets around that by replacing low-level httplib things. Perhaps pywb should just leverage warcio.capture_http, but I'm not sure about that.

anjackson mentioned this issue Dec 11, 2020

Problems playing back some Twitter records ukwa/ukwa-pywb#61

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunked encoding is recorded incorrectly #584

Chunked encoding is recorded incorrectly #584

yotann2 commented Oct 9, 2020

JustAnotherArchivist commented Oct 9, 2020

ikreymer commented Oct 9, 2020

JustAnotherArchivist commented Oct 9, 2020

robertvanloenhout commented Sep 14, 2021

JustAnotherArchivist commented Sep 14, 2021

Chunked encoding is recorded incorrectly #584

Chunked encoding is recorded incorrectly #584

Comments

yotann2 commented Oct 9, 2020

Describe the bug

Steps to reproduce the bug

Expected behavior

JustAnotherArchivist commented Oct 9, 2020

ikreymer commented Oct 9, 2020

JustAnotherArchivist commented Oct 9, 2020

robertvanloenhout commented Sep 14, 2021

JustAnotherArchivist commented Sep 14, 2021