Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunked encoding is recorded incorrectly #584

Open
yotann2 opened this issue Oct 9, 2020 · 5 comments
Open

Chunked encoding is recorded incorrectly #584

yotann2 opened this issue Oct 9, 2020 · 5 comments

Comments

@yotann2
Copy link

yotann2 commented Oct 9, 2020

Describe the bug

When used to record an HTTP response that uses Transfer-Encoding: chunked, pywb produces a WARC record where the chunks have been decoded but the Transfer-Encoding header is still in place. This behavior is incompatible with other WARC implementations, which either preserve the exact encoded data of the original response, or decode the data and remove the Transfer-Encoding header.

Steps to reproduce the bug

Start pywb in record mode and run a command like:

curl http://127.0.0.1:8080/my-web-archive/record/https://httpbin.org/stream/1

The resulting WARC file will look like this:

WARC/1.0
WARC-Type: response
WARC-Target-URI: https://httpbin.org/stream/1
...

HTTP/1.1 200 OK
Transfer-Encoding: chunked
...

{"url": "https://httpbin.org/stream/1", "args": {}, "headers": {"Host": "httpbin.org", "X-Amzn-Trace-Id": "Root=1-5f7fde1e-324e053b6cccf5765be1c3ca", "Accept-Encoding": "identity", "User-Agent": "curl/7.70.0", "Accept": "*/*"}, "origin": "35.222.231.221", "id": 0}

Expected behavior

The WARC record should include the original chunk-encoded data, like this:

WARC/1.0
WARC-Type: response
WARC-Target-URI: https://httpbin.org/stream/1
...

HTTP/1.1 200 OK
Transfer-Encoding: chunked
...

12f
{"url": "https://httpbin.org/stream/1", "args": {}, "headers": {"Host": "httpbin.org", "X-Amzn-Trace-Id": "Root=1-5f7f735d-5f540ab35fef070975fdd02b", "User-Agent": "Wpull
/2.0.3 (gzip)", "Accept-Encoding": "gzip, deflate", "Referer": "https://httpbin.org/stream/1"}, "origin": "35.222.231.221", "id": 0}

0
@JustAnotherArchivist
Copy link

Side note: any tool that decodes the payload and removes the header would produce invalid WARCs. The Transfer-Encoding must be preserved (section 6.3.2 in WARC/1.0). (The payload digest calculation is a different matter and one that is implemented inconsistently across tools.)

@ikreymer
Copy link
Member

ikreymer commented Oct 9, 2020

Yeah, that's sort of why it works the way it does. On the one hand, removing the Transfer-Encoding header would violate the spec.. But on the other, storing it with the chunked encoding turns out to be more work, and is generally less useful, since at least 99% of the time, you just want to de-chunk it anyway when consuming the payload.

Others have complained about transfer-encoding and even gzip encoding being in the WARCs at all..

There's a few reasons for not storing with transfer-encoding in particular, including that actually equivalent records may be different due to different chunks. Also, I've found browsers tend not respect the header in certain circumstances at all: for example, getting the payload from the browser (via debugging protocol) also has the payload de-chunked, as well as when receiving a response via fetch() and serving via service-workers. In these cases, I've found that the header is there, while the payload is already de-chunked.

I suppose some options are:

  1. remove the transfer-encoding header, or rewrite it? X-Original-Transfer-Encoding: chunked ?
  2. ensure the WARC payload keeps the transfer-encoding. I think that's probably doable, but is more work and may be less desirable for users of the WARC.
  3. Do nothing, maybe document current behavior : )
  4. 'Fake' transfer-encoding when its not available by simply writing a WARC with response as a single chunk?

@JustAnotherArchivist
Copy link

It may be worth holding a discussion about changing the WARC specification. Chunked TE is essentially a transmission artefact; it can provide useful clues about the HTTP server (e.g. based on how the chunk size lines are formatted), but the gain in terms of archived information is small. And removing TE in data and headers before writing the records would solve the issue you mentioned: two retrievals with the same payload digest (when calculated according to spec, which most tools don't seem to do; cf. webrecorder/warcio#74) may actually hold technically different data due to TE, and that difference would be lost when writing a revisit record instead. The situation is however complicated by the fact that nobody uses TE correctly for on-the-fly compression and abuses Content-Encoding instead (which is intended for at-rest compression, e.g. serving a static .tar.gz from disk). But that's a whole other can of worms.

Regardless, the current WARC specification (1.0 and 1.1) require that chunked TE be preserved, so I think that until a version 1.2 or later is released which removes this requirement and that version is implemented in pywb, pywb should adhere to the specification. In other words, I believe that number 2 is the only valid option at this time.

@robertvanloenhout
Copy link

I'd like to know if there is any progress on this issue. We are currently using the jwarc library to read warc files created by pywb, and as many others, we have encountered the problem that jwarc throws a parse exception when reading responses that have the Transfer-Encoding chunked header, while the body is decoded.

Is the consensus still that the only solution in pywb is to keep and store the Transfer-Encoding? If so, has any progress been made to implement this?
Will this be difficult to implement? I'm an experienced software engineer, but unfortunately I have almost no experience with Python. So I'm somewhat hesitant to try to fix this for pywb and offer you my changes. Still, if nobody started work on this yet, and you think this would be a valuable addition to pywb, then I could take a look.

What about WARC specification 1.2? Is there any progress on this, and has this issue been addressed?

What about having a setting in pywb to produce 'invalid' WARC files without the Transfer-Encoding header? This sounds like something easy to implement, and it lets the user of pywb make a deliberate choice to deviate from the specification.

@JustAnotherArchivist
Copy link

No changes have been made to the WARC specifications, no proper discussion on this topic has been launched, and a version 1.2 is not currently on the horizon to my knowledge. So yes, preserving the transfer encoding in the WARC records is still the only valid option for the foreseeable future in my eyes.

I had a quick look at how this could be fixed, and it doesn't seem to be straighforward. The pywb.recorder.recorderapp.RecorderApp accesses requests.Response.raw for retrieving the body, which already decodes the TE as far as I remember. warcio gets around that by replacing low-level httplib things. Perhaps pywb should just leverage warcio.capture_http, but I'm not sure about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants