Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoded-words underspecified and unsupported #67

Open
yotann2 opened this issue Sep 13, 2020 · 1 comment
Open

Encoded-words underspecified and unsupported #67

yotann2 opened this issue Sep 13, 2020 · 1 comment

Comments

@yotann2
Copy link

yotann2 commented Sep 13, 2020

WARC 1.0 and 1.1 say that "the ‘encoded-word’ mechanism of [RFC2047] may also be used when writing WARC fields." However, RFC2047§5 gives strict limitations on which fields may hold encoded-words, and requires that encoded-words be separated from various other tokens with whitespace. The last version of HTTP that included encoded-words, RFC2616§2.2, also limited which fields may hold them. The WARC standards don't specify which of these requirements apply to WARC, if any. If they do apply, it would seem that encoded-words are not allowed in any of the standard WARC fields.

I've checked several WARC implementations and none of them actually produce or decode encoded-words. These tools can produce headers like WARC-Target-URI: http://example.com/=?iso-8859-1?q?=31?= that would cause compatibility issues with a hypothetical tool that supported encoded-words. Furthermore, even in web browsers, encoded-words are parsed inconsistently.

Therefore, I propose removing RFC2047 encoded-words entirely from future versions of WARC.

@yotann2
Copy link
Author

yotann2 commented Sep 15, 2020

(To clarify the compatibility issue: http://example.com/=?iso-8859-1?q?=31?= is a valid URI. If you archive it with one of the existing WARC tools, you'll get a field like WARC-Target-URI: http://example.com/=?iso-8859-1?q?=31?=. If you feed that into a tool that supports RFC2047, it would decode the field into http://example.com/1, which is different from the original URI.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants