You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
WARC 1.0 and 1.1 say that "the ‘encoded-word’ mechanism of [RFC2047] may also be used when writing WARC fields." However, RFC2047§5 gives strict limitations on which fields may hold encoded-words, and requires that encoded-words be separated from various other tokens with whitespace. The last version of HTTP that included encoded-words, RFC2616§2.2, also limited which fields may hold them. The WARC standards don't specify which of these requirements apply to WARC, if any. If they do apply, it would seem that encoded-words are not allowed in any of the standard WARC fields.
I've checked several WARC implementations and none of them actually produce or decode encoded-words. These tools can produce headers like WARC-Target-URI: http://example.com/=?iso-8859-1?q?=31?= that would cause compatibility issues with a hypothetical tool that supported encoded-words. Furthermore, even in web browsers, encoded-words are parsed inconsistently.
Therefore, I propose removing RFC2047 encoded-words entirely from future versions of WARC.
The text was updated successfully, but these errors were encountered:
(To clarify the compatibility issue: http://example.com/=?iso-8859-1?q?=31?= is a valid URI. If you archive it with one of the existing WARC tools, you'll get a field like WARC-Target-URI: http://example.com/=?iso-8859-1?q?=31?=. If you feed that into a tool that supports RFC2047, it would decode the field into http://example.com/1, which is different from the original URI.)
WARC 1.0 and 1.1 say that "the ‘encoded-word’ mechanism of [RFC2047] may also be used when writing WARC fields." However, RFC2047§5 gives strict limitations on which fields may hold encoded-words, and requires that encoded-words be separated from various other tokens with whitespace. The last version of HTTP that included encoded-words, RFC2616§2.2, also limited which fields may hold them. The WARC standards don't specify which of these requirements apply to WARC, if any. If they do apply, it would seem that encoded-words are not allowed in any of the standard WARC fields.
I've checked several WARC implementations and none of them actually produce or decode encoded-words. These tools can produce headers like
WARC-Target-URI: http://example.com/=?iso-8859-1?q?=31?=
that would cause compatibility issues with a hypothetical tool that supported encoded-words. Furthermore, even in web browsers, encoded-words are parsed inconsistently.Therefore, I propose removing RFC2047 encoded-words entirely from future versions of WARC.
The text was updated successfully, but these errors were encountered: