diff --git a/issue4070.html b/issue4070.html new file mode 100644 index 0000000000..05343c9f99 --- /dev/null +++ b/issue4070.html @@ -0,0 +1,179 @@ + + + + +Issue 4070: Transcoding by std::formatter<std::filesystem::path> + + + + + + + + + +
+

This page is a snapshot from the LWG issues list, see the Library Active Issues List for more information and the meaning of New status.

+

4070. Transcoding by std::formatter<std::filesystem::path>

+

Section: 99 [fs.path.fmtr.funcs] Status: New + Submitter: Jonathan Wakely Opened: 2024-04-19 Last modified: 2024-04-19

+

Priority: Not Prioritized +

+

View all issues with New status.

+

Discussion:

+

+99 [fs.path.fmtr.funcs] says: + +

+If charT is char, path::value_type is wchar_t, +and the literal encoding is UTF-8, then the escaped path is +transcoded from the native encoding for wide character strings to UTF-8 +with maximal subparts of ill-formed subsequences substituted with +u+fffd +replacement character per the Unicode Standard [...]. +Otherwise, transcoding is implementation-defined. +
+

+ +

+This seems to mean that the Unicode substitutions are only done +for an escaped path, i.e. when the ? option is used. Otherwise, the form +of transcoding is completely implementation-defined. +However, this makes no sense. +An escaped string will have no ill-formed subsequences, because they will +already have been replaced as per 22.14.6.4 [format.string.escaped]: +

+Otherwise (X is a sequence of ill-formed code units), +each code unit U is appended to E in order as +the sequence \x{hex-digit-sequence}, +where hex-digit-sequence is the shortest hexadecimal +representation of U using lower-case hexadecimal digits. +
+

+

+So only unescaped strings can have ill-formed sequences by the time +we do transcoding to char, but whether or not any +u+fffd substitution +occurs is just implementation-defined. +

+ +

+I believe we want to specify the substitutions are done when transcoding +an unescaped path (and it doesn't matter whether we specify it +for escaped paths, because it's a no-op if escaping happens first, +as is apparently intended). +

+ +

+It does matter whether we escape first or perform substitutions first. +If we escape first then every code unit in an ill-formed sequence is +individually escaped as \x{hex-digit-sequence}. +So an ill-formed sequence of two wchar_t values will be escaped as +two \x{...} strings, which are then transcoded to UTF-8. +If we transcode (with substitutions first) then the entire +ill-formed sequence is replaced with a single replacement character, +which will then be escaped as \x{fffd}. +SG16 should be asked to confirm that escaping first is intended, +so that an escaped string shows the original invalid code units. +For a non-escaped string, we want the ill-formed sequence to be +formatted as �, which the proposed resolution tries to ensure. +

+ + + +

Proposed resolution:

+

+This wording is relative to N4981. +

+
    +
  1. Modify 99 [fs.path.fmtr.funcs] as indicated:

    + +
    +
    
    +template<class FormatContext>
    +  typename FormatContext::iterator
    +    format(const filesystem::path& p, FormatContext& ctx) const;
    +
    +
    -5- +Effects: +Let s be p.generic_string<filesystem::path::value_type>() +if the g option is used, otherwise p.native(). +Writes s into ctx.out()`, adjusted according to the path-format-spec. +If charT is char, path::value_type is wchar_t, +and the literal encoding is UTF-8, then the +escaped path +(possible escaped) string +is transcoded from the native encoding for wide character strings to UTF-8 +with maximal subparts of ill-formed subsequences substituted with +u+fffd replacement character per +the Unicode Standard, Chapter 3.9 u+fffd +Substitution in Conversion. +If charT and path::value_type are the same then no transcoding is performed. +Otherwise, transcoding is implementation-defined +
    +
    +
  2. +
  3. +Modify the entry in the index of implementation-defined behavior as indicated: +
    +transcoding of a formatted path when charT and path::value_type differ +and not converting from wchar_t to UTF-8 +
    +
  4. + +
+ + + + + + + diff --git a/lwg-active.html b/lwg-active.html index 319e667f41..27c0d63656 100644 --- a/lwg-active.html +++ b/lwg-active.html @@ -79,7 +79,7 @@

C++ Standard Library Active Issues List (Revision D125)

-

Revised 2024-04-19 at 14:07:09 UTC +

Revised 2024-04-19 at 15:21:35 UTC

Reference ISO/IEC IS 14882:2020(E)

Also see:

@@ -199,17 +199,17 @@

Revision History