From 5f29a5a5d2a4595f59954b345549586c5f3d6cbd Mon Sep 17 00:00:00 2001
From: github-actions This page is a snapshot from the LWG issues list, see the Library Active Issues List for more information and the meaning of New status. Section: 99 [fs.path.fmtr.funcs] Status: New
+ Submitter: Jonathan Wakely Opened: 2024-04-19 Last modified: 2024-04-19 Priority: Not Prioritized
+ View all issues with New status. Discussion:
+99 [fs.path.fmtr.funcs] says:
+
+
+4070. Transcoding by
+std::formatter<std::filesystem::path>
+If
+charT
is char
, path::value_type
is wchar_t
,
+and the literal encoding is UTF-8, then the escaped path is
+transcoded from the native encoding for wide character strings to UTF-8
+with maximal subparts of ill-formed subsequences substituted with
+u+fffd
+replacement character per the Unicode Standard [...].
+Otherwise, transcoding is implementation-defined.
+
+This seems to mean that the Unicode substitutions are only done
+for an escaped path, i.e. when the ?
option is used. Otherwise, the form
+of transcoding is completely implementation-defined.
+However, this makes no sense.
+An escaped string will have no ill-formed subsequences, because they will
+already have been replaced as per 22.14.6.4 [format.string.escaped]:
+
+Otherwise (X is a sequence of ill-formed code units), +each code unit U is appended to E in order as +the sequence+ +\x{hex-digit-sequence}
, +wherehex-digit-sequence
is the shortest hexadecimal +representation of U using lower-case hexadecimal digits. +
+So only unescaped strings can have ill-formed sequences by the time
+we do transcoding to char
, but whether or not any
+u+fffd substitution
+occurs is just implementation-defined.
+
+I believe we want to specify the substitutions are done when transcoding +an unescaped path (and it doesn't matter whether we specify it +for escaped paths, because it's a no-op if escaping happens first, +as is apparently intended). +
+ +
+It does matter whether we escape first or perform substitutions first.
+If we escape first then every code unit in an ill-formed sequence is
+individually escaped as \x{hex-digit-sequence}
.
+So an ill-formed sequence of two wchar_t
values will be escaped as
+two \x{...}
strings, which are then transcoded to UTF-8.
+If we transcode (with substitutions first) then the entire
+ill-formed sequence is replaced with a single replacement character,
+which will then be escaped as \x{fffd}
.
+SG16 should be asked to confirm that escaping first is intended,
+so that an escaped string shows the original invalid code units.
+For a non-escaped string, we want the ill-formed sequence to be
+formatted as �, which the proposed resolution tries to ensure.
+
Proposed resolution:
++This wording is relative to N4981. +
+Modify 99 [fs.path.fmtr.funcs] as indicated:
+ +++++template<class FormatContext> + typename FormatContext::iterator + format(const filesystem::path& p, FormatContext& ctx) const; +
-5- +Effects: +Let+s
bep.generic_string<filesystem::path::value_type>()
+if theg
option is used, otherwisep.native()
. +Writess into
ctx.out()`, adjusted according to the path-format-spec. +IfcharT
ischar
,path::value_type
iswchar_t
, +and the literal encoding is UTF-8, then the +escaped path+(possible escaped) string +is transcoded from the native encoding for wide character strings to UTF-8 +with maximal subparts of ill-formed subsequences substituted with +u+fffd replacement character per +the Unicode Standard, Chapter 3.9 u+fffd +Substitution in Conversion. +IfcharT
andpath::value_type
are the same then no transcoding is performed. +Otherwise, transcoding is implementation-defined +
+transcoding of a formatted+path
whencharT
andpath::value_type
differ +and not converting fromwchar_t
to UTF-8 +
Revised 2024-04-19 at 14:07:09 UTC +
Revised 2024-04-19 at 15:21:35 UTC
Reference ISO/IEC IS 14882:2020(E)
Also see:
@@ -199,17 +199,17 @@std::formatter<std::filesystem::path>
Section: 99 [fs.path.fmtr.funcs] Status: New + Submitter: Jonathan Wakely Opened: 2024-04-19 Last modified: 2024-04-19
+Priority: Not Prioritized +
+View all issues with New status.
+Discussion:
++99 [fs.path.fmtr.funcs] says: + +
+If+ + +charT
ischar
,path::value_type
iswchar_t
, +and the literal encoding is UTF-8, then the escaped path is +transcoded from the native encoding for wide character strings to UTF-8 +with maximal subparts of ill-formed subsequences substituted with +u+fffd +replacement character per the Unicode Standard [...]. +Otherwise, transcoding is implementation-defined. +
+This seems to mean that the Unicode substitutions are only done
+for an escaped path, i.e. when the ?
option is used. Otherwise, the form
+of transcoding is completely implementation-defined.
+However, this makes no sense.
+An escaped string will have no ill-formed subsequences, because they will
+already have been replaced as per 22.14.6.4 [format.string.escaped]:
+
+Otherwise (X is a sequence of ill-formed code units), +each code unit U is appended to E in order as +the sequence+ +\x{hex-digit-sequence}
, +wherehex-digit-sequence
is the shortest hexadecimal +representation of U using lower-case hexadecimal digits. +
+So only unescaped strings can have ill-formed sequences by the time
+we do transcoding to char
, but whether or not any
+u+fffd substitution
+occurs is just implementation-defined.
+
+I believe we want to specify the substitutions are done when transcoding +an unescaped path (and it doesn't matter whether we specify it +for escaped paths, because it's a no-op if escaping happens first, +as is apparently intended). +
+ +
+It does matter whether we escape first or perform substitutions first.
+If we escape first then every code unit in an ill-formed sequence is
+individually escaped as \x{hex-digit-sequence}
.
+So an ill-formed sequence of two wchar_t
values will be escaped as
+two \x{...}
strings, which are then transcoded to UTF-8.
+If we transcode (with substitutions first) then the entire
+ill-formed sequence is replaced with a single replacement character,
+which will then be escaped as \x{fffd}
.
+SG16 should be asked to confirm that escaping first is intended,
+so that an escaped string shows the original invalid code units.
+For a non-escaped string, we want the ill-formed sequence to be
+formatted as �, which the proposed resolution tries to ensure.
+
Proposed resolution:
++This wording is relative to N4981. +
+Modify 99 [fs.path.fmtr.funcs] as indicated:
+ +++++template<class FormatContext> + typename FormatContext::iterator + format(const filesystem::path& p, FormatContext& ctx) const; +
-5- +Effects: +Let+s
bep.generic_string<filesystem::path::value_type>()
+if theg
option is used, otherwisep.native()
. +Writess into
ctx.out()`, adjusted according to the path-format-spec. +IfcharT
ischar
,path::value_type
iswchar_t
, +and the literal encoding is UTF-8, then the +escaped path+(possible escaped) string +is transcoded from the native encoding for wide character strings to UTF-8 +with maximal subparts of ill-formed subsequences substituted with +u+fffd replacement character per +the Unicode Standard, Chapter 3.9 u+fffd +Substitution in Conversion. +IfcharT
andpath::value_type
are the same then no transcoding is performed. +Otherwise, transcoding is implementation-defined +
+transcoding of a formatted+path
whencharT
andpath::value_type
differ +and not converting fromwchar_t
to UTF-8 +