-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: validate UTF-8 at C++ -> Lean boundary #3963
Conversation
!bench |
(disabled auto-merge to run a benchmark first) |
Mathlib CI status (docs):
|
Here are the benchmark results for commit 88cad9a. Benchmark Metric Change
================================================
+ stdlib tactic execution -1.9% (-302.2 σ) |
The affected tests are regression tests from #301, #1707, #1690, all of which appear to have been strange behaviors of lean caused by invalid UTF-8. The new behavior is for the C code to throw an unspecific error before even calling lean code; this is inevitable because the lean parser/elaborator assumes it can store the file in a |
src/include/lean/lean.h
Outdated
@@ -2064,11 +2066,11 @@ static inline uint8_t lean_version_get_is_release(lean_obj_arg _unit) { | |||
} | |||
|
|||
static inline lean_obj_res lean_version_get_special_desc(lean_obj_arg _unit) { | |||
return lean_mk_string(LEAN_SPECIAL_VERSION_DESC); | |||
return lean_mk_ascii_string(LEAN_SPECIAL_VERSION_DESC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’d probably lean towards using mk_utf8_string
here. It’s probably not performance critical, and maybe some German developer with a ü in their name builds lean from a branch with v0.7.8-Günnar
….
return lean_mk_ascii_string(LEAN_SPECIAL_VERSION_DESC); | |
return lean_mk_utf8_string(LEAN_SPECIAL_VERSION_DESC); |
src/include/lean/lean.h
Outdated
} | ||
|
||
static inline lean_obj_res lean_system_platform_target(lean_obj_arg _unit) { | ||
return lean_mk_string(LEAN_PLATFORM_TARGET); | ||
return lean_mk_ascii_string(LEAN_PLATFORM_TARGET); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return lean_mk_ascii_string(LEAN_PLATFORM_TARGET); | |
return lean_mk_utf8_string(LEAN_PLATFORM_TARGET); |
Unlikely to matter, but better safe than sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, the "better safe than sorry" option is lean_mk_string
. If we have any reason to believe that the data isn't necessarily in the subset it appears to be in, we should just validate it. (Note: is it confusing that the function mk_utf8_string
does not validate? Your comments here suggest that you've misunderstood the role of the functions in which case the name should probably be changed to something less confusing.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, after I wrote this and was away from the computer I paused wondered if my suggestion is bogus.
It’s not entirely bogus, though: If the there is utf8 in the TARGET, then mk_ascii_string
will produce an invalid length. So if LEAN_SPECIAL_VERSION_DESC
etc. happens to be utf8, then mk_utf8_string
is better than mk_ascii_string
, but not as safe as a validating function.
I agree that it’s worth changing the lean_mk_{utf8,utf8}_string
function names to include `unchecked to avoid future bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on better names
src/runtime/platform.cpp
Outdated
@@ -40,7 +40,7 @@ extern "C" LEAN_EXPORT uint8 lean_system_platform_emscripten(obj_arg) { | |||
#endif | |||
} | |||
|
|||
extern "C" object * lean_get_githash(obj_arg) { return lean_mk_string(LEAN_GITHASH); } | |||
extern "C" object * lean_get_githash(obj_arg) { return lean_mk_ascii_string(LEAN_GITHASH); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also unlikely, but since the source is out of our (local) control, better safe than sorry.
extern "C" object * lean_get_githash(obj_arg) { return lean_mk_ascii_string(LEAN_GITHASH); } | |
extern "C" object * lean_get_githash(obj_arg) { return lean_mk_utf8_string(LEAN_GITHASH); } |
src/include/lean/lean.h
Outdated
} | ||
|
||
static inline lean_obj_res lean_system_platform_target(lean_obj_arg _unit) { | ||
return lean_mk_string(LEAN_PLATFORM_TARGET); | ||
return lean_mk_ascii_string(LEAN_PLATFORM_TARGET); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on better names
src/runtime/object.cpp
Outdated
if (validate_utf8((const uint8_t *)s, len)) { | ||
return lean_mk_string_from_bytes(s, len); | ||
} else { | ||
return lean_mk_string_core("", 0, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really a sensible fallback? I matches the panic value but there is no message. Shouldn't this raise an exception if done as part of I/O?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. I was planning to put in a follow-up PR which does something like rust's String::from_utf8_lossy
with replacement characters. In most cases it's not really a good idea to rely on this for error handling, and instead code should call validate
separately and throw IO errors etc as appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to elaborate on this plan before going forward. It would also be helpful to mention such relevant plans in the PR description :) .
If we want to get to the point where no-one, especially not Lean APIs, depends on this branch, should it become a hard assertion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a revision in the works which addresses many of the review comments. I'll post that and then we can talk about how we want to proceed here.
1e6967c
to
248a17a
Compare
I pushed a new version of the PR. Summary of changes (w.r.t master, not ver.1):
Other (non-API) changes:
|
Sounds like a nice and thorough cleanup, thanks a lot! I'd lean towards rejecting invalid utf8 in lean files (not replacing, not using the empty string). Who knows what kind of surprises will occur when people accidentally have invalid utf8 and lean does something, but not what they expect to. |
I'm happy as well modulo below -
Agreed, and more generally I/O functions should at least by default throw an error when decoding fails. I'm pretty sure this is the default in all the big stdlibs out there as well, no? |
I agree, but can we agree that this is an improvement on the status quo and do that in a follow up PR? Handling UTF-8 errors properly means having new kinds of error reporting functions we don't have currently, as well as touching a bunch more code. The main purpose of this PR is to avoid undefined behavior in the present version caused by invalid lean objects being created, not to add new error reporting mechanisms. |
Sure, I just want to make sure it is documented and discussed that this is a partial solution in that way. That is very relevant context for reviewing. |
Sounds like as soon as the PR description reflects the current state I can press the button? |
PR description updated |
Linux-debug fails with
|
@digama0 Do you have time to look into the assertion failure? |
Sorry for the delay, the assertion failure should be fixed now. |
Continuation of #3958. To ensure that lean code is able to uphold the invariant that
String
s are valid UTF-8 (which is assumed by the lean model), we have to make sure that no lean objects are created with invalid UTF-8. #3958 covers the case of lean code creating strings viafromUTF8Unchecked
, but there are still many cases where C++ code constructs strings from aconst char *
orstd::string
with unclear UTF-8 status.To address this and minimize accidental missed validation, the
(lean_)mk_string
function is modified to validate UTF-8. The original function is renamed tomk_string_unchecked
, with several other variants depending on whether we know the string is UTF-8 or ASCII and whether we have the length and/or utf8 char count on hand. I reviewed every function which leads tomk_string
or its variants in the C code, and used the appropriate validation function, defaulting tomk_string
if the provenance is unclear.This PR adds no new error handling paths, meaning that incorrect UTF-8 will still produce incorrect results in e.g. IO functions, they are just not causing unsound behavior anymore. A subsequent PR will handle adding better error reporting for bad UTF-8.