Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: validate UTF-8 at C++ -> Lean boundary #3963

Merged
merged 4 commits into from
Jun 19, 2024

Conversation

digama0
Copy link
Collaborator

@digama0 digama0 commented Apr 20, 2024

Continuation of #3958. To ensure that lean code is able to uphold the invariant that Strings are valid UTF-8 (which is assumed by the lean model), we have to make sure that no lean objects are created with invalid UTF-8. #3958 covers the case of lean code creating strings via fromUTF8Unchecked, but there are still many cases where C++ code constructs strings from a const char * or std::string with unclear UTF-8 status.

To address this and minimize accidental missed validation, the (lean_)mk_string function is modified to validate UTF-8. The original function is renamed to mk_string_unchecked, with several other variants depending on whether we know the string is UTF-8 or ASCII and whether we have the length and/or utf8 char count on hand. I reviewed every function which leads to mk_string or its variants in the C code, and used the appropriate validation function, defaulting to mk_string if the provenance is unclear.

This PR adds no new error handling paths, meaning that incorrect UTF-8 will still produce incorrect results in e.g. IO functions, they are just not causing unsound behavior anymore. A subsequent PR will handle adding better error reporting for bad UTF-8.

@digama0 digama0 requested a review from joehendrix as a code owner April 20, 2024 20:10
@digama0 digama0 changed the title fix: validate UTF-8 at C++ -> Lean gboundary fix: validate UTF-8 at C++ -> Lean boundary Apr 20, 2024
@nomeata nomeata enabled auto-merge April 20, 2024 20:13
@digama0 digama0 disabled auto-merge April 20, 2024 20:16
@digama0
Copy link
Collaborator Author

digama0 commented Apr 20, 2024

!bench

@digama0
Copy link
Collaborator Author

digama0 commented Apr 20, 2024

(disabled auto-merge to run a benchmark first)

@github-actions github-actions bot added the toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN label Apr 20, 2024
@leanprover-community-mathlib4-bot
Copy link
Collaborator

leanprover-community-mathlib4-bot commented Apr 20, 2024

Mathlib CI status (docs):

  • ❗ Std/Mathlib CI will not be attempted unless your PR branches off the nightly-with-mathlib branch. Try git rebase 62cdb51ed5b9d8487877d5a4adbcd4659d81fc6a --onto 291bb84c972dae470e8f5602729e9d5a5e9433a2. (2024-04-20 20:29:09)
  • ❗ Std/Mathlib CI will not be attempted unless your PR branches off the nightly-with-mathlib branch. Try git rebase fb135b8cfe83c8d9286a9d0198114b7cb306a67a --onto 62cdb51ed5b9d8487877d5a4adbcd4659d81fc6a. (2024-04-23 21:32:17)

@leanprover-bot
Copy link
Collaborator

Here are the benchmark results for commit 88cad9a.
There were significant changes against commit 62cdb51:

  Benchmark   Metric             Change
  ================================================
+ stdlib      tactic execution    -1.9% (-302.2 σ)

@Kha Kha self-requested a review April 20, 2024 20:37
@digama0
Copy link
Collaborator Author

digama0 commented Apr 20, 2024

The affected tests are regression tests from #301, #1707, #1690, all of which appear to have been strange behaviors of lean caused by invalid UTF-8. The new behavior is for the C code to throw an unspecific error before even calling lean code; this is inevitable because the lean parser/elaborator assumes it can store the file in a String so we can't really use most of lean's error handling.

@leodemoura leodemoura requested a review from nomeata April 21, 2024 21:19
@@ -2064,11 +2066,11 @@ static inline uint8_t lean_version_get_is_release(lean_obj_arg _unit) {
}

static inline lean_obj_res lean_version_get_special_desc(lean_obj_arg _unit) {
return lean_mk_string(LEAN_SPECIAL_VERSION_DESC);
return lean_mk_ascii_string(LEAN_SPECIAL_VERSION_DESC);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d probably lean towards using mk_utf8_string here. It’s probably not performance critical, and maybe some German developer with a ü in their name builds lean from a branch with v0.7.8-Günnar….

Suggested change
return lean_mk_ascii_string(LEAN_SPECIAL_VERSION_DESC);
return lean_mk_utf8_string(LEAN_SPECIAL_VERSION_DESC);

}

static inline lean_obj_res lean_system_platform_target(lean_obj_arg _unit) {
return lean_mk_string(LEAN_PLATFORM_TARGET);
return lean_mk_ascii_string(LEAN_PLATFORM_TARGET);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return lean_mk_ascii_string(LEAN_PLATFORM_TARGET);
return lean_mk_utf8_string(LEAN_PLATFORM_TARGET);

Unlikely to matter, but better safe than sorry.

Copy link
Collaborator Author

@digama0 digama0 Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, the "better safe than sorry" option is lean_mk_string. If we have any reason to believe that the data isn't necessarily in the subset it appears to be in, we should just validate it. (Note: is it confusing that the function mk_utf8_string does not validate? Your comments here suggest that you've misunderstood the role of the functions in which case the name should probably be changed to something less confusing.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, after I wrote this and was away from the computer I paused wondered if my suggestion is bogus.

It’s not entirely bogus, though: If the there is utf8 in the TARGET, then mk_ascii_string will produce an invalid length. So if LEAN_SPECIAL_VERSION_DESC etc. happens to be utf8, then mk_utf8_string is better than mk_ascii_string, but not as safe as a validating function.

I agree that it’s worth changing the lean_mk_{utf8,utf8}_string function names to include `unchecked to avoid future bugs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on better names

@@ -40,7 +40,7 @@ extern "C" LEAN_EXPORT uint8 lean_system_platform_emscripten(obj_arg) {
#endif
}

extern "C" object * lean_get_githash(obj_arg) { return lean_mk_string(LEAN_GITHASH); }
extern "C" object * lean_get_githash(obj_arg) { return lean_mk_ascii_string(LEAN_GITHASH); }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also unlikely, but since the source is out of our (local) control, better safe than sorry.

Suggested change
extern "C" object * lean_get_githash(obj_arg) { return lean_mk_ascii_string(LEAN_GITHASH); }
extern "C" object * lean_get_githash(obj_arg) { return lean_mk_utf8_string(LEAN_GITHASH); }

src/include/lean/lean.h Outdated Show resolved Hide resolved
}

static inline lean_obj_res lean_system_platform_target(lean_obj_arg _unit) {
return lean_mk_string(LEAN_PLATFORM_TARGET);
return lean_mk_ascii_string(LEAN_PLATFORM_TARGET);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on better names

src/runtime/object.cpp Outdated Show resolved Hide resolved
if (validate_utf8((const uint8_t *)s, len)) {
return lean_mk_string_from_bytes(s, len);
} else {
return lean_mk_string_core("", 0, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really a sensible fallback? I matches the panic value but there is no message. Shouldn't this raise an exception if done as part of I/O?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. I was planning to put in a follow-up PR which does something like rust's String::from_utf8_lossy with replacement characters. In most cases it's not really a good idea to rely on this for error handling, and instead code should call validate separately and throw IO errors etc as appropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to elaborate on this plan before going forward. It would also be helpful to mention such relevant plans in the PR description :) .

If we want to get to the point where no-one, especially not Lean APIs, depends on this branch, should it become a hard assertion?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a revision in the works which addresses many of the review comments. I'll post that and then we can talk about how we want to proceed here.

@digama0 digama0 requested a review from leodemoura as a code owner April 23, 2024 21:08
@digama0
Copy link
Collaborator Author

digama0 commented Apr 23, 2024

I pushed a new version of the PR. Summary of changes (w.r.t master, not ver.1):

  • lean_mk_string_unchecked(const char * s, size_t sz, size_t utf8_chars) was previously lean_mk_string_core
    • Use this when we have both the length and utf8_length values available and also know the input is valid UTF-8.
  • lean_mk_string_from_bytes_unchecked(const char * s, size_t sz) was lean_mk_string_from_bytes
    • Use this when we have the length and know the input is valid UTF-8, but need to compute the utf8_chars.
  • lean_mk_string_from_bytes(const char * s, size_t sz) does validation now, and if validation fails it converts invalid UTF-8 sequences to U+FFFD instead of just returning the empty string.
    • Use this when we have a (char*, len) pair but do not know it encodes valid UTF-8
  • lean_mk_string(const char * s) is a wrapper around lean_mk_string_from_bytes
    • Use this when we have char*, not necessarily UTF-8 and need strlen to get the length
  • lean_string_from_utf8_unchecked was lean_string_from_utf8
    • This is the implementation of String.fromUTF8, you should probably not call it directly. I added _unchecked to the name because the C version doesn't validate, even though the lean version takes a validity proof witness as an argument.
  • lean_mk_ascii_string_unchecked(char const * s) was lean_mk_ascii_string
    • Use this when you know the string is valid ASCII but need strlen to get the length.
  • lean::mk_ascii_string_unchecked(std::string const & s) was lean::mk_ascii_string
    • Use this when you know the string is valid ASCII and have a std::string (which counts the length).

Other (non-API) changes:

  • The code generator produces calls to lean_mk_string_unchecked now instead of lean_mk_string_from_bytes, since we don't want to validate string literals, plus we can calculate the utf8_chars in advance.
  • bool validate_utf8(uint8_t const * str, size_t size, size_t & pos, size_t & i) also returns the byte position of the first bad character, which can be used for better error reporting in addition to its use here as a fast path for the replacement character algorithm.
  • I removed the special error handling for invalid UTF-8 in lean files. These cases, in addition to the IO uses of mk_string, are handled by the replacement algorithm, meaning that lean will complain in the bad UTF-8 tests with a "more normal" error saying that U+FFFD is not a token; if you stick it inside a comment or guillemets it will be accepted like any other unicode character. This is not necessarily a good choice, but at least it's a less damaging default behavior than erasing the whole string. You can however do separate validation, by calling validate_utf8 and lean_mk_string_unchecked instead of lean_mk_string.

@nomeata
Copy link
Collaborator

nomeata commented Apr 24, 2024

Sounds like a nice and thorough cleanup, thanks a lot!

I'd lean towards rejecting invalid utf8 in lean files (not replacing, not using the empty string). Who knows what kind of surprises will occur when people accidentally have invalid utf8 and lean does something, but not what they expect to.

@Kha
Copy link
Member

Kha commented Apr 24, 2024

I'm happy as well modulo below -

I'd lean towards rejecting invalid utf8 in lean files (not replacing, not using the empty string)

Agreed, and more generally I/O functions should at least by default throw an error when decoding fails. I'm pretty sure this is the default in all the big stdlibs out there as well, no?

@digama0
Copy link
Collaborator Author

digama0 commented Apr 24, 2024

I agree, but can we agree that this is an improvement on the status quo and do that in a follow up PR? Handling UTF-8 errors properly means having new kinds of error reporting functions we don't have currently, as well as touching a bunch more code. The main purpose of this PR is to avoid undefined behavior in the present version caused by invalid lean objects being created, not to add new error reporting mechanisms.

@Kha
Copy link
Member

Kha commented Apr 25, 2024

Sure, I just want to make sure it is documented and discussed that this is a partial solution in that way. That is very relevant context for reviewing.

@nomeata
Copy link
Collaborator

nomeata commented Apr 25, 2024

Sounds like as soon as the PR description reflects the current state I can press the button?

@digama0
Copy link
Collaborator Author

digama0 commented Apr 25, 2024

PR description updated

@Kha Kha added this pull request to the merge queue Apr 25, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 25, 2024
@nomeata
Copy link
Collaborator

nomeata commented Apr 25, 2024

Linux-debug fails with

1658/1862 Test #1661: leanruntest_subset.lean ...................................***Failed    1.07 sec
../../common.sh: line 42: 48489 Aborted                 (core dumped) LEAN_BACKTRACE=0 "$@" 2>&1
     48490 Done                    | perl -pe 's/(\?(\w|_\w+))\.[0-9]+/\1/g' > "$f.produced.out"
Unexpected return code 134 executing 'lean -Dlinter.all=false subset.lean'; expected 0. Output:
LEAN ASSERTION VIOLATION
File: /home/runner/work/lean4/lean4/build/stage1/include/lean/lean.h
Line: 473
lean_is_string(o)
(C)ontinue, (A)bort/exit, (S)top/trap

@Kha
Copy link
Member

Kha commented May 21, 2024

@digama0 Do you have time to look into the assertion failure?

@Kha Kha removed the full-ci label May 24, 2024
@kim-em kim-em self-requested a review as a code owner May 29, 2024 16:23
@Kha Kha added the awaiting-author Waiting for PR author to address issues label Jun 17, 2024
@digama0
Copy link
Collaborator Author

digama0 commented Jun 18, 2024

Sorry for the delay, the assertion failure should be fixed now.

@digama0 digama0 added full-ci and removed awaiting-author Waiting for PR author to address issues labels Jun 18, 2024
leanprover-community-mathlib4-bot added a commit to leanprover-community/batteries that referenced this pull request Jun 18, 2024
leanprover-community-mathlib4-bot added a commit to leanprover-community/mathlib4 that referenced this pull request Jun 18, 2024
@Kha Kha added this pull request to the merge queue Jun 19, 2024
Merged via the queue into leanprover:master with commit 0a1a855 Jun 19, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants