Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-22890 MF2: Add lone surrogate test to parser #3167

Merged
merged 1 commit into from
Sep 19, 2024

Conversation

catamorphism
Copy link
Contributor

@catamorphism catamorphism commented Sep 14, 2024

Also see #3166

Checklist
  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22890
  • Required: The PR title must be prefixed with a JIRA Issue number.
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@catamorphism
Copy link
Contributor Author

Note that #3063 (not landed yet) and #3092 (draft) have more general fixes for keyword lookahead and wide character parsing, but I wanted to include a minimal solution here to address the infinite loop bug.

@FrankYFTang
Copy link
Contributor

@echeran you review other MessageFormatter changes before. Could you also review this PR? Thanks

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/i18n/messageformat2_parser.cpp is no longer changed in the branch
  • icu4c/source/i18n/messageformat2_parser.h is no longer changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@catamorphism
Copy link
Contributor Author

I rebased, and it turns out that #3063 fully fixed the infinite loop bug, so I removed all the commits except the one with the new tests.

@FrankYFTang FrankYFTang self-requested a review September 18, 2024 17:54
@FrankYFTang
Copy link
Contributor

Could you change the title of this PR from "Fix infinity loop in parser" to "Add lone surrogate test to MessageFormatter2 Parser" Thanks

@FrankYFTang
Copy link
Contributor

Could you also add similar test to Java ?

@catamorphism catamorphism changed the title ICU-22890: Fix infinite loop in parser ICU-22890: MF2: Add lone surrogate test to parser Sep 18, 2024
@catamorphism
Copy link
Contributor Author

@FrankYFTang First, I added checks to your ICU4C test so as to require a syntax error.

I also added a test for Java. I was hoping to add this to the shared data-driven tests, but I couldn't figure out how to escape the unpaired surrogate strings for JSON. So, the tests are separate for now.

ICU4J wasn't erroring out on this case (though there was no infinite loop either), so I fixed that -- cc @mihnita. But maybe that should be a separate PR, since the ICU4J bug isn't critical. What do you think?

.build(errorCode);
UnicodeString result = msgfmt1.formatToString({}, errorCode);
assertEquals("testHighLoneSurrogate", U_MF_SYNTAX_ERROR, errorCode);
errorCode.reset();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we reset the errorCode here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that in intltest, each test has to reset the error code so it doesn't carry over to the next test -- is that not the case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, errorCode is a local variable in this test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove the reset

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remove the errorCode.reset() call, then the test fails. I assume this has to do with the reference to this:

    IcuTestErrorCode errorCode(*this, "testHighLoneSurrogate");

and that there's some shared state that results in an error if the error code is non-success at the end of a test.

A lot of other tests in intltest have this pattern (errorCode declared as a local IcuTestErrorCode variable, and then errorCode.reset() at the end of the method.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so sorry, I got confused, it should be

errorCode.expectErrorAndReset(U_MF_SYNTAX_ERROR, "testHighLoneSurrogate");

instead of

assertEquals("testHighLoneSurrogate", U_MF_SYNTAX_ERROR, errorCode);
errorCode.reset();

same below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fixed in a9a6de2

.build(errorCode);
UnicodeString result = msgfmt2.formatToString({}, errorCode);
assertEquals("testLowLoneSurrogate", U_MF_SYNTAX_ERROR, errorCode);
errorCode.reset();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we reset the errorCode here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above

FrankYFTang
FrankYFTang previously approved these changes Sep 18, 2024
Copy link
Contributor

@FrankYFTang FrankYFTang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@catamorphism
Copy link
Contributor Author

I haven't been able to reproduce the fuzzer failure in the latest CI run yet, but I'll keep trying.

@mihnita
Copy link
Contributor

mihnita commented Sep 18, 2024

Unpaired surrogates are not an error, according to the spec.
We had a long argument about that a while ago, I can try to dig it out.

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/i18n/unicode/messageformat2.h is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@catamorphism
Copy link
Contributor Author

135a911 should fix the fuzzer error.

@catamorphism
Copy link
Contributor Author

@mihnita:

Unpaired surrogates are not an error, according to the spec. We had a long argument about that a while ago, I can try to dig it out.

I think they're a syntax error?

simple-message    -> o [simple-start pattern]
simple-start -> simple-start-char / escaped-char / placeholder
simple-start-char -> content-char / "@" / "|"

And the definition of content-char excludes surrogates. Am I missing something?

FrankYFTang
FrankYFTang previously approved these changes Sep 19, 2024
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@catamorphism catamorphism changed the title ICU-22890: MF2: Add lone surrogate test to parser ICU-22890 MF2: Add lone surrogate test to parser Sep 19, 2024
@catamorphism
Copy link
Contributor Author

I won't merge this right away so that @mihnita has a chance to reply further re: whether unpaired surrogates are a syntax error.

@mihnita
Copy link
Contributor

mihnita commented Sep 19, 2024

https://github.com/unicode-org/message-format-wg/blob/main/meetings/2022/notes-2022-06-13.md

RGN: Surrogate code points. Those are code points reserved for representing code points in UTF-16 that are beyond the first plane (BMP) of 2^16 code points.

MIH: I understand what you mean. But we also implement this is C and Java, and so on. So what should we do if we receive a message with invalid UTF-8 code points. Do we expect to replace them with the replacement character, or do we just pass them through?

RGN: I think what you're asking about, using JavaScript as a concrete example, is that a JS string is allowed to have unpaired surrogates. So the question is a question for the JS adapter / implementation, but that's not a question for the standard itself.

MIH: So we leave it to the implementation?

RGN: Yes.

MIH: Okay, that is fine with me.

If the spec ended up saying something else, I am really not happy about it.
I will try to track it down when this got in and how.

I don't want to reopen that discussion.
But it is not the job of the MF2 to validate correct / incorrect UTF sequences.
Not in the message proper.
I can live with it in the code part. But once we are in the pattern, we should not.

And I would not make such a change so very late.

One might make an argument about the C++ implementation.

But Java is UTF-16 everywhere, and it does not "explode" on improperly paired surrogates.
And was explicitly mentioned and accepted in the quoted thread.

(OK, deep in the belly of String there is an optimization making some strings Latin1, but that is not visible in the public APIs, it is only an implementation detail, for size)

@catamorphism
Copy link
Contributor Author

@mihnita Looks like unicode-org/message-format-wg#290 is the PR that introduced this (from Aug. 2022).

@mihnita
Copy link
Contributor

mihnita commented Sep 19, 2024

I will open an issue with MessageFormat WG.

This is really unfriendly for Java, JavaScript, Windows C "wide APIs" (that take a wchar_t, which is 16 bit).

It is not the job of a formatting function to validate and reject UTF corectness, at least in the above mentioned environments.

For example in Java we have String.format, java.text.MessageFormat, com.ibm.icu.text.MessageFormat, all of them work fine with unpaired surrogates.

I will make my case in the WG issue.

But I am against making this change in Java with this PR, sorry.

Copy link
Contributor

@mihnita mihnita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the Java changes from the PR.

@markusicu
Copy link
Member

We normally treat unpaired surrogates like unassigned code points.

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4j/main/core/src/main/java/com/ibm/icu/message2/InputSource.java is no longer changed in the branch
  • icu4j/main/core/src/main/java/com/ibm/icu/message2/MFParser.java is no longer changed in the branch
  • icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/MessageFormat2Test.java is no longer changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@catamorphism
Copy link
Contributor Author

OK -- I've removed the Java test and changes. Needs a re-approval.

mihnita
mihnita previously approved these changes Sep 19, 2024
Copy link
Contributor

@mihnita mihnita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much!
Mihai

Add a test to ICU4C for handling of lone surrogates.

Incidentally fix uninitialized-memory bug in MessageFormatter
(initialize `errors` to nullptr)

Co-authored-by: Frank Tang <[email protected]>
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/i18n/unicode/messageformat2.h is no longer changed in the branch
  • icu4c/source/test/intltest/messageformat2test.cpp is different
  • icu4c/source/test/intltest/messageformat2test.h is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@mihnita
Copy link
Contributor

mihnita commented Sep 19, 2024

I've created unicode-org/message-format-wg#895

@catamorphism
Copy link
Contributor Author

Sorry @mihnita -- one more approval? I had to rebase because I landed #3083 first.

@catamorphism catamorphism merged commit 5991c93 into unicode-org:main Sep 19, 2024
94 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants