ICU-22940 MF2 ICU4C: Update for bidi support #3236

catamorphism · 2024-10-08T21:43:43Z

The tests in this PR are also included in a PR against the MF2 spec. However, some editing will have to occur unless #3198 (matching on variables instead of expressions) lands before then. There was a spec change to the syntax of .match constructs, and some of the tests include .match constructs.

Checklist

Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22940
Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
Issue accepted (done by Technical Committee after discussion)
Tests included, if applicable
API docs and/or User Guide docs changed or added, if applicable

jira-pull-request-webhook · 2024-10-08T22:33:02Z

Notice: the branch changed across the force-push!

icu4c/source/test/intltest/messageformat2test_utils.h is different
testdata/message2/bidi.json is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

jira-pull-request-webhook · 2024-10-08T22:33:27Z

Notice: the branch changed across the force-push!

icu4c/source/test/intltest/messageformat2test.cpp is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

srl295 · 2024-10-18T18:32:09Z

icu4c/source/i18n/messageformat2_parser.cpp

-           inRange(c, 0x00F8, 0x02FF) || inRange(c, 0x0370, 0x037D) || inRange(c, 0x037F, 0x1FFF) ||
-           inRange(c, 0x200C, 0x200D) || inRange(c, 0x2070, 0x218F) || inRange(c, 0x2C00, 0x2FEF) ||
+           inRange(c, 0x00F8, 0x02FF) || inRange(c, 0x0370, 0x037D) || inRange(c, 0x037F, 0x061B) ||
+           inRange(c, 0x061D, 0x200D) || inRange(c, 0x2070, 0x218F) || inRange(c, 0x2C00, 0x2FEF) ||


@catamorphism i see ALM is not namestart, but this change makes U+2000…U+200B isNameStart true. they are dashes and spaces, and not ID_Start

BTW, I think this would be far more reliable in using a UnicodeSet. That can be created as the C++ equivalent of a static final immutable object.

@srl295 Fixed

@macchiati Done in 780a947

srl295 · 2024-10-18T18:33:13Z

icu4c/source/i18n/messageformat2_parser.cpp

@@ -125,7 +125,13 @@ static bool isContentChar(UChar32 c) {
           || inRange(c, 0xE000, 0x10FFFF);
 }

-// See `s` in the MessageFormat 2 grammar
+// See `bidi` in the MF2 grammar
+static bool isBidi(UChar32 c) {


maybe isBidiControl might be better?

Done in 780a947

srl295 · 2024-11-05T22:50:52Z

needs squash but LGTM, seems like all issues addressed

jira-pull-request-webhook · 2024-11-06T00:07:59Z

Notice: the branch changed across the force-push!

icu4c/source/i18n/messageformat2_formatter.cpp is now changed in the branch
icu4c/source/i18n/messageformat2_parser.cpp is different
icu4c/source/i18n/messageformat2_parser.h is different
icu4c/source/i18n/ucln_in.h is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

jira-pull-request-webhook · 2024-11-06T00:24:09Z

Notice: the branch changed across the force-push!

icu4c/source/i18n/messageformat2_formatter.cpp is no longer changed in the branch
icu4c/source/i18n/messageformat2_parser.cpp is different
icu4c/source/i18n/messageformat2_parser.h is different
icu4c/source/i18n/ucln_in.h is no longer changed in the branch
icu4c/source/test/intltest/messageformat2test.cpp is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

jira-pull-request-webhook · 2024-11-06T00:29:59Z

Notice: the branch changed across the force-push!

icu4c/source/i18n/messageformat2_formatter.cpp is now changed in the branch
icu4c/source/i18n/messageformat2_parser.cpp is different
icu4c/source/i18n/messageformat2_parser.h is different
icu4c/source/i18n/ucln_in.h is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

jira-pull-request-webhook · 2024-11-06T21:51:12Z

Notice: the branch changed across the force-push!

icu4c/source/i18n/messageformat2_formatter.cpp is different
icu4c/source/i18n/messageformat2_parser.cpp is different
icu4c/source/i18n/messageformat2_parser.h is different
icu4c/source/i18n/messageformat2.cpp is now changed in the branch
icu4c/source/test/intltest/messageformat2test_read_json.cpp is different
icu4c/source/test/intltest/messageformat2test_utils.h is different
icu4c/source/test/intltest/messageformat2test.cpp is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

catamorphism · 2024-11-06T21:54:35Z

I added 5eb2b7d to fix a bug that actually was in a previous PR, #3239 , but wasn't caught when merging that PR because of a bug in the test runner.

This PR incidentally fixes that bug, so the test failure showed up here when I rebased this PR against main, and I'm fixing it here. (Will squash after @srl295 gets a chance to look at it, I left it unsquashed so far just so it's clear what I changed after he reviewed the PR.)

jira-pull-request-webhook · 2024-12-09T18:18:12Z

Notice: the branch changed across the force-push!

icu4c/source/i18n/messageformat2_parser.cpp is different
icu4c/source/i18n/messageformat2_parser.h is different
icu4c/source/i18n/messageformat2.cpp is different
icu4c/source/i18n/ucln_in.h is different
icu4c/source/test/intltest/messageformat2test_read_json.cpp is different
icu4c/source/test/intltest/messageformat2test_utils.h is different
icu4c/source/test/intltest/messageformat2test.cpp is different
testdata/message2/bidi.json is different
testdata/message2/matches-whitespace.json is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

catamorphism · 2024-12-09T19:28:38Z

The fuzzer is reporting a timeout bug with this test data: https://github.com/unicode-org/icu/actions/runs/12241776312/artifacts/2295676981

I downloaded the artifact and tried running it. The test string is "\u007b\u0000\u002f\u0000\u0067\u0020\u007d\u0000\u0000\u0000\u0000\u000c". ~~I can't reproduce the timeout locally.~~ I was able to reproduce with the instructions at https://unicode-org.github.io/icu/userguide/dev/fuzzer_targets.html#how-to-locally-reproduce-fuzzer-findings . Working on it.

catamorphism · 2024-12-09T21:11:46Z

I think I've found the cause of the timeout bug. 41994fa fixes it, if so; there's a loop in the parser that's checking for the presence of a syntax error and exiting if there already is one so that it doesn't loop infinitely, but if the UErrorCode is set, the operation that adds a syntax error will fail and hasSyntaxError() will return true.

What I don't understand is why the UErrorCode is being set when running this test. I still can't reproduce the failure outside of the fuzzer, and I only have limited ability to run gdb on the fuzzer since it's chroot'ed.

catamorphism · 2024-12-09T23:08:40Z

Will squash after final review.

srl295

fixes LGTM

Per unicode-org/message-format-wg#884

jira-pull-request-webhook · 2024-12-11T20:09:32Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

FrankYFTang · 2024-12-12T19:58:08Z

icu4c/source/i18n/messageformat2_parser.h

 	  parseError.line = 0;
 	  parseError.offset = 0;
 	  parseError.lengthBeforeCurrentLine = 0;
 	  parseError.preContext[0] = '\0';
 	  parseError.postContext[0] = '\0';
 	}

+        UnicodeSet initContentChars(UErrorCode& status);


are the return type of these initX method "UnicodeSet" ? or it should be "UnicodeSet* "?

Should be UnicodeSet*, but these declarations aren't actually necessary as the functions are defined before they're used.

FrankYFTang · 2024-12-12T19:58:53Z

icu4c/source/i18n/messageformat2_parser.cpp

+
+UnicodeSet* initContentChars(UErrorCode& status) {
+    if (U_FAILURE(status)) {
+        return {};


why you return {} here but return nullptr later? What is the differences?

There is no difference, but I'll change it to nullptr for consistency.

FrankYFTang · 2024-12-12T20:10:27Z

icu4c/source/i18n/messageformat2_parser.cpp

+        return {};
+    }
+
+    UnicodeSet* result = new UnicodeSet(*unisets::getImpl(unisets::ALPHA));


please check the value of "unisets::getImpl(unisets::ALPHA)" is not nullptr before you deref. I understand in the current code, this code inside initNameStartChars is always called only if status is U_FAILURE is not true after initAlpha() but there are no facility to enforce initNameStartChars is always called after initAlpha and will not be moved around in the future. Therefore, it is a very weak assumption that unisets::getImpl(unisets::ALPHA) is not nullptr here.

Could we change to

UnicodeSet* isAlpha = unisets::getImpl(unisets::ALPHA); if (isAlpha == nullptr) { status = U_MEMORY_ALLOCATION_ERROR; return nullptr; } UnicodeSet* result = new UnicodeSet(*isalpha);

Will do this in a future PR.

FrankYFTang · 2024-12-12T20:10:48Z

icu4c/source/i18n/messageformat2_parser.cpp

+        status = U_MEMORY_ALLOCATION_ERROR;
+        return nullptr;
+    };
+    result->addAll(*unisets::getImpl(unisets::NAME_START));


same as above

Will fix in a future PR

FrankYFTang · 2024-12-12T20:11:00Z

icu4c/source/i18n/messageformat2_parser.cpp

+        status = U_MEMORY_ALLOCATION_ERROR;
+        return nullptr;
+    };
+    result->addAll(*unisets::getImpl(unisets::CONTENT));


Will fix in a future PR

FrankYFTang · 2024-12-12T20:11:09Z

icu4c/source/i18n/messageformat2_parser.cpp

+        status = U_MEMORY_ALLOCATION_ERROR;
+        return nullptr;
+    };
+    result->addAll(*unisets::getImpl(unisets::CONTENT));


Will fix in a future PR

FrankYFTang · 2024-12-12T20:16:42Z

icu4c/source/i18n/messageformat2_parser.cpp

+const UnicodeSet* get(Key key) {
+    UErrorCode localStatus = U_ZERO_ERROR;
+    umtx_initOnce(gMF2ParseUniSetsInitOnce, &initMF2ParseUniSets, localStatus);
+    if (U_FAILURE(localStatus)) {


If the first time the code call initMF2ParseUniSets the error happen, then localStatus will be error and return nullptr, but the gUnicodeSets will still be partially initialize. For example, gUnicodeSets[unisets::TEXT] may be nullptr if initNameChars failed. then later on you will pass a nullptr into contentChars and the later code contentChars->contains() will deref a nullptr

Will fix in a future PR (by making get() take a UErrorCode).

FrankYFTang · 2024-12-12T20:20:52Z

I found the error handling code in this PR very weak. If any of the operation of new operation during initMF2ParseUniSets return nullptr due to out of memory, the code will deref nullptr later on.

FrankYFTang · 2024-12-12T20:25:25Z

icu4c/source/i18n/messageformat2_parser.h

+            bidiControlChars(unisets::get(unisets::BIDI)),
+            alphaChars(unisets::get(unisets::ALPHA)),
+            digitChars(unisets::get(unisets::DIGIT)),
+            nameStartChars(unisets::get(unisets::NAME_START)),


notice, with the current code, some of these unisets::get() may return nullptr while other not and cause nullptr deref later

FrankYFTang · 2024-12-12T20:34:30Z

icu4c/source/i18n/messageformat2_parser.cpp

+}
+
+UnicodeSet* initNameStartChars(UErrorCode& status) {
+    if (U_FAILURE(status)) {


you have an implicit requirement that initNameStartChars must be called after initAlpha or it may crash here. I think this is dangerous. Could we enforce that explicitly, for example, by calling initAlpha in the beginning of initNameStartChars() ? Also, there is an implicit assumption that the status contains the status passed to initAlpha so if initAlpha fail the status here will contain that failure.

Will fix in a future PR

FrankYFTang · 2024-12-12T20:39:44Z

icu4c/source/i18n/messageformat2_parser.cpp

+    gUnicodeSets[unisets::BIDI] = initBidiControls(status);
+    gUnicodeSets[unisets::ALPHA] = initAlpha(status);
+    gUnicodeSets[unisets::DIGIT] = initDigits(status);
+    gUnicodeSets[unisets::NAME_START] = initNameStartChars(status);


let's consider the condition that initMF2ParseUniSets is called the first time while the memory is almost full and initContentChars initWhitespace and initBidiControls all success but initAlpha fail, in that case, all except CONTENT, WHITESPACE and BIDI will return nullptr . And the initMF2ParseUniSets will return error in status the first time. so the first get() call will return nullptr, and later on the getImpl may also return nullptr. I think you need to change
const UnicodeSet* get(Key key)
to
const UnicodeSet* get(Key key, UErrorCode& status)
to pass up the error andalso set the status to error code if getImpl() return nullptr to the Parser() constructor

Will fix in a future PR

Improve checking for OOM errors when allocating UnicodeSets, per post-merge comments on unicode-org#3236

catamorphism · 2024-12-13T23:50:55Z

@FrankYFTang I submitted #3306 to apply the fixes you suggested. Thanks!

Improve checking for OOM errors when allocating UnicodeSets, per post-merge comments on unicode-org#3236

Improve checking for OOM errors when allocating UnicodeSets, per post-merge comments on #3236

catamorphism force-pushed the bidi-tests branch from 911d047 to 703002a Compare October 8, 2024 22:32

catamorphism force-pushed the bidi-tests branch from 703002a to 01d2fdc Compare October 8, 2024 22:33

catamorphism marked this pull request as ready for review October 10, 2024 16:00

catamorphism requested review from echeran and mihnita October 10, 2024 16:01

catamorphism changed the title ~~ICU-22940 DRAFT: MF2 ICU4C: Update for bidi support~~ ICU-22940 MF2 ICU4C: Update for bidi support Oct 10, 2024

catamorphism requested a review from srl295 October 18, 2024 15:45

srl295 reviewed Oct 18, 2024

View reviewed changes

catamorphism mentioned this pull request Oct 25, 2024

ICU-22953 MF2: Allow unpaired surrogates in text and quoted literals #3256

Merged

7 tasks

srl295 previously approved these changes Nov 5, 2024

View reviewed changes

catamorphism dismissed srl295’s stale review via 01d2fdc November 6, 2024 00:01

catamorphism force-pushed the bidi-tests branch 2 times, most recently from 01d2fdc to 9054d0c Compare November 6, 2024 00:07

catamorphism force-pushed the bidi-tests branch from 9054d0c to 5ff8167 Compare November 6, 2024 00:24

catamorphism force-pushed the bidi-tests branch from 5ff8167 to cebadc4 Compare November 6, 2024 00:29

catamorphism force-pushed the bidi-tests branch from cebadc4 to 5eb2b7d Compare November 6, 2024 21:51

srl295 previously approved these changes Dec 9, 2024

View reviewed changes

catamorphism dismissed srl295’s stale review via 03cba1f December 9, 2024 18:18

catamorphism force-pushed the bidi-tests branch from 5eb2b7d to 03cba1f Compare December 9, 2024 18:18

srl295 approved these changes Dec 11, 2024

View reviewed changes

ICU-22940 MF2 ICU4C: Update for bidi support

63b7228

Per unicode-org/message-format-wg#884

catamorphism force-pushed the bidi-tests branch from e2353bd to 63b7228 Compare December 11, 2024 20:09

catamorphism merged commit 1b81180 into unicode-org:main Dec 11, 2024
101 checks passed

FrankYFTang reviewed Dec 12, 2024

View reviewed changes

catamorphism added a commit to catamorphism/icu that referenced this pull request Dec 13, 2024

ICU-22940 MF2 ICU4C: Error checking improvements in parser

d48eefc

Improve checking for OOM errors when allocating UnicodeSets, per post-merge comments on unicode-org#3236

catamorphism mentioned this pull request Dec 13, 2024

ICU-22940 MF2 ICU4C: Error checking improvements in parser #3306

Merged

7 tasks

catamorphism added a commit to catamorphism/icu that referenced this pull request Jan 10, 2025

ICU-22940 MF2 ICU4C: Error checking improvements in parser

cb61cf1

Improve checking for OOM errors when allocating UnicodeSets, per post-merge comments on unicode-org#3236

catamorphism added a commit that referenced this pull request Jan 10, 2025

ICU-22940 MF2 ICU4C: Error checking improvements in parser

f8aa68b

Improve checking for OOM errors when allocating UnicodeSets, per post-merge comments on #3236

ICU-22940 MF2 ICU4C: Update for bidi support #3236

ICU-22940 MF2 ICU4C: Update for bidi support #3236

Conversation

catamorphism commented Oct 8, 2024 • edited Loading

Checklist

jira-pull-request-webhook bot commented Oct 8, 2024

jira-pull-request-webhook bot commented Oct 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srl295 commented Nov 5, 2024

jira-pull-request-webhook bot commented Nov 6, 2024

jira-pull-request-webhook bot commented Nov 6, 2024

jira-pull-request-webhook bot commented Nov 6, 2024

jira-pull-request-webhook bot commented Nov 6, 2024

catamorphism commented Nov 6, 2024

jira-pull-request-webhook bot commented Dec 9, 2024

catamorphism commented Dec 9, 2024 • edited Loading

catamorphism commented Dec 9, 2024

catamorphism commented Dec 9, 2024

srl295 left a comment

Choose a reason for hiding this comment

jira-pull-request-webhook bot commented Dec 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FrankYFTang commented Dec 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

catamorphism commented Dec 13, 2024

catamorphism commented Oct 8, 2024 •

edited

Loading

catamorphism commented Dec 9, 2024 •

edited

Loading