Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-22940 MF2 ICU4C: Update for bidi support #3236

Merged
merged 1 commit into from
Dec 11, 2024

Conversation

catamorphism
Copy link
Contributor

@catamorphism catamorphism commented Oct 8, 2024

The tests in this PR are also included in a PR against the MF2 spec. However, some editing will have to occur unless #3198 (matching on variables instead of expressions) lands before then. There was a spec change to the syntax of .match constructs, and some of the tests include .match constructs.

Checklist

  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22940
  • Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/test/intltest/messageformat2test_utils.h is different
  • testdata/message2/bidi.json is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/test/intltest/messageformat2test.cpp is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@catamorphism catamorphism marked this pull request as ready for review October 10, 2024 16:00
@catamorphism catamorphism changed the title ICU-22940 DRAFT: MF2 ICU4C: Update for bidi support ICU-22940 MF2 ICU4C: Update for bidi support Oct 10, 2024
@catamorphism catamorphism requested a review from srl295 October 18, 2024 15:45
inRange(c, 0x00F8, 0x02FF) || inRange(c, 0x0370, 0x037D) || inRange(c, 0x037F, 0x1FFF) ||
inRange(c, 0x200C, 0x200D) || inRange(c, 0x2070, 0x218F) || inRange(c, 0x2C00, 0x2FEF) ||
inRange(c, 0x00F8, 0x02FF) || inRange(c, 0x0370, 0x037D) || inRange(c, 0x037F, 0x061B) ||
inRange(c, 0x061D, 0x200D) || inRange(c, 0x2070, 0x218F) || inRange(c, 0x2C00, 0x2FEF) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@catamorphism i see ALM is not namestart, but this change makes U+2000…U+200B isNameStart true. they are dashes and spaces, and not ID_Start

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I think this would be far more reliable in using a UnicodeSet. That can be created as the C++ equivalent of a static final immutable object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srl295 Fixed

@macchiati Done in 780a947

@@ -125,7 +125,13 @@ static bool isContentChar(UChar32 c) {
|| inRange(c, 0xE000, 0x10FFFF);
}

// See `s` in the MessageFormat 2 grammar
// See `bidi` in the MF2 grammar
static bool isBidi(UChar32 c) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe isBidiControl might be better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 780a947

srl295
srl295 previously approved these changes Nov 5, 2024
@srl295
Copy link
Member

srl295 commented Nov 5, 2024

needs squash but LGTM, seems like all issues addressed

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/i18n/messageformat2_formatter.cpp is now changed in the branch
  • icu4c/source/i18n/messageformat2_parser.cpp is different
  • icu4c/source/i18n/messageformat2_parser.h is different
  • icu4c/source/i18n/ucln_in.h is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/i18n/messageformat2_formatter.cpp is no longer changed in the branch
  • icu4c/source/i18n/messageformat2_parser.cpp is different
  • icu4c/source/i18n/messageformat2_parser.h is different
  • icu4c/source/i18n/ucln_in.h is no longer changed in the branch
  • icu4c/source/test/intltest/messageformat2test.cpp is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/i18n/messageformat2_formatter.cpp is now changed in the branch
  • icu4c/source/i18n/messageformat2_parser.cpp is different
  • icu4c/source/i18n/messageformat2_parser.h is different
  • icu4c/source/i18n/ucln_in.h is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/i18n/messageformat2_formatter.cpp is different
  • icu4c/source/i18n/messageformat2_parser.cpp is different
  • icu4c/source/i18n/messageformat2_parser.h is different
  • icu4c/source/i18n/messageformat2.cpp is now changed in the branch
  • icu4c/source/test/intltest/messageformat2test_read_json.cpp is different
  • icu4c/source/test/intltest/messageformat2test_utils.h is different
  • icu4c/source/test/intltest/messageformat2test.cpp is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@catamorphism
Copy link
Contributor Author

I added 5eb2b7d to fix a bug that actually was in a previous PR, #3239 , but wasn't caught when merging that PR because of a bug in the test runner.

This PR incidentally fixes that bug, so the test failure showed up here when I rebased this PR against main, and I'm fixing it here. (Will squash after @srl295 gets a chance to look at it, I left it unsquashed so far just so it's clear what I changed after he reviewed the PR.)

srl295
srl295 previously approved these changes Dec 9, 2024
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/i18n/messageformat2_parser.cpp is different
  • icu4c/source/i18n/messageformat2_parser.h is different
  • icu4c/source/i18n/messageformat2.cpp is different
  • icu4c/source/i18n/ucln_in.h is different
  • icu4c/source/test/intltest/messageformat2test_read_json.cpp is different
  • icu4c/source/test/intltest/messageformat2test_utils.h is different
  • icu4c/source/test/intltest/messageformat2test.cpp is different
  • testdata/message2/bidi.json is different
  • testdata/message2/matches-whitespace.json is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@catamorphism
Copy link
Contributor Author

catamorphism commented Dec 9, 2024

The fuzzer is reporting a timeout bug with this test data: https://github.com/unicode-org/icu/actions/runs/12241776312/artifacts/2295676981

I downloaded the artifact and tried running it. The test string is "\u007b\u0000\u002f\u0000\u0067\u0020\u007d\u0000\u0000\u0000\u0000\u000c". I can't reproduce the timeout locally. I was able to reproduce with the instructions at https://unicode-org.github.io/icu/userguide/dev/fuzzer_targets.html#how-to-locally-reproduce-fuzzer-findings . Working on it.

@catamorphism
Copy link
Contributor Author

I think I've found the cause of the timeout bug. 41994fa fixes it, if so; there's a loop in the parser that's checking for the presence of a syntax error and exiting if there already is one so that it doesn't loop infinitely, but if the UErrorCode is set, the operation that adds a syntax error will fail and hasSyntaxError() will return true.

What I don't understand is why the UErrorCode is being set when running this test. I still can't reproduce the failure outside of the fuzzer, and I only have limited ability to run gdb on the fuzzer since it's chroot'ed.

@catamorphism
Copy link
Contributor Author

Will squash after final review.

Copy link
Member

@srl295 srl295 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixes LGTM

@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@catamorphism catamorphism merged commit 1b81180 into unicode-org:main Dec 11, 2024
101 checks passed
parseError.line = 0;
parseError.offset = 0;
parseError.lengthBeforeCurrentLine = 0;
parseError.preContext[0] = '\0';
parseError.postContext[0] = '\0';
}

UnicodeSet initContentChars(UErrorCode& status);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the return type of these initX method "UnicodeSet" ? or it should be "UnicodeSet* "?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be UnicodeSet*, but these declarations aren't actually necessary as the functions are defined before they're used.


UnicodeSet* initContentChars(UErrorCode& status) {
if (U_FAILURE(status)) {
return {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why you return {} here but return nullptr later? What is the differences?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no difference, but I'll change it to nullptr for consistency.

return {};
}

UnicodeSet* result = new UnicodeSet(*unisets::getImpl(unisets::ALPHA));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check the value of "unisets::getImpl(unisets::ALPHA)" is not nullptr before you deref. I understand in the current code, this code inside initNameStartChars is always called only if status is U_FAILURE is not true after initAlpha() but there are no facility to enforce initNameStartChars is always called after initAlpha and will not be moved around in the future. Therefore, it is a very weak assumption that unisets::getImpl(unisets::ALPHA) is not nullptr here.

Could we change to

UnicodeSet* isAlpha = unisets::getImpl(unisets::ALPHA);
if (isAlpha == nullptr) {
    status = U_MEMORY_ALLOCATION_ERROR;
    return nullptr;
}
UnicodeSet* result = new UnicodeSet(*isalpha);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do this in a future PR.

status = U_MEMORY_ALLOCATION_ERROR;
return nullptr;
};
result->addAll(*unisets::getImpl(unisets::NAME_START));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix in a future PR

status = U_MEMORY_ALLOCATION_ERROR;
return nullptr;
};
result->addAll(*unisets::getImpl(unisets::CONTENT));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix in a future PR

status = U_MEMORY_ALLOCATION_ERROR;
return nullptr;
};
result->addAll(*unisets::getImpl(unisets::CONTENT));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix in a future PR

const UnicodeSet* get(Key key) {
UErrorCode localStatus = U_ZERO_ERROR;
umtx_initOnce(gMF2ParseUniSetsInitOnce, &initMF2ParseUniSets, localStatus);
if (U_FAILURE(localStatus)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the first time the code call initMF2ParseUniSets the error happen, then localStatus will be error and return nullptr, but the gUnicodeSets will still be partially initialize. For example, gUnicodeSets[unisets::TEXT] may be nullptr if initNameChars failed. then later on you will pass a nullptr into contentChars and the later code contentChars->contains() will deref a nullptr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix in a future PR (by making get() take a UErrorCode).

@FrankYFTang
Copy link
Contributor

I found the error handling code in this PR very weak. If any of the operation of new operation during initMF2ParseUniSets return nullptr due to out of memory, the code will deref nullptr later on.

bidiControlChars(unisets::get(unisets::BIDI)),
alphaChars(unisets::get(unisets::ALPHA)),
digitChars(unisets::get(unisets::DIGIT)),
nameStartChars(unisets::get(unisets::NAME_START)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notice, with the current code, some of these unisets::get() may return nullptr while other not and cause nullptr deref later

}

UnicodeSet* initNameStartChars(UErrorCode& status) {
if (U_FAILURE(status)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have an implicit requirement that initNameStartChars must be called after initAlpha or it may crash here. I think this is dangerous. Could we enforce that explicitly, for example, by calling initAlpha in the beginning of initNameStartChars() ? Also, there is an implicit assumption that the status contains the status passed to initAlpha so if initAlpha fail the status here will contain that failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix in a future PR

gUnicodeSets[unisets::BIDI] = initBidiControls(status);
gUnicodeSets[unisets::ALPHA] = initAlpha(status);
gUnicodeSets[unisets::DIGIT] = initDigits(status);
gUnicodeSets[unisets::NAME_START] = initNameStartChars(status);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's consider the condition that initMF2ParseUniSets is called the first time while the memory is almost full and initContentChars initWhitespace and initBidiControls all success but initAlpha fail, in that case, all except CONTENT, WHITESPACE and BIDI will return nullptr . And the initMF2ParseUniSets will return error in status the first time. so the first get() call will return nullptr, and later on the getImpl may also return nullptr. I think you need to change
const UnicodeSet* get(Key key)
to
const UnicodeSet* get(Key key, UErrorCode& status)
to pass up the error andalso set the status to error code if getImpl() return nullptr to the Parser() constructor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix in a future PR

catamorphism added a commit to catamorphism/icu that referenced this pull request Dec 13, 2024
Improve checking for OOM errors when allocating UnicodeSets,
per post-merge comments on unicode-org#3236
@catamorphism
Copy link
Contributor Author

@FrankYFTang I submitted #3306 to apply the fixes you suggested. Thanks!

catamorphism added a commit to catamorphism/icu that referenced this pull request Jan 10, 2025
Improve checking for OOM errors when allocating UnicodeSets,
per post-merge comments on unicode-org#3236
catamorphism added a commit that referenced this pull request Jan 10, 2025
Improve checking for OOM errors when allocating UnicodeSets,
per post-merge comments on #3236
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants