New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

ICU-22984 Generate old monkeys #3287

Merged

eggrobin merged 4 commits into unicode-org:main from eggrobin:generate-old-monkeys

Dec 4, 2024

Member

eggrobin commented Nov 29, 2024 •

edited by markusicu

Loading

C++ only for now, I will do the old monkeys from Java separately (that will be more of the same but with subtle differences, as we will need to expand the UnicodeSets ourselves before feeding them to Pattern). This uses ICU regexes to match the context before and after, or to match the left-hand side of a remap rule, as in UAX14 and UAX29. The regexes are the ones used to generate the conformance tests in the Unicode tools; they now closely match the ones in the UAX.
Also line breaking only; the other ones will be very similar (and simpler).

The generated partition and rules are from unicode-org/unicodetools#979. Note that the generation involves transforming & and - (UnicodeSet syntax) to && and -- (ICU regex character class syntax).

Checklist

Required: Issue filed: ICU-22984
Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
Issue accepted (done by Technical Committee after discussion)
Tests included, if applicable
API docs and/or User Guide docs changed or added, if applicable

ALLOW_MANY_COMMITS=true

eggrobin mentioned this pull request

Generate old monkeys unicode-org/unicodetools#979

Merged

eggrobin marked this pull request as ready for review

November 29, 2024 18:54

eggrobin requested a review from markusicu

November 29, 2024 18:54

macchiati previously approved these changes

View reviewed changes

Member

macchiati left a comment

Looks like a good optimization (and cleanup)

eggrobin dismissed macchiati’s stale review via

9d5cec9

December 2, 2024 15:43

markusicu self-assigned this

markusicu requested a review from aheninger

December 2, 2024 19:53

markusicu reviewed

View reviewed changes

Member

markusicu left a comment

very nice replacement of hardcoded rules with generated ones!

icu4c/source/test/intltest/rbbitst.cpp Outdated Show resolved Hide resolved

icu4c/source/test/intltest/rbbitst.cpp Outdated Show resolved Hide resolved

icu4c/source/test/intltest/rbbitst.cpp Outdated Show resolved Hide resolved

icu4c/source/test/intltest/rbbitst.cpp Outdated Show resolved Hide resolved

icu4c/source/test/intltest/rbbitst.cpp Outdated Show resolved Hide resolved

icu4c/source/test/intltest/rbbitst.cpp Outdated Show resolved Hide resolved

icu4c/source/test/intltest/rbbitst.cpp Outdated Show resolved Hide resolved

icu4c/source/test/intltest/rbbitst.cpp Outdated Show resolved Hide resolved

icu4c/source/test/intltest/rbbitst.cpp

-                  // TODO(egg): The following two are a workaround for what seems to be an ICU bug.
+                  // TODO(egg): The following two workarounds for what seems to be ICU bugs;
+                  // with UREGEX_DOTALL (but not UREGEX_MULTILINE):
+                  // 1. /.*\u000A/ does not match CR LF;

Member

markusicu Dec 2, 2024

I think that's because . matches CR LF, so it won't match just half of that.

Member Author

eggrobin Dec 2, 2024

That may well be the explanation, but even with regex greed, that still feels like a bug: /(rn|[a-z])*n/ still matches rn.

icu4c/source/test/intltest/rbbitst.cpp Outdated Show resolved Hide resolved

eggrobin requested a review from markusicu

December 3, 2024 18:19

markusicu reviewed

View reviewed changes

Member

markusicu left a comment

great comments!

icu4c/source/test/intltest/rbbitst.cpp Outdated

+                      //    𒀀     ◌́     ␠     ◌𝅲
+                      // 0  1  2     3     4  5  6         ⟨ 𒀀, ◌́, ␠, ◌𝅲 ⟩   (none)
+                      // 0  ␀  ␀     2     3  4  5         ⟨ 𒀀,    ␠, ◌𝅲 ⟩   1                -1
+                      // 0  ␀  ␀     2     3  ␀  4         ⟨ 𒀀,    ␠, A ⟩   2                -1

Member

markusicu Dec 3, 2024

should the final offset be -2?

Member Author

eggrobin Dec 3, 2024 •

edited

Loading

No, the offset variable is with respect to the previous value of indexInRemapped, so 6 + -1 + -1 = 4.

icu4c/source/test/intltest/rbbitst.cpp Outdated Show resolved Hide resolved

icu4c/source/test/intltest/rbbitst.cpp Outdated Show resolved Hide resolved

markusicu previously approved these changes

View reviewed changes

eggrobin dismissed markusicu’s stale review via

287e4d6

December 3, 2024 23:18

eggrobin requested a review from markusicu

December 3, 2024 23:19

markusicu approved these changes

View reviewed changes

Member

markusicu commented Dec 3, 2024

Remember to squash the commits. It's ok if you want to merge multiple commits if they make sense as separate ones.
Each one's message needs to start with ICU-22984
Squashing without contents change will not require re-approval.

eggrobin added 4 commits

December 4, 2024 18:17


          ICU-22984 Generate old monkeys

4466c57


          ICU-22984 Optimize old monkeys

c587ad5


          ICU-22984 Move old monkeys

8341bee


          ICU-22984 Clean up old monkeys

eb7fd5f

eggrobin force-pushed the generate-old-monkeys branch from 287e4d6 to eb7fd5f Compare

December 4, 2024 17:18

jira-pull-request-webhook bot commented Dec 4, 2024

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

Member Author

eggrobin commented Dec 4, 2024

It's ok if you want to merge multiple commits if they make sense as separate ones.

I tried with four steps (unicodetools-like implementation, optimization, code motion which wants to be on its own because it is undiffable, post-motion cleanup). Let me know if you like that, otherwise I can squish it all into one commit.

Member

markusicu commented Dec 4, 2024

It's ok if you want to merge multiple commits if they make sense as separate ones.

I tried with four steps (unicodetools-like implementation, optimization, code motion which wants to be on its own because it is undiffable, post-motion cleanup). Let me know if you like that, otherwise I can squish it all into one commit.

wfm tnx
I added the allow-many-commits magic into the description.

eggrobin merged commit e59065c into unicode-org:main

95 checks passed

This was referenced Dec 10, 2024

ICU-22986 GL takes CM #3296

Merged

ICU-22984 Generate old Java monkeys #3301

Open

Contributor

FrankYFTang commented Dec 17, 2024

The Exhaustive Tests for ICU #22 is broken between e025466 and 2e57f07 in TestMonkey

https://github.com/unicode-org/icu/actions/runs/12209763924

Likely caused by this PR

Member Author

eggrobin commented Dec 18, 2024

Yes, fixing in #3296.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet