Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-22984 Generate old monkeys #3287

Merged
merged 4 commits into from
Dec 4, 2024

Conversation

eggrobin
Copy link
Member

@eggrobin eggrobin commented Nov 29, 2024

C++ only for now, I will do the old monkeys from Java separately (that will be more of the same but with subtle differences, as we will need to expand the UnicodeSets ourselves before feeding them to Pattern). This uses ICU regexes to match the context before and after, or to match the left-hand side of a remap rule, as in UAX14 and UAX29. The regexes are the ones used to generate the conformance tests in the Unicode tools; they now closely match the ones in the UAX.
Also line breaking only; the other ones will be very similar (and simpler).

The generated partition and rules are from unicode-org/unicodetools#979. Note that the generation involves transforming & and - (UnicodeSet syntax) to && and -- (ICU regex character class syntax).

Checklist

  • Required: Issue filed: ICU-22984
  • Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

ALLOW_MANY_COMMITS=true

@eggrobin eggrobin marked this pull request as ready for review November 29, 2024 18:54
@eggrobin eggrobin requested a review from markusicu November 29, 2024 18:54
macchiati
macchiati previously approved these changes Nov 29, 2024
Copy link
Member

@macchiati macchiati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good optimization (and cleanup)

Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice replacement of hardcoded rules with generated ones!

icu4c/source/test/intltest/rbbitst.cpp Outdated Show resolved Hide resolved
icu4c/source/test/intltest/rbbitst.cpp Outdated Show resolved Hide resolved
icu4c/source/test/intltest/rbbitst.cpp Outdated Show resolved Hide resolved
icu4c/source/test/intltest/rbbitst.cpp Outdated Show resolved Hide resolved
icu4c/source/test/intltest/rbbitst.cpp Outdated Show resolved Hide resolved
icu4c/source/test/intltest/rbbitst.cpp Outdated Show resolved Hide resolved
icu4c/source/test/intltest/rbbitst.cpp Outdated Show resolved Hide resolved
icu4c/source/test/intltest/rbbitst.cpp Outdated Show resolved Hide resolved
// TODO(egg): The following two are a workaround for what seems to be an ICU bug.
// TODO(egg): The following two workarounds for what seems to be ICU bugs;
// with UREGEX_DOTALL (but not UREGEX_MULTILINE):
// 1. /.*\u000A/ does not match CR LF;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's because . matches CR LF, so it won't match just half of that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That may well be the explanation, but even with regex greed, that still feels like a bug: /(rn|[a-z])*n/ still matches rn.

icu4c/source/test/intltest/rbbitst.cpp Outdated Show resolved Hide resolved
@eggrobin eggrobin requested a review from markusicu December 3, 2024 18:19
Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great comments!

// 𒀀 ◌́ ␠ ◌𝅲
// 0 1 2 3 4 5 6 ⟨ 𒀀, ◌́, ␠, ◌𝅲 ⟩ (none)
// 0 ␀ ␀ 2 3 4 5 ⟨ 𒀀, ␠, ◌𝅲 ⟩ 1 -1
// 0 ␀ ␀ 2 3 ␀ 4 ⟨ 𒀀, ␠, A ⟩ 2 -1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the final offset be -2?

Copy link
Member Author

@eggrobin eggrobin Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the offset variable is with respect to the previous value of indexInRemapped, so 6 + -1 + -1 = 4.

icu4c/source/test/intltest/rbbitst.cpp Outdated Show resolved Hide resolved
icu4c/source/test/intltest/rbbitst.cpp Outdated Show resolved Hide resolved
markusicu
markusicu previously approved these changes Dec 3, 2024
@markusicu
Copy link
Member

Remember to squash the commits. It's ok if you want to merge multiple commits if they make sense as separate ones.
Each one's message needs to start with ICU-22984
Squashing without contents change will not require re-approval.

@eggrobin eggrobin force-pushed the generate-old-monkeys branch from 287e4d6 to eb7fd5f Compare December 4, 2024 17:18
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@eggrobin
Copy link
Member Author

eggrobin commented Dec 4, 2024

It's ok if you want to merge multiple commits if they make sense as separate ones.

I tried with four steps (unicodetools-like implementation, optimization, code motion which wants to be on its own because it is undiffable, post-motion cleanup). Let me know if you like that, otherwise I can squish it all into one commit.

@markusicu
Copy link
Member

It's ok if you want to merge multiple commits if they make sense as separate ones.

I tried with four steps (unicodetools-like implementation, optimization, code motion which wants to be on its own because it is undiffable, post-motion cleanup). Let me know if you like that, otherwise I can squish it all into one commit.

wfm tnx
I added the allow-many-commits magic into the description.

@eggrobin eggrobin merged commit e59065c into unicode-org:main Dec 4, 2024
95 checks passed
@FrankYFTang
Copy link
Contributor

The Exhaustive Tests for ICU #22 is broken between e025466 and 2e57f07 in TestMonkey

https://github.com/unicode-org/icu/actions/runs/12209763924

Likely caused by this PR

@eggrobin
Copy link
Member Author

Yes, fixing in #3296.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants