-
-
Notifications
You must be signed in to change notification settings - Fork 760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ICU-22984 Generate old monkeys #3287
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a good optimization (and cleanup)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice replacement of hardcoded rules with generated ones!
// TODO(egg): The following two are a workaround for what seems to be an ICU bug. | ||
// TODO(egg): The following two workarounds for what seems to be ICU bugs; | ||
// with UREGEX_DOTALL (but not UREGEX_MULTILINE): | ||
// 1. /.*\u000A/ does not match CR LF; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's because . matches CR LF, so it won't match just half of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That may well be the explanation, but even with regex greed, that still feels like a bug: /(rn|[a-z])*n/
still matches rn
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great comments!
// 𒀀 ◌́ ␠ ◌𝅲 | ||
// 0 1 2 3 4 5 6 ⟨ 𒀀, ◌́, ␠, ◌𝅲 ⟩ (none) | ||
// 0 ␀ ␀ 2 3 4 5 ⟨ 𒀀, ␠, ◌𝅲 ⟩ 1 -1 | ||
// 0 ␀ ␀ 2 3 ␀ 4 ⟨ 𒀀, ␠, A ⟩ 2 -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the final offset be -2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the offset
variable is with respect to the previous value of indexInRemapped
, so 6 + -1 + -1 = 4.
Remember to squash the commits. It's ok if you want to merge multiple commits if they make sense as separate ones. |
287e4d6
to
eb7fd5f
Compare
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
I tried with four steps (unicodetools-like implementation, optimization, code motion which wants to be on its own because it is undiffable, post-motion cleanup). Let me know if you like that, otherwise I can squish it all into one commit. |
wfm tnx |
The Exhaustive Tests for ICU #22 is broken between e025466 and 2e57f07 in TestMonkey https://github.com/unicode-org/icu/actions/runs/12209763924 Likely caused by this PR |
Yes, fixing in #3296. |
C++ only for now, I will do the old monkeys from Java separately (that will be more of the same but with subtle differences, as we will need to expand the UnicodeSets ourselves before feeding them to Pattern). This uses ICU regexes to match the context before and after, or to match the left-hand side of a remap rule, as in UAX14 and UAX29. The regexes are the ones used to generate the conformance tests in the Unicode tools; they now closely match the ones in the UAX.
Also line breaking only; the other ones will be very similar (and simpler).
The generated partition and rules are from unicode-org/unicodetools#979. Note that the generation involves transforming
&
and-
(UnicodeSet syntax) to&&
and--
(ICU regex character class syntax).Checklist
ALLOW_MANY_COMMITS=true