Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-22707 Unicode 16 alpha #2930

Merged
merged 18 commits into from
Apr 30, 2024
Merged

ICU-22707 Unicode 16 alpha #2930

merged 18 commits into from
Apr 30, 2024

Conversation

markusicu
Copy link
Member

@markusicu markusicu commented Mar 27, 2024

  • Unicode 16 alpha data
  • Add "security" and "UCA" files, generated after the alpha
    • Unicode 16 Identifier_Type no longer has a bug in U+A9CF that required a hack in ICU
  • Update script metadata via CLDR
  • Change the encoding of the Age property so that 16.0 fits
  • Fix a bug in C++ normalization for MaybeYes characters that combine both backward & forward; these are new in Unicode 16
  • Normalization data format & code to support MaybeNo characters with NF*C_QC=Maybe and NF*D_QC=No, that is, they have two-way mappings but also combine-back themselves, or their decompositions combine-back; also new in Unicode 16
  • Newly testing normalization quick check properties against ppucd.txt
  • UCA data from Unicode Tools, not yet in CLDR

Known issues, to be fixed separately:

  • ICU-22757 genuca --icu4x does not work with Unicode 16
  • ICU-22758 icuexportdata --mode norm does not work with Unicode 16

Will do later:

  • Unihan collators
  • Unicode 16 beta
  • Unicode 16 final

For comparison:

Checklist
  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22707
  • Required: The PR title must be prefixed with a JIRA Issue number.
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

ALLOW_MANY_COMMITS=true

@markusicu
Copy link
Member Author

@hsivonen FYI -- I am working on Unicode 16 alpha in ICU. The properties are in.
I am debugging normalization which currently fails as expected.
It may or may not be good enough for generating ICU4X data. Normalization mappings should come out ok but don't trust the quick check properties.
The alpha did not include security & UCA files, so those are unchanged from 15.1.
@echeran FYI

@jira-pull-request-webhook

This comment was marked as outdated.

@jira-pull-request-webhook

This comment was marked as outdated.

@jira-pull-request-webhook

This comment was marked as outdated.

@markusicu
Copy link
Member Author

@hsivonen I think I am done with ICU4C changes for normalization to support the characters with the new combinations of properties. All of the C/C++ normalization tests pass. 🎉 Hopefully this can get you started.

There are still ICU4C test failures, but they are currently expected. They are due to missing Unicode 16 data of several types, and some outdated test expectations.

@jira-pull-request-webhook

This comment was marked as outdated.

@jira-pull-request-webhook

This comment was marked as resolved.

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/common/uprops.h is different
  • icu4c/source/data/unidata/generate.sh is different
  • icu4c/source/test/intltest/ucdtest.cpp is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • .ci-builds/.azure-pipelines-icu4c.yml is now changed in the branch
  • icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/testdata/testnorm.nrm is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@markusicu
Copy link
Member Author

@eggrobin @cjchapman FYI

It looks like I got this snapshot to work well enough in ICU4C & ICU4J.
I intend to rebase-and-merge the chain of commits without squashing. I tried to keep them reasonably clean, both for reviewing and for future reference.

There are two problems with generating ICU4X data; for now I disabled those generators and filed separate ICU tickets. I think it's most productive if I can merge this PR and @hsivonen can then look into fixing them on the main branch.

@aheninger FYI I got a UBSan failure in rbbitst.cpp. It was unhappy about accessing the 8-bit version of RBBIStateTableRow at an odd address. I changed the test code to cast directly to the 8-bit or 16-bit version of the row struct.

@markusicu markusicu marked this pull request as ready for review April 27, 2024 02:43
@hsivonen
Copy link
Member

I think it's most productive if I can merge this PR and @hsivonen can then look into fixing them on the main branch.

I intend to look into this is the later part of this week.

Copy link
Contributor

@echeran echeran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RSLGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants