Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLDR-15830 Extensive refactoring of NameGetter; new enum NameType #4246

Merged
merged 12 commits into from
Dec 19, 2024

Conversation

btangmu
Copy link
Member

@btangmu btangmu commented Dec 15, 2024

CLDR-15830

-New enum NameType replaces integers CLDRFile.LANGUAGE_NAME, etc.; avoid arithmetic based on such integers

-Encapsulate in NameType all dependencies on ordering of NameType.LANGUAGE, etc.

-Move related methods and NameTable from CLDRFile to NameType

-Rename NameGetter.getNameFromBCP47 to getNameFromIdentifier, etc., since identifiers may not be only BCP47; change some comments accordingly

-Avoid boolean parameter onlyConstructCompound; use enum NameGetter.NameOpt instead, values COMPOUND_ONLY and DEFAULT

-Fix all 3 TODO comments added in previous PR 4228

-Fix typo: rename getZoneToCounty to getZoneToCountry (country, not county)

-Remove dead code, including ShowLanguages.printMissing, ShowLanguages.print, TestCoverageLevel.TypeName

-Remove some commented-out code

  • This PR completes the ticket.

ALLOW_MANY_COMMITS=true

-Fix all 3 TODO comments added in previous PR
…ME, etc.

-Replace many but not all int CLDRFile.LANGUAGE_NAME with Enum NameType.LANGUAGE, and so forth

-New method getNameFromTypeEnumCode(NameType, ...) to replace getNameFromTypenumCode(int, ...)

-New method NameType.getKeyPath to replace CLDRFile.getKey, work in progress

-Remove CLDRFile.TZ_START and CLDRFile.TZ_LIMIT

-New temporary methods NameType.fromCldrInt and NameType.toCldrInt, only to facilitate transition

-New TODO comments in CLDRTest.java
…ME, etc.

-Fix TestDisplayNameCollisions to use NameType instead of int

-New NameType.getNameTypeName, wraps CLDRFile.getNameTypeName for now

-New NameType.fromPath like CLDRFile.getNameType

-New NameType.NONE like CLDRFile.NO_NAME

-Remove some commented-out code
-Fix ShowLanguages, TestScriptMetadata to use NameType instead of int
-Change int to NameType, avoid arithmetic
…etCode/getNameType

-Also remove an unused private method getName from TimezoneFormatter
…omTypenumCode

-Also move the NameTable-associated methods, and consolidate

-Change more int to NameType

-Make toCldrInt private

-Remove dead code
-Encapsulate in NameType all dependencies on ordering of NameType.LANGUAGE, etc.

-Remove dead code, including ShowLanguages.printMissing, ShowLanguages.print, TestCoverageLevel.TypeName
…instead of boolean

-Rename getNameFromBCP47 to getNameFromIdentifier, since may not be only BCP47; change some comments accordingly

-Avoid boolean parameter onlyConstructCompound; use enum NameGetter.NameOpt instead, values COMPOUND_ONLY and DEFAULT
}
}
pw.println("</table></div>");
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above two methods were unused, and I deleted them rather than refactor them

@@ -922,10 +925,13 @@ public void TestMinimalLocalization() throws IOException {
new String[] {
"sun", "mon", "tue", "wed", "thu", "fri", "sat"
}));
for (int i = -7; i < 0; ++i) {
checkForItems(item, (i < -4 ? months : days), i, missing, failureCount, null);
Copy link
Member Author

@btangmu btangmu Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This usage of the magic number "-7" to mean "CLDRFile.TZ_EXEMPLAR and getDateKey", and so forth, involved some tricky arithmetic. This PR reduces the trickiness, at the expense of making the parameters explicit instead of looping.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much clearer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this as a comment to the code -- otherwise it hard for us in the future to follow why this is.

A practice I liked, is to move magic numbers to the top of the file as constant integers so when you call on them you call on English-readable labels.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortunately my change here removes the magic numbers, so I don't think there's any need for a comment

for (int i = -7; i < 0; ++i) { is the old code being replaced

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, this is a review on the old code. no comment required.

* @return
*/
public static int getNameType(String xpath) {
for (int i = 0; i < NameTable.length; ++i) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code in CLDRFile.java moved (in refactored form) to NameType.java

public NameType toNameType() {
switch (this) {
case language:
return NameType.LANGUAGE;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This maps StandardCodes.CodeType to NameType. These enums are strangely overlapping, suggesting possible future integration

public NameType toNameType() {
switch (this) {
case region:
return NameType.TERRITORY;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly this maps StandardCodes.LstrType to NameType. I'm not sure how StandardCodes.LstrType and StandardCodes.CodeType are related; again, they seem to overlap

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they definitely overlap

@@ -368,48 +368,6 @@ static String isBigLanguage(String lang) {

private static final boolean DEBUG = false;

static class TypeName implements Transform<String, String> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was unused

@@ -1601,31 +1601,39 @@ public String getLocaleID() {
static {
StandardCodes sc = StandardCodes.make();
Map<String, Set<String>> countries_zoneSet = sc.getCountryToZoneSet();
Map<String, String> zone_countries = sc.getZoneToCounty();

for (int typeNo = 0; typeNo <= CLDRFile.TZ_START; ++typeNo) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here was a loop depending on the numeric values of CLDRFile.LANGUAGE_NAME, etc. The new code lists the enum values explicitly. This static code block is crucial for code fallback, and while studying it for https://unicode-org.atlassian.net/browse/CLDR-17014 and also seeing how it involves getName, it appeared that refactoring would make that ticket easier to understand and solve


public enum NameType {
NONE,
LANGUAGE,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the replacements for CLDRFile.NO_NAME, CLDRFile.LANGUAGE_NAME, etc. (which were integers, NOT enum)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like putting this in an enum :D

However, I'm having a hard time understanding what NameType means -- can you add a longer comment about what it means? I wonder if there's a better enum name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I just wrote in another reply, "A path for a language name is associated with NameType.LANGUAGE; a path for a script name is associated with NameType.SCRIPT; etc."

There are different types of names, and this enum enumerates those types, so I think it's a pretty good name... You're right, though, adding a comment here could help

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be a future PR to add some brief enum docs (at the enum level)

// The row with "key|type" has four strings, while the others have three.
private static final String[][] NameTable = {
{"//ldml/localeDisplayNames/languages/language[@type=\"", "\"]", "language"},
{"//ldml/localeDisplayNames/scripts/script[@type=\"", "\"]", "script"},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NameTable is moved from CLDRFile, otherwise unchanged. Further refactoring could be beneficial, to avoid the entanglement with integers INDEX_LANGUAGE, etc., but at least that entanglement is now encapsulated inside this class

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, actually something that works with XPathParts (thus cache) - an XPathCalculator (builder) or something could be a completely different utility. That's what this is, it knows how to make an XPath for certain purposes.

@btangmu btangmu marked this pull request as ready for review December 16, 2024 00:56
@btangmu btangmu requested review from srl295 and macchiati December 16, 2024 00:57
@@ -238,15 +239,15 @@ public void TestLocalePartsValidity() {
if (!xpath.startsWith("//ldml/localeDisplayNames/")) {
continue;
}
switch (CLDRFile.getNameType(xpath)) {
case 0:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0, 1, and 2 were used here instead of CLDRFile.LOCALE_NAME, CLDRFile.SCRIPT_NAME, and CLDRFile.TERRITORY_NAME

Copy link
Member

@srl295 srl295 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good so far

@@ -922,10 +925,13 @@ public void TestMinimalLocalization() throws IOException {
new String[] {
"sun", "mon", "tue", "wed", "thu", "fri", "sat"
}));
for (int i = -7; i < 0; ++i) {
checkForItems(item, (i < -4 ? months : days), i, missing, failureCount, null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much clearer

@@ -189,7 +189,7 @@ public void writeSubcharts(Anchors anchors) throws IOException {
for (Entry<LanguageGroup, Set<R3<Integer, String, String>>> groupPairs :
groupToNameAndCodeSorted.keyValuesSet()) {
LanguageGroup group = groupPairs.getKey();
String ename = nameGetter.getNameFromBCP47Bool("en", true);
String ename = nameGetter.getNameFromIdentifierCompoundOnly("en");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are a lot better.

public NameType toNameType() {
switch (this) {
case region:
return NameType.TERRITORY;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they definitely overlap

+ xpath
+ "\t"
+ xpath2);
String theName = nameType.getNameTypeName();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is it the name of? There are too many variables like "noun2" -- and reading this code without having written it, I don't know what "theName" really is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Join the party :-) Doing this refactoring, I've also been diving into a lot of unfamiliar and perplexing code. The same variable or parameter may be called a name, or a type, or a code, ... This method in particular is looping through all the paths, so depending on the path, theName could be the name of a language, a script, a territory, ... -- any of the values of the new NameType enum. A path for a language name is associated with NameType.LANGUAGE; a path for a script name is associated with NameType.SCRIPT; etc.

@@ -922,10 +925,13 @@ public void TestMinimalLocalization() throws IOException {
new String[] {
"sun", "mon", "tue", "wed", "thu", "fri", "sat"
}));
for (int i = -7; i < 0; ++i) {
checkForItems(item, (i < -4 ? months : days), i, missing, failureCount, null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this as a comment to the code -- otherwise it hard for us in the future to follow why this is.

A practice I liked, is to move magic numbers to the top of the file as constant integers so when you call on them you call on English-readable labels.

@@ -843,6 +843,12 @@ void getSupplementalData(
}
}

/** Settings for checkForItems behavior */
private enum KeyOpt {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an optional key? Is this a specific key possibility? What is the key for? It would be great to make variable names clearer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a required parameter for one function, checkForItems. There are also two wrapper versions of checkForItems with fewer parameters, in which it implicitly has the value KeyOpt.DEFAULT, and there's one wrapper function checkForItemsDate in which it implicitly has the value KeyOpt.DATE.

Its purpose is to control the behavior of the only place it's referenced:

            String key =
                    (keyOpt == KeyOpt.DATE)
                            ? getDateKey(nameType, code)
                            : nameType.getKeyPath(code);

I don't know why exactly that's required; this is old code that I'm refactoring. These are the only callers where it implicitly has KeyOpt.DATE:

        checkForItemsDate(item, months, NameType.TZ_EXEMPLAR, missing, failureCount);
        checkForItemsDate(item, months, NameType.CURRENCY_SYMBOL, missing, failureCount);
        checkForItemsDate(item, months, NameType.CURRENCY, missing, failureCount);
        checkForItemsDate(item, days, NameType.VARIANT, missing, failureCount);
        checkForItemsDate(item, days, NameType.TERRITORY, missing, failureCount);
        checkForItemsDate(item, days, NameType.SCRIPT, missing, failureCount);
        checkForItemsDate(item, days, NameType.LANGUAGE, missing, failureCount);

It's testing something about months and days, for which evidently getDateKey is required.

Rather than take a lot of additional time, which might not have been justified, to figure out what this whole test accomplishes and why, I approached the refactoring as a formal problem: how can I revise this code so that it still accomplishes the same result, without depending on the magic values of the numbers -7, etc. The function has two options for how to get a key, and instead of making one of the parameters negative or positive to indicate the option, it now has a KeyOpt parameter with two possible values.

When there's time and motivation for further refactoring, a builder pattern might be employed to avoid the large number of parameters...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree; we can make further improvements in later PRs now that the bulk of this is done.

However, as a double check before merging, I would generate before and after charts, then diff to see that each generated files is identical (generate both on the same day, so there are not spurious diffs).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@conradarcturus it's to be an improvement of a boolean parameter like "useDate" or something, because then you call it with getDateKey(nameType, true) where true has no obvious meaning

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generate before and after charts, then diff to see that each generated files is identical

I will do that before merging


public enum NameType {
NONE,
LANGUAGE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like putting this in an enum :D

However, I'm having a hard time understanding what NameType means -- can you add a longer comment about what it means? I wonder if there's a better enum name.

Copy link
Member

@srl295 srl295 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great cleanup.

@macchiati
Copy link
Member

macchiati commented Dec 18, 2024 via email

@btangmu
Copy link
Member Author

btangmu commented Dec 19, 2024

As @macchiati advised, I ran GenerateAllCharts with branch main, moved ../cldr-staging/docs/charts/47 to another location, then ran GenerateAllCharts again with branch t15830_c. The resulting ../cldr-staging/docs/charts/47 matched the version from main, comparing with diff -r -- no differences.

Some errors were reported by GenerateAllCharts both times, all similar to this:

Err: on path //supplementalData/calendarData/calendar[@type="japanese"]/eras/era[@type="106"][@start="1184-4-16"], missing territories. Probably, this tool is out of date.

I don't know what to make of that but it was the same regardless of the changes in this PR so I'm assuming it's unrelated, and I'm going ahead with merging this PR. I'll follow up with an explanatory comment on the NameType enum as @conradarcturus advised.

@btangmu btangmu merged commit 72b7e8a into unicode-org:main Dec 19, 2024
12 checks passed
@btangmu btangmu deleted the t15830_c branch December 19, 2024 20:19
@srl295
Copy link
Member

srl295 commented Dec 19, 2024

yeah, I think we talked about a ticket to reduce the manual error checking

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants