-
Notifications
You must be signed in to change notification settings - Fork 387
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
CLDR-17566 converting index general p1 (#3792)
- Loading branch information
Showing
14 changed files
with
677 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
CLDR DDL Subcommittee | ||
The Common Locale Data Repository (CLDR) is widely used, and the content has grown dramatically over the years with participation by organizations of all types and sizes, as well as many individual contributors. | ||
Contributors for Digitally Disadvantaged Languages (DDL) face unique challenges. The CLDR-DDL subcommittee has been formed to evaluate mechanisms to make it easier for contributors for DDLs to: | ||
become contributors to CLDR | ||
improve the coverage for their language in CLDR | ||
raise the status of their contributions, so that the CLDR data for their language is incorporated into more products. | ||
The DDL Subcommittee has started to meet every other week as of June, 2023. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
Unicode Extensions for BCP 47 | ||
IETF BCP 47 Tags for Identifying Languages defines the language identifiers (tags) used on the Internet and in many standards. It has an extension mechanism that allows additional information to be included. The Unicode Consortium is the maintainer of the extension ‘u’ for Locale Extensions, as described in rfc6067, and the extension 't' for Transformed Content, as described in rfc6497. | ||
The subtags available for use in the 'u' extension provide language tag extensions that provide for additional information needed for identifying locales. The 'u' subtags consist of a set of keys and associated values (types). For example, a locale identifier for British English with numeric collation has the following form: en-GB-u-kn-true | ||
The subtags available for use in the 't' extension provide language tag extensions that provide for additional information needed for identifying transformed content, or a request to transform content in a certain way. For example, the language tag "ja-Kana-t-it" can be used as a content tag indicates Japanese Katakana transformed from Italian. It can also be used as a request for a given transformation. | ||
For more details on the valid subtags for these extensions, their syntax, and their meanings, see LDML Section 3.7 Unicode BCP 47 Extension Data. | ||
Machine-Readable Files for Validity Testing | ||
Beginning with CLDR version 1.7.2, machine-readable files are available listing the valid attributes, keys, and types for each successive version of LDML. The most recently released version is always available at http://unicode.org/Public/cldr/latest/ in a file of the form cldr-common*.zip (in older versions the file was of the form cldr-core*.zip). Inside that file, the directory "common/bcp47/" contains the data files defining the valid attributes, keys, and types. | ||
The BCP47 data is also currently maintained in a source code repository, with each release tagged, for viewing directly without unzipping. For example, see https://github.com/unicode-org/cldr/tree/release-38/common/bcp47. The current development snapshot is found at https://github.com/unicode-org/cldr/tree/master/common/bcp47. | ||
All releases including the latest are listed on http://cldr.unicode.org/index/downloads, with a link to each respective data directory under the column heading Data, and direct access to the repository under the GitHub Tag. | ||
For example, the timezone.xml file looks like the following: | ||
<keyword> | ||
<key name="tz" alias="timezone"> | ||
<type name="adalv" alias="Europe/Andorra"/> | ||
<type name="aedxb" alias="Asia/Dubai"/> | ||
Using this data, an implementation would determine that "fr-u-tz-adalv" and fr-u-tz-aedxb" are both valid. Some data in the CLDR data files also requires reference to LDML for validation according to Appendix Q of LDML. For example, LDML defines the type 'codepoints' to define specific code point ranges in Unicode for specific purposes. | ||
Version Information | ||
The following is not necessary for correct validation of the -u- extension, but may be useful for some readers. | ||
Each release has an associated data directory of the form "http://unicode.org/Public/cldr/<version>", where "<version>" is replaced by the release number. The version number for any file is given by the directory where it was downloaded from. If that information is no longer available, the version can still be accessed by looking at the common/dtd/ldml.dtd file in the cldr-common*.zip file (for older versions, the core.zip file), at the element cldrVersion, such as the following. This information is also accessible with a validating XML parser. | ||
<!ATTLIST version cldrVersion CDATA #FIXED "1.8" > | ||
For each release after CLDR 1.8, types introduced in that release are also marked in the data files by the XML attribute "since", such as in the following example: <type name="adp" since="1.9"/> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
CLDR Charts | ||
The Unicode CLDR Charts provide different ways to view the Common Locale Data Repository data. | ||
Latest - The charts for the latest release version | ||
Dev - A snapshot of data under development | ||
Previous - Previous available charts are linked from the download page in the Charts column | ||
The format of most of the fields in the charts will be clear from the Name and ID, such as the months of the year. The format for others, such as the date or time formats, is structured and requires more interpretation. For more information, see UTS #35: Locale Data Markup Language (LDML). | ||
Most charts have "double links" somewhere in each row. These are links that put the address of that row into the address bar of the browser for copying. | ||
Note that not all CLDR data is included in the charts. | ||
Version Deltas | ||
Delta Data - Data that changed in the current release. | ||
Delta DTDs - Differences between CLDR DTD's over time. | ||
Locale-Based Data | ||
Verification - Constructed data for verification: Dates, Timezones, Numbers | ||
Summary - Provides a summary view of the main locale data. Language locales (those with no territory or variant) are presented with fully resolved data; the inherited or aliased data can be hidden if desired. Other locales do not show inherited or aliased data, just the differences from the respective language locale. The English value is provided for comparison (shown as "=" if it is equal to the localized value, and n/a if not available). The Sublocales column shows variations across locales. Hovering over each Sublocale value shows a pop-up with the locales that have that value. | ||
By-Type - provides a side-by-side comparison of data from different locales for each field. For example, one can see all the locales that are left-to-right, or all the different translaitons of the Arabic script across languages. Data that is unconfimred or provisional is marked by a red-italic locale ID, such as ·bn_BD·. | ||
Character Annotations - The CLDR emoji character annotations. | ||
Subdivision Names - The (draft) CLDR subdivision names (names for states, provinces, cantons, etc.). | ||
Collation Tailorings - Collation charts (draft) for CLDR locales. | ||
Other Data | ||
Supplemental Data - General data that is not part of the locale hierarchy but is still part of CLDR. Includes: plural rules, day-period rules, language matching, language-script information, territories (countries), and their subdivisions, timezones, and so on. | ||
Transform - (Disabled temporarily) Some of the transforms in CLDR: the transliterations between different scripts. For more on transliterations, see Transliteration Guidelines. | ||
Keyboards - Provides a view of keyboard data: layouts for different locales, mappings from characters to keyboards, and from keyboards to characters. | ||
For more details on the locale data collection process, please see the CLDR process. For filing or viewing bug reports, see CLDR Bug Reports. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
CLDR Keyboard Subcommittee | ||
The CLDR Keyboard Subcommittee is developing a new cross-platform standard XML format for use by keyboard authors for inclusion in the CLDR source repository. | ||
News | ||
2023-Feb-29: The CLDR-TC has authorized the proposed specification to be released as stable (out of Technical Preview). | ||
2023-May-15: The CLDR-TC has authorized Public Review Issue #476 of the proposed specification, as a "Technical Preview." The PRI closed on 2023-Jul-15. | ||
Background | ||
CLDR (Common Locale Data Repository) | ||
Computing devices have become increasingly personal and increasingly affordable to the point that they are now within reach of most people on the planet. The diverse linguistic requirements of the world's 7+ billion people do not scale to traditional models of software development. In response to this, Unicode CLDR has emerged as a standards-based solution that empowers specialist and community input, as a means of balancing the needs of language communities with the technologies of major platform and service providers. | ||
The challenge and promise of Keyboards | ||
Text input is a core component of most computing experiences and is most commonly achieved using a keyboard, whether hardware or virtual (on-screen or touch). However, keyboard support for most of the world's languages is either completely missing or often does not adequately support the input needs of language communities. Improving text input support for minority languages is an essential part of the Unicode mission. | ||
Keyboard data is currently completely platform-specific. Consequently, language communities and other keyboard authors must see their designs developed independently for every platform/operating system, resulting in unnecessary duplication of technical and organizational effort. | ||
There is no central repository or contact point for this data, meaning that such authors must separately and independently contact all platform/operating system developers. | ||
LDML: The universal interchange format for keyboards | ||
The CLDR Keyboard Subcommittee is currently rewriting and redeveloping the existing LDML (XML) definition for keyboards (UTS#35 part 7) in order to define core keyboard-based text input requirements for the world's languages. This format allows the physical and virtual (on-screen or touch) keyboard layouts for a language to be defined in a single file. Input Method Editors (IME) or other input methods are not currently in scope for this format. | ||
CLDR: A home for the world's newest keyboards | ||
Today, there are many existing platform-specific implementations and keyboard definitions. This project does not intend to remove or replace existing well-established support. | ||
The goal of this project is that, where otherwise unsupported languages are concerned, CLDR becomes the common source for keyboard data, for use by platform/operating system developers and vendors. | ||
As a result, CLDR will also become the point of contact for keyboard authors and language communities to submit new or updated keyboard layouts to serve those user communities. CLDR has already become the definitive and publicly available source for the world's locale data. | ||
Unicode: Enabling the world's languages | ||
Keyboard support is part of a multi-step, often multi-year process of enabling a new language or script. | ||
Three critical parts of initial support for a language in content are: | ||
Encoding, in the Unicode Standard | ||
Display, including fonts and text layout | ||
Input | ||
Today, the vast majority of the languages of the world are already in the Unicode encoding. The open-source Noto font provides a wide range of fonts to support display, and the Unicode character properties play a vital role in display. However, input support often lags many years behind when a script is added to Unicode. | ||
The LDML keyboard format, and the CLDR repository, will make it much easier to deliver text input. | ||
Common Questions | ||
What is the history of this effort? | ||
In 2012, the original LDML keyboard format was designed to describe keyboards for comparative purposes. In 2018, a PRI was created soliciting further feedback. | ||
The CLDR Keyboard Subcommittee was formed and has been meeting since mid-2020. It quickly became apparent that the existing LDML format was insufficient for implementing new keyboard layouts. | ||
What is the current status? | ||
Release | ||
Updates to LDML (UTS#35) Part 7: Keyboards are scheduled to be released as part of CLDR v45. | ||
Implementations | ||
The SIL Keyman project is actively working on an open-source implementation of the LDML format. | ||
How can I get involved? | ||
If you want to be engaged in this workgroup, please contact the CLDR Keyboard Subcommittee via the Unicode contact form. |
Oops, something went wrong.