Skip to content

Commit

Permalink
CLDR-17566 converting index general p1 (#3792)
Browse files Browse the repository at this point in the history
  • Loading branch information
chpy04 authored Jun 7, 2024
1 parent 3a79293 commit c5c2381
Show file tree
Hide file tree
Showing 14 changed files with 677 additions and 0 deletions.
7 changes: 7 additions & 0 deletions docs/site/TEMP-TEXT-FILES/ddl.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
CLDR DDL Subcommittee
The Common Locale Data Repository (CLDR) is widely used, and the content has grown dramatically over the years with participation by organizations of all types and sizes, as well as many individual contributors.
Contributors for Digitally Disadvantaged Languages (DDL) face unique challenges. The CLDR-DDL subcommittee has been formed to evaluate mechanisms to make it easier for contributors for DDLs to:
become contributors to CLDR
improve the coverage for their language in CLDR
raise the status of their contributions, so that the CLDR data for their language is incorporated into more products.
The DDL Subcommittee has started to meet every other week as of June, 2023.
20 changes: 20 additions & 0 deletions docs/site/TEMP-TEXT-FILES/index-bcp47-extension.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
Unicode Extensions for BCP 47
IETF BCP 47 Tags for Identifying Languages defines the language identifiers (tags) used on the Internet and in many standards. It has an extension mechanism that allows additional information to be included. The Unicode Consortium is the maintainer of the extension ‘u’ for Locale Extensions, as described in rfc6067, and the extension 't' for Transformed Content, as described in rfc6497.
The subtags available for use in the 'u' extension provide language tag extensions that provide for additional information needed for identifying locales. The 'u' subtags consist of a set of keys and associated values (types). For example, a locale identifier for British English with numeric collation has the following form: en-GB-u-kn-true
The subtags available for use in the 't' extension provide language tag extensions that provide for additional information needed for identifying transformed content, or a request to transform content in a certain way. For example, the language tag "ja-Kana-t-it" can be used as a content tag indicates Japanese Katakana transformed from Italian. It can also be used as a request for a given transformation.
For more details on the valid subtags for these extensions, their syntax, and their meanings, see LDML Section 3.7 Unicode BCP 47 Extension Data.
Machine-Readable Files for Validity Testing
Beginning with CLDR version 1.7.2, machine-readable files are available listing the valid attributes, keys, and types for each successive version of LDML. The most recently released version is always available at http://unicode.org/Public/cldr/latest/ in a file of the form cldr-common*.zip (in older versions the file was of the form cldr-core*.zip). Inside that file, the directory "common/bcp47/" contains the data files defining the valid attributes, keys, and types.
The BCP47 data is also currently maintained in a source code repository, with each release tagged, for viewing directly without unzipping. For example, see https://github.com/unicode-org/cldr/tree/release-38/common/bcp47. The current development snapshot is found at https://github.com/unicode-org/cldr/tree/master/common/bcp47.
All releases including the latest are listed on http://cldr.unicode.org/index/downloads, with a link to each respective data directory under the column heading Data, and direct access to the repository under the GitHub Tag.
For example, the timezone.xml file looks like the following:
<keyword>
<key name="tz" alias="timezone">
<type name="adalv" alias="Europe/Andorra"/>
<type name="aedxb" alias="Asia/Dubai"/>
Using this data, an implementation would determine that "fr-u-tz-adalv" and fr-u-tz-aedxb" are both valid. Some data in the CLDR data files also requires reference to LDML for validation according to Appendix Q of LDML. For example, LDML defines the type 'codepoints' to define specific code point ranges in Unicode for specific purposes.
Version Information
The following is not necessary for correct validation of the -u- extension, but may be useful for some readers.
Each release has an associated data directory of the form "http://unicode.org/Public/cldr/<version>", where "<version>" is replaced by the release number. The version number for any file is given by the directory where it was downloaded from. If that information is no longer available, the version can still be accessed by looking at the common/dtd/ldml.dtd file in the cldr-common*.zip file (for older versions, the core.zip file), at the element cldrVersion, such as the following. This information is also accessible with a validating XML parser.
<!ATTLIST version cldrVersion CDATA #FIXED "1.8" >
For each release after CLDR 1.8, types introduced in that release are also marked in the data files by the XML attribute "since", such as in the following example: <type name="adp" since="1.9"/>
23 changes: 23 additions & 0 deletions docs/site/TEMP-TEXT-FILES/index-charts.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
CLDR Charts
The Unicode CLDR Charts provide different ways to view the Common Locale Data Repository data.
Latest - The charts for the latest release version
Dev - A snapshot of data under development
Previous - Previous available charts are linked from the download page in the Charts column
The format of most of the fields in the charts will be clear from the Name and ID, such as the months of the year. The format for others, such as the date or time formats, is structured and requires more interpretation. For more information, see UTS #35: Locale Data Markup Language (LDML).
Most charts have "double links" somewhere in each row. These are links that put the address of that row into the address bar of the browser for copying.
Note that not all CLDR data is included in the charts.
Version Deltas
Delta Data - Data that changed in the current release.
Delta DTDs - Differences between CLDR DTD's over time.
Locale-Based Data
Verification - Constructed data for verification: Dates, Timezones, Numbers
Summary - Provides a summary view of the main locale data. Language locales (those with no territory or variant) are presented with fully resolved data; the inherited or aliased data can be hidden if desired. Other locales do not show inherited or aliased data, just the differences from the respective language locale. The English value is provided for comparison (shown as "=" if it is equal to the localized value, and n/a if not available). The Sublocales column shows variations across locales. Hovering over each Sublocale value shows a pop-up with the locales that have that value.
By-Type - provides a side-by-side comparison of data from different locales for each field. For example, one can see all the locales that are left-to-right, or all the different translaitons of the Arabic script across languages. Data that is unconfimred or provisional is marked by a red-italic locale ID, such as ·bn_BD·.
Character Annotations - The CLDR emoji character annotations.
Subdivision Names - The (draft) CLDR subdivision names (names for states, provinces, cantons, etc.).
Collation Tailorings - Collation charts (draft) for CLDR locales.
Other Data
Supplemental Data - General data that is not part of the locale hierarchy but is still part of CLDR. Includes: plural rules, day-period rules, language matching, language-script information, territories (countries), and their subdivisions, timezones, and so on.
Transform - (Disabled temporarily) Some of the transforms in CLDR: the transliterations between different scripts. For more on transliterations, see Transliteration Guidelines.
Keyboards - Provides a view of keyboard data: layouts for different locales, mappings from characters to keyboards, and from keyboards to characters.
For more details on the locale data collection process, please see the CLDR process. For filing or viewing bug reports, see CLDR Bug Reports.
37 changes: 37 additions & 0 deletions docs/site/TEMP-TEXT-FILES/index-keyboard-workgroup.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
CLDR Keyboard Subcommittee
The CLDR Keyboard Subcommittee is developing a new cross-platform standard XML format for use by keyboard authors for inclusion in the CLDR source repository.
News
2023-Feb-29: The CLDR-TC has authorized the proposed specification to be released as stable (out of Technical Preview).
2023-May-15: The CLDR-TC has authorized Public Review Issue #476 of the proposed specification, as a "Technical Preview." The PRI closed on 2023-Jul-15.
Background
CLDR (Common Locale Data Repository)
Computing devices have become increasingly personal and increasingly affordable to the point that they are now within reach of most people on the planet. The diverse linguistic requirements of the world's 7+ billion people do not scale to traditional models of software development. In response to this, Unicode CLDR has emerged as a standards-based solution that empowers specialist and community input, as a means of balancing the needs of language communities with the technologies of major platform and service providers.
The challenge and promise of Keyboards
Text input is a core component of most computing experiences and is most commonly achieved using a keyboard, whether hardware or virtual (on-screen or touch). However, keyboard support for most of the world's languages is either completely missing or often does not adequately support the input needs of language communities. Improving text input support for minority languages is an essential part of the Unicode mission.
Keyboard data is currently completely platform-specific. Consequently, language communities and other keyboard authors must see their designs developed independently for every platform/operating system, resulting in unnecessary duplication of technical and organizational effort.
There is no central repository or contact point for this data, meaning that such authors must separately and independently contact all platform/operating system developers.
LDML: The universal interchange format for keyboards
The CLDR Keyboard Subcommittee is currently rewriting and redeveloping the existing LDML (XML) definition for keyboards (UTS#35 part 7) in order to define core keyboard-based text input requirements for the world's languages. This format allows the physical and virtual (on-screen or touch) keyboard layouts for a language to be defined in a single file. Input Method Editors (IME) or other input methods are not currently in scope for this format.
CLDR: A home for the world's newest keyboards
Today, there are many existing platform-specific implementations and keyboard definitions. This project does not intend to remove or replace existing well-established support.
The goal of this project is that, where otherwise unsupported languages are concerned, CLDR becomes the common source for keyboard data, for use by platform/operating system developers and vendors.
As a result, CLDR will also become the point of contact for keyboard authors and language communities to submit new or updated keyboard layouts to serve those user communities. CLDR has already become the definitive and publicly available source for the world's locale data.
Unicode: Enabling the world's languages
Keyboard support is part of a multi-step, often multi-year process of enabling a new language or script.
Three critical parts of initial support for a language in content are:
Encoding, in the Unicode Standard
Display, including fonts and text layout
Input
Today, the vast majority of the languages of the world are already in the Unicode encoding. The open-source Noto font provides a wide range of fonts to support display, and the Unicode character properties play a vital role in display. However, input support often lags many years behind when a script is added to Unicode.
The LDML keyboard format, and the CLDR repository, will make it much easier to deliver text input.
Common Questions
What is the history of this effort?
In 2012, the original LDML keyboard format was designed to describe keyboards for comparative purposes. In 2018, a PRI was created soliciting further feedback.
The CLDR Keyboard Subcommittee was formed and has been meeting since mid-2020. It quickly became apparent that the existing LDML format was insufficient for implementing new keyboard layouts.
What is the current status?
Release
Updates to LDML (UTS#35) Part 7: Keyboards are scheduled to be released as part of CLDR v45.
Implementations
The SIL Keyman project is actively working on an open-source implementation of the LDML format.
How can I get involved?
If you want to be engaged in this workgroup, please contact the CLDR Keyboard Subcommittee via the Unicode contact form.
Loading

0 comments on commit c5c2381

Please sign in to comment.