| Version | 48.2 |
|---|---|
| Editors | Mark Davis (markdavis@google.com) and other CLDR committee members |
| Date | 2026-03-03 |
| This Version | https://www.unicode.org/reports/tr35/tr35-78/tr35.html |
| Previous Version | https://www.unicode.org/reports/tr35/tr35-77/tr35.html |
| Latest Version | https://www.unicode.org/reports/tr35/ |
| Corrigenda | https://cldr.unicode.org/index/corrigenda |
| Latest Proposed Update | https://www.unicode.org/reports/tr35/proposed.html |
| Namespace | https://www.unicode.org/cldr/ |
| DTDs | https://www.unicode.org/cldr/dtd/48/ |
| Change History | Modifications |
This document describes an XML format (vocabulary) for the exchange of structured locale data. This format is used in the Unicode Common Locale Data Repository.
This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.
A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.
Please submit corrigenda and other comments with the CLDR bug reporting form [Bugs]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For more information see About Unicode Technical Reports and the Specifications FAQ. Unicode Technical Reports are governed by the Unicode Terms of Use.
The LDML specification is divided into the following parts:
unicode_language_subtag (also known as a Unicode base language code)unicode_script_subtag (also known as a Unicode script code)unicode_region_subtag (also known as a Unicode region code, or a Unicode territory code)unicode_variant_subtag (also known as a Unicode language variant code)<dates><calendars><timeZoneNames><zone> and <metazone><contextTransformUsage> element<segmentations>Not long ago, computer systems were like separate worlds, isolated from one another. The internet and related events have changed all that. A single system can be built of many different components, hardware and software, all needing to work together. Many different technologies have been important in bridging the gaps; in the internationalization arena, Unicode has provided a lingua franca for communicating textual data. However, there remain differences in the locale data used by different systems.
The best practice for internationalization is to store and communicate language-neutral data, and format that data for the client. This formatting can take place on any of a number of the components in a system; a server might format data based on the user's locale, or it could be that a client machine does the formatting. The same goes for parsing data, and locale-sensitive analysis of data.
But there remain significant differences across systems and applications in the locale-sensitive data used for such formatting, parsing, and analysis. Many of those differences are simply gratuitous; all within acceptable limits for human beings, but yielding different results. In many other cases there are outright errors. Whatever the cause, the differences can cause discrepancies to creep into a heterogeneous system. This is especially serious in the case of collation (sort-order), where different collation caused not only ordering differences, but also different results of queries! That is, with a query of customers with names between "Abbot, Cosmo" and "Arnold, James", if different systems have different sort orders, different lists will be returned. (For comparisons across systems formatted as HTML tables, see [Comparisons].)
Note: There are many different equally valid ways in which data can be judged to be "correct" for a particular locale. The goal for the common locale data is to make it as consistent as possible with existing locale data, and acceptable to users in that locale.
This document specifies an XML format for the communication of locale data: the Unicode Locale Data Markup Language (LDML). This provides a common format for systems to interchange locale data so that they can get the same results in the services provided by internationalization libraries. It also provides a standard format that can allow users to customize the behavior of a system. With it, for example, collation (sorting) rules can be exchanged, allowing two implementations to exchange a specification of tailored collation rules. Using the same specification, the two implementations will achieve the same results in comparing strings. Unicode LDML can also be used to let a user encapsulate specialized sorting behavior for a specific domain, or create a customized locale for a minority language. Unicode LDML is also used in the Unicode Common Locale Data Repository (CLDR). CLDR uses an open process for reconciling differences between the locale data used on different systems and validating the data, to produce with a useful, common, consistent base of locale data.
For more information, see the Common Locale Data Repository project page [LocaleProject].
As LDML is an interchange format, it was designed for ease of maintenance and simplicity of transformation into other formats, above efficiency of run-time lookup and use. Implementations should consider converting LDML data into a more compact format prior to use.
There are many ways to use the Unicode LDML specification and the CLDR data. The Unicode Consortium does not restrict the ways in which the format or data are used. However, an implementation may also claim conformance to the LDML specification and/or to CLDR data, as follows:
UAX35-C1. An implementation that claims conformance to this specification shall:
alt data//ldml/numbers/symbols/group an implementation could use alt="official" data.An implementation may also make a general claim of conformance to the LDML specification and/or CLDR data. Such a claim is understood to claim conformance to all portions of this specification that are relevant to the operations performed by the implementation, except for those specifically declared as exceptions. For example, if an implementation making a general claim of conformance performs date formatting, and does not declare date formatting as an exception, it is understood to be claiming conformance to date formatting as described in the section listed below.
UAX35-C2. An implementation that claims conformance to Unicode locale or language identifiers shall:
1. Specify whether Unicode locale extensions are allowed
2. Specify the canonical form used for identifiers in terms of casing and field separator characters.
External specifications may also reference particular components of Unicode locale or language identifiers, such as:
> Field X can contain any Unicode region subtag values as given in Unicode Technical Standard #35: Unicode Locale Data Markup Language (LDML), excluding grouping codes.
NOTE: UAX35-C2. is replaced by the following generalization.
The following lists the high-level sections with structures and/or processing algorithms. Conformance to a particular section may reference and require conformance to another section.
| Sections | Topics |
|---|---|
| Unicode Locale Identifier | identifier syntax, interpretation, and validity |
| Annex C. LocaleId Canonicalization | canonicalize |
| CLDR to BCP 47, BCP 47 to CLDR | convert |
| Language Identifier Field Definitions | interpretation and validity of -u key-value pairs |
| Locale Display Name Algorithm | locale display names |
| Sections | Topics |
|---|---|
| Locale Inheritance and Matching | locale inheritance |
| Likely Subtags | likely subtags |
| Language Matching | locale matching |
| Sections | Topics |
|---|---|
| Unit Identifiers | unit identifier syntax, interpretation, and validity |
| Unit Identifier Normalization | identifier normalization |
| Unit Conversion | unit conversion |
| Unit Preferences | evaluation of user preferences |
| Unit Identifier Uniqueness | converting units into BCP47 format |
| Compound Units | unit display names |
| Sections | Topics |
|---|---|
| Number Format Patterns | number format patterns, syntax and interpretation |
| Compact Number Formats | compact number formats |
| Rule-Based Number Formatting | spell-out number formatting |
| Sections | Topics |
|---|---|
| Elements availableFormats, appendItems | date formatting, patterns |
| Date Format Patterns | date format patterns and symbols |
| Using Time Zone Names | timezone forms, fallback and parsing |
| Sections | Topics |
|---|---|
| Root Collation | Root collation syntax and structure |
| Collation Tailorings | Rule syntax and interpretation for language-specific ordering |
| Sections | Topics |
|---|---|
| Grammatical Features | noun classes (except for plurals) |
| Language Plural Rules | plural and ordinal category rules, ranges |
| Sections | Topics |
|---|---|
| Unicode Sets | Unicode set syntax and interpretation |
| String Range | string-range syntax and interpretation |
| Transforms | transform identifier and rule syntax and interpretation |
| Segmentations | segmentation customizations |
| Synthesizing Sequence Names | constructing derived emoji names |
| Formatting Process | person name formatting |
| Part 7: Keyboards | keyboard structure and interpretation |
| Conformance (Message Format) | message formatting |
Conformant implementations cannot modify CLDR structures, such as the syntax or interpretation of locale identifiers.
There are usually mechanisms for implementations to customize these to a certain extent, using what are known a private use codes.
For example, an implementation could use the private-use language code qfz to mean a language that was not covered by BCP 47,
or use a private use extension in a Unicode locale identifer, or use a private-use unit such as xxx-smoot-per-second.
An implementation may also use a deprecated code instead of the corresponding preferred code.
For example, the most frequent case of this is with an implementation whose earlier versions predated BCP 47, and used iw for Hebrew,
rather than the BCP 47 (and CLDR) code he.
When this is done, the CLDR data needs to be modified in appropriate places, not just in some file names.
For example, the languageAlias data requires modification, from:
<languageAlias type="iw" replacement="he" reason="deprecated"/> <!-- Hebrew -->
to
<languageAlias type="he" replacement="iw" reason="deprecated"/> <!-- Hebrew -->
Minimized locale identifiers are also not required. For example, an implementation could consistently expand locale identifiers to include regions, such as en → en_DE or de → de-AT.
Implementations may customize CLDR data, as long as they declare that they are doing so. This may include:
An implementation may dispense with locale data for locales that an implementation does not support, or for locales it does support, dispense with data that is at CoverageLevel=Comprehensive, or dispense with particular sorts of data, such a annotations for emoji.
An implementation could add data for a locale that CLDR does not yet support, or add higher-coverage data for a locale than what CLDR has.
CLDR has a mechanism for overriding data using the alt mechanism.
At build time, an implementation could override the default value by using an alt value.
For example, take the following data:
<territory type="HK">Sonderverwaltungsregion Hongkong</territory>
<territory type="HK" alt="short">Hongkong</territory>
An implementation could, at build time, substitute the short value for the regular value, getting "Hongkong". It could instead support both values at runtime, using display option settings to pick between the regular value and the short value.
Implementations can override the data in other ways as well, such as changing the spelling of a particular value.
The files in testData can be used to test conformance.
Brief instructions for use are supplied in _readme.txt files in the different directories and/or in the headers of the files in question.
For example, the following is from a sample header:
# Format:
# <source locale identifier> ; <expected canonicalized locale identifier>
#
# The data lines are divided into 4 sets:
# explicit: a short list of explicit test cases.
# fromAliases: test cases generated from the alias data.
# decanonicalized: test cases generated by reversing the normalization process.
# withIrrelevants: test cases generated from the others by adding irrelevant fields where possible,
# to ensure that the canonicalization implementation is not sensitive to irrelevant fields. These include:
# Language: aaa
# Script: Adlm
# Region: AC
# Variant: fonipa
If an implementation overrides CLDR data, then various lines in the relevant test files may need to be modified correspondingly, or skipped.
The EBNF syntax used in LDML is a variant of the Extended Backus-Naur Form (EBNF) notation used in W3C XML Notation. The main differences are:
digit{3} for 3 digits, digit{3,5} for 3 to 5 digits, and digit{3,} for 3 or more digits.[A-Z a-z] is the same as [A-Za-z]\x20 is the same as #x20 and [\&\-] is the same as [#x26#x2D]In the text, this is sometimes referred to as "EBNF (Perl-based)".
Before diving into the XML structure, it is helpful to describe the model behind the structure. People do not have to subscribe to this model to use data in LDML, but they do need to understand it so that the data can be correctly translated into whatever model their implementation uses.
The first issue is basic: what is a locale? In this model, a locale is an identifier (id) that refers to a set of user preferences that tend to be shared across significant swaths of the world. Traditionally, the data associated with this id provides support for formatting and parsing of dates, times, numbers, and currencies; for measurement units, for sort-order (collation), plus translated names for time zones, languages, countries (regions), and scripts. The data can also include support for text boundaries (character, word, line, and sentence), text transformations (including transliterations), and other services.
Locale data is not cast in stone: the data used on someone's machine generally may reflect the US format, for example, but preferences can typically set to override particular items, such as setting the date format for 2002.03.15, or using metric or Imperial measurement units. In the abstract, locales are simply one of many sets of preferences that, say, a website may want to remember for a particular user. Depending on the application, it may want to also remember the user's time zone, preferred currency, preferred character set, smoker/non-smoker preference, meal preference (vegetarian, kosher, and so on), music preference, religion, party affiliation, favorite charity, and so on.
Locale data in a system may also change over time: country boundaries change; governments (and currencies) come and go: committees impose new standards; bugs are found and fixed in the source data; and so on. Thus the data needs to be versioned for stability over time.
In general terms, the locale id is a parameter that is supplied to a particular service (date formatting, sorting, spell-checking, and so on). The format in this document does not attempt to represent all the data that could conceivably be used by all possible services. Instead, it collects together data that is in common use in systems and internationalization libraries for basic services. The main difference among locales is in terms of language; there may also be some differences according to different countries or regions. However, the line between locales and languages, as commonly used in the industry, are rather fuzzy. Note also that the vast majority of the locale data in CLDR is in fact language data; all non-linguistic data is separated out into a separate tree. For more information, see Language and Locale IDs.
We will speak of data as being "in locale X". That does not imply that a locale is a collection of data; it is simply shorthand for "the set of data associated with the locale id X". Each individual piece of data is called a resource or field, and a tag indicating the key of the resource is called a resource tag.
Unicode LDML uses stable identifiers based on [BCP47] for distinguishing among languages, locales, regions, currencies, time zones, transforms, and so on. There are many systems for identifiers for these entities. The Unicode LDML identifiers may not match the identifiers used on a particular target system. If so, some process of identifier translation may be required when using LDML data.
The BCP 47 extensions (-u- and -t-) are described in Unicode BCP 47 U Extension and Unicode BCP 47 T Extension.
A Unicode language identifier has the following structure (provided in EBNF (Perl-based)). The following table defines syntactically well-formed identifiers: they are not necessarily valid identifiers. For additional validity criteria, see the links on the right.
| EBNF | Validity / Comments | |
|---|---|---|
unicode_language_id |
|
"root" is treated as a special unicode_language_subtag |
unicode_language_subtag |
= alpha{2,3} | alpha{5,8}; |
validity latest-data |
unicode_script_subtag |
= alpha{4} ; |
validity latest-data |
unicode_region_subtag
| = (alpha{2} | digit{3}) ; |
validity latest-data |
unicode_variant_subtag
| = (alphanum{5,8} |
validity latest-data |
sep | = [-_] ; | |
digit | = [0-9] ; | |
alpha | = [A-Z a-z] ; | |
alphanum | = [0-9 A-Z a-z] ; |
The following is an additional well-formedness constraint:
The semantics of the various subtags is explained in Language Identifier Field Definitions ; there are also direct links from unicode_language_subtag , etc. While theoretically the unicode_language_subtag may have more than 3 letters through the IANA registration process, in practice that has not occurred. The unicode_language_subtag "und" may be omitted when there is a unicode_script_subtag ; for that reason unicode_language_subtag values with 4 letters are not permitted. However, such unicode_language_id values are not intended for general interchange, because they are not valid BCP 47 tags. Instead, they are intended for certain protocols such as the identification of transliterators or font ScriptLangTag values. For more information on language subtags with 4 letters, see BCP 47 Language Tag to Unicode BCP 47 Locale Identifier.
For example, "en-US" (American English), "en_GB" (British English), "es-419" (Latin American Spanish), and "uz-Cyrl" (Uzbek in Cyrillic) are all valid Unicode language identifiers.
A Unicode locale identifier is composed of a Unicode language identifier plus (optional) locale extensions. It has the following structure. The semantics of the U and T extensions are explained in Unicode BCP 47 U Extension and Unicode BCP 47 T Extension. Other extensions and private use extensions are supported for pass-through. The following table defines syntactically well-formed identifiers: they are not necessarily valid identifiers. For additional validity criteria, see the links on the right.
| EBNF | Validity / Comments | |
|---|---|---|
unicode_locale_id |
= unicode_language_idextensions*pu_extensions? ; |
|
extensions |
= unicode_locale_extensions| transformed_extensions | other_extensions ; |
|
unicode_locale_extensions |
= sep [uU]((sep keyword)+|(sep uattribute)+ (sep ufield)*) ; |
|
transformed_extensions |
= sep [tT]((sep tlang (sep tfield)*)| (sep tfield)+) ; |
|
pu_extensions |
= sep [xX] (sep alphanum{1,8})+ ; |
|
other_extensions |
= sep [alphanum-[tTuUxX]] (sep alphanum{2,8})+ ; |
|
ufield(Also known as keyword) |
= ukey (sep uvalue)? ; |
|
ukey(Also known as key) |
= alphanum alpha ; |
validitylatest-data (Note that this is narrower than in [RFC6067], so that it is disjoint with tkey.) |
uvalue(Also known as type) |
= alphanum{3,8} (sep alphanum{3,8})* ; |
validitylatest-data |
uattribute(Also known as attribute) |
= alphanum{3,8} ; |
|
unicode_subdivision_id |
= unicode_region_subtag unicode_subdivision_suffix ; |
validitylatest-data |
unicode_subdivision_suffix |
= alphanum{1,4} ; |
|
unicode_measure_unit |
= alphanum{3,8} (sep alphanum{3,8})* ; |
validitylatest-data |
tlang |
= unicode_language_subtag (sep unicode_script_subtag)? (sep unicode_region_subtag)? (sep unicode_variant_subtag)* ; |
same as in unicode_language_id |
tfield |
= tkey tvalue; |
validitylatest-data |
tkey |
= alpha digit ; |
|
tvalue |
= alphanum{3,8} (sep alphanum{3,8})+ ; |
The following are additional well-formedness constraints:
For historical reasons, this is called a Unicode locale identifier. However, it also functions (with few exceptions) as a language identifier, and accesses language-based data. Except where it would be unclear, this document uses the term "locale" data loosely to encompass both types of data: for more information, see Language and Locale IDs.
As of the release of this specification, there were no other_extensions defined. The other_extensions are present in the syntax to allow implementations to preserve that information.
As for terminology, the term code may also be used instead of "subtag", and "territory" instead of "region". The primary language subtag is also called the base language code. For example, the base language code for "en-US" (American English) is "en" (English). The type may also be referred to as a value or key-value.
All identifier field values are case-insensitive. Although case distinctions do not carry any special meaning, an implementation of LDML should use the casing recommendations in [BCP47], especially when a Unicode locale identifier is used for locale data exchange in software protocols.
The identifiers can vary in case and in the separator characters. The "-" and "_" separators are treated as equivalent, although "-" is preferred.
A Unicode BCP 47 locale identifier (unicode_bcp47_locale_id) is a unicode_locale_id that meets the following additional constraints:
sep is restricted to only [-] in unicode_language_id and unicode_locale_id.]unicode_language_subtag.] Thus it can be neither of the following:unicode_script_subtag.unicode_language_subtag is used instead of "root").A well-formed Unicode BCP 47 locale identifier is always a well-formed BCP 47 language tag. The reverse, however, is not guaranteed; a BCP 47 language tag that contains an extlang subtag, an irregular subtag, or an initial 'x' subtag would not be a well-formed Unicode BCP 47 locale identifier — for details see BCP 47 Conformance. However, any BCP 47 language tag can easily converted to a Unicode BCP 47 locale identifier as specified in BCP 47 Language Tag Conversion.
A Unicode CLDR locale identifier (unicode_cldr_locale_id) is a unicode_locale_id that meets the following additional constraints:
sep is restricted to only [_] in unicode_language_id and unicode_locale_id.]unicode_language_id "und" is replaced by "root".]unicode_script_subtag.]Note: The current version of CLDR data uses Unicode CLDR locale identifiers for backward compatibility. This might be changed in future CLDR releases.
A unicode_locale_id has canonical syntax when:
ufields and tfields are sorted by alphabetical order of their keys, within their respective extensions.ufield or tfield value "true" is removed.For example, the canonical form of "en-u-foo-bar-nu-thai-ca-buddhist-kk-true" is "en-u-bar-foo-ca-buddhist-kk-nu-thai". The attributes "foo" and "bar" in this example are provided only for illustration; no attribute subtags are defined by the current CLDR specification.
NOTE: Some people may wonder why CLDR uses alphabetical order for variants, rather than the ordering in Section 4.1 of BCP 47. Here are the considerations that lead to that decision:
A unicode_locale_id is in canonical form when it has canonical syntax and contains no aliased subtags. A unicode_locale_id can be transformed into canonical form according to Annex C. LocaleId Canonicalization.
A unicode_locale_id is maximal when the unicode_language_id and tlang (if any) have been transformed by the Add Likely Subtags operation in Likely Subtags, excluding "und".
Example: the maximal form of ja-Kana-t-it is ja-Kana-JP-t-it-latn-it
Note that the latn and final it don't use any uppercase characters, since they are not inside unicode_language_id.
Two unicode_locale_ids are equivalent when their maximal canonical forms are identical.
Example: "IW-HEBR-u-ms-imperial" ~ "he-u-ms-uksystem"
The equivalence relationship may change over time, such as when subtags are deprecated or likely subtag mappings change. For example, if two countries were to merge, then various subtags would become deprecated. These kinds of changes are generally very infrequent.
Unicode language and locale identifiers inherit the design and the repertoire of subtags from [BCP47] Language Tags. There are some extensions and restrictions made for the use of the Unicode locale identifier in CLDR:
There are thus two subtypes of Unicode locale identifiers, as defined above.