Unicode Technical Standard #35

Unicode Locale Data Markup Language (LDML)

Version	48.2
Editors	Mark Davis (markdavis@google.com) and other CLDR committee members
Date	2026-03-03
This Version	https://www.unicode.org/reports/tr35/tr35-78/tr35.html
Previous Version	https://www.unicode.org/reports/tr35/tr35-77/tr35.html
Latest Version	https://www.unicode.org/reports/tr35/
Corrigenda	https://cldr.unicode.org/index/corrigenda
Latest Proposed Update	https://www.unicode.org/reports/tr35/proposed.html
Namespace	https://www.unicode.org/cldr/
DTDs	https://www.unicode.org/cldr/dtd/48/
Change History	Modifications

Summary

This document describes an XML format (vocabulary) for the exchange of structured locale data. This format is used in the Unicode Common Locale Data Repository.

Status

This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.

A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.

Please submit corrigenda and other comments with the CLDR bug reporting form [Bugs]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For more information see About Unicode Technical Reports and the Specifications FAQ. Unicode Technical Reports are governed by the Unicode Terms of Use.

Parts

The LDML specification is divided into the following parts:

Part 1: Core (languages, locales, basic structure)
Part 2: General (display names & transforms, etc.)
Part 3: Numbers (number & currency formatting)
Part 4: Dates (date, time, time zone formatting)
Part 5: Collation (sorting, searching, grouping)
Part 6: Supplemental (supplemental data)
Part 7: Keyboards (keyboard mappings)
Part 8: Person Names (person names)
Part 9: MessageFormat (message format)
Appendix A: Modifications
Appendix B: Acknowledgments

Contents of Part 1, Core

Introduction
What is a Locale?
Unicode Language and Locale Identifiers
- Unicode Language Identifier
- Unicode Locale Identifier
  - Canonical Unicode Locale Identifiers
- BCP 47 Conformance
  - BCP 47 Language Tag Conversion
- Language Identifier Field Definitions
  - unicode_language_subtag (also known as a Unicode base language code)
  - unicode_script_subtag (also known as a Unicode script code)
  - unicode_region_subtag (also known as a Unicode region code, or a Unicode territory code)
  - unicode_variant_subtag (also known as a Unicode language variant code)
- Special Codes
- Special Script Codes
- Unicode BCP 47 U Extension
- Unicode BCP 47 T Extension
  - T Extension Data Files
- Compatibility with Older Identifiers
  - Old Locale Extension Syntax
    - Table: Locale Extension Mappings
  - Legacy Variants
    - Table: Legacy Variant Mappings
  - Relation to OpenI18n
- Transmitting Locale Information
  - Message Formatting and Exceptions
- Unicode Language and Locale IDs
  - Written Language
  - Hybrid Locale Identifiers
- Validity Data
Locale Inheritance and Matching
XML Format
Property Data
Issues in Formatting and Parsing
- Lenient Parsing
  - Motivation
  - Loose Matching
- Handling Invalid Patterns
Data Size Reduction
- Vertical Slicing
- Horizontal Slicing
Annex A Deprecated Structure
Annex B Links to Other Parts
- Table: Part 2 Links: General (display names & transforms, etc.)
- Table: Part 3 Links: Numbers (number & currency formatting)
- Table: Part 4 Links: Dates (date, time, time zone formatting)
- Table: Part 5 Links: Collation (sorting, searching, grouping)
- Table: Part 6 Links: Supplemental (supplemental data)
- Table: Part 7 Links: Keyboards (keyboard mappings)
Annex C. LocaleId Canonicalization
References
Acknowledgments
Modifications

Introduction

Not long ago, computer systems were like separate worlds, isolated from one another. The internet and related events have changed all that. A single system can be built of many different components, hardware and software, all needing to work together. Many different technologies have been important in bridging the gaps; in the internationalization arena, Unicode has provided a lingua franca for communicating textual data. However, there remain differences in the locale data used by different systems.

The best practice for internationalization is to store and communicate language-neutral data, and format that data for the client. This formatting can take place on any of a number of the components in a system; a server might format data based on the user's locale, or it could be that a client machine does the formatting. The same goes for parsing data, and locale-sensitive analysis of data.

But there remain significant differences across systems and applications in the locale-sensitive data used for such formatting, parsing, and analysis. Many of those differences are simply gratuitous; all within acceptable limits for human beings, but yielding different results. In many other cases there are outright errors. Whatever the cause, the differences can cause discrepancies to creep into a heterogeneous system. This is especially serious in the case of collation (sort-order), where different collation caused not only ordering differences, but also different results of queries! That is, with a query of customers with names between "Abbot, Cosmo" and "Arnold, James", if different systems have different sort orders, different lists will be returned. (For comparisons across systems formatted as HTML tables, see [Comparisons].)

Note: There are many different equally valid ways in which data can be judged to be "correct" for a particular locale. The goal for the common locale data is to make it as consistent as possible with existing locale data, and acceptable to users in that locale.

This document specifies an XML format for the communication of locale data: the Unicode Locale Data Markup Language (LDML). This provides a common format for systems to interchange locale data so that they can get the same results in the services provided by internationalization libraries. It also provides a standard format that can allow users to customize the behavior of a system. With it, for example, collation (sorting) rules can be exchanged, allowing two implementations to exchange a specification of tailored collation rules. Using the same specification, the two implementations will achieve the same results in comparing strings. Unicode LDML can also be used to let a user encapsulate specialized sorting behavior for a specific domain, or create a customized locale for a minority language. Unicode LDML is also used in the Unicode Common Locale Data Repository (CLDR). CLDR uses an open process for reconciling differences between the locale data used on different systems and validating the data, to produce with a useful, common, consistent base of locale data.

For more information, see the Common Locale Data Repository project page [LocaleProject].

As LDML is an interchange format, it was designed for ease of maintenance and simplicity of transformation into other formats, above efficiency of run-time lookup and use. Implementations should consider converting LDML data into a more compact format prior to use.

Conformance

There are many ways to use the Unicode LDML specification and the CLDR data. The Unicode Consortium does not restrict the ways in which the format or data are used. However, an implementation may also claim conformance to the LDML specification and/or to CLDR data, as follows:

UAX35-C1. An implementation that claims conformance to this specification shall:

Identify the sections of the specification that it conforms to.
- For example, an implementation might claim conformance to all LDML features except for transforms and segments.
- The names of sections may change for clarity, so the associated links should be included in any reference — links into LDML will remain stable.
Interpret the relevant elements and attributes of LDML data in accordance with the descriptions in those sections.
- For example, an implementation that claims conformance to the date format patterns must interpret the characters in such patterns according to Date Field Symbol Table.
Declare which types of CLDR data it uses.
- For example, an implementation might declare that it only uses language names, and those with a draft status of contributed or approved.
Declare when it overrides CLDR data, or uses alt data
- For example, for //ldml/numbers/symbols/group an implementation could use alt="official" data.

An implementation may also make a general claim of conformance to the LDML specification and/or CLDR data. Such a claim is understood to claim conformance to all portions of this specification that are relevant to the operations performed by the implementation, except for those specifically declared as exceptions. For example, if an implementation making a general claim of conformance performs date formatting, and does not declare date formatting as an exception, it is understood to be claiming conformance to date formatting as described in the section listed below.

UAX35-C2. An implementation that claims conformance to Unicode locale or language identifiers shall:

~~1. Specify whether Unicode locale extensions are allowed~~ ~~2. Specify the canonical form used for identifiers in terms of casing and field separator characters.~~

~~External specifications may also reference particular components of Unicode locale or language identifiers, such as:~~

> Field X can contain any Unicode region subtag values as given in Unicode Technical Standard #35: Unicode Locale Data Markup Language (LDML), excluding grouping codes.

NOTE: UAX35-C2. is replaced by the following generalization.

The following lists the high-level sections with structures and/or processing algorithms. Conformance to a particular section may reference and require conformance to another section.

Unicode Locale Identifiers

Sections	Topics
Unicode Locale Identifier	identifier syntax, interpretation, and validity
Annex C. LocaleId Canonicalization	canonicalize
CLDR to BCP 47, BCP 47 to CLDR	convert
Language Identifier Field Definitions	interpretation and validity of -u key-value pairs
Locale Display Name Algorithm	locale display names

Unicode Locale Inheritance and Matching

Sections	Topics
Locale Inheritance and Matching	locale inheritance
Likely Subtags	likely subtags
Language Matching	locale matching

Units of Measurement

Sections	Topics
Unit Identifiers	unit identifier syntax, interpretation, and validity
Unit Identifier Normalization	identifier normalization
Unit Conversion	unit conversion
Unit Preferences	evaluation of user preferences
Unit Identifier Uniqueness	converting units into BCP47 format
Compound Units	unit display names

Number Formatting

Sections	Topics
Number Format Patterns	number format patterns, syntax and interpretation
Compact Number Formats	compact number formats
Rule-Based Number Formatting	spell-out number formatting

Date Formatting

Sections	Topics
Elements availableFormats, appendItems	date formatting, patterns
Date Format Patterns	date format patterns and symbols
Using Time Zone Names	timezone forms, fallback and parsing

Collation

Sections	Topics
Root Collation	Root collation syntax and structure
Collation Tailorings	Rule syntax and interpretation for language-specific ordering

Grammar

Sections	Topics
Grammatical Features	noun classes (except for plurals)
Language Plural Rules	plural and ordinal category rules, ranges

Miscellaneous

Sections	Topics
Unicode Sets	Unicode set syntax and interpretation
String Range	string-range syntax and interpretation
Transforms	transform identifier and rule syntax and interpretation
Segmentations	segmentation customizations
Synthesizing Sequence Names	constructing derived emoji names
Formatting Process	person name formatting
Part 7: Keyboards	keyboard structure and interpretation
Conformance (Message Format)	message formatting

Customization

Conformant implementations cannot modify CLDR structures, such as the syntax or interpretation of locale identifiers. There are usually mechanisms for implementations to customize these to a certain extent, using what are known a private use codes. For example, an implementation could use the private-use language code qfz to mean a language that was not covered by BCP 47, or use a private use extension in a Unicode locale identifer, or use a private-use unit such as xxx-smoot-per-second.

An implementation may also use a deprecated code instead of the corresponding preferred code. For example, the most frequent case of this is with an implementation whose earlier versions predated BCP 47, and used iw for Hebrew, rather than the BCP 47 (and CLDR) code he. When this is done, the CLDR data needs to be modified in appropriate places, not just in some file names. For example, the languageAlias data requires modification, from:

<languageAlias type="iw" replacement="he" reason="deprecated"/> <!-- Hebrew -->

<languageAlias type="he" replacement="iw" reason="deprecated"/> <!-- Hebrew -->

Minimized locale identifiers are also not required. For example, an implementation could consistently expand locale identifiers to include regions, such as en → en_DE or de → de-AT.

Implementations may customize CLDR data, as long as they declare that they are doing so. This may include:

Omitting data

An implementation may dispense with locale data for locales that an implementation does not support, or for locales it does support, dispense with data that is at CoverageLevel=Comprehensive, or dispense with particular sorts of data, such a annotations for emoji.

Adding data

An implementation could add data for a locale that CLDR does not yet support, or add higher-coverage data for a locale than what CLDR has.

Overriding data

CLDR has a mechanism for overriding data using the alt mechanism. At build time, an implementation could override the default value by using an alt value. For example, take the following data:

<territory type="HK">Sonderverwaltungsregion Hongkong</territory>
<territory type="HK" alt="short">Hongkong</territory>

An implementation could, at build time, substitute the short value for the regular value, getting "Hongkong". It could instead support both values at runtime, using display option settings to pick between the regular value and the short value.

Implementations can override the data in other ways as well, such as changing the spelling of a particular value.

Testing

The files in testData can be used to test conformance. Brief instructions for use are supplied in _readme.txt files in the different directories and/or in the headers of the files in question. For example, the following is from a sample header:

# Format:
# <source locale identifier>	;	<expected canonicalized locale identifier>
#
# The data lines are divided into 4 sets:
#   explicit:    a short list of explicit test cases.
#   fromAliases: test cases generated from the alias data.
#   decanonicalized: test cases generated by reversing the normalization process.
#   withIrrelevants: test cases generated from the others by adding irrelevant fields where possible,
#                           to ensure that the canonicalization implementation is not sensitive to irrelevant fields. These include:
#     Language: aaa
#     Script:   Adlm
#     Region:   AC
#     Variant:  fonipa

If an implementation overrides CLDR data, then various lines in the relevant test files may need to be modified correspondingly, or skipped.

EBNF

The EBNF syntax used in LDML is a variant of the Extended Backus-Naur Form (EBNF) notation used in W3C XML Notation. The main differences are:

Bounded repetition following Perl regex syntax is allowed, such as digit{3} for 3 digits, digit{3,5} for 3 to 5 digits, and digit{3,} for 3 or more digits.
Whitespace inside bracketed enumerations and ranges is ignored.
- eg., [A-Z a-z] is the same as [A-Za-z]
A backslash may be used to escape a following "x"-prefixed hexadecimal code point or the immediately following character.
- eg., \x20 is the same as #x20 and [\&\-] is the same as [#x26#x2D]
Constraints (well-formedness or validity) may use separate notes, and/or the W3C notations:
- [ wfc: ... ]
- [ vc: ... ]

In the text, this is sometimes referred to as "EBNF (Perl-based)".

What is a Locale?

Before diving into the XML structure, it is helpful to describe the model behind the structure. People do not have to subscribe to this model to use data in LDML, but they do need to understand it so that the data can be correctly translated into whatever model their implementation uses.

The first issue is basic: what is a locale? In this model, a locale is an identifier (id) that refers to a set of user preferences that tend to be shared across significant swaths of the world. Traditionally, the data associated with this id provides support for formatting and parsing of dates, times, numbers, and currencies; for measurement units, for sort-order (collation), plus translated names for time zones, languages, countries (regions), and scripts. The data can also include support for text boundaries (character, word, line, and sentence), text transformations (including transliterations), and other services.

Locale data is not cast in stone: the data used on someone's machine generally may reflect the US format, for example, but preferences can typically set to override particular items, such as setting the date format for 2002.03.15, or using metric or Imperial measurement units. In the abstract, locales are simply one of many sets of preferences that, say, a website may want to remember for a particular user. Depending on the application, it may want to also remember the user's time zone, preferred currency, preferred character set, smoker/non-smoker preference, meal preference (vegetarian, kosher, and so on), music preference, religion, party affiliation, favorite charity, and so on.

Locale data in a system may also change over time: country boundaries change; governments (and currencies) come and go: committees impose new standards; bugs are found and fixed in the source data; and so on. Thus the data needs to be versioned for stability over time.

In general terms, the locale id is a parameter that is supplied to a particular service (date formatting, sorting, spell-checking, and so on). The format in this document does not attempt to represent all the data that could conceivably be used by all possible services. Instead, it collects together data that is in common use in systems and internationalization libraries for basic services. The main difference among locales is in terms of language; there may also be some differences according to different countries or regions. However, the line between locales and languages, as commonly used in the industry, are rather fuzzy. Note also that the vast majority of the locale data in CLDR is in fact language data; all non-linguistic data is separated out into a separate tree. For more information, see Language and Locale IDs.

We will speak of data as being "in locale X". That does not imply that a locale is a collection of data; it is simply shorthand for "the set of data associated with the locale id X". Each individual piece of data is called a resource or field, and a tag indicating the key of the resource is called a resource tag.

Unicode Language and Locale Identifiers

Unicode LDML uses stable identifiers based on [BCP47] for distinguishing among languages, locales, regions, currencies, time zones, transforms, and so on. There are many systems for identifiers for these entities. The Unicode LDML identifiers may not match the identifiers used on a particular target system. If so, some process of identifier translation may be required when using LDML data.

The BCP 47 extensions (-u- and -t-) are described in Unicode BCP 47 U Extension and Unicode BCP 47 T Extension.

Unicode Language Identifier

A Unicode language identifier has the following structure (provided in EBNF (Perl-based)). The following table defines syntactically well-formed identifiers: they are not necessarily valid identifiers. For additional validity criteria, see the links on the right.

	EBNF	Validity / Comments
`unicode_language_id`	`= "root" \| (unicode_language_subtag (sep unicode_script_subtag)? \| unicode_script_subtag) (sep unicode_region_subtag)? (sep unicode_variant_subtag)* ;`	"root" is treated as a special `unicode_language_subtag`
`unicode_language_subtag`	= alpha{2,3} \| alpha{5,8};	validity latest-data
`unicode_script_subtag`	= alpha{4} ;	validity latest-data
`unicode_region_subtag`	= (alpha{2} \| digit{3}) ;	validity latest-data
`unicode_variant_subtag`	= (alphanum{5,8} \| digit alphanum{3}) ;	validity latest-data
`sep`	= [-_] ;
`digit`	= [0-9] ;
`alpha`	= [A-Z a-z] ;
`alphanum`	= [0-9 A-Z a-z] ;

The following is an additional well-formedness constraint:

[ wfc: The sequence of variant subtags must not have any duplicates (eg, de-1996-fonipa-1996 is not syntactically well-formed). ]

The semantics of the various subtags is explained in Language Identifier Field Definitions ; there are also direct links from unicode_language_subtag , etc. While theoretically the unicode_language_subtag may have more than 3 letters through the IANA registration process, in practice that has not occurred. The unicode_language_subtag "und" may be omitted when there is a unicode_script_subtag ; for that reason unicode_language_subtag values with 4 letters are not permitted. However, such unicode_language_id values are not intended for general interchange, because they are not valid BCP 47 tags. Instead, they are intended for certain protocols such as the identification of transliterators or font ScriptLangTag values. For more information on language subtags with 4 letters, see BCP 47 Language Tag to Unicode BCP 47 Locale Identifier.

For example, "en-US" (American English), "en_GB" (British English), "es-419" (Latin American Spanish), and "uz-Cyrl" (Uzbek in Cyrillic) are all valid Unicode language identifiers.

Unicode Locale Identifier

A Unicode locale identifier is composed of a Unicode language identifier plus (optional) locale extensions. It has the following structure. The semantics of the U and T extensions are explained in Unicode BCP 47 U Extension and Unicode BCP 47 T Extension. Other extensions and private use extensions are supported for pass-through. The following table defines syntactically well-formed identifiers: they are not necessarily valid identifiers. For additional validity criteria, see the links on the right.

	EBNF	Validity / Comments
`unicode_locale_id`	`= unicode_language_id` `extensions*` `pu_extensions? ;`
`extensions`	`= unicode_locale_extensions` `\| transformed_extensions` `\| other_extensions ;`
`unicode_locale_extensions`	`= sep [uU]` `((sep keyword)+` `\|(sep uattribute)+ (sep ufield)*) ;`
`transformed_extensions`	`= sep [tT]` `((sep tlang (sep tfield)*)` `\| (sep tfield)+) ;`
`pu_extensions`	`= sep [xX]` `(sep alphanum{1,8})+ ;`
`other_extensions`	`= sep [alphanum-[tTuUxX]]` `(sep alphanum{2,8})+ ;`
`ufield` (Also known as `keyword`)	`= ukey (sep uvalue)? ;`
`ukey` (Also known as `key`)	`= alphanum alpha ;`	`validity` `latest-data` (Note that this is narrower than in [RFC6067], so that it is disjoint with `tkey`.)
`uvalue` (Also known as `type`)	`= alphanum{3,8}` `(sep alphanum{3,8})* ;`	`validity` `latest-data`
`uattribute` (Also known as `attribute`)	`= alphanum{3,8} ;`
`unicode_subdivision_id`	`=` `unicode_region_subtag` `unicode_subdivision_suffix ;`	`validity` `latest-data`
`unicode_subdivision_suffix`	`= alphanum{1,4} ;`
`unicode_measure_unit`	`= alphanum{3,8}` `(sep alphanum{3,8})* ;`	`validity` `latest-data`
`tlang`	`= unicode_language_subtag` `(sep unicode_script_subtag)?` `(sep unicode_region_subtag)?` `(sep unicode_variant_subtag)* ;`	same as in unicode_language_id
`tfield`	`= tkey tvalue;`	`validity` `latest-data`
`tkey`	`= alpha digit ;`
`tvalue`	`= alphanum{3,8}` `(sep alphanum{3,8})+ ;`

The following are additional well-formedness constraints:

[ wfc: There cannot be more than one extension with the same singleton. For example, en-u-ca-buddhist-u-cf-standard is ill-formed.]
[ wfc: There cannot be more than one ukey or tkey. For example, en-u-ca-buddhist-ca-islamic is ill-formed. ]
[ wfc: The sequence of variant subtags in a tlang must not have any duplicates. ]
[ wfc: The private use extension (-x-) must come after all other extensions. ]

For historical reasons, this is called a Unicode locale identifier. However, it also functions (with few exceptions) as a language identifier, and accesses language-based data. Except where it would be unclear, this document uses the term "locale" data loosely to encompass both types of data: for more information, see Language and Locale IDs.

As of the release of this specification, there were no other_extensions defined. The other_extensions are present in the syntax to allow implementations to preserve that information.

As for terminology, the term code may also be used instead of "subtag", and "territory" instead of "region". The primary language subtag is also called the base language code. For example, the base language code for "en-US" (American English) is "en" (English). The type may also be referred to as a value or key-value.

All identifier field values are case-insensitive. Although case distinctions do not carry any special meaning, an implementation of LDML should use the casing recommendations in [BCP47], especially when a Unicode locale identifier is used for locale data exchange in software protocols.

The identifiers can vary in case and in the separator characters. The "-" and "_" separators are treated as equivalent, although "-" is preferred.

A Unicode BCP 47 locale identifier (unicode_bcp47_locale_id) is a unicode_locale_id that meets the following additional constraints:

[ wfc: The EBNF sep is restricted to only [-] in unicode_language_id and unicode_locale_id.]
[ wfc: The first subtag must be a unicode_language_subtag.] Thus it can be neither of the following:
- a unicode_script_subtag.
- a "root" subtag (the "und" unicode_language_subtag is used instead of "root").

A well-formed Unicode BCP 47 locale identifier is always a well-formed BCP 47 language tag. The reverse, however, is not guaranteed; a BCP 47 language tag that contains an extlang subtag, an irregular subtag, or an initial 'x' subtag would not be a well-formed Unicode BCP 47 locale identifier — for details see BCP 47 Conformance. However, any BCP 47 language tag can easily converted to a Unicode BCP 47 locale identifier as specified in BCP 47 Language Tag Conversion.

A Unicode CLDR locale identifier (unicode_cldr_locale_id) is a unicode_locale_id that meets the following additional constraints:

[ wfc: The EBNF sep is restricted to only [_] in unicode_language_id and unicode_locale_id.]
[ wfc: The unicode_language_id "und" is replaced by "root".]
[ wfc: The first subtag cannot be a unicode_script_subtag.]

Note: The current version of CLDR data uses Unicode CLDR locale identifiers for backward compatibility. This might be changed in future CLDR releases.

Canonical Unicode Locale Identifiers

A unicode_locale_id has canonical syntax when:

It starts with a language subtag (those beginning with a script subtag are only for specialized use)
Casing
- Any script subtag inside unicode_language_id is in title case (eg, Hant)
- Any region subtag inside unicode_language_id is in uppercase (eg, DE)
- All other subtags are in lowercase (eg, en, fonipa)
Order
- Any variants are in alphabetical order (eg, en-fonipa-scouse, not en-scouse-fonipa)
- Any extensions are in alphabetical order by their singleton (eg, en-t-xxx-u-yyy, not en-u-yyy-t-xxx)
- All attributes are sorted in alphabetical order.
- All ufields and tfields are sorted by alphabetical order of their keys, within their respective extensions.
- Any ufield or tfield value "true" is removed.

For example, the canonical form of "en-u-foo-bar-nu-thai-ca-buddhist-kk-true" is "en-u-bar-foo-ca-buddhist-kk-nu-thai". The attributes "foo" and "bar" in this example are provided only for illustration; no attribute subtags are defined by the current CLDR specification.

NOTE: Some people may wonder why CLDR uses alphabetical order for variants, rather than the ordering in Section 4.1 of BCP 47. Here are the considerations that lead to that decision:

The ordering in is recommended, but not required for conformance. In particular, use of and ordering by Prefix is recommended but not required.
Moreover, Section 4.5 states that “If more than one variant appears within a tag, processors MAY reorder the variants to obtain better matching behavior or more consistent presentation.”
The best practices for internationalization have moved well beyond some of the guidelines and recommendations in BCP 47, especially for language matching and language fallback.
Robust implementations will accept the variants in any order, just as they accept extensions in any order.
A canonical order allows for determination of identity of identifiers via string comparison.
The ordering in does not result in a determinant order for canonicalization, since the mechanism for determining “importance” is not specified: ca-valencia-fonipa and ca-fonipa-valencia could both be ‘canonical’ variants of one another.
Pure alphabetical order is determinant and simple to implement while the ordering in is indeterminant, more complex, and provides no significant benefit in modern applications.

A unicode_locale_id is in canonical form when it has canonical syntax and contains no aliased subtags. A unicode_locale_id can be transformed into canonical form according to Annex C. LocaleId Canonicalization.

A unicode_locale_id is maximal when the unicode_language_id and tlang (if any) have been transformed by the Add Likely Subtags operation in Likely Subtags, excluding "und".

Example: the maximal form of ja-Kana-t-it is ja-Kana-JP-t-it-latn-it

Note that the latn and final it don't use any uppercase characters, since they are not inside unicode_language_id.

Two unicode_locale_ids are equivalent when their maximal canonical forms are identical.

Example: "IW-HEBR-u-ms-imperial" ~ "he-u-ms-uksystem"

The equivalence relationship may change over time, such as when subtags are deprecated or likely subtag mappings change. For example, if two countries were to merge, then various subtags would become deprecated. These kinds of changes are generally very infrequent.

BCP 47 Conformance

Unicode language and locale identifiers inherit the design and the repertoire of subtags from [BCP47] Language Tags. There are some extensions and restrictions made for the use of the Unicode locale identifier in CLDR:

It does not allow for the full syntax of [BCP47]:
- No extlang subtags are allowed (as in the BCP 47 canonical form, see BCP 47 Section 4.5 and Section 3.1.7)
- No irregular BCP 47 legacy language tags (marked as “Type: grandfathered” in BCP 47) are allowed (these are all deprecated in BCP 47)
- A tag must not start with the subtag "x": thus a privateuse (eg x-abc) can only be after a language subtag, like "und"
It allows for certain semantic additions and constraints:
- Certain codes that are private-use in BCP 47 and ISO are given semantics by LDML
- Each macrolanguage has an identified primary encompassed language, which is treated as an alias for the macrolanguage, and thus is replaced when canonicalizing (as allowed by BCP 47, see Section 4.1.2)
It allows certain syntax for backwards compatibility (not BCP 47-compatible):
- The "_" character for field separator characters, as well as the "-" used in [BCP47] (however, the canonical form is with "-")
- The subtag "root" to indicate the generic locale used as the parent of all languages in the CLDR data model ("und" can be used instead)
- The language tag may begin with a script subtag rather than a language subtag. This is specialized use only, and not required for CLDR conformance.

There are thus two subtypes of Unicode locale identifiers, as defined above.

Unicode BCP 47 locale identifier (