Closed Bug 1382253 Opened 8 years ago Closed 7 years ago

Uppercase privateuse delimiter not correctly handled in IsStructurallyValidLanguageTag

Categories

(Core :: JavaScript: Internationalization API, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla60
Tracking Status
firefox60 --- fixed

People

(Reporter: anba, Assigned: anba, Mentored)

References

Details

Attachments

(4 files, 3 obsolete files)

Test case: --- Intl.getCanonicalLocales("de-X-a-a"); --- Expected: No RangeError Actual: Throws `RangeError: invalid language tag: de-X-a-a`
Priority: -- → P3
function IsStructurallyValidLanguageTag(locale) { assert(typeof locale === "string", "IsStructurallyValidLanguageTag"); var languageTagRE = getLanguageTagRE(); if (!regexp_test_no_statics(languageTagRE, locale)) return false; // Before checking for duplicate variant or singleton subtags with // regular expressions, we have to get private use subtag sequences // out of the picture. if (callFunction(std_String_startsWith, locale, "x-")) return true; var pos = callFunction(std_String_indexOf, locale, "-x-"); if (pos !== -1) locale = callFunction(String_substring, locale, 0, pos); A capitalized "x" passes the languageTagRE test, but because we keep using the original input string for testing, the capitalized "X" isn't recognized by the startsWith/indexOf checks. Probably we should just check for capitalized variants. The ultimate *best* fix, IMO, is to validate language tags into a structured object that represents all the possible details in a langtag. (I've wanted to do this for years.) But this looks very finicky/tricky (not least in JS which lacks any kind of enforced exhaustive-mapping functionality), so certainly can't delay a fix here.
Mentor: jwalden+bmo
Drive-by change to remove another use of regular expression parsing in Intl code. IsWellFormedCurrencyCode: - The input is always a string value, so we can replace the ToString call with an assertion. - We only need to test if the input is a three letter currency code. And while converting the input to upper-case and then throwing a RegExp at it works, it's faster to use the IsASCIIAlphaString helper function here. CurrencyDigits: - The regular expression can easily be replaced by using IsWellFormedCurrencyCode to test that the input has the correct length and contains only ASCII alpha characters and by using toASCIIUpperCase to ensure all ASCII characters are in upper-case.
Assignee: nobody → andrebargull
Status: NEW → ASSIGNED
Attachment #8948423 - Flags: review?(jwalden+bmo)
Replaces the RegExp-based BCP47 language tag parser with a String-based. The next patch in this queue will return a structured object holding the individual subtags of a language tag. The parser works on code-units because that's faster for JavaScript and while we could move it to C++, leaving it in JavaScript for now makes it easier to implement Intl.Locale (bug 1433303). The new parser also directly tests for duplicate variants and duplicate extension keys, so we no longer need getDuplicateVariantRE() and getDuplicateSingletonRE().
Attachment #8948424 - Flags: review?(jwalden+bmo)
Amends the language tag parser to return a structured object holding the individual subtags. Language tag canonicalization is now a bit easier to follow, because we no longer need to split the language tag into its individual subtags in the canonicalization function, but instead can directly take the subtags from the language tag object. There are now five functions to work on language tags: 1. parseLanguageTag() - The actual language tag parser. 2. IsStructurallyValidLanguageTag() - Matches the ECMA402 abstract operation. - The input is a string value and returns true if the input is a well-formed language tag. - Should only be used if no canonicalization is needed, e.g. in assertions. 3. CanonicalizeLanguageTag() - Matches the ECMA402 abstract operation. - The input is a well-formed language tag and returns its canonical form. - Should only be used if no language tag object is available or if performance is not an issue, e.g. in assertions. 4. CanonicalizeLanguageTagFromObject() - The input is a language tag object returned from parseLanguageTag() and the function returns the canonical form of this language tag. 5. ValidateAndCanonicalizeLanguage() - Combined function to validate and canonicalize a language tag. - Has a fast-path for standalone languages. ValidateAndCanonicalizeLanguage() and DefaultLocaleIgnoringAvailableLocales() were changed to call the new parseLanguageTag() and CanonicalizeLanguageTagFromObject() for efficiency reasons. CanonicalizeLocaleList() was updated to call ValidateAndCanonicalizeLanguage(), so it can make use of the fast path in ValidateAndCanonicalizeLanguage(). There's one catch: Regular grandfathered language tags are returned as non-grandfathered language tags in the parser, e.g. `parseLanguageTag("art-lojban")` returns the object `{language: "art", ..., variants: ["lojban"], ...}`. There are two reasons for this behaviour: It makes the parser a bit easier and the Intl.Locale proposal (at least in its current form) requires to handle regular grandfathered language tags as non-grandfathered tags. I'd like to leave this as is for now, at least until the Intl.Locale proposal has matured a bit more and we know the concrete requirements for Intl.Locale.
Attachment #8948426 - Flags: review?(jwalden+bmo)
Blocks: 1433303
Attachment #8948423 - Flags: review?(jwalden+bmo) → review+
Comment on attachment 8948424 [details] [diff] [review] bug1382253-part2-langtag-parser.patch Review of attachment 8948424 [details] [diff] [review]: ----------------------------------------------------------------- Yum. ::: js/src/builtin/intl/CommonFunctions.js @@ +84,5 @@ > + // Current parse index in |locale|. > + var index = 0; > + > + // The three possible token types. > + var NONE = 0, ALPHA = 1, DIGIT = 2; Could maybe #define/#undef these locally to avoid a bunch of name lookups for the JITs et al. to boil away. Also use binary to imply bitfieldness. Also maybe "token type bits". Would also be nice to have, instead of opaque constants, things like #define LOWER_A 0x41 #define LOWER_Z 0x5A #define UPPER_A 0x61 #define UPPER_Z 0x7A #define ZERO 0x30 #define NINE 0x39 #define HYPHEN 0x2D #define UPPER_X 0x78 and then a series of assertions that each of these really is the character code for that letter, to give the reader who hasn't memorized these confidence in them. Then use these throughout the patch to eliminate every magic unnamed constant. @@ +120,5 @@ > + > + // Returns the code unit of the first character at the current token > + // position. If the character is an ASCII upper-case character, the > + // code unit of its lower-case form is returned. > + function tokenStartCodeUnit() { Put "Upper" in the name somewhere. @@ +197,5 @@ > + // RFC 5646 section 2.1 > + // alphanum = (ALPHA / DIGIT) ; letters and numbers > + var seenVariants = []; > + while ((5 <= tokenLength && tokenLength <= 8) || > + (tokenLength === 4 && tokenStartCodeUnit() <= 0x39)) Would prefer if there were a sanity assert of >= 0x30 in the latter case, if we're going to not do the lower-bound comparison because of known-rangeness. @@ +200,5 @@ > + while ((5 <= tokenLength && tokenLength <= 8) || > + (tokenLength === 4 && tokenStartCodeUnit() <= 0x39)) > + { > + var variant = callFunction(String_substring, locale, tokenStart, > + tokenStart + tokenLength); There's a Substring helper function in builtin/String.js that you should use instead. Also in any other places where substringing happens in this patch later on. @@ +212,5 @@ > + // characters here: we have already verified the variant is > + // alphanumeric ASCII.) > + variant = callFunction(std_String_toLowerCase, variant); > + > + // Reject the language tag if a duplicate variant was found. At least note here that this linear-time verification step means that validity checking is potentially quadratic, even if we're okay doing that because language tags are unlikely to be deliberately pathological. @@ +230,5 @@ > + // / %x79-7A ; y - z > + var seenSingletons = []; > + while (tokenLength === 1) { > + var singleton = tokenStartCodeUnit(); > + if (singleton === 0x78) Maybe // "x" / "X". @@ +241,5 @@ > + assert(!(0x41 <= singleton && singleton <= 0x5A), "unexpected upper-case code unit"); > + > + // Reject the input if a duplicate singleton was found. > + if (callFunction(ArrayIndexOf, seenSingletons, singleton) !== -1) > + return false; Also here note something like O(35**2), 35 because there are max 35 singletons possible. @@ +272,5 @@ > + return false; > + } while (1 <= tokenLength && tokenLength <= 8); > + } > + > + // Return if the complete input was successfully parsed. Maybe add "-- which is to say, this is a non-grandfathered langtag language-tag" or so. @@ +279,2 @@ > > + // Ensure any remaining parts are ASCII-only characters. alphanum-only? ::: js/src/tests/non262/Intl/tolower-ascii-equivalent.js @@ +3,5 @@ > +// Language tags are processed case-insensitive, but unconditionally calling > +// the built-in String.prototype.toLowerCase() function before parsing a > +// language tag can map non-ASCII characters into the ASCII range. > +// > +// Validate the BCP47 language tag parser handles this case (pun intended) U+1F4AF HUNDRED POINTS SYMBOL (historically stupid Bugzilla) @@ +11,5 @@ > +// The lower-case form of "i-ha\u212A" is "i-hak". > +assertEq("i-hak", "i-ha\u212A".toLowerCase()); > + > +// "i-hak" is a valid language tag. > +assertEqArray(Intl.getCanonicalLocales("i-hak"), ["hak"]); May as well add the opposite case of U+0131 LATIN SMALL LETTER DOTLESS I uppercasing to U+0049 LATIN CAPITAL LETTER I and that not being a valid langtag. @@ +17,5 @@ > +// But "i-ha\u212A" is not a valid language tag. > +assertThrowsInstanceOf(() => Intl.getCanonicalLocales("i-ha\u212A"), RangeError); > + > +if (typeof reportCompare === 'function') > + reportCompare(0, 0); This test should be upstreamed in a PR, then imported into local/ (if that still remains in the modern brave new test directory structure). ::: js/src/tests/non262/Intl/uppercase-privateuse.js @@ +1,3 @@ > +// |reftest| skip-if(!this.hasOwnProperty("Intl")) > + > +// privatuse subtags can start with upper-case 'X'. privateuse -- BCP47 ain't Deutsch. :-)
Attachment #8948424 - Flags: review?(jwalden+bmo) → review+
Comment on attachment 8948426 [details] [diff] [review] bug1382253-part3-structured-object.patch Review of attachment 8948426 [details] [diff] [review]: ----------------------------------------------------------------- Very nice. At least once I wrapped my head around the double-viewing bit where I ask for a comment pointing it out. ::: js/src/builtin/intl/CommonFunctions.js @@ +68,5 @@ > > return combined; > } > > +/* eslint-disable complexity */ "stfu eslint" @@ +149,5 @@ > + // Language tags are compared and processed case-insensitively, so > + // technically it's not necessary to adjust case. But for easier processing, > + // and because the canonical form for most subtags is lower case, we start > + // with lower case for all. > + var localeLowercase = callFunction(std_String_toLowerCase, locale); 1. I misread this somewhat as we would do this, then stop using |locale|. But no -- what happens is we keep iterating over |locale|, but when we substring and do other copying-of-contents actions, we use the lowercase copy. This is worth a comment specifically pointing out this simultaneous usage. 2. Assert the two strings have the same length as sanity-check? @@ +162,5 @@ > } > > + // Returns the current token part transformed to lower-case. > + function tokenString() { > + return callFunction(String_substr, localeLowercase, tokenStart, tokenLength); Substring as noted in the last patch, and anywhere else in these changes. @@ +257,5 @@ > // no worry about case transformation generating invalid > // characters here: we have already verified the variant is > // alphanumeric ASCII.) > + assert(variant === callFunction(std_String_toLowerCase, variant), > + "unexpected non lower-case variant"); Comment above this can probably be trimmed some now. @@ +352,5 @@ > > + // Compare the lower-case form of locale to the list of grandfathered > + // language tags. Because we have just verified the input contains only > + // ASCII characters, we don't have to worry about case transformations > + // generating invalid characters here. There's no case transform any more, so this comment needs cleaning up. @@ +457,2 @@ > */ > +function CanonicalizeLanguageTagFromObject(localeObj) { It would be preferable if all these changes were moved down to where CanonicalizeLanguageTag was before this patch, for reviewing ease and for blame and archaeology in the future. @@ +465,5 @@ > + // Handle mappings for complete tags. > + if (hasOwn(locale, langTagMappings)) > + return langTagMappings[locale]; > + > + assert(!hasOwn("grandfathered", localeObj), "grandfathered tags should be mapped completely"); Span this across two lines, please -- more readable. @@ +503,5 @@ > + // no extlang mapping will replace a normal language code. > + // The preferred value for all current deprecated extlang subtags > + // is equal to the extlang subtag, so we only need remove the > + // redundant prefix to get the preferred value. > + if (hasOwn(extlang1, extlangMappings) && extlangMappings[extlang1] === canonical) I think I would slightly prefer if the comparison were to |language|, because that's the value that actually matters. That at this instant, that happens to be the partial canonical langtag, is not a good reason to use that name instead. @@ +525,5 @@ > + if (script) { > + // The first character of a script code needs to be capitalized. > + // "hans" -> "Hans" > + script = callFunction(std_String_toUpperCase, script[0]) + > + callFunction(String_substring, script, 1); I'd prefer if Substring were used here, if we're touching this. @@ +529,5 @@ > + callFunction(String_substring, script, 1); > + > + // Replace deprecated subtags with their preferred values. > + if (hasOwn(script, langSubtagMappings)) > + script = langSubtagMappings[script]; This doesn't look like it should ever hit -- no keys with first letter capitalized. Remove it, or replace it with an assert of not applying if that makes any semantic sense (which it doesn't seem to me like it should)? @@ +543,5 @@ > + // "BU" -> "MM" > + // This has to come after we capitalize region codes because > + // otherwise some language and region codes could be confused. > + // For example, "in" is an obsolete language code for Indonesian, > + // but "IN" is the country code for India. "This has to..." doesn't apply any more, right? We're no longer doing the grody things we did before. Also the first sentence should be made more precise -- refer to regions and not subtags generically, maybe? (This probably applies earlier, too, doesn't it?) Also, I don't remember how we generated these mappings, but it might be nice to split up the mapping goo by the separate categories, now that we don't have a loop that's *compelled* to group everything together -- at least for clarity. Doesn't need to happen immediately.
Attachment #8948426 - Flags: review?(jwalden+bmo) → review+
(In reply to Jeff Walden [:Waldo] from comment #5) > Would also be nice to have, instead of opaque constants, things like > > #define LOWER_A 0x41 > #define LOWER_Z 0x5A > #define UPPER_A 0x61 > #define UPPER_Z 0x7A > #define ZERO 0x30 > #define NINE 0x39 > #define HYPHEN 0x2D > #define UPPER_X 0x78 > > and then a series of assertions that each of these really is the character > code for that letter, to give the reader who hasn't memorized these > confidence in them. Makes sense, especially because lower<->upper has to be swapped above. :-) > > @@ +120,5 @@ > > + > > + // Returns the code unit of the first character at the current token > > + // position. If the character is an ASCII upper-case character, the > > + // code unit of its lower-case form is returned. > > + function tokenStartCodeUnit() { > > Put "Upper" in the name somewhere. Yes, I'll add "Lower". :-D > @@ +17,5 @@ > > +// But "i-ha\u212A" is not a valid language tag. > > +assertThrowsInstanceOf(() => Intl.getCanonicalLocales("i-ha\u212A"), RangeError); > > + > > +if (typeof reportCompare === 'function') > > + reportCompare(0, 0); > > This test should be upstreamed in a PR, then imported into local/ (if that > still remains in the modern brave new test directory structure). I have no idea how the upstreaming process is supposed to work. There is a test262-export.py script, but it appears to only accept directories and the generated test262-like files are a bit corrupted (they all end with |if (typeof assert.sameValue === "function")|, so an if-statement without a consequent). So I'll guess manual PRs are still the way to go. :-/
(In reply to Jeff Walden [:Waldo] from comment #6) > 2. Assert the two strings have the same length as sanity-check? Not possible because of U+0130 which lower-cases to the sequence <U+0069 U+0307>. :-( > @@ +457,2 @@ > > */ > > +function CanonicalizeLanguageTagFromObject(localeObj) { > > It would be preferable if all these changes were moved down to where > CanonicalizeLanguageTag was before this patch, for reviewing ease and for > blame and archaeology in the future. Hmm, I don't think that'll work, because the lines have different indentations. > @@ +503,5 @@ > > + // no extlang mapping will replace a normal language code. > > + // The preferred value for all current deprecated extlang subtags > > + // is equal to the extlang subtag, so we only need remove the > > + // redundant prefix to get the preferred value. > > + if (hasOwn(extlang1, extlangMappings) && extlangMappings[extlang1] === canonical) > > I think I would slightly prefer if the comparison were to |language|, > because that's the value that actually matters. That at this instant, that > happens to be the partial canonical langtag, is not a good reason to use > that name instead. I've used |canonical| at this point instead of |language| for consistency with the extlang2 and extlang3 checks, and because the IANA language tag registry requires a matching prefix for conditional subtag replacements. (In general the registry specifies conditional mappings based on prefixes and not based on individual subtags.) But now I'm reading https://tools.ietf.org/html/rfc5646#section-2.2.2 and I roll my eyes: --- [...] 2. Extended language subtag records MUST include exactly one 'Prefix' field indicating an appropriate subtag or sequence of subtags for that extended language subtag. [...] 4. Although the ABNF production 'extlang' permits up to three extended language tags in the language tag, extended language subtags MUST NOT include another extended language subtag in their 'Prefix'. That is, the second and third extended language subtag positions in a language tag are permanently reserved and tags that include those subtags in that position are, and will always remain, invalid. [...] --- Per 2. the prefix can contain more than one subtag, per 4. the prefix will never contain other extlang subtags, per the ABNF the prefix contains exactly one language subtag and up to three extlang subtags. I assume the editors were parents: "Child, you can stay outside as long as you want. But you're back at six!" :-) > @@ +529,5 @@ > > + callFunction(String_substring, script, 1); > > + > > + // Replace deprecated subtags with their preferred values. > > + if (hasOwn(script, langSubtagMappings)) > > + script = langSubtagMappings[script]; > > This doesn't look like it should ever hit -- no keys with first letter > capitalized. Remove it, or replace it with an assert of not applying if > that makes any semantic sense (which it doesn't seem to me like it should)? I'll change it to an assertion and then we can add a check to the update script that script subtags don't have preferred values. > > @@ +543,5 @@ > > + // "BU" -> "MM" > > + // This has to come after we capitalize region codes because > > + // otherwise some language and region codes could be confused. > > + // For example, "in" is an obsolete language code for Indonesian, > > + // but "IN" is the country code for India. > > "This has to..." doesn't apply any more, right? We're no longer doing the > grody things we did before. Also the first sentence should be made more > precise -- refer to regions and not subtags generically, maybe? (This > probably applies earlier, too, doesn't it?) This still applies, because langSubtagMappings still contains language _and_ region subtags. But if we split langSubtagMappings as suggested below, we can reword this comment. > > Also, I don't remember how we generated these mappings, but it might be nice > to split up the mapping goo by the separate categories, now that we don't > have a loop that's *compelled* to group everything together -- at least for > clarity. Doesn't need to happen immediately. Yes, I agree we should split langSubtagMappings.
Updated part 2 to apply review comments. Carrying r+.
Attachment #8948424 - Attachment is obsolete: true
Attachment #8949503 - Flags: review+
Updated part 3 to apply review comments. Carrying r+.
Attachment #8948426 - Attachment is obsolete: true
Attachment #8949504 - Flags: review+
Splits |langSubtagMappings| into |languageMappings| for language subtag replacements and |regionMappings| for region subtag replacements. - Removes these comments [1] about subtag and extlang mappings, because we no longer canonicalize subtags in a loop. - Amend some comments in |CanonicalizeLanguageTagFromObject()| by referring to [RFC 5646 section 2.2.2]. - Replace this check [2] because it can never apply for extlang mappings since the preferred value for a extlang replacement is always equal to the extlang value [RFC 5646 section 2.2.2]. [1] https://searchfox.org/mozilla-central/rev/a539f8f7fe288d0b4d8d6a32510be20f52bb7a0f/js/src/builtin/intl/CommonFunctions.js#382-383,389-390 [2] https://searchfox.org/mozilla-central/rev/a539f8f7fe288d0b4d8d6a32510be20f52bb7a0f/js/src/builtin/intl/make_intl_data.py#149-151
Attachment #8949512 - Flags: review?(jwalden+bmo)
Comment on attachment 8949512 [details] [diff] [review] bug1382253-part4-split-maps.patch Review of attachment 8949512 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/builtin/intl/CommonFunctions.js @@ +545,5 @@ > > var canonical = language; > > if (extlang1) { > + // Replace deprecated extlang subtags with their preferred values. We're not just replacing the extlang subtag, we're replacing it *plus* its prefix. Which I guess is what the next paragraph says, but a bit more slowly. How about if the whole thing here were """ When a deprecated extlang subtag is encountered with its corresponding language tag prefix, replace the combination with the preferred value -- which MUST be the unadorned extlang subtag. For example, this entry Type: extlang Subtag: nan Description: Min Nan Chinese Added: 2009-07-29 Preferred-Value: nan Prefix: zh Macrolanguage: zh is interpreted to say that "nan" extlang appearing after a "zh" prefix is deprecated, so "zh-nan" must be replaced by the preferred value "nan". See also <https://tools.ietf.org/html/rfc5646#section-2.2.2>. """ ::: js/src/builtin/intl/make_intl_data.py @@ +84,5 @@ > file date. > > + We also check that extlang mappings don't generate preferred values > + which in turn are subject to language subtag mappings, so that > + CanonicalizeLanguageTag can process subtags in sequential order. can process subtags sequentially
Attachment #8949512 - Flags: review?(jwalden+bmo) → review+
Updated to apply review comments (*), carrying r+.
Attachment #8949512 - Attachment is obsolete: true
Attachment #8950226 - Flags: review+
Pushed by nerli@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/99dd9444c2df Part 1: Remove regular expressions to validate currency digits. r=Waldo https://hg.mozilla.org/integration/mozilla-inbound/rev/cd80bec8de58 Part 2: Remove regular expressions to validate language tags. r=Waldo https://hg.mozilla.org/integration/mozilla-inbound/rev/2c6c9489d9de Part 3: Use a structured object for language-tag canonicalization. r=Waldo https://hg.mozilla.org/integration/mozilla-inbound/rev/996deca6cc2d Part 4: Split maps for preferred-values in language tags. r=Waldo
Keywords: checkin-needed
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: