Closed Bug 728180 Opened 9 years ago Closed 4 years ago
Use ICU to replace the (obsolete) implementation of ns
I was intending to make use of our normalization support, but on looking into intl/unicharutil to see what nsIUnicodeNormalizer could offer, it appeared to be quite out-of-date, with the key normalization data in http://mxr.mozilla.org/mozilla-central/source/intl/unicharutil/src/normalization_data.h not having been updated since the move to HG (at least). There's a NormalizationTest tool in intl/unicharutil/tests, but the necessary test data isn't actually present in the tree, so I don't see how it can be testing much - and it doesn't look like it is run as part of "make check" anyway, AFAICT. I tried running the NormalizationTest after generating up-to-date test data with the genNormalizationData.pl tool. As expected, the results indicate that the implementation of normalization is seriously behind the times: NormalizationTest: test nsIUnicodeNormalizer. UCD version: 6.1.0 Test Part0: Specific cases 23 cases passed, 0 failed Test Part1: Character by character test 1113792 cases passed, 319 failed Test Part2: Canonical Order Test 740 cases passed, 566 failed Test Part3: PRI #29 Test 25 cases passed, 143 failed Test finished
It turns out there are two aspects to this. First, we need to regenerate normalization_data.h based on current Unicode data files. This will resolve the failures in Part1 and Part2 of the test. However, there's also the correction to the definition of normalization, as discussed in http://www.unicode.org/review/pr-29.html and incorporated into UAX#15 as of Unicode 4.1. This requires a code fix in nsUnicodeNormalizer.cpp. Patches upcoming, after I finish building & testing locally....
One complication to consider: I'm not sure what all the various users of nsIUnicodeNormalizer may be, but I believe there are some standards/protocols that explicitly require the use of a specific, older version of Unicode normalization. Do we even _know_ what version of Unicode is currently implemented by nsIUnicodeNormalizer? I didn't see any obvious mention of the version; I assumed it was meant to support the same Unicode version as we're supporting in the rest of our code (the current version, allowing for a short time lag to update our data tables when a new version is released). But that may not be true, even in intention (let alone in actuality). So do we actually need to support _multiple_ normalization tables, so as to provide one for general use by code that just wants "support for the current Unicode version", and one (or more) specific older versions to support protocols that mandate their use? Taking a quick look at some docs: http://tools.ietf.org/html/rfc3454 (stringprep) references Unicode 3.2.0 http://tools.ietf.org/html/rfc3491 (nameprep) also uses Unicode 3.2, by reference to stringprep http://tools.ietf.org/html/rfc5890 (IDNA - definitions) references Unicode 5.2.0 http://tools.ietf.org/html/rfc5891 (IDNA - protocol) references RFC5890, but also (normatively) references the _current_ version of UAX#15 (which would now be 6.1.0) - I suspect that may be unintentional? Ugh.
According to MXR, it seems that the only two users of nsIUnicodeNormalizer in the m-c tree are at (1) http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/nsIDNService.cpp#539 (as part of the implementation of stringprep), and (2) http://mxr.mozilla.org/mozilla-central/source/dom/plugins/base/nsNPAPIPlugin.cpp#2205 (looks like part of an API to return the document origin, applies normalization form NFKC before returning it). For use (1), stringprep, the spec seems to insist on Unicode 3.2.0, so updating nsIUnicodeNormalizer would _break_ that. For use (2), it's less clear. I found https://wiki.mozilla.org/NPAPI:DocumentOrigin, but this only says that the API returns "the Unicode serialization of the origin converted to NFKC-encoded (normalized) UTF-8"; in the absence of a mention of any specific Unicode version, it would be natural to assume this refers to whatever version of Unicode the browser currently "supports" (i.e. 6.1, for Gecko). In view of item (1) here, I think we need to be very cautious about updating the normalization code; however, I also think it is highly misleading to provide an "nsIUnicodeNormalizer" service that is tied to an old version of Unicode, without making this limitation extremely clear in its documentation/headers/even its NAME. (nsIUnicode_3_2_0_Normalizer?)
FTR, I'm attaching a patch that updates the normalization data to Unicode 6.1, although as discussed above, I don't think we can just go ahead and do this.
Forgot to add, with that patch, the NormalizationTest (run against Unicode 6.1 test data) report: NormalizationTest: test nsIUnicodeNormalizer. UCD version: 6.1.0 Test Part0: Specific cases 23 cases passed, 0 failed Test Part1: Character by character test 1114111 cases passed, 0 failed Test Part2: Canonical Order Test 1306 cases passed, 0 failed Test Part3: PRI #29 Test 12 cases passed, 156 failed
And this is the normalization code fix for PRI-29; with this patch in addition to the data update, NormalizationTest passes all tests.
See bug 210502 comment 26 and following: we deliberately kept the normalization out-of-date in order not to break IDNA. I agree with you that it would have made more sense to make this explicit in the name. However, bug 479520 should make all this moot.
Depends on: IDNA2008
Ah, thanks, I hadn't run across bug 210502. That confirms my concern about updating this. What about the NPAPI consumer of normalization - is there any definitive ruling on what version that's supposed to be using? I do think we should have an up-to-date normalization service, for the sake of the (new) text-shaping consumer, and also to expose to any code (extensions, whatever) that might want normalization and isn't specifically tied to an old version.
Now we have ICU, but are low-level functions (such as nsUnicodeNormalizer::DecomposeNonRecursively) implementable without modifying the ICU source?
For DecomposeNonRecursively (which AFAIK is used only by the harfbuzz shaping backend), I think unorm2_getRawDecomposition can be used. http://icu-project.org/apiref/icu4c/unorm2_8h.html#a0551019637648af4638435d79f24be17
...though we should adapt the callback in gfxHarfBuzzShaper.cpp to use ICU directly rather than adding an extra layer of wrapper around it to emulate the old nsUnicodeNormalizer API. Especially as this is a particularly hot codepath. So I suspect we could scrap DecomposeNonRecursively altogether.
When ICU is available (i.e. when ENABLE_INTL_API is true), we can use ICU normalization to replace our old (unmaintained and obsolete) code here. The issue of requiring Unicode 3.2 normalization for IDNA is no longer a concern when ENABLE_INTL_API is true, as we've migrated to ICU stringprep services for that case. The ICU-based implementation doesn't need to provide the Compose and DecomposeNonRecursively functions, as these are not used in ENABLE_INTL_API builds, and will disappear from the tree completely when non-ENABLE_INTL_API codepaths are eliminated.
Summary: nsIUnicodeNormalizer is out of date and is not being tested → Use ICU to replace the (obsolete) implementation of nsUnicodeNormalizer
Attachment #598327 - Attachment is obsolete: true
Attachment #598330 - Attachment is obsolete: true
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d0628ce6f4b8 FTR, this reduces the impact of ENABLE_INTL_API on the final .apk size of an Android build by about 32K, as we no longer end up with two copies of normalization data in the product. (It'll have a similar effect on desktop builds, too, though we're less concerned about size there.)
Do we need to preserve nsUnicodeNormalizer in the first place? The only remaining user is NPAPI if ICU is available. How about rewriting the last caller instead?
I've been assuming there'd be add-on users of the service that haven't necessarily all been updated to use the new JS String method. If we're willing to break such code then yes, I think we could remove the service entirely.
Incidentally, I see that bug 1292476 intends to remove the NPAPI consumer of this interface.
Depends on: 1292476
Comment on attachment 8795662 [details] [diff] [review] Use ICU normalization functions to implement nsUnicodeNormalizer when ENABLE_INTL_API is defined, in place of our obsolete/unmaintained normalization code Review of attachment 8795662 [details] [diff] [review]: ----------------------------------------------------------------- OK, let's discuss dropping nsIUnicodeNormalizer elsewhere.
Attachment #8795662 - Flags: review?(VYV03354) → review+
It's bug 1292486. I'm inclined to think we should go ahead and do it, but to be polite to any existing users, we should announce it ahead of time (an "intent to unship"?) and document it (with recommended alternatives). Meanwhile, we can do this internally without affecting any consumers.
https://hg.mozilla.org/integration/mozilla-inbound/rev/01cd7a8a158d425c216cb319898832c4891ec35e Bug 728180 - Use ICU normalization functions to implement nsUnicodeNormalizer when ENABLE_INTL_API is defined, in place of our obsolete/unmaintained normalization code. r=emk
You need to log in before you can comment on or make changes to this bug.