Closed Bug 345156 Opened 14 years ago Closed 14 years ago

Support Unicode 5.0.0

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: smontagu, Assigned: smontagu)

References

()

Details

(Keywords: fixed1.8.1)

Attachments

(2 files, 4 obsolete files)

Unicode 5.0.0 has been released. We need to update data files generated from the Unicode Character Database.

I started gathering documentation on how to do this a little while ago, see http://wiki.mozilla.org/I18n:Updating_Unicode_version.
Attached patch Code changes (obsolete) — Splinter Review
I'll attach the diffs to the generated files after checking in bug 345024. What should we do about normalization (ref. bug 210502 comment 26 and following)?
Attachment #229843 - Flags: review?(jshin1987)
Attached patch Diffs of generated files (obsolete) — Splinter Review
This doesn't really need review as such, but I'd appreciate it if you check that I haven't omitted any files.
Attachment #229958 - Flags: review?(jshin1987)
I realize it's late in the game, but this is a very low-risk change, and it would be a big feather in our cap to release Firefox 2 with support for Unicode 5.0.0
Flags: blocking1.8.1?
Attachment #229843 - Attachment is obsolete: true
Attachment #230303 - Flags: review?(jshin1987)
Attachment #229843 - Flags: review?(jshin1987)
Comment on attachment 230303 [details] [diff] [review]
Code changes, with better handling of undefined characters in genbidicattable.pl

Please ignore BidiMirroring.txt in this patch. I'm not sure why it's even in the Mozilla source tree and I'm going to go ahead and cvs remove it.
Phil, Axel, how important is Unicode 5.0.0 support for our localizers? Important enough to block? I don't think anyone's arguing against taking this, but I'm not sure we want to block the release on it.
not a blocker, will consider a trunk-baked patch
Flags: blocking1.8.1? → blocking1.8.1-
Comment on attachment 230303 [details] [diff] [review]
Code changes, with better handling of undefined characters in genbidicattable.pl

r=jshin
sorry for the delay.
Attachment #230303 - Flags: review?(jshin1987) → review+
Comment on attachment 229958 [details] [diff] [review]
Diffs of generated files

Simon, 
Didn't you get a warning about the number of patterns for plane 0 larger than 256? 
We have now 261 patterns, which can't be indexed with PRUint8 any more. 

We need to come up with a way to deal with that. Using PRUint16 will double the table size (about 3kB increase). Perhaps, splitting the BMP into two may fare better in terms of memory footprint. 

Do you have any better idea?
Attachment #229958 - Flags: review?(jshin1987)
It looks like the code that should issue that warning is broken :)

Output was
 Plane 0 has 262 patterns
 Plane 1 has 57 patterns
 total = 3874
and I forgot that the index was PRUint8. Thanks for picking up on that!
Attached patch Fix bug in warning (obsolete) — Splinter Review
Attachment #230743 - Flags: review?(jshin1987)
Comment on attachment 230743 [details] [diff] [review]
Fix bug in warning

r=jshin

Sorry that I made a mistake in that script.
Attachment #230743 - Flags: review?(jshin1987) → review+
Looking at the generated file, it turns out that most of the variety in the patterns is right at the beginning of the BMP. I exerimented with a few different values and ended up splitting U+001 - U+1CFF into one "plane" and all the rest of planes 0 and 1 into another "plane". This gives:

 Plane 0 has 181 patterns
 Plane 1 has 154 patterns
 total = 3938
Attachment #230303 - Attachment is obsolete: true
Attachment #230743 - Attachment is obsolete: true
Attachment #230888 - Flags: review?(jshin1987)
Attachment #229958 - Attachment is obsolete: true
Comment on attachment 230888 [details] [diff] [review]
Code changes fixing the table overflow

r=jshin
Attachment #230888 - Flags: review?(jshin1987) → review+
Attachment #230888 - Flags: superreview?(jag)
Comment on attachment 230888 [details] [diff] [review]
Code changes fixing the table overflow

Add a check that $planeSplit < $tl && %planeSplit > $th to reinforce that comment there.
Attachment #230888 - Flags: superreview?(jag) → superreview+
s/%/$/
Checked in.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Comment on attachment 230888 [details] [diff] [review]
Code changes fixing the table overflow

This has been baking on the trunk for two weeks with no regressions reported.
Attachment #230888 - Flags: approval1.8.1?
Comment on attachment 230888 [details] [diff] [review]
Code changes fixing the table overflow

a=beltzner on behalf of drivers, for the mozilla 181 branch
Attachment #230888 - Flags: approval1.8.1? → approval1.8.1+
Keywords: fixed1.8.1
Some Questions about standards compliance: Unicode 5 has some non-compatible changes in it, specifically that curly quotes, amongst other characters, were added to the bidi-mirroing list. 

1) I can't understand (or find, really) the legalese describing exactly how the HTML spec changes as the Unicode spec changes out from under it.  I see that it refers to automatically accepting new characters into the character set, but does that mean that mirroring changes are automatically added as well?  Is this a job for a putative HTML-4.0.2?  Does ISO 10646 have mirroring properties also, or is that just unicode, by which logic, maybe we should be using the unicode-3.0 mirroring table.

2) Are all web pages that assume the Unicode-4 semantics for curly quotes instantaneously wrong the minute that Unicode-5 is published?  Maybe not, but how would an HTML author indicate which bidi-semantics are to be used?  Different doctype?

3) What is the right mailing list for this question? :)

My point is, maybe there are parts of Unicode-5 that we /don't/ want to blindly adopt, since it would automatically break every single bidi page containing curly quotes.    Adding characters is a reverse-compatible change.  Changing existing layout semantics is not, and should be done carefully.
Just to put a finer point on the question, In FF2,  the behavior is Unicode-5-ish on OSX, and Unicode-4-ish on Windows.    
https://bugzilla.mozilla.org/attachment.cgi?id=250434
which is right, or is the answer fuzzy?  Your answer will let me understand whether to report this bug against Windows or OSX.
See http://unicode.org/review/resolved-pri.html#pri72. Unicode chose to destabilize existing documents now and introduce a stability policy for the future, so right now the answer is fuzzy. What makes it even fuzzier is that on platforms that can do their own Bidi layout (which in FF2 is only Windows, but in FF3 will be all platforms) we don't do the mirroring ourselves but let the OS handle it.

The right mailing list for the questions in comment 21 is probably http://lists.w3.org/Archives/Public/www-international/
You need to log in before you can comment on or make changes to this bug.