Closed Bug 210502 Opened 21 years ago Closed 19 years ago

Update generated files in unicharutils to Unicode 4.0.1 database

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: jshin1987, Assigned: jshin1987)

References

Details

(Keywords: intl)

Attachments

(4 files, 1 obsolete file)

Now that Unicode 4.0 was released, we need to update intl/unicharutils to
Unicode 4.0. That includes normalization, character properties, categories, and
so forth.
Summary: Update unicharutils to Unicode 4.0 → Update unicharutils to Unicode 4.0
This link can be used for testing Unicode 4 casing when you are ready.

http://www.w3.org/International/tests/test-text-transform.html
What we need to do is rather simple (at least on the surface):

1. grab idnkit at http://www.nic.ad.jp/en/idn/index.html
   (http://www.nic.ad.jp/ja/idn/idnkit/download/sources/idnkit-1.0-src.tar.gz)
2. there are three files we need in the kit:
   generate_normalize_data.pl, UCD.pm and SparseMap.pm
3. download the following Unicode data files:
   CaseFolding.txt,CompositionExclusions.txt, SpecialCasing.txt, UnicodeData.txt
4. run generate_normalize_data.pl,  edit the result  (remove case folding part
because we have separate scripts for that, replace 'unsigned short' and
'unsigned long' with 'PRUnichar' and 'PRUint32') and save to 'unicodedata.h'
(the current name is unicodedata_320.c', but I'm not if it's wise to have the
Unicode version name in the file name).

5. generate casetable.h and cattable.h with  gencasetable.pl and gencattable.pl

It'd have been nice if we had made this ready for 1.6, but I'm afraid it's too
late. 


  
Assignee: smontagu → jshin
Attached patch preliminary patch (obsolete) — Splinter Review
In addition to updating data to Unicode 4.0x, I had to make some changes to
support non-BMP characters. Still, more changes are necessary to support
non-BMP characters in case-conversion. So, even with this patch, Desert
alphabet is not properly supported when it comes to the case conversion.
I forgot that case-conversion had been filed as a separate bug (bug 210501)
Status: NEW → ASSIGNED
It would be nice to have some documentation in the tree somewhere that explains
how to upgrade all the data we have that's generated from the Unicode database.

Also, the patch in bug 238844 might have some useful changes to the generation
scripts to avoid changes made manually to the files (license headers, alecf's
comment removal).

Also, retitling bug with additional search terms so it's possible to find it
(since I did about 10 searches before filing bug 238844).
Summary: Update unicharutils to Unicode 4.0 → Update generated files in unicharutils to Unicode 4.0 database
*** Bug 238844 has been marked as a duplicate of this bug. ***
(In reply to comment #5)
> It would be nice to have some documentation in the tree somewhere that explains

That's on my TODO list. I had to reverse engineer to figure out how to update
files here. 
 
> Also, the patch in bug 238844 might have some useful changes to the generation

 I may (or may not) have made a similar change in my tree. Anyway, thanks for
the patch.

> it's possible to find it (since I did about 10 searches before filing bug 238844).

I wonder what your 10 queries were.... With the component
'internationalization', the summary containing 'Unicode' and the owner set to me
or smontagu would have  brought you right here in a couple of queries ;-)

I have to run now. I'll try to fix it early next week. 

Updating summary since Unicode 4.0.1 has been released in the meanwhile.
http://www.unicode.org/versions/Unicode4.0.1/
http://www.unicode.org/Public/4.0-Update1/
Summary: Update generated files in unicharutils to Unicode 4.0 database → Update generated files in unicharutils to Unicode 4.0.1 database
dbaron's patch was combined with attachment 137911 [details] [diff] [review]. I also changed the license
to MPL.
Attachment #137911 - Attachment is obsolete: true
Comment on attachment 150203 [details] [diff] [review]
patch v2 (dbaron's patch incorporated and license change)

>Index: intl/unicharutil/tools/gentransliterate.pl
>+$header = <<END_OF_HEADER;
> ##
> ## The contents of this file are subject to the Netscape Public

Do you want this to spit out MPL-tri?
What's the status of this patch?
I'll make a new patch once Unicode 4.1 is released in February. In the meantime,
I converted nsIUGenCategory.h to nsIUGenCategory.idl while changing its methods,
'Get' and 'Is' to accept PRUint32 (UTF-32) instead of UCS-2. Nobody ever used
them so that we don't have to worry about callers or compatibility (not that
it's hard to fix callers) Btw, I also want to change their names. Any
suggestion? How about 'getCategory' and 'isCategory'? 


Simon, I vaguely remember you made (at least suggested) a similar change in
nsICaseConversion, but the change is not in the tree. Will you take care of it
in bug 210501? 

With this patch, each Unicode plane has its own category pattern table because
otherwise the number of different category patterns (categories for 8
characters are stored in each pattern) exceeeds 256. If I used a single pattern
table, PRUint16 instead of PRUint8 would have to  be used as lookup keys, which
would increase the code size by about 3k. 

I updated nsIUnicodeGeneralCategory (renamed from nsIUGenCategory) to deal with
the full Unicode repertoire and made it scriptable. Its implementation was also
updated. 

The category table was generated from UCD 4.0.1, but will be updated to 4.1.0
once it's released.
See also bug 288137. I'd like to get all these bugs fixed by 1.8b2, preferably
with Unicode 4.1 (which should be released this month), but at least with 4.0.1
in the event that 4.1 is delayed.
This patch only deals with the Unicode normalization. It also has a new file
README.txt explaining how to generate various properties and header files.
Btw, diff was made against 'unicodedata_320.c', but I'm gonna cvs remove it and
add a new file 'normalization_data.h' in its place. 

This should be safe enough for branch landing because Unicode normalization
routines don't need any change to deal with non-BMP characters (they're already
full-unicode-proof), which means only the data file was updated to Unicode
4.1.0.
Attachment #179694 - Flags: superreview?(dbaron)
Attachment #179694 - Flags: review?(smontagu)
Comment on attachment 179694 [details] [diff] [review]
normalization part patch + README.txt

>+   The latest version is, as of this writing is in 

Nit: one "is" is superfluous
Attachment #179694 - Flags: review?(smontagu) → review+
This fixes cattable.h and gencattable.pl. I haven't changed the interface
definition and its implementation that uses cattable.h because it's not used
anywhere. 

cattable.h in spellchecker is not updated. My plan is either to export
cattable.h in intl and cvs-remove the copy in spellchecker so that spellchecker
can refer to the copy in intl or to make spellchecker use nsIUGenCategory.
That'll be done in bug 287340
Attachment #179704 - Flags: superreview?(dbaron)
Attachment #179704 - Flags: review?(smontagu)
Comment on attachment 179704 [details] [diff] [review]
cattable.h and gencattable.pl patch

>-printf OUT "static PRUint8 GetCat(PRUnichar u)\n{\n";
>+printf OUT "static PRUint16 GetCat(PRUint32 u)\n{\n";

Why does this now return a PRUint16? The return value is still in the same
range, no?
(In reply to comment #18)

> Why does this now return a PRUint16? The return value is still in the same
> range, no?
 
Oops. You're absolutely right. Do you want me to upload a new patch with that
fixed? 

Comment on attachment 179694 [details] [diff] [review]
normalization part patch + README.txt

>Index: intl/unicharutil/src/nsUnicodeNormalizer.cpp
>-#include "unicodedata_320.c"
>+#include "normalization_data.h"

I don't follow what's going on here, but sr=dbaron.
Attachment #179694 - Flags: superreview?(dbaron) → superreview+
Thanks for r/sr.
Attachment 179694 [details] [diff] was landed on the trunk.

(In reply to comment #20)
> (From update of attachment 179694 [details] [diff] [review] [edit])
> >Index: intl/unicharutil/src/nsUnicodeNormalizer.cpp
> >-#include "unicodedata_320.c"
> >+#include "normalization_data.h"
> 
> I don't follow what's going on here, but sr=dbaron.

see comment #15. I removed unicodedata_320.c and added 'normalization_data.h',
but made a diff against 'unicodedata_320.c' to avoid making the patch a lot
longer than necessary. 
(In reply to comment #21)
> > I don't follow what's going on here, but sr=dbaron.
> 
> see comment #15. I removed unicodedata_320.c and added 'normalization_data.h',
> but made a diff against 'unicodedata_320.c' to avoid making the patch a lot
> longer than necessary. 

I figured it was something like that, but didn't see it.  Makes sense.
Blocks: 287340
Comment on attachment 179704 [details] [diff] [review]
cattable.h and gencattable.pl patch

r=smontagu with the change I mentioned earlier (no need for a new patch)
Attachment #179704 - Flags: review?(smontagu) → review+
Attachment #179704 - Flags: superreview?(dbaron) → superreview+
cattable patch got landed on the trunk.
case folding  and transliteration tables were updated by smontagu in separate
bugs so that i think we can resolve this as fixed. 
Btw, we may want to make firefox 1.0.x and mozilla 1.7.x up to date (in terms of
Unicode support). 
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8b2) Gecko/20050406
Firefox/1.0+

Is there any chance this patch is causing characters like ë ö etc to be
displayed as a &#65533; (? in a dark diamond) again ?
I'm seeing this again in todays build.
It may be OK to update categories to a newer version of Unicode, but we need to
be careful about updating the normalization tables. The IDN-related RFCs
specifically call for Unicode 3.2.0 and there were some changes in 4.0 in the
normalization:

http://www.unicode.org/Public/UNIDATA/NormalizationCorrections.txt

As far as I can tell, JPNIC's idnkit does not take these corrections into
account. Also, Unicode fixed an error in the normalization spec. The Public
Review document says that idnkit needs to be fixed:

http://www.unicode.org/review/pr-29.html

However, the IETF has not decided what to do about PR-29 in Stringprep.
Also, Unicode has a normalization test file:

http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt

Has anyone run this test for Mozilla?
(In reply to comment #26)

> be careful about updating the normalization tables. The IDN-related RFCs
> specifically call for Unicode 3.2.0 and there were some changes in 4.0 in the
> normalization:
> 
> http://www.unicode.org/Public/UNIDATA/NormalizationCorrections.txt
> 
> As far as I can tell, JPNIC's idnkit does not take these corrections into
> account. 

You meant they should be kept 'misnormalized' because the IDN-related RFCs are
based on Unicode 3.2.0, didn't you? All of them (with the possible exception of
the first one in the BMP) are rarely used, but ....

> Also, Unicode fixed an error in the normalization spec. The Public
> Review document says that idnkit needs to be fixed:
> 
> http://www.unicode.org/review/pr-29.html
> 
> However, the IETF has not decided what to do about PR-29 in Stringprep.

 Hmm.. what do you think we need to do here? Go ahead proactively or just wait
for the IETF to make a decision. (http://www.unicode.org/unicode/reports/tr15/ :
Unicode consortium approved the PR #29 and it was reflected in UAX 15. There's
another revision -hopefully requiring no change on our part - in the draft
stage) Again, the change appears to have little practical implication (except
for potential security issues)

> Also, Unicode has a normalization test file:
> 
> http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt
> 
> Has anyone run this test for Mozilla?

Not me. Perhaps, we have to write a test program based on the file unless it's
already been written. 
> You meant they should be kept 'misnormalized' because the IDN-related RFCs are
> based on Unicode 3.2.0, didn't you?

Right. The IETF takes stability very seriously, and we should be careful too.

> All of them (with the possible exception of
> the first one in the BMP) are rarely used, but ....

They may be rare, but that is not the point. The point is that we seem to be
making changes blindly here. I don't see any evidence that this issue was
considered.

>  Hmm.. what do you think we need to do here? Go ahead proactively or just wait
> for the IETF to make a decision. (http://www.unicode.org/unicode/reports/tr15/ :
> Unicode consortium approved the PR #29 and it was reflected in UAX 15. There's
> another revision -hopefully requiring no change on our part - in the draft
> stage)

What revision are you talking about? Please give me a pointer. As far as PR-29
is concerned, we should probably wait for the IETF response. I don't know what
we should do about the Unicode 4.X changes that were already checked in. The
IETF has not decided what to do about NormalizationCorrections.txt either.
We could just leave the Mozilla source as it is now -- and correct it later
when the IETF decides both of these issues. This leaves Mozilla users exposed
somewhat, maybe not much.

Looking at the source (intl/unicharutil/src/nsUnicodeNormalizer.cpp), it seems
that the PR-29 issue has not been fixed yet:

  if ((last_class < cl || cl == 0) &&
By the way, I've been talking about Stringprep and Nameprep. If there are other
consumers of the Unicode normalization service in Mozilla, they may have other
considerations. I haven't checked for other consumers.
(In reply to comment #27)

> > http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt
> > 
> > Has anyone run this test for Mozilla?
> 
> Not me. Perhaps, we have to write a test program based on the file unless it's
> already been written. 

There is a very short NFD test in intl/unicharutil/tests/UnicharSelfTest.cpp. We
could try importing the Unicode test file into there to do a more comprehensive
test.
(In reply to comment #28)
>
> > All of them (with the possible exception of
> > the first one in the BMP) are rarely used, but ....
> 
> They may be rare, but that is not the point. The point is that we seem to be
> making changes blindly here. I

That's what 'but ...' was for :-)

> >  Hmm.. what do you think we need to do here? Go ahead proactively or just wait
> > for the IETF to make a decision. (http://www.unicode.org/unicode/reports/tr15/ :
> > Unicode consortium approved the PR #29 and it was reflected in UAX 15. There's
> > another revision -hopefully requiring no change on our part - in the draft
> > stage)
> 
> What revision are you talking about? Please give me a pointer. 

I already sorta gave one :-). The 'revision history' in the latest version of
UAX #15 explicitly mentions that PR #29 was taken into account. (see
http://www.unicode.org/unicode/reports/tr15/#Modifications ). You can also find
the same information at  
http://www.unicode.org/review/resolved-pri.html
where you can also find PRI #61 that was also approved whose content hasn't yet
been published in UAX #15(
http://www.unicode.org/reports/tr15/tr15-24.html)


Simon, the normalization test file is 2 MB. I wouldn't check it into the tree.
Maybe you didn't mean to do that. (But it would be nice to check in a program
that performs the tests.)

Jungshik, tr15-24.html is the previous version (a draft). The newest version is
tr15-25.html, and PR-29 is reflected in that one. Look for the numbers 24 and
25 near the top. In 24, it's called "Tracking Number", in 25 "Revision".

There is at least one implementor who believes that this is an incompatible
change. Even though Unicode has decided to correct tr15, the IETF has not
decided how to address that change in Stringprep and its profiles (including
Nameprep). See the thread "stringprep: PRI #29" of March 19, 2005 in the IDN
mailing list archive:

http://ops.ietf.org/lists/idn/idn.2005/maillist.html

Gerv has indicated that an IAB Working Group is looking into Stringprep and
Nameprep revisions:

http://weblogs.mozillazine.org/gerv/archives/007785.html

At least one of the Stringprep profile authors appears to be aware of PR-29.
See section 4 of:

http://ietf.org/rfc/rfc4013.txt

Stringprep itself specifically refers to tr15-22.html:

http://ietf.org/rfc/rfc3454.txt
Jungshik, it might be a good idea to file one or more new bugs to address
PR-29, NormalizationCorrections.txt and the fact that Stringprep does not
allow characters that were unassigned in Unicode 3.2.0 in "stored strings"
(though they are allowed in "queries"). See section 7 of Stringprep.

I suspect that Mozilla code would mostly use the IDN routines in "queries",
but if any server, client or other program based on Mozilla code ever uses
"stored strings", it would be illegal to store characters outside 3.2.0.

If we decide to comply to this extent, we may need to support both 3.2.0
and 4.1 in mozilla/intl.

See also the part about normalization in section 10 of IDNA:

http://ietf.org/rfc/rfc3490.txt

If you like, I can file the bugs.
See bug 326207. Jungshik, would you like to request 1.8 branch approval for this, or shall we just wait for Unicode 5, due for release at the end of next month?
(In reply to comment #34)
> See bug 326207. Jungshik, would you like to request 1.8 branch approval for
> this, or shall we just wait for Unicode 5, due for release at the end of next
> month?

I'm not sure. Either way is fine with me. Even though we have to bother drivers twice, I'm slightly more inclined to get approval now. 

BTW, we have to revert to Unicode 3.2 for normalization as pointed out by Erik, don't we? So, only attachment 179704 [details] [diff] [review] needs to go in. For the trunk, I'll file a new bug to revert to Unicode 3.2.0 and change 'README.txt' accordingly. 
And, there is an issue with UTR PR#29. 


I recommend trying to get this in to the 1.8 branch.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: