210502 - Update generated files in unicharutils to Unicode 4.0.1 database

Assignee

Description

•

21 years ago

Now that Unicode 4.0 was released, we need to update intl/unicharutils to
Unicode 4.0. That includes normalization, character properties, categories, and
so forth.

Roland Mainz

Updated

•

21 years ago

Summary: Update unicharutils to Unicode 4.0 → Update unicharutils to Unicode 4.0

Tex

Comment 1

•

21 years ago

This link can be used for testing Unicode 4 casing when you are ready.

http://www.w3.org/International/tests/test-text-transform.html

Jungshik Shin

Assignee

Comment 2

•

21 years ago

What we need to do is rather simple (at least on the surface):

1. grab idnkit at http://www.nic.ad.jp/en/idn/index.html
   (http://www.nic.ad.jp/ja/idn/idnkit/download/sources/idnkit-1.0-src.tar.gz)
2. there are three files we need in the kit:
   generate_normalize_data.pl, UCD.pm and SparseMap.pm
3. download the following Unicode data files:
   CaseFolding.txt,CompositionExclusions.txt, SpecialCasing.txt, UnicodeData.txt
4. run generate_normalize_data.pl,  edit the result  (remove case folding part
because we have separate scripts for that, replace 'unsigned short' and
'unsigned long' with 'PRUnichar' and 'PRUint32') and save to 'unicodedata.h'
(the current name is unicodedata_320.c', but I'm not if it's wise to have the
Unicode version name in the file name).

5. generate casetable.h and cattable.h with  gencasetable.pl and gencattable.pl

It'd have been nice if we had made this ready for 1.6, but I'm afraid it's too
late.

Assignee: smontagu → jshin

Jungshik Shin

Assignee

Comment 3

•

21 years ago

Attached patch preliminary patch (obsolete) — Details — Splinter Review

In addition to updating data to Unicode 4.0x, I had to make some changes to
support non-BMP characters. Still, more changes are necessary to support
non-BMP characters in case-conversion. So, even with this patch, Desert
alphabet is not properly supported when it comes to the case conversion.

Jungshik Shin

Assignee

Comment 4

•

21 years ago

I forgot that case-conversion had been filed as a separate bug (bug 210501)

Status: NEW → ASSIGNED

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 5

•

20 years ago

It would be nice to have some documentation in the tree somewhere that explains
how to upgrade all the data we have that's generated from the Unicode database.

Also, the patch in bug 238844 might have some useful changes to the generation
scripts to avoid changes made manually to the files (license headers, alecf's
comment removal).

Also, retitling bug with additional search terms so it's possible to find it
(since I did about 10 searches before filing bug 238844).

Summary: Update unicharutils to Unicode 4.0 → Update generated files in unicharutils to Unicode 4.0 database

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 6

•

20 years ago

*** Bug 238844 has been marked as a duplicate of this bug. ***

Jungshik Shin

Assignee

Comment 7

•

20 years ago

(In reply to comment #5)
> It would be nice to have some documentation in the tree somewhere that explains

That's on my TODO list. I had to reverse engineer to figure out how to update
files here. 
 
> Also, the patch in bug 238844 might have some useful changes to the generation

 I may (or may not) have made a similar change in my tree. Anyway, thanks for
the patch.

> it's possible to find it (since I did about 10 searches before filing bug 238844).

I wonder what your 10 queries were.... With the component
'internationalization', the summary containing 'Unicode' and the owner set to me
or smontagu would have  brought you right here in a couple of queries ;-)

I have to run now. I'll try to fix it early next week.

Simon Montagu :smontagu

Comment 8

•

20 years ago

Updating summary since Unicode 4.0.1 has been released in the meanwhile.
http://www.unicode.org/versions/Unicode4.0.1/
http://www.unicode.org/Public/4.0-Update1/

Summary: Update generated files in unicharutils to Unicode 4.0 database → Update generated files in unicharutils to Unicode 4.0.1 database

Jungshik Shin

Assignee

Comment 9

•

20 years ago

Attached patch patch v2 (dbaron's patch incorporated and license change) — Details — Splinter Review

dbaron's patch was combined with attachment 137911 [details] [diff] [review]. I also changed the license
to MPL.

Attachment #137911 - Attachment is obsolete: true

timeless

Comment 10

•

20 years ago

Comment on attachment 150203 [details] [diff] [review]
patch v2 (dbaron's patch incorporated and license change)

>Index: intl/unicharutil/tools/gentransliterate.pl
>+$header = <<END_OF_HEADER;
> ##
> ## The contents of this file are subject to the Netscape Public

Do you want this to spit out MPL-tri?

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 11

•

20 years ago

What's the status of this patch?

Jungshik Shin

Assignee

Comment 12

•

20 years ago

I'll make a new patch once Unicode 4.1 is released in February. In the meantime,
I converted nsIUGenCategory.h to nsIUGenCategory.idl while changing its methods,
'Get' and 'Is' to accept PRUint32 (UTF-32) instead of UCS-2. Nobody ever used
them so that we don't have to worry about callers or compatibility (not that
it's hard to fix callers) Btw, I also want to change their names. Any
suggestion? How about 'getCategory' and 'isCategory'? 


Simon, I vaguely remember you made (at least suggested) a similar change in
nsICaseConversion, but the change is not in the tree. Will you take care of it
in bug 210501?

Jungshik Shin

Assignee

Comment 13

•

19 years ago

Attached patch patch for general category only — Details — Splinter Review

With this patch, each Unicode plane has its own category pattern table because
otherwise the number of different category patterns (categories for 8
characters are stored in each pattern) exceeeds 256. If I used a single pattern
table, PRUint16 instead of PRUint8 would have to  be used as lookup keys, which
would increase the code size by about 3k. 

I updated nsIUnicodeGeneralCategory (renamed from nsIUGenCategory) to deal with
the full Unicode repertoire and made it scriptable. Its implementation was also
updated. 

The category table was generated from UCD 4.0.1, but will be updated to 4.1.0
once it's released.

Simon Montagu :smontagu

Comment 14

•

19 years ago

See also bug 288137. I'd like to get all these bugs fixed by 1.8b2, preferably
with Unicode 4.1 (which should be released this month), but at least with 4.0.1
in the event that 4.1 is delayed.

Jungshik Shin

Assignee

Comment 15

•

19 years ago

Attached patch normalization part patch + README.txt — Details — Splinter Review

This patch only deals with the Unicode normalization. It also has a new file
README.txt explaining how to generate various properties and header files.
Btw, diff was made against 'unicodedata_320.c', but I'm gonna cvs remove it and
add a new file 'normalization_data.h' in its place. 

This should be safe enough for branch landing because Unicode normalization
routines don't need any change to deal with non-BMP characters (they're already
full-unicode-proof), which means only the data file was updated to Unicode
4.1.0.

Attachment #179694 - Flags: superreview?(dbaron)

Attachment #179694 - Flags: review?(smontagu)

Simon Montagu :smontagu

Comment 16

•

19 years ago

Comment on attachment 179694 [details] [diff] [review]
normalization part patch + README.txt

>+   The latest version is, as of this writing is in 

Nit: one "is" is superfluous

Attachment #179694 - Flags: review?(smontagu) → review+

Jungshik Shin

Assignee

Comment 17

•

19 years ago

Attached patch cattable.h and gencattable.pl patch — Details — Splinter Review

This fixes cattable.h and gencattable.pl. I haven't changed the interface
definition and its implementation that uses cattable.h because it's not used
anywhere. 

cattable.h in spellchecker is not updated. My plan is either to export
cattable.h in intl and cvs-remove the copy in spellchecker so that spellchecker
can refer to the copy in intl or to make spellchecker use nsIUGenCategory.
That'll be done in bug 287340

Attachment #179704 - Flags: superreview?(dbaron)

Attachment #179704 - Flags: review?(smontagu)

Simon Montagu :smontagu

Comment 18

•

19 years ago

Comment on attachment 179704 [details] [diff] [review]
cattable.h and gencattable.pl patch

>-printf OUT "static PRUint8 GetCat(PRUnichar u)\n{\n";
>+printf OUT "static PRUint16 GetCat(PRUint32 u)\n{\n";

Why does this now return a PRUint16? The return value is still in the same
range, no?

Jungshik Shin

Assignee

Comment 19

•

19 years ago

(In reply to comment #18)

> Why does this now return a PRUint16? The return value is still in the same
> range, no?
 
Oops. You're absolutely right. Do you want me to upload a new patch with that
fixed?

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 20

•

19 years ago

Comment on attachment 179694 [details] [diff] [review]
normalization part patch + README.txt

>Index: intl/unicharutil/src/nsUnicodeNormalizer.cpp
>-#include "unicodedata_320.c"
>+#include "normalization_data.h"

I don't follow what's going on here, but sr=dbaron.

Attachment #179694 - Flags: superreview?(dbaron) → superreview+

Jungshik Shin

Assignee

Comment 21

•

19 years ago

Thanks for r/sr.
Attachment 179694 [details] [diff] was landed on the trunk.

(In reply to comment #20)
> (From update of attachment 179694 [details] [diff] [review] [edit])
> >Index: intl/unicharutil/src/nsUnicodeNormalizer.cpp
> >-#include "unicodedata_320.c"
> >+#include "normalization_data.h"
> 
> I don't follow what's going on here, but sr=dbaron.

see comment #15. I removed unicodedata_320.c and added 'normalization_data.h',
but made a diff against 'unicodedata_320.c' to avoid making the patch a lot
longer than necessary.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 22

•

19 years ago

(In reply to comment #21)
> > I don't follow what's going on here, but sr=dbaron.
> 
> see comment #15. I removed unicodedata_320.c and added 'normalization_data.h',
> but made a diff against 'unicodedata_320.c' to avoid making the patch a lot
> longer than necessary. 

I figured it was something like that, but didn't see it.  Makes sense.

Jungshik Shin

Assignee

Updated

•

19 years ago

Blocks: 287340

Simon Montagu :smontagu

Comment 23

•

19 years ago

Comment on attachment 179704 [details] [diff] [review]
cattable.h and gencattable.pl patch

r=smontagu with the change I mentioned earlier (no need for a new patch)

Attachment #179704 - Flags: review?(smontagu) → review+

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Updated

•

19 years ago

Attachment #179704 - Flags: superreview?(dbaron) → superreview+

Jungshik Shin

Assignee

Comment 24

•

19 years ago

cattable patch got landed on the trunk.
case folding  and transliteration tables were updated by smontagu in separate
bugs so that i think we can resolve this as fixed. 
Btw, we may want to make firefox 1.0.x and mozilla 1.7.x up to date (in terms of
Unicode support).

Status: ASSIGNED → RESOLVED

Closed: 19 years ago

Resolution: --- → FIXED

Peter van der Woude [:Peter6]

Comment 25

•

19 years ago

Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8b2) Gecko/20050406
Firefox/1.0+

Is there any chance this patch is causing characters like ë ö etc to be
displayed as a &#65533; (? in a dark diamond) again ?
I'm seeing this again in todays build.

Erik van der Poel

Comment 26

•

19 years ago

It may be OK to update categories to a newer version of Unicode, but we need to
be careful about updating the normalization tables. The IDN-related RFCs
specifically call for Unicode 3.2.0 and there were some changes in 4.0 in the
normalization:

http://www.unicode.org/Public/UNIDATA/NormalizationCorrections.txt

As far as I can tell, JPNIC's idnkit does not take these corrections into
account. Also, Unicode fixed an error in the normalization spec. The Public
Review document says that idnkit needs to be fixed:

http://www.unicode.org/review/pr-29.html

However, the IETF has not decided what to do about PR-29 in Stringprep.
Also, Unicode has a normalization test file:

http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt

Has anyone run this test for Mozilla?

Jungshik Shin

Assignee

Comment 27

•

19 years ago

(In reply to comment #26)

> be careful about updating the normalization tables. The IDN-related RFCs
> specifically call for Unicode 3.2.0 and there were some changes in 4.0 in the
> normalization:
> 
> http://www.unicode.org/Public/UNIDATA/NormalizationCorrections.txt
> 
> As far as I can tell, JPNIC's idnkit does not take these corrections into
> account. 

You meant they should be kept 'misnormalized' because the IDN-related RFCs are
based on Unicode 3.2.0, didn't you? All of them (with the possible exception of
the first one in the BMP) are rarely used, but ....

> Also, Unicode fixed an error in the normalization spec. The Public
> Review document says that idnkit needs to be fixed:
> 
> http://www.unicode.org/review/pr-29.html
> 
> However, the IETF has not decided what to do about PR-29 in Stringprep.

 Hmm.. what do you think we need to do here? Go ahead proactively or just wait
for the IETF to make a decision. (http://www.unicode.org/unicode/reports/tr15/ :
Unicode consortium approved the PR #29 and it was reflected in UAX 15. There's
another revision -hopefully requiring no change on our part - in the draft
stage) Again, the change appears to have little practical implication (except
for potential security issues)

> Also, Unicode has a normalization test file:
> 
> http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt
> 
> Has anyone run this test for Mozilla?

Not me. Perhaps, we have to write a test program based on the file unless it's
already been written.

Erik van der Poel

Comment 28

•

19 years ago

> You meant they should be kept 'misnormalized' because the IDN-related RFCs are
> based on Unicode 3.2.0, didn't you?

Right. The IETF takes stability very seriously, and we should be careful too.

> All of them (with the possible exception of
> the first one in the BMP) are rarely used, but ....

They may be rare, but that is not the point. The point is that we seem to be
making changes blindly here. I don't see any evidence that this issue was
considered.

>  Hmm.. what do you think we need to do here? Go ahead proactively or just wait
> for the IETF to make a decision. (http://www.unicode.org/unicode/reports/tr15/ :
> Unicode consortium approved the PR #29 and it was reflected in UAX 15. There's
> another revision -hopefully requiring no change on our part - in the draft
> stage)

What revision are you talking about? Please give me a pointer. As far as PR-29
is concerned, we should probably wait for the IETF response. I don't know what
we should do about the Unicode 4.X changes that were already checked in. The
IETF has not decided what to do about NormalizationCorrections.txt either.
We could just leave the Mozilla source as it is now -- and correct it later
when the IETF decides both of these issues. This leaves Mozilla users exposed
somewhat, maybe not much.

Looking at the source (intl/unicharutil/src/nsUnicodeNormalizer.cpp), it seems
that the PR-29 issue has not been fixed yet:

  if ((last_class < cl || cl == 0) &&

Erik van der Poel

Comment 29

•

19 years ago

By the way, I've been talking about Stringprep and Nameprep. If there are other
consumers of the Unicode normalization service in Mozilla, they may have other
considerations. I haven't checked for other consumers.

Simon Montagu :smontagu

Comment 30

•

19 years ago

(In reply to comment #27)

> > http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt
> > 
> > Has anyone run this test for Mozilla?
> 
> Not me. Perhaps, we have to write a test program based on the file unless it's
> already been written. 

There is a very short NFD test in intl/unicharutil/tests/UnicharSelfTest.cpp. We
could try importing the Unicode test file into there to do a more comprehensive
test.

Jungshik Shin

Assignee

Comment 31

•

19 years ago

(In reply to comment #28)
>
> > All of them (with the possible exception of
> > the first one in the BMP) are rarely used, but ....
> 
> They may be rare, but that is not the point. The point is that we seem to be
> making changes blindly here. I

That's what 'but ...' was for :-)

> >  Hmm.. what do you think we need to do here? Go ahead proactively or just wait
> > for the IETF to make a decision. (http://www.unicode.org/unicode/reports/tr15/ :
> > Unicode consortium approved the PR #29 and it was reflected in UAX 15. There's
> > another revision -hopefully requiring no change on our part - in the draft
> > stage)
> 
> What revision are you talking about? Please give me a pointer. 

I already sorta gave one :-). The 'revision history' in the latest version of
UAX #15 explicitly mentions that PR #29 was taken into account. (see
http://www.unicode.org/unicode/reports/tr15/#Modifications ). You can also find
the same information at  
http://www.unicode.org/review/resolved-pri.html
where you can also find PRI #61 that was also approved whose content hasn't yet
been published in UAX #15(
http://www.unicode.org/reports/tr15/tr15-24.html)

Erik van der Poel

Comment 32

•

19 years ago

Simon, the normalization test file is 2 MB. I wouldn't check it into the tree.
Maybe you didn't mean to do that. (But it would be nice to check in a program
that performs the tests.)

Jungshik, tr15-24.html is the previous version (a draft). The newest version is
tr15-25.html, and PR-29 is reflected in that one. Look for the numbers 24 and
25 near the top. In 24, it's called "Tracking Number", in 25 "Revision".

There is at least one implementor who believes that this is an incompatible
change. Even though Unicode has decided to correct tr15, the IETF has not
decided how to address that change in Stringprep and its profiles (including
Nameprep). See the thread "stringprep: PRI #29" of March 19, 2005 in the IDN
mailing list archive:

http://ops.ietf.org/lists/idn/idn.2005/maillist.html

Gerv has indicated that an IAB Working Group is looking into Stringprep and
Nameprep revisions:

http://weblogs.mozillazine.org/gerv/archives/007785.html

At least one of the Stringprep profile authors appears to be aware of PR-29.
See section 4 of:

http://ietf.org/rfc/rfc4013.txt

Stringprep itself specifically refers to tr15-22.html:

http://ietf.org/rfc/rfc3454.txt

Erik van der Poel

Comment 33

•

19 years ago

Jungshik, it might be a good idea to file one or more new bugs to address
PR-29, NormalizationCorrections.txt and the fact that Stringprep does not
allow characters that were unassigned in Unicode 3.2.0 in "stored strings"
(though they are allowed in "queries"). See section 7 of Stringprep.

I suspect that Mozilla code would mostly use the IDN routines in "queries",
but if any server, client or other program based on Mozilla code ever uses
"stored strings", it would be illegal to store characters outside 3.2.0.

If we decide to comply to this extent, we may need to support both 3.2.0
and 4.1 in mozilla/intl.

See also the part about normalization in section 10 of IDNA:

http://ietf.org/rfc/rfc3490.txt

If you like, I can file the bugs.

Simon Montagu :smontagu

Comment 34

•

19 years ago

See bug 326207. Jungshik, would you like to request 1.8 branch approval for this, or shall we just wait for Unicode 5, due for release at the end of next month?

Jungshik Shin

Assignee

Comment 35

•

18 years ago

(In reply to comment #34)
> See bug 326207. Jungshik, would you like to request 1.8 branch approval for
> this, or shall we just wait for Unicode 5, due for release at the end of next
> month?

I'm not sure. Either way is fine with me. Even though we have to bother drivers twice, I'm slightly more inclined to get approval now. 

BTW, we have to revert to Unicode 3.2 for normalization as pointed out by Erik, don't we? So, only attachment 179704 [details] [diff] [review] needs to go in. For the trunk, I'll file a new bug to revert to Unicode 3.2.0 and change 'README.txt' accordingly. 
And, there is an issue with UTR PR#29.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 36

•

18 years ago

I recommend trying to get this in to the 1.8 branch.

preliminary patch 21 years ago Jungshik Shin 440.57 KB, patch		Details \| Diff \| Splinter Review
patch v2 (dbaron's patch incorporated and license change) 20 years ago Jungshik Shin 213.78 KB, patch		Details \| Diff \| Splinter Review
patch for general category only 19 years ago Jungshik Shin 191.58 KB, patch		Details \| Diff \| Splinter Review
normalization part patch + README.txt 19 years ago Jungshik Shin 246.50 KB, patch	smontagu : review+ dbaron : superreview+	Details \| Diff \| Splinter Review
cattable.h and gencattable.pl patch 19 years ago Jungshik Shin 185.82 KB, patch	smontagu : review+ dbaron : superreview+	Details \| Diff \| Splinter Review