210502 - Update generated files in unicharutils to Unicode 4.0.1 database

Assignee

Description

•

22 years ago

Now that Unicode 4.0 was released, we need to update intl/unicharutils to Unicode 4.0. That includes normalization, character properties, categories, and so forth.

Roland Mainz

Updated

•

22 years ago

Summary: Update unicharutils to Unicode 4.0 → Update unicharutils to Unicode 4.0

Tex

Comment 1

•

22 years ago

This link can be used for testing Unicode 4 casing when you are ready. http://www.w3.org/International/tests/test-text-transform.html

Jungshik Shin

Assignee

Comment 2

•

21 years ago

What we need to do is rather simple (at least on the surface): 1. grab idnkit at http://www.nic.ad.jp/en/idn/index.html (http://www.nic.ad.jp/ja/idn/idnkit/download/sources/idnkit-1.0-src.tar.gz) 2. there are three files we need in the kit: generate_normalize_data.pl, UCD.pm and SparseMap.pm 3. download the following Unicode data files: CaseFolding.txt,CompositionExclusions.txt, SpecialCasing.txt, UnicodeData.txt 4. run generate_normalize_data.pl, edit the result (remove case folding part because we have separate scripts for that, replace 'unsigned short' and 'unsigned long' with 'PRUnichar' and 'PRUint32') and save to 'unicodedata.h' (the current name is unicodedata_320.c', but I'm not if it's wise to have the Unicode version name in the file name). 5. generate casetable.h and cattable.h with gencasetable.pl and gencattable.pl It'd have been nice if we had made this ready for 1.6, but I'm afraid it's too late.

Assignee: smontagu → jshin

Jungshik Shin

Assignee

Comment 3

•

21 years ago

Attached patch preliminary patch (obsolete) — Details — Splinter Review

In addition to updating data to Unicode 4.0x, I had to make some changes to support non-BMP characters. Still, more changes are necessary to support non-BMP characters in case-conversion. So, even with this patch, Desert alphabet is not properly supported when it comes to the case conversion.

Jungshik Shin

Assignee

Comment 4

•

21 years ago

I forgot that case-conversion had been filed as a separate bug (bug 210501)

Status: NEW → ASSIGNED

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 5

•

21 years ago

It would be nice to have some documentation in the tree somewhere that explains how to upgrade all the data we have that's generated from the Unicode database. Also, the patch in bug 238844 might have some useful changes to the generation scripts to avoid changes made manually to the files (license headers, alecf's comment removal). Also, retitling bug with additional search terms so it's possible to find it (since I did about 10 searches before filing bug 238844).

Summary: Update unicharutils to Unicode 4.0 → Update generated files in unicharutils to Unicode 4.0 database

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 6

•

21 years ago

*** Bug 238844 has been marked as a duplicate of this bug. ***

Jungshik Shin

Assignee

Comment 7

•

21 years ago

(In reply to comment #5) > It would be nice to have some documentation in the tree somewhere that explains That's on my TODO list. I had to reverse engineer to figure out how to update files here. > Also, the patch in bug 238844 might have some useful changes to the generation I may (or may not) have made a similar change in my tree. Anyway, thanks for the patch. > it's possible to find it (since I did about 10 searches before filing bug 238844). I wonder what your 10 queries were.... With the component 'internationalization', the summary containing 'Unicode' and the owner set to me or smontagu would have brought you right here in a couple of queries ;-) I have to run now. I'll try to fix it early next week.

Simon Montagu :smontagu

Comment 8

•

21 years ago

Updating summary since Unicode 4.0.1 has been released in the meanwhile. http://www.unicode.org/versions/Unicode4.0.1/ http://www.unicode.org/Public/4.0-Update1/

Summary: Update generated files in unicharutils to Unicode 4.0 database → Update generated files in unicharutils to Unicode 4.0.1 database

Jungshik Shin

Assignee

Comment 9

•

21 years ago

Attached patch patch v2 (dbaron's patch incorporated and license change) — Details — Splinter Review

dbaron's patch was combined with attachment 137911 [details] [diff] [review]. I also changed the license to MPL.

Attachment #137911 - Attachment is obsolete: true

timeless

Comment 10

•

21 years ago

Comment on attachment 150203 [details] [diff] [review] patch v2 (dbaron's patch incorporated and license change) >Index: intl/unicharutil/tools/gentransliterate.pl >+$header = <<END_OF_HEADER; > ## > ## The contents of this file are subject to the Netscape Public Do you want this to spit out MPL-tri?

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 11

•

20 years ago

What's the status of this patch?

Jungshik Shin

Assignee

Comment 12

•

20 years ago

I'll make a new patch once Unicode 4.1 is released in February. In the meantime, I converted nsIUGenCategory.h to nsIUGenCategory.idl while changing its methods, 'Get' and 'Is' to accept PRUint32 (UTF-32) instead of UCS-2. Nobody ever used them so that we don't have to worry about callers or compatibility (not that it's hard to fix callers) Btw, I also want to change their names. Any suggestion? How about 'getCategory' and 'isCategory'? Simon, I vaguely remember you made (at least suggested) a similar change in nsICaseConversion, but the change is not in the tree. Will you take care of it in bug 210501?

Jungshik Shin

Assignee

Comment 13

•

20 years ago

Attached patch patch for general category only — Details — Splinter Review

With this patch, each Unicode plane has its own category pattern table because otherwise the number of different category patterns (categories for 8 characters are stored in each pattern) exceeeds 256. If I used a single pattern table, PRUint16 instead of PRUint8 would have to be used as lookup keys, which would increase the code size by about 3k. I updated nsIUnicodeGeneralCategory (renamed from nsIUGenCategory) to deal with the full Unicode repertoire and made it scriptable. Its implementation was also updated. The category table was generated from UCD 4.0.1, but will be updated to 4.1.0 once it's released.

Simon Montagu :smontagu

Comment 14

•

20 years ago

See also bug 288137. I'd like to get all these bugs fixed by 1.8b2, preferably with Unicode 4.1 (which should be released this month), but at least with 4.0.1 in the event that 4.1 is delayed.

Jungshik Shin

Assignee

Comment 15

•

20 years ago

Attached patch normalization part patch + README.txt — Details — Splinter Review

This patch only deals with the Unicode normalization. It also has a new file README.txt explaining how to generate various properties and header files. Btw, diff was made against 'unicodedata_320.c', but I'm gonna cvs remove it and add a new file 'normalization_data.h' in its place. This should be safe enough for branch landing because Unicode normalization routines don't need any change to deal with non-BMP characters (they're already full-unicode-proof), which means only the data file was updated to Unicode 4.1.0.

Attachment #179694 - Flags: superreview?(dbaron)

Attachment #179694 - Flags: review?(smontagu)

Simon Montagu :smontagu

Comment 16

•

20 years ago

Comment on attachment 179694 [details] [diff] [review] normalization part patch + README.txt >+ The latest version is, as of this writing is in Nit: one "is" is superfluous

Attachment #179694 - Flags: review?(smontagu) → review+

Jungshik Shin

Assignee

Comment 17

•

20 years ago

Attached patch cattable.h and gencattable.pl patch — Details — Splinter Review

This fixes cattable.h and gencattable.pl. I haven't changed the interface definition and its implementation that uses cattable.h because it's not used anywhere. cattable.h in spellchecker is not updated. My plan is either to export cattable.h in intl and cvs-remove the copy in spellchecker so that spellchecker can refer to the copy in intl or to make spellchecker use nsIUGenCategory. That'll be done in bug 287340

Attachment #179704 - Flags: superreview?(dbaron)

Attachment #179704 - Flags: review?(smontagu)

Simon Montagu :smontagu

Comment 18

•

20 years ago

Comment on attachment 179704 [details] [diff] [review] cattable.h and gencattable.pl patch >-printf OUT "static PRUint8 GetCat(PRUnichar u)\n{\n"; >+printf OUT "static PRUint16 GetCat(PRUint32 u)\n{\n"; Why does this now return a PRUint16? The return value is still in the same range, no?

Jungshik Shin

Assignee

Comment 19

•

20 years ago

(In reply to comment #18) > Why does this now return a PRUint16? The return value is still in the same > range, no? Oops. You're absolutely right. Do you want me to upload a new patch with that fixed?

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 20

•

20 years ago

Comment on attachment 179694 [details] [diff] [review] normalization part patch + README.txt >Index: intl/unicharutil/src/nsUnicodeNormalizer.cpp >-#include "unicodedata_320.c" >+#include "normalization_data.h" I don't follow what's going on here, but sr=dbaron.

Attachment #179694 - Flags: superreview?(dbaron) → superreview+

Jungshik Shin

Assignee

Comment 21

•

20 years ago

Thanks for r/sr. Attachment 179694 [details] [diff] was landed on the trunk. (In reply to comment #20) > (From update of attachment 179694 [details] [diff] [review] [edit]) > >Index: intl/unicharutil/src/nsUnicodeNormalizer.cpp > >-#include "unicodedata_320.c" > >+#include "normalization_data.h" > > I don't follow what's going on here, but sr=dbaron. see comment #15. I removed unicodedata_320.c and added 'normalization_data.h', but made a diff against 'unicodedata_320.c' to avoid making the patch a lot longer than necessary.

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 22

•

20 years ago

(In reply to comment #21) > > I don't follow what's going on here, but sr=dbaron. > > see comment #15. I removed unicodedata_320.c and added 'normalization_data.h', > but made a diff against 'unicodedata_320.c' to avoid making the patch a lot > longer than necessary. I figured it was something like that, but didn't see it. Makes sense.

Jungshik Shin

Assignee

Updated

•

20 years ago

Blocks: 287340

Simon Montagu :smontagu

Comment 23

•

20 years ago

Comment on attachment 179704 [details] [diff] [review] cattable.h and gencattable.pl patch r=smontagu with the change I mentioned earlier (no need for a new patch)

Attachment #179704 - Flags: review?(smontagu) → review+

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Updated

•

20 years ago

Attachment #179704 - Flags: superreview?(dbaron) → superreview+

Jungshik Shin

Assignee

Comment 24

•

20 years ago

cattable patch got landed on the trunk. case folding and transliteration tables were updated by smontagu in separate bugs so that i think we can resolve this as fixed. Btw, we may want to make firefox 1.0.x and mozilla 1.7.x up to date (in terms of Unicode support).

Status: ASSIGNED → RESOLVED

Closed: 20 years ago

Resolution: --- → FIXED

Peter van der Woude [:Peter6]

Comment 25

•

20 years ago

Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8b2) Gecko/20050406 Firefox/1.0+ Is there any chance this patch is causing characters like ë ö etc to be displayed as a � (? in a dark diamond) again ? I'm seeing this again in todays build.

Erik van der Poel

Comment 26

•

20 years ago

It may be OK to update categories to a newer version of Unicode, but we need to be careful about updating the normalization tables. The IDN-related RFCs specifically call for Unicode 3.2.0 and there were some changes in 4.0 in the normalization: http://www.unicode.org/Public/UNIDATA/NormalizationCorrections.txt As far as I can tell, JPNIC's idnkit does not take these corrections into account. Also, Unicode fixed an error in the normalization spec. The Public Review document says that idnkit needs to be fixed: http://www.unicode.org/review/pr-29.html However, the IETF has not decided what to do about PR-29 in Stringprep. Also, Unicode has a normalization test file: http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt Has anyone run this test for Mozilla?

Jungshik Shin

Assignee

Comment 27

•

20 years ago

(In reply to comment #26) > be careful about updating the normalization tables. The IDN-related RFCs > specifically call for Unicode 3.2.0 and there were some changes in 4.0 in the > normalization: > > http://www.unicode.org/Public/UNIDATA/NormalizationCorrections.txt > > As far as I can tell, JPNIC's idnkit does not take these corrections into > account. You meant they should be kept 'misnormalized' because the IDN-related RFCs are based on Unicode 3.2.0, didn't you? All of them (with the possible exception of the first one in the BMP) are rarely used, but .... > Also, Unicode fixed an error in the normalization spec. The Public > Review document says that idnkit needs to be fixed: > > http://www.unicode.org/review/pr-29.html > > However, the IETF has not decided what to do about PR-29 in Stringprep. Hmm.. what do you think we need to do here? Go ahead proactively or just wait for the IETF to make a decision. (http://www.unicode.org/unicode/reports/tr15/ : Unicode consortium approved the PR #29 and it was reflected in UAX 15. There's another revision -hopefully requiring no change on our part - in the draft stage) Again, the change appears to have little practical implication (except for potential security issues) > Also, Unicode has a normalization test file: > > http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt > > Has anyone run this test for Mozilla? Not me. Perhaps, we have to write a test program based on the file unless it's already been written.

Erik van der Poel

Comment 28

•

20 years ago

> You meant they should be kept 'misnormalized' because the IDN-related RFCs are > based on Unicode 3.2.0, didn't you? Right. The IETF takes stability very seriously, and we should be careful too. > All of them (with the possible exception of > the first one in the BMP) are rarely used, but .... They may be rare, but that is not the point. The point is that we seem to be making changes blindly here. I don't see any evidence that this issue was considered. > Hmm.. what do you think we need to do here? Go ahead proactively or just wait > for the IETF to make a decision. (http://www.unicode.org/unicode/reports/tr15/ : > Unicode consortium approved the PR #29 and it was reflected in UAX 15. There's > another revision -hopefully requiring no change on our part - in the draft > stage) What revision are you talking about? Please give me a pointer. As far as PR-29 is concerned, we should probably wait for the IETF response. I don't know what we should do about the Unicode 4.X changes that were already checked in. The IETF has not decided what to do about NormalizationCorrections.txt either. We could just leave the Mozilla source as it is now -- and correct it later when the IETF decides both of these issues. This leaves Mozilla users exposed somewhat, maybe not much. Looking at the source (intl/unicharutil/src/nsUnicodeNormalizer.cpp), it seems that the PR-29 issue has not been fixed yet: if ((last_class < cl || cl == 0) &&

Erik van der Poel

Comment 29

•

20 years ago

By the way, I've been talking about Stringprep and Nameprep. If there are other consumers of the Unicode normalization service in Mozilla, they may have other considerations. I haven't checked for other consumers.

Simon Montagu :smontagu

Comment 30

•

20 years ago

(In reply to comment #27) > > http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt > > > > Has anyone run this test for Mozilla? > > Not me. Perhaps, we have to write a test program based on the file unless it's > already been written. There is a very short NFD test in intl/unicharutil/tests/UnicharSelfTest.cpp. We could try importing the Unicode test file into there to do a more comprehensive test.

Jungshik Shin

Assignee

Comment 31

•

20 years ago

(In reply to comment #28) > > > All of them (with the possible exception of > > the first one in the BMP) are rarely used, but .... > > They may be rare, but that is not the point. The point is that we seem to be > making changes blindly here. I That's what 'but ...' was for :-) > > Hmm.. what do you think we need to do here? Go ahead proactively or just wait > > for the IETF to make a decision. (http://www.unicode.org/unicode/reports/tr15/ : > > Unicode consortium approved the PR #29 and it was reflected in UAX 15. There's > > another revision -hopefully requiring no change on our part - in the draft > > stage) > > What revision are you talking about? Please give me a pointer. I already sorta gave one :-). The 'revision history' in the latest version of UAX #15 explicitly mentions that PR #29 was taken into account. (see http://www.unicode.org/unicode/reports/tr15/#Modifications ). You can also find the same information at http://www.unicode.org/review/resolved-pri.html where you can also find PRI #61 that was also approved whose content hasn't yet been published in UAX #15( http://www.unicode.org/reports/tr15/tr15-24.html)

Erik van der Poel

Comment 32

•

20 years ago

Simon, the normalization test file is 2 MB. I wouldn't check it into the tree. Maybe you didn't mean to do that. (But it would be nice to check in a program that performs the tests.) Jungshik, tr15-24.html is the previous version (a draft). The newest version is tr15-25.html, and PR-29 is reflected in that one. Look for the numbers 24 and 25 near the top. In 24, it's called "Tracking Number", in 25 "Revision". There is at least one implementor who believes that this is an incompatible change. Even though Unicode has decided to correct tr15, the IETF has not decided how to address that change in Stringprep and its profiles (including Nameprep). See the thread "stringprep: PRI #29" of March 19, 2005 in the IDN mailing list archive: http://ops.ietf.org/lists/idn/idn.2005/maillist.html Gerv has indicated that an IAB Working Group is looking into Stringprep and Nameprep revisions: http://weblogs.mozillazine.org/gerv/archives/007785.html At least one of the Stringprep profile authors appears to be aware of PR-29. See section 4 of: http://ietf.org/rfc/rfc4013.txt Stringprep itself specifically refers to tr15-22.html: http://ietf.org/rfc/rfc3454.txt

Erik van der Poel

Comment 33

•

20 years ago

Jungshik, it might be a good idea to file one or more new bugs to address PR-29, NormalizationCorrections.txt and the fact that Stringprep does not allow characters that were unassigned in Unicode 3.2.0 in "stored strings" (though they are allowed in "queries"). See section 7 of Stringprep. I suspect that Mozilla code would mostly use the IDN routines in "queries", but if any server, client or other program based on Mozilla code ever uses "stored strings", it would be illegal to store characters outside 3.2.0. If we decide to comply to this extent, we may need to support both 3.2.0 and 4.1 in mozilla/intl. See also the part about normalization in section 10 of IDNA: http://ietf.org/rfc/rfc3490.txt If you like, I can file the bugs.

Simon Montagu :smontagu

Comment 34

•

19 years ago

See bug 326207. Jungshik, would you like to request 1.8 branch approval for this, or shall we just wait for Unicode 5, due for release at the end of next month?

Jungshik Shin

Assignee

Comment 35

•

19 years ago

(In reply to comment #34) > See bug 326207. Jungshik, would you like to request 1.8 branch approval for > this, or shall we just wait for Unicode 5, due for release at the end of next > month? I'm not sure. Either way is fine with me. Even though we have to bother drivers twice, I'm slightly more inclined to get approval now. BTW, we have to revert to Unicode 3.2 for normalization as pointed out by Erik, don't we? So, only attachment 179704 [details] [diff] [review] needs to go in. For the trunk, I'll file a new bug to revert to Unicode 3.2.0 and change 'README.txt' accordingly. And, there is an issue with UTR PR#29.

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 36

•

19 years ago

I recommend trying to get this in to the 1.8 branch.

preliminary patch 21 years ago Jungshik Shin 440.57 KB, patch		Details \| Diff \| Splinter Review
patch v2 (dbaron's patch incorporated and license change) 21 years ago Jungshik Shin 213.78 KB, patch		Details \| Diff \| Splinter Review
patch for general category only 20 years ago Jungshik Shin 191.58 KB, patch		Details \| Diff \| Splinter Review
normalization part patch + README.txt 20 years ago Jungshik Shin 246.50 KB, patch	smontagu : review+ dbaron : superreview+	Details \| Diff \| Splinter Review
cattable.h and gencattable.pl patch 20 years ago Jungshik Shin 185.82 KB, patch	smontagu : review+ dbaron : superreview+	Details \| Diff \| Splinter Review