Closed Bug 210502 Opened 22 years ago Closed 20 years ago

Update generated files in unicharutils to Unicode 4.0.1 database

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: jshin1987, Assigned: jshin1987)

References

Details

(Keywords: intl)

Attachments

(4 files, 1 obsolete file)

Now that Unicode 4.0 was released, we need to update intl/unicharutils to Unicode 4.0. That includes normalization, character properties, categories, and so forth.
Summary: Update unicharutils to Unicode 4.0 → Update unicharutils to Unicode 4.0
This link can be used for testing Unicode 4 casing when you are ready. http://www.w3.org/International/tests/test-text-transform.html
What we need to do is rather simple (at least on the surface): 1. grab idnkit at http://www.nic.ad.jp/en/idn/index.html (http://www.nic.ad.jp/ja/idn/idnkit/download/sources/idnkit-1.0-src.tar.gz) 2. there are three files we need in the kit: generate_normalize_data.pl, UCD.pm and SparseMap.pm 3. download the following Unicode data files: CaseFolding.txt,CompositionExclusions.txt, SpecialCasing.txt, UnicodeData.txt 4. run generate_normalize_data.pl, edit the result (remove case folding part because we have separate scripts for that, replace 'unsigned short' and 'unsigned long' with 'PRUnichar' and 'PRUint32') and save to 'unicodedata.h' (the current name is unicodedata_320.c', but I'm not if it's wise to have the Unicode version name in the file name). 5. generate casetable.h and cattable.h with gencasetable.pl and gencattable.pl It'd have been nice if we had made this ready for 1.6, but I'm afraid it's too late.
Assignee: smontagu → jshin
Attached patch preliminary patch (obsolete) — Splinter Review
In addition to updating data to Unicode 4.0x, I had to make some changes to support non-BMP characters. Still, more changes are necessary to support non-BMP characters in case-conversion. So, even with this patch, Desert alphabet is not properly supported when it comes to the case conversion.
I forgot that case-conversion had been filed as a separate bug (bug 210501)
Status: NEW → ASSIGNED
It would be nice to have some documentation in the tree somewhere that explains how to upgrade all the data we have that's generated from the Unicode database. Also, the patch in bug 238844 might have some useful changes to the generation scripts to avoid changes made manually to the files (license headers, alecf's comment removal). Also, retitling bug with additional search terms so it's possible to find it (since I did about 10 searches before filing bug 238844).
Summary: Update unicharutils to Unicode 4.0 → Update generated files in unicharutils to Unicode 4.0 database
*** Bug 238844 has been marked as a duplicate of this bug. ***
(In reply to comment #5) > It would be nice to have some documentation in the tree somewhere that explains That's on my TODO list. I had to reverse engineer to figure out how to update files here. > Also, the patch in bug 238844 might have some useful changes to the generation I may (or may not) have made a similar change in my tree. Anyway, thanks for the patch. > it's possible to find it (since I did about 10 searches before filing bug 238844). I wonder what your 10 queries were.... With the component 'internationalization', the summary containing 'Unicode' and the owner set to me or smontagu would have brought you right here in a couple of queries ;-) I have to run now. I'll try to fix it early next week.
Updating summary since Unicode 4.0.1 has been released in the meanwhile. http://www.unicode.org/versions/Unicode4.0.1/ http://www.unicode.org/Public/4.0-Update1/
Summary: Update generated files in unicharutils to Unicode 4.0 database → Update generated files in unicharutils to Unicode 4.0.1 database
dbaron's patch was combined with attachment 137911 [details] [diff] [review]. I also changed the license to MPL.
Attachment #137911 - Attachment is obsolete: true
Comment on attachment 150203 [details] [diff] [review] patch v2 (dbaron's patch incorporated and license change) >Index: intl/unicharutil/tools/gentransliterate.pl >+$header = <<END_OF_HEADER; > ## > ## The contents of this file are subject to the Netscape Public Do you want this to spit out MPL-tri?
What's the status of this patch?
I'll make a new patch once Unicode 4.1 is released in February. In the meantime, I converted nsIUGenCategory.h to nsIUGenCategory.idl while changing its methods, 'Get' and 'Is' to accept PRUint32 (UTF-32) instead of UCS-2. Nobody ever used them so that we don't have to worry about callers or compatibility (not that it's hard to fix callers) Btw, I also want to change their names. Any suggestion? How about 'getCategory' and 'isCategory'? Simon, I vaguely remember you made (at least suggested) a similar change in nsICaseConversion, but the change is not in the tree. Will you take care of it in bug 210501?
With this patch, each Unicode plane has its own category pattern table because otherwise the number of different category patterns (categories for 8 characters are stored in each pattern) exceeeds 256. If I used a single pattern table, PRUint16 instead of PRUint8 would have to be used as lookup keys, which would increase the code size by about 3k. I updated nsIUnicodeGeneralCategory (renamed from nsIUGenCategory) to deal with the full Unicode repertoire and made it scriptable. Its implementation was also updated. The category table was generated from UCD 4.0.1, but will be updated to 4.1.0 once it's released.
See also bug 288137. I'd like to get all these bugs fixed by 1.8b2, preferably with Unicode 4.1 (which should be released this month), but at least with 4.0.1 in the event that 4.1 is delayed.
This patch only deals with the Unicode normalization. It also has a new file README.txt explaining how to generate various properties and header files. Btw, diff was made against 'unicodedata_320.c', but I'm gonna cvs remove it and add a new file 'normalization_data.h' in its place. This should be safe enough for branch landing because Unicode normalization routines don't need any change to deal with non-BMP characters (they're already full-unicode-proof), which means only the data file was updated to Unicode 4.1.0.
Attachment #179694 - Flags: superreview?(dbaron)
Attachment #179694 - Flags: review?(smontagu)
Comment on attachment 179694 [details] [diff] [review] normalization part patch + README.txt >+ The latest version is, as of this writing is in Nit: one "is" is superfluous
Attachment #179694 - Flags: review?(smontagu) → review+
This fixes cattable.h and gencattable.pl. I haven't changed the interface definition and its implementation that uses cattable.h because it's not used anywhere. cattable.h in spellchecker is not updated. My plan is either to export cattable.h in intl and cvs-remove the copy in spellchecker so that spellchecker can refer to the copy in intl or to make spellchecker use nsIUGenCategory. That'll be done in bug 287340
Attachment #179704 - Flags: superreview?(dbaron)
Attachment #179704 - Flags: review?(smontagu)
Comment on attachment 179704 [details] [diff] [review] cattable.h and gencattable.pl patch >-printf OUT "static PRUint8 GetCat(PRUnichar u)\n{\n"; >+printf OUT "static PRUint16 GetCat(PRUint32 u)\n{\n"; Why does this now return a PRUint16? The return value is still in the same range, no?
(In reply to comment #18) > Why does this now return a PRUint16? The return value is still in the same > range, no? Oops. You're absolutely right. Do you want me to upload a new patch with that fixed?
Comment on attachment 179694 [details] [diff] [review] normalization part patch + README.txt >Index: intl/unicharutil/src/nsUnicodeNormalizer.cpp >-#include "unicodedata_320.c" >+#include "normalization_data.h" I don't follow what's going on here, but sr=dbaron.
Attachment #179694 - Flags: superreview?(dbaron) → superreview+
Thanks for r/sr. Attachment 179694 [details] [diff] was landed on the trunk. (In reply to comment #20) > (From update of attachment 179694 [details] [diff] [review] [edit]) > >Index: intl/unicharutil/src/nsUnicodeNormalizer.cpp > >-#include "unicodedata_320.c" > >+#include "normalization_data.h" > > I don't follow what's going on here, but sr=dbaron. see comment #15. I removed unicodedata_320.c and added 'normalization_data.h', but made a diff against 'unicodedata_320.c' to avoid making the patch a lot longer than necessary.
(In reply to comment #21) > > I don't follow what's going on here, but sr=dbaron. > > see comment #15. I removed unicodedata_320.c and added 'normalization_data.h', > but made a diff against 'unicodedata_320.c' to avoid making the patch a lot > longer than necessary. I figured it was something like that, but didn't see it. Makes sense.
Blocks: 287340
Comment on attachment 179704 [details] [diff] [review] cattable.h and gencattable.pl patch r=smontagu with the change I mentioned earlier (no need for a new patch)
Attachment #179704 - Flags: review?(smontagu) → review+
Attachment #179704 - Flags: superreview?(dbaron) → superreview+
cattable patch got landed on the trunk. case folding and transliteration tables were updated by smontagu in separate bugs so that i think we can resolve this as fixed. Btw, we may want to make firefox 1.0.x and mozilla 1.7.x up to date (in terms of Unicode support).
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8b2) Gecko/20050406 Firefox/1.0+ Is there any chance this patch is causing characters like ë ö etc to be displayed as a &#65533; (? in a dark diamond) again ? I'm seeing this again in todays build.
It may be OK to update categories to a newer version of Unicode, but we need to be careful about updating the normalization tables. The IDN-related RFCs specifically call for Unicode 3.2.0 and there were some changes in 4.0 in the normalization: http://www.unicode.org/Public/UNIDATA/NormalizationCorrections.txt As far as I can tell, JPNIC's idnkit does not take these corrections into account. Also, Unicode fixed an error in the normalization spec. The Public Review document says that idnkit needs to be fixed: http://www.unicode.org/review/pr-29.html However, the IETF has not decided what to do about PR-29 in Stringprep. Also, Unicode has a normalization test file: http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt Has anyone run this test for Mozilla?
(In reply to comment #26) > be careful about updating the normalization tables. The IDN-related RFCs > specifically call for Unicode 3.2.0 and there were some changes in 4.0 in the > normalization: > > http://www.unicode.org/Public/UNIDATA/NormalizationCorrections.txt > > As far as I can tell, JPNIC's idnkit does not take these corrections into > account. You meant they should be kept 'misnormalized' because the IDN-related RFCs are based on Unicode 3.2.0, didn't you? All of them (with the possible exception of the first one in the BMP) are rarely used, but .... > Also, Unicode fixed an error in the normalization spec. The Public > Review document says that idnkit needs to be fixed: > > http://www.unicode.org/review/pr-29.html > > However, the IETF has not decided what to do about PR-29 in Stringprep. Hmm.. what do you think we need to do here? Go ahead proactively or just wait for the IETF to make a decision. (http://www.unicode.org/unicode/reports/tr15/ : Unicode consortium approved the PR #29 and it was reflected in UAX 15. There's another revision -hopefully requiring no change on our part - in the draft stage) Again, the change appears to have little practical implication (except for potential security issues) > Also, Unicode has a normalization test file: > > http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt > > Has anyone run this test for Mozilla? Not me. Perhaps, we have to write a test program based on the file unless it's already been written.
> You meant they should be kept 'misnormalized' because the IDN-related RFCs are > based on Unicode 3.2.0, didn't you? Right. The IETF takes stability very seriously, and we should be careful too. > All of them (with the possible exception of > the first one in the BMP) are rarely used, but .... They may be rare, but that is not the point. The point is that we seem to be making changes blindly here. I don't see any evidence that this issue was considered. > Hmm.. what do you think we need to do here? Go ahead proactively or just wait > for the IETF to make a decision. (http://www.unicode.org/unicode/reports/tr15/ : > Unicode consortium approved the PR #29 and it was reflected in UAX 15. There's > another revision -hopefully requiring no change on our part - in the draft > stage) What revision are you talking about? Please give me a pointer. As far as PR-29 is concerned, we should probably wait for the IETF response. I don't know what we should do about the Unicode 4.X changes that were already checked in. The IETF has not decided what to do about NormalizationCorrections.txt either. We could just leave the Mozilla source as it is now -- and correct it later when the IETF decides both of these issues. This leaves Mozilla users exposed somewhat, maybe not much. Looking at the source (intl/unicharutil/src/nsUnicodeNormalizer.cpp), it seems that the PR-29 issue has not been fixed yet: if ((last_class < cl || cl == 0) &&
By the way, I've been talking about Stringprep and Nameprep. If there are other consumers of the Unicode normalization service in Mozilla, they may have other considerations. I haven't checked for other consumers.
(In reply to comment #27) > > http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt > > > > Has anyone run this test for Mozilla? > > Not me. Perhaps, we have to write a test program based on the file unless it's > already been written. There is a very short NFD test in intl/unicharutil/tests/UnicharSelfTest.cpp. We could try importing the Unicode test file into there to do a more comprehensive test.
(In reply to comment #28) > > > All of them (with the possible exception of > > the first one in the BMP) are rarely used, but .... > > They may be rare, but that is not the point. The point is that we seem to be > making changes blindly here. I That's what 'but ...' was for :-) > > Hmm.. what do you think we need to do here? Go ahead proactively or just wait > > for the IETF to make a decision. (http://www.unicode.org/unicode/reports/tr15/ : > > Unicode consortium approved the PR #29 and it was reflected in UAX 15. There's > > another revision -hopefully requiring no change on our part - in the draft > > stage) > > What revision are you talking about? Please give me a pointer. I already sorta gave one :-). The 'revision history' in the latest version of UAX #15 explicitly mentions that PR #29 was taken into account. (see http://www.unicode.org/unicode/reports/tr15/#Modifications ). You can also find the same information at http://www.unicode.org/review/resolved-pri.html where you can also find PRI #61 that was also approved whose content hasn't yet been published in UAX #15( http://www.unicode.org/reports/tr15/tr15-24.html)
Simon, the normalization test file is 2 MB. I wouldn't check it into the tree. Maybe you didn't mean to do that. (But it would be nice to check in a program that performs the tests.) Jungshik, tr15-24.html is the previous version (a draft). The newest version is tr15-25.html, and PR-29 is reflected in that one. Look for the numbers 24 and 25 near the top. In 24, it's called "Tracking Number", in 25 "Revision". There is at least one implementor who believes that this is an incompatible change. Even though Unicode has decided to correct tr15, the IETF has not decided how to address that change in Stringprep and its profiles (including Nameprep). See the thread "stringprep: PRI #29" of March 19, 2005 in the IDN mailing list archive: http://ops.ietf.org/lists/idn/idn.2005/maillist.html Gerv has indicated that an IAB Working Group is looking into Stringprep and Nameprep revisions: http://weblogs.mozillazine.org/gerv/archives/007785.html At least one of the Stringprep profile authors appears to be aware of PR-29. See section 4 of: http://ietf.org/rfc/rfc4013.txt Stringprep itself specifically refers to tr15-22.html: http://ietf.org/rfc/rfc3454.txt
Jungshik, it might be a good idea to file one or more new bugs to address PR-29, NormalizationCorrections.txt and the fact that Stringprep does not allow characters that were unassigned in Unicode 3.2.0 in "stored strings" (though they are allowed in "queries"). See section 7 of Stringprep. I suspect that Mozilla code would mostly use the IDN routines in "queries", but if any server, client or other program based on Mozilla code ever uses "stored strings", it would be illegal to store characters outside 3.2.0. If we decide to comply to this extent, we may need to support both 3.2.0 and 4.1 in mozilla/intl. See also the part about normalization in section 10 of IDNA: http://ietf.org/rfc/rfc3490.txt If you like, I can file the bugs.
See bug 326207. Jungshik, would you like to request 1.8 branch approval for this, or shall we just wait for Unicode 5, due for release at the end of next month?
(In reply to comment #34) > See bug 326207. Jungshik, would you like to request 1.8 branch approval for > this, or shall we just wait for Unicode 5, due for release at the end of next > month? I'm not sure. Either way is fine with me. Even though we have to bother drivers twice, I'm slightly more inclined to get approval now. BTW, we have to revert to Unicode 3.2 for normalization as pointed out by Erik, don't we? So, only attachment 179704 [details] [diff] [review] needs to go in. For the trunk, I'll file a new bug to revert to Unicode 3.2.0 and change 'README.txt' accordingly. And, there is an issue with UTR PR#29.
I recommend trying to get this in to the 1.8 branch.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: