Closed
Bug 210502
Opened 21 years ago
Closed 19 years ago
Update generated files in unicharutils to Unicode 4.0.1 database
Categories
(Core :: Internationalization, defect)
Core
Internationalization
Tracking
()
RESOLVED
FIXED
People
(Reporter: jshin1987, Assigned: jshin1987)
References
Details
(Keywords: intl)
Attachments
(4 files, 1 obsolete file)
213.78 KB,
patch
|
Details | Diff | Splinter Review | |
191.58 KB,
patch
|
Details | Diff | Splinter Review | |
246.50 KB,
patch
|
smontagu
:
review+
dbaron
:
superreview+
|
Details | Diff | Splinter Review |
185.82 KB,
patch
|
smontagu
:
review+
dbaron
:
superreview+
|
Details | Diff | Splinter Review |
Now that Unicode 4.0 was released, we need to update intl/unicharutils to Unicode 4.0. That includes normalization, character properties, categories, and so forth.
Updated•21 years ago
|
Summary: Update unicharutils to Unicode 4.0 → Update unicharutils to Unicode 4.0
This link can be used for testing Unicode 4 casing when you are ready. http://www.w3.org/International/tests/test-text-transform.html
Assignee | ||
Comment 2•21 years ago
|
||
What we need to do is rather simple (at least on the surface): 1. grab idnkit at http://www.nic.ad.jp/en/idn/index.html (http://www.nic.ad.jp/ja/idn/idnkit/download/sources/idnkit-1.0-src.tar.gz) 2. there are three files we need in the kit: generate_normalize_data.pl, UCD.pm and SparseMap.pm 3. download the following Unicode data files: CaseFolding.txt,CompositionExclusions.txt, SpecialCasing.txt, UnicodeData.txt 4. run generate_normalize_data.pl, edit the result (remove case folding part because we have separate scripts for that, replace 'unsigned short' and 'unsigned long' with 'PRUnichar' and 'PRUint32') and save to 'unicodedata.h' (the current name is unicodedata_320.c', but I'm not if it's wise to have the Unicode version name in the file name). 5. generate casetable.h and cattable.h with gencasetable.pl and gencattable.pl It'd have been nice if we had made this ready for 1.6, but I'm afraid it's too late.
Assignee: smontagu → jshin
Assignee | ||
Comment 3•21 years ago
|
||
In addition to updating data to Unicode 4.0x, I had to make some changes to support non-BMP characters. Still, more changes are necessary to support non-BMP characters in case-conversion. So, even with this patch, Desert alphabet is not properly supported when it comes to the case conversion.
Assignee | ||
Comment 4•21 years ago
|
||
I forgot that case-conversion had been filed as a separate bug (bug 210501)
Status: NEW → ASSIGNED
It would be nice to have some documentation in the tree somewhere that explains how to upgrade all the data we have that's generated from the Unicode database. Also, the patch in bug 238844 might have some useful changes to the generation scripts to avoid changes made manually to the files (license headers, alecf's comment removal). Also, retitling bug with additional search terms so it's possible to find it (since I did about 10 searches before filing bug 238844).
Summary: Update unicharutils to Unicode 4.0 → Update generated files in unicharutils to Unicode 4.0 database
*** Bug 238844 has been marked as a duplicate of this bug. ***
Assignee | ||
Comment 7•20 years ago
|
||
(In reply to comment #5) > It would be nice to have some documentation in the tree somewhere that explains That's on my TODO list. I had to reverse engineer to figure out how to update files here. > Also, the patch in bug 238844 might have some useful changes to the generation I may (or may not) have made a similar change in my tree. Anyway, thanks for the patch. > it's possible to find it (since I did about 10 searches before filing bug 238844). I wonder what your 10 queries were.... With the component 'internationalization', the summary containing 'Unicode' and the owner set to me or smontagu would have brought you right here in a couple of queries ;-) I have to run now. I'll try to fix it early next week.
Comment 8•20 years ago
|
||
Updating summary since Unicode 4.0.1 has been released in the meanwhile. http://www.unicode.org/versions/Unicode4.0.1/ http://www.unicode.org/Public/4.0-Update1/
Summary: Update generated files in unicharutils to Unicode 4.0 database → Update generated files in unicharutils to Unicode 4.0.1 database
Assignee | ||
Comment 9•20 years ago
|
||
dbaron's patch was combined with attachment 137911 [details] [diff] [review]. I also changed the license to MPL.
Attachment #137911 -
Attachment is obsolete: true
Comment 10•20 years ago
|
||
Comment on attachment 150203 [details] [diff] [review] patch v2 (dbaron's patch incorporated and license change) >Index: intl/unicharutil/tools/gentransliterate.pl >+$header = <<END_OF_HEADER; > ## > ## The contents of this file are subject to the Netscape Public Do you want this to spit out MPL-tri?
What's the status of this patch?
Assignee | ||
Comment 12•20 years ago
|
||
I'll make a new patch once Unicode 4.1 is released in February. In the meantime, I converted nsIUGenCategory.h to nsIUGenCategory.idl while changing its methods, 'Get' and 'Is' to accept PRUint32 (UTF-32) instead of UCS-2. Nobody ever used them so that we don't have to worry about callers or compatibility (not that it's hard to fix callers) Btw, I also want to change their names. Any suggestion? How about 'getCategory' and 'isCategory'? Simon, I vaguely remember you made (at least suggested) a similar change in nsICaseConversion, but the change is not in the tree. Will you take care of it in bug 210501?
Assignee | ||
Comment 13•19 years ago
|
||
With this patch, each Unicode plane has its own category pattern table because otherwise the number of different category patterns (categories for 8 characters are stored in each pattern) exceeeds 256. If I used a single pattern table, PRUint16 instead of PRUint8 would have to be used as lookup keys, which would increase the code size by about 3k. I updated nsIUnicodeGeneralCategory (renamed from nsIUGenCategory) to deal with the full Unicode repertoire and made it scriptable. Its implementation was also updated. The category table was generated from UCD 4.0.1, but will be updated to 4.1.0 once it's released.
Comment 14•19 years ago
|
||
See also bug 288137. I'd like to get all these bugs fixed by 1.8b2, preferably with Unicode 4.1 (which should be released this month), but at least with 4.0.1 in the event that 4.1 is delayed.
Assignee | ||
Comment 15•19 years ago
|
||
This patch only deals with the Unicode normalization. It also has a new file README.txt explaining how to generate various properties and header files. Btw, diff was made against 'unicodedata_320.c', but I'm gonna cvs remove it and add a new file 'normalization_data.h' in its place. This should be safe enough for branch landing because Unicode normalization routines don't need any change to deal with non-BMP characters (they're already full-unicode-proof), which means only the data file was updated to Unicode 4.1.0.
Attachment #179694 -
Flags: superreview?(dbaron)
Attachment #179694 -
Flags: review?(smontagu)
Comment 16•19 years ago
|
||
Comment on attachment 179694 [details] [diff] [review] normalization part patch + README.txt >+ The latest version is, as of this writing is in Nit: one "is" is superfluous
Attachment #179694 -
Flags: review?(smontagu) → review+
Assignee | ||
Comment 17•19 years ago
|
||
This fixes cattable.h and gencattable.pl. I haven't changed the interface definition and its implementation that uses cattable.h because it's not used anywhere. cattable.h in spellchecker is not updated. My plan is either to export cattable.h in intl and cvs-remove the copy in spellchecker so that spellchecker can refer to the copy in intl or to make spellchecker use nsIUGenCategory. That'll be done in bug 287340
Attachment #179704 -
Flags: superreview?(dbaron)
Attachment #179704 -
Flags: review?(smontagu)
Comment 18•19 years ago
|
||
Comment on attachment 179704 [details] [diff] [review] cattable.h and gencattable.pl patch >-printf OUT "static PRUint8 GetCat(PRUnichar u)\n{\n"; >+printf OUT "static PRUint16 GetCat(PRUint32 u)\n{\n"; Why does this now return a PRUint16? The return value is still in the same range, no?
Assignee | ||
Comment 19•19 years ago
|
||
(In reply to comment #18) > Why does this now return a PRUint16? The return value is still in the same > range, no? Oops. You're absolutely right. Do you want me to upload a new patch with that fixed?
Comment on attachment 179694 [details] [diff] [review] normalization part patch + README.txt >Index: intl/unicharutil/src/nsUnicodeNormalizer.cpp >-#include "unicodedata_320.c" >+#include "normalization_data.h" I don't follow what's going on here, but sr=dbaron.
Attachment #179694 -
Flags: superreview?(dbaron) → superreview+
Assignee | ||
Comment 21•19 years ago
|
||
Thanks for r/sr. Attachment 179694 [details] [diff] was landed on the trunk. (In reply to comment #20) > (From update of attachment 179694 [details] [diff] [review] [edit]) > >Index: intl/unicharutil/src/nsUnicodeNormalizer.cpp > >-#include "unicodedata_320.c" > >+#include "normalization_data.h" > > I don't follow what's going on here, but sr=dbaron. see comment #15. I removed unicodedata_320.c and added 'normalization_data.h', but made a diff against 'unicodedata_320.c' to avoid making the patch a lot longer than necessary.
(In reply to comment #21) > > I don't follow what's going on here, but sr=dbaron. > > see comment #15. I removed unicodedata_320.c and added 'normalization_data.h', > but made a diff against 'unicodedata_320.c' to avoid making the patch a lot > longer than necessary. I figured it was something like that, but didn't see it. Makes sense.
Comment 23•19 years ago
|
||
Comment on attachment 179704 [details] [diff] [review] cattable.h and gencattable.pl patch r=smontagu with the change I mentioned earlier (no need for a new patch)
Attachment #179704 -
Flags: review?(smontagu) → review+
Attachment #179704 -
Flags: superreview?(dbaron) → superreview+
Assignee | ||
Comment 24•19 years ago
|
||
cattable patch got landed on the trunk. case folding and transliteration tables were updated by smontagu in separate bugs so that i think we can resolve this as fixed. Btw, we may want to make firefox 1.0.x and mozilla 1.7.x up to date (in terms of Unicode support).
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
Comment 25•19 years ago
|
||
Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8b2) Gecko/20050406 Firefox/1.0+ Is there any chance this patch is causing characters like ë ö etc to be displayed as a � (? in a dark diamond) again ? I'm seeing this again in todays build.
Comment 26•19 years ago
|
||
It may be OK to update categories to a newer version of Unicode, but we need to be careful about updating the normalization tables. The IDN-related RFCs specifically call for Unicode 3.2.0 and there were some changes in 4.0 in the normalization: http://www.unicode.org/Public/UNIDATA/NormalizationCorrections.txt As far as I can tell, JPNIC's idnkit does not take these corrections into account. Also, Unicode fixed an error in the normalization spec. The Public Review document says that idnkit needs to be fixed: http://www.unicode.org/review/pr-29.html However, the IETF has not decided what to do about PR-29 in Stringprep. Also, Unicode has a normalization test file: http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt Has anyone run this test for Mozilla?
Assignee | ||
Comment 27•19 years ago
|
||
(In reply to comment #26) > be careful about updating the normalization tables. The IDN-related RFCs > specifically call for Unicode 3.2.0 and there were some changes in 4.0 in the > normalization: > > http://www.unicode.org/Public/UNIDATA/NormalizationCorrections.txt > > As far as I can tell, JPNIC's idnkit does not take these corrections into > account. You meant they should be kept 'misnormalized' because the IDN-related RFCs are based on Unicode 3.2.0, didn't you? All of them (with the possible exception of the first one in the BMP) are rarely used, but .... > Also, Unicode fixed an error in the normalization spec. The Public > Review document says that idnkit needs to be fixed: > > http://www.unicode.org/review/pr-29.html > > However, the IETF has not decided what to do about PR-29 in Stringprep. Hmm.. what do you think we need to do here? Go ahead proactively or just wait for the IETF to make a decision. (http://www.unicode.org/unicode/reports/tr15/ : Unicode consortium approved the PR #29 and it was reflected in UAX 15. There's another revision -hopefully requiring no change on our part - in the draft stage) Again, the change appears to have little practical implication (except for potential security issues) > Also, Unicode has a normalization test file: > > http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt > > Has anyone run this test for Mozilla? Not me. Perhaps, we have to write a test program based on the file unless it's already been written.
Comment 28•19 years ago
|
||
> You meant they should be kept 'misnormalized' because the IDN-related RFCs are > based on Unicode 3.2.0, didn't you? Right. The IETF takes stability very seriously, and we should be careful too. > All of them (with the possible exception of > the first one in the BMP) are rarely used, but .... They may be rare, but that is not the point. The point is that we seem to be making changes blindly here. I don't see any evidence that this issue was considered. > Hmm.. what do you think we need to do here? Go ahead proactively or just wait > for the IETF to make a decision. (http://www.unicode.org/unicode/reports/tr15/ : > Unicode consortium approved the PR #29 and it was reflected in UAX 15. There's > another revision -hopefully requiring no change on our part - in the draft > stage) What revision are you talking about? Please give me a pointer. As far as PR-29 is concerned, we should probably wait for the IETF response. I don't know what we should do about the Unicode 4.X changes that were already checked in. The IETF has not decided what to do about NormalizationCorrections.txt either. We could just leave the Mozilla source as it is now -- and correct it later when the IETF decides both of these issues. This leaves Mozilla users exposed somewhat, maybe not much. Looking at the source (intl/unicharutil/src/nsUnicodeNormalizer.cpp), it seems that the PR-29 issue has not been fixed yet: if ((last_class < cl || cl == 0) &&
Comment 29•19 years ago
|
||
By the way, I've been talking about Stringprep and Nameprep. If there are other consumers of the Unicode normalization service in Mozilla, they may have other considerations. I haven't checked for other consumers.
Comment 30•19 years ago
|
||
(In reply to comment #27) > > http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt > > > > Has anyone run this test for Mozilla? > > Not me. Perhaps, we have to write a test program based on the file unless it's > already been written. There is a very short NFD test in intl/unicharutil/tests/UnicharSelfTest.cpp. We could try importing the Unicode test file into there to do a more comprehensive test.
Assignee | ||
Comment 31•19 years ago
|
||
(In reply to comment #28) > > > All of them (with the possible exception of > > the first one in the BMP) are rarely used, but .... > > They may be rare, but that is not the point. The point is that we seem to be > making changes blindly here. I That's what 'but ...' was for :-) > > Hmm.. what do you think we need to do here? Go ahead proactively or just wait > > for the IETF to make a decision. (http://www.unicode.org/unicode/reports/tr15/ : > > Unicode consortium approved the PR #29 and it was reflected in UAX 15. There's > > another revision -hopefully requiring no change on our part - in the draft > > stage) > > What revision are you talking about? Please give me a pointer. I already sorta gave one :-). The 'revision history' in the latest version of UAX #15 explicitly mentions that PR #29 was taken into account. (see http://www.unicode.org/unicode/reports/tr15/#Modifications ). You can also find the same information at http://www.unicode.org/review/resolved-pri.html where you can also find PRI #61 that was also approved whose content hasn't yet been published in UAX #15( http://www.unicode.org/reports/tr15/tr15-24.html)
Comment 32•19 years ago
|
||
Simon, the normalization test file is 2 MB. I wouldn't check it into the tree. Maybe you didn't mean to do that. (But it would be nice to check in a program that performs the tests.) Jungshik, tr15-24.html is the previous version (a draft). The newest version is tr15-25.html, and PR-29 is reflected in that one. Look for the numbers 24 and 25 near the top. In 24, it's called "Tracking Number", in 25 "Revision". There is at least one implementor who believes that this is an incompatible change. Even though Unicode has decided to correct tr15, the IETF has not decided how to address that change in Stringprep and its profiles (including Nameprep). See the thread "stringprep: PRI #29" of March 19, 2005 in the IDN mailing list archive: http://ops.ietf.org/lists/idn/idn.2005/maillist.html Gerv has indicated that an IAB Working Group is looking into Stringprep and Nameprep revisions: http://weblogs.mozillazine.org/gerv/archives/007785.html At least one of the Stringprep profile authors appears to be aware of PR-29. See section 4 of: http://ietf.org/rfc/rfc4013.txt Stringprep itself specifically refers to tr15-22.html: http://ietf.org/rfc/rfc3454.txt
Comment 33•19 years ago
|
||
Jungshik, it might be a good idea to file one or more new bugs to address PR-29, NormalizationCorrections.txt and the fact that Stringprep does not allow characters that were unassigned in Unicode 3.2.0 in "stored strings" (though they are allowed in "queries"). See section 7 of Stringprep. I suspect that Mozilla code would mostly use the IDN routines in "queries", but if any server, client or other program based on Mozilla code ever uses "stored strings", it would be illegal to store characters outside 3.2.0. If we decide to comply to this extent, we may need to support both 3.2.0 and 4.1 in mozilla/intl. See also the part about normalization in section 10 of IDNA: http://ietf.org/rfc/rfc3490.txt If you like, I can file the bugs.
Comment 34•19 years ago
|
||
See bug 326207. Jungshik, would you like to request 1.8 branch approval for this, or shall we just wait for Unicode 5, due for release at the end of next month?
Assignee | ||
Comment 35•18 years ago
|
||
(In reply to comment #34) > See bug 326207. Jungshik, would you like to request 1.8 branch approval for > this, or shall we just wait for Unicode 5, due for release at the end of next > month? I'm not sure. Either way is fine with me. Even though we have to bother drivers twice, I'm slightly more inclined to get approval now. BTW, we have to revert to Unicode 3.2 for normalization as pointed out by Erik, don't we? So, only attachment 179704 [details] [diff] [review] needs to go in. For the trunk, I'll file a new bug to revert to Unicode 3.2.0 and change 'README.txt' accordingly. And, there is an issue with UTR PR#29.
I recommend trying to get this in to the 1.8 branch.
You need to log in
before you can comment on or make changes to this bug.
Description
•