Closed Bug 724826 Opened 12 years ago Closed 12 years ago

nsCategoryImp returns incorrect General Category for certain Unicode values

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla13

People

(Reporter: jfkthame, Assigned: jfkthame)

Details

Attachments

(3 files, 1 obsolete file)

I've noticed a couple of issues with the Unicode General Category service as implemented by nsCategoryImp, using GetCat() defined in intl/unicharutil/src/cattable.h.

First, the comment in intl/unicharutil/public/nsIUGenCategory.h
  http://mxr.mozilla.org/mozilla-central/source/intl/unicharutil/public/nsIUGenCategory.h#65
says that unassigned code points (GC=Cn in the Unicode database) should return kOther (4), which corresponds to the documentation at http://unicode.org/reports/tr44/#General_Category_Values, where Cn (Unassigned) is included in the group C (Other).

However, unassigned code points in fact return kUndefined (0). I assume this is because they are not explicitly listed in the UnicodeData.txt file, and the gencattable.pl script fails to fill in 4 (instead of 0) for characters that are not mentioned.

Second, and more seriously IMO, the service also returns the wrong value for the CJK COMPATIBILITY IDEOGRAPH block from U+F900..FAD9. These have GC=Lo, so should return kLetter (5), but actually return kUndefined (0). This appears to be an error in the GetCat() function, which includes a fragment intended to handle this block:
  http://mxr.mozilla.org/mozilla-central/source/intl/unicharutil/src/cattable.h#4166
but never reaches this, because it maps those codes earlier using a table (that apparently does not provide the correct result):
  http://mxr.mozilla.org/mozilla-central/source/intl/unicharutil/src/cattable.h#4045

I'm not sure offhand if there are any user-visible behavior issues that can be traced to these errors, but it seems possible that classifying these ideographs as "undefined" instead of "letter" could have repercussions somewhere.

(These issues are not new with the Unicode 6.1 update in bug 723509, nor are they fixed by it; they appear to be long-standing flaws in the table-generating tool.)
One approach here, at least as a short-term fix while we consider possibly moving to ICU for Unicode support, would be to make use of the character properties in gfxUnicodeProperties, originally added to support harfbuzz. The General Category implementation there does not suffer from the bugs in cattable.h, and also happens to be more efficient, particularly when dealing with higher Unicode blocks and planes. (In the ASCII range, the two implementations have essentially identical performance, but the GetCat function in cattable.h degrades seriously for higher ranges, whereas the gfxUnicodeProperties one doesn't.)

Simon, WDYT? It seems a bit untidy to have intl/unicharutil relying on gfx in this way, but with everything being part of libxul it should work OK.

We could consider moving everything that's in gfxUnicodeProperties over to the intl subtree, as long as we don't add anything to the cost of looking up the properties during shaping (such as going through a virtual call), as that's very performance-sensitive. But if we do move to ICU, that'll replace all of this anyway, so for now I'm inclined to take the easiest approach.
Attachment #595001 - Flags: review?(smontagu)
I really would rather move the gfxUnicodeProperties stuff into intl, if you wouldn't mind.
OK, here's my suggested approach, in three stages. First, this patch moves gfxUnicodeProperties.cpp and associated files over to the intl tree, renamed to nsUnicodeProperties.cpp; also, instead of a class full of static members, it seems more sensible to just have a "unicode" namespace with the property-access functions. So this patch is basically just moving and renaming.
Attachment #595001 - Attachment is obsolete: true
Attachment #595001 - Flags: review?(smontagu)
Attachment #599988 - Flags: review?(smontagu)
This fixes the cases where the old code returns incorrect GenCategory values, by removing cattable.h and the tool that generates it, and relying on the property data from nsUnicodeProperties instead. (As a bonus, the lookups are slightly faster, particularly with higher ranges of Unicode.) So this patch is almost entirely code removal. :)
Attachment #599991 - Flags: review?
Finally, for nsIUGenCategory users within libxul, there's really no reason to get a "service" and call through that; they can simply use the property-access functions directly. This should shave the odd nanosecond off here and there.

At this point, the nsIUGenCategory interface becomes unused (except that we're still using its enums for the categories), but I don't propose to remove it completely as there could easily be external code in other products/extensions/whatever that is relying on it. But using that extra indirection within our own code is gaining us nothing whatsoever.
Attachment #599995 - Flags: review?(smontagu)
Attachment #599991 - Flags: review? → review?(smontagu)
Attachment #599988 - Flags: review?(smontagu) → review+
Attachment #599991 - Flags: review?(smontagu) → review+
Comment on attachment 599995 [details] [diff] [review]
part 3 - general category clients within libxul can call the property accessors directly

Review of attachment 599995 [details] [diff] [review]:
-----------------------------------------------------------------

::: testing/peptest/tests/firefox/server/mozilla.org/index.html.orig
@@ -4,5 @@
> -Test mozilla home page.
> -</div>
> -</body>
> -</html>
> -

This file seems to be extraneous
Attachment #599995 - Flags: review?(smontagu) → review+
Comment on attachment 599991 [details] [diff] [review]
part 2 - use mozilla::unicode functions instead of cattable.h in nsGenCategoryImp

Review of attachment 599991 [details] [diff] [review]:
-----------------------------------------------------------------

::: intl/unicharutil/src/nsUnicodeProperties.cpp
@@ +74,5 @@
>  
> +nsIUGenCategory::nsUGenCategory sDetailedToGeneralCategory[] = {
> +  /*HB_CATEGORY_CONTROL*/             nsIUGenCategory::kOther,
> +  /*HB_CATEGORY_FORMAT*/              nsIUGenCategory::kOther,
> +  /*HB_CATEGORY_UNASSIGNED*/          nsIUGenCategory::kOther,

All the names in comments here are not the actual defined names in gfx/harfbuzz/src/hb-common.h, which are HB_UNICODE_GENERAL_CATEGORY_CONTROL, HB_UNICODE_GENERAL_CATEGORY_FORMAT etc. I understand if you wanted to keep each line within 80 characters, but can you at least add a meta-comment pointing to the actual names and/or their definition?
(In reply to Simon Montagu from comment #6)
> This file seems to be extraneous

Oops!

(In reply to Simon Montagu from comment #7)
> All the names in comments here are not the actual defined names in
> gfx/harfbuzz/src/hb-common.h, which are HB_UNICODE_GENERAL_CATEGORY_CONTROL,
> HB_UNICODE_GENERAL_CATEGORY_FORMAT etc. I understand if you wanted to keep
> each line within 80 characters, but can you at least add a meta-comment
> pointing to the actual names and/or their definition?

I'll fix them; I think that's better. It wasn't an intentional mismatch. The HB names changed when we took the update in bug 695857, but I already had the old ones in an earlier version of this patch - and of course the change didn't cause a compilation failure, so I overlooked updating them.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: