Closed Bug 738101 Opened 12 years ago Closed 12 years ago

Add support for more Unicode properties

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla15

People

(Reporter: smontagu, Assigned: smontagu)

References

Details

Attachments

(4 files, 4 obsolete files)

For the new IDNA display algorithm we need to be able to retrieve the character properties from http://www.unicode.org/Public/security/beta/xidmodifications.txt for identifying restricted characters as described at http://www.unicode.org/reports/tr39/#General_Security_Profile.
The data file we should be using at least for now is http://www.unicode.org/Public/security/latest/xidmodifications.txt
I could easily add this to the character properties available via nsUnicodeProperties - does that seem like the best way forward here?
Summary: Add support for Unicode Identifier Modification property → Add support for more Unicode property
I'm morphing this to add GetNumericValue, GetHanVariant (which will return whether characters are Simplified Chinese only or Traditional Chinese only), both of which we want for bug 722299; and also at jfkthame's suggestion to move GetBidiCat into GetUnicodeProperties and consolidate a whole lot of the properties there into a single table.

w-i-p pushed to try: https://hg.mozilla.org/try/rev/06e82a916179

Do we have good regression tests for combining classes or should I be testing that stuff manually?
Summary: Add support for more Unicode property → Add support for more Unicode properties
Attached patch Patch to UnicodeProperties (obsolete) — Splinter Review
Assignee: nobody → smontagu
Attachment #609761 - Flags: review?(jfkthame)
Attached patch Use the new GetBidiCat (obsolete) — Splinter Review
Attachment #609762 - Flags: review?(jfkthame)
Comment on attachment 609761 [details] [diff] [review]
Patch to UnicodeProperties

Review of attachment 609761 [details] [diff] [review]:
-----------------------------------------------------------------

I like this, but I think we can polish it a bit more - see suggestions below, unless you find they don't work out well in practice.

::: gfx/harfbuzz/src/hb-common.h
@@ +303,5 @@
>  
>  hb_script_t
>  hb_script_from_iso15924_tag (hb_tag_t tag);
>  
> +/* sugar for tag_from_string() then script_from_iso15924_tag */

Let's not do this here (although typos do annoy me!) - we should get Behdad to fix it upstream instead, and then pick it up that way.

::: intl/unicharutil/tools/genUnicodePropertyData.pl
@@ +163,5 @@
> +  "RLE" => "15", # Right-to-Left Embedding
> +  "PDF" => "15", # Pop Directional Format
> +  "LRO" => "15", # Left-to-Right Override
> +  "RLO" => "15"  # Right-to-Left Override
> +);

I realize this is following how bidicattable did things, but I really think the property table should return separate values for the LRE..RLO control codes, which are distinct bidi types in the UBA (http://www.unicode.org/reports/tr9/#Table_Bidirectional_Character_Types).

(Which will then mean an adjustment in nsBidiUtils, but I think it'll be cleaner overall.)

@@ +579,2 @@
>  }
> +&genTables("CharProp", "struct nsCharProps {\n  unsigned char mEAW:3;\n  unsigned char mCategory:5;\n  unsigned char mBidiCategory:4;\n  unsigned char mXidmod:4;\n  signed char mNumericValue:5;\n  unsigned char mCombiningClass:8;\n  unsigned char mHanVariant:2;\n};", "nsCharProps", 11, 5, \&sprintCharProps, 16);

Expanding mBidiCategory to support the full 19 values used in the UBA will require it to be 5 bits. I think all these fields still fit within 32 bits, though.

Also, we should rearrange the fields in such a way that mCombiningClass (being an 8-bit field) avoids straddling a byte boundary.

Actually, it might be better to put Script rather than CombiningClass into the CharProps structure, as CC doesn't require tables for the full 16 planes. Did you look at this both ways?

And while we're here, how about the DefaultIgnorable property? Only requires a single bit. Can we squeeze that in somehow? I think you can reduce mNumericValue to 4 bits to free up one for it.

@@ +583,5 @@
>  {
>    my $usv = shift;
>    return sprintf("%d,", $hangul[$usv]);
>  }
> +&genTables("Hangul", "", "PRUint8", 10, 6, \&sprintHangulType, 0);

And it might be advantageous to combine Mirror and HangulType (both of which only have values for Plane 0) into a single 16-bit record.
Comment on attachment 609762 [details] [diff] [review]
Use the new GetBidiCat

Review of attachment 609762 [details] [diff] [review]:
-----------------------------------------------------------------

::: intl/unicharutil/util/nsBidiUtils.cpp
@@ +153,5 @@
>    // This method is used when stripping Bidi control characters for
>    // display, so it will return TRUE for LRM and RLM as
>    // well as the characters with category eBidiCat_CC
> +  return (eBidiCat_CC == mozilla::unicode::GetBidiCat(aChar) ||
> +          ((aChar)&0xfffffe)==LRM_CHAR);

This will need adjustment if we make GetBidiCat return "real" UAX#9 categories, with the control codes having individual values....

@@ +180,1 @@
>    if (eBidiCat_CC != bCat) {

...but we'll be able to get rid of the extra complexity here - in fact, if we ensure that the GetBidiCat return values correspond to the nsCharType enumeration, we won't need an extra mapping table here at all.
(In reply to Jonathan Kew (:jfkthame) from comment #6)
> ::: gfx/harfbuzz/src/hb-common.h
> @@ +303,5 @@
> >  
> >  hb_script_t
> >  hb_script_from_iso15924_tag (hb_tag_t tag);
> >  
> > +/* sugar for tag_from_string() then script_from_iso15924_tag */
> 
> Let's not do this here (although typos do annoy me!) - we should get Behdad
> to fix it upstream instead, and then pick it up that way.

I don't know if this is at all relevant, but be advised of the existence of bug 739861.
Following a suggestion by jfkthame I tried introducing mistakes into the patch and sending to tryserver to assess how much test coverage we have for these properties.

GetHangulSyllableType caused 1 failure in layout/reftests/first-letter/329062-2.html
GetCombiningClass caused 2 failures in layout/reftests/bidi/bidi-004[-j].html (Windows XP only)
GetEastAsianWidth caused no failures.
Assignee: smontagu → nobody
Component: Internationalization → Installer: XPInstall Engine
QA Contact: i18n → xpi-engine
Component: Installer: XPInstall Engine → Internationalization
QA Contact: xpi-engine → i18n
Assignee: nobody → smontagu
Attached patch Patch to UnicodeProperties v.2 (obsolete) — Splinter Review
This adopts all the suggestions from comment 6, except for adding a bit for DefaultIgnorable -- we already have that as one of the values for GetIdentifierModification and I'll be attaching a patch in a second to use that instead of the existing DefaultIgnorable implementation.
Attachment #609761 - Attachment is obsolete: true
Attachment #609761 - Flags: review?(jfkthame)
Attachment #616577 - Flags: review?(jfkthame)
Attachment #609762 - Attachment is obsolete: true
Attachment #609762 - Flags: review?(jfkthame)
Attachment #616582 - Flags: review?(jfkthame)
This removes the last client of nsCompressedCharMap, so I'm removing all the infrastructure from that as well.
Attachment #616583 - Flags: review?(jfkthame)
(In reply to Simon Montagu from comment #10)
> Created attachment 616577 [details] [diff] [review]
> Patch to UnicodeProperties v.2
> 
> This adopts all the suggestions from comment 6, except for adding a bit for
> DefaultIgnorable -- we already have that as one of the values for
> GetIdentifierModification and I'll be attaching a patch in a second to use
> that instead of the existing DefaultIgnorable implementation.

Just wondering, do you have figures for the resulting table sizes on hand? I'm curious how this is working out in terms of footprint.
I have the output of genUnicodePropertyData.pl:

Before:
Data for Script = 27984
Data for CClass = 6657
Data for Mirror = 2944
Data for CatEAW = 28304
Data for Hangul = 1984
Data for CaseMap = 12929
(total 80802)

After:
Data for CClass = 6657
Data for MirrorHangul = 3648
Data for CharProp = 72144
Data for CaseMap = 12929
(total 95378)
Comment on attachment 616577 [details] [diff] [review]
Patch to UnicodeProperties v.2

Review of attachment 616577 [details] [diff] [review]:
-----------------------------------------------------------------

This looks great, but I think we can go a little further - see suggestion for hangul/mirror packing.

Also a few other minor comments you may want to consider.

::: intl/unicharutil/tools/genUnicodePropertyData.pl
@@ +570,2 @@
>  }
> +&genTables("MirrorHangul", "struct nsMirrorHangul {\n  PRInt8 mMirrorValue;\n  PRUint8 mHangulType;\n};", "nsMirrorHangul", 10, 6, \&sprintMirrorHangul, 0);

Hmmm. mHangulType really only needs 3 bits; it seems a shame to keep spending 8 bits on it.

And as it happens, mMirrorValue can be squeezed into 5 bits without too much trouble. If you change smallMirrorOffset to 4 instead of 64, we'll still get the majority of the mirror offsets (+/- 1 or 2) packed directly into the field. The count of "distant" mirrors only rises from 14 to 20. So we can make the mMirrorValue a 5-bit field, and store either a "small" offset (signed value in the range -4..3 when interpreted as signed, or 0..3 and 28..31 if interpreted as unsigned) or the "distant" index + 4 (as an unsigned 5-bit value, with the range 4..27 being available, which is sufficient). Or add a bias of 4 to the small mirror values, so that they're stored using the range 0..7, and 8 to the distant indexes. That's maybe a bit less obscure, and allows us to test whether a value is "small" or "distant" with one comparison instead of two.

And with mHangulType and mMirrorValue packed into a single byte, it's probably advantageous to combine this and the combining class into a single table whose value is a 16-bit struct, as there will be extensive ranges where all these properties are zero.

(If a future version of Unicode adds several more mirrored pairs, this won't fit any longer and we'll need to revise the packing strategy, but I doubt many mirrored pairs are being added these days. Just make sure the tool checks that the values fit within the available bits, so that we're alerted if this happens.)

@@ +653,5 @@
>    print DATA_TABLES "};\n\n";
>  
> +  unless (length($typedef) == 0) {
> +    print DATA_TABLES "$typedef\n\n";
> +  }

IMO, this would be more idiomatic as

  print DATA_TABLES "$typedef\n\n" if $typedef ne '';

::: intl/unicharutil/util/nsUnicodeProperties.cpp
@@ +118,5 @@
> +        return sMirrorHangulValues[sMirrorHangulPages[0][aCh >> kMirrorHangulCharBits]]
> +                                  [aCh & ((1 << kMirrorHangulCharBits) - 1)];
> +    }
> +
> +    NS_NOTREACHED("Tried to get Bidi mirroring or Hangul type outside BMP");

I'd suggest removing this assertion and letting the clients of this function rely on it returning the proper result for the full Unicode range (as it actually does); then they don't need to do their own separate range check before calling it.

@@ +128,5 @@
>  PRUint32
>  GetMirroredChar(PRUint32 aCh)
>  {
>      // all mirrored chars are in plane 0
>      if (aCh < UNICODE_BMP_LIMIT) {

No need for the test, GetMirrorHangul does it.

@@ +170,5 @@
> +        return sCharPropValues[sCharPropPages[sCharPropPlanes[(aCh >> 16) - 1]]
> +                                             [(aCh & 0xffff) >> kCharPropCharBits]]
> +                              [aCh & ((1 << kCharPropCharBits) - 1)];
> +    }
> +

As there are specific CharProps for characters in Plane 16, we should never reach here unless someone is calling this with a character code that's outside the *full* Unicode range.

AFAIK, that should never happen, and we could reasonably have a NOTREACHED assertion here (but keep the safe-return value, too).

@@ +188,4 @@
>  PRUint8
>  GetGeneralCategory(PRUint32 aCh)
>  {
> +    return GetCharProps(aCh).mCategory;

I think we should put these single-line accessors into the .h file as inline functions.

@@ +241,5 @@
>  GetHangulSyllableType(PRUint32 aCh)
>  {
>      // all Hangul chars are in plane 0
>      if (aCh < UNICODE_BMP_LIMIT) {
> +        return HSType(GetMirrorHangul(aCh).mHangulType);

We could drop the UNICODE_BMP_LIMIT check here and just rely on GetMirrorHangul returning the appropriate default for non-BMP characters. Then this becomes a single-line accessor (suitable for inline treatment), too.

::: intl/unicharutil/util/nsUnicodeProperties.h
@@ +54,5 @@
>  PRUint32 GetMirroredChar(PRUint32 aCh);
>  
> +nsCharProps GetCharProps(PRUint32 aCh);
> +
> +nsMirrorHangul GetMirrorHangul(PRUint32 aCh);

I don't think there's any reason to expose these publicly, is there? I'd prefer to consider them an internal implementation detail, and hide them from the public namespace.
Comment on attachment 616582 [details] [diff] [review]
Use the new GetBidiCat v.2

Review of attachment 616582 [details] [diff] [review]:
-----------------------------------------------------------------

What a lovely lot of deleted code. :)
Attachment #616582 - Flags: review?(jfkthame) → review+
Comment on attachment 616583 [details] [diff] [review]
Use the new GetIdentifierModification for defaultIgnorable

Review of attachment 616583 [details] [diff] [review]:
-----------------------------------------------------------------

Likewise.
Attachment #616583 - Flags: review?(jfkthame) → review+
(In reply to Jonathan Kew (:jfkthame) from comment #15)
> And as it happens, mMirrorValue can be squeezed into 5 bits without too much
> trouble. If you change smallMirrorOffset to 4 instead of 64, we'll still get
> the majority of the mirror offsets (+/- 1 or 2) packed directly into the
> field. The count of "distant" mirrors only rises from 14 to 20.

Apparently I goofed when I came up with this count. There would actually be about 26 distant mirrors, which won't quite fit in the available bits using the packing scheme I suggested.

So reconsidering the strategy here.....I did a count of the mirror offsets for the complete Unicode 6.1 database, to see what we're actually working with:

MirrorOffset: -2108 occurrences: 1
MirrorOffset: -2106 occurrences: 2
MirrorOffset: -2104 occurrences: 1
MirrorOffset: -2016 occurrences: 1
MirrorOffset: -1824 occurrences: 1
MirrorOffset: -138 occurrences: 1
MirrorOffset: -16 occurrences: 1
MirrorOffset: -8 occurrences: 3
MirrorOffset: -7 occurrences: 2
MirrorOffset: -3 occurrences: 4
MirrorOffset: -2 occurrences: 7
MirrorOffset: -1 occurrences: 158
MirrorOffset: 1 occurrences: 158
MirrorOffset: 2 occurrences: 7
MirrorOffset: 3 occurrences: 4
MirrorOffset: 7 occurrences: 2
MirrorOffset: 8 occurrences: 3
MirrorOffset: 16 occurrences: 1
MirrorOffset: 138 occurrences: 1
MirrorOffset: 1824 occurrences: 1
MirrorOffset: 2016 occurrences: 1
MirrorOffset: 2104 occurrences: 1
MirrorOffset: 2106 occurrences: 2
MirrorOffset: 2108 occurrences: 1

Note that there are only 24 unique offsets altogether. So we can pack this into 5 bits per character by storing *all* of them as 5-bit indexes into a small table of PRInt16 mirror-offset values (rather than actual mirrored-character codes). We may as well make that a table of 25 mirror-offsets, including zero as the first entry, so that all characters in the covered planes are handled in the same way - look up the mirror-offset index in the 5-bit field of the table, use this to look up the offset value in the table of offsets, and add that value to the original character. This means all uses of GetMirroredChar will go through that auxiliary lookup, but on the other hand it avoids the need for a conditional to decide whether the value is "small" or "distant", so overall I think it's pretty well equivalent in efficiency.
There are problems here after bug 745780: rerunning genUnicodePropertyData.pl with the new hb-common.h changes many of the values of the MOZ_SCRIPT_ enum, which makes the static asserts at http://mxr.mozilla.org/mozilla-central/source/gfx/thebes/gfxPangoFonts.cpp#2160 fail.
Drat. OK, well, that's why I put the static asserts in there - so we'd know if this broke. I guess it's time to revise how the tool comes up with MOZ_SCRIPT_* values, if we can't base this on the list in the harfbuzz header any more. I'll look into it, unless you get there first....
Depends on: 747834
Attached patch Patch to UnicodeProperties v.3 (obsolete) — Splinter Review
Addressed review comments.

The figures for data size in comment 14 are hogwash because the code that calculates the sizes makes the assumption that every datatype that doesn't have '32' in the name is one byte long. With that corrected, the figures are:

Data for CharProp1 = 13057
Data for CharProp2 = 245520
Data for CaseMap = 12929
Total data = 271506
Attachment #616577 - Attachment is obsolete: true
Attachment #616577 - Flags: review?(jfkthame)
Attachment #617464 - Flags: review?(jfkthame)
The huge size of CharProp2 here is due to the HanVariant data, which varies "randomly" across huge ranges of the Unicode space, and thereby disrupts the packing scheme we're using. I'm planning to try alternative ways to pack that data so that we don't bloat the footprint so badly.
We need a more comprehensive test than we have for bidi mirroring.
Attachment #617496 - Flags: review?(jfkthame)
Here's an alternative patch that makes the data more compact. The key is to remove HanVariant from the CharProps2 struct, because those two bits cause a lot of "noise" and result in an excessive number of 32-bit CharProps2 records for Han characters that vary *only* in these two bits - but occur in varying places in the table rows, so the rows can't be shared.

As a result, it's better to "waste" those two bits in the CharProps2 record - at least until we think of another use for them - and store the HanVariant data separately. We can pack HanVariant values for 4 characters into a single byte, and then look up those bytes using the same multi-level array scheme as our other properties. Extracting the value becomes marginally more complex, but the space savings are substantial.

Data size reported for this version:

Data for CharProp1 = 13057
Data for CharProp2 = 102800
Data for HanVariant = 14818
Data for CaseMap = 12929
Total data = 143604

I've checked that GetHanVariant() returns the same values as the Perl tool originally had in its $hanVariant array for the entire first 3 planes of Unicode, so I think the logic here is OK. :)

Tryserver push that includes your patches and tests from bug 722299 is at https://tbpl.mozilla.org/?tree=Try&rev=88d6d1a46992.
Attachment #617635 - Flags: review?(smontagu)
And with your additional patches from this bug, which I forgot to include last time: https://tbpl.mozilla.org/?tree=Try&rev=2feeffe25050
Comment on attachment 617635 [details] [diff] [review]
patch, add support for more Unicode properties, v4

Review of attachment 617635 [details] [diff] [review]:
-----------------------------------------------------------------

r=me for Jonathan's changes.
Attachment #617635 - Flags: review?(smontagu) → review+
Attachment #617496 - Flags: review?(jfkthame) → review+
Attachment #617464 - Attachment is obsolete: true
Attachment #617464 - Flags: review?(jfkthame)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: