Closed Bug 553981 Opened 14 years ago Closed 14 years ago

first-letter splits up Hangul Jamo sequences

Categories

(Core :: Layout: Text and Fonts, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla2.0b8

People

(Reporter: smontagu, Assigned: jfkthame)

References

(Blocks 1 open bug)

Details

Attachments

(3 files, 3 obsolete files)

Spun off from bug 553834. See also attachment 213727 [details] in bug 329069.
Correcting the styles in the test makes it pass on Linux (only) on m-c.
(It doesn't pass on 1.9.2, so something has been fixed.)
http://hg.mozilla.org/mozilla-central/rev/7b49ee09259b
Blocks: 329069
This should be fixed by enhancing gfxPlatform::SetupClusterBoundaries() to handle Hangul syllables. Starting on a patch...
This patch should also fix bug 614468.
Assignee: nobody → jfkthame
Attachment #495020 - Flags: review?(karlt)
Blocks: 614468
Comment on attachment 495020 [details] [diff] [review]
patch, enhance gfxPlatform::SetupClusterBoundaries to support Hangul, etc

> # read Scripts.txt
>-my %scriptAliases = (
>-    'MEETEI_MAYEK' => 'MEITEI_MAYEK'
>-);
>-
> open FH, "< $ARGV[1]/Scripts.txt" or die "can't open UCD file Scripts.txt\n";
> while (<FH>) {
>     if (m/([0-9A-F]{4,6})(?:\.\.([0-9A-F]{4,6}))*\s+;\s+([^ ]+)/) {
>         my $script = uc($3);
>-        $script = $scriptAliases{$script} if exists $scriptAliases{$script};

Can you leave a note in this bug, please, explaining why this is no longer
needed?

>         // 8-bit text doesn't have clusters.
>         // XXX is this true in all languages???
>         // behdad: don't think so.  Czech for example IIRC has a
>         // 'ch' grapheme.
>+        // jfkthame: but that's not expected to behave as a grapheme cluster
>+        // for selection/editing/etc.

FWIW "Slovak ch digraph" seems to be grouped with other clusters that I expect
would normally behave as a cluster for selection and first-letter.
http://www.unicode.org/reports/tr29/tr29-17.html#Grapheme_Cluster_Boundaries

Could we get "control" (incl Cf) characters here?  GB4 (and GB5 wrt Prepend if
the Thai/Lao code were enabled) are not handled.  Those are probably unusual
situations so no need to address that here.

>+        PRUint8 category = gfxUnicodeProperties::GetGeneralCategory(ch);

I would suggest "hb_category_t category =" for consistency in the comparisons
below.  Logically the relational operators will perform integral promotion of
operands anyway as part of "usual arithmetic conversions".  How much
difference this makes to compiled code and warnings about comparison between
different types, I don't know.

>+            if ((ch & ~0xff) == 0x1100 ||
>+                       (ch >= 0xa960 && ch <= 0xa97f) ||
>+                       (ch >= 0xac00 && ch <= 0xd7ff))

Indentation.

>+                    if ( (hangulState & gfxUnicodeProperties::LV) &&
>+                        !(hangulState & gfxUnicodeProperties::T))
>+                    {
>+                        aTextRun->SetGlyphs(i, extendCluster, nsnull);
>+                        hangulState = gfxUnicodeProperties::V;
>+                    }

  if (hangulState &&
      !(hangulState & gfxUnicodeProperties::T))

is a little simpler, or perhaps with hangulState != NONE for clarity.

IIU GB6 - 8 correctly, hangulState should be set here regardless of the old
hangulState.

>+                    if (hangulState & (gfxUnicodeProperties::V |
>+                                       gfxUnicodeProperties::T))
>+                    {
>+                        aTextRun->SetGlyphs(i, extendCluster, nsnull);
>+                        hangulState = gfxUnicodeProperties::T;
>+                    }

Similarly for hangulState here.

It may be simplest to always initialize hangulType to NONE at the start of the
loop.  That way hangulState can be set to hangulType at the end of the loop
and we don't need to make sure we have a
"hangulState = gfxUnicodeProperties::NONE" for every case.

>+    typedef enum {
>+        NONE = 0x00,
>+        L    = 0x01,
>+        V    = 0x02,
>+        T    = 0x04,
>+        LV   = 0x03,
>+        LVT  = 0x07
>+    } HSType;

The normal C++ idiom is enum HSType { ... }.
That might lead to more helpful compiler messages than a typedef for an
anonymous enum (but I'm don't know for certain).

This claims the identifiers gfxUnicodeProperties::NONE,
gfxUnicodeProperties::L, etc, which may lead to conflicts in the future.

How about HST_NONE/HST_L/etc.?

r+ if you explain MEETEI_MAYEK, fix up the indentation, always set hangulState
for each character, and are happy with HST_*.

I'll leave the other things to your discretion.
Attachment #495020 - Flags: review?(karlt) → review+
(In reply to comment #6)

> > # read Scripts.txt
> >-my %scriptAliases = (
> >-    'MEETEI_MAYEK' => 'MEITEI_MAYEK'
> >-);
> >-
> > open FH, "< $ARGV[1]/Scripts.txt" or die "can't open UCD file Scripts.txt\n";
> > while (<FH>) {
> >     if (m/([0-9A-F]{4,6})(?:\.\.([0-9A-F]{4,6}))*\s+;\s+([^ ]+)/) {
> >         my $script = uc($3);
> >-        $script = $scriptAliases{$script} if exists $scriptAliases{$script};
> 
> Can you leave a note in this bug, please, explaining why this is no longer
> needed?

The name of the constant defined in hb-unicode.h was updated to match the spelling used in the Scripts.txt data file, so we no longer need to treat it as an "alias" when reading the Unicode data and relating this to the values in the harfbuzz header file.
Blocks: 614476
Attached patch patch for checkin (obsolete) — Splinter Review
From http://hg.mozilla.org/try/rev/ec6f06e5e2c3 which addresses review comments.

Also marked newly passing reftests.

Requesting approval2.0 so that bug 569770 can land without regression from bug 614468.
Attachment #496069 - Flags: approval2.0?
(In reply to comment #8)
> Created attachment 496069 [details] [diff] [review]
> patch for checkin
> 
> From http://hg.mozilla.org/try/rev/ec6f06e5e2c3 which addresses review
> comments.

Note that on tryserver, this caused crashtest orange on Linux debug (only) - an assertion "Glyphruns not coalesced" in gfx/tests/crashtests/419255-1.html. I've been looking into that, and believe it is (a) harmless, and (b) caused by a separate underlying bug in textrun construction, which I'll file shortly.

We should still land this, but be aware that we'll need to mark the assertion in the crashtest manifest until the issue causing it is fixed.
Attachment #496069 - Flags: approval2.0?
Comment on attachment 496069 [details] [diff] [review]
patch for checkin

>@@ -2377,16 +2377,20 @@ gfxFontGroup::InitTextRun(gfxContext *aC
>     gfxScriptItemizer scriptRuns(aString, aLength);
> 
>     PRUint32 runStart = 0, runLimit = aLength;
>     PRInt32 runScript = HB_SCRIPT_LATIN;
>     while (scriptRuns.Next(runStart, runLimit, runScript)) {
>         InitTextRun(aContext, aTextRun, aString, aLength,
>                     runStart, runLimit, runScript);
>     }
>+
>+    // Is this actually necessary? Without it, gfxTextRun::CopyGlyphDataFrom may assert
>+    // "Glyphruns not coalesced", but does that matter?
>+    aTextRun->SortGlyphRuns();

>@@ -2438,20 +2442,16 @@ gfxFontGroup::InitTextRun(gfxContext *aC
>     }
> 
>     // It's possible for CoreText to omit glyph runs if it decides they contain
>     // only invisibles (e.g., U+FEFF, see reftest 474417-1). In this case, we
>     // need to eliminate them from the glyph run array to avoid drawing "partial
>     // ligatures" with the wrong font.
>     aTextRun->SanitizeGlyphRuns();
> 
>-    // Is this actually necessary? Without it, gfxTextRun::CopyGlyphDataFrom may assert
>-    // "Glyphruns not coalesced", but does that matter?
>-    aTextRun->SortGlyphRuns();

Sounds like we should remove this part of the patch.
This is not needed for these bugs, right?
It looked right to me, but sounds like it is causing enough unexpected consequences that it should be dealt with separately.
(In reply to comment #10)
> Comment on attachment 496069 [details] [diff] [review]
> patch for checkin
> 
> >@@ -2377,16 +2377,20 @@ gfxFontGroup::InitTextRun(gfxContext *aC
> >     gfxScriptItemizer scriptRuns(aString, aLength);
> > 
> >     PRUint32 runStart = 0, runLimit = aLength;
> >     PRInt32 runScript = HB_SCRIPT_LATIN;
> >     while (scriptRuns.Next(runStart, runLimit, runScript)) {
> >         InitTextRun(aContext, aTextRun, aString, aLength,
> >                     runStart, runLimit, runScript);
> >     }
> >+
> >+    // Is this actually necessary? Without it, gfxTextRun::CopyGlyphDataFrom may assert
> >+    // "Glyphruns not coalesced", but does that matter?
> >+    aTextRun->SortGlyphRuns();
> 
> >@@ -2438,20 +2442,16 @@ gfxFontGroup::InitTextRun(gfxContext *aC
> >     }
> > 
> >     // It's possible for CoreText to omit glyph runs if it decides they contain
> >     // only invisibles (e.g., U+FEFF, see reftest 474417-1). In this case, we
> >     // need to eliminate them from the glyph run array to avoid drawing "partial
> >     // ligatures" with the wrong font.
> >     aTextRun->SanitizeGlyphRuns();
> > 
> >-    // Is this actually necessary? Without it, gfxTextRun::CopyGlyphDataFrom may assert
> >-    // "Glyphruns not coalesced", but does that matter?
> >-    aTextRun->SortGlyphRuns();
> 
> Sounds like we should remove this part of the patch.
> This is not needed for these bugs, right?
> It looked right to me, but sounds like it is causing enough unexpected
> consequences that it should be dealt with separately.

I don't think the movement of that code is the reason for the assertion showing up - moving it up from the individual script-run method to the top-level gfxFontGroup::InitTextRun merely avoids redundantly re-processing the runs as each new script run is handled.

The "Glyphruns not coalesced" assertion on Linux crashtest, however, comes from a textrun that is being created by nsTransformingTextRunFactory; the problem is that this may assemble a textrun by copying from several sources, and fails to coalesce adjacent sub-runs using the same font in this situation. The change to cluster-marking here must be tickling run-construction in such a way that this starts to show up; I'm going to try and reproduce it consistently with a simpler testcase (and without the clustering patch), so that we can file, fix, and verify it reliably.
OK, I've figured out why we're getting the assertion. A testcase that will reproduce it *without* the SetupClusterBoundaries patch here, at least on my Ubuntu system, is:

data:text/html,<div style="font-family:OpenSymbol;text-transform:uppercase">x&#x303; &#x303;

The contributing factors here are:
- the use of text-transform, so that we go through an nsTransformingTextRunFactory that rebuilds textruns
- the presence *after* the space of a combining mark which is NOT supported in the style's (first-listed) font family
- the last character *before* the space falls back to the same font as the combining mark after it

This situation arises in crashtests/419255-1.html after patching the cluster code because we then (correctly) treat the U+200C character similarly to a combining mark for clustering purposes (UAX#29, GB9; ZWNJ and ZWJ are Grapheme_Extend). This triggers the code path at http://mxr.mozilla.org/mozilla-central/source/gfx/thebes/gfxTextRunWordCache.cpp#514 that "backs up" by one character to include the space when copying words, and that leads to the assertion due to a bug in gfxTextRun::AddGlyphRun.

The fix is straightforward; I'll post a new patch here that includes it.
Marking for re-review so you can check the fix in gfxTextRun::AddGlyphRun. The rest is unchanged.
Attachment #496069 - Attachment is obsolete: true
Attachment #496160 - Flags: review?(karlt)
Attachment #496160 - Flags: approval2.0?
Comment on attachment 496160 [details] [diff] [review]
patch, updated to fix the textrun assertion

This version of the patch has lost the changes addressing previous review comments.

>         if (lastGlyphRun->mCharacterOffset == aUTF16Offset) {
>+            // avoid leaving a zero-length run if the offset hasn't changed
>+            // by overwriting the last entry instead of appending
>+            if (numGlyphRuns > 1 &&
>+                mGlyphRuns[numGlyphRuns - 2].mFont == aFont)
>+            {
>+                // if the (last-1) run had the same font as the new one wants,
>+                // merge with it instead of creating adjacent runs with the
>+                // same font
>+                mGlyphRuns.TruncateLength(numGlyphRuns - 1);
>+                return NS_OK;

This code looks good, but I got a bit confused reading the comments.

The "if the offset hasn't changed" condition has already been satisfied at the
point of the first comment.  Can we either move the comment to before the
conditional that it applies to, or change it to "The offset hasn't changed.
Avoid leaving a zero-length run by ..."

Similarly for the "(last-1)" comment.
Attachment #496160 - Flags: review?(karlt)
Attachment #496160 - Flags: review-
Attachment #496160 - Flags: feedback+
Attachment #496160 - Flags: approval2.0?
(In reply to comment #14)
> This version of the patch has lost the changes addressing previous review
> comments.

Argh, how did I do that? :( Thought I started from the latest one..... will re-post shortly. Thanks for catching it. Will update comments as well.
I believe this includes the proper updates this time - sorry about the bad version!
Attachment #495020 - Attachment is obsolete: true
Attachment #496160 - Attachment is obsolete: true
Attachment #496341 - Flags: review?(karlt)
Comment on attachment 496341 [details] [diff] [review]
patch, updated properly this time!

Excellent, thanks.
Attachment #496341 - Flags: review?(karlt)
Attachment #496341 - Flags: review+
Attachment #496341 - Flags: approval2.0?
Comment on attachment 496341 [details] [diff] [review]
patch, updated properly this time!

Seems like the logic in gfxTextRun::AddGlyphRun could be simplified by first checking whether we should wipe out the current last glyph run (because it's zero-length), then checking whether we should merge the new glyph run with the new last glyph run.
Attachment #496341 - Flags: approval2.0? → approval2.0+
Blocks: 617869
http://hg.mozilla.org/mozilla-central/rev/4861b72cc66d

Filed bug 617869 for the WINNT failure.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla2.0b8
(In reply to comment #18)

> Seems like the logic in gfxTextRun::AddGlyphRun could be simplified by first
> checking whether we should wipe out the current last glyph run (because it's
> zero-length), then checking whether we should merge the new glyph run with the
> new last glyph run.

Yes, it could be a bit simpler, though we'd need to watch out for the case where the "current last glyph run" that got wiped out was in fact the first (only) glyph run, so that we're left without a "new last glyph run" to compare with.

OTOH, the current code is arranged so that the common case (where "lastGlyphRun->mFont == aFont" is true) is checked first; a quick test with an instrumented browser shows that we get an early return from here about 95% of the time. So I'm inclined to leave it this way.
Depends on: 658881
Depends on: 658912
Depends on: 680402
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: