Open Bug 543200 Opened 11 years ago Updated 8 months ago

font fallback should try to use the same font for a complete character cluster or word

Categories

(Core :: Layout: Text and Fonts, defect)

x86
macOS
defect
Not set
normal

Tracking

()

People

(Reporter: jfkthame, Unassigned)

References

(Depends on 1 open bug, Blocks 4 open bugs)

Details

(Keywords: perf)

Attachments

(2 files, 5 obsolete files)

We currently do font fallback (when the primary font being used doesn't support all the characters in the text) on a character-by-character basis. This can lead to unsightly "ransom note" effects when an unexpected font change occurs in the middle of a word, or even within a glyph cluster (e.g. a base character and diacritics). It can also prevent proper shaping/diacritic placement, because this does not occur across font-run boundaries.

It might be better to make the font fallback process attempt to find a single font that can be used for all the characters in a cluster (at least) or possibly an entire word.
Trying to get each cluster in a single font (if possible) sounds good to me.

I'm less sure about words, though; that opens up the potential for ransom note effects on a word-by-word basis.  There may also be cases where authors intend the character-by-character fallback, particularly when they list a font with latin glyphs ahead of a font with CJK (or other) glyphs so that they don't end up using the latin in the CJK font.

I think it's also technically a violation of the spec.
I think per-cluster matching makes much more sense.  It's simpler and doesn't lead to cross-browser inconsistencies as easily as "word" matching.  If a Japanese author uses a Chinese word, one for which glyphs only exist in a Chinese font, what's the "word" here that you're going to match?  Are you going to break at punctuation, so that half the sentence is one font, half another? I'm sure there would be cases where word matching would provide better results but precisely what "word" means here seems problematic to define across scripts.
I agree with those reasons for making it per-cluster.
Blocks: 657655
See also:
Bug 677919 - Inconsistent rendering of Chinese characters
I'm trying to implemet this, but it caused some test failures.
For example, layout/reftests/bugs/619511-1.html fails because DeLarge.woff doesn't contain ZWJ while Time New Roman does. What should be done in this case?
Search algorithm is the following:
1. If a group font found which contains all characters in the whole cluster, select the found font and terminate these steps.
2. Otherwise, if a group font found which contains a character, select the found font and terminate these steps.
3. Otherwise, fallback to pref fonts and system fonts without considering the cluster.

This patch will also fix some known reftest failures.
Assignee: nobody → VYV03354
Status: NEW → ASSIGNED
Attachment #600769 - Flags: review?(jfkthame)
(In reply to Masatoshi Kimura [:emk] from comment #6)
> Created attachment 600769 [details] [diff] [review]
> Bug 543200 - try to use the same font for a complete character cluster
> 
> Search algorithm is the following:
> 1. If a group font found which contains all characters in the whole cluster,
> select the found font and terminate these steps.
> 2. Otherwise, if a group font found which contains a character, select the
> found font and terminate these steps.
> 3. Otherwise, fallback to pref fonts and system fonts without considering
> the cluster.

Very pleased to see you're working on this - I've been wanting to do it for a long time but never found the time to focus on it.

However, I'm inclined to think the algorithm should be rearranged slightly, if possible:
1. As above.
2. Otherwise, fallback to pref fonts and system fonts looking for a font that supports the complete cluster.
3. Otherwise, try the group fonts and then fallback to pref or system fonts for each individual character.

This avoids the problem where, for example, the text includes a diacritic that is not available in the selected (group) font, but the base character is present - it's better to find a pref or system font that supports the complete cluster than to render the base char and diacritic from different fonts, which will almost invariably give poor results.
Just to echo dbaron's comment, whatever algorithm we come up with needs
to be reflected in the CSS3 Fonts spec.  Right now, that spec only
defines per-character fallback but I think it's completely reasonable to
expand it to cover per-cluster fallback, along the lines of what
Jonathan proposes in comment 8.

I think the algorithm needed is significantly more complicated than the
logic contained in this patch.  First, it needs to be clear what the
exact definition of a cluster is.  The patch seems to treat clusters as
base + (mark or ZWJ/ZWNJ or katakana sound marks), based on the use of
IsClusterExtender.  Not unreasonable but this doesn't seem to match the
any definition of cluster listed in UAX29.  It doesn't include Hangul
syllables for example.

The second thing is that we need to consider normalization here. The
simple example of this is that it doesn't make sense to search for 'base
+ combining mark' in pref and system fonts if the precomposed codepoint
is supported among the fonts in the given font list.  In fact, many
layout engines do NFC normalization before doing font matching.

But my biggest concern here is that I don't think we should implement
the handling of font matching for clusters within a general font
matching algorithm, we should have an entirely different method for this
so that we can make the 99.9% "all clusters are single characters" case
as fast as possible.  The clusters case is going to be far more
complicated than the normal case.
I'll delegate the definition of clusters to ClusterIterator which will be implemented in bug 721821.
Depends on: 721821
(In reply to John Daggett (:jtd) from comment #9)
> The second thing is that we need to consider normalization here. The
> simple example of this is that it doesn't make sense to search for 'base
> + combining mark' in pref and system fonts if the precomposed codepoint
> is supported among the fonts in the given font list.  In fact, many
> layout engines do NFC normalization before doing font matching.

Just a note to point out that using nsIUnicodeNormalizer for this purpose is not really appropriate, because it implements Unicode 3.2, whereas our text code in general (and rendering in particular) supports Unicode 6.1.

(I posted patches in bug 728180 that would bring normalization up-to-date, but because of IDNA requirements we can't simply update the existing service; we'd need to split it into a 3.2-based normalizer for IDNA, and an up-to-date one for other purposes. Ugh. Or move to IDNA2008.)
Attached patch patch v2 (obsolete) — Splinter Review
- Extend cluster-based matching to pref/system fonts.
- Use ClusterIterator to split the run.
- Normalize clusters to NFC before doing font matching. (I'm using nsIUnicodeNormalizer until up-to-date normalizer has come.)
- Separate functions to minimize the performance impact.
Attachment #600769 - Attachment is obsolete: true
Attachment #602727 - Flags: review?(jfkthame)
Attachment #600769 - Flags: review?(jfkthame)
Filed spec bug to W3C Bugzilla.
Comment on attachment 602727 [details] [diff] [review]
patch v2

Review of attachment 602727 [details] [diff] [review]:
-----------------------------------------------------------------

Some initial comments below.... I intend to study this in more detail, so if you want to wait for more comments before revising, that's ok - or if it's convenient to post an updated version for further review, that's fine too.

Have you done any testing to see if there's any detectable impact on performance? (I hope there won't be, because the more complex codepaths will be taken relatively rarely.)

::: gfx/thebes/gfxFont.cpp
@@ +169,5 @@
>      return 0;
>  }
>  
> +bool gfxFontEntry::HasCluster(PRUint32 aCh, const PRUnichar *aTop,
> +                              const PRUnichar *aEnd, PRInt32 aScript)

These "top" and "end" parameters/variables used throughout the patch need to be explained in a comment somewhere.

(It looks like "top" is a pointer to the UTF-16 code unit immediately *following* the initial character of the cluster (which is passed separately as aCh); and "end" points after the end of the cluster - right?)

Maybe the names could be made clearer? Something like
  aCh -> aStartChar
  aTop -> aNext
  aEnd -> aLimit (or aEnd is ok, I guess)

@@ +192,5 @@
> +            }
> +        }
> +    }
> +    return true;
> +}

This logic is basically identical in gfxFontEntry::HasCluster and gfxFontFamily::HasCluster (and the static function in gfxPangoFonts.cpp), but it's not obvious how we can share the code between these in the current structure. Hmm. Can we do anything about that, do you think?

@@ +3379,5 @@
>  
> +static inline bool IsControlChar(PRUint32 aCh)
> +{
> +    return (aCh & ~0x80) < 0x20;
> +}

Is there a particular reason to introduce this instead of relying on the General Category to identify control chars?

@@ +3538,2 @@
>  void gfxFontGroup::ComputeRanges(nsTArray<gfxTextRange>& aRanges,
> +                                 const PRUint8 *aString, PRUint32 aLength,

If you're separating the 8-bit and 16-bit implementations of ComputeRanges, is there any reason to keep the templatized declaration? I think you can just have the two separate function signatures, can't you?

@@ +3632,5 @@
> +        // get the next cluster if the current position is top of the cluster
> +        MOZ_ASSERT(i <= iter - aString);
> +        if (i == iter - aString) {
> +            ch = iter.Next();
> +            i += 1 + (ch > 0xFFFF);

!IS_IN_BMP(ch)

@@ +3709,5 @@
> +                if (IsControlChar(ch) ||
> +                    gfxFontUtils::IsJoinControl(ch) ||
> +                    gfxFontUtils::IsJoinCauser(prevCh) &&
> +                    prevFont->HasCharacter(ch) ||
> +                    gfxFontUtils::IsVarSelector(ch)) {

When mixing || and && in a complex conditional, please use parentheses to make the intended logic clear, even if they're not strictly necessary because of precedence rules - it's easy to get confused otherwise.

::: intl/unicharutil/public/nsUnicodeProperties.h
@@ +100,5 @@
> +    return ScriptShapingType(aScriptCode) & (SHAPING_ARABIC | SHAPING_INDIC);
> +}
> +
> +static inline PRUint32
> +NextChar(const PRUnichar *&aPos, const PRUnichar *aEnd)

If this is going to modify aPos, I think you should declare it as "const PRUnichar **aPos", so that callers have to pass &mPos (or whatever their variable is), making it clearer that it is an in/out parameter.

@@ +114,5 @@
> +    }
> +
> +    if (ch == 0xa0) {
> +        ch = ' ';
> +    }

Let's not "hide" this replacement inside NextChar(), but do it in the calling code - it's not logically a part of "get the next Unicode character from a UTF-16 buffer", but a special-case substitution that we do for font-matching purposes, so let's put it in the font-matching code where it belongs.

@@ +117,5 @@
> +        ch = ' ';
> +    }
> +
> +    return ch;
> +}

On second thoughts, maybe it would be tidier to turn this into a (fairly trivial) CharacterIterator object - similar to ClusterIterator, except that the Next() method would be much simpler (everything would be inline). WDYT?

@@ +139,5 @@
>      bool AtEnd() const {
>          return mPos >= mLimit;
>      }
>  
> +    PRUint32 Next();

Add a comment documenting what this returns, now that it's no longer 'void'.

::: intl/unicharutil/src/nsUnicodeProperties.cpp
@@ +299,5 @@
>          (ch >= 0xac00 && ch <= 0xd7ff)) {
>          // Handle conjoining Jamo that make Hangul syllables
>          HSType hangulState = GetHangulSyllableType(ch);
>          while (mPos < mLimit) {
> +            PRUint32 ch = *mPos;

Please don't re-use the variable name "ch" (here, and within the while-loop below); it shadows the original variable that you're intending to return, which is confusing. Maybe rename the original one to "baseChar" or something like that?
(In reply to Masatoshi Kimura [:emk] from comment #13)
> Filed spec bug to W3C Bugzilla.

Thanks for filing this, I've added a couple comments.

For this bug, I don't think we should be landing any patches until we've at least discussed some of the issues related to this on the www-style mailing list.  I think we need to define a few things clearly:

- How is a cluster defined?  Our code may or may not be a good starting point, UAX29 may also be something to use.  Note that standards typically define clusters as including individual codepoints without combiners present.  We need to avoid normalization of single codepoints, to avoid normalizing CJK Combatibility characters for example.

- When is normalization applied (i.e. when/how is base + combiner matched vs. an underlying precomposed character)?  Does a user agent try base + combiner, then precomposed character for each font in the font list?  Or try base + combiner for each font, then try precomposed character for each font?  Is system fallback done the same way or differently?  It would be nice to be able assure that base + combiner is picked up in the given font list but I'm not sure this is necessary in system font fallback.

- Clusters with IVS selectors are a slightly different special case, ideally user agents would do system fallback with the base + IVS combination before falling back to just matching the base character.

We've had lots of problems related to system fallback, we need to assure that whatever we come up with minimizes unnecessary system font fallback (see bug 705594, bug 705258) for more background.
Depends on: 705594
> -        FindFontForChar(PRUint32 ch, PRUint32 prevCh, PRInt32 aRunScript,
> +        FindFontForCluster(PRUint32 ch,
> +                           const PRUnichar *aTop,
> +                           const PRUnichar *aEnd,
> +                           PRInt32 aRunScript,
> +                           gfxFont *aPrevMatchedFont,
> +                           PRUint8 *aMatchType);

I don't think this is the right approach at all, the *vast* majority of
text runs processed at this level will not need to deal with clusters,
so we shouldn't impose a performance penalty on these cases.  Within
ComputeRanges do a simple check for whether clusters exist or not in the
string, and run the appropriate font matching algorithm.
Keywords: perf
(In reply to John Daggett (:jtd) from comment #17)
> > -        FindFontForChar(PRUint32 ch, PRUint32 prevCh, PRInt32 aRunScript,
> > +        FindFontForCluster(PRUint32 ch,
> > +                           const PRUnichar *aTop,
> > +                           const PRUnichar *aEnd,
> > +                           PRInt32 aRunScript,
> > +                           gfxFont *aPrevMatchedFont,
> > +                           PRUint8 *aMatchType);
> I don't think this is the right approach at all, the *vast* majority of
> text runs processed at this level will not need to deal with clusters,
> so we shouldn't impose a performance penalty on these cases.
It's just that diff fails to find the function body of FindFontForChar after FindFontForCluster. I didn't add any codes to FindFontForChar.

> Within
> ComputeRanges do a simple check for whether clusters exist or not in the
> string, and run the appropriate font matching algorithm.
Is it really faster than the current patch? Hangul Jamo test and IsClusterExtender() call will be required anyway for every characters in the run.
(In reply to John Daggett (:jtd) from comment #16)
> - How is a cluster defined?  Our code may or may not be a good starting point, UAX29 may also be something to use.  Note that standards typically define clusters as including individual codepoints without combiners present.
My assumption was UAX #29. If our code does not conform to UAX #29, it should be considered as a bug but it is out of scope of this bug. The same cluster definition should be used throughout our code.

> We need to avoid normalization of single codepoints, to avoid normalizing CJK Combatibility characters for example.
The current patch will not apply normalization if the cluster consists of a single codepoint.

> - When is normalization applied (i.e. when/how is base + combiner matched vs. an underlying precomposed character)?  Does a user agent try base + combiner, then precomposed character for each font in the font list?  Or try base + combiner for each font, then try precomposed character for each font?  Is system fallback done the same way or differently?  It would be nice to be able assure that base + combiner is picked up in the given font list but I'm not sure this is necessary in system font fallback.
The current patch will only try to precomposed character. I hard to imagine a font which supports base + combiner but does not support precomposed. Is there any examples of such a font.

> - Clusters with IVS selectors are a slightly different special case, ideally user agents would do system fallback with the base + IVS combination before falling back to just matching the base character.
The current patch treats base + IVS as just a cluster per UAX #29 (or ClusterIterator). So base + IVS would be preferred just like other clusters when doing system fallback.

> We've had lots of problems related to system fallback, we need to assure that whatever we come up with minimizes unnecessary system font fallback (see bug 705594, bug 705258) for more background.
I modified FontSearch to avoid system fallback is invoked twice. If a font supports all characters, it would be given higher score than others. And FontSearch returns whether the returned font matched the whole cluster. If it did not match the whole cluster, It would be used only when no fonts support single codepoint in group fonts and pref fonts,
> > - How is a cluster defined?  Our code may or may not be a good
> > starting point, UAX29 may also be something to use.  Note that
> > standards typically define clusters as including individual
> > codepoints without combiners present.
>
> My assumption was UAX #29. If our code does not conform to UAX #29, it
> should be considered as a bug but it is out of scope of this bug. The
> same cluster definition should be used throughout our code.

UAX #29 discusses a number of ways of segmenting text.  We need to
explicitly choose something that's clearly defined and appropriate for
efficient font mapping.  For example, clusters could be defined to be
what UAX #29 labels as "enhanced grapheme clusters".  I don't think it's
out-of-scope because how it's defined will affect performance.

> > We need to avoid normalization of single codepoints, to avoid
> > normalizing CJK Combatibility characters for example.
>
> The current patch will not apply normalization if the cluster consists
> of a single codepoint.

Right, I think we need to avoid the "singleton" decompositions that are
applied as part of NFC normalization.  But not having a pre-defined
algorithm for what is and isn't normalized means we have to define it
clearly ourselves and propose it as the standard way.  Maybe it's as
simple as "use NFC form when cluster is more than a single character".

I'm guessing that referencing the "Canonical Composition Algorithm"
(i.e. NFC without the decomposition) would give us the best results.

> > - When is normalization applied (i.e. when/how is base + combiner
> > matched vs. an underlying precomposed character)?  Does a user agent
> > try base + combiner, then precomposed character for each font in the
> > font list?  Or try base + combiner for each font, then try
> > precomposed character for each font?  Is system fallback done the
> > same way or differently?  It would be nice to be able assure that
> > base + combiner is picked up in the given font list but I'm not sure
> > this is necessary in system font fallback.
> 
> The current patch will only try to precomposed character. I hard to
> imagine a font which supports base + combiner but does not support
> precomposed. Is there any examples of such a font.

This is important for a couple reasons.  It allows sites to use a
combination of decomposed characters and downloadable font to reduce the
need to define glyphs for all possible diacritic combinations that exist
in precomposed form.  Plus it *does* occur in practice.  For example,
take the Windows font Constantia, it has a A-umlaut character (u+c4) and
a macron diacritic (u+304) but lacks a A-umlaut-macron (u+1de)
precomposed character.

> > - Clusters with IVS selectors are a slightly different special case,
> > ideally user agents would do system fallback with the base + IVS
> > combination before falling back to just matching the base character.
>
> The current patch treats base + IVS as just a cluster per UAX #29 (or
> ClusterIterator). So base + IVS would be preferred just like other
> clusters when doing system fallback.

The handling will be different.  With base + diacritic clusters, I think
the desired behavior would be to for each font, try to match (base +
diacritics), then precomposed character, and match this way through
system fallback.

With IVS clusters, the matching behavior should be (base + IVS) all the
way through to system fallback, *then* use the first font containing the
base character.  These two processes aren't the same.
> > I don't think this is the right approach at all, the *vast* majority
> > of text runs processed at this level will not need to deal with
> > clusters, so we shouldn't impose a performance penalty on these
> > cases.
>
> It's just that diff fails to find the function body of FindFontForChar
> after FindFontForCluster. I didn't add any codes to FindFontForChar.

Ah, okay, you're calling both of them in different places within ComputeRanges.
 
> > Within ComputeRanges do a simple check for whether clusters exist or
> > not in the string, and run the appropriate font matching algorithm.
>
> Is it really faster than the current patch? Hangul Jamo test and
> IsClusterExtender() call will be required anyway for every characters
> in the run.

I don't think the speed of "strings with clusters" handling is critical,
but the 99% case of strings with no clusters should be as fast as
possible.  With the way you have structured the code now in
ComputeRanges, all strings will be affected by the cluster-handling
logic.  Might even make sense to pull out the surrogate handling too.
(In reply to John Daggett (:jtd) from comment #20)
> UAX #29 discusses a number of ways of segmenting text.  We need to
> explicitly choose something that's clearly defined and appropriate for
> efficient font mapping.  For example, clusters could be defined to be
> what UAX #29 labels as "enhanced grapheme clusters".  I don't think it's
> out-of-scope because how it's defined will affect performance.
Ah, I forgot UAX #29 had multiple definitions of clusters. I considered "extended grapheme clusters" because:
- UAX #29 prefers it than "legacy grapheme clusters".
- "All spacing marks" are easier to test than "some of spacing marks" using nsIUGenCategory.
- It's closest to our current ClusterIterator implementation.

> I'm guessing that referencing the "Canonical Composition Algorithm"
> (i.e. NFC without the decomposition) would give us the best results.
It makes sense.

> This is important for a couple reasons.  It allows sites to use a
> combination of decomposed characters and downloadable font to reduce the
> need to define glyphs for all possible diacritic combinations that exist
> in precomposed form.  Plus it *does* occur in practice.  For example,
> take the Windows font Constantia, it has a A-umlaut character (u+c4) and
> a macron diacritic (u+304) but lacks a A-umlaut-macron (u+1de)
> precomposed character.
A-umlaut-macron will not match Constantia anyway because it consists of only one character unless yet another diacritics is appended. Is it OK? Or do we have to decompose the cluster first? (IMO we have to compromise here to balance the performance with the correctness.)

> The handling will be different.  With base + diacritic clusters, I think
> the desired behavior would be to for each font, try to match (base +
> diacritics), then precomposed character,
Precomposed forms may not be defined in Unicode even for diacritics.
> and match this way through system fallback.
Then use the first font containing the base character, then switch to the first font containing the diacritics. Of course it will produce ugly results, but it will be much better than hexbox. Thus it's required to match the single base character anyway.

So we can generalize the algorithm as follows:
1. For each font, try to match the whole cluster (base + diacritics or base + VS).
2. Try to match the precomposed form of the cluster if it is different from the unnormalized cluster (for some base + diacritics).
3. Match this way through system fallback.
4. If no fonts match the whole cluster, use the first font containing the base character, then switch the font for following characters. (Here, VS can be considered as if it is contained any fonts regardless of actual cmaps as we have already done.)
(In reply to John Daggett (:jtd) from comment #20)
> UAX #29 discusses a number of ways of segmenting text.  We need to
> explicitly choose something that's clearly defined and appropriate for
> efficient font mapping.  For example, clusters could be defined to be
> what UAX #29 labels as "enhanced grapheme clusters". 

(I presume you mean "extended" here.)

> we have to define it
> clearly ourselves and propose it as the standard way

I'm not convinced that it's necessary to have a precise specification in CSS Fonts of what definition of "cluster" is used here. At the point where font fallback is coming into play, and the choice of cluster definition may influence the fonts chosen, we're in a situation where the fonts actually specified by the user were not a good match for the content. At this point, what's important is to try and render recognizable text, drawing on whatever resources and heuristics the browser has at its disposal; I don't think it is necessarily important (or even helpful) to try and define precisely _how_ every browser should do this, or expect that all browsers will behave the same in every corner case.
(In reply to Masatoshi Kimura [:emk] from comment #22)
> > I'm guessing that referencing the "Canonical Composition Algorithm"
> > (i.e. NFC without the decomposition) would give us the best results.
> It makes sense.
I'm reconsidering about this.
If the font contains A-umlaut-macron, it would match <A + combining umlaut + combining macron>, but would not match <A-macron (U+0100) + combining umlaut> because we do not apply Decompose and Canonical Reordering. Is this OK?
Ah, umlaut and macron couldn't be reordered because Canonical Composition Class is identical. But what about U+1E08? (C-cedilla-acute)
(In reply to Masatoshi Kimura [:emk] from comment #24)
> (In reply to Masatoshi Kimura [:emk] from comment #22)
> > > I'm guessing that referencing the "Canonical Composition Algorithm"
> > > (i.e. NFC without the decomposition) would give us the best results.
> > It makes sense.
> I'm reconsidering about this.
> If the font contains A-umlaut-macron, it would match <A + combining umlaut +
> combining macron>, but would not match <A-macron (U+0100) + combining
> umlaut> because we do not apply Decompose and Canonical Reordering. Is this
> OK?

In this example, that would be correct, because umlaut and macron have the same combining class and therefore will _not_ be reordered; A-umlaut-macron is not equivalent to A-macron-umlaut.

However, if the font includes c-cedilla (U+00C7) and combining-cedilla and combining-acute, but not a precomposed c-acute (U+0107), what will happen if the text contains <c-acute, combining-cedilla>? Normalization would transform this to <c-cedilla, combining-acute>, which would match the font, but if you don't do the Decompose step, that won't happen.

(This is a particularly plausible example, because c-cedilla is present in the ISO-8859-1 codepage, while c-acute is not; many similar examples can of course be devised.)
(In reply to Jonathan Kew (:jfkthame) from comment #26)
> However, if the font includes c-cedilla (U+00C7) and combining-cedilla and
> combining-acute, but not a precomposed c-acute (U+0107), what will happen if
> the text contains <c-acute, combining-cedilla>? Normalization would
> transform this to <c-cedilla, combining-acute>, which would match the font,
> but if you don't do the Decompose step, that won't happen.

Actually, NFC would make it c-cedilla-acute, as you noted in the comment that I mid-aired - but the point remains that unless you do the decomposition step, this won't work.

Moreover, this illustrates a good example where it's entirely possible that a font may _not_ support the fully-composed form (c-cedilla-acute), yet would be able to render it using a combining mark.
I tested decompose-before-matching patch locally, but it failed to render all of <C, combining-acute, combining-cedilla>, <C, combining-cedilla, combining-acute>, <C-acute, combining-cedilla>, <C-cedilla, combining-acute> with Constantia! Because our shaper composed the string (bug 728866).
Therefore I'll try to match only using precomposed clusters at this time and file a followup bug.
Depends on: 733376
Longterm, we should add API to HarfBuzz to return the longest prefix of a piece of text that it can render using a given font, and use that to guide breaking at grapheme boundaries.
(In reply to Jonathan Kew (:jfkthame) from comment #23)

> (I presume you mean "extended" here.)

Yes, sorry for the typo.

> I'm not convinced that it's necessary to have a precise specification
> in CSS Fonts of what definition of "cluster" is used here. At the
> point where font fallback is coming into play, and the choice of
> cluster definition may influence the fonts chosen, we're in a
> situation where the fonts actually specified by the user were not a
> good match for the content. At this point, what's important is to try
> and render recognizable text, drawing on whatever resources and
> heuristics the browser has at its disposal; I don't think it is
> necessarily important (or even helpful) to try and define precisely
> _how_ every browser should do this, or expect that all browsers will
> behave the same in every corner case.

I think we don't need to push this to an extreme level but we at least
need to set a reasonable minimum bar so that font designers and web
authors can rely on font matching for clusters behaving in a known and
consistent way.  I'd like to assure that (1) clusters formed with base +
diacritic will match even when the precomposed form is not supported and
(2) IVS matching is done in a normative way, especially given the issues
surrounding multiple IVS values mapping to the same underlying glyph.  I'm
not worried so much about corner case issues.
(In reply to Jonathan Kew (:jfkthame) from comment #27)
> Moreover, this illustrates a good example where it's entirely possible
> that a font may _not_ support the fully-composed form
> (c-cedilla-acute), yet would be able to render it using a combining mark.

Right, that's what I was trying to emphasize in the original example.
Constantia supports A-umlaut and the macron diacritic but lacks a glyph
for A-umlaut-macron.  Normalization will transform <A-umlaut> <macron>
to <A-umlaut-macron> so fallback will occur.
(In reply to John Daggett (:jtd) from comment #31)
> Right, that's what I was trying to emphasize in the original example.
> Constantia supports A-umlaut and the macron diacritic but lacks a glyph
> for A-umlaut-macron.  Normalization will transform <A-umlaut> <macron>
> to <A-umlaut-macron> so fallback will occur.
Nightly renders <A-umlaut> <macron> as question-box if Constantia is specified. We need to fix bug 733376 to implement this.
(In reply to Masatoshi Kimura [:emk] from comment #32)
> (In reply to John Daggett (:jtd) from comment #31)
> > Right, that's what I was trying to emphasize in the original example.
> > Constantia supports A-umlaut and the macron diacritic but lacks a glyph
> > for A-umlaut-macron.  Normalization will transform <A-umlaut> <macron>
> > to <A-umlaut-macron> so fallback will occur.
> Nightly renders <A-umlaut> <macron> as question-box if Constantia is
> specified. We need to fix bug 733376 to implement this.

Well, the solution to this bug and that bug are the same IMO.  Ie. same pseudo-normalization logic that is implemented in HarfBuzz needs to be used for font selection.  Either by adding API to HarfBuzz, or reimplementing it.
(In reply to Behdad Esfahbod from comment #33)
> Well, the solution to this bug and that bug are the same IMO.  Ie. same
> pseudo-normalization logic that is implemented in HarfBuzz needs to be used
> for font selection.  Either by adding API to HarfBuzz, or reimplementing it.
Is the pseudo-normalization depends on the selected font? Some fonts (esp. older ones) prefer the precomposed form, but others (e.g. Constantia) don't.
(In reply to Masatoshi Kimura [:emk] from comment #34)
> (In reply to Behdad Esfahbod from comment #33)
> > Well, the solution to this bug and that bug are the same IMO.  Ie. same
> > pseudo-normalization logic that is implemented in HarfBuzz needs to be used
> > for font selection.  Either by adding API to HarfBuzz, or reimplementing it.
> Is the pseudo-normalization depends on the selected font? Some fonts (esp.
> older ones) prefer the precomposed form, but others (e.g. Constantia) don't.

Yes.  It does the smart thing.  See the comments here:

http://cgit.freedesktop.org/harfbuzz/tree/src/hb-ot-shape-normalize.cc
(In reply to Behdad Esfahbod from comment #35)
> (In reply to Masatoshi Kimura [:emk] from comment #34)
> > (In reply to Behdad Esfahbod from comment #33)
> > > Well, the solution to this bug and that bug are the same IMO.  Ie. same
> > > pseudo-normalization logic that is implemented in HarfBuzz needs to be used
> > > for font selection.  Either by adding API to HarfBuzz, or reimplementing it.
> > Is the pseudo-normalization depends on the selected font? Some fonts (esp.
> > older ones) prefer the precomposed form, but others (e.g. Constantia) don't.
> 
> Yes.  It does the smart thing.  See the comments here:
> 
> http://cgit.freedesktop.org/harfbuzz/tree/src/hb-ot-shape-normalize.cc

It's OK as our internal handling, but I'm not sure it's a good idea for the algorithm defined in the spec to depend on HarfBuzz.
(In reply to Behdad Esfahbod from comment #35)
> > Is the pseudo-normalization depends on the selected font? Some fonts (esp.
> > older ones) prefer the precomposed form, but others (e.g. Constantia) don't.
> 
> Yes.  It does the smart thing.  See the comments here:
> 
> http://cgit.freedesktop.org/harfbuzz/tree/src/hb-ot-shape-normalize.cc

This code will "decompose" singletons which is not a good thing due to the inclusion of CJK Compatibility characters in canonical decomposition, this will "unify" these characters.  Same goes for the ohm and angstrom charcters, those will be reduced to their Latin-range equivalents.
(In reply to John Daggett (:jtd) from comment #37)
> This code will "decompose" singletons which is not a good thing due to the
> inclusion of CJK Compatibility characters in canonical decomposition, this
> will "unify" these characters.  Same goes for the ohm and angstrom
> charcters, those will be reduced to their Latin-range equivalents.
hb_unicode_decompose just callbacks user-supplied decompose function. If no function is supplied, it does nothing. Here, the user-supplied function is our HBUnicodeDecompose which we have implemented in bug 728866.
So if _hb_ot_shape_normalize decomposes singletons, it's our fault :)
(In reply to John Daggett (:jtd) from comment #37)
> (In reply to Behdad Esfahbod from comment #35)
> > > Is the pseudo-normalization depends on the selected font? Some fonts (esp.
> > > older ones) prefer the precomposed form, but others (e.g. Constantia) don't.
> > 
> > Yes.  It does the smart thing.  See the comments here:
> > 
> > http://cgit.freedesktop.org/harfbuzz/tree/src/hb-ot-shape-normalize.cc
> 
> This code will "decompose" singletons which is not a good thing due to the
> inclusion of CJK Compatibility characters in canonical decomposition, this
> will "unify" these characters.  Same goes for the ohm and angstrom
> charcters, those will be reduced to their Latin-range equivalents.

John, what's wrong with that?  If the font doesn't have a glyph for the singleton but it does for its canonical decomposition, what's wrong with using it?  These are not compatibility decompositions, they're still canonical decompositions, just happen to be singletons.
(In reply to Masatoshi Kimura [:emk] from comment #38)
> (In reply to John Daggett (:jtd) from comment #37)
> > This code will "decompose" singletons which is not a good thing due to the
> > inclusion of CJK Compatibility characters in canonical decomposition, this
> > will "unify" these characters.  Same goes for the ohm and angstrom
> > charcters, those will be reduced to their Latin-range equivalents.
> hb_unicode_decompose just callbacks user-supplied decompose function. If no
> function is supplied, it does nothing. Here, the user-supplied function is
> our HBUnicodeDecompose which we have implemented in bug 728866.

Firefox doesn't hit this because font matching is done before shaping is done.  The code in HBUnicodeDecompose just calls through to code that makes no distinction between singletons and non-singletons.
(In reply to Behdad Esfahbod from comment #40)
> > This code will "decompose" singletons which is not a good thing due
> > to the inclusion of CJK Compatibility characters in canonical
> > decomposition, this will "unify" these characters.  Same goes for
> > the ohm and angstrom charcters, those will be reduced to their
> > Latin-range equivalents.
> 
> John, what's wrong with that?  If the font doesn't have a glyph for
> the singleton but it does for its canonical decomposition, what's
> wrong with using it?  These are not compatibility decompositions,
> they're still canonical decompositions, just happen to be singletons.

The problem is that this will lose the distinction between compatibility
characters and their unified version, for example U+fa5d and u+fa5e will
both be "normalized" to u+8279.  NFC is effectively a lossy operation
for singletons.

http://blogs.adobe.com/CCJKType/2012/03/cjk-compatibility-ideographs.html
(In reply to John Daggett (:jtd) from comment #41)
> Firefox doesn't hit this because font matching is done before shaping is
> done. 
If _hb_unicode_decompose is exposed and we use it also for font matching, we should pass the same function.
> The code in HBUnicodeDecompose just calls through to code that makes
> no distinction between singletons and non-singletons.
Let's fix our callback as I already said.
BTW, CJK Compatibility Ideograph brokenness was my first concern when I found bug 728866. But actually Nightly didn't break it. I don't know the reason why.
(In reply to Behdad Esfahbod from comment #40)
> > This code will "decompose" singletons which is not a good thing due to the
> > inclusion of CJK Compatibility characters in canonical decomposition, this
> > will "unify" these characters.  Same goes for the ohm and angstrom
> > charcters, those will be reduced to their Latin-range equivalents.
> John, what's wrong with that?  If the font doesn't have a glyph for the
> singleton but it does for its canonical decomposition, what's wrong with
> using it?  These are not compatibility decompositions, they're still
> canonical decompositions, just happen to be singletons.
It is garbled character ("mojibake") for Japanese users. Japanese fonts contain different glyphs between Compatibility Ideographs and Unified Idepgraphs, and the distinction is critical for some Japanese. Someone (in Japan NB) do not even admit Compatibility Ideographs can be broken anytime.
https://twitter.com/kawabata/statuses/116795710782976000 (Japanese)
(In reply to John Daggett (:jtd) from comment #42)
> (In reply to Behdad Esfahbod from comment #40)
> > > This code will "decompose" singletons which is not a good thing due
> > > to the inclusion of CJK Compatibility characters in canonical
> > > decomposition, this will "unify" these characters.  Same goes for
> > > the ohm and angstrom charcters, those will be reduced to their
> > > Latin-range equivalents.
> > 
> > John, what's wrong with that?  If the font doesn't have a glyph for
> > the singleton but it does for its canonical decomposition, what's
> > wrong with using it?  These are not compatibility decompositions,
> > they're still canonical decompositions, just happen to be singletons.
> 
> The problem is that this will lose the distinction between compatibility
> characters and their unified version, for example U+fa5d and u+fa5e will
> both be "normalized" to u+8279.  NFC is effectively a lossy operation
> for singletons.

And therefore, authors who care about such a distinction should _not_ rely on those "compatibility characters" to encode it, because they are _by definition_ equivalent to their unified counterparts, and a Unicode-conformant process is free to apply NFC (or NFD) normalization to the data on the understanding that this makes absolutely _no_ semantic difference.

As that blog post points out, "...it’s not possible to guarantee that normalization will not be applied, except in completely closed environments. Any interchange of them [the CJK compatibility ideographs] is subject to data loss."

For example, if such data is served via a system that follows the "Net-Unicode" recommendation of http://tools.ietf.org/html/rfc5198:
    4. Before transmission, all character sequences SHOULD be normalized
       according to Unicode normalization form "NFC"
then the compatibility characters the author may have entered will already be lost, long before they even reach the rendering engine.

Arguably, normalizing these characters in the rendering engine is actually doing authors a favor, in that it increases the likelihood that they will notice that the distinction they may have intended to make is not actually present in their data, and therefore look for other, more robust ways to represent it.
(In reply to Jonathan Kew (:jfkthame) from comment #45)
> And therefore, authors who care about such a distinction should _not_
> rely on those "compatibility characters" to encode it, because they
> are _by definition_ equivalent to their unified counterparts, and a
> Unicode-conformant process is free to apply NFC (or NFD) normalization
> to the data on the understanding that this makes absolutely _no_
> semantic difference.

Right, no *semantic* difference.  If matching strings it makes complete
sense to be applying NFC normalization.  But in the context of glyph
selection, there is a requirement to preserve the visual distinction
between the codepoints.

The JIS X 0213-2004 standard includes 92 codepoints that will map to a
Unicode CJK Compatibility codepoint that will be decomposed to a unified
codepoint under NFC.  Of these, a number are included in Jinmeiyo kanji
which are used for registering names in Japan.  So for either displaying
JIS-encoded content or Japanese names correctly, decomposing the CJK
Combatibility codepoints will result in incorrect rendering, plain and
simple.  You can talk until you're blue in the face about semantic
equivalence but if a name is required to be rendered with the glyph for
the CJK Combatibility codepoint, using the glyph for the unified
codepoint will not be considered correct.

The eventual hope is that this problem will be rectified through the use
of IVS selectors appended to the unified codepoint (which aren't canonically decomposed) but that doesn't completely resolve the issues with legacy content that exist.

> Arguably, normalizing these characters in the rendering engine is
> actually doing authors a favor, in that it increases the likelihood
> that they will notice that the distinction they may have intended to
> make is not actually present in their data, and therefore look for
> other, more robust ways to represent it.

Authors in Japan would certainly disagree.
(In reply to Jonathan Kew (:jfkthame) from comment #45)
> And therefore, authors who care about such a distinction should _not_ rely
> on those "compatibility characters" to encode it,
HTML5 is filled with a ton of sentences such as "Authors must not depend on XXX, but User-Agents must interpret XXX" due to legacy compat. It sometimes even violates other specs deliberately. CJK Compatibility Ideographs brokenness is exactly a legacy compat issue.
Eric Muller from Adobe says the Flash text engine does something similar for multi-character clusters, using a slightly modified definition of a cluster:

http://lists.w3.org/Archives/Public/www-style/2011Feb/0818.html
CJK Compatibility character test:
http://people.mozilla.org/~jdaggett/tests/cjkcompat.html

No user agent on either Windows or OSX does singleton normalization on fallback, all of them so appropriate fallback to fonts that support the CJK Compatibility glyphs.
(In reply to John Daggett (:jtd) from comment #46)
> Right, no *semantic* difference.  If matching strings it makes complete
> sense to be applying NFC normalization.  But in the context of glyph
> selection, there is a requirement to preserve the visual distinction
> between the codepoints.

For data that is being interchanged (e.g. from web server to browser) in Unicode, this violates conformance clause C6:

C6	A process shall not assume that the interpretations of two canonical-equivalent character sequences are distinct.
• The implications of this conformance clause are twofold. First, a process is never required to give different interpretations to two different, but canonical-equivalent character sequences. Second, no process can assume that another process will make a distinction between two different, but canonical-equivalent character sequences.

To repeat Ken Lunde's statement, "Any interchange of them [the CJK compatibility ideographs] is subject to data loss."

> The JIS X 0213-2004 standard includes 92 codepoints that will map to a
> Unicode CJK Compatibility codepoint that will be decomposed to a unified
> codepoint under NFC.

Given that JIS X 0213-2004 encodes these forms separately, it is a valid distinction for data *interchanged as JIS X 0213-2004*. And given that we'd render JIS X 0213-2004 pages by transcoding the text internally to Unicode, it's reasonable to expect that we should refrain from applying these decompositions, so that we can maintain the distinction that was present in the JIS data (although it would perhaps be better if we transcoded to IVS sequences, so that the distinction could be reliably maintained if the user - for example - copies text from a web page and pastes it into another program).

> The eventual hope is that this problem will be rectified through the use
> of IVS selectors appended to the unified codepoint (which aren't canonically
> decomposed) but that doesn't completely resolve the issues with legacy
> content that exist.

Legacy JIS-encoded content is a legitimate use case, and for this reason I'll agree that we should avoid these singleton decompositions during rendering. "Legacy Unicode content" that uses the CJK Compatibility ideographs - and "requires" that distinction to be maintained - is inherently unreliable for interchange purposes, and the sooner people understand and accept this (as the chances of getting Unicode normalization changed are essentially zero), and eradicate such data in favor of IVS sequences, the better.
Depends on: 734308
(In reply to Masatoshi Kimura [:emk] from comment #43)
> BTW, CJK Compatibility Ideograph brokenness was my first concern when I
> found bug 728866. But actually Nightly didn't break it. I don't know the
> reason why.
Compatibility Ideographs weren't decomposed because the cluster will contain only one character.
https://mxr.mozilla.org/mozilla-central/source/gfx/harfbuzz/src/hb-ot-shape-normalize.cc#125
In theory Compatibility Ideographs can be followed by cluster extenders, but legacy JIS-encoded content is not likely to have such clusters.
Attached patch patch v3 (obsolete) — Splinter Review
I believe the current ClusterIterator is enough close to the definition of "extended grapheme cluster" of UAX #29. Any minor problems can be dealed in followup bugs.

The current proposed algorithm is:
1. If the cluster consists of a single character, go to step 5.
 (This is an optimization for performance and legacy compatibility.)
2. Try decomposed, then precomposed form (if present) of the cluster for each font in the font list.
 (Note that this patch wouldn't implement matching decomposed clusters until bug 733376 is fixed.)
3. If no fonts found which support all characters in the cluster, Try match the cluster for each font in the system font fallback list.
4. If no fonts found, try the following steps for each characters in the cluster.
5. Try matching the character for each font in the font list.
6. If no fonts found which support the character, Try match the character for each font in the system font fallback list.
7. If no fonts found, indicate a character is not being displayed.
This algorithm can handle both diacritics and variation sequences.

(In reply to Jonathan Kew (:jfkthame) from comment #15)
> Have you done any testing to see if there's any detectable impact on
> performance? (I hope there won't be, because the more complex codepaths will
> be taken relatively rarely.)
Here is a compare between talos results with and without the patch.
http://perf.snarkfest.net/compare-talos/index.html?oldRevs=f898f3ed87a7&newRev=49ccfebe9f86&submit=true

> These "top" and "end" parameters/variables used throughout the patch need to
> be explained in a comment somewhere.
> 
> (It looks like "top" is a pointer to the UTF-16 code unit immediately
> *following* the initial character of the cluster (which is passed separately
> as aCh); and "end" points after the end of the cluster - right?)
> 
> Maybe the names could be made clearer? Something like
>   aCh -> aStartChar
>   aTop -> aNext
>   aEnd -> aLimit (or aEnd is ok, I guess)
Those parameters and variables are replaced with CharacterIterator.

> This logic is basically identical in gfxFontEntry::HasCluster and
> gfxFontFamily::HasCluster (and the static function in gfxPangoFonts.cpp),
> but it's not obvious how we can share the code between these in the current
> structure. Hmm. Can we do anything about that, do you think?
I have no idea how we can share the code without sacrificing the performance, sorry.

> Is there a particular reason to introduce this instead of relying on the
> General Category to identify control chars?
I want to make the test as fast as possible. IsControlChar() call is required even for 8-bit code path. Unicode standard is unlikely to add general_category=Cc characters anymore.

> Let's not "hide" this replacement inside NextChar(), but do it in the
> calling code - it's not logically a part of "get the next Unicode character
> from a UTF-16 buffer", but a special-case substitution that we do for
> font-matching purposes, so let's put it in the font-matching code where it
> belongs.
I put mozilla::gfx::CharacterIterator in gfxFontUtils.h not to confuse it with a general purpose Unicode character iterator.

Other nits are resolved.
Attachment #602727 - Attachment is obsolete: true
Attachment #602727 - Flags: review?(jfkthame)
Attached patch patch v3.1 (obsolete) — Splinter Review
>> Let's not "hide" this replacement inside NextChar(), but do it in the
>> calling code - it's not logically a part of "get the next Unicode character
>> from a UTF-16 buffer", but a special-case substitution that we do for
>> font-matching purposes, so let's put it in the font-matching code where it
>> belongs.
Eventually I moved it out from ClusterIterator because U+00a0 can never be a cluster extender. It was a waste of CPU cycles.
Attachment #604794 - Attachment is obsolete: true
Attached patch patch v3.3 (obsolete) — Splinter Review
Rebased to tip
Attachment #604900 - Attachment is obsolete: true
Attached patch patch v4Splinter Review
- Rebased to tip.
- Separated a method for cluster matching. The common case would become faster because it doesn't even check the surrogate anymore.
Now cluster matching is spec'ed.
Attachment #606006 - Attachment is obsolete: true
Attachment #703287 - Flags: review?(jfkthame)
Comment on attachment 703287 [details] [diff] [review]
patch v4

Review of attachment 703287 [details] [diff] [review]:
-----------------------------------------------------------------

I haven't been through this carefully yet, but one question up-front...

::: gfx/thebes/gfxFont.cpp
@@ +3992,5 @@
> +    const T *startPtr = aString + aScriptRunStart;
> +    uint32_t runLength = aScriptRunEnd - aScriptRunStart;
> +    if (sizeof(T) == sizeof(PRUnichar) &&
> +        MaybeHasComplexCluster(reinterpret_cast<const PRUnichar*>(startPtr),
> +                               runLength)) {

This will result in making an additional pass over the text, doing a number of comparisons on every character in the buffer, in the common case where no "complex cluster" is actually present (e.g. most extended Latin, Cyrillic, CJK, etc., etc.)

Have you tried profiling, or at least running talos jobs on tryserver, to see whether this has a significant performance impact? I'm concerned it may regress performance for virtually all text (except purely 8-bit), which seems a high price to pay for fixing the (relatively rare) cluster issues.
(In reply to Jonathan Kew (:jfkthame) from comment #56)
> This will result in making an additional pass over the text, doing a number
> of comparisons on every character in the buffer, in the common case where no
> "complex cluster" is actually present (e.g. most extended Latin, Cyrillic,
> CJK, etc., etc.)

This patch reduce calling IsJoinControl, IsJoinCauser, IsVarSelector and GetGeneralCategory, building non-BMP characters from surrogates, and usage of aPrevCh parameter for each character instead.
I could even reduce the compare number by merging Hangul Jamo Extended-B and High-surrogate, but I'm trusting the compilers at the moment.

> Have you tried profiling, or at least running talos jobs on tryserver, to
> see whether this has a significant performance impact? I'm concerned it may
> regress performance for virtually all text (except purely 8-bit), which
> seems a high price to pay for fixing the (relatively rare) cluster issues.

http://perf.snarkfest.net/compare-talos/index.html?oldRevs=9a2924e7f42b&newRev=bd129c09eb39&submit=true (with separating the method)
http://perf.snarkfest.net/compare-talos/index.html?oldRevs=9a2924e7f42b&newRev=d197d4b30e2b&submit=true (without separating the method)
- Avoid two-pass scanning
- Make IsClusterExtender() inline
Oddly enough, this change caused a serious ts regression on Android Native.
http://perf.snarkfest.net/compare-talos/index.html?oldRevs=9a2924e7f42b&newRev=6c70f5ca6dd0&submit=true
Comment on attachment 703287 [details] [diff] [review]
patch v4

Review of attachment 703287 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry this is going so slowly! Here are a few more comments to consider, at least. Leaving r? as I haven't covered all of this yet...

Also, we should have jdaggett look at this, particularly in relation to his CSS3 Fonts draft, which goes into some detail about cluster-matching.

::: gfx/thebes/gfxFont.cpp
@@ +171,5 @@
> +        uint32_t prevCh = ch;
> +        ch = aCluster.Next();
> +        if (gfxFontUtils::IsVarSelector(ch)) {
> +            if (!GetUVSGlyph(prevCh, ch)) {
> +                return false;

This would lead to returning FALSE for any cluster that has multiple UVS codepoints. That's probably not a good thing; extra UVS codes should be ignored (I think this would best match Unicode expectations, and it's also what the current CSS3 Fonts ED says).

@@ +180,5 @@
> +            if (!HasCharacter(ch) &&
> +                !gfxFontUtils::IsNonJoiner(ch) &&
> +                (!gfxFontUtils::IsJoinCauser(ch) ||
> +                 IsArabicOrIndic(aScript))) {
> +                return false;

I'm not sure why ZWNJ and ZWJ are treated differently here; they should both be default-ignorable codepoints that need not be explicitly supported by the font, AFAIK.

@@ +1159,5 @@
>  }
>  
> +bool
> +gfxFontFamily::HasCluster(CharacterIterator aCluster,
> +                          int32_t aScript)

Same comments as for gfxFontEntry::HasCluster.

(I wish we could somehow refactor these two functions so as to have a single copy of the logic, but it's a bit tricky as they handle their char maps differently.)

@@ +4131,5 @@
> +    }
> +
> +    // Note that we don't need the actual runScript in matchData for
> +    // gfxFontFamily::SearchAllFontsForChar, it's only used for the
> +    // system-fallback case. So we can just set it to 0 here.

This isn't true for SearchAllFontsForCluster, given how HasCluster is currently written - it *does* look at the runScript. Though I'm not sure whether it really needs to (see earlier comment).

@@ +4412,5 @@
> +                    mNormalizer = new nsUnicodeNormalizer();
> +                }
> +                mNormalizer->NormalizeUnicodeNFC(nsDependentSubstring(
> +                                                     aString + origI, iter),
> +                                                 normalizedCluster);

If I'm understanding correctly, this means we'll *only* be checking for the presence of the precomposed form, not for the existence of all the components of its decomposed equivalent. That may be a problem, as it's possible for a font to support a combining character sequence (for which a precomposed form also exists in Unicode), yet not have an explicit precomposed character.

I think we should probably first check for the character sequence in the form it existed in the document, and only try normalizing if the original sequence was not supported.

Oh, and another thing: nsUnicodeNormalizer is badly out of date; see bug 728180. If we're going to use it to support font-matching/text-shaping, we need to deal with that somehow.

::: intl/unicharutil/util/nsUnicodeProperties.cpp
@@ +313,5 @@
>  ClusterIterator::Next()
>  {
>      if (AtEnd()) {
>          NS_WARNING("ClusterIterator has already reached the end");
> +        return 0xFFFF;

This seems a bit arbitrary. I wonder if we should have an assertion (rather than just a warning) here, to check that we don't ever call this inappropriately.

::: intl/unicharutil/util/nsUnicodeProperties.h
@@ +6,5 @@
>  #ifndef NS_UNICODEPROPERTIES_H
>  #define NS_UNICODEPROPERTIES_H
>  
>  #include "nsBidiUtils.h"
> +#include "nsCharTraits.h"

Just curious, why did this need to be put in the header?
Attachment #703287 - Flags: feedback?(jdaggett)
Depends on: 728180
(In reply to Jonathan Kew (:jfkthame) from comment #59)
> ::: gfx/thebes/gfxFont.cpp
> @@ +171,5 @@
> > +        uint32_t prevCh = ch;
> > +        ch = aCluster.Next();
> > +        if (gfxFontUtils::IsVarSelector(ch)) {
> > +            if (!GetUVSGlyph(prevCh, ch)) {
> > +                return false;
> 
> This would lead to returning FALSE for any cluster that has multiple UVS
> codepoints. That's probably not a good thing; extra UVS codes should be
> ignored (I think this would best match Unicode expectations, and it's also
> what the current CSS3 Fonts ED says).

Sure, will fix.

> @@ +180,5 @@
> > +            if (!HasCharacter(ch) &&
> > +                !gfxFontUtils::IsNonJoiner(ch) &&
> > +                (!gfxFontUtils::IsJoinCauser(ch) ||
> > +                 IsArabicOrIndic(aScript))) {
> > +                return false;
> 
> I'm not sure why ZWNJ and ZWJ are treated differently here; they should both
> be default-ignorable codepoints that need not be explicitly supported by the
> font, AFAIK.

IIRC I added that to fix test failures about complex text. I'll reconfirm it later.

> @@ +4131,5 @@
> > +    }
> > +
> > +    // Note that we don't need the actual runScript in matchData for
> > +    // gfxFontFamily::SearchAllFontsForChar, it's only used for the
> > +    // system-fallback case. So we can just set it to 0 here.
> 
> This isn't true for SearchAllFontsForCluster, given how HasCluster is
> currently written - it *does* look at the runScript. Though I'm not sure
> whether it really needs to (see earlier comment).

I'll either fix the comment or the code.

> @@ +4412,5 @@
> > +                    mNormalizer = new nsUnicodeNormalizer();
> > +                }
> > +                mNormalizer->NormalizeUnicodeNFC(nsDependentSubstring(
> > +                                                     aString + origI, iter),
> > +                                                 normalizedCluster);
> 
> If I'm understanding correctly, this means we'll *only* be checking for the
> presence of the precomposed form, not for the existence of all the
> components of its decomposed equivalent. That may be a problem, as it's
> possible for a font to support a combining character sequence (for which a
> precomposed form also exists in Unicode), yet not have an explicit
> precomposed character.
> 
> I think we should probably first check for the character sequence in the
> form it existed in the document, and only try normalizing if the original
> sequence was not supported.

That's bug 733376. I'm not going to fix that in this bug. It's already broken on the current Nightly, so it is not a regression, at least.

> Oh, and another thing: nsUnicodeNormalizer is badly out of date; see bug
> 728180. If we're going to use it to support font-matching/text-shaping, we
> need to deal with that somehow.

According to past discussions, we should add an API to harfbuzz rather than a general-purpose normalization API. nsUnicodeNormalizer is just an aproximation until the API has been added. I'm not going to fix all edge-cases in this bug. (unless the API is already available.)

> ::: intl/unicharutil/util/nsUnicodeProperties.cpp
> @@ +313,5 @@
> >  ClusterIterator::Next()
> >  {
> >      if (AtEnd()) {
> >          NS_WARNING("ClusterIterator has already reached the end");
> > +        return 0xFFFF;
> 
> This seems a bit arbitrary. I wonder if we should have an assertion (rather
> than just a warning) here, to check that we don't ever call this
> inappropriately.

Will do.

> ::: intl/unicharutil/util/nsUnicodeProperties.h
> @@ +6,5 @@
> >  #ifndef NS_UNICODEPROPERTIES_H
> >  #define NS_UNICODEPROPERTIES_H
> >  
> >  #include "nsBidiUtils.h"
> > +#include "nsCharTraits.h"
> 
> Just curious, why did this need to be put in the header?

Maybe for surrogate macros.
Depends on: 836197
(In reply to Masatoshi Kimura [:emk] from comment #60)
> > @@ +180,5 @@
> > > +            if (!HasCharacter(ch) &&
> > > +                !gfxFontUtils::IsNonJoiner(ch) &&
> > > +                (!gfxFontUtils::IsJoinCauser(ch) ||
> > > +                 IsArabicOrIndic(aScript))) {
> > > +                return false;
> > 
> > I'm not sure why ZWNJ and ZWJ are treated differently here; they should both
> > be default-ignorable codepoints that need not be explicitly supported by the
> > font, AFAIK.
> 
> IIRC I added that to fix test failures about complex text. I'll reconfirm it
> later.

It was layout/reftests/bugs/619511-1.html (and I found I wrote it in comment #5).
The font will need a glyph for ZWJ even if it is empty because otherwise glyph substitution and/or positioning will not work. We can't ignore ZWJ as long as we are trying cluster match. (Of course, if no fonts contain ZWJ, we can fallback to character-based match with ignoring ZWJ.)
Regarding ZWNJ, I'm treating it special because it is ignorable. Otherwise font-switching will occur when the font does not contain the ZWJ explicitly.
(In reply to Masatoshi Kimura [:emk] from comment #62)
> when the font does not contain the ZWJ explicitly.
when the font does not contain the ZWNJ explicitly.
layout/reftests/bugs/534919-1.html, layout/reftests/text-overflow/marker-basic.html, layout/reftests/font-features/font-features-turkish.html fails without special-casing ZWNJ.
https://tbpl.mozilla.org/?tree=Try&rev=ec40423f038f
Comment on attachment 703287 [details] [diff] [review]
patch v4

Sorry this has been waiting so long... clearing the r? for now, as rebasing/updating (at least) will be needed if we're going to move forward with this.

John, any feedback in relation to CSS Fonts, the matching algorithm, etc, would still be appreciated.

I did try to update this patch and pushed a tryserver job to see if it would help with bug 923007 (see https://bugzilla.mozilla.org/show_bug.cgi?id=923007#c13), but it sounds like that didn't work as I hoped. I still think that in theory it ought to be useful, but I don't currently have the right system on hand to try and debug that further.

Kimura-san, do you think you'll have time to look into this again and update it for current trunk, investigate the connection with bug 923007, etc?
Attachment #703287 - Flags: review?(jfkthame)
I still consider updating the patch, but unassigning until I can resume the work. Feel free to take it if you have a time before that.
It is unlikely that I can investigate bug 923007 because I have no Mac machines.
Assignee: VYV03354 → nobody
Blocks: 1107568
Blocks: 1162921
I was under the impression that this is already done.  John, Jonathan, can you please advise what is done and what is remaining?  Thanks.
(In reply to Behdad Esfahbod from comment #67)
> I was under the impression that this is already done.  John, Jonathan, can
> you please advise what is done and what is remaining?  Thanks.

We try to ensure that any combining marks are included in the same font run as the base character, provided they're supported by the font (so a font-fallback choice for the base letter "propagates" to its marks); but we don't currently handle cases where the combining marks aren't supported in their own right by the font, yet harfbuzz would be able to render them via canonical composition at the character level.

We really should resurrect and update :emk's patch here and try to get it landed, as this is an issue that people keep hitting.
Thanks Jonathan.  When reconsidering this, please also consider a harfbuzz-driven font selection model.  I'll try to hash out the details of that later this month.  Will report.
Attachment #703287 - Flags: feedback?(jd.bugzilla)
Blocks: 1269484
(In reply to Behdad Esfahbod from comment #69)
> Thanks Jonathan.  When reconsidering this, please also consider a
> harfbuzz-driven font selection model.  I'll try to hash out the details of
> that later this month.  Will report.

For the record, this has been implemented and shipping in Chrome for a while now.  Chrome team is very happy with it.
Is there any progress on this?
See Also: → 1425711
No assignee, updating the status.
Status: ASSIGNED → NEW
No assignee, updating the status.
No assignee, updating the status.
No assignee, updating the status.
Blocks: 1513907
Blocks: 1544488
Blocks: 1551809
See Also: → 1608548
See Also: → 1258613
You need to log in before you can comment on or make changes to this bug.