Closed Bug 831548 Opened 11 years ago Closed 11 years ago

characters that map explicitly to .notdef should not be considered supported by the font [was: Using wrong hyphen U+2010 for auto-hyphenation and failing to fall back to other font]

Categories

(Core :: Layout: Text and Fonts, defect)

18 Branch
x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla21

People

(Reporter: tphinney, Assigned: jfkthame)

References

()

Details

Attachments

(2 files)

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.101 Safari/537.11

Steps to reproduce:

Viewed test page at: http://test.extensis.com.s3.amazonaws.com/wesTest/css_hyphentest_forff.html

The page uses CSS3 automatic hyphenation. It shows test cases in which fonts have different hyphen characters available. The possibilities are:
- U+002D the standard ASCII hyphen-minus
- U+00AD the "soft hyphen" control code
- U+2010 the distinct hyphen character that is not doing double duty as a minus sign.

The test cases are:
1) Font has all three of U+002D, U+00AD and U+2010
2) Font has U+002D, U+00AD only
3) Font has U+002D
4) Font has none of the above


Actual results:

Firefox 18.0 Mac: the first case hyphenates nicely. The remainder all show notdefs instead of hyphens.

When doing CSS3 hyphenation, Firefox is using the Unicode codepoint U+2010 to specify a hyphen to be displayed. (This can also be seen by using the wp-typography plugin on a WordPress site… don't know if that is using CSS3 behind the scenes or what.)

That's fine for many system fonts which often have rather large character sets. But when it comes to using arbitrary fonts as webfonts, it creates a bit of a problem. U+2010 is indeed an unambiguous hyphen, but it is not in the most common/basic character sets, such as MacRoman or WinANSI.

To give you an idea of its non-ubiquity, of the 5600 fonts in the WebINK library, only 29% of them have U+2010. At least half of those are from a single foundry, Adobe.

The problem is made much worse by the fact that when U+2010 is unavailable in the current font, instead of getting the right character in a fallback font, a notdef is displayed by Firefox.

I assume that other apps just use U+002D, which is a basic ASCII character. Or perhaps they start with U+2010, but manage to fall back to U+002D if U+2010 is unavailable, or at worst fall back to another font?

(It is also possible to use the soft hyphen U+00AD, but as best as I understand it, that's a control character to be used on the authoring side, and not something you should be using at the point where you wish to create pixels on screen for display.)

I've attached a screen grab of the offending behavior, as the test page may not be up forever.


Expected results:

I would expect the same result as Safari. In Safari 5.1.7 Mac: all cases hyphenate acceptably. In the fourth and last case a fallback font is used to display the hyphen.

One could argue there are actually two bugs in Firefox: (1) not using the right hyphen in the font, and (2) showing a notdef instead of using a fallback font when the desired character is missing. However, if (1) is fixed, users will almost never encounter (2).
Note that we have also verified this issue on Windows previously, just not with this latest test page.

This is related but not identical to https://bugzilla.mozilla.org/show_bug.cgi?id=476378 that was fixed several years back.
Works for me on Firefox 17 and 21 on Linux. CCing some font devs.Does this happen on Nightly build? Download from http://nigtly.mozilla.org
Component: Untriaged → Layout: Text
Product: Firefox → Core
It works on Linux, but fails on OS X. (Note that the whole font-matching process on Linux is very different to other platforms, because of the involvement of fontconfig. So it's not surprising an issue like this would differ. I'd expect Windows and Android to behave like OS X in a case like this, probably.)

Thomas, any chance you could post a testcase in a form where the subsetted fonts involved are loaded from simple URLs that can also be downloaded separately for inspection? The webink URLs make that a bit awkward without some extra hackery...
Status: UNCONFIRMED → NEW
Ever confirmed: true
Yup, reproduces on the mobile platform as well (tested on b2g).
Curiouser and curiouser..... although this reproduces on B2G (unagi), it *doesn't* reproduce on my HTC Desire HD phone (stock device running android 2.2). I wonder why; they should be using essentially the same font code (the gfxFT2* backend).

Now I'm more puzzled. Would still like a testcase with separately-downloadable fonts, though.
Jonathan, maybe I'm not understanding your request for separately downloadable fonts. Can't you just use firebug to capture/dump the downloaded fonts for inspection?
Hmm, maybe that'd work - I don't normally have firebug installed, but I could take a look. (I was hoping for a simple URL I could just grab with wget or curl...)
I created a repro case that uses static fonts instead of the WebINK service: http://test.extensis.com.s3.amazonaws.com/wesTest/FirefoxRepro/hyphentest.html.

You should be able to download the 4 test fonts directly via wget, curl, etc.

Let me know if I can do anything else to help.
Thanks, that'll make investigation easier - I'll try to take a look shortly (been distracted by other issues, sorry!)
OK, now I understand what's going on. These fonts are a bit unusual in how they're constructed (as an artifact of a particular subsetting process, no doubt), and as a result they trip over an issue in the Firefox code that determines which characters are supported by the font. The result is that Firefox believes U+2010 is supported, and so it doesn't fall back to an alternative font.

Note that this is -not- actually specific to hyphenation; it just happens that U+2010 is one of the affected character codes in these fonts. If you try a sample that includes certain other characters such as schwa (Ə, ə) or the fraction slash (⁄), you'll see the font's .notdef for these as well.

What's triggering the problem is that the 'cmap' includes explicit mappings for characters (such as U+2010) that have been omitted from the subsetted font; rather than the cmap entries being -removed-, they have been modified to map to glyph ID 0 (.notdef), but are still present. The code in Firefox that determines the character coverage of the font - and hence whether fallback is used - checks for the -presence- of a cmap mapping for the character but does not verify that it maps to a non-zero glyph ID; so it concludes that the font supports U+2010, etc.

This only seems to affect the (relatively few) cases where an "isolated" cmap entry has been replaced by a mapping to 0, and the cmap subtable uses a single-character range with idRangeOffset=0, and an idDelta equal to the negative of the character code (resulting in a mapped glyph of zero). Where a contiguous range of characters (such as the entire Cyrillic alphabet) has been deleted, the subtable uses a non-zero idRangeOffset pointing into the glyphIdArray; subsetting will then have zeroed the relevant positions in the glyphIdArray, and Firefox handles that scenario correctly.

I'll post a patch to update our cmap-analysis code so that these characters that map to glyph ID zero will -not- be considered "supported", so that we'll handle such fonts better in future.

Meanwhile, to work around the problem, what you can do on the font-development side is to ensure that when characters are omitted from the font's repertoire, the corresponding cmap entries are actually -removed-, not just updated to map the character to .notdef. Note that this will also reduce the size of the font file, so it's a desirable optimization anyway. Then Firefox won't believe that these characters are still supported, fallback will kick in, and we'll all be happy. :)

(I'd recommend going further, actually: I notice that the fonts include a large number of glyphs - corresponding to the omitted characters - whose outline data has been deleted, but that are still contributing size to the 'loca', 'hmtx', 'post', and OpenType layout tables. By completely deleting characters and corresponding glyphs in the subsetting operation, rather than just deleting the actual outlines and setting the cmap entries to point at .notdef, the font files could be made a great deal smaller.)
Assignee: nobody → jfkthame
Summary: Using wrong hyphen U+2010 for auto-hyphenation and failing to fall back to other font → characters that map explicitly to .notdef should not be considered supported by the font [was: Using wrong hyphen U+2010 for auto-hyphenation and failing to fall back to other font]
Comment on attachment 704335 [details] [diff] [review]
character codes with cmap mappings that result in glyph id 0 should not be included in the font's character map.

Looks fine but why the difference in behavior across Linux/OSX/Android, since this is platform-generic code?
Attachment #704335 - Flags: review?(jdaggett) → review+
Jonathan, and John:

I appreciate that y'all are willing to try to fix this on the Firefox end. But based on your discovery, I would classify this as a font bug and not a Firefox bug. These entries for deleted glyphs should NOT be in the cmap. That's just wrong!  :/

We are aware of the other limitations Jonathan describes in the current subsetting code. We are hoping to improve things over time in this area, but there are some immense performance advantages to be had if we do not remove GIDs at all....
(In reply to John Daggett (:jtd) from comment #12)
> Comment on attachment 704335 [details] [diff] [review]
> character codes with cmap mappings that result in glyph id 0 should not be
> included in the font's character map.
> 
> Looks fine but why the difference in behavior across Linux/OSX/Android,
> since this is platform-generic code?

It makes sense that the issue wouldn't appear on Linux, since font-matching there depends on fontconfig (gfxPangoFontGroup has its own version of FindFontForChar). It regards a character as missing if the cmap returns glyph ID 0, regardless of how that arose.

I don't know why it failed to reproduce for me on Android (although it did on b2g) -- I'd have expected to see the same behavior there. Possibly I did something wrong in my testing.
(In reply to Thomas Phinney from comment #13)
> Jonathan, and John:
> 
> I appreciate that y'all are willing to try to fix this on the Firefox end.
> But based on your discovery, I would classify this as a font bug and not a
> Firefox bug. These entries for deleted glyphs should NOT be in the cmap.
> That's just wrong!  :/

I considered describing this as a font bug, rather than just an "unusual" aspect of how the fonts are constructed, but it's not clear to me that it is actually -wrong- rather than just inefficient to explicitly map characters to .notdef rather than omit them from the cmap. (In the case of an old Mac font with a format-0 cmap subtable, that would be the natural thing to do for any missing characters in the charset.)

Moreover, you're not the only people to have used this kind of "poor man's subsetting" workflow (if I may call it that!), where glyphs outlines are removed but the rest of the font is left largely undisturbed. So I wouldn't be all that surprised to see other people run into the same issue.

> We are aware of the other limitations Jonathan describes in the current
> subsetting code. We are hoping to improve things over time in this area, but
> there are some immense performance advantages to be had if we do not remove
> GIDs at all....

You're thinking of the fact that you don't need to rebuild potentially complex OpenType tables, I guess. Yes, I can certainly appreciate that.
Jonathan & John,

Thanks to both of you for your sleuthing on this one. I have fixed the bug in our subsetting that was resulting in the cmap entries pointing to .notdef.

BTW: We have two essentially different subsetting paths based on the requesting browser's ability to handle OpenType features. When stripping OpenType features, we remove unneeded glyphs and renumber the remaining ones. If however, we are leaving OpenType features intact, then we take the computationally less intensive route and just remove the outlines from the unneeded glyphs.
But even in the second case, we remove the entries from the cmap. Leastways, that was the plan! And now more consistently the reality as well.
https://hg.mozilla.org/mozilla-central/rev/c681bf531dcd
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: