Closed Bug 1238243 Opened 8 years ago Closed 8 years ago

Some combined Hangul Jamos overlay with preceding character

Categories

(Core :: Layout: Text and Fonts, defect)

43 Branch
Unspecified
Windows 10
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla48
Tracking Status
firefox42 --- wontfix
firefox43 --- wontfix
firefox44 --- wontfix
firefox45 --- wontfix
firefox46 --- wontfix
firefox47 --- wontfix
firefox48 --- fixed

People

(Reporter: shanshandehongxing, Assigned: jfkthame)

References

Details

(Keywords: regression)

Attachments

(3 files, 1 obsolete file)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:43.0) Gecko/20100101 Firefox/43.0
Build ID: 20151223140742

Steps to reproduce:

Chinese, Japanese and Korean Wikisource has taxts for Hunmin Jeongum (훈민정음, 訓民正音)
https://zh.wikisource.org/wiki/%E8%A8%93%E6%B0%91%E6%AD%A3%E9%9F%B3
https://ja.wikisource.org/wiki/%E8%A8%93%E6%B0%91%E6%AD%A3%E9%9F%B3
https://ko.wikisource.org/wiki/%ED%9B%88%EB%AF%BC%EC%A0%95%EC%9D%8C

In these pages some combined Hangul Jamos looks overlay with preceding character in some sentences as "ㄱ與ᅟᅮᆫ而爲군" and "ㆁ與ᅟᅥᆸ而爲ᅌᅥᆸ". What's wrong?
OS: Unspecified → Windows Phone
OS: Windows Phone → Windows 10
Can you provide a screenshot shows what happened?
Flags: needinfo?(shanshandehongxing)
Attached file Screenshots.zip
I take screenshots for all these pages, you can see what I have seen.
Flags: needinfo?(shanshandehongxing)
I see it now, the combination of characters superimposed incorrect.


18:47.39 LOG: MainThread main INFO Oh noes, no (more) inbound revisions :(
18:47.39 LOG: MainThread Bisector INFO Last good revision: f90e57c732b03c5aba62dbebbfe6511438f20db4
18:47.39 LOG: MainThread Bisector INFO First bad revision: a6f105aba8d22e793d853bced35605258b4fa936
18:47.40 LOG: MainThread Bisector INFO Pushlog:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=f90e57c732b03c5aba62dbebbfe6511438f20db4&tochange=a6f105aba8d22e793d853bced35605258b4fa936
Blocks: 729993
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: regression
Jonathan, can you look into this regression? Your patches appear to be in the regression window.
Flags: needinfo?(jfkthame)
Huh, I see what's happening... some fonts, and in particular Malgun Gothic (which is a standard font on Windows), deal with sequences of combining jamos by positioning individual glyphs rather than ligating them to precomposed forms. That's normally fine, but it leads to a problem when one of the characters is an invisible "filler" character such as U+115F, because those are categorized as "default-ignorable" by Unicode, and we filter those out. But this font is using that character to carry the width of the whole cluster of combining jamos, and the following jamos are zero-width. So we mustn't discard it, even though it has no appearance of its own.
Flags: needinfo?(jfkthame)
Assignee: nobody → jfkthame
Status: NEW → ASSIGNED
(And bug 729993 regressed our behavior here because prior to that, a sequence of combining jamos was being treated as a single cluster, and so not eligible for filtering out as default-ignorable. That only started happening once we moved to finer-grained clustering, where each of the jamos remains individually addressable within the textrun.)
Comment on attachment 8731413 [details] [diff] [review]
Don't filter out Hangul jamo fillers as 'ignorable', because the font may require them to provide advance width

Review of attachment 8731413 [details] [diff] [review]:
-----------------------------------------------------------------

This doesn't look like the right way to fix. It would violate Unicode's guide about ignoring characters. More specifically, these would regress some (invalid) cases like: data:text/plain;charset=UTF-8,a%E1%85%9Fb (In current version, the filler between a and b is ignored, but with this patch, it is rendered as a blank.)

The comment above the callsite of FilterIfIgnorable says we want to remove default-ignorable characters which are not combined [1]. Could we probably find a way to know exactly whether the filler has been combined with characters follow it?

It seems to me that, when we ignore the character, ginfo[glyphStart].mask and ginfo[glyphStart].codepoint are both different for different cases. On my machine (which uses "Malgun Gothic" to render), when the ᅟ is combined, the codepoint varies, but the mask is always 3, while if it is not combined, the codepoint seems to be 0x0C04 all the time, and the mask is always 1. I don't know the exact meaning of those numbers (and couldn't find any related document), but I guess they might be useful somehow.

[1] https://dxr.mozilla.org/mozilla-central/rev/341344bdec8f10bf50646cd6ef2355361435cbf6/gfx/thebes/gfxHarfBuzzShaper.cpp#1718-1720
Attachment #8731413 - Flags: review?(quanxunzhen)
(In reply to Xidorn Quan [:xidorn] (UTC+8) from comment #8)
> Comment on attachment 8731413 [details] [diff] [review]
> Don't filter out Hangul jamo fillers as 'ignorable', because the font may
> require them to provide advance width
> 
> Review of attachment 8731413 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This doesn't look like the right way to fix. It would violate Unicode's
> guide about ignoring characters. More specifically, these would regress some
> (invalid) cases like: data:text/plain;charset=UTF-8,a%E1%85%9Fb (In current
> version, the filler between a and b is ignored, but with this patch, it is
> rendered as a blank.)
> 
> The comment above the callsite of FilterIfIgnorable says we want to remove
> default-ignorable characters which are not combined [1]. Could we probably
> find a way to know exactly whether the filler has been combined with
> characters follow it?

I don't think we can, because it hasn't actually been "combined" -- the following chars have been substituted with forms that will overstrike the blank space created by the filler, but that's a purely visual "combination", not something we can reliably recognize in code.

> 
> It seems to me that, when we ignore the character, ginfo[glyphStart].mask
> and ginfo[glyphStart].codepoint are both different for different cases. On
> my machine (which uses "Malgun Gothic" to render), when the ᅟ is
> combined, the codepoint varies, but the mask is always 3, while if it is not
> combined, the codepoint seems to be 0x0C04 all the time, and the mask is
> always 1. I don't know the exact meaning of those numbers (and couldn't find
> any related document), but I guess they might be useful somehow.

The "mask" value is an internal harfbuzz implementation detail (it relates to which features are being applied, but the actual value is not meaningful to us here). The codepoint (glyph ID) changes because Malgun is substituting the glyph (perhaps for one with a different width? perhaps just as an internal implementation detail of the font, so other lookups can recognize what's happening?), but again, this is an internal detail that isn't useful or reliable for our purposes here. Any solution along these lines would be specific to this particular font's implementation.


(FWIW, there was some discussion related to this on the Unicode mailing list a while back:
  http://unicode.org/mail-arch/unicode-ml/y2013-m03/0093.html
though it's not clear that there's a definitive answer there as to what renderers should do.)


What I suggest we can do, to address your concern with examples like data:text/plain;charset=UTF-8,a%E1%85%9Fb, is to further restrict the patch here so that it only applies when there are following characters within the same grapheme cluster, as that's the "problem" case where the font may be depending on the default-ignorable letter's glyph to provide advance width for the rest of the cluster.
Attachment #8731413 - Attachment is obsolete: true
Comment on attachment 8731643 [details] [diff] [review]
Don't filter out Hangul jamo fillers as 'ignorable', because the font may require them to provide advance width

Review of attachment 8731643 [details] [diff] [review]:
-----------------------------------------------------------------

This probably works... Given this patch, I wonder whether we can just check if the next glyph is a non-cluster-start.
Attachment #8731643 - Flags: review?(quanxunzhen) → review+
(In reply to Xidorn Quan [:xidorn] (UTC+8) from comment #11)
> Comment on attachment 8731643 [details] [diff] [review]
> Don't filter out Hangul jamo fillers as 'ignorable', because the font may
> require them to provide advance width
> 
> Review of attachment 8731643 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This probably works... Given this patch, I wonder whether we can just check
> if the next glyph is a non-cluster-start.

That could be an issue for a bunch of the other default-ignorables which have combining-mark categories themselves, and may occur in the midst of a cluster; the canonical example would be the use of COMBINING GRAPHEME JOINER to prevent combining-class based reordering of a series of diacritics. In a case like that, the default-ignorable CGJ will be followed by a non-cluster-start, but we'd still want to suppress it in rendering.

Limiting this to default-ignorables with Letter category makes it less likely it'll have undesired side-effects on other kinds of usage.
Here are a couple of simple tests to go with this patch; the first test fails with current trunk code, and is fixed by the patch here, while the second checks that we don't break things for the invalid 'a<filler>b' example. On Windows, they'll explicitly test with Malgun Gothic; on other platforms, they ought to also pass regardless of what font gets used, but let's see what tryserver says... https://treeherder.mozilla.org/#/jobs?repo=try&revision=f0d31b7136d0.
Attachment #8731682 - Flags: review?(quanxunzhen)
Comment on attachment 8731643 [details] [diff] [review]
Don't filter out Hangul jamo fillers as 'ignorable', because the font may require them to provide advance width

Review of attachment 8731643 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/thebes/gfxFont.cpp
@@ +767,5 @@
> +        // just the Hangul filler characters) that we'd better not discard
> +        // if they're followed by additional characters in the same cluster.
> +        // Some fonts use them to carry the width of a whole cluster of
> +        // combining jamos; see bug 1238243.
> +        if (GetGenCategory(aCh) == nsIUGenCategory::kLetter &&

It seems to me we can use
> GetGeneralCategory(aCh) == HB_UNICODE_GENERAL_CATEGORY_OTHER_LETTER
here instead, as all those characters fall in this category.
I don't have a machine on hand to reproduce this issue currently (the win7 install in my vm doesn't seem to have this issue). I'll check it on my win10 machine tomorrow morning.
(In reply to Xidorn Quan [:xidorn] (UTC+8) from comment #14)
> It seems to me we can use
> > GetGeneralCategory(aCh) == HB_UNICODE_GENERAL_CATEGORY_OTHER_LETTER
> here instead, as all those characters fall in this category.

Yes, we could. I tested for the wider kLetter category on the grounds that *if* there were ever a character with a different "Letter" subcategory in default-ignorables, it would likely make sense to treat that in the same way. But I doubt there'll ever be such a character in Unicode, so it will probably never make any difference.
Comment on attachment 8731682 [details] [diff] [review]
Reftests involving possibly-ignorable hangul choseong filler

Review of attachment 8731682 [details] [diff] [review]:
-----------------------------------------------------------------

It seems you need to address the reftest failure on 1238243-2, but that should be easy.
Attachment #8731682 - Flags: review?(quanxunzhen) → review+
(In reply to Xidorn Quan [:xidorn] (UTC+8) from comment #17)
> Comment on attachment 8731682 [details] [diff] [review]
> Reftests involving possibly-ignorable hangul choseong filler
> 
> Review of attachment 8731682 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> It seems you need to address the reftest failure on 1238243-2, but that
> should be easy.

That failure is a line-height mismatch, which we can fix by setting an explicit line-height.

Unfortunately, we still get a failure on Linux:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f5bd4160351b

This seems to be dependent on the available fonts on the system; on our test machines, it looks like we're getting a font that doesn't support the combining jamos very well. I can reproduce the original bug here on my local Ubuntu system using the Noto Sans CJK KR font, and the patch here fixes it; but we must be getting some other font on treeherder.

In principle, we could add a known-good Korean font to the reftest directory and load it with @font-face, but I'm not sure it's worth doing that just for this test; for now, I'm inclined to just mark the test as random-if(gtkWidget), and rely on Windows to test it properly.
https://hg.mozilla.org/integration/mozilla-inbound/rev/96680d60aa042031857c6cd78ed015322c351c61
Bug 1238243 - Don't filter out Hangul jamo fillers as 'ignorable', because the font may require them to provide advance width. r=xidorn

https://hg.mozilla.org/integration/mozilla-inbound/rev/dda7d8e0fd514cf86417e3a3012b42ba427fc739
Bug 1238243 - Reftests involving possibly-ignorable hangul choseong filler. r=xidorn
https://hg.mozilla.org/integration/mozilla-inbound/rev/de0a178499e2d31d885c4ab72ee8457f325b9e2f
Bug 1238243 followup - Mark test 1238243-2.html as random on OS X 10.6, due to dependency on available fonts.
https://hg.mozilla.org/mozilla-central/rev/96680d60aa04
https://hg.mozilla.org/mozilla-central/rev/dda7d8e0fd51
https://hg.mozilla.org/mozilla-central/rev/de0a178499e2
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: