Closed
Bug 1331339
Opened 8 years ago
Closed 8 years ago
Graphite rendering problems with shared Tamil and Grantha text
Categories
(Core :: Graphics: Text, defect, P3)
Tracking
()
RESOLVED
FIXED
mozilla53
Tracking | Status | |
---|---|---|
firefox53 | --- | fixed |
People
(Reporter: martin_hosken, Assigned: jfkthame)
Details
(Whiteboard: [gfx-noted])
Attachments
(3 files)
197.04 KB,
application/x-zip-compressed
|
Details | |
3.59 KB,
patch
|
jrmuizel
:
review+
|
Details | Diff | Splinter Review |
59.21 KB,
patch
|
jrmuizel
:
review+
|
Details | Diff | Splinter Review |
Enclosed is a firefox test document that shows the rendering of a Tamil text in both Graphite and OpenType. Due to the inclusion of U+1133C (Grantha nukta) in the text, both render wrongly. The correct rendering is shown in the enclosed pdf.
The columns in the html show the glyph outputs from graphite and harfbuzz respectively and their positions.
I can understand the OT being wrong, even though it shouldn't be. But the graphite should be rendering correctly.
Updated•8 years ago
|
Priority: -- → P3
Whiteboard: [gfx-noted]
Assignee | ||
Comment 1•8 years ago
|
||
The primary problem is almost certainly script-run itemization. U+1133C is listed in Scripts.txt as Script=Grantha, which is distinct from Script=Tamil, so it gets split into its own script run and shaped separately from the surrounding Tamil text.
ScriptExtensions.txt says "U+1133C ; Gran Taml", implying that it can also be used as part of Tamil, but our script-run itemization doesn't currently make use of ScriptExtensions.txt.
So we need to fix that; then we'll see if any further issues remain, but my guess is that's the problem here.
Assignee | ||
Comment 2•8 years ago
|
||
OK, I've confirmed that fixing the script run itemization will allow the Graphite font to shape correctly here. The OpenType version still fails; I haven't investigated that further. (Does it work in a standalone harfbuzz test?)
Fully fixing script run itemization to handle ScriptExtensions will be tricky, but we can at least do a partial fix that is adequate for the example here. The simple case (here) is that we've started a run of a given script (Tamil), and we need to avoid breaking it when we encounter a character (the nukta) whose Script assignment is different if the new character's ScriptExtensions happens to include the current run script. In this case, we should allow the run to continue uninterrupted. (I.e. it'll behave like a character with Script=Common or Script=Inherited.)
This still won't solve all cases: if the run _started_ with a character that has a primary Script assignment (i.e. it's not just COMMON), but also has secondary values in ScriptExtensions, and is then followed by a character in one of those secondary scripts, we'll have already (incorrectly) set the run's script, and we'll still interrupt it.
Concrete example: <0BAA TAMIL LETTER PA, 1133F GRANTHA VOWEL SIGN I>. U+0BAA has Script=Tamil, ScriptExtensions=Gran Taml. Because it has Script=Tamil, we'll begin a Tamil run. Then U+1133F has Script=Grantha (and no ScriptExtensions), so it'll start a new Grantha run. Oops. What we really want to do there is to go back and use the secondary Grantha script from the initial character. But fixing this in the general case will get hairy, as it could involve arbitrary-length lookahead or backtracking, and characters can have quite a list of secondary scripts; we'd potentially need to scan a series of characters, each assigned to different (but overlapping) collections of scripts, looking for what script code(s) they have in common, until we narrow it down to a single script for the run.
Implementing this would be possible, I'm sure, but may be expensive, and script run itemization is a pretty perf-critical part of the text-rendering pipeline. So I'm hesitant to add too much complexity unless it's truly necessary for real-world content. The most glaring examples can be fixed, I think, by ensuring that we don't break script runs within a grapheme cluster, as that's the case most likely to have bad effects on shaping.
Reporter | ||
Comment 3•8 years ago
|
||
On solving the general problem, there are (as of Unicode 8) 57 scripts involved in script extensions out of a total of 130 scripts referenced by Scripts.txt. So I suppose you could do a 64 bit bit field to look for extensions (bf[i] & bf[i-1]) if the result is 0 and the script[i] != script[i-1] then you have a script break. I suppose you could further divide the bitfield into clusters of scripts so more than one script could use the same bit, but then that could be risky for crazy data. The longest list is 17 scripts, with, from a quick hand calculation, the biggest class being 24 scripts.
As to never breaking run in a grapheme cluster. That is inviting, but I'm sure it would bite us somewhere. What about some Latin text with a non-latin diacritic between parenthesis with no spaces in there?
Assignee | ||
Comment 4•8 years ago
|
||
(In reply to martin_hosken from comment #3)
> On solving the general problem, there are (as of Unicode 8) 57 scripts
> involved in script extensions out of a total of 130 scripts referenced by
> Scripts.txt. So I suppose you could do a 64 bit bit field to look for
In Unicode 9, there are 59 scripts in ScriptExtensions, and I guess that'll continue to grow (gradually) in future versions, so we're uncomfortably close to overflowing a 64-bit value already.
> extensions (bf[i] & bf[i-1]) if the result is 0 and the script[i] !=
> script[i-1] then you have a script break.
Yes, this is the sort of thing I was picturing, though as noted, we'll need to be prepared for it to become more than just a single 64-bit value.
> As to never breaking run in a grapheme cluster. That is inviting, but I'm
> sure it would bite us somewhere. What about some Latin text with a non-latin
> diacritic between parenthesis with no spaces in there?
Yes, I'm sure you can construct cases where it's wrong, in some sense. I'm less sure that there will be real-world use cases that are negatively affected. What kind of script-specific shaping is likely to be expected (and fail to occur because the diacritic gets merged into the Latin run) in an example like this?
Many combining marks are Script=Inherited anyway; basically, all this idea does is to extend that convention to some diacritics where the Unicode committee decided to give them a specific script code instead, which is something of a judgement call anyway (see http://www.unicode.org/reports/tr24/#Assignment_Script_Values).
Reporter | ||
Comment 5•8 years ago
|
||
I think if we use clustering on the list of ScriptExtensions we get a largest cluster of 31, since once you get Latn in your cluster you start clustering clusters. But even then, I doubt it will hit 64 bits for a very long time.
I agree that expanding Script=Inherited gives a quick and easy solution that gets us a long way without too much danger. It'll merely miss non diacritics, but as you say, the main concern is not breaking grapheme clusters (that shouldn't be broken).
Assignee | ||
Comment 6•8 years ago
|
||
As noted above, this does not provide an exhaustive handling of ScriptExtensions and all its implications, but it should fix the majority of problems here (including the specific Tamil+Grantha shaping example) with minimal runtime cost.
Attachment #8827412 -
Flags: review?(jmuizelaar)
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → jfkthame
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Assignee | ||
Comment 7•8 years ago
|
||
I've filed https://github.com/behdad/harfbuzz/issues/400 about the harfbuzz shaping failure that still occurs with the OpenType version of the font even after script itemization is fixed.
Comment 8•8 years ago
|
||
Comment on attachment 8827412 [details] [diff] [review]
Don't start a new script run when the current script appears in the next character's ScriptExtensions property, or next char is a cluster-extender
Review of attachment 8827412 [details] [diff] [review]:
-----------------------------------------------------------------
Can we add a testcase for this?
Attachment #8827412 -
Flags: review?(jmuizelaar) → review+
Assignee | ||
Comment 9•8 years ago
|
||
Here's a reftest that fails without the patch here, because the presence of the Grantha nuktas interrupts the Tamil script run and therefore disrupts the shaping of the base text (specifically, glyph reordering that is required for the second U+0BC6 character doesn't happen).
Attachment #8827899 -
Flags: review?(jmuizelaar)
Updated•8 years ago
|
Attachment #8827899 -
Flags: review?(jmuizelaar) → review+
Assignee | ||
Comment 10•8 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/782f2a6bf284974ae03d85ce269b9a802589b045
Bug 1331339 - Don't start a new script run when the current script appears in the next character's ScriptExtensions property, or next char is a cluster-extender. r=jrmuizel
https://hg.mozilla.org/integration/mozilla-inbound/rev/5d4a0370b0a7b30b73342cf53f4cf2374d34cfb4
Bug 1331339 - Reftest to check that Tamil shaping is not disrupted by presence of Grantha nukta characters. r=jrmuizel
Comment 11•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/782f2a6bf284
https://hg.mozilla.org/mozilla-central/rev/5d4a0370b0a7
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
status-firefox53:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla53
You need to log in
before you can comment on or make changes to this bug.
Description
•