Arabic shaper not used for Arabic combining marks on whitespace
Categories
(Core :: Layout: Text and Fonts, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox78 | --- | fixed |
People
(Reporter: eusgf4u4pw, Assigned: jfkthame)
Details
Attachments
(5 files, 1 obsolete file)
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36
Steps to reproduce:
display Unicode character sequences:
U+00A0 U+0654 U+0670 (nbspace hamzaAbove superscriptAlef)
and
U+00A0 U+0670 U+0654 (nbspace superscriptAlef hamzaAbove)
within both
a Latin string ("hi bye")
and
an Arabic string ("ب ي").
Actual results:
Observed rendering:
-- within the Latin string, the text was reordered to U+00A0 U+0670 U+0654, resulting in incorrect rendering (as if the hamza was being attached to the alef)
-- within the Arabic string, the text was reordered to U+00A0 U+0654 U+0670, rendering correctly (the alef being attached to the hamza)
Expected results:
The marks should always be displayed as if the order was U+00A0 U+0654 U+0670, whether or not the surrounding context is Arabic or Latin, as explained in Unicode technical Report 53 UNICODE ARABIC MARK RENDERING.
A combining character sequence of arabic marks should always be rendered with the Arabic shaper in order to trigger the operation of UTR53 Arabic Mark Transient Reordering Algorithm (AMTRA) within Harfbuzz. This should be the case whether the combining character sequence is on an Arabic letter base or not.
UTR53 should be applied to the Arabic mark sequence in both cases instead of only in the case of the Arabic context.
Reporter | ||
Comment 1•5 years ago
|
||
Unicode Standard suggests:
Nonspacing combining marks used by the Unicode Standard may be exhibited in apparent isolation by applying them to U+00A0 NO-BREAK SPACE. This convention might be employed, for example, when talking about the combining mark itself
Reporter | ||
Comment 2•5 years ago
|
||
Reporter | ||
Comment 3•5 years ago
|
||
Reporter | ||
Comment 4•5 years ago
|
||
Reporter | ||
Comment 5•5 years ago
|
||
I also tried various markup around the problematic sequence (nbsp + marks), including:
- lang="ar"
- lang="ar-Arab"
- preceding the sequence with U+061C ARABIC LETTER MARK
- wrapping the sequence in RLO/PDF or RLE/PDF pairs
- wrapping the sequence in <span dir="rtl"></span>
But nothing seems to help.
Assignee | ||
Comment 6•4 years ago
|
||
The basic issue here is that the script itemizer doesn't identify the <NBSP, hamza-above, superscript-alef> sequence as an Arabic script run, because NBSP has Script=Common and the two diacritics have Script=Inherited. So the run just resolves to Script=Common, and we send it through the generic shaper.
The two Arabic diacritics do have ScriptExtensions=arab,syrc in Unicode. So I think we should try looking at the ScriptExtensions property when a run otherwise resolves to Common, and use the first "real" script found there. This will generally result in the right shaping, for cases like this where the marks depend on being processed by a specific shaper.
Even then, it's unclear to me whether <NBSP, hamza-above, superscript-alef> should be expected to work in an LTR context (e.g. between Latin words), because of directionality concerns. None of these characters have strong RTL directionality (obviously not NBSP, and the diacritics have class NonSpacingMark, which means they adopt the directionality of the base to which they're applied). As a result, a line containing
hi <NBSP, hamza-above, superscript-alef> bye
will simply resolve to LTR, and so I'm doubtful whether shaping of the <NBSP, hamza-above, superscript-alef> should be expected to work as if it were RTL. But wrapping this in <span dir=rtl> would fix that; the base direction will then be RTL and the sequence should render properly if sent through the Arabic shaper.
So the testcase
data:text/html;charset=utf-8,<span dir=rtl style="font:36px arial"> &%23x654;&%23x670;</span>
should render with the superscript-alef on top of the hamza (as it does in Chrome), but this currently fails in Firefox. Fixing the script itemizer to check ScriptExtensions will resolve this.
(It's interesting that Chrome renders this "correctly" even without the dir=rtl, but as described above, it's unclear to me whether that should really be expected. Possibly an issue for further investigation as a followup.)
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 7•4 years ago
|
||
Assignee | ||
Comment 8•4 years ago
|
||
Depends on D75744
Comment 12•4 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/3ac062ec44ec
https://hg.mozilla.org/mozilla-central/rev/42c6f55994f3
Description
•