Closed Bug 1638478 Opened 5 months ago Closed 4 months ago

Arabic shaper not used for Arabic combining marks on whitespace

Categories

(Core :: Layout: Text and Fonts, defect)

78 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla78
Tracking Status
firefox78 --- fixed

People

(Reporter: eusgf4u4pw, Assigned: jfkthame)

Details

Attachments

(5 files, 1 obsolete file)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36

Steps to reproduce:

display Unicode character sequences:
U+00A0 U+0654 U+0670 (nbspace hamzaAbove superscriptAlef)
and
U+00A0 U+0670 U+0654 (nbspace superscriptAlef hamzaAbove)
within both
a Latin string ("hi bye")
and
an Arabic string ("ب ي").

Actual results:

Observed rendering:
-- within the Latin string, the text was reordered to U+00A0 U+0670 U+0654, resulting in incorrect rendering (as if the hamza was being attached to the alef)
-- within the Arabic string, the text was reordered to U+00A0 U+0654 U+0670, rendering correctly (the alef being attached to the hamza)

Expected results:

The marks should always be displayed as if the order was U+00A0 U+0654 U+0670, whether or not the surrounding context is Arabic or Latin, as explained in Unicode technical Report 53 UNICODE ARABIC MARK RENDERING.

A combining character sequence of arabic marks should always be rendered with the Arabic shaper in order to trigger the operation of UTR53 Arabic Mark Transient Reordering Algorithm (AMTRA) within Harfbuzz. This should be the case whether the combining character sequence is on an Arabic letter base or not.

UTR53 should be applied to the Arabic mark sequence in both cases instead of only in the case of the Arabic context.

Unicode Standard suggests:

Nonspacing combining marks used by the Unicode Standard may be exhibited in apparent isolation by applying them to U+00A0 NO-BREAK SPACE. This convention might be employed, for example, when talking about the combining mark itself

Attachment #9149546 - Attachment is obsolete: true

I also tried various markup around the problematic sequence (nbsp + marks), including:

  • lang="ar"
  • lang="ar-Arab"
  • preceding the sequence with U+061C ARABIC LETTER MARK
  • wrapping the sequence in RLO/PDF or RLE/PDF pairs
  • wrapping the sequence in <span dir="rtl"></span>

But nothing seems to help.

The basic issue here is that the script itemizer doesn't identify the <NBSP, hamza-above, superscript-alef> sequence as an Arabic script run, because NBSP has Script=Common and the two diacritics have Script=Inherited. So the run just resolves to Script=Common, and we send it through the generic shaper.

The two Arabic diacritics do have ScriptExtensions=arab,syrc in Unicode. So I think we should try looking at the ScriptExtensions property when a run otherwise resolves to Common, and use the first "real" script found there. This will generally result in the right shaping, for cases like this where the marks depend on being processed by a specific shaper.

Even then, it's unclear to me whether <NBSP, hamza-above, superscript-alef> should be expected to work in an LTR context (e.g. between Latin words), because of directionality concerns. None of these characters have strong RTL directionality (obviously not NBSP, and the diacritics have class NonSpacingMark, which means they adopt the directionality of the base to which they're applied). As a result, a line containing

hi <NBSP, hamza-above, superscript-alef> bye

will simply resolve to LTR, and so I'm doubtful whether shaping of the <NBSP, hamza-above, superscript-alef> should be expected to work as if it were RTL. But wrapping this in <span dir=rtl> would fix that; the base direction will then be RTL and the sequence should render properly if sent through the Arabic shaper.

So the testcase

data:text/html;charset=utf-8,<span dir=rtl style="font:36px arial">&nbsp;&%23x654;&%23x670;</span>

should render with the superscript-alef on top of the hamza (as it does in Chrome), but this currently fails in Firefox. Fixing the script itemizer to check ScriptExtensions will resolve this.

(It's interesting that Chrome renders this "correctly" even without the dir=rtl, but as described above, it's unclear to me whether that should really be expected. Possibly an issue for further investigation as a followup.)

Assignee: nobody → jfkthame
Severity: -- → S3
Status: UNCONFIRMED → NEW
Ever confirmed: true
Pushed by jkew@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/3ac062ec44ec
Try to resolve Script=Common runs to a specific script for shaping purposes based on the ScriptExtensions property. r=jrmuizel
https://hg.mozilla.org/integration/autoland/rev/42c6f55994f3
Add WPT reftest for shaping Arabic diacritics stacked on NBSP. r=jrmuizel
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/23692 for changes under testing/web-platform/tests
Upstream web-platform-tests status checks passed, PR will merge once commit reaches central.
Status: NEW → RESOLVED
Closed: 4 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla78
Upstream PR merged by moz-wptsync-bot
You need to log in before you can comment on or make changes to this bug.