Closed Bug 1045139 Opened 10 years ago Closed 10 years ago

The Arabic text with "MS Sans Serif" font is rendered bad

Categories

(Core :: Graphics: Text, defect)

34 Branch
x86_64
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla34
Tracking Status
firefox32 --- unaffected
firefox33 + fixed
firefox34 + fixed

People

(Reporter: over68, Assigned: poiru)

References

Details

Attachments

(1 file)

The Arabic text does not appear correctly with MS Sans Serif font and HWA disabled.

Testcase https://dl.dropboxusercontent.com/u/95157096/85f61cf7/oj13vf3ek.html


This problem occurs since the version 33.0a1 (2014-06-14).

https://hg.mozilla.org/mozilla-central/rev/86aa28ce309e

Screenshot https://dl.dropboxusercontent.com/u/95157096/85f61cf7/71e5hi0ac.png
[Tracking Requested - why for this release]: Arabic-script sites that explicitly call for the MS Sans Serif font will show broken text on GDI-based (non-HW-acc) Windows systems.

This is clearly a regression from bug 985220. MS Sans Serif is an old bitmap font, so obviously it doesn't have OpenType tables; Uniscribe must have special-cased it one way or another. We should figure out some way to work around this, perhaps by forcing it to fall back to another font (which must be happening on all other platforms, including Windows+DWrite anyway).
Blocks: 985220
Status: UNCONFIRMED → NEW
Ever confirmed: true
One way to address this might be to implement bug 1006122 on Windows, as that should mask out the Arabic range in MS Sans Serif.
Depends on: 1006122
Jonathan, why is the fallback arabic shaping in HarfBuzz not working here?
(In reply to Behdad Esfahbod from comment #3)
> Jonathan, why is the fallback arabic shaping in HarfBuzz not working here?

I wondered that too, and am intending to take a look but haven't had time yet. I wouldn't be surprised if it turns out the font does not support the presentation-form codepoints, but relied on some other legacy hack for "shaping". But that's pure speculation, until I spend a bit of time in Windows to poke at it...
Ah, it's Windows-1255 encoding!
Well, Windows-1256 I mean.
So, what implementation do you use for get_glyph() callback with this font?
We should be using gfxGDIFont::GetGlyph(), found at http://mxr.mozilla.org/mozilla-central/source/gfx/thebes/gfxGDIFont.cpp#422, which is based on Uniscribe's ScriptGetCMap() API. So I need to check what that's actually doing.
Screenshot of page affected by this bug https://dl.dropboxusercontent.com/u/95157096/85f61cf7/jdgp9tl97.png
Thanks Jonathan.  I suppose it's returning a custom cmap that maps Arabic chars.  Looks like MS is using a custom glyph encoding for Windows-1256 shaping.  It's using 0x05..0x1F etc for shaped glyphs.

Technically, we can detect this in the arabic fallback shaper code and synthesize a GSUB table for it.  In fact, the GSUB table for it is fixed.  So we just need to do the mapping once and include the binary.

I can write that patch if you are interested in it.
(In reply to Behdad Esfahbod from comment #10)
> Thanks Jonathan.  I suppose it's returning a custom cmap that maps Arabic
> chars.  Looks like MS is using a custom glyph encoding for Windows-1256
> shaping.  It's using 0x05..0x1F etc for shaped glyphs.
> 
> Technically, we can detect this in the arabic fallback shaper code and
> synthesize a GSUB table for it.  In fact, the GSUB table for it is fixed. 
> So we just need to do the mapping once and include the binary.
> 
> I can write that patch if you are interested in it.

Sounds good, thanks. (Well, kinda. I hate to carry extra code just to support obsolete stuff like this. But there are enough people on affected Windows systems that we should do it, I guess.)
Ok, here's a patch:

  https://github.com/behdad/harfbuzz/commit/479ae6855d5611f9d05441b320e4db57d3502f4a

There's a tiny commit before it that you need also.

From the commit log:

commit 479ae6855d5611f9d05441b320e4db57d3502f4a
Author: Behdad Esfahbod <behdad@behdad.org>
Date:   Wed Jul 30 02:15:44 2014 -0400

    [arabic] Implement Windows-1256 private shaping
    
    Bug 1045139 - The Arabic text with "MS Sans Serif" font is rendered bad
    https://bugzilla.mozilla.org/show_bug.cgi?id=1045139
    
    This is only enabled on Windows platforms, and requires support from
    Uniscribe to work.  But for clients that do hook up to Uniscribe, this
    fixes shaping of Windows-1256-encoded bitmap fonts like "MS Sans Serif".
    
    The code and table together have just less than a 1kb footprint when
    enabled.
    
    UNTESTED.  I might even have broken regular Arabic fallback shaping.
Will track for now.
Ok, I fixed that commit a bit.  Look here instead:

https://github.com/behdad/harfbuzz/commits/win1256

It's just the top commit you need.
Thanks, confirmed that it now builds successfully under MSVC.

As for the behavior... almost, but not quite! With the simple testcase from here[1], the text shapes correctly *except* that the khah in 'al-khat' (U+062E) appears as a final form, rather than medial.
Looking at http://en.wikipedia.org/wiki/Arabic_alphabet#Table_of_basic_letters with the font forced to MS Sans Serif, I see that all three of JEEM, HAH and KHAH share the same problem: they display final where they should be medial, and they display isolate where they should be final.
You are right!  My bad.  Fixing.
Fix pushed:

https://github.com/behdad/harfbuzz/commit/946e610c7fe7ad23db0ea41c55f93c9ab0f7c10c

Also please test:

- Lam-Alef ligature, in final vs isolated forms.  You should get distinct glyphs.

- Diacritic marks:

  * I'm curious what happens with fallback positioning.  If the bitmap glyphs return a big box, positioning should fail, right?  We should detect that and skip positioning,

  * The Shadda followed by Fatha/Zamma/Kasra should ligate.
(In reply to Behdad Esfahbod from comment #18)
> Fix pushed:
> 
> https://github.com/behdad/harfbuzz/commit/
> 946e610c7fe7ad23db0ea41c55f93c9ab0f7c10c

Great, that appears to have fixed those letters.

> Also please test:
> 
> - Lam-Alef ligature, in final vs isolated forms.  You should get distinct
> glyphs.

I'm getting the isolated form even when there's a preceding left-joining letter, so that doesn't look right.

The other letter where I'm noticing a problem is U+0629 TEH MARBUTA; I get the isolated letter where it ought to be final.

> 
> - Diacritic marks:
> 
>   * I'm curious what happens with fallback positioning.  If the bitmap
> glyphs return a big box, positioning should fail, right?  We should detect
> that and skip positioning,

No fallback positioning seems to be happening, which is fine; the diacritics all stay at a constant level, which AFAIK is how the font has always been expected to render. I'll attach a screenshot of text from quran.com with the font forced to MS Sans Serif/cp1256. There are some glitches where an unsupported diacritic falls back to a different font and interrupts the shaping, but aside from that, and the two issues mentioned above, I think it looks about as good as can be expected.

> 
>   * The Shadda followed by Fatha/Zamma/Kasra should ligate.

Yes, that looks OK. There are occurrences of all three in the screenshot I'll attach.
Fixed both.  please take another look.
In your screenshot, in 2:16, the first word, third letter (from right), the Lam should be initial but is isolated.  I don't understand how that can happen.
Is your Unicode Data complete?  Looks like U+0670 SUPERSCRIPT ALEF is breaking the shaping?
Here's the sequence:

$ ./hb-unicode-decode ﻚﹷﺌﹻﻟﹶٰﻭﺃﹸ | ./hb-unicode-prettyname  
ALEF WITH HAMZA ABOVE + ARABIC DAMMA + WAW + LAM + ARABIC FATHA + SUPERSCRIPT ALEF + YEH WITH HAMZA ABOVE + ARABIC KASRA + KAF + ARABIC FATHA

I'm guessing that SUPERSCRIPT ALEF is coming from another font and breaks the shaping...
Right, that'll be what's happening there. We don't currently implement context across font runs, so a character like U+0670 that falls back to a different font will cause a break in shaping.
(In reply to Behdad Esfahbod from comment #21)
> Fixed both.  please take another look.

Yep, medial lam-alef is now working, and teh-marbuta is shaping correctly.

However, now that the mediLamLookup is hooked up, we're getting an extra isolated lam after any instance of medial lam that *isn't* followed by alef.

ISTM that we need to insert the LAM_MEDI_MARKER only in context when there is a following alef.
Oh shoot.  Yeah, of course.  That's why I hadn't hooked up the lamMedi, I need to write a contextual lookup.  Ok, will do that now.
Humm.  The LAM_MEDI_MARKER won't work if user disables rlig.  I know that's not supported. Still, I think it's better to add a contextual rlig feature that chooses the correct form based on the previous glyph.  I'll do that instead.
Humm.  The solution was much easier: we control the order of lookups here, so doing rlig before medi/fina does the trick.  Please try again.
Almost! You forgot to update the ligature rules to match the original (unshaped) lam/alef glyphs, but once I fixed that locally, it works.
In master now!
Depends on: 1036981
The harfbuzz update that landed in bug 1036981 includes the fix for this issue.
Status: NEW → RESOLVED
Closed: 10 years ago
No longer depends on: 1006122
Resolution: --- → FIXED
Assignee: nobody → birunthan
Target Milestone: --- → mozilla34
Jonathan, birunthan, what are the plan wrt 33? Are you planning to ask for an uplift of harfbuzz to see that bug fixed in 33? Thanks
Flags: needinfo?(jfkthame)
Flags: needinfo?(birunthan)
(In reply to Sylvestre Ledru [:sylvestre] from comment #33)
> Jonathan, birunthan, what are the plan wrt 33? Are you planning to ask for
> an uplift of harfbuzz to see that bug fixed in 33? Thanks

My plan was to nominate bug 1036981 for uplift to aurora once it's had a few days to bake on central, in order to fix this issue. (I think that's preferable to trying to cherry-pick just this fix onto our existing older harfbuzz version, though that would be a possible alternative.)
Flags: needinfo?(jfkthame)
Flags: needinfo?(birunthan)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: