Open Bug 1938822 Opened 2 months ago Updated 1 month ago

[Linux] Arabic text with manually inserted kashidas (ـ) breaks letter connections in Firefox

Categories

(Core :: Layout: Text and Fonts, defect)

Firefox 133
defect

Tracking

()

People

(Reporter: yahyazekry, Unassigned)

References

Details

Attachments

(3 files)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:133.0) Gecko/20100101 Firefox/133.0

Steps to reproduce:

1- Type or paste Arabic text with manually inserted kashidas (example: يــــحــــيـــى)
2- View text on websites like Facebook, Twitter, or Goodreads
3- Observe how letters appear disconnected with the kashidas

Actual results:

The Arabic text appears broken/disconnected (example: ي ـــ ح ـــ ي ــــ ى) where each letter and kashida is separated instead of flowing together

Expected results:

The text should display as a continuous connected word with kashidas extending the connections between letters (يــــحــــيـــى), as it does in Brave browser

The Bugbug bot thinks this bug should belong to the 'Core::Layout: Text and Fonts' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → Layout: Text and Fonts
Product: Firefox → Core
Attached file possible testcase

I noticed that on Twitter at least, they're using a font declaration that has a final fallback to sans-serif, and I get the same outcome if I just specify sans-serif directly as in this reduced testcase. I can see that the glyphs are indeed disconnected if I use a large font-size as in this testcase.

Reporter, could you confirm that you see the issue with this testcase too?

(I think this might be a case where the system is simply giving Firefox a bad choice of arabic fallback-font, and maybe we need to add some logic to forbid certain fonts or something? But jfkthame might have more insight...)

As shown in the font panel in my attached screenshot of the testcase, Firefox is rendering the dash-like kashida glyphs from a different font vs. the other glyphs.

We're rendering the kashidas using Noto Sans Adlam, vs. we're rendering the other glyphs using Noto Sans Arabic.

Chrome is rendering all of the text with Noto Sans Arabic. And if I manually specify that font, then we match their rendering. (I'll post another attachment to demonstrate.)

Here's a reference case where I'm explicitly requesting "Noto Sans Arabic", which is sufficient to make us match the Chromium rendering.

jfkthame, do you know if there's a good reason for us to be switching fonts at the glyph boundaries here?

(Presumably that switching is what's causing the gaps, because I imagine we won't get nice ligatures between glyphs of different fonts.)

Flags: needinfo?(jfkthame)

I should mention that I'm testing on Linux (distro: Ubuntu 24.10), like the reporter. I'll add that to the bug title, for clarity/completeness.

I just tested on Windows and macOS, using the attached "possible testcase", and I can't reproduce any issues there -- I don't see any gaps. Firefox's font-devtools tell me that Firefox is using Arial to render the whole string on those platforms.

Severity: -- → S3
Status: UNCONFIRMED → NEW
Ever confirmed: true
Summary: Arabic text with manually inserted kashidas (ـ) breaks letter connections in Firefox → [Linux] Arabic text with manually inserted kashidas (ـ) breaks letter connections in Firefox

I think the issue here is that Noto Sans Adlam is being found ahead of Noto Sans Arabic when font fallback is happening, and it supports the kashida character (U+0640, if I recall correctly) but not the rest of the Arabic script block. So as we do font fallback for each character in turn, we find the Adlam font for the kashidas only, fall back to the actual Arabic font for the other letters, and hence get a break in shaping because we can't shape across font changes.

Note that this wouldn't happen (I think ... not tested just now) if the content were tagged with lang="ar", as that should cause us to map sans-serif to a suitable Arabic font. The trouble here is that without the lang attribute, we're mapping sans-serif to a Latin font that doesn't support the Arabic letters, and so then fallback comes into play and produces an unreliable result.

The change I'm hoping to make in bug 1919512 should address this, as the entire text will resolve to a run of Arabic script and then we'll apply the appropriate font prefs to it.

Another thing that would help here would be if gfxPlatformGtk::GetCommonFallbackFonts returned a suitable Arabic fallback font. That function currently just returns a few fonts that were commonly available on Linux systems, and hasn't been updated in a number of years. In particular, it's now common for systems to have a collection of Noto fonts covering many writing systems; we should make GetCommonFallbackFonts aware of this, and then we'll be less dependent on the arbitrary global-fallback search that is happening (and making a bad choice) in this case. I'll file a bug to update that.

Flags: needinfo?(jfkthame)
See Also: → 1939243
See Also: → 1919512

Hmm, looking into this a bit further, it seems that improving gfxPlatformGtk::GetCommonFallbackFonts isn't enough to resolve the problem; we still get the kashidas rendered from the Adlam font.

The issue is that we're not actually hitting that fallback search; rather, what's happening is that when we ask fontconfig to resolve sans-serif, with the (implied/default) language of English, we get back not a single default font for Latin text, but a sorted list of all the fonts that fontconfig thinks sans-serif can map to.

To see this, run fc-match -s :family=sans-serif:lang=en, which on my Ubuntu machine gives me a list of

Noto Sans
Noto Sans Adlam
Noto Sans Arabic
Noto Sans Armenian
Noto Sans Avestan
...etc

If I pass lang=ar instead of lang=en, then Noto Sans Arabic moves to the front of the list (appropriately), and the problem here wouldn't arise. But with lang=en, we get the Adlam font ahead of the Arabic one, and so the kashidas get assigned to that font.

(For performance reasons, we restrict the number of these mappings we'll actually try to use; by default, we just use the first 3, per gfx.font_rendering.fontconfig.max_generic_substitutions. But that's enough to result in the unfortunate mapping we see here.)

Setting gfx.font_rendering.fontconfig.max_generic_substitutions to 1, so that we only look at the first entry returned by fontconfig, does seem to help here (at least on my system; YMMV, depending on installed fonts), but it could also be harmful for cases where there are in fact several relevant fonts being returned, and the first one alone may have incomplete character coverage.

FWIW, a similar scenario (I think) with Tor Browser's bundled fonts on Linux - https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/43381#note_3144708 (it's a short read, I pinky swear)

(In reply to Thorin [:thorin] from comment #10)

FWIW, a similar scenario (I think) with Tor Browser's bundled fonts on Linux - https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/43381#note_3144708 (it's a short read, I pinky swear)

I don't have an account to comment directly there, but I think you'd want to be using Noto Sans SC (not Noto Sans TC) in that situation: zh-Hans indicates Simplified, and the SC font is the Simplified variant.

Does it make any difference if the content is tagged zh-CN rather than zh-Hans? I don't remember offhand if we actually handle the script subtag there, or only the region subtags.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: