Closed Bug 1742626 Opened 3 years ago Closed 3 years ago

first-letter doesn't select conjuncts properly in indic scripts

Categories

(Core :: Layout: Text and Fonts, defect)

Firefox 94
defect

Tracking

()

RESOLVED FIXED
97 Branch
Tracking Status
firefox97 --- fixed

People

(Reporter: ishida, Assigned: jfkthame)

Details

Attachments

(2 files)

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:94.0) Gecko/20100101 Firefox/94.0

Steps to reproduce:

When the start of a line contains a consonant cluster that uses a conjunct (rather than visible virama), ::first-letter should highlight the whole cluster.

Consonant clusters that form conjuncts using an invisible virama between the component letters need to be selected as a unit. This doesn't work well if segmentation relies on Unicode grapheme clusters, since a conjunct with two consonants will be parsed as two grapheme clusters (the first ending after the virama, and the second starting with the second consonant and including any following vowel-signs or other combining characters).

For these situations it is necessary to tailor the segmentation algorithm, so that it recognises the whole consonant cluster plus any attached vowel-signs or combining characters as a single unit.

For examples see Typographic character units in complex scripts.
https://www.w3.org/International/questions/qa-indic-graphemes

Actual results:

Interactive test, When ::first-letter is applied to Devanagari the browser will select a 2-consonant conjunct as a unit
https://github.com/w3c/line_paragraph_tests/issues/66

Interactive test, When ::first-letter is applied to Bengali the browser will select a conjunct as a unit, if the virama is hidden
https://github.com/w3c/line_paragraph_tests/issues/69

Gecko breaks most of the half-form conjuncts (which is the large majority of all conjuncts in Devanagari), and they are broken into an initial consonant with visible virama and a following consonant. Blink, and Webkit fully select all conjuncts as a unit.

I18n test suite, Devanagari text
https://www.w3.org/International/i18n-tests/results/first-letter-pseudo#devanagari

Expected results:

Gecko should select the full conjunct.

See also: https://w3c.github.io/iip/gap-analysis/deva-gap#issue94_initials

Note: When the start of a line contains a 2-consonant cluster that uses a visible virama, ::first-letter should highlight only the first consonant+virama. This corresponds to a grapheme cluster, as defined by Unicode.

Tests show that this functions correctly in Firefox. (Because browsers tend to either select the whole conjunct or otherwise only select the first grapheme cluster, Blink and WebKit are broken when it comes to visible viramas.)

So the solution here needs to be able to distinguish between whether a virama is shown or whether the virama is replaced by a conjunct. For example, Tamil rarely uses conjuncts, so Tamil consonant clusters must remain split by first-letter highlighting.

Component: Untriaged → Layout: Text and Fonts
Product: Firefox → Core

In principle, this may be quite tricky, because whether a given sequence displays with a visible virama or a conjunct may be dependent on the specific font that ends up being used, but at the point where we're deciding what text belongs to the ::first-letter pseudo, font selection hasn't happened yet.

Moreover, it's possible that ::first-letter might apply a different font, and this could have the result of changing whether a conjunct gets used.

Yes, i was planning to add a note along those lines. (I'm surprised you responded so quickly!)

There's also the question of what happens if a fallback font is used.

I don't believe that this is fixable by focusing on the code points used. That could possibly be useful for mitigating behaviour for Tamil, since it uses a different virama, but although (modern) Tamil mostly uses visible viramas it still sometimes uses a couple of conjuncts (one of which i think could occur word-initially), so it's not a perfect workaround.) Note that Chrome and WebKit apply different behaviours to Deva/Beng compared with Tamil.

I think, though, that ultimately the browser would have to be sensitive to the font used and the capabilities of that font. Either that, or the content author might be enabled to identify the expected behaviour on a font by font basis in the CSS style sheet – though that sounds a bit complicated. It may not help for fallbacks, but it might help if you're using webfonts and know what's likely to be applied to your text.(?)

I've no idea whether we can fix this, but it's certainly an issue for use of first-letter in Indic scripts, so i thought best to at least make sure that it's a known problem, so that we can track any developments if someone can figure out how to fix it.

An author who wants reliable control of an effect like this should probably not rely on ::first-letter at all, but put markup such as <span class="first-letter"> to identify the exact range of text they want to style.

But of course we'd like to make ::first-letter do as good a job as is reasonably possible; I think there's at least some room for improvement over the existing behavior.

I think the issue here primarily relates to scripts like Devanagari that commonly use "half-form" consonants, whose underlying encoding is <consonant, virama>.

We already have code for first-letter that prohibits breaking within a ligature in Indic scripts, so conjuncts that form a ligature of the entire <consonant1, virama, consonant2> sequence work as desired here, but when <consonant1, virama> forms a half-consonant, and then <consonant2> is a separate glyph, we end the first-letter at the default grapheme cluster boundary after the virama.

So what I propose is that for such scripts, we additionally check whether the trailing character of the cluster is a virama, and has been combined by the font with the preceding letter (i.e. created a half-form); if so, don't stop at this position but include the next cluster as well.

This won't be 100% reliable, because it can't distinguish between the case where <C1, Vir> is rendered as a half-form and the case where the font similarly ligates the two glyphs, but only to produce a full consonant with virama as a single glyph. In that case, presumably ending first-letter after the virama (which remains visible from a user's point of view, even though the glyph has been absorbed into the ligature) would be desired, but we'll see it as equivalent to a half-form and include the next cluster.

Richard, does this sound like a useful step forward? The main scripts that I think would be affected are Devanagari, Bengali and Gujarati; are there others with a similar pattern of half-form usage that we should treat the same way?

I'll put up a patch that implements this; we should also add some WPT testcases.

Flags: needinfo?(ishida)
Assignee: nobody → jfkthame

I think the issue here primarily relates to scripts like Devanagari that commonly use "half-form" consonants, whose underlying encoding is <consonant, virama>.

And actually 'virama' here means just the subset of virama-type characters that are used to produce conjuncts. If one of the following 'Invisible_Stacker' characters is detected, i think the browser should just continue to gather codepoints to make up the unbreakable typographic character unit:

꫶ U+AAF6 MEETEI MAYEK VIRAMA
𑤾 U+1193E DIVES AKURU VIRAMA
𑵅 U+11D45 MASARAM GONDI VIRAMA
𑶗 U+11D97 GUNJALA GONDI VIRAMA
᮫ U+1BAB SUNDANESE SIGN VIRAMA
𐨿 U+10A3F KHAROSHTHI VIRAMA
𑩇 U+11A47 ZANABAZAR SQUARE SUBJOINER
𑪙 U+11A99 SOYOMBO SUBJOINER
္ U+1039 MYANMAR SIGN VIRAMA
𑄳 U+11133 CHAKMA VIRAMA
្ U+17D2 KHMER SIGN COENG
᩠ U+1A60 TAI THAM SIGN SAKOT

I did a little bit of testing and we may be ok there.

The characters that are problematic for the issue at hand include ( presumably all of ) the following, which have the indic property name 'virama'. These are the ones that may appear or disappear according to the capability of the font (or in some cases, use of ZWJ/NJ).

্ U+09CD BENGALI SIGN VIRAMA
੍ U+0A4D GURMUKHI SIGN VIRAMA
્ U+0ACD GUJARATI SIGN VIRAMA
୍ U+0B4D ORIYA SIGN VIRAMA
் U+0BCD TAMIL SIGN VIRAMA
్ U+0C4D TELUGU SIGN VIRAMA
್ U+0CCD KANNADA SIGN VIRAMA
് U+0D4D MALAYALAM SIGN VIRAMA
් U+0DCA SINHALA SIGN AL-LAKUNA
꠆ U+A806 SYLOTI NAGRI SIGN HASANTA
꣄ U+A8C4 SAURASHTRA SIGN VIRAMA
𑂹 U+110B9 KAITHI SIGN VIRAMA
𑇀 U+111C0 SHARADA SIGN VIRAMA
𑈵 U+11235 KHOJKI SIGN VIRAMA
𑍍 U+1134D GRANTHA SIGN VIRAMA
𑑂 U+11442 NEWA SIGN VIRAMA
𑓂 U+114C2 TIRHUTA SIGN VIRAMA
𑖿 U+115BF SIDDHAM SIGN VIRAMA
𑘿 U+1163F MODI SIGN VIRAMA
𑚶 U+116B6 TAKRI SIGN VIRAMA
𑧠 U+119E0 NANDINAGARI SIGN VIRAMA
𑠹 U+11839 DOGRA SIGN VIRAMA
𑁆 U+11046 BRAHMI VIRAMA
𑰿 U+11C3F BHAIKSUKI SIGN VIRAMA
᭄ U+1B44 BALINESE ADEG ADEG
꧀ U+A9C0 JAVANESE PANGKON

These are mostly all South Asian scripts, but also include Balinese & Javanese from SE Asia.

Certainly Devanagari uses 'half-forms' in the sense of a consonant with a vertical stroke removed, but that particular approach to composing conjuncts is not that widespread. In case it helps, i updated some of my materials over the past few days that show the different strategies used for conjunct composition for a number of the scripts we are concerned with. Here are the links:

https://r12a.github.io/scripts/bengali/#clusters
https://r12a.github.io/scripts/devanagari/#clusters
https://r12a.github.io/scripts/gujarati/#clusters
https://r12a.github.io/scripts/gurmukhi/#clusters
https://r12a.github.io/scripts/newa/#clusters
https://r12a.github.io/scripts/malayalam/#clusters
https://r12a.github.io/scripts/oriya/#clusters
https://r12a.github.io/scripts/sinhala/#clusters
https://r12a.github.io/scripts/tamil/#clusters
https://r12a.github.io/scripts/telugu/#clusters
https://r12a.github.io/scripts/balinese/#clusters
https://r12a.github.io/scripts/javanese/#clusters

I'm not sure which of the cases listed in those descriptions fall into the category of 'half-forms' vs 'ligatures'.

Btw, I suppose it would be ideal if there was a way to detect whether the virama glyph had been used, since i suspect that that would correlate with all the contexts where the consonant cluster would be treated as more than one typographic unit. Presumably that's not possible?

hope that helps

Flags: needinfo?(ishida)

Btw, in case it's useful for generating additional tests, this might be useful:
https://r12a.github.io/scripts/apps/conjunct_generator/?preset=beng

You can select presets for other languages than Bengali on the right-hand side of the top of the page, or you can create your own consonant lists.

The severity field is not set for this bug.
:hiro, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(hikezoe.birchill)
Pushed by jkew@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d5c9e8167503
Attempt to improve Indic-script ::first-letter behavior by not allowing a boundary after a ligated virama in scripts that use half-consonant forms. r=emilio
https://hg.mozilla.org/integration/autoland/rev/cb7d18a0b541
Add WPT reftests for ::first-letter in Devanagari script, based on i18n testcases. r=emilio
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/31958 for changes under testing/web-platform/tests
Flags: needinfo?(hikezoe.birchill)
Status: UNCONFIRMED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 97 Branch
Upstream PR merged by moz-wptsync-bot

Fwiw, I just added a related bug at https://bugzilla.mozilla.org/show_bug.cgi?id=1746172 which deals with independent vowels followed by virama+YA+AA.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: