Open Bug 1848049 Opened 9 months ago Updated 4 days ago

Incompatibilty line-breaking rules on Blink and WebKit when characters are ASCII (Word breaks after one character)

Categories

(Core :: Internationalization, defect)

Firefox 118
Desktop
All
defect

Tracking

()

Tracking Status
firefox-esr102 --- unaffected
firefox-esr115 --- unaffected
firefox116 --- unaffected
firefox117 --- unaffected
firefox118 --- disabled
firefox119 --- disabled
firefox120 --- disabled

People

(Reporter: sime.vidas, Assigned: m_kato, NeedInfo)

References

(Blocks 2 open bugs, Regression)

Details

(Keywords: nightly-community, regression)

Attachments

(1 file)

Attached file test-page.html

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/118.0

Steps to reproduce:

  1. On desktop, open the attached test page
  2. Resize the browser width

Actual results:

Firefox breaks the word “!important” into ”!” and “important” across two lines.

Expected results:

This word should not be broken by default. No other browser breaks it.

Component: Untriaged → Layout: Text and Fonts
Product: Firefox → Core

ICU4x seg probably just thinks ! is an end-of-sentence punctuation. Oops.

:m_kato, since you are the author of the regressor, bug 1719535, could you take a look? Also, could you set the severity field?

For more information, please visit BugBot documentation.

Flags: needinfo?(m_kato)

(this is nightly only change)

Assignee: nobody → m_kato
Flags: needinfo?(m_kato)

Interestingly, using "?" instead of "!" gives the same behavior in Nightly, but in Chrome and Safari there's a line-break after "?" but not after "!":

data:text/html,<h1 style="width:1em">?testing !testing

Checking the Unicode line-break data, we see that both characters have the same line-break class:

0021;EX           # Po         EXCLAMATION MARK
...
003F;EX           # Po         QUESTION MARK

This implies that in a direct, non-tailored implementation of UAX#14, they would behave the same; but the other browsers must have tailored the class or breaking rules for the exclamation mark.

Duplicate of this bug: 1848091
No longer duplicate of this bug: 1848091

A similar tailoring seems to be happening to some other characters, including SOLIDUS /, which by default would be linebreak class=SY, and VERTICAL LINE |, which would be class=BA. But not to other class=BA characters like EN DASH .

With a test like

data:text/html;charset=utf-8,<h1 style="width:0">?test !test .test /test |test %E2%80%93test test?test test!test test.test test/test test|test test%E2%80%93test

both Safari and Chrome give the / and | the same behavior as FULL STOP .

Re comment 5:

FWIW. On the ICU demo page, ?testing !testing outputs line break opportunities like

?|testing !|testing|

So ICU4X line breaking matches ICU in this particular case.

Ah, interesting. Sure enough, if we replace the Latin letters with Cyrillic, for example, webkit and blink do put a break after the initial !, just like Gecko:

data:text/html;charset=utf-8,<h1 style="width:0">?%D1%82%D0%B5%D1%81%D1%82%20!%D1%82%D0%B5%D1%81%D1%82%20.%D1%82%D0%B5%D1%81%D1%82%20/%D1%82%D0%B5%D1%81%D1%82%20|%D1%82%D0%B5%D1%81%D1%82%20%E2%80%93%D1%82%D0%B5%D1%81%D1%82%20%D1%82%D0%B5%D1%81%D1%82?%D1%82%D0%B5%D1%81%D1%82%20%D1%82%D0%B5%D1%81%D1%82!%D1%82%D0%B5%D1%81%D1%82%20%D1%82%D0%B5%D1%81%D1%82.%D1%82%D0%B5%D1%81%D1%82%20%D1%82%D0%B5%D1%81%D1%82/%D1%82%D0%B5%D1%81%D1%82%20%D1%82%D0%B5%D1%81%D1%82|%D1%82%D0%B5%D1%81%D1%82%20%D1%82%D0%B5%D1%81%D1%82%E2%80%93%D1%82%D0%B5%D1%81%D1%82

So I guess we need to consider a few questions here.

  • do we want to follow webkit and blink's behavior here?
  • if so, do we want to limit it to ASCII letters only (not even Latin-1), as they do?
    • testcase: data:text/html;charset=utf-8,<h1 style="width:0">!exchange<br>!échanger; safari and chrome break after the exclamation point in "!exchange" but not in "!échanger"
  • if we extend this beyond basic ASCII to include accented letters, etc, then what about non-Latin letters?
  • should we seek to get this standardized in some way?

My feeling is that we should probably do this, but we should do it by tailoring the line-break class of exclamation point (etc) rather than by creating an ASCII-only alternative code path, and so it would apply more widely than is currently the case in the other browsers. Their special-casing of ASCII only leads to anomalies such as allowing a line-break in "hiver/été" but not in "été/hiver" (testcase: data:text/html;charset=utf-8,<h1 style="width:0">été/hiver<br>hiver/été), which IMO is undesirable.

The severity field is not set for this bug.
:boris, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(boris.chiou)
Severity: -- → S3
Flags: needinfo?(boris.chiou)

Set release status flags based on info from the regressing bug 1719535

Summary: Word breaks after one character → Incompatibilty line-breaking rules on Blink and WebKit when characters are ASCII (Word breaks after one character)
Duplicate of this bug: 1851323
No longer duplicate of this bug: 1851323

testing/web-platform/tests/css/css-text/white-space/white-space-pre-wrap-justify-004.html with UAX#14 may be also failure by same reason. (If EX isn't Latin-1, this test is failure even if Blink).

If we're going to adjust our behavior to be closer to the other browsers here, I think we should not make it conditional on the word being strictly ASCII-only. As noted in comment 11, this results in unexpected inconsistencies in behavior when content is largely ASCII-only, but some words include accented letters -- which is very common for major European languages.

Therefore, I don't think we should simply copy their special ASCII-only table; instead, can we tailor the line-break classes of certain characters such as ! (and some others, as mentioned in comments above)? We could do this either unconditionally, or only for Latin-script content, but not just for ASCII -- that's too limited of a special-case.

That means we would not be exactly copying their behavior... that's deliberate. I propose to file bugs against Blink and WebKit about their current inconsistency, as seen in examples like:

data:text/html;charset=utf-8,<p style="width:0">écouter/parler</p> <p style="width:0">parler/écouter</p>

We can and should do better. Either both of these cases should break, or neither of them.

(In reply to Jonathan Kew [:jfkthame] from comment #19)

If we're going to adjust our behavior to be closer to the other browsers here, I think we should not make it conditional on the word being strictly ASCII-only. As noted in comment 11, this results in unexpected inconsistencies in behavior when content is largely ASCII-only, but some words include accented letters -- which is very common for major European languages.

Should we escalate to Unicode.org or CSSWG? I don't know this behavior is convenience for European languages. This issue (ASCII table) is historical thing for compatibility of old-Gecko and old-IE, so I guess that it is better to remove this old line breaking rules, then match with UAX#14.

Therefore, I don't think we should simply copy their special ASCII-only table; instead, can we tailor the line-break classes of certain characters such as ! (and some others, as mentioned in comments above)? We could do this either unconditionally, or only for Latin-script content, but not just for ASCII -- that's too limited of a special-case.

That means we would not be exactly copying their behavior... that's deliberate. I propose to file bugs against Blink and WebKit about their current inconsistency, as seen in examples like:

data:text/html;charset=utf-8,<p style="width:0">écouter/parler</p> <p style="width:0">parler/écouter</p>

We can and should do better. Either both of these cases should break, or neither of them.

But I don't know whether they (WebKit and Blink) change breaking rules. I think that this issue is compatibility issue, not new behavior for it. If we change it at first, it is new compatibility issue until they fix it.

Do you think we should wait for enable ICU4X's line breaker on stable channel until they fix it?

IMO, we don't need to wait for webkit and/or blink to change; we can still enable ICU4X (with the customizations we decide are worthwhile). I expect this would bring us closer to webkit/blink behavior in many cases, and I don't think 100%-identical behavior in all cases is a hard requirement.

So my idea of the way forward:

  • We should tailor ! because of the CSS !important example as reported here (note that our legacy behavior was not to break there, so we'll be maintaining existing behavior as well as matching the other browsers).

  • We should also probably tailor / to not break between slash and a following letter (though UAX#14 by default would allow a break), because that matches what the other browsers do for ASCII only and matches our existing behavior. And the same for |, although that's much less common.

(There may be a handful of other tailorings to consider... it should be feasible to compare the default ICU4X/UAX#14 behavior for the ASCII range with the other browsers' legacy table and evaluate any other differences to see if they seem important.)

(In reply to Jonathan Kew [:jfkthame] from comment #22)

(There may be a handful of other tailorings to consider... it should be feasible to compare the default ICU4X/UAX#14 behavior for the ASCII range with the other browsers' legacy table and evaluate any other differences to see if they seem important.)

Other differences that I know are https://searchfox.org/mozilla-central/source/testing/web-platform/tests/css/css-text/i18n/css3-text-line-break-baspglwj-026.html (failed on WebKit and Blink due to that table) and between number and hyphen such as (111-222). It is https://searchfox.org/wubkat/rev/d65248aa5c753da56ad0c0e0028e4c4cddd5c1f4/Source/WebCore/rendering/BreakLines.h#62-65.

(In reply to Makoto Kato [:m_kato] from comment #23)

(In reply to Jonathan Kew [:jfkthame] from comment #22)

(There may be a handful of other tailorings to consider... it should be feasible to compare the default ICU4X/UAX#14 behavior for the ASCII range with the other browsers' legacy table and evaluate any other differences to see if they seem important.)

Other differences that I know are https://searchfox.org/mozilla-central/source/testing/web-platform/tests/css/css-text/i18n/css3-text-line-break-baspglwj-026.html (failed on WebKit and Blink due to that table)

Right, that's the vertical-bar character |, where UAX#14 calls for a break-after, but they suppress it, like with slash /. As per comment 22, I'd be inclined to tailor the class of | to achieve similar behavior.

and between number and hyphen such as (111-222). It is https://searchfox.org/wubkat/rev/d65248aa5c753da56ad0c0e0028e4c4cddd5c1f4/Source/WebCore/rendering/BreakLines.h#62-65.

Hyphen is a tricky one; there's no "right" answer in all cases, but probably some special-case heuristics are worth doing. Webkit's ASCII-only processing gives inconsistent results for this, too: compare

data:text/html;charset=utf-8,<p style="width:0">abc-123</p> <p style="width:0">абц-123</p>

IMO, it'd be better for them to do something like

    if (character == '-' && isDigit(nextCharacter))
        return isAlphanumeric(lastCharacter);

than the ASCII-specific version they currently have. ASCII letters should not be privileged.

If having an ASCII-only fast-path/table is worthwhile for performance (rather than to fix compat issues), then I'm not opposed to doing that; but we should make sure the behavior of the ASCII path matches what our full Unicode breaker would do for the same text.

So if we want to deviate from the default UAX#14 behavior (for example, avoiding breaks between !, /, and | and a following letter), I think we should first tailor the classes of those characters to give the desired result in the ICU4X codepath; and then create an ASCII-only fast lookup table that matches this behavior.

No longer blocks: 1854032
Component: Layout: Text and Fonts → Internationalization
Depends on: line-breaking
No longer depends on: line-breaking
See Also: → 1876874

Hi Jonathan and Ting-Yu, I'm just trying to understand where we're at with this bug. Do you think it needs to be fixed?

Makoto filed an issue with the CSSWG to specify this more precisely but it was suggested the spec is deliberately open-ended on this point: https://github.com/w3c/csswg-drafts/issues/9432

Do you think we should specify it or is it ok to leave it open-ended?

Flags: needinfo?(jfkthame)
Flags: needinfo?(aethanyc)

I guess it's OK for the spec to be somewhat open-ended about line breaking, as there are lots of competing factors and no single solution will be "right" in all circumstances. Treating it as a quality-of-implementation issue leaves some room for browsers to come up with rules or heuristics to try and improve their results.

At the same time, we've seen that discrepances in line-breaking behavior do lead to webcompat issues, so I think there'd be benefit in the spec including informative notes about any deviations from the default UAX#14 behavior that browsers have chosen to implement. For example, as described in comments above, we can see that Blink and Webkit are not following UAX#14 defaults for characters like ! and /. It'd be helpful if these details were documented in an informative note, so that other implementors could make an informed choice as to whether to follow the same customizations or make different trade-offs.

On the implementation side, I do think we should make some adjustments here, as suggested in comment 22 & comment 24 above.

Flags: needinfo?(jfkthame)
Duplicate of this bug: 1881600

Just adding my 2cents here.
Maybe just add an about:config setting to override the list of line breaking characters.
If the change is to make firefox conform to the standards, then that's fine, otherwise the line break behavior shouldn't change.

In my case, this broke some of my date text that went from:
23/Jan/2024

to

23/
Jan/
2024

(In reply to Jonathan Kew [:jfkthame] from comment #26)

It'd be helpful if these details were documented in an informative note, so that other implementors could make an informed choice as to whether to follow the same customizations or make different trade-offs.
...
On the implementation side, I do think we should make some adjustments here, as suggested in comment 22 & comment 24 above.

Excellent, thanks so much Jonathan. It sounds like the next step is to draft an informative note for CSS issue #9432 that captures the contents of comment 22 and comment 24. I can try to do that next week assuming I'm not on paternity leave then.

I submitted https://github.com/w3c/csswg-drafts/pull/9997 to try and capture the discussion here. Feel free to fix it as needed.

Brian, thank you for submitting a PR to the spec.

Flags: needinfo?(aethanyc)
See Also: → 1883494

Hi Jonathan, would you mind reviewing https://github.com/w3c/csswg-drafts/pull/9997? Thanks!

Flags: needinfo?(jfkthame)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: