1848049 - Incompatibilty line-breaking rules on Blink and WebKit when characters are ASCII (Word breaks after one character)

Interestingly, using "?" instead of "!" gives the same behavior in Nightly, but in Chrome and Safari there's a line-break after "?" but not after "!":

data:text/html,<h1 style="width:1em">?testing !testing

Checking the Unicode line-break data, we see that both characters have the same line-break class:

0021;EX           # Po         EXCLAMATION MARK
...
003F;EX           # Po         QUESTION MARK

This implies that in a direct, non-tailored implementation of UAX#14, they would behave the same; but the other browsers must have tailored the class or breaking rules for the exclamation mark.

Makoto Kato [:m_kato]

Assignee

Updated

•

9 months ago

Duplicate of this bug: 1848091

Makoto Kato [:m_kato]

Assignee

Updated

•

9 months ago

No longer duplicate of this bug: 1848091

Jonathan Kew [:jfkthame]

Comment 7

•

9 months ago

A similar tailoring seems to be happening to some other characters, including SOLIDUS /, which by default would be linebreak class=SY, and VERTICAL LINE |, which would be class=BA. But not to other class=BA characters like EN DASH –.

With a test like

data:text/html;charset=utf-8,<h1 style="width:0">?test !test .test /test |test %E2%80%93test test?test test!test test.test test/test test|test test%E2%80%93test

both Safari and Chrome give the / and | the same behavior as FULL STOP .

Ting-Yu Lin [:TYLin] (PDT, UTC-7)

Comment 8

•

9 months ago

Re comment 5:

FWIW. On the ICU demo page, ?testing !testing outputs line break opportunities like

?|testing !|testing|

So ICU4X line breaking matches ICU in this particular case.

Makoto Kato [:m_kato]

Assignee

Comment 9

•

9 months ago

•

Edited

Maybe, WebKit and Blink don't use ICU for Latin1?

https://searchfox.org/wubkat/rev/cd25edd92284ea5ea247483e66b404c0774949b2/Source/WebCore/rendering/BreakLines.cpp#53

Jonathan Kew [:jfkthame]

Comment 10

•

9 months ago

•

Edited

Ah, interesting. Sure enough, if we replace the Latin letters with Cyrillic, for example, webkit and blink do put a break after the initial !, just like Gecko:

data:text/html;charset=utf-8,<h1 style="width:0">?%D1%82%D0%B5%D1%81%D1%82%20!%D1%82%D0%B5%D1%81%D1%82%20.%D1%82%D0%B5%D1%81%D1%82%20/%D1%82%D0%B5%D1%81%D1%82%20|%D1%82%D0%B5%D1%81%D1%82%20%E2%80%93%D1%82%D0%B5%D1%81%D1%82%20%D1%82%D0%B5%D1%81%D1%82?%D1%82%D0%B5%D1%81%D1%82%20%D1%82%D0%B5%D1%81%D1%82!%D1%82%D0%B5%D1%81%D1%82%20%D1%82%D0%B5%D1%81%D1%82.%D1%82%D0%B5%D1%81%D1%82%20%D1%82%D0%B5%D1%81%D1%82/%D1%82%D0%B5%D1%81%D1%82%20%D1%82%D0%B5%D1%81%D1%82|%D1%82%D0%B5%D1%81%D1%82%20%D1%82%D0%B5%D1%81%D1%82%E2%80%93%D1%82%D0%B5%D1%81%D1%82

Jonathan Kew [:jfkthame]

Comment 11

•

9 months ago

So I guess we need to consider a few questions here.

do we want to follow webkit and blink's behavior here?
if so, do we want to limit it to ASCII letters only (not even Latin-1), as they do?
- testcase: data:text/html;charset=utf-8,<h1 style="width:0">!exchange<br>!échanger; safari and chrome break after the exclamation point in "!exchange" but not in "!échanger"
if we extend this beyond basic ASCII to include accented letters, etc, then what about non-Latin letters?
should we seek to get this standardized in some way?

My feeling is that we should probably do this, but we should do it by tailoring the line-break class of exclamation point (etc) rather than by creating an ASCII-only alternative code path, and so it would apply more widely than is currently the case in the other browsers. Their special-casing of ASCII only leads to anomalies such as allowing a line-break in "hiver/été" but not in "été/hiver" (testcase: data:text/html;charset=utf-8,<h1 style="width:0">été/hiver<br>hiver/été), which IMO is undesirable.

BugBot [:suhaib / :marco/ :calixte]

Comment 12

•

9 months ago

The severity field is not set for this bug.
:boris, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(boris.chiou)

Boris Chiou [:boris]

Updated

•

9 months ago

Severity: -- → S3

Flags: needinfo?(boris.chiou)

Makoto Kato [:m_kato]

Assignee

•

8 months ago

testing/web-platform/tests/css/css-text/white-space/white-space-pre-wrap-justify-004.html with UAX#14 may be also failure by same reason. (If EX isn't Latin-1, this test is failure even if Blink).

Daniel Holbert [:dholbert]

Comment 17

•

8 months ago

Updating status to disabled for v119, but still affecting 120 (since it's nightly-only for now per https://searchfox.org/mozilla-central/rev/57f6fbd39c0b5957e11b27b4db58b821d8e1607d/modules/libpref/init/StaticPrefList.yaml#7220-7223 )

status-firefox119: affected → disabled

status-firefox120: --- → affected

Makoto Kato [:m_kato]

Assignee

Comment 18

•

8 months ago

WIP: https://treeherder.mozilla.org/jobs?repo=try&revision=5b872df55ed4e833972491d8a47138d53a7c171e

Jonathan Kew [:jfkthame]

Comment 19

•

8 months ago

If we're going to adjust our behavior to be closer to the other browsers here, I think we should not make it conditional on the word being strictly ASCII-only. As noted in comment 11, this results in unexpected inconsistencies in behavior when content is largely ASCII-only, but some words include accented letters -- which is very common for major European languages.

Therefore, I don't think we should simply copy their special ASCII-only table; instead, can we tailor the line-break classes of certain characters such as ! (and some others, as mentioned in comments above)? We could do this either unconditionally, or only for Latin-script content, but not just for ASCII -- that's too limited of a special-case.

That means we would not be exactly copying their behavior... that's deliberate. I propose to file bugs against Blink and WebKit about their current inconsistency, as seen in examples like:

data:text/html;charset=utf-8,<p style="width:0">écouter/parler</p> <p style="width:0">parler/écouter</p>

We can and should do better. Either both of these cases should break, or neither of them.

Jonathan Kew [:jfkthame]

Comment 20

•

8 months ago

FTR, I've just filed https://bugs.webkit.org/show_bug.cgi?id=262497 and https://bugs.chromium.org/p/chromium/issues/detail?id=1488608 about their inconsistent behavior.

Makoto Kato [:m_kato]

Assignee

Comment 21

•

8 months ago

(In reply to Jonathan Kew [:jfkthame] from comment #19)

If we're going to adjust our behavior to be closer to the other browsers here, I think we should not make it conditional on the word being strictly ASCII-only. As noted in comment 11, this results in unexpected inconsistencies in behavior when content is largely ASCII-only, but some words include accented letters -- which is very common for major European languages.

Should we escalate to Unicode.org or CSSWG? I don't know this behavior is convenience for European languages. This issue (ASCII table) is historical thing for compatibility of old-Gecko and old-IE, so I guess that it is better to remove this old line breaking rules, then match with UAX#14.

Therefore, I don't think we should simply copy their special ASCII-only table; instead, can we tailor the line-break classes of certain characters such as ! (and some others, as mentioned in comments above)? We could do this either unconditionally, or only for Latin-script content, but not just for ASCII -- that's too limited of a special-case.

That means we would not be exactly copying their behavior... that's deliberate. I propose to file bugs against Blink and WebKit about their current inconsistency, as seen in examples like:
data:text/html;charset=utf-8,<p style="width:0">écouter/parler</p> <p style="width:0">parler/écouter</p>
We can and should do better. Either both of these cases should break, or neither of them.

But I don't know whether they (WebKit and Blink) change breaking rules. I think that this issue is compatibility issue, not new behavior for it. If we change it at first, it is new compatibility issue until they fix it.

Do you think we should wait for enable ICU4X's line breaker on stable channel until they fix it?

Jonathan Kew [:jfkthame]

Comment 22

•

8 months ago

IMO, we don't need to wait for webkit and/or blink to change; we can still enable ICU4X (with the customizations we decide are worthwhile). I expect this would bring us closer to webkit/blink behavior in many cases, and I don't think 100%-identical behavior in all cases is a hard requirement.

So my idea of the way forward:

We should tailor ! because of the CSS !important example as reported here (note that our legacy behavior was not to break there, so we'll be maintaining existing behavior as well as matching the other browsers).
We should also probably tailor / to not break between slash and a following letter (though UAX#14 by default would allow a break), because that matches what the other browsers do for ASCII only and matches our existing behavior. And the same for |, although that's much less common.

(There may be a handful of other tailorings to consider... it should be feasible to compare the default ICU4X/UAX#14 behavior for the ASCII range with the other browsers' legacy table and evaluate any other differences to see if they seem important.)

Makoto Kato [:m_kato]

Assignee

Comment 23

•

8 months ago

(In reply to Jonathan Kew [:jfkthame] from comment #22)

(There may be a handful of other tailorings to consider... it should be feasible to compare the default ICU4X/UAX#14 behavior for the ASCII range with the other browsers' legacy table and evaluate any other differences to see if they seem important.)

Other differences that I know are https://searchfox.org/mozilla-central/source/testing/web-platform/tests/css/css-text/i18n/css3-text-line-break-baspglwj-026.html (failed on WebKit and Blink due to that table) and between number and hyphen such as (111-222). It is https://searchfox.org/wubkat/rev/d65248aa5c753da56ad0c0e0028e4c4cddd5c1f4/Source/WebCore/rendering/BreakLines.h#62-65.

Jonathan Kew [:jfkthame]

Comment 24

•

8 months ago

(In reply to Makoto Kato [:m_kato] from comment #23)

(In reply to Jonathan Kew [:jfkthame] from comment #22)

(There may be a handful of other tailorings to consider... it should be feasible to compare the default ICU4X/UAX#14 behavior for the ASCII range with the other browsers' legacy table and evaluate any other differences to see if they seem important.)

Other differences that I know are https://searchfox.org/mozilla-central/source/testing/web-platform/tests/css/css-text/i18n/css3-text-line-break-baspglwj-026.html (failed on WebKit and Blink due to that table)

Right, that's the vertical-bar character |, where UAX#14 calls for a break-after, but they suppress it, like with slash /. As per comment 22, I'd be inclined to tailor the class of | to achieve similar behavior.

and between number and hyphen such as (111-222). It is https://searchfox.org/wubkat/rev/d65248aa5c753da56ad0c0e0028e4c4cddd5c1f4/Source/WebCore/rendering/BreakLines.h#62-65.

Hyphen is a tricky one; there's no "right" answer in all cases, but probably some special-case heuristics are worth doing. Webkit's ASCII-only processing gives inconsistent results for this, too: compare

data:text/html;charset=utf-8,<p style="width:0">abc-123</p> <p style="width:0">абц-123</p>

IMO, it'd be better for them to do something like

    if (character == '-' && isDigit(nextCharacter))
        return isAlphanumeric(lastCharacter);

than the ASCII-specific version they currently have. ASCII letters should not be privileged.

If having an ASCII-only fast-path/table is worthwhile for performance (rather than to fix compat issues), then I'm not opposed to doing that; but we should make sure the behavior of the ASCII path matches what our full Unicode breaker would do for the same text.

So if we want to deviate from the default UAX#14 behavior (for example, avoiding breaks between !, /, and | and a following letter), I think we should first tailor the classes of those characters to give the desired result in the ICU4X codepath; and then create an ASCII-only fast lookup table that matches this behavior.

Ting-Yu Lin [:TYLin] (PDT, UTC-7)

Updated

•

8 months ago

Blocks: 1854032

Emilio Cobos Álvarez (:emilio)

Updated

•

7 months ago

status-firefox120: affected → disabled

Makoto Kato [:m_kato]

Assignee

•

4 months ago

Blocks: segmenter

Ting-Yu Lin [:TYLin] (PDT, UTC-7)

Updated

•

4 months ago

Comment 25

•

3 months ago

Hi Jonathan and Ting-Yu, I'm just trying to understand where we're at with this bug. Do you think it needs to be fixed?

Makoto filed an issue with the CSSWG to specify this more precisely but it was suggested the spec is deliberately open-ended on this point: https://github.com/w3c/csswg-drafts/issues/9432

Do you think we should specify it or is it ok to leave it open-ended?

Flags: needinfo?(jfkthame)

Flags: needinfo?(aethanyc)

Jonathan Kew [:jfkthame]

Comment 26

•

3 months ago

I guess it's OK for the spec to be somewhat open-ended about line breaking, as there are lots of competing factors and no single solution will be "right" in all circumstances. Treating it as a quality-of-implementation issue leaves some room for browsers to come up with rules or heuristics to try and improve their results.

At the same time, we've seen that discrepances in line-breaking behavior do lead to webcompat issues, so I think there'd be benefit in the spec including informative notes about any deviations from the default UAX#14 behavior that browsers have chosen to implement. For example, as described in comments above, we can see that Blink and Webkit are not following UAX#14 defaults for characters like ! and /. It'd be helpful if these details were documented in an informative note, so that other implementors could make an informed choice as to whether to follow the same customizations or make different trade-offs.

On the implementation side, I do think we should make some adjustments here, as suggested in comment 22 & comment 24 above.

Flags: needinfo?(jfkthame)

Jonathan Kew [:jfkthame]

Updated

•

3 months ago

Duplicate of this bug: 1881600

Jason

Comment 28

•

3 months ago

Just adding my 2cents here.
Maybe just add an about:config setting to override the list of line breaking characters.
If the change is to make firefox conform to the standards, then that's fine, otherwise the line break behavior shouldn't change.

In my case, this broke some of my date text that went from:
23/Jan/2024

to

23/
Jan/
2024

Brian Birtles (:birtles)

Comment 29

•

3 months ago

•

Edited

(In reply to Jonathan Kew [:jfkthame] from comment #26)

It'd be helpful if these details were documented in an informative note, so that other implementors could make an informed choice as to whether to follow the same customizations or make different trade-offs.
...
On the implementation side, I do think we should make some adjustments here, as suggested in comment 22 & comment 24 above.

Excellent, thanks so much Jonathan. It sounds like the next step is to draft an informative note for CSS issue #9432 that captures the contents of comment 22 and comment 24. I can try to do that next week assuming I'm not on paternity leave then.

Brian Birtles (:birtles)

Comment 30

•

3 months ago

I submitted https://github.com/w3c/csswg-drafts/pull/9997 to try and capture the discussion here. Feel free to fix it as needed.

Ting-Yu Lin [:TYLin] (PDT, UTC-7)

Comment 31

•

3 months ago

Brian, thank you for submitting a PR to the spec.

Flags: needinfo?(aethanyc)

Ting-Yu Lin [:TYLin] (PDT, UTC-7)

Updated

•

3 months ago

Comment 32

•

4 days ago

Hi Jonathan, would you mind reviewing https://github.com/w3c/csswg-drafts/pull/9997? Thanks!

Flags: needinfo?(jfkthame)