Incompatibilty line-breaking rules on Blink and WebKit when characters are ASCII (Word breaks after one character)
Categories
(Core :: Internationalization, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr102 | --- | unaffected |
firefox-esr115 | --- | unaffected |
firefox116 | --- | unaffected |
firefox117 | --- | unaffected |
firefox118 | --- | disabled |
firefox119 | --- | disabled |
firefox120 | --- | disabled |
People
(Reporter: sime.vidas, Assigned: m_kato)
References
(Blocks 2 open bugs, Regression)
Details
(Keywords: nightly-community, regression)
Attachments
(1 file)
687 bytes,
text/html
|
Details |
User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/118.0
Steps to reproduce:
- On desktop, open the attached test page
- Resize the browser width
Actual results:
Firefox breaks the word “!important” into ”!” and “important” across two lines.
Expected results:
This word should not be broken by default. No other browser breaks it.
Updated•1 year ago
|
Comment 1•1 year ago
|
||
Regression window:
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=6d877fdb9a1e892fe6528a26aab81b53cfae55c5&tochange=d6760db62dc1cc0831c09c42f086cf57d1c2bf48
Comment 2•1 year ago
|
||
ICU4x seg probably just thinks !
is an end-of-sentence punctuation. Oops.
Comment 3•1 year ago
|
||
:m_kato, since you are the author of the regressor, bug 1719535, could you take a look? Also, could you set the severity field?
For more information, please visit BugBot documentation.
Assignee | ||
Comment 4•1 year ago
|
||
(this is nightly only change)
Comment 5•1 year ago
|
||
Interestingly, using "?" instead of "!" gives the same behavior in Nightly, but in Chrome and Safari there's a line-break after "?" but not after "!":
data:text/html,<h1 style="width:1em">?testing !testing
Checking the Unicode line-break data, we see that both characters have the same line-break class:
0021;EX # Po EXCLAMATION MARK
...
003F;EX # Po QUESTION MARK
This implies that in a direct, non-tailored implementation of UAX#14, they would behave the same; but the other browsers must have tailored the class or breaking rules for the exclamation mark.
Comment 7•1 year ago
|
||
A similar tailoring seems to be happening to some other characters, including SOLIDUS /
, which by default would be linebreak class=SY, and VERTICAL LINE |
, which would be class=BA. But not to other class=BA characters like EN DASH –
.
With a test like
data:text/html;charset=utf-8,<h1 style="width:0">?test !test .test /test |test %E2%80%93test test?test test!test test.test test/test test|test test%E2%80%93test
both Safari and Chrome give the /
and |
the same behavior as FULL STOP .
Comment 8•1 year ago
|
||
Re comment 5:
FWIW. On the ICU demo page, ?testing !testing
outputs line break opportunities like
?|testing !|testing|
So ICU4X line breaking matches ICU in this particular case.
Assignee | ||
Comment 9•1 year ago
•
|
||
Maybe, WebKit and Blink don't use ICU for Latin1?
Comment 10•1 year ago
•
|
||
Ah, interesting. Sure enough, if we replace the Latin letters with Cyrillic, for example, webkit and blink do put a break after the initial !
, just like Gecko:
data:text/html;charset=utf-8,<h1 style="width:0">?%D1%82%D0%B5%D1%81%D1%82%20!%D1%82%D0%B5%D1%81%D1%82%20.%D1%82%D0%B5%D1%81%D1%82%20/%D1%82%D0%B5%D1%81%D1%82%20|%D1%82%D0%B5%D1%81%D1%82%20%E2%80%93%D1%82%D0%B5%D1%81%D1%82%20%D1%82%D0%B5%D1%81%D1%82?%D1%82%D0%B5%D1%81%D1%82%20%D1%82%D0%B5%D1%81%D1%82!%D1%82%D0%B5%D1%81%D1%82%20%D1%82%D0%B5%D1%81%D1%82.%D1%82%D0%B5%D1%81%D1%82%20%D1%82%D0%B5%D1%81%D1%82/%D1%82%D0%B5%D1%81%D1%82%20%D1%82%D0%B5%D1%81%D1%82|%D1%82%D0%B5%D1%81%D1%82%20%D1%82%D0%B5%D1%81%D1%82%E2%80%93%D1%82%D0%B5%D1%81%D1%82
Comment 11•1 year ago
|
||
So I guess we need to consider a few questions here.
- do we want to follow webkit and blink's behavior here?
- if so, do we want to limit it to ASCII letters only (not even Latin-1), as they do?
- testcase:
data:text/html;charset=utf-8,<h1 style="width:0">!exchange<br>!échanger
; safari and chrome break after the exclamation point in "!exchange" but not in "!échanger"
- testcase:
- if we extend this beyond basic ASCII to include accented letters, etc, then what about non-Latin letters?
- should we seek to get this standardized in some way?
My feeling is that we should probably do this, but we should do it by tailoring the line-break class of exclamation point (etc) rather than by creating an ASCII-only alternative code path, and so it would apply more widely than is currently the case in the other browsers. Their special-casing of ASCII only leads to anomalies such as allowing a line-break in "hiver/été" but not in "été/hiver" (testcase: data:text/html;charset=utf-8,<h1 style="width:0">été/hiver<br>hiver/été
), which IMO is undesirable.
Comment 12•1 year ago
|
||
The severity field is not set for this bug.
:boris, could you have a look please?
For more information, please visit BugBot documentation.
Updated•1 year ago
|
Assignee | ||
Comment 13•1 year ago
|
||
Comment 14•1 year ago
|
||
Set release status flags based on info from the regressing bug 1719535
Updated•1 year ago
|
Assignee | ||
Updated•1 year ago
|
Assignee | ||
Comment 16•1 year ago
|
||
testing/web-platform/tests/css/css-text/white-space/white-space-pre-wrap-justify-004.html
with UAX#14 may be also failure by same reason. (If EX isn't Latin-1, this test is failure even if Blink).
Comment 17•1 year ago
|
||
Updating status to disabled
for v119, but still affecting 120 (since it's nightly-only for now per https://searchfox.org/mozilla-central/rev/57f6fbd39c0b5957e11b27b4db58b821d8e1607d/modules/libpref/init/StaticPrefList.yaml#7220-7223 )
Assignee | ||
Comment 18•1 year ago
|
||
Comment 19•1 year ago
|
||
If we're going to adjust our behavior to be closer to the other browsers here, I think we should not make it conditional on the word being strictly ASCII-only. As noted in comment 11, this results in unexpected inconsistencies in behavior when content is largely ASCII-only, but some words include accented letters -- which is very common for major European languages.
Therefore, I don't think we should simply copy their special ASCII-only table; instead, can we tailor the line-break classes of certain characters such as !
(and some others, as mentioned in comments above)? We could do this either unconditionally, or only for Latin-script content, but not just for ASCII -- that's too limited of a special-case.
That means we would not be exactly copying their behavior... that's deliberate. I propose to file bugs against Blink and WebKit about their current inconsistency, as seen in examples like:
data:text/html;charset=utf-8,<p style="width:0">écouter/parler</p> <p style="width:0">parler/écouter</p>
We can and should do better. Either both of these cases should break, or neither of them.
Comment 20•1 year ago
|
||
FTR, I've just filed https://bugs.webkit.org/show_bug.cgi?id=262497 and https://bugs.chromium.org/p/chromium/issues/detail?id=1488608 about their inconsistent behavior.
Assignee | ||
Comment 21•1 year ago
|
||
(In reply to Jonathan Kew [:jfkthame] from comment #19)
If we're going to adjust our behavior to be closer to the other browsers here, I think we should not make it conditional on the word being strictly ASCII-only. As noted in comment 11, this results in unexpected inconsistencies in behavior when content is largely ASCII-only, but some words include accented letters -- which is very common for major European languages.
Should we escalate to Unicode.org or CSSWG? I don't know this behavior is convenience for European languages. This issue (ASCII table) is historical thing for compatibility of old-Gecko and old-IE, so I guess that it is better to remove this old line breaking rules, then match with UAX#14.
Therefore, I don't think we should simply copy their special ASCII-only table; instead, can we tailor the line-break classes of certain characters such as
!
(and some others, as mentioned in comments above)? We could do this either unconditionally, or only for Latin-script content, but not just for ASCII -- that's too limited of a special-case.That means we would not be exactly copying their behavior... that's deliberate. I propose to file bugs against Blink and WebKit about their current inconsistency, as seen in examples like:
data:text/html;charset=utf-8,<p style="width:0">écouter/parler</p> <p style="width:0">parler/écouter</p>
We can and should do better. Either both of these cases should break, or neither of them.
But I don't know whether they (WebKit and Blink) change breaking rules. I think that this issue is compatibility issue, not new behavior for it. If we change it at first, it is new compatibility issue until they fix it.
Do you think we should wait for enable ICU4X's line breaker on stable channel until they fix it?
Comment 22•1 year ago
|
||
IMO, we don't need to wait for webkit and/or blink to change; we can still enable ICU4X (with the customizations we decide are worthwhile). I expect this would bring us closer to webkit/blink behavior in many cases, and I don't think 100%-identical behavior in all cases is a hard requirement.
So my idea of the way forward:
-
We should tailor
!
because of the CSS!important
example as reported here (note that our legacy behavior was not to break there, so we'll be maintaining existing behavior as well as matching the other browsers). -
We should also probably tailor
/
to not break between slash and a following letter (though UAX#14 by default would allow a break), because that matches what the other browsers do for ASCII only and matches our existing behavior. And the same for|
, although that's much less common.
(There may be a handful of other tailorings to consider... it should be feasible to compare the default ICU4X/UAX#14 behavior for the ASCII range with the other browsers' legacy table and evaluate any other differences to see if they seem important.)
Assignee | ||
Comment 23•1 year ago
|
||
(In reply to Jonathan Kew [:jfkthame] from comment #22)
(There may be a handful of other tailorings to consider... it should be feasible to compare the default ICU4X/UAX#14 behavior for the ASCII range with the other browsers' legacy table and evaluate any other differences to see if they seem important.)
Other differences that I know are https://searchfox.org/mozilla-central/source/testing/web-platform/tests/css/css-text/i18n/css3-text-line-break-baspglwj-026.html (failed on WebKit and Blink due to that table) and between number and hyphen such as (111-222). It is https://searchfox.org/wubkat/rev/d65248aa5c753da56ad0c0e0028e4c4cddd5c1f4/Source/WebCore/rendering/BreakLines.h#62-65.
Comment 24•1 year ago
|
||
(In reply to Makoto Kato [:m_kato] from comment #23)
(In reply to Jonathan Kew [:jfkthame] from comment #22)
(There may be a handful of other tailorings to consider... it should be feasible to compare the default ICU4X/UAX#14 behavior for the ASCII range with the other browsers' legacy table and evaluate any other differences to see if they seem important.)
Other differences that I know are https://searchfox.org/mozilla-central/source/testing/web-platform/tests/css/css-text/i18n/css3-text-line-break-baspglwj-026.html (failed on WebKit and Blink due to that table)
Right, that's the vertical-bar character |
, where UAX#14 calls for a break-after, but they suppress it, like with slash /
. As per comment 22, I'd be inclined to tailor the class of |
to achieve similar behavior.
and between number and hyphen such as (111-222). It is https://searchfox.org/wubkat/rev/d65248aa5c753da56ad0c0e0028e4c4cddd5c1f4/Source/WebCore/rendering/BreakLines.h#62-65.
Hyphen is a tricky one; there's no "right" answer in all cases, but probably some special-case heuristics are worth doing. Webkit's ASCII-only processing gives inconsistent results for this, too: compare
data:text/html;charset=utf-8,<p style="width:0">abc-123</p> <p style="width:0">абц-123</p>
IMO, it'd be better for them to do something like
if (character == '-' && isDigit(nextCharacter))
return isAlphanumeric(lastCharacter);
than the ASCII-specific version they currently have. ASCII letters should not be privileged.
If having an ASCII-only fast-path/table is worthwhile for performance (rather than to fix compat issues), then I'm not opposed to doing that; but we should make sure the behavior of the ASCII path matches what our full Unicode breaker would do for the same text.
So if we want to deviate from the default UAX#14 behavior (for example, avoiding breaks between !
, /
, and |
and a following letter), I think we should first tailor the classes of those characters to give the desired result in the ICU4X codepath; and then create an ASCII-only fast lookup table that matches this behavior.
Updated•1 year ago
|
Updated•10 months ago
|
Updated•10 months ago
|
Comment 25•8 months ago
|
||
Hi Jonathan and Ting-Yu, I'm just trying to understand where we're at with this bug. Do you think it needs to be fixed?
Makoto filed an issue with the CSSWG to specify this more precisely but it was suggested the spec is deliberately open-ended on this point: https://github.com/w3c/csswg-drafts/issues/9432
Do you think we should specify it or is it ok to leave it open-ended?
Comment 26•8 months ago
|
||
I guess it's OK for the spec to be somewhat open-ended about line breaking, as there are lots of competing factors and no single solution will be "right" in all circumstances. Treating it as a quality-of-implementation issue leaves some room for browsers to come up with rules or heuristics to try and improve their results.
At the same time, we've seen that discrepances in line-breaking behavior do lead to webcompat issues, so I think there'd be benefit in the spec including informative notes about any deviations from the default UAX#14 behavior that browsers have chosen to implement. For example, as described in comments above, we can see that Blink and Webkit are not following UAX#14 defaults for characters like !
and /
. It'd be helpful if these details were documented in an informative note, so that other implementors could make an informed choice as to whether to follow the same customizations or make different trade-offs.
On the implementation side, I do think we should make some adjustments here, as suggested in comment 22 & comment 24 above.
Comment 28•8 months ago
|
||
Just adding my 2cents here.
Maybe just add an about:config setting to override the list of line breaking characters.
If the change is to make firefox conform to the standards, then that's fine, otherwise the line break behavior shouldn't change.
In my case, this broke some of my date text that went from:
23/Jan/2024
to
23/
Jan/
2024
Comment 29•8 months ago
•
|
||
(In reply to Jonathan Kew [:jfkthame] from comment #26)
It'd be helpful if these details were documented in an informative note, so that other implementors could make an informed choice as to whether to follow the same customizations or make different trade-offs.
...
On the implementation side, I do think we should make some adjustments here, as suggested in comment 22 & comment 24 above.
Excellent, thanks so much Jonathan. It sounds like the next step is to draft an informative note for CSS issue #9432 that captures the contents of comment 22 and comment 24. I can try to do that next week assuming I'm not on paternity leave then.
Comment 30•7 months ago
|
||
I submitted https://github.com/w3c/csswg-drafts/pull/9997 to try and capture the discussion here. Feel free to fix it as needed.
Comment 31•7 months ago
|
||
Brian, thank you for submitting a PR to the spec.
Comment 32•5 months ago
|
||
Hi Jonathan, would you mind reviewing https://github.com/w3c/csswg-drafts/pull/9997? Thanks!
Comment 33•20 days ago
|
||
The note has been merged into https://drafts.csswg.org/css-text-4/#line-break-property. Hopefully that unblocks implementation work on this bug.
Description
•