Closed Bug 1935148 Opened 2 months ago Closed 1 month ago

Remove newline around CJ(K) punctuation instead of replacing it with space

Categories

(Core :: Layout: Text and Fonts, defect)

Firefox 133
defect

Tracking

()

RESOLVED FIXED
135 Branch
Tracking Status
firefox135 --- fixed

People

(Reporter: tats.u, Assigned: jfkthame, NeedInfo)

References

(Blocks 1 open bug, Regressed 1 open bug)

Details

Attachments

(2 files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0

Steps to reproduce:

<p lang="ja">
  Firefoxは最高のブラウザです。
  Androidでも広告をブロックできます。
  皆さんもFirefoxを使いましょう。
</p>
<p lang="ja">
  Chrome・
  Edge・
  Vivaldi・Braveなどのブラウザは、
  Chromiumという共通のレンダリングエンジンを
  採用しており、
  多様性が不足しています。
</p>

https://codepen.io/tats-u/pen/jENbJdm

Actual results:

No spaces inside each paragraph

Expected results:

Some spaces are inserted. However, these spaces are unexpected for Japanese people.
https://github.com/w3c/csswg-drafts/issues/5086
The current behavior has bad effects on the future behavior of Markdown/HTML formatters like Prettier.
https://github.com/prettier/prettier/pull/16805
Prettier is planning to adapt its behavior to that of Firefox by inserting such spaces, but I don't want to let it do it because it's not a natural behavior.

4.3.3 in CSS Text 4 hasn't mentioned on a concrete rule yet.
https://drafts.csswg.org/css-text-4/#line-break-transform

I want Firefox to remove a newline that meets either of the following conditions:

  • Its next character is CJ(K) punctuation
  • Its previous character is CJ(K) punctuation

The current Firefox trims a space whose previous and next character is both CJ (Chinese or Japanese).
Examples of CJ(K) punctuations are "、", "。", "・", "(", ")", "「", "」". All of their Unicode categories stars with P.

Markdown equivalent to the above HTML is formatted as intended in the latest Prettier:

https://prettier.io/playground/#N4Igxg9gdgLgprEAuEAxAlgJzgMwgD0HsGQAHNANrMDsGQNYZBLhkDKGQNoZBzBkE0GQIAYAdKAQSgBNMI6Xo0BBDIGC9QFIqgJIZKgW4ZAwwyB6hkaBZBkB+DG06Awt0CqDIGSGERmx58UwP7ygEQY1gdQZA4QyAxBg5ROAYQAWAgLZxA3wycAorwA5r6cAGroAG4AhgA2vOg+AEKY0ZFwgFYMgJYMVHT0hICADK4eEJ7oAK6egBYMFvaAjoqAWAnkgDcMgM8MgAMMgFcMrYANDIAVDK2AHQytUpyAhcaAFK7WgGYMgFIMgFEMRVCAWJqAG5aA5AaAMgyAsHKAzbEzVmwgADQgEAAOMOjQAM7IoNGYAgDuAAqPCHcocc-RAJ53U4AI1SYAA1nAYABlaLeAAy6CgcGQODiNzgwNBEOh52iYERQWQMEw5QxIHRZSJJLJcHw5zgmHQ3lgcQAKgyoI90HAvqjYujTjcCbE4ABFcoQeAotFkgBWN3wUOFYolUqQfIFIAAjqq4K8BOcviBojcALRIuC8S0nEDE6LoWIElylTzRZDG2KxG1CqBBEVcGDE9BA8rwV4MhFI6X8sluGCeWIAdTc6HgN1xYDgUM+qaiqb+7rAN0BIEipIAknwENCwIzLjxeFCYH8RdHNecBOjE6lzu6kelMDbEejMDB9dEgq622TcZgR+7XZgwbwIM8oDaO4iYImhDA3MgABwABlO2B1WD1qUnbvVMtOMGiQJ3vD3yAATKdyujWY-eXeQHAnhApaVq8HC0S+uUE5wKgECYK6gYEu60ShhAIAAL7oUAA

Firefoxは最高のブラウザです。
Androidでも広告をブロックできます。
皆さんもFirefoxを使いましょう。

Chrome・
Edge・
Vivaldi・Braveなどのブラウザは、
Chromiumという共通のレンダリングエンジンを
採用しており、
多様性が不足しています。

Firefoxは最高のブラウザです。Androidでも広告をブロックできます。皆さんもFirefoxを使いましょう。

Chrome・Edge・Vivaldi・Braveなどのブラウザは、Chromiumという共通のレンダリングエンジンを採用しており、多様性が不足しています。

Component: Untriaged → Layout: Text and Fonts
Product: Firefox → Core

The current behavior in Firefox is intended to be that a newline is discarded (instead of converted to a space) if the characters on each side of it in the source (i.e. at the end of the previous line and at the start of the next line) both have East Asian Width category F, H, or W, and the script is not Hangul (so this behavior applies to Japanese and Chinese content, but not to Korean).

In the examples here, the punctuation characters have East Asian Width = Wide, so they would be candidates for discarding the newline, but the character after the newline is a Latin letter (EAW = Narrow), and so the space is retained.

So to implement the requested behavior, we'd need an additional rule to say that the newline is discarded if the character before it is a Wide or Fullwidth punctuation character (EAW=[FW], GC=P*), regardless of the category of the character after the newline. And for symmetry (and because of opening-fullwidth punctuation such as brackets), probably the same thing applies if the character after the newline is fullwidth punctuation.

In general this seems reasonable to me, but I do have a question: what about the Halfwidth CJK punctuation characters such as "「", "」", "、", "・" -- should newline disappear next to those, or should a space be retained as these characters do not create an inherent visual space of their own to the same extent as wide ones do?

I'd also like to hear from some of our Japanese specialists to confirm whether they agree this would be a good change.

Flags: needinfo?(masayuki)
Flags: needinfo?(m_kato)

Includes the examples from the report as a testcase, though there is not yet
any formal spec for the exact behavior of segment break transformation.
(But nevertheless there is an existing collection of tests, so this just adds
one for the punctuation case.)

Assignee: nobody → jfkthame
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Severity: -- → S3

I do have a question: what about the Halfwidth CJK punctuation characters such as "「", "」", "、", "・" -- should newline disappear next to those, or should a space be retained as these characters do not create an inherent visual space of their own to the same extent as wide ones do?

Japanese don't insert a space around them, either. (in the first place half width katakana isn't used unless resource is limited)

リヨウカンキョウ:Windows・64ビット、プロバン
(利用環境:Windows・64ビット、プロ版)
(Environment: Windows & 64 bit, Pro version)

ユーザ「ヤマダタロウ」ノアカウントヲショウキョシマス。ヨロシイデスカ?
(ユーザ「山田太郎」のアカウントを消去します。よろしいですか?)
(The account of the user "John Smith" will be deleted. Are you sure?)

  • testing/web-platform/tests/css/css-text/line-breaking/segment-break-transformation-punctuation-001.html
  • testing/web-platform/tests/css/css-text/line-breaking/segment-break-transformation-punctuation-001-ref.html

You had better not sync text in them with WPT.
It is very opinionated and gives developers of other browsers an unpleasant feelings.

opinionated → subjective

I strongly recommend you to tell a translating or generative AI to translate it to English or your native language once.
The following text is much more neutral and safer for WPT:

<p lang="ja">
  本システムはサポート切れのブラウザに対応しません。
  Internet Explorerをお使いの場合、
  Edge
  ・
  Chrome
  ・
  Firefoxなどに移行してください。
  (EdgeはChromium阪をお使いください)
</p>
<p lang="ja">
  ユーザメイ
  「ジョン
  ・
  スミス」
  、
  ID
  「smith」
  ノアカウントヲショウキョシマス。
  y/N
</p>

(In reply to Tatsunori Uchino from comment #5)

I do have a question: what about the Halfwidth CJK punctuation characters such as "「", "」", "、", "・" -- should newline disappear next to those, or should a space be retained as these characters do not create an inherent visual space of their own to the same extent as wide ones do?

Japanese don't insert a space around them, either. (in the first place half width katakana isn't used unless resource is limited)

リヨウカンキョウ:Windows・64ビット、プロバン
(利用環境:Windows・64ビット、プロ版)
(Environment: Windows & 64 bit, Pro version)

ユーザ「ヤマダタロウ」ノアカウントヲショウキョシマス。ヨロシイデスカ?
(ユーザ「山田太郎」のアカウントを消去します。よろしいですか?)
(The account of the user "John Smith" will be deleted. Are you sure?)

OK, I'll update the patch to handle the half-width punctuation as well.

  • testing/web-platform/tests/css/css-text/line-breaking/segment-break-transformation-punctuation-001.html
  • testing/web-platform/tests/css/css-text/line-breaking/segment-break-transformation-punctuation-001-ref.html

You had better not sync text in them with WPT.
It is very opinionated and gives developers of other browsers an unpleasant feelings.

Thank you for mentioning this; I'll make sure to update the text.

Attachment #9442299 - Attachment description: Bug 1935148 - Remove newline (instead of transforming to space) if adjacent to wide East-Asian punctuation character. r=m_kato,masayuki → Bug 1935148 - Remove newline (instead of transforming to space) if adjacent to East-Asian punctuation character. r=m_kato,masayuki

Yeah, Japanese text usually has no white-spaces even before/after an ASCII character. Therefore, except implicitly inserted white-space (i.e., U+0020 which is not direct sibling of a linefeed), all collapsible spaces should be discarded at rendering time. It's hard to say about half-width characters, but I think that same behavior as fullwidth characters should be reasonable.

Flags: needinfo?(masayuki)

Should we consider U+FF5E (~) FULL WIDTH TILDE as a punctuation?
It's "Sm" but widely used as a substitute for U+301C (〜) WAVE DASH in Windows for Japanese.

https://www.compart.com/en/unicode/U+FF5E
https://github.com/prettier/prettier/pull/16832
https://ja.wikipedia.org/wiki/%E6%B3%A2%E3%83%80%E3%83%83%E3%82%B7%E3%83%A5#Unicode%E3%81%AB%E9%96%A2%E9%80%A3%E3%81%99%E3%82%8B%E5%95%8F%E9%A1%8C (Japanese)
https://www.tohoho-web.com/ex/dash-tilde.html (Japanese)

Attachment #9442299 - Attachment description: Bug 1935148 - Remove newline (instead of transforming to space) if adjacent to East-Asian punctuation character. r=m_kato,masayuki → Bug 1935148 - Remove newline (instead of transforming to space) if adjacent to East-Asian punctuation character. r=m_kato

This change should be safe to Chinese, too.
We can exclude characters whose Script is Hangul.

https://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt

Japanese & Chinese → should be safe
Other languages → can be Nightly only

Flags: needinfo?(m_kato)
Pushed by jkew@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/87246fb462dd Remove newline (instead of transforming to space) if adjacent to East-Asian punctuation character. r=m_kato
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/49864 for changes under testing/web-platform/tests

Backed out for causing build bustages @ nsTextFrameUtils.cpp

/builds/worker/checkouts/gecko/layout/generic/nsTextFrameUtils.cpp(259,33): error: use of overloaded operator '[]' is ambiguous (with operand types 'const char16ptr_t' and 'int')
/builds/worker/checkouts/gecko/layout/generic/nsTextFrameUtils.cpp(260,33): error: use of overloaded operator '[]' is ambiguous (with operand types 'const char16ptr_t' and 'int')
/builds/worker/checkouts/gecko/layout/generic/nsTextFrameUtils.cpp(261,33): error: use of overloaded operator '[]' is ambiguous (with operand types 'const char16ptr_t' and 'int')
/builds/worker/checkouts/gecko/layout/generic/nsTextFrameUtils.cpp(262,33): error: use of overloaded operator '[]' is ambiguous (with operand types 'const char16ptr_t' and 'int')
/builds/worker/checkouts/gecko/layout/generic/nsTextFrameUtils.cpp(263,51): error: use of overloaded operator '[]' is ambiguous (with operand types 'const char16ptr_t' and 'int')
gmake[4]: *** [/builds/worker/checkouts/gecko/config/rules.mk:674: Unified_cpp_layout_generic4.obj] Error 1
gmake[3]: *** [/builds/worker/checkouts/gecko/config/recurse.mk:72: layout/generic/target-objects] Error 2
gmake[2]: *** [/builds/worker/checkouts/gecko/config/recurse.mk:34: compile] Error 2
gmake[1]: *** [/builds/worker/checkouts/gecko/config/rules.mk:359: default] Error 2
Flags: needinfo?(jfkthame)
Upstream PR was closed without merging
Pushed by jkew@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ce6faa97d947 Remove newline (instead of transforming to space) if adjacent to East-Asian punctuation character. r=m_kato

Backed out for causing build bustages @ nsTextFrameUtils.h

/builds/worker/checkouts/gecko/layout/generic/nsTextFrameUtils.h:132:37: error: unknown type name 'nsAtom'

/builds/worker/checkouts/gecko/layout/generic/nsTextFrameUtils.cpp:210:26: error: out-of-line definition of 'TransformText' does not match any declaration in 'nsTextFrameUtils'

/builds/worker/checkouts/gecko/layout/generic/nsTextFrameUtils.cpp:364:37: error: explicit instantiation of 'TransformText' does not refer to a function template, variable template, member function, member class, or static data member

/builds/worker/checkouts/gecko/layout/generic/nsTextFrameUtils.cpp:368:38: error: explicit instantiation of 'TransformText' does not refer to a function template, variable template, member function, member class, or static data member

gmake[4]: *** [/builds/worker/checkouts/gecko/config/rules.mk:676: nsTextFrameUtils.o] Error 1

gmake[3]: *** [/builds/worker/checkouts/gecko/config/recurse.mk:72: layout/generic/target-objects] Error 2

gmake[2]: *** [/builds/worker/checkouts/gecko/config/recurse.mk:34: compile] Error 2

gmake[1]: *** [/builds/worker/checkouts/gecko/config/rules.mk:359: default] Error 2

gmake: *** [client.mk:59: build] Error 2

'mach build -v' did not run successfully. Please check log for errors.
	•	
Backout by amarc@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/fa7bad38ce12 Backed out changeset ce6faa97d947 for causing build bustages @ nsTextFrameUtils.h CLOSED TREE
Upstream PR was closed without merging
Pushed by jkew@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/4e7974e77cd8 Remove newline (instead of transforming to space) if adjacent to East-Asian punctuation character. r=m_kato
Status: ASSIGNED → RESOLVED
Closed: 1 month ago
Resolution: --- → FIXED
Target Milestone: --- → 135 Branch
Upstream PR merged by moz-wptsync-bot

https://codepen.io/tats-u/pen/GgKxpyE

Hey, U+FF5E has not been treated as punctuation yet.

  • category: Sm
  • East Asian Width: F

https://www.compart.com/en/unicode/U+FF5E

It is an exception that should be designated as a codepoint.

https://github.com/mozilla/gecko-dev/blob/29e186485fe1b835f05bde01f650e371545de98e/intl/unicharutil/util/nsUnicharUtils.cpp#L524-L525

 bool IsEastAsianPunctuation(uint32_t u) {
   return intl::UnicodeProperties::IsEastAsianWidthFHW(u) &&
-         intl::UnicodeProperties::IsPunctuation(u);
+         (intl::UnicodeProperties::IsPunctuation(u) || u == 0xff5e);
 }
See Also: → 1941096
See Also: → 1941093
See Also: → 1940947
Regressions: 1941097
Blocks: 1945813
See Also: → 1945813
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: