Open Bug 1851131 Opened 2 years ago Updated 3 months ago

Bad line breaking behavior with ICU4X for Chinese text containing quotation marks

Tracking

()

Status:

UNCONFIRMED

People

(Reporter: cpplearner, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Screen Shot 2023-09-01 at 16.33.52.png 2 years ago cpplearner 4.38 KB, image/png		Details

cpplearner

Reporter

Description

•

2 years ago

Attached image Screen Shot 2023-09-01 at 16.33.52.png — Details

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/119.0

Steps to reproduce:

Ensure that intl.icu4x.segmenter.enabled is true in about:config (this is the default for Nightly)
Open data:text/html;charset=utf-8,<div lang=zh style="border:1px dashed;width:180px">%E4%BB%96%E7%AB%99%E8%B5%B7%E6%9D%A5%E9%97%AE%EF%BC%9A%E2%80%9C%E8%80%81%E5%B8%88%EF%BC%8C%E2%80%98%E6%9C%89%E6%9D%A1%E4%B8%8D%E7%B4%8A%E2%80%99%E7%9A%84%E2%80%98%E7%B4%8A%E2%80%99%E6%98%AF%E4%BB%80%E4%B9%88%E6%84%8F%E6%80%9D%EF%BC%9F%E2%80%9D

Actual results:

No line break around the quotation marks, making the text look poor. See screen shot

Expected results:

There's a line break opportunity before each

U+2018 LEFT SINGLE QUOTATION MARK
U+201C LEFT DOUBLE QUOTATION MARK

There's a line break opportunity after each

U+2019 RIGHT SINGLE QUOTATION MARK
U+201D RIGHT DOUBLE QUOTATION MARK

Note that UAX 14 says

Note: If language information is available, it can be used to determine which character is used as the opening quote and which as the closing quote. See the information in Section 6.2, General Punctuation, in [Unicode]. In such a case, the quotation marks could be tailored to either OP or CL depending on their actual usage.

This tailoring is essential for Chinese text.

BugBot [:suhaib / :marco/ :calixte]

Comment 1

•

2 years ago

The Bugbug bot thinks this bug should belong to the 'Core::DOM: Core & HTML' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → DOM: Core & HTML

Product: Firefox → Core

Andreas Farre [:farre]

Comment 2

•

2 years ago

Henri, do you know if this is something that we can do anything about?

Flags: needinfo?(hsivonen)

Makoto Kato [:m_kato]

Updated

•

1 years ago

Component: DOM: Core & HTML → Internationalization

Flags: needinfo?(hsivonen) → needinfo?(m_kato)

Makoto Kato [:m_kato]

Comment 3

•

1 years ago

Actually, ICU4X doesn't uses language information. It is unified. This is a kind of bug 465457.

Blocks: 465457

Severity: -- → S3

Type: defect → enhancement

Flags: needinfo?(m_kato)

Priority: -- → P3

Henri Sivonen (:hsivonen)

Comment 4

•

1 years ago

FWIW, this isn't in any way specific to Chinese. Finnish and Swedish, for example, use U+201D RIGHT DOUBLE QUOTATION MARK as both open and close quotation mark.

It's unclear to me why this needs language information as opposed to making quotation marks that have a space-like character on one side and a non-space-like character on the other side not have a line break opportunity on the side of the non-space-like character. Why doesn't UAX 14 itself have such a rule?

Henri Sivonen (:hsivonen)

Comment 5

•

1 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #4)

It's unclear to me why this needs language information as opposed to making quotation marks that have a space-like character on one side and a non-space-like character on the other side not have a line break opportunity on the side of the non-space-like character. Why doesn't UAX 14 itself have such a rule?

Hmm. Superficially, rules LB15a and LB15b in UAX 14 seem to be about exactly this. What am I missing?

Makoto Kato [:m_kato]

Comment 6

•

1 years ago

LB15a and LB15b are from Unicode 15.1. ICU4X uses 15.0 (https://www.unicode.org/reports/tr14/tr14-49.html)

Brian Birtles (:birtles)

Comment 7

•

1 year ago

So far as I can this, the proposal that introduced LB15a and LB15b is https://www.unicode.org/L2/L2023/23063-break-quot-mark.pdf and implementing it is tracked in https://github.com/unicode-org/icu4x/issues/3255

See Also: → https://github.com/unicode-org/icu4x/issues/3255

Henri Sivonen (:hsivonen)

Comment 8

•

1 year ago

The author of the proposal that resulted in LB15a and LB15b points out that those rules don't address the Chinese (without spaces) case: https://github.com/unicode-org/icu4x/issues/3255#issuecomment-1771263967

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Bad line breaking behavior with ICU4X for Chinese text containing quotation marks

Categories

(Core :: Internationalization, enhancement, P3)

Tracking

()

People

(Reporter: cpplearner, Unassigned)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Attachment

General

Description

File Name

Content Type