Open Bug 1851131 Opened 1 year ago Updated 10 months ago

Bad line breaking behavior with ICU4X for Chinese text containing quotation marks

Categories

(Core :: Internationalization, enhancement, P3)

Firefox 119
enhancement

Tracking

()

UNCONFIRMED

People

(Reporter: cpplearner, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/119.0

Steps to reproduce:

  1. Ensure that intl.icu4x.segmenter.enabled is true in about:config (this is the default for Nightly)
  2. Open data:text/html;charset=utf-8,<div lang=zh style="border:1px dashed;width:180px">%E4%BB%96%E7%AB%99%E8%B5%B7%E6%9D%A5%E9%97%AE%EF%BC%9A%E2%80%9C%E8%80%81%E5%B8%88%EF%BC%8C%E2%80%98%E6%9C%89%E6%9D%A1%E4%B8%8D%E7%B4%8A%E2%80%99%E7%9A%84%E2%80%98%E7%B4%8A%E2%80%99%E6%98%AF%E4%BB%80%E4%B9%88%E6%84%8F%E6%80%9D%EF%BC%9F%E2%80%9D

Actual results:

No line break around the quotation marks, making the text look poor. See screen shot

Expected results:

There's a line break opportunity before each

  • U+2018 LEFT SINGLE QUOTATION MARK
  • U+201C LEFT DOUBLE QUOTATION MARK

There's a line break opportunity after each

  • U+2019 RIGHT SINGLE QUOTATION MARK
  • U+201D RIGHT DOUBLE QUOTATION MARK

Note that UAX 14 says

Note: If language information is available, it can be used to determine which character is used as the opening quote and which as the closing quote. See the information in Section 6.2, General Punctuation, in [Unicode]. In such a case, the quotation marks could be tailored to either OP or CL depending on their actual usage.

This tailoring is essential for Chinese text.

The Bugbug bot thinks this bug should belong to the 'Core::DOM: Core & HTML' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → DOM: Core & HTML
Product: Firefox → Core

Henri, do you know if this is something that we can do anything about?

Flags: needinfo?(hsivonen)
Component: DOM: Core & HTML → Internationalization
Flags: needinfo?(hsivonen) → needinfo?(m_kato)

Actually, ICU4X doesn't uses language information. It is unified. This is a kind of bug 465457.

Blocks: 465457
Severity: -- → S3
Type: defect → enhancement
Flags: needinfo?(m_kato)
Priority: -- → P3

FWIW, this isn't in any way specific to Chinese. Finnish and Swedish, for example, use U+201D RIGHT DOUBLE QUOTATION MARK as both open and close quotation mark.

It's unclear to me why this needs language information as opposed to making quotation marks that have a space-like character on one side and a non-space-like character on the other side not have a line break opportunity on the side of the non-space-like character. Why doesn't UAX 14 itself have such a rule?

(In reply to Henri Sivonen (:hsivonen) from comment #4)

It's unclear to me why this needs language information as opposed to making quotation marks that have a space-like character on one side and a non-space-like character on the other side not have a line break opportunity on the side of the non-space-like character. Why doesn't UAX 14 itself have such a rule?

Hmm. Superficially, rules LB15a and LB15b in UAX 14 seem to be about exactly this. What am I missing?

LB15a and LB15b are from Unicode 15.1. ICU4X uses 15.0 (https://www.unicode.org/reports/tr14/tr14-49.html)

So far as I can this, the proposal that introduced LB15a and LB15b is https://www.unicode.org/L2/L2023/23063-break-quot-mark.pdf and implementing it is tracked in https://github.com/unicode-org/icu4x/issues/3255

The author of the proposal that resulted in LB15a and LB15b points out that those rules don't address the Chinese (without spaces) case: https://github.com/unicode-org/icu4x/issues/3255#issuecomment-1771263967

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: