Bad line breaking behavior with ICU4X for Chinese text containing quotation marks
Categories
(Core :: Internationalization, enhancement, P3)
Tracking
()
People
(Reporter: cpplearner, Unassigned)
References
(Blocks 1 open bug)
Details
Attachments
(1 file)
4.38 KB,
image/png
|
Details |
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/119.0
Steps to reproduce:
- Ensure that
intl.icu4x.segmenter.enabled
istrue
inabout:config
(this is the default for Nightly) - Open
data:text/html;charset=utf-8,<div lang=zh style="border:1px dashed;width:180px">%E4%BB%96%E7%AB%99%E8%B5%B7%E6%9D%A5%E9%97%AE%EF%BC%9A%E2%80%9C%E8%80%81%E5%B8%88%EF%BC%8C%E2%80%98%E6%9C%89%E6%9D%A1%E4%B8%8D%E7%B4%8A%E2%80%99%E7%9A%84%E2%80%98%E7%B4%8A%E2%80%99%E6%98%AF%E4%BB%80%E4%B9%88%E6%84%8F%E6%80%9D%EF%BC%9F%E2%80%9D
Actual results:
No line break around the quotation marks, making the text look poor. See screen shot
Expected results:
There's a line break opportunity before each
- U+2018 LEFT SINGLE QUOTATION MARK
- U+201C LEFT DOUBLE QUOTATION MARK
There's a line break opportunity after each
- U+2019 RIGHT SINGLE QUOTATION MARK
- U+201D RIGHT DOUBLE QUOTATION MARK
Note that UAX 14 says
Note: If language information is available, it can be used to determine which character is used as the opening quote and which as the closing quote. See the information in Section 6.2, General Punctuation, in [Unicode]. In such a case, the quotation marks could be tailored to either OP or CL depending on their actual usage.
This tailoring is essential for Chinese text.
Comment 1•2 years ago
|
||
The Bugbug bot thinks this bug should belong to the 'Core::DOM: Core & HTML' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.
Comment 2•2 years ago
|
||
Henri, do you know if this is something that we can do anything about?
Updated•1 years ago
|
Comment 3•1 years ago
|
||
Actually, ICU4X doesn't uses language information. It is unified. This is a kind of bug 465457.
Comment 4•1 years ago
|
||
FWIW, this isn't in any way specific to Chinese. Finnish and Swedish, for example, use U+201D RIGHT DOUBLE QUOTATION MARK as both open and close quotation mark.
It's unclear to me why this needs language information as opposed to making quotation marks that have a space-like character on one side and a non-space-like character on the other side not have a line break opportunity on the side of the non-space-like character. Why doesn't UAX 14 itself have such a rule?
Comment 5•1 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #4)
It's unclear to me why this needs language information as opposed to making quotation marks that have a space-like character on one side and a non-space-like character on the other side not have a line break opportunity on the side of the non-space-like character. Why doesn't UAX 14 itself have such a rule?
Hmm. Superficially, rules LB15a and LB15b in UAX 14 seem to be about exactly this. What am I missing?
Comment 6•1 years ago
|
||
LB15a and LB15b are from Unicode 15.1. ICU4X uses 15.0 (https://www.unicode.org/reports/tr14/tr14-49.html)
Comment 7•1 year ago
|
||
So far as I can this, the proposal that introduced LB15a and LB15b is https://www.unicode.org/L2/L2023/23063-break-quot-mark.pdf and implementing it is tracked in https://github.com/unicode-org/icu4x/issues/3255
Comment 8•1 year ago
|
||
The author of the proposal that resulted in LB15a and LB15b points out that those rules don't address the Chinese (without spaces) case: https://github.com/unicode-org/icu4x/issues/3255#issuecomment-1771263967
Description
•