Closed
Bug 1265631
Opened 8 years ago
Closed 8 years ago
Supplementary-plane Unicode characters (including Emoji) fail to line-break properly
Categories
(Core :: Layout: Text and Fonts, defect)
Tracking
()
RESOLVED
FIXED
mozilla49
Tracking | Status | |
---|---|---|
firefox49 | --- | fixed |
People
(Reporter: rez, Assigned: jfkthame)
References
()
Details
Attachments
(6 files)
611 bytes,
text/html
|
Details | |
4.87 KB,
patch
|
Details | Diff | Splinter Review | |
5.68 KB,
patch
|
masayuki
:
review+
|
Details | Diff | Splinter Review |
898.09 KB,
patch
|
masayuki
:
review+
|
Details | Diff | Splinter Review |
1.92 KB,
patch
|
masayuki
:
review+
|
Details | Diff | Splinter Review |
14.77 KB,
patch
|
masayuki
:
review+
|
Details | Diff | Splinter Review |
User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Firefox/45.0 Build ID: 20160407164938 Steps to reproduce: Uninterrupted emoji characters fail to wrap correctly within a confined space. Demo: http://codepen.io/cliener/pen/BKVNVv Actual results: Actual rendering: XXXXXXXXXXXXXXXXXXXXXXXX Expected results: Expecting to see: XXXXXXXX XXXXXXXX XXXXXXXX
Reporter | ||
Updated•8 years ago
|
Summary: Emoji characters fail to wrap → HTML Emoji characters fail to wrap
Updated•8 years ago
|
Component: Untriaged → Layout: Text
Product: Firefox → Core
At the very least, Chrome does wrap here. I didn't check Edge. I'm somewhat curious what http://www.unicode.org/reports/tr14/ says. (And also surprised this hasn't previously been reported.) The other question is whether there should be priorities involved, e.g., whether text should break as: hello
Flags: needinfo?(masayuki)
Summary: HTML Emoji characters fail to wrap → Unicode Emoji characters fail to wrap
Oh, right. ******* BUGZILLA (bug 405011). Here's the rest of my comment, with the Emoji replaced by X's: hello XXX XX hello (giving the spaces higher priority as a break point) or as: hello XXX hello (treating the emoji as just-as-breakable as spaces). (I also admit I don't remember what we do for similar cases with Chinese-within-English.)
Assignee | ||
Comment 4•8 years ago
|
||
(In reply to David Baron [:dbaron] ⌚️UTC-7 from comment #2) > At the very least, Chrome does wrap here. I didn't check Edge. > > I'm somewhat curious what http://www.unicode.org/reports/tr14/ says. TR14 doesn't directly talk about emoji, afaics, but most of them are given line-break class ID in the LineBreak.txt data file, which means they ought to behave similarly to CJK ideographs. And apparently they don't, so it does look like we have a bug here.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Assignee | ||
Comment 5•8 years ago
|
||
We seem to have the same problem with a sequence of plane-2 ideographs, too. Which suggests this is probably a lack of proper surrogate-pair handling somewhere in the line-breaker code.
Assignee | ||
Comment 6•8 years ago
|
||
Indeed, the code in nsJISx4051LineBreaker.cpp looks like it works purely on 16-bit code units and has no awareness of surrogates. (And the line-break class tables it uses haven't been updated lately, afaict, so there are probably a scattering of more recently-encoded BMP characters that we don't handle properly, let alone all the supplementary-plane stuff.) (If it weren't for Android, maybe we could just use ICU instead of having to maintain that ancient code....)
Assignee | ||
Comment 7•8 years ago
|
||
This is a quick-'n'-dirty proof of concept that adds some basic surrogate support to the line-breaker, and makes the example here behave as expected. For a real patch, we should probably import the properties from Unicode's LineBreak.txt data file into nsUnicodeProperties, and use that not just for non-BMP characters but also to replace the bulk of the existing GetClass function in nsJISx4051LineBreaker.cpp, just keeping any custom overrides that are needed for any special cases where we need to differ from the default Unicode line-breaking properties.
Assignee | ||
Updated•8 years ago
|
Summary: Unicode Emoji characters fail to wrap → Supplementary-plane Unicode characters (including Emoji) fail to line-break properly
Comment 8•8 years ago
|
||
If we use ICU for linebreaking of Unicode characters (in other words, non-ASCII characters), this should be gone. Makoto-san, what's the status?
Flags: needinfo?(masayuki) → needinfo?(m_kato)
Comment 9•8 years ago
|
||
(In reply to Masayuki Nakano [:masayuki] (Mozilla Japan) from comment #8) > If we use ICU for linebreaking of Unicode characters (in other words, > non-ASCII characters), this should be gone. Makoto-san, what's the status? ICU line breaker spends more binary size (maybe additional 1-2 MB). So I don't integrate it within the near feature. (And Fennec team doesn't agree for integrating ICU yet....)
Flags: needinfo?(m_kato)
Assignee | ||
Comment 10•8 years ago
|
||
What we could try doing (without needing to add the full ICU BreakIterator code) is to implement the pair-based algorithm from UAX#14 (based on the LineBreak.txt data from Unicode, so it covers the full character set, unlike our current code). However, this has the potential to change behavior in various cases, and we'd probably want to tailor parts of the UAX#14 algorithm to suit Web content better. Altogether, a far-from-trivial change that I think we should consider in its own right, in a separate bug. As a shorter-term fix here, though, we can fairly easily add some basic support for supplementary-plane characters to the existing code, without changing its overall behavior for everything it currently supports. This just requires an added check for surrogates, and a way to look up suitable break-class values for the supplementary codepoints that GetClass doesn't currently know anything about. The simplest of all fixes would be to just treat all supplementary-plane characters as a single default class; but that's not quite good enough, because there are some important distinctions there. The emoji should be treated much like CJK ideographs, so that a long run of smiley-faces will wrap whenever it happens to hit the line edge; but that's not appropriate for the various supplementary-plane alphabets, or some of the other types of symbols. So I propose a sort of "half-way" fix for now: add the LineBreak.txt data to our unicode properties (for platforms where we don't have ICU available, i.e. Android; where we do have ICU, we can simply ask it for the character properties), and then for characters that GetClass doesn't currently support, use a simple mapping from Unicode's LineBreak classes to the classes currently used in the nsJISx4051LineBreaker code. This will avoid disturbing current behavior for the characters where it has been carefully designed and tested, while adding support for the supplementary characters where we currently fail.
Assignee | ||
Comment 11•8 years ago
|
||
Here are a couple of simple reftests for supplementary-plane line-breaking; they currently fail because of our lack of support in nsJISx4051LineBreaker.
Attachment #8744283 -
Flags: review?(masayuki)
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → jfkthame
Status: NEW → ASSIGNED
Assignee | ||
Comment 12•8 years ago
|
||
This adds LineBreak.txt to the data available through nsUnicodeProperties; it is only included when ENABLE_INTL_API is disabled, so that we can't get the properties via ICU.
Attachment #8744284 -
Flags: review?(masayuki)
Assignee | ||
Comment 13•8 years ago
|
||
And an accessor so we can get the UAX#14 linebreak class for any character, either from ICU or our own data.
Attachment #8744286 -
Flags: review?(masayuki)
Assignee | ||
Comment 14•8 years ago
|
||
This adds basic surrogate support in the line-breaker, and makes the testcases here work properly.
Attachment #8744287 -
Flags: review?(masayuki)
Assignee | ||
Comment 15•8 years ago
|
||
Try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1bc7226d7e83
Updated•8 years ago
|
Attachment #8744283 -
Flags: review?(masayuki) → review+
Comment 16•8 years ago
|
||
Comment on attachment 8744287 [details] [diff] [review] patch 3 - Add surrogate support to nsJISx4051LineBreaker so that supplementary-plane characters get correct line-breaking behavior >+ if (l <= 0xbf) { // Hangul Compatibility Jamo, Bopomofo, Kanbun > // XXX: This is per UAX #14, but UAX #14 may change > // the line breaking rules about Kanbun and Bopomofo. nit: Why don't you indent these two lines? >+ // Mapping for Unicode LineBreak.txt classes to the (simplified) set of >+ // character classes used here. Let me check this later... >- char16_t GetPreviousNonHyphenCharacter() const { >+ uint32_t GetPreviousNonHyphenCharacter() const { Why don't you use char32_t here? > return mPreviousNonHyphenCharacter; > } >- void NotifyNonHyphenCharacter(char16_t ch) { >+ void NotifyNonHyphenCharacter(uint32_t ch) { and here... >- char16_t mPreviousNonHyphenCharacter; // The last character we have seen >+ uint32_t mPreviousNonHyphenCharacter; // The last character we have seen and here... >@@ -825,17 +874,22 @@ nsJISx4051LineBreaker::GetJISx4051Breaks > uint8_t aWordBreak, > uint8_t* aBreakBefore) > { > uint32_t cur; > int8_t lastClass = CLASS_NONE; > ContextState state(aChars, aLength); > > for (cur = 0; cur < aLength; ++cur, state.AdvanceIndex()) { >- char16_t ch = aChars[cur]; >+ uint32_t ch = aChars[cur]; And here. >+ if (ch > 0xffff) { >+ ++cur; >+ aBreakBefore[cur] = false; >+ state.AdvanceIndex(); >+ } Could you add a comment like "Mark before the following low surrogate character isn't breakable and skip it" or something.
Updated•8 years ago
|
Attachment #8744286 -
Flags: review?(masayuki) → review+
Updated•8 years ago
|
Attachment #8744284 -
Flags: review?(masayuki) → review+
Comment 17•8 years ago
|
||
Comment on attachment 8744287 [details] [diff] [review] patch 3 - Add surrogate support to nsJISx4051LineBreaker so that supplementary-plane characters get correct line-breaking behavior My understanding is, this is new map when the old code class is CLASS_CHARACTER or CLASS_BREAKABLE (for CJK characters), right? >+ // Mapping for Unicode LineBreak.txt classes to the (simplified) set of >+ // character classes used here. >+ // XXX The mappings here were derived by comparing the Unicode LineBreak >+ // values of BMP characters to the classes our existing GetClass returns >+ // for the same codepoints; in cases where characters with the same >+ // LineBreak class mapped to various classes here, I picked what seemed >+ // the most prevalent equivalence. >+ // Some of these are unclear to me, but currently they are ONLY used >+ // for characters not handled by the old code above, so all the JISx405 >+ // special cases should already be accounted for. >+ static const int8_t sUnicodeLineBreakToClass[] = { >+ /* UNKNOWN = 0, [XX] */ CLASS_CHARACTER, >+ /* AMBIGUOUS = 1, [AI] */ CLASS_CHARACTER, >+ /* ALPHABETIC = 2, [AL] */ CLASS_CHARACTER, >+ /* BREAK_BOTH = 3, [B2] */ CLASS_CHARACTER, CLASS_BREAKABLE? But I think that keeping to use CLASS_CHARACTER in this bug is better. We should *try* to change later and separately. >+ /* BREAK_AFTER = 4, [BA] */ CLASS_CHARACTER, CLASS_CLOSE_LIKE_CHARACTER? But same above. >+ /* BREAK_BEFORE = 5, [BB] */ CLASS_OPEN_LIKE_CHARACTER, // ??? I think so. >+ /* MANDATORY_BREAK = 6, [BK] */ CLASS_CHARACTER, CLASS_BREAKABLE? But same above. >+ /* CONTINGENT_BREAK = 7, [CB] */ CLASS_CHARACTER, >+ /* CLOSE_PUNCTUATION = 8, [CL] */ CLASS_CHARACTER, CLASS_CLOSE_LIKE_CHARACTER? But same above. >+ /* COMBINING_MARK = 9, [CM] */ CLASS_CHARACTER, >+ /* CARRIAGE_RETURN = 10, [CR] */ CLASS_BREAKABLE, >+ /* EXCLAMATION = 11, [EX] */ CLASS_CHARACTER, CLASS_CLOSE_LIKE_CHARACTER? But same above. >+ /* GLUE = 12, [GL] */ CLASS_NON_BREAKABLE, >+ /* HYPHEN = 13, [HY] */ CLASS_CHARACTER, >+ /* IDEOGRAPHIC = 14, [ID] */ CLASS_BREAKABLE, >+ /* INSEPARABLE = 15, [IN] */ CLASS_CLOSE_LIKE_CHARACTER, >+ /* INFIX_NUMERIC = 16, [IS] */ CLASS_CHARACTER, >+ /* LINE_FEED = 17, [LF] */ CLASS_BREAKABLE, >+ /* NONSTARTER = 18, [NS] */ CLASS_CLOSE, I guess that CLASS_CLOSE_LIKE_CHARACTER is safer for now if this is not the class of emoji characters. >+ /* NUMERIC = 19, [NU] */ CLASS_CHARACTER, >+ /* OPEN_PUNCTUATION = 20, [OP] */ CLASS_CHARACTER, >+ /* POSTFIX_NUMERIC = 21, [PO] */ CLASS_CHARACTER, >+ /* PREFIX_NUMERIC = 22, [PR] */ CLASS_CHARACTER, >+ /* QUOTATION = 23, [QU] */ CLASS_CHARACTER, >+ /* COMPLEX_CONTEXT = 24, [SA] */ CLASS_CHARACTER, >+ /* SURROGATE = 25, [SG] */ CLASS_CHARACTER, >+ /* SPACE = 26, [SP] */ CLASS_BREAKABLE, >+ /* BREAK_SYMBOLS = 27, [SY] */ CLASS_CHARACTER, CLASS_CLOSE? CLASS_CLOSE_LIKE_CHARACTER? Not sure... >+ /* ZWSPACE = 28, [ZW] */ CLASS_BREAKABLE, >+ /* NEXT_LINE = 29, [NL] */ CLASS_CHARACTER, CLASS_CLOSE? But try later. >+ /* WORD_JOINER = 30, [WJ] */ CLASS_NON_BREAKABLE, >+ /* H2 = 31, [H2] */ CLASS_BREAKABLE, >+ /* H3 = 32, [H3] */ CLASS_BREAKABLE, >+ /* JL = 33, [JL] */ CLASS_CHARACTER, >+ /* JT = 34, [JT] */ CLASS_CHARACTER, >+ /* JV = 35, [JV] */ CLASS_CHARACTER, >+ /* CLOSE_PARENTHESIS = 36, [CP] */ CLASS_CLOSE_LIKE_CHARACTER, >+ /* CONDITIONAL_JAPANESE_STARTER = 37, [CJ] */ CLASS_CLOSE, If this is applied only for Japanese characters, it must be safe because basically Japanese text is breakable in any points. >+ /* HEBREW_LETTER = 38, [HL] */ CLASS_CHARACTER, >+ /* REGIONAL_INDICATOR = 39, [RI] */ CLASS_CHARACTER >+ }; >+ >+ return sUnicodeLineBreakToClass[mozilla::unicode::GetLineBreakClass(u)]; If we should try to change above suggestions, we should try it in another bug and land them separately for easier to check regression range. So, for this bug, I think that NONSTARTER case should be changed for safe.
Attachment #8744287 -
Flags: review?(masayuki) → review+
Assignee | ||
Comment 18•8 years ago
|
||
(In reply to Masayuki Nakano [:masayuki] (Mozilla Japan) (offline: 4/29-5/8) from comment #16) > >- char16_t GetPreviousNonHyphenCharacter() const { > >+ uint32_t GetPreviousNonHyphenCharacter() const { > > Why don't you use char32_t here? Habit, mainly. :) But on checking the gecko codebase, I don't think we use char32_t for this anywhere else; the only occurrences I see are under a couple of AOSP-derived /media/... directories. Throughout intl, gfx, layout, dom, etc., whenever we need to handle a Unicode scalar value (rather than just a utf16 code unit), we simply use uint32_t. So for consistency with the existing code, I'm inclined to leave this unchanged for now.
Comment 19•8 years ago
|
||
> So for consistency with the existing code, I'm inclined to leave this unchanged for now.
I don't like the current style, but sounds reasonable for using uint32_t for it.
Assignee | ||
Comment 20•8 years ago
|
||
(In reply to Masayuki Nakano [:masayuki] (Mozilla Japan) (offline: 4/29-5/8) from comment #17) > If we should try to change above suggestions, we should try it in another > bug and land them separately for easier to check regression range. So, for > this bug, I think that NONSTARTER case should be changed for safe. OK, I'll change that one. I agree it may be possible to adjust some of the others, but it's better to do that separately. I also wonder if we can remove some of the existing ad hoc if-then code and tables, and rely more widely on mapping the unicode classes; but again, I didn't want to add that risk to this bug.
Assignee | ||
Comment 21•8 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/387681d67d20df7ba607e022b49810686fb9dd5e Bug 1265631 - Reftests for line-breaking with supplementary-plane characters. r=masayuki https://hg.mozilla.org/integration/mozilla-inbound/rev/77509a99fab42bf659801a9319b94463e6639b35 Bug 1265631 - patch 1 - Add line-break data from Unicode's LineBreak.txt to nsUnicodeProperties for builds without ICU. r=masayuki https://hg.mozilla.org/integration/mozilla-inbound/rev/769a374b634718a3e8d05dcf0f392e2bae782312 Bug 1265631 - patch 2 - Implement GetLineBreakClass() accessor to get Unicode line-break class from ICU or nsUnicodeProperties data. r=masayuki https://hg.mozilla.org/integration/mozilla-inbound/rev/f6ac8d1b4243d84078c3daa7e30cce393f950557 Bug 1265631 - patch 3 - Add surrogate support to nsJISx4051LineBreaker so that supplementary-plane characters get correct line-breaking behavior. r=masayuki
Assignee | ||
Comment 22•8 years ago
|
||
Argh, sorry - I landed a stale version of "patch 3" that didn't include the last couple of adjustments to address review comments. I'll push a quick followup to tidy up...
Comment 24•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/387681d67d20 https://hg.mozilla.org/mozilla-central/rev/77509a99fab4 https://hg.mozilla.org/mozilla-central/rev/769a374b6347 https://hg.mozilla.org/mozilla-central/rev/f6ac8d1b4243 https://hg.mozilla.org/mozilla-central/rev/90bd0f479ed7
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
status-firefox49:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
You need to log in
before you can comment on or make changes to this bug.
Description
•