Closed Bug 1265631 Opened 8 years ago Closed 8 years ago

Supplementary-plane Unicode characters (including Emoji) fail to line-break properly

Categories

(Core :: Layout: Text and Fonts, defect)

45 Branch
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla49
Tracking Status
firefox49 --- fixed

People

(Reporter: rez, Assigned: jfkthame)

References

()

Details

Attachments

(6 files)

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Firefox/45.0
Build ID: 20160407164938

Steps to reproduce:

Uninterrupted emoji characters fail to wrap correctly within a confined space. Demo: http://codepen.io/cliener/pen/BKVNVv


Actual results:

Actual rendering:
XXXXXXXXXXXXXXXXXXXXXXXX


Expected results:

Expecting to see:
XXXXXXXX
XXXXXXXX
XXXXXXXX
Summary: Emoji characters fail to wrap → HTML Emoji characters fail to wrap
Component: Untriaged → Layout: Text
Product: Firefox → Core
At the very least, Chrome does wrap here.  I didn't check Edge.

I'm somewhat curious what http://www.unicode.org/reports/tr14/ says.

(And also surprised this hasn't previously been reported.)


The other question is whether there should be priorities involved, e.g., whether text should break as:

  hello 
Flags: needinfo?(masayuki)
Summary: HTML Emoji characters fail to wrap → Unicode Emoji characters fail to wrap
Oh, right. ******* BUGZILLA (bug 405011).  Here's the rest of my comment, with the Emoji replaced by X's:


  hello XXX
  XX hello

(giving the spaces higher priority as a break point) or as:

  hello
  XXX
  hello

(treating the emoji as just-as-breakable as spaces).

(I also admit I don't remember what we do for similar cases with Chinese-within-English.)
(In reply to David Baron [:dbaron] ⌚️UTC-7 from comment #2)
> At the very least, Chrome does wrap here.  I didn't check Edge.
> 
> I'm somewhat curious what http://www.unicode.org/reports/tr14/ says.

TR14 doesn't directly talk about emoji, afaics, but most of them are given line-break class ID in the LineBreak.txt data file, which means they ought to behave similarly to CJK ideographs. And apparently they don't, so it does look like we have a bug here.
Status: UNCONFIRMED → NEW
Ever confirmed: true
We seem to have the same problem with a sequence of plane-2 ideographs, too. Which suggests this is probably a lack of proper surrogate-pair handling somewhere in the line-breaker code.
Indeed, the code in nsJISx4051LineBreaker.cpp looks like it works purely on 16-bit code units and has no awareness of surrogates. (And the line-break class tables it uses haven't been updated lately, afaict, so there are probably a scattering of more recently-encoded BMP characters that we don't handle properly, let alone all the supplementary-plane stuff.)

(If it weren't for Android, maybe we could just use ICU instead of having to maintain that ancient code....)
This is a quick-'n'-dirty proof of concept that adds some basic surrogate support to the line-breaker, and makes the example here behave as expected. For a real patch, we should probably import the properties from Unicode's LineBreak.txt data file into nsUnicodeProperties, and use that not just for non-BMP characters but also to replace the bulk of the existing GetClass function in nsJISx4051LineBreaker.cpp, just keeping any custom overrides that are needed for any special cases where we need to differ from the default Unicode line-breaking properties.
Summary: Unicode Emoji characters fail to wrap → Supplementary-plane Unicode characters (including Emoji) fail to line-break properly
If we use ICU for linebreaking of Unicode characters (in other words, non-ASCII characters), this should be gone. Makoto-san, what's the status?
Flags: needinfo?(masayuki) → needinfo?(m_kato)
(In reply to Masayuki Nakano [:masayuki] (Mozilla Japan) from comment #8)
> If we use ICU for linebreaking of Unicode characters (in other words,
> non-ASCII characters), this should be gone. Makoto-san, what's the status?

ICU line breaker spends more binary size (maybe additional 1-2 MB).  So I don't integrate it within the near feature.
(And Fennec team doesn't agree for integrating ICU yet....)
Flags: needinfo?(m_kato)
What we could try doing (without needing to add the full ICU BreakIterator code) is to implement the pair-based algorithm from UAX#14 (based on the LineBreak.txt data from Unicode, so it covers the full character set, unlike our current code). However, this has the potential to change behavior in various cases, and we'd probably want to tailor parts of the UAX#14 algorithm to suit Web content better. Altogether, a far-from-trivial change that I think we should consider in its own right, in a separate bug.

As a shorter-term fix here, though, we can fairly easily add some basic support for supplementary-plane characters to the existing code, without changing its overall behavior for everything it currently supports. This just requires an added check for surrogates, and a way to look up suitable break-class values for the supplementary codepoints that GetClass doesn't currently know anything about.

The simplest of all fixes would be to just treat all supplementary-plane characters as a single default class; but that's not quite good enough, because there are some important distinctions there. The emoji should be treated much like CJK ideographs, so that a long run of smiley-faces will wrap whenever it happens to hit the line edge; but that's not appropriate for the various supplementary-plane alphabets, or some of the other types of symbols.

So I propose a sort of "half-way" fix for now: add the LineBreak.txt data to our unicode properties (for platforms where we don't have ICU available, i.e. Android; where we do have ICU, we can simply ask it for the character properties), and then for characters that GetClass doesn't currently support, use a simple mapping from Unicode's LineBreak classes to the classes currently used in the nsJISx4051LineBreaker code. This will avoid disturbing current behavior for the characters where it has been carefully designed and tested, while adding support for the supplementary characters where we currently fail.
Here are a couple of simple reftests for supplementary-plane line-breaking; they currently fail because of our lack of support in nsJISx4051LineBreaker.
Attachment #8744283 - Flags: review?(masayuki)
Assignee: nobody → jfkthame
Status: NEW → ASSIGNED
This adds LineBreak.txt to the data available through nsUnicodeProperties; it is only included when ENABLE_INTL_API is disabled, so that we can't get the properties via ICU.
Attachment #8744284 - Flags: review?(masayuki)
And an accessor so we can get the UAX#14 linebreak class for any character, either from ICU or our own data.
Attachment #8744286 - Flags: review?(masayuki)
This adds basic surrogate support in the line-breaker, and makes the testcases here work properly.
Attachment #8744287 - Flags: review?(masayuki)
Attachment #8744283 - Flags: review?(masayuki) → review+
Comment on attachment 8744287 [details] [diff] [review]
patch 3 - Add surrogate support to nsJISx4051LineBreaker so that supplementary-plane characters get correct line-breaking behavior

>+      if (l <= 0xbf) { // Hangul Compatibility Jamo, Bopomofo, Kanbun
>                       // XXX: This is per UAX #14, but UAX #14 may change
>                       // the line breaking rules about Kanbun and Bopomofo.

nit: Why don't you indent these two lines?

>+  // Mapping for Unicode LineBreak.txt classes to the (simplified) set of
>+  // character classes used here.

Let me check this later...

>-  char16_t GetPreviousNonHyphenCharacter() const {
>+  uint32_t GetPreviousNonHyphenCharacter() const {

Why don't you use char32_t here?

>     return mPreviousNonHyphenCharacter;
>   }
>-  void NotifyNonHyphenCharacter(char16_t ch) {
>+  void NotifyNonHyphenCharacter(uint32_t ch) {

and here...

>-  char16_t mPreviousNonHyphenCharacter; // The last character we have seen
>+  uint32_t mPreviousNonHyphenCharacter; // The last character we have seen

and here...

>@@ -825,17 +874,22 @@ nsJISx4051LineBreaker::GetJISx4051Breaks
>                                          uint8_t aWordBreak,
>                                          uint8_t* aBreakBefore)
> {
>   uint32_t cur;
>   int8_t lastClass = CLASS_NONE;
>   ContextState state(aChars, aLength);
> 
>   for (cur = 0; cur < aLength; ++cur, state.AdvanceIndex()) {
>-    char16_t ch = aChars[cur];
>+    uint32_t ch = aChars[cur];

And here.

>+    if (ch > 0xffff) {
>+      ++cur;
>+      aBreakBefore[cur] = false;
>+      state.AdvanceIndex();
>+    }

Could you add a comment like "Mark before the following low surrogate character isn't breakable and skip it" or something.
Attachment #8744286 - Flags: review?(masayuki) → review+
Attachment #8744284 - Flags: review?(masayuki) → review+
Comment on attachment 8744287 [details] [diff] [review]
patch 3 - Add surrogate support to nsJISx4051LineBreaker so that supplementary-plane characters get correct line-breaking behavior

My understanding is, this is new map when the old code class is CLASS_CHARACTER or CLASS_BREAKABLE (for CJK characters), right?

>+  // Mapping for Unicode LineBreak.txt classes to the (simplified) set of
>+  // character classes used here.
>+  // XXX The mappings here were derived by comparing the Unicode LineBreak
>+  //     values of BMP characters to the classes our existing GetClass returns
>+  //     for the same codepoints; in cases where characters with the same
>+  //     LineBreak class mapped to various classes here, I picked what seemed
>+  //     the most prevalent equivalence.
>+  //     Some of these are unclear to me, but currently they are ONLY used
>+  //     for characters not handled by the old code above, so all the JISx405
>+  //     special cases should already be accounted for.
>+  static const int8_t sUnicodeLineBreakToClass[] = {
>+    /* UNKNOWN = 0,                       [XX] */ CLASS_CHARACTER,
>+    /* AMBIGUOUS = 1,                     [AI] */ CLASS_CHARACTER,
>+    /* ALPHABETIC = 2,                    [AL] */ CLASS_CHARACTER,
>+    /* BREAK_BOTH = 3,                    [B2] */ CLASS_CHARACTER,

CLASS_BREAKABLE? But I think that keeping to use CLASS_CHARACTER in this bug is better. We should *try* to change later and separately.

>+    /* BREAK_AFTER = 4,                   [BA] */ CLASS_CHARACTER,

CLASS_CLOSE_LIKE_CHARACTER? But same above.

>+    /* BREAK_BEFORE = 5,                  [BB] */ CLASS_OPEN_LIKE_CHARACTER, // ???

I think so.

>+    /* MANDATORY_BREAK = 6,               [BK] */ CLASS_CHARACTER,

CLASS_BREAKABLE? But same above.

>+    /* CONTINGENT_BREAK = 7,              [CB] */ CLASS_CHARACTER,
>+    /* CLOSE_PUNCTUATION = 8,             [CL] */ CLASS_CHARACTER,

CLASS_CLOSE_LIKE_CHARACTER? But same above.

>+    /* COMBINING_MARK = 9,                [CM] */ CLASS_CHARACTER,
>+    /* CARRIAGE_RETURN = 10,              [CR] */ CLASS_BREAKABLE,
>+    /* EXCLAMATION = 11,                  [EX] */ CLASS_CHARACTER,

CLASS_CLOSE_LIKE_CHARACTER? But same above.

>+    /* GLUE = 12,                         [GL] */ CLASS_NON_BREAKABLE,
>+    /* HYPHEN = 13,                       [HY] */ CLASS_CHARACTER,
>+    /* IDEOGRAPHIC = 14,                  [ID] */ CLASS_BREAKABLE,
>+    /* INSEPARABLE = 15,                  [IN] */ CLASS_CLOSE_LIKE_CHARACTER,
>+    /* INFIX_NUMERIC = 16,                [IS] */ CLASS_CHARACTER,
>+    /* LINE_FEED = 17,                    [LF] */ CLASS_BREAKABLE,
>+    /* NONSTARTER = 18,                   [NS] */ CLASS_CLOSE,

I guess that CLASS_CLOSE_LIKE_CHARACTER is safer for now if this is not the class of emoji characters.

>+    /* NUMERIC = 19,                      [NU] */ CLASS_CHARACTER,
>+    /* OPEN_PUNCTUATION = 20,             [OP] */ CLASS_CHARACTER,
>+    /* POSTFIX_NUMERIC = 21,              [PO] */ CLASS_CHARACTER,
>+    /* PREFIX_NUMERIC = 22,               [PR] */ CLASS_CHARACTER,
>+    /* QUOTATION = 23,                    [QU] */ CLASS_CHARACTER,
>+    /* COMPLEX_CONTEXT = 24,              [SA] */ CLASS_CHARACTER,
>+    /* SURROGATE = 25,                    [SG] */ CLASS_CHARACTER,
>+    /* SPACE = 26,                        [SP] */ CLASS_BREAKABLE,
>+    /* BREAK_SYMBOLS = 27,                [SY] */ CLASS_CHARACTER,

CLASS_CLOSE? CLASS_CLOSE_LIKE_CHARACTER? Not sure...

>+    /* ZWSPACE = 28,                      [ZW] */ CLASS_BREAKABLE,
>+    /* NEXT_LINE = 29,                    [NL] */ CLASS_CHARACTER,

CLASS_CLOSE? But try later.

>+    /* WORD_JOINER = 30,                  [WJ] */ CLASS_NON_BREAKABLE,
>+    /* H2 = 31,                           [H2] */ CLASS_BREAKABLE,
>+    /* H3 = 32,                           [H3] */ CLASS_BREAKABLE,
>+    /* JL = 33,                           [JL] */ CLASS_CHARACTER,
>+    /* JT = 34,                           [JT] */ CLASS_CHARACTER,
>+    /* JV = 35,                           [JV] */ CLASS_CHARACTER,
>+    /* CLOSE_PARENTHESIS = 36,            [CP] */ CLASS_CLOSE_LIKE_CHARACTER,
>+    /* CONDITIONAL_JAPANESE_STARTER = 37, [CJ] */ CLASS_CLOSE,

If this is applied only for Japanese characters, it must be safe because basically Japanese text is breakable in any points.

>+    /* HEBREW_LETTER = 38,                [HL] */ CLASS_CHARACTER,
>+    /* REGIONAL_INDICATOR = 39,           [RI] */ CLASS_CHARACTER
>+  };
>+
>+  return sUnicodeLineBreakToClass[mozilla::unicode::GetLineBreakClass(u)];


If we should try to change above suggestions, we should try it in another bug and land them separately for easier to check regression range. So, for this bug, I think that NONSTARTER case should be changed for safe.
Attachment #8744287 - Flags: review?(masayuki) → review+
(In reply to Masayuki Nakano [:masayuki] (Mozilla Japan) (offline: 4/29-5/8) from comment #16)
> >-  char16_t GetPreviousNonHyphenCharacter() const {
> >+  uint32_t GetPreviousNonHyphenCharacter() const {
> 
> Why don't you use char32_t here?

Habit, mainly. :)

But on checking the gecko codebase, I don't think we use char32_t for this anywhere else; the only occurrences I see are under a couple of AOSP-derived /media/... directories. Throughout intl, gfx, layout, dom, etc., whenever we need to handle a Unicode scalar value (rather than just a utf16 code unit), we simply use uint32_t.

So for consistency with the existing code, I'm inclined to leave this unchanged for now.
> So for consistency with the existing code, I'm inclined to leave this unchanged for now.

I don't like the current style, but sounds reasonable for using uint32_t for it.
(In reply to Masayuki Nakano [:masayuki] (Mozilla Japan) (offline: 4/29-5/8) from comment #17)
> If we should try to change above suggestions, we should try it in another
> bug and land them separately for easier to check regression range. So, for
> this bug, I think that NONSTARTER case should be changed for safe.

OK, I'll change that one. I agree it may be possible to adjust some of the others, but it's better to do that separately. I also wonder if we can remove some of the existing ad hoc if-then code and tables, and rely more widely on mapping the unicode classes; but again, I didn't want to add that risk to this bug.
https://hg.mozilla.org/integration/mozilla-inbound/rev/387681d67d20df7ba607e022b49810686fb9dd5e
Bug 1265631 - Reftests for line-breaking with supplementary-plane characters. r=masayuki

https://hg.mozilla.org/integration/mozilla-inbound/rev/77509a99fab42bf659801a9319b94463e6639b35
Bug 1265631 - patch 1 - Add line-break data from Unicode's LineBreak.txt to nsUnicodeProperties for builds without ICU. r=masayuki

https://hg.mozilla.org/integration/mozilla-inbound/rev/769a374b634718a3e8d05dcf0f392e2bae782312
Bug 1265631 - patch 2 - Implement GetLineBreakClass() accessor to get Unicode line-break class from ICU or nsUnicodeProperties data. r=masayuki

https://hg.mozilla.org/integration/mozilla-inbound/rev/f6ac8d1b4243d84078c3daa7e30cce393f950557
Bug 1265631 - patch 3 - Add surrogate support to nsJISx4051LineBreaker so that supplementary-plane characters get correct line-breaking behavior. r=masayuki
Argh, sorry - I landed a stale version of "patch 3" that didn't include the last couple of adjustments to address review comments. I'll push a quick followup to tidy up...
Depends on: 1403656
No longer depends on: 1403656
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: