Closed Bug 1265631 Opened 8 years ago Closed 8 years ago

Supplementary-plane Unicode characters (including Emoji) fail to line-break properly

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla49

Tracking Flags:

Tracking

Status

firefox49

---

fixed

People

(Reporter: rez, Assigned: jfkthame)

References

(
URL
)

Details

Attachments

(6 files)

reporter's testcase 8 years ago David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla) 611 bytes, text/html		Details
hack to handle surrogate pairs in line breaking 8 years ago Jonathan Kew [:jfkthame] 4.87 KB, patch		Details \| Diff \| Splinter Review
Reftests for line-breaking with supplementary-plane characters 8 years ago Jonathan Kew [:jfkthame] 5.68 KB, patch	masayuki : review+	Details \| Diff \| Splinter Review
patch 1 - Add line-break data from Unicode's LineBreak.txt to nsUnicodeProperties for builds without ICU 8 years ago Jonathan Kew [:jfkthame] 898.09 KB, patch	masayuki : review+	Details \| Diff \| Splinter Review
patch 2 - Implement GetLineBreakClass() accessor to get Unicode line-break class from ICU or nsUnicodeProperties data 8 years ago Jonathan Kew [:jfkthame] 1.92 KB, patch	masayuki : review+	Details \| Diff \| Splinter Review
patch 3 - Add surrogate support to nsJISx4051LineBreaker so that supplementary-plane characters get correct line-breaking behavior 8 years ago Jonathan Kew [:jfkthame] 14.77 KB, patch	masayuki : review+	Details \| Diff \| Splinter Review

Chris Lienert

Reporter

Description

•

8 years ago

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Firefox/45.0
Build ID: 20160407164938

Steps to reproduce:

Uninterrupted emoji characters fail to wrap correctly within a confined space. Demo: http://codepen.io/cliener/pen/BKVNVv


Actual results:

Actual rendering:
XXXXXXXXXXXXXXXXXXXXXXXX


Expected results:

Expecting to see:
XXXXXXXX
XXXXXXXX
XXXXXXXX

Chris Lienert

Reporter

Updated

•

8 years ago

URL: http://codepen.io/cliener/pen/BKVNVv

Summary: Emoji characters fail to wrap → HTML Emoji characters fail to wrap

Kohei Yoshino [:kohei]

Updated

•

8 years ago

Component: Untriaged → Layout: Text

Product: Firefox → Core

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 1

•

8 years ago

Attached file reporter's testcase — Details

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 2

•

8 years ago

At the very least, Chrome does wrap here.  I didn't check Edge.

I'm somewhat curious what http://www.unicode.org/reports/tr14/ says.

(And also surprised this hasn't previously been reported.)


The other question is whether there should be priorities involved, e.g., whether text should break as:

  hello

Flags: needinfo?(masayuki)

Summary: HTML Emoji characters fail to wrap → Unicode Emoji characters fail to wrap

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 3

•

8 years ago

Oh, right. ******* BUGZILLA (bug 405011).  Here's the rest of my comment, with the Emoji replaced by X's:


  hello XXX
  XX hello

(giving the spaces higher priority as a break point) or as:

  hello
  XXX
  hello

(treating the emoji as just-as-breakable as spaces).

(I also admit I don't remember what we do for similar cases with Chinese-within-English.)

Jonathan Kew [:jfkthame]

Assignee

Comment 4

•

8 years ago

(In reply to David Baron [:dbaron] ⌚️UTC-7 from comment #2)
> At the very least, Chrome does wrap here.  I didn't check Edge.
> 
> I'm somewhat curious what http://www.unicode.org/reports/tr14/ says.

TR14 doesn't directly talk about emoji, afaics, but most of them are given line-break class ID in the LineBreak.txt data file, which means they ought to behave similarly to CJK ideographs. And apparently they don't, so it does look like we have a bug here.

Status: UNCONFIRMED → NEW

Ever confirmed: true

Jonathan Kew [:jfkthame]

Assignee

Comment 5

•

8 years ago

We seem to have the same problem with a sequence of plane-2 ideographs, too. Which suggests this is probably a lack of proper surrogate-pair handling somewhere in the line-breaker code.

Jonathan Kew [:jfkthame]

Assignee

Comment 6

•

8 years ago

Indeed, the code in nsJISx4051LineBreaker.cpp looks like it works purely on 16-bit code units and has no awareness of surrogates. (And the line-break class tables it uses haven't been updated lately, afaict, so there are probably a scattering of more recently-encoded BMP characters that we don't handle properly, let alone all the supplementary-plane stuff.)

(If it weren't for Android, maybe we could just use ICU instead of having to maintain that ancient code....)

Jonathan Kew [:jfkthame]

Assignee

Comment 7

•

8 years ago

Attached patch hack to handle surrogate pairs in line breaking — Details — Splinter Review

This is a quick-'n'-dirty proof of concept that adds some basic surrogate support to the line-breaker, and makes the example here behave as expected. For a real patch, we should probably import the properties from Unicode's LineBreak.txt data file into nsUnicodeProperties, and use that not just for non-BMP characters but also to replace the bulk of the existing GetClass function in nsJISx4051LineBreaker.cpp, just keeping any custom overrides that are needed for any special cases where we need to differ from the default Unicode line-breaking properties.

Jonathan Kew [:jfkthame]

Assignee

Updated

•

8 years ago

Summary: Unicode Emoji characters fail to wrap → Supplementary-plane Unicode characters (including Emoji) fail to line-break properly

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 8

•

8 years ago

If we use ICU for linebreaking of Unicode characters (in other words, non-ASCII characters), this should be gone. Makoto-san, what's the status?

Flags: needinfo?(masayuki) → needinfo?(m_kato)

Makoto Kato [:m_kato]

Comment 9

•

8 years ago

(In reply to Masayuki Nakano [:masayuki] (Mozilla Japan) from comment #8)
> If we use ICU for linebreaking of Unicode characters (in other words,
> non-ASCII characters), this should be gone. Makoto-san, what's the status?

ICU line breaker spends more binary size (maybe additional 1-2 MB).  So I don't integrate it within the near feature.
(And Fennec team doesn't agree for integrating ICU yet....)

Flags: needinfo?(m_kato)

Jonathan Kew [:jfkthame]

Assignee

Comment 10

•

8 years ago

What we could try doing (without needing to add the full ICU BreakIterator code) is to implement the pair-based algorithm from UAX#14 (based on the LineBreak.txt data from Unicode, so it covers the full character set, unlike our current code). However, this has the potential to change behavior in various cases, and we'd probably want to tailor parts of the UAX#14 algorithm to suit Web content better. Altogether, a far-from-trivial change that I think we should consider in its own right, in a separate bug.

As a shorter-term fix here, though, we can fairly easily add some basic support for supplementary-plane characters to the existing code, without changing its overall behavior for everything it currently supports. This just requires an added check for surrogates, and a way to look up suitable break-class values for the supplementary codepoints that GetClass doesn't currently know anything about.

The simplest of all fixes would be to just treat all supplementary-plane characters as a single default class; but that's not quite good enough, because there are some important distinctions there. The emoji should be treated much like CJK ideographs, so that a long run of smiley-faces will wrap whenever it happens to hit the line edge; but that's not appropriate for the various supplementary-plane alphabets, or some of the other types of symbols.

So I propose a sort of "half-way" fix for now: add the LineBreak.txt data to our unicode properties (for platforms where we don't have ICU available, i.e. Android; where we do have ICU, we can simply ask it for the character properties), and then for characters that GetClass doesn't currently support, use a simple mapping from Unicode's LineBreak classes to the classes currently used in the nsJISx4051LineBreaker code. This will avoid disturbing current behavior for the characters where it has been carefully designed and tested, while adding support for the supplementary characters where we currently fail.

Jonathan Kew [:jfkthame]

Assignee

Comment 11

•

8 years ago

Attached patch Reftests for line-breaking with supplementary-plane characters — Details — Splinter Review

Here are a couple of simple reftests for supplementary-plane line-breaking; they currently fail because of our lack of support in nsJISx4051LineBreaker.

Attachment #8744283 - Flags: review?(masayuki)

Jonathan Kew [:jfkthame]

Assignee

Updated

•

8 years ago

Assignee: nobody → jfkthame

Status: NEW → ASSIGNED

Jonathan Kew [:jfkthame]

Assignee

Comment 12

•

8 years ago

Attached patch patch 1 - Add line-break data from Unicode's LineBreak.txt to nsUnicodeProperties for builds without ICU — Details — Splinter Review

This adds LineBreak.txt to the data available through nsUnicodeProperties; it is only included when ENABLE_INTL_API is disabled, so that we can't get the properties via ICU.

Attachment #8744284 - Flags: review?(masayuki)

Jonathan Kew [:jfkthame]

Assignee

Comment 13

•

8 years ago

Attached patch patch 2 - Implement GetLineBreakClass() accessor to get Unicode line-break class from ICU or nsUnicodeProperties data — Details — Splinter Review

And an accessor so we can get the UAX#14 linebreak class for any character, either from ICU or our own data.

Attachment #8744286 - Flags: review?(masayuki)

Jonathan Kew [:jfkthame]

Assignee

Comment 14

•

8 years ago

Attached patch patch 3 - Add surrogate support to nsJISx4051LineBreaker so that supplementary-plane characters get correct line-breaking behavior — Details — Splinter Review

This adds basic surrogate support in the line-breaker, and makes the testcases here work properly.

Attachment #8744287 - Flags: review?(masayuki)

Jonathan Kew [:jfkthame]

Assignee

Comment 15

•

8 years ago

Try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1bc7226d7e83

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Updated

•

8 years ago

Attachment #8744283 - Flags: review?(masayuki) → review+

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 16

•

8 years ago

Comment on attachment 8744287 [details] [diff] [review]
patch 3 - Add surrogate support to nsJISx4051LineBreaker so that supplementary-plane characters get correct line-breaking behavior

>+      if (l <= 0xbf) { // Hangul Compatibility Jamo, Bopomofo, Kanbun
>                       // XXX: This is per UAX #14, but UAX #14 may change
>                       // the line breaking rules about Kanbun and Bopomofo.

nit: Why don't you indent these two lines?

>+  // Mapping for Unicode LineBreak.txt classes to the (simplified) set of
>+  // character classes used here.

Let me check this later...

>-  char16_t GetPreviousNonHyphenCharacter() const {
>+  uint32_t GetPreviousNonHyphenCharacter() const {

Why don't you use char32_t here?

>     return mPreviousNonHyphenCharacter;
>   }
>-  void NotifyNonHyphenCharacter(char16_t ch) {
>+  void NotifyNonHyphenCharacter(uint32_t ch) {

and here...

>-  char16_t mPreviousNonHyphenCharacter; // The last character we have seen
>+  uint32_t mPreviousNonHyphenCharacter; // The last character we have seen

and here...

>@@ -825,17 +874,22 @@ nsJISx4051LineBreaker::GetJISx4051Breaks
>                                          uint8_t aWordBreak,
>                                          uint8_t* aBreakBefore)
> {
>   uint32_t cur;
>   int8_t lastClass = CLASS_NONE;
>   ContextState state(aChars, aLength);
> 
>   for (cur = 0; cur < aLength; ++cur, state.AdvanceIndex()) {
>-    char16_t ch = aChars[cur];
>+    uint32_t ch = aChars[cur];

And here.

>+    if (ch > 0xffff) {
>+      ++cur;
>+      aBreakBefore[cur] = false;
>+      state.AdvanceIndex();
>+    }

Could you add a comment like "Mark before the following low surrogate character isn't breakable and skip it" or something.

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Updated

•

8 years ago

Attachment #8744286 - Flags: review?(masayuki) → review+

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Updated

•

8 years ago

Attachment #8744284 - Flags: review?(masayuki) → review+

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 17

•

8 years ago

Comment on attachment 8744287 [details] [diff] [review]
patch 3 - Add surrogate support to nsJISx4051LineBreaker so that supplementary-plane characters get correct line-breaking behavior

My understanding is, this is new map when the old code class is CLASS_CHARACTER or CLASS_BREAKABLE (for CJK characters), right?

>+  // Mapping for Unicode LineBreak.txt classes to the (simplified) set of
>+  // character classes used here.
>+  // XXX The mappings here were derived by comparing the Unicode LineBreak
>+  //     values of BMP characters to the classes our existing GetClass returns
>+  //     for the same codepoints; in cases where characters with the same
>+  //     LineBreak class mapped to various classes here, I picked what seemed
>+  //     the most prevalent equivalence.
>+  //     Some of these are unclear to me, but currently they are ONLY used
>+  //     for characters not handled by the old code above, so all the JISx405
>+  //     special cases should already be accounted for.
>+  static const int8_t sUnicodeLineBreakToClass[] = {
>+    /* UNKNOWN = 0,                       [XX] */ CLASS_CHARACTER,
>+    /* AMBIGUOUS = 1,                     [AI] */ CLASS_CHARACTER,
>+    /* ALPHABETIC = 2,                    [AL] */ CLASS_CHARACTER,
>+    /* BREAK_BOTH = 3,                    [B2] */ CLASS_CHARACTER,

CLASS_BREAKABLE? But I think that keeping to use CLASS_CHARACTER in this bug is better. We should *try* to change later and separately.

>+    /* BREAK_AFTER = 4,                   [BA] */ CLASS_CHARACTER,

CLASS_CLOSE_LIKE_CHARACTER? But same above.

>+    /* BREAK_BEFORE = 5,                  [BB] */ CLASS_OPEN_LIKE_CHARACTER, // ???

I think so.

>+    /* MANDATORY_BREAK = 6,               [BK] */ CLASS_CHARACTER,

CLASS_BREAKABLE? But same above.

>+    /* CONTINGENT_BREAK = 7,              [CB] */ CLASS_CHARACTER,
>+    /* CLOSE_PUNCTUATION = 8,             [CL] */ CLASS_CHARACTER,

CLASS_CLOSE_LIKE_CHARACTER? But same above.

>+    /* COMBINING_MARK = 9,                [CM] */ CLASS_CHARACTER,
>+    /* CARRIAGE_RETURN = 10,              [CR] */ CLASS_BREAKABLE,
>+    /* EXCLAMATION = 11,                  [EX] */ CLASS_CHARACTER,

CLASS_CLOSE_LIKE_CHARACTER? But same above.

>+    /* GLUE = 12,                         [GL] */ CLASS_NON_BREAKABLE,
>+    /* HYPHEN = 13,                       [HY] */ CLASS_CHARACTER,
>+    /* IDEOGRAPHIC = 14,                  [ID] */ CLASS_BREAKABLE,
>+    /* INSEPARABLE = 15,                  [IN] */ CLASS_CLOSE_LIKE_CHARACTER,
>+    /* INFIX_NUMERIC = 16,                [IS] */ CLASS_CHARACTER,
>+    /* LINE_FEED = 17,                    [LF] */ CLASS_BREAKABLE,
>+    /* NONSTARTER = 18,                   [NS] */ CLASS_CLOSE,

I guess that CLASS_CLOSE_LIKE_CHARACTER is safer for now if this is not the class of emoji characters.

>+    /* NUMERIC = 19,                      [NU] */ CLASS_CHARACTER,
>+    /* OPEN_PUNCTUATION = 20,             [OP] */ CLASS_CHARACTER,
>+    /* POSTFIX_NUMERIC = 21,              [PO] */ CLASS_CHARACTER,
>+    /* PREFIX_NUMERIC = 22,               [PR] */ CLASS_CHARACTER,
>+    /* QUOTATION = 23,                    [QU] */ CLASS_CHARACTER,
>+    /* COMPLEX_CONTEXT = 24,              [SA] */ CLASS_CHARACTER,
>+    /* SURROGATE = 25,                    [SG] */ CLASS_CHARACTER,
>+    /* SPACE = 26,                        [SP] */ CLASS_BREAKABLE,
>+    /* BREAK_SYMBOLS = 27,                [SY] */ CLASS_CHARACTER,

CLASS_CLOSE? CLASS_CLOSE_LIKE_CHARACTER? Not sure...

>+    /* ZWSPACE = 28,                      [ZW] */ CLASS_BREAKABLE,
>+    /* NEXT_LINE = 29,                    [NL] */ CLASS_CHARACTER,

CLASS_CLOSE? But try later.

>+    /* WORD_JOINER = 30,                  [WJ] */ CLASS_NON_BREAKABLE,
>+    /* H2 = 31,                           [H2] */ CLASS_BREAKABLE,
>+    /* H3 = 32,                           [H3] */ CLASS_BREAKABLE,
>+    /* JL = 33,                           [JL] */ CLASS_CHARACTER,
>+    /* JT = 34,                           [JT] */ CLASS_CHARACTER,
>+    /* JV = 35,                           [JV] */ CLASS_CHARACTER,
>+    /* CLOSE_PARENTHESIS = 36,            [CP] */ CLASS_CLOSE_LIKE_CHARACTER,
>+    /* CONDITIONAL_JAPANESE_STARTER = 37, [CJ] */ CLASS_CLOSE,

If this is applied only for Japanese characters, it must be safe because basically Japanese text is breakable in any points.

>+    /* HEBREW_LETTER = 38,                [HL] */ CLASS_CHARACTER,
>+    /* REGIONAL_INDICATOR = 39,           [RI] */ CLASS_CHARACTER
>+  };
>+
>+  return sUnicodeLineBreakToClass[mozilla::unicode::GetLineBreakClass(u)];


If we should try to change above suggestions, we should try it in another bug and land them separately for easier to check regression range. So, for this bug, I think that NONSTARTER case should be changed for safe.

Attachment #8744287 - Flags: review?(masayuki) → review+

Jonathan Kew [:jfkthame]

Assignee

Comment 18

•

8 years ago

(In reply to Masayuki Nakano [:masayuki] (Mozilla Japan) (offline: 4/29-5/8) from comment #16)
> >-  char16_t GetPreviousNonHyphenCharacter() const {
> >+  uint32_t GetPreviousNonHyphenCharacter() const {
> 
> Why don't you use char32_t here?

Habit, mainly. :)

But on checking the gecko codebase, I don't think we use char32_t for this anywhere else; the only occurrences I see are under a couple of AOSP-derived /media/... directories. Throughout intl, gfx, layout, dom, etc., whenever we need to handle a Unicode scalar value (rather than just a utf16 code unit), we simply use uint32_t.

So for consistency with the existing code, I'm inclined to leave this unchanged for now.

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 19

•

8 years ago

> So for consistency with the existing code, I'm inclined to leave this unchanged for now.

I don't like the current style, but sounds reasonable for using uint32_t for it.

Jonathan Kew [:jfkthame]

Assignee

Comment 20

•

8 years ago

(In reply to Masayuki Nakano [:masayuki] (Mozilla Japan) (offline: 4/29-5/8) from comment #17)
> If we should try to change above suggestions, we should try it in another
> bug and land them separately for easier to check regression range. So, for
> this bug, I think that NONSTARTER case should be changed for safe.

OK, I'll change that one. I agree it may be possible to adjust some of the others, but it's better to do that separately. I also wonder if we can remove some of the existing ad hoc if-then code and tables, and rely more widely on mapping the unicode classes; but again, I didn't want to add that risk to this bug.

Jonathan Kew [:jfkthame]

Assignee

Comment 21

•

8 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/387681d67d20df7ba607e022b49810686fb9dd5e
Bug 1265631 - Reftests for line-breaking with supplementary-plane characters. r=masayuki

https://hg.mozilla.org/integration/mozilla-inbound/rev/77509a99fab42bf659801a9319b94463e6639b35
Bug 1265631 - patch 1 - Add line-break data from Unicode's LineBreak.txt to nsUnicodeProperties for builds without ICU. r=masayuki

https://hg.mozilla.org/integration/mozilla-inbound/rev/769a374b634718a3e8d05dcf0f392e2bae782312
Bug 1265631 - patch 2 - Implement GetLineBreakClass() accessor to get Unicode line-break class from ICU or nsUnicodeProperties data. r=masayuki

https://hg.mozilla.org/integration/mozilla-inbound/rev/f6ac8d1b4243d84078c3daa7e30cce393f950557
Bug 1265631 - patch 3 - Add surrogate support to nsJISx4051LineBreaker so that supplementary-plane characters get correct line-breaking behavior. r=masayuki

Jonathan Kew [:jfkthame]

Assignee

Comment 22

•

8 years ago

Argh, sorry - I landed a stale version of "patch 3" that didn't include the last couple of adjustments to address review comments. I'll push a quick followup to tidy up...

Pulsebot

Comment 23

•

8 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/90bd0f479ed7

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 24

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/387681d67d20
https://hg.mozilla.org/mozilla-central/rev/77509a99fab4
https://hg.mozilla.org/mozilla-central/rev/769a374b6347
https://hg.mozilla.org/mozilla-central/rev/f6ac8d1b4243
https://hg.mozilla.org/mozilla-central/rev/90bd0f479ed7

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

status-firefox49: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla49

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

Depends on: 1403656

Jonathan Kew [:jfkthame]

Assignee

Updated

•

7 years ago

No longer depends on: 1403656

You need to log in before you can comment on or make changes to this bug.