Closed Bug 1342686 Opened 7 years ago Closed 7 years ago

line-breaking code does not handle supplementary-plane characters properly

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla54

Tracking Flags:

Tracking

Status

firefox54

---

fixed

People

(Reporter: jfkthame, Assigned: jfkthame)

Details

Attachments

(2 files, 3 obsolete files)

Reftests for line-breaking with text that includes surrogate pairs 7 years ago Jonathan Kew [:jfkthame] 6.70 KB, patch	emk : review+	Details \| Diff \| Splinter Review
Fix up the handling of surrogates in nsJISx4051LineBreaker 7 years ago Jonathan Kew [:jfkthame] 10.37 KB, patch	emk : review-	Details \| Diff \| Splinter Review
Fix up the handling of surrogates in nsJISx4051LineBreaker 7 years ago Jonathan Kew [:jfkthame] 14.55 KB, patch		Details \| Diff \| Splinter Review
Fix up the handling of surrogates in nsJISx4051LineBreaker 7 years ago Jonathan Kew [:jfkthame] 14.56 KB, patch		Details \| Diff \| Splinter Review
Fix up the handling of surrogates in nsJISx4051LineBreaker 7 years ago Jonathan Kew [:jfkthame] 15.40 KB, patch	emk : review+	Details \| Diff \| Splinter Review

Jonathan Kew [:jfkthame]

Assignee

Description

•

7 years ago

In the course of looking at bug 809020, I noticed that the code in nsJISx4051LineBreaker operates on buffers of char16_t codepoints without paying sufficient attention to surrogate pairs.

In bug 1265631 we went part of the way, but key parts of the code (in particular, the ContextualAnalysis function used when we encounter characters such as hyphen, open-quote, and others, and the Init function that scans for the presence of CJK characters) still lack surrogate support.

Jonathan Kew [:jfkthame]

Assignee

Comment 1

•

7 years ago

Attached patch Reftests for line-breaking with text that includes surrogate pairs — Details — Splinter Review

Some simple reftests (currently failing) for the behavior of ContextualAnalysis in line-breaking when surrogate pairs are present.

Attachment #8841253 - Flags: review?(VYV03354)

Jonathan Kew [:jfkthame]

Assignee

Comment 2

•

7 years ago

Attached patch Fix up the handling of surrogates in nsJISx4051LineBreaker (obsolete) — Details — Splinter Review

With the fixes here, the tests above should all pass as expected.

Attachment #8841254 - Flags: review?(VYV03354)

Masatoshi Kimura [:emk]

Comment 3

•

7 years ago

Comment on attachment 8841254 [details] [diff] [review]
Fix up the handling of surrogates in nsJISx4051LineBreaker

Review of attachment 8841254 [details] [diff] [review]:
-----------------------------------------------------------------

::: intl/lwbrk/nsJISx4051LineBreaker.cpp
@@ +380,5 @@
>    return ((0xff66 <= (u)) && ((u) <= 0xff70));
>  }
>  
>  static inline int
> +IS_CJK_CHAR(uint32_t u)

char32_t? (applies throughout of the patch)

@@ +732,5 @@
> +          break;
> +        }
> +      } else if (mUniText && !mHasCJKChar) {
> +        if (NS_IS_HIGH_SURROGATE(u) && i + 1 < mLength &&
> +            NS_IS_LOW_SURROGATE(GetCharAt(i + 1))) {

You can directly access mUniText to make this faster.
If the performance impact is neligible, could you add GetCodePointAt to make the code more readable?

@@ +992,5 @@
> +        if (NS_IS_HIGH_SURROGATE(c) && end + 1 < aLength &&
> +            NS_IS_LOW_SURROGATE(aChars[end + 1])) {
> +          c = SURROGATE_TO_UCS4(c, aChars[end + 1]);
> +        }
> +        if (CLASS_COMPLEX != GetClass(aChars[end])) {

GetClass(c)? And could you add a unit test to catch this error?
(Also this proves that GetCodePointAt would be useful.)

Attachment #8841254 - Flags: review?(VYV03354) → review-

Masatoshi Kimura [:emk]

Updated

•

7 years ago

Attachment #8841253 - Flags: review?(VYV03354) → review+

Jonathan Kew [:jfkthame]

Assignee

Comment 4

•

7 years ago

(In reply to Masatoshi Kimura [:emk] from comment #3)

> @@ +732,5 @@
> > +          break;
> > +        }
> > +      } else if (mUniText && !mHasCJKChar) {
> > +        if (NS_IS_HIGH_SURROGATE(u) && i + 1 < mLength &&
> > +            NS_IS_LOW_SURROGATE(GetCharAt(i + 1))) {
> 
> You can directly access mUniText to make this faster.
> If the performance impact is neligible, could you add GetCodePointAt to make
> the code more readable?

Will do. Actually, what I'd like to do is rename GetCharAt (which is painfully ambiguous -- does "Char" mean a code unit in the 8- or 16-bit buffer, or does it mean a Unicode character?) to the more explicit GetCodeUnitAt, which always gets a single char16_t code unit, and add GetUnicodeCharAt to do the more complex thing (only needed for 16-bit) that handles surrogates and returns a char32_t Unicode value.

(I'd prefer that name over GetCodePointAt because I suspect many people may be unclear about the terminology: "code unit" vs "code point" is an important technical distinction in Unicode encoding forms, but how widely is that understood? I think "code unit" vs "Unicode char" is probably clearer.)

> 
> @@ +992,5 @@
> > +        if (NS_IS_HIGH_SURROGATE(c) && end + 1 < aLength &&
> > +            NS_IS_LOW_SURROGATE(aChars[end + 1])) {
> > +          c = SURROGATE_TO_UCS4(c, aChars[end + 1]);
> > +        }
> > +        if (CLASS_COMPLEX != GetClass(aChars[end])) {
> 
> GetClass(c)? And could you add a unit test to catch this error?
> (Also this proves that GetCodePointAt would be useful.)

Oops, thanks for catching that!

Actually, I don't think this currently makes any difference (and so it cannot readily be tested), because we don't have any non-BMP characters for which GetClass would return CLASS_COMPLEX. So in theory we could skip the surrogate check here, and just rely on GetClass for an individual surrogate not returning COMPLEX, but that seems like an ugly hack.

Jonathan Kew [:jfkthame]

Assignee

Comment 5

•

7 years ago

Attached patch Fix up the handling of surrogates in nsJISx4051LineBreaker (obsolete) — Details — Splinter Review

Attachment #8841548 - Flags: review?(VYV03354)

Jonathan Kew [:jfkthame]

Assignee

Updated

•

7 years ago

Attachment #8841254 - Attachment is obsolete: true

Jonathan Kew [:jfkthame]

Assignee

Updated

•

7 years ago

Assignee: nobody → jfkthame

Status: NEW → ASSIGNED

Jonathan Kew [:jfkthame]

Assignee

Comment 6

•

7 years ago

Attached patch Fix up the handling of surrogates in nsJISx4051LineBreaker (obsolete) — Details — Splinter Review

Rebased on top of updated patch in bug 809020 (sorry for the noise).

Attachment #8841640 - Flags: review?(VYV03354)

Jonathan Kew [:jfkthame]

Assignee

Updated

•

7 years ago

Attachment #8841548 - Attachment is obsolete: true

Attachment #8841548 - Flags: review?(VYV03354)

Masatoshi Kimura [:emk]

Comment 7

•

7 years ago

Comment on attachment 8841640 [details] [diff] [review]
Fix up the handling of surrogates in nsJISx4051LineBreaker

Review of attachment 8841640 [details] [diff] [review]:
-----------------------------------------------------------------

::: intl/lwbrk/nsJISx4051LineBreaker.cpp
@@ +733,4 @@
>      mLastBreakIndex = 0;
>      mPreviousNonHyphenCharacter = U_NULL;
>      mHasCJKChar = 0;
>      mHasNonbreakableSpace = 0;

Let's change 1/0 to true/false while we are at it.

@@ +951,5 @@
>    int8_t lastClass = CLASS_NONE;
>    ContextState state(aChars, aLength);
>  
>    for (cur = 0; cur < aLength; ++cur, state.AdvanceIndex()) {
> +    char32_t ch = state.GetUnicodeCharAt(cur);

If it guaranteed that aChars == mUniText? If so, why aChars is required at all? If not, shouldn't GetUnicodeCharAt take a char16_t buffer as a parameter?

@@ +1002,3 @@
>  
> +      while (end < aLength) {
> +        char32_t c = state.GetUnicodeCharAt(end);

Same here.

Jonathan Kew [:jfkthame]

Assignee

Comment 8

•

7 years ago

(In reply to Masatoshi Kimura [:emk] from comment #7)
> @@ +951,5 @@
> >    int8_t lastClass = CLASS_NONE;
> >    ContextState state(aChars, aLength);
> >  
> >    for (cur = 0; cur < aLength; ++cur, state.AdvanceIndex()) {
> > +    char32_t ch = state.GetUnicodeCharAt(cur);
> 
> If it guaranteed that aChars == mUniText?

Yes, in the char16_t* version of this method, `state` is initialized with aChars and mUniText is never changed thereafter. 

I'll make mUniText (and mText) const members, to emphasize that they're never changed once the ContextState object is initialized.

> If so, why aChars is required at
> all?

It's required because it's the input param to GetJISx4051Breaks, used to initialize the ContextState. After that, it's not strictly needed as the state has cached it, though it can still be used for convenient access to the input code units when we don't need the surrogate processing that GetUnicodeCharAt provides.

Jonathan Kew [:jfkthame]

Assignee

Comment 9

•

7 years ago

Attached patch Fix up the handling of surrogates in nsJISx4051LineBreaker — Details — Splinter Review

Attachment #8841912 - Flags: review?(VYV03354)

Jonathan Kew [:jfkthame]

Assignee

Updated

•

7 years ago

Attachment #8841640 - Attachment is obsolete: true

Attachment #8841640 - Flags: review?(VYV03354)

Masatoshi Kimura [:emk]

Comment 10

•

7 years ago

Comment on attachment 8841912 [details] [diff] [review]
Fix up the handling of surrogates in nsJISx4051LineBreaker

Review of attachment 8841912 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for the explanation.

::: intl/lwbrk/nsJISx4051LineBreaker.cpp
@@ +770,5 @@
>      }
>    }
>  
> +  const char16_t* const mUniText;
> +  const uint8_t* const mText;

Nice constification :)

Attachment #8841912 - Flags: review?(VYV03354) → review+

Jonathan Kew [:jfkthame]

Assignee

Comment 11

•

7 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/84e809d7bf0b6da8cdbb02475cfbb2b97377c836
Bug 1342686 - Reftests for line-breaking with text that includes surrogate pairs. r=emk

https://hg.mozilla.org/integration/mozilla-inbound/rev/f7b0f3b271228e2f33e0f6ce641060862012f07d
Bug 1342686 - Fix up the handling of surrogates in nsJISx4051LineBreaker. r=emk

Carsten Book [:Tomcat]

Comment 12

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/84e809d7bf0b
https://hg.mozilla.org/mozilla-central/rev/f7b0f3b27122

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

status-firefox54: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla54

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

line-breaking code does not handle supplementary-plane characters properly

Categories

(Core :: Layout: Text and Fonts, defect)

Tracking

()

People

(Reporter: jfkthame, Assigned: jfkthame)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files, 3 obsolete files)

Description

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Updated

Updated

Comment 6

Updated

Comment 7

Comment 8

Comment 9

Updated

Comment 10

Comment 11

Comment 12

Attachment

General

Description

File Name

Content Type