Closed Bug 1342686 Opened 7 years ago Closed 7 years ago

line-breaking code does not handle supplementary-plane characters properly

Categories

(Core :: Layout: Text and Fonts, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox54 --- fixed

People

(Reporter: jfkthame, Assigned: jfkthame)

Details

Attachments

(2 files, 3 obsolete files)

In the course of looking at bug 809020, I noticed that the code in nsJISx4051LineBreaker operates on buffers of char16_t codepoints without paying sufficient attention to surrogate pairs.

In bug 1265631 we went part of the way, but key parts of the code (in particular, the ContextualAnalysis function used when we encounter characters such as hyphen, open-quote, and others, and the Init function that scans for the presence of CJK characters) still lack surrogate support.
Some simple reftests (currently failing) for the behavior of ContextualAnalysis in line-breaking when surrogate pairs are present.
Attachment #8841253 - Flags: review?(VYV03354)
With the fixes here, the tests above should all pass as expected.
Attachment #8841254 - Flags: review?(VYV03354)
Comment on attachment 8841254 [details] [diff] [review]
Fix up the handling of surrogates in nsJISx4051LineBreaker

Review of attachment 8841254 [details] [diff] [review]:
-----------------------------------------------------------------

::: intl/lwbrk/nsJISx4051LineBreaker.cpp
@@ +380,5 @@
>    return ((0xff66 <= (u)) && ((u) <= 0xff70));
>  }
>  
>  static inline int
> +IS_CJK_CHAR(uint32_t u)

char32_t? (applies throughout of the patch)

@@ +732,5 @@
> +          break;
> +        }
> +      } else if (mUniText && !mHasCJKChar) {
> +        if (NS_IS_HIGH_SURROGATE(u) && i + 1 < mLength &&
> +            NS_IS_LOW_SURROGATE(GetCharAt(i + 1))) {

You can directly access mUniText to make this faster.
If the performance impact is neligible, could you add GetCodePointAt to make the code more readable?

@@ +992,5 @@
> +        if (NS_IS_HIGH_SURROGATE(c) && end + 1 < aLength &&
> +            NS_IS_LOW_SURROGATE(aChars[end + 1])) {
> +          c = SURROGATE_TO_UCS4(c, aChars[end + 1]);
> +        }
> +        if (CLASS_COMPLEX != GetClass(aChars[end])) {

GetClass(c)? And could you add a unit test to catch this error?
(Also this proves that GetCodePointAt would be useful.)
Attachment #8841254 - Flags: review?(VYV03354) → review-
Attachment #8841253 - Flags: review?(VYV03354) → review+
(In reply to Masatoshi Kimura [:emk] from comment #3)

> @@ +732,5 @@
> > +          break;
> > +        }
> > +      } else if (mUniText && !mHasCJKChar) {
> > +        if (NS_IS_HIGH_SURROGATE(u) && i + 1 < mLength &&
> > +            NS_IS_LOW_SURROGATE(GetCharAt(i + 1))) {
> 
> You can directly access mUniText to make this faster.
> If the performance impact is neligible, could you add GetCodePointAt to make
> the code more readable?

Will do. Actually, what I'd like to do is rename GetCharAt (which is painfully ambiguous -- does "Char" mean a code unit in the 8- or 16-bit buffer, or does it mean a Unicode character?) to the more explicit GetCodeUnitAt, which always gets a single char16_t code unit, and add GetUnicodeCharAt to do the more complex thing (only needed for 16-bit) that handles surrogates and returns a char32_t Unicode value.

(I'd prefer that name over GetCodePointAt because I suspect many people may be unclear about the terminology: "code unit" vs "code point" is an important technical distinction in Unicode encoding forms, but how widely is that understood? I think "code unit" vs "Unicode char" is probably clearer.)

> 
> @@ +992,5 @@
> > +        if (NS_IS_HIGH_SURROGATE(c) && end + 1 < aLength &&
> > +            NS_IS_LOW_SURROGATE(aChars[end + 1])) {
> > +          c = SURROGATE_TO_UCS4(c, aChars[end + 1]);
> > +        }
> > +        if (CLASS_COMPLEX != GetClass(aChars[end])) {
> 
> GetClass(c)? And could you add a unit test to catch this error?
> (Also this proves that GetCodePointAt would be useful.)

Oops, thanks for catching that!

Actually, I don't think this currently makes any difference (and so it cannot readily be tested), because we don't have any non-BMP characters for which GetClass would return CLASS_COMPLEX. So in theory we could skip the surrogate check here, and just rely on GetClass for an individual surrogate not returning COMPLEX, but that seems like an ugly hack.
Attachment #8841548 - Flags: review?(VYV03354)
Attachment #8841254 - Attachment is obsolete: true
Assignee: nobody → jfkthame
Status: NEW → ASSIGNED
Rebased on top of updated patch in bug 809020 (sorry for the noise).
Attachment #8841640 - Flags: review?(VYV03354)
Attachment #8841548 - Attachment is obsolete: true
Attachment #8841548 - Flags: review?(VYV03354)
Comment on attachment 8841640 [details] [diff] [review]
Fix up the handling of surrogates in nsJISx4051LineBreaker

Review of attachment 8841640 [details] [diff] [review]:
-----------------------------------------------------------------

::: intl/lwbrk/nsJISx4051LineBreaker.cpp
@@ +733,4 @@
>      mLastBreakIndex = 0;
>      mPreviousNonHyphenCharacter = U_NULL;
>      mHasCJKChar = 0;
>      mHasNonbreakableSpace = 0;

Let's change 1/0 to true/false while we are at it.

@@ +951,5 @@
>    int8_t lastClass = CLASS_NONE;
>    ContextState state(aChars, aLength);
>  
>    for (cur = 0; cur < aLength; ++cur, state.AdvanceIndex()) {
> +    char32_t ch = state.GetUnicodeCharAt(cur);

If it guaranteed that aChars == mUniText? If so, why aChars is required at all? If not, shouldn't GetUnicodeCharAt take a char16_t buffer as a parameter?

@@ +1002,3 @@
>  
> +      while (end < aLength) {
> +        char32_t c = state.GetUnicodeCharAt(end);

Same here.
(In reply to Masatoshi Kimura [:emk] from comment #7)
> @@ +951,5 @@
> >    int8_t lastClass = CLASS_NONE;
> >    ContextState state(aChars, aLength);
> >  
> >    for (cur = 0; cur < aLength; ++cur, state.AdvanceIndex()) {
> > +    char32_t ch = state.GetUnicodeCharAt(cur);
> 
> If it guaranteed that aChars == mUniText?

Yes, in the char16_t* version of this method, `state` is initialized with aChars and mUniText is never changed thereafter. 

I'll make mUniText (and mText) const members, to emphasize that they're never changed once the ContextState object is initialized.

> If so, why aChars is required at
> all?

It's required because it's the input param to GetJISx4051Breaks, used to initialize the ContextState. After that, it's not strictly needed as the state has cached it, though it can still be used for convenient access to the input code units when we don't need the surrogate processing that GetUnicodeCharAt provides.
Attachment #8841640 - Attachment is obsolete: true
Attachment #8841640 - Flags: review?(VYV03354)
Comment on attachment 8841912 [details] [diff] [review]
Fix up the handling of surrogates in nsJISx4051LineBreaker

Review of attachment 8841912 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for the explanation.

::: intl/lwbrk/nsJISx4051LineBreaker.cpp
@@ +770,5 @@
>      }
>    }
>  
> +  const char16_t* const mUniText;
> +  const uint8_t* const mText;

Nice constification :)
Attachment #8841912 - Flags: review?(VYV03354) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/84e809d7bf0b6da8cdbb02475cfbb2b97377c836
Bug 1342686 - Reftests for line-breaking with text that includes surrogate pairs. r=emk

https://hg.mozilla.org/integration/mozilla-inbound/rev/f7b0f3b271228e2f33e0f6ce641060862012f07d
Bug 1342686 - Fix up the handling of surrogates in nsJISx4051LineBreaker. r=emk
https://hg.mozilla.org/mozilla-central/rev/84e809d7bf0b
https://hg.mozilla.org/mozilla-central/rev/f7b0f3b27122
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
You need to log in before you can comment on or make changes to this bug.