If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Stop passing in lone surrogates to nsUnicodeProperties methods

NEW
Assigned to

Status

()

Core
Internationalization
5 years ago
5 years ago

People

(Reporter: smontagu, Assigned: smontagu)

Tracking

Trunk
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(4 attachments)

(Assignee)

Description

5 years ago
jfkthame made a tryserver run with assertions when methods in nsUnicodeProperties that expect a Unicode character are passed surrogate code points, results at https://tbpl.mozilla.org/?tree=Try&rev=3c9393b3f2d8.

We should stop doing that.
(Assignee)

Comment 1

5 years ago
Created attachment 624024 [details] [diff] [review]
Patch for paired surrogates

I'm going to split this into two patches: this one handles cases where we get a supplementary character correctly encoded as a surrogate pair but pass it to nsUnicodeProperties as two separate "characters".
Assignee: nobody → smontagu
Attachment #624024 - Flags: review?(jfkthame)
(Assignee)

Comment 2

5 years ago
Created attachment 624025 [details] [diff] [review]
Patch for  lone surrogates

This patch is for the case where there is an actual lone surrogate in the DOM, e.g. in content/base/crashtests/377360-1.xhtml which does

 div.textContent = String.fromCharCode(0xDCBF);

I've tested that it doesn't cause problems in cases where both halves of a legal surrogate pair get inserted into the DOM one after another.
Attachment #624025 - Flags: review?(jfkthame)
Comment on attachment 624024 [details] [diff] [review]
Patch for paired surrogates

Review of attachment 624024 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/thebes/gfxFont.cpp
@@ +2396,5 @@
> +                nextUcs4Char = '\n';
> +            }
> +            boundary = IsBoundarySpace(ucs4Char, nextUcs4Char);
> +            invalid = !boundary && gfxFontGroup::IsInvalidChar(ch);
> +        }

Hmmm. I'd like to try and optimize this so that we don't bother with the surrogate checks unless absolutely necessary, as this loop is pretty perf-sensitive for text rendering. Note that if the current code unit is a surrogate, IsBoundarySpace doesn't need to look at the next one at all in order to answer "false"; and it doesn't need to actually decode the following surrogate pair unless the current char is space or  . So I think this can be reworked to take several comparisons out of the common path. Maybe the surrogate-decoding can be moved into IsBoundarySpace, after taking early returns for the common cases that don't need it?
Comment on attachment 624025 [details] [diff] [review]
Patch for  lone surrogates

Review of attachment 624025 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/thebes/gfxFont.cpp
@@ +3488,5 @@
> +                !(i > 0 &&
> +                  NS_IS_LOW_SURROGATE(origCh) &&
> +                  NS_IS_HIGH_SURROGATE(aString[i - 1]))) {
> +                newCh = UCS2_REPLACEMENT_CHAR;
> +            }

AFAICS, this hunk means that lone surrogates will now be displayed as U+FFFD (�), whereas previously they would have been displayed as hexboxes. I'm trying to decide whether I think that's the correct behavior... wonder what other browsers do?

I'd certainly agree with replacing lone surrogates in incoming text with U+FFFD (and I believe that's what the parser does?), but given that it's possible - regrettably - for JavaScript to add them to the DOM, I'm not completely sure about "hiding" them behind U+FFFD at display time.

Note that I think this results in slightly surprising behavior whereby some � glyphs, resulting from lone surrogates (or other encoding errors) in input text, will yield U+FFFD characters when copy-pasted elsewhere, whereas others that result from lone surrogates poked into the DOM will result in the original surrogate codepoints when copy-pasted.

As of now, I'm leaning towards the view that we should only display � in cases where the actual character in the DOM is U+FFFD, and other invalid/non-displayable junk should be displayed in some other way - such as our hexboxes. But I'm open to possible persuasion.......

::: layout/base/nsBidi.cpp
@@ +463,5 @@
>        if(!IS_FIRST_SURROGATE(uchar) || i+1==length || !IS_SECOND_SURROGATE(aText[i+1])) {
> +        /* not a surrogate pair: if it is a lone surrogate treat it as BN */
> +        flags|=DIRPROP_FLAG(dirProps[i]=dirProp=
> +                             IS_SURROGATE(uchar) ?
> +                              BN : GetBidiCat((PRUint32)uchar));

This causes us to do two separate checks on the current codepoint (!IS_FIRST_SURROGATE, and then IS_SURROGATE), where the 99% case will be that they are both false. I think it would be better to check !IS_SURROGATE first, so that for the common case we only do the single check and then immediately call GetBidiCat. That way we can avoid introducing an extra test into the hot code path.

@@ +494,5 @@
>      uchar=aText[i];
>      if(!IS_FIRST_SURROGATE(uchar) || i+1==length || !IS_SECOND_SURROGATE(aText[i+1])) {
> +      /* not a surrogate pair: if it is a lone surrogate treat it as BN */
> +      flags|=DIRPROP_FLAG(dirProps[i]=IS_SURROGATE(uchar) ?
> +                                        BN : GetBidiCat((PRUint32)uchar));

Ditto here.
Created attachment 629168 [details] [diff] [review]
patch for paired surrogates, alternate version

This is a reworking of your patch to try and avoid adding surrogate processing wherever possible within the perf-critical SplitAndInitTextRun loop.
Attachment #629168 - Flags: review?(smontagu)
(Assignee)

Comment 6

5 years ago
(In reply to Jonathan Kew (:jfkthame) from comment #4)
> AFAICS, this hunk means that lone surrogates will now be displayed as U+FFFD
> (�), whereas previously they would have been displayed as hexboxes. I'm
> trying to decide whether I think that's the correct behavior... wonder what
> other browsers do?

I think displaying as hexboxes is non-compliant with clause C10 in http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf which says 

"When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters."

> Note that I think this results in slightly surprising behavior whereby some
> � glyphs, resulting from lone surrogates (or other encoding errors) in input
> text, will yield U+FFFD characters when copy-pasted elsewhere, whereas
> others that result from lone surrogates poked into the DOM will result in
> the original surrogate codepoints when copy-pasted.

I think that the status quo that lone surrogates after being poked into the DOM can be copy-pasted is also a bug and should be fixed.

I'll attach a testcase in which a lone high surrogate and a lone low surrogate are poked into the DOM individually before and after another character, which can then be removed. Currently before removing the intervening character, both surrogates are displayed as hexboxes, and it's possible to copy-paste the whole string, which I think is wrong.

In Chrome the surrogates aren't displayed at all in the original webpage, but when the string is copy-pasted they appear as replacement characters.

In practice with my patch the surrogates appear as garbage rather than either hexboxes or replacement characters; there must be another place that needs patching to substitute 0xFFFD for the lone surrogates.
(Assignee)

Comment 7

5 years ago
Created attachment 642762 [details]
lone surrogate testcase
You need to log in before you can comment on or make changes to this bug.