Open Bug 753262 Opened 12 years ago Updated 2 years ago

Stop passing in lone surrogates to nsUnicodeProperties methods

Categories

(Core :: Internationalization, defect, P3)

defect

Tracking

()

People

(Reporter: smontagu, Unassigned)

Details

Attachments

(4 files)

jfkthame made a tryserver run with assertions when methods in nsUnicodeProperties that expect a Unicode character are passed surrogate code points, results at https://tbpl.mozilla.org/?tree=Try&rev=3c9393b3f2d8.

We should stop doing that.
I'm going to split this into two patches: this one handles cases where we get a supplementary character correctly encoded as a surrogate pair but pass it to nsUnicodeProperties as two separate "characters".
Assignee: nobody → smontagu
Attachment #624024 - Flags: review?(jfkthame)
This patch is for the case where there is an actual lone surrogate in the DOM, e.g. in content/base/crashtests/377360-1.xhtml which does

 div.textContent = String.fromCharCode(0xDCBF);

I've tested that it doesn't cause problems in cases where both halves of a legal surrogate pair get inserted into the DOM one after another.
Attachment #624025 - Flags: review?(jfkthame)
Comment on attachment 624024 [details] [diff] [review]
Patch for paired surrogates

Review of attachment 624024 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/thebes/gfxFont.cpp
@@ +2396,5 @@
> +                nextUcs4Char = '\n';
> +            }
> +            boundary = IsBoundarySpace(ucs4Char, nextUcs4Char);
> +            invalid = !boundary && gfxFontGroup::IsInvalidChar(ch);
> +        }

Hmmm. I'd like to try and optimize this so that we don't bother with the surrogate checks unless absolutely necessary, as this loop is pretty perf-sensitive for text rendering. Note that if the current code unit is a surrogate, IsBoundarySpace doesn't need to look at the next one at all in order to answer "false"; and it doesn't need to actually decode the following surrogate pair unless the current char is space or  . So I think this can be reworked to take several comparisons out of the common path. Maybe the surrogate-decoding can be moved into IsBoundarySpace, after taking early returns for the common cases that don't need it?
Comment on attachment 624025 [details] [diff] [review]
Patch for  lone surrogates

Review of attachment 624025 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/thebes/gfxFont.cpp
@@ +3488,5 @@
> +                !(i > 0 &&
> +                  NS_IS_LOW_SURROGATE(origCh) &&
> +                  NS_IS_HIGH_SURROGATE(aString[i - 1]))) {
> +                newCh = UCS2_REPLACEMENT_CHAR;
> +            }

AFAICS, this hunk means that lone surrogates will now be displayed as U+FFFD (�), whereas previously they would have been displayed as hexboxes. I'm trying to decide whether I think that's the correct behavior... wonder what other browsers do?

I'd certainly agree with replacing lone surrogates in incoming text with U+FFFD (and I believe that's what the parser does?), but given that it's possible - regrettably - for JavaScript to add them to the DOM, I'm not completely sure about "hiding" them behind U+FFFD at display time.

Note that I think this results in slightly surprising behavior whereby some � glyphs, resulting from lone surrogates (or other encoding errors) in input text, will yield U+FFFD characters when copy-pasted elsewhere, whereas others that result from lone surrogates poked into the DOM will result in the original surrogate codepoints when copy-pasted.

As of now, I'm leaning towards the view that we should only display � in cases where the actual character in the DOM is U+FFFD, and other invalid/non-displayable junk should be displayed in some other way - such as our hexboxes. But I'm open to possible persuasion.......

::: layout/base/nsBidi.cpp
@@ +463,5 @@
>        if(!IS_FIRST_SURROGATE(uchar) || i+1==length || !IS_SECOND_SURROGATE(aText[i+1])) {
> +        /* not a surrogate pair: if it is a lone surrogate treat it as BN */
> +        flags|=DIRPROP_FLAG(dirProps[i]=dirProp=
> +                             IS_SURROGATE(uchar) ?
> +                              BN : GetBidiCat((PRUint32)uchar));

This causes us to do two separate checks on the current codepoint (!IS_FIRST_SURROGATE, and then IS_SURROGATE), where the 99% case will be that they are both false. I think it would be better to check !IS_SURROGATE first, so that for the common case we only do the single check and then immediately call GetBidiCat. That way we can avoid introducing an extra test into the hot code path.

@@ +494,5 @@
>      uchar=aText[i];
>      if(!IS_FIRST_SURROGATE(uchar) || i+1==length || !IS_SECOND_SURROGATE(aText[i+1])) {
> +      /* not a surrogate pair: if it is a lone surrogate treat it as BN */
> +      flags|=DIRPROP_FLAG(dirProps[i]=IS_SURROGATE(uchar) ?
> +                                        BN : GetBidiCat((PRUint32)uchar));

Ditto here.
This is a reworking of your patch to try and avoid adding surrogate processing wherever possible within the perf-critical SplitAndInitTextRun loop.
Attachment #629168 - Flags: review?(smontagu)
(In reply to Jonathan Kew (:jfkthame) from comment #4)
> AFAICS, this hunk means that lone surrogates will now be displayed as U+FFFD
> (�), whereas previously they would have been displayed as hexboxes. I'm
> trying to decide whether I think that's the correct behavior... wonder what
> other browsers do?

I think displaying as hexboxes is non-compliant with clause C10 in http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf which says 

"When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters."

> Note that I think this results in slightly surprising behavior whereby some
> � glyphs, resulting from lone surrogates (or other encoding errors) in input
> text, will yield U+FFFD characters when copy-pasted elsewhere, whereas
> others that result from lone surrogates poked into the DOM will result in
> the original surrogate codepoints when copy-pasted.

I think that the status quo that lone surrogates after being poked into the DOM can be copy-pasted is also a bug and should be fixed.

I'll attach a testcase in which a lone high surrogate and a lone low surrogate are poked into the DOM individually before and after another character, which can then be removed. Currently before removing the intervening character, both surrogates are displayed as hexboxes, and it's possible to copy-paste the whole string, which I think is wrong.

In Chrome the surrogates aren't displayed at all in the original webpage, but when the string is copy-pasted they appear as replacement characters.

In practice with my patch the surrogates appear as garbage rather than either hexboxes or replacement characters; there must be another place that needs patching to substitute 0xFFFD for the lone surrogates.

Can we drop the r? and close this bug?

Flags: needinfo?(jfkthame)

I'm inclined to agree with Simon's argument that our current behavior of displaying lone surrogates as hexboxes is wrong -- a hexbox is what we show for a Unicode codepoint for which no font is available; but a lone surrogate is not a Unicode codepoint, rather, the content has an ill-formed code unit sequence. That means we still have a bug here that we should aim to fix (although the existing patches are likely to be heavily bit-rotted).

I'll try to find cycles to have another look at this sometime.

Flags: needinfo?(jfkthame)
Attachment #624024 - Flags: review?(jfkthame)
Attachment #624025 - Flags: review?(jfkthame)

Comment on attachment 629168 [details] [diff] [review]
patch for paired surrogates, alternate version

Clearing review flags so this doesn't show up in triage.

Attachment #629168 - Flags: review?(smontagu)

The bug assignee didn't login in Bugzilla in the last 7 months.
:m_kato, could you have a look please?
For more information, please visit auto_nag documentation.

Assignee: smontagu → nobody
Flags: needinfo?(m_kato)
Severity: normal → S3
Flags: needinfo?(m_kato)
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: