It takes two backspaces to remove a smiley + VS16

RESOLVED FIXED in Firefox 44

Status

()

defect
RESOLVED FIXED
4 years ago
3 years ago

People

(Reporter: timdream, Assigned: jfkthame)

Tracking

unspecified
mozilla44
Points:
---

Firefox Tracking Flags

(firefox44 fixed)

Details

()

Attachments

(3 attachments, 6 obsolete attachments)

13.22 KB, patch
emk
: review+
Details | Diff | Splinter Review
2.73 KB, patch
emk
: review+
Details | Diff | Splinter Review
9.27 KB, patch
emk
: review+
Details | Diff | Splinter Review
STR:

1. Load the URL, which is a <input> with value of Unicode sequence U+263a U+fe0f (rendered as a color Emoji)
2. Focus on the input and hit backspace

Expected:

1. I can remove the color smiley in one backspace

Actual:

1. The first backspace turn the smiley into a black white one, only the second smiley removes the smiley.

Note:

1. Looks like the first backspace only removes VS16 (U+fe0f)?
2. Bug 1216409 will ensure BMP Emojis are output with VS16 -- this bug will be very visible after that.

Masayuki, do you know what's the desired behavior here? I can see this behavior happen on Mac and FxOS so it's a core bug not a InputMethod API one. Thanks.
Flags: needinfo?(masayuki)
IIRC, this is handled by gfx via editor. ni'ing to experts of combined character handling.

I think that the behavior you expected is ideal behavior.
Component: DOM: Events → Editor
Flags: needinfo?(masayuki)
Flags: needinfo?(jfkthame)
Flags: needinfo?(VYV03354)
(In reply to Masayuki Nakano (:masayuki) (Mozilla Japan) from comment #1)
> IIRC, this is handled by gfx via editor. ni'ing to experts of combined
> character handling.
> 
> I think that the behavior you expected is ideal behavior.

Thanks for the quick reply! It's worthy to point out the current behavior is different compare to Safari. I can remove the smiley with one backspace only.
Currently Gecko handles VSes just like all other combining characters. For example, if you hit a backspace right after A+acute, only accute will be deleted.
This is consistent with the platform behavior at least on windows. (Do we need another per-platform pref?)
Flags: needinfo?(VYV03354)
(In reply to Tim Guan-tin Chien [:timdream] (please needinfo) from comment #2)
> (In reply to Masayuki Nakano (:masayuki) (Mozilla Japan) from comment #1)
> > IIRC, this is handled by gfx via editor. ni'ing to experts of combined
> > character handling.
> > 
> > I think that the behavior you expected is ideal behavior.
> 
> Thanks for the quick reply! It's worthy to point out the current behavior is
> different compare to Safari. I can remove the smiley with one backspace only.

In general, Safari always removes a "cluster" (base character + combining marks) as a single unit, whereas Firefox allows backspace to remove a trailing mark from the cluster without affecting the base. So in an example like

  data:text/html,<div contenteditable style="font-size:36px">fo&%23x331;x&%23x302;

you'll find that with Firefox, it's possible to click after one of the accented letters and backspace to delete just the accent, while in Safari the entire accented letter is deleted as a unit.

This difference is intended behavior on our part. In languages where combining marks (not just precomposed accented letters, as in typical European languages) are widely used, the inability to correct a mistyped mark without retyping the entire cluster becomes a hassle for users.

(Note that this behavior applies only to backspace; moving the cursor with the left/right arrow keys will step over the cluster as a whole.)

So what we see here is the Variation Selector behaving similarly, as it is also a combining character at the end of a cluster.

However, I think it would be justifiable to special-case variation selectors here, and handle them differently from combining marks in general. As the VSxx characters are never visible in their own right, and rarely if ever typed separately from their base -- they'll normally be generated automatically by an input method that is allowing the user to select a particular form of a character -- it probably makes more sense to delete the char+VS cluster as a unit even when backspacing.
Flags: needinfo?(jfkthame)
(In reply to Masatoshi Kimura [:emk] from comment #3)
> Currently Gecko handles VSes just like all other combining characters. For
> example, if you hit a backspace right after A+acute, only accute will be
> deleted.
> This is consistent with the platform behavior at least on windows. (Do we
> need another per-platform pref?)

I think we should just special-case VSes (on all platforms) for this purpose, and avoid adding a pref unless we see significant user demand.
Hm, IE11 and Edge on Windows 10 deleted the character with only one backspace. Chrome and Notepad required two backspaces.
Here's a simple mochitest (based on the existing tests for deleting surrogate pairs) for the issue here; marked 'todo' because this will fail with current code.
Attachment #8676231 - Flags: review?(VYV03354)
Assignee: nobody → jfkthame
Status: NEW → ASSIGNED
And here's a patch that makes the tests actually pass, assuming we're happy to adopt this behavior (which I think makes sense).
Attachment #8676232 - Flags: review?(VYV03354)
Comment on attachment 8676231 [details] [diff] [review]
Tests for backspacing over a character with variation selector

Please add testcases for:
* CJK Ideographs in BMP + IVS
* CJK Ideographs in SIP + IVS
Attachment #8676231 - Flags: review?(VYV03354) → review+
Comment on attachment 8676232 [details] [diff] [review]
Ensure a character+VS sequence is deleted as a single unit when backspacing

Perhaps the current behavior would be only beneficial to character encoding lawyers.
Attachment #8676232 - Flags: review?(VYV03354) → review+
(In reply to Masatoshi Kimura [:emk] from comment #9)
> Comment on attachment 8676231 [details] [diff] [review]
> Tests for backspacing over a character with variation selector
> 
> Please add testcases for:
> * CJK Ideographs in BMP + IVS
> * CJK Ideographs in SIP + IVS

Hmm, the patch above won't actually work for VS17 and upwards, because the fix in IsAcceptableCaretPosition() only checks IsVarSelector on individual UTF16 code units.

Rather than add surrogate processing there, I propose an alternative implementation. We already have the FLAG_CHAR_IS_LOW_SURROGATE flag on CompressedGlyph records in the textrun; this exists purely to support the IsAcceptableCaretPosition() function. So let's extend it to become CHAR_IS_LOW_SURROGATE_OR_VS, so that we mark variation selectors in the textrun just like low surrogates; these are both cases that should never be separated from the preceding code unit.

Then IsAcceptableCaretPosition() will just need to check the character flags in the textrun (as it already does); it won't need to fetch characters from the frame's underlying contents' textFragment and do its own surrogate handling.
Added more testcases, including Plane 14 variation selectors; carrying over r=emk. These new tests fail with the initial patch above; alternative patch coming next.
Attachment #8676231 - Attachment is obsolete: true
Here's the revised patch, as per comment 9. Marking r? again as this is an entirely (or mostly, anyhow) different implementation. With this, all the testcases pass for me locally; I'll push to try as well, to check across platforms.
Attachment #8676334 - Flags: review?(VYV03354)
Attachment #8676232 - Attachment is obsolete: true
Comment on attachment 8676334 [details] [diff] [review]
Ensure a character+VS sequence is deleted as a single unit when backspacing

Review of attachment 8676334 [details] [diff] [review]:
-----------------------------------------------------------------

The overall direction is good, but I have some questions.

::: editor/libeditor/nsPlaintextEditor.cpp
@@ +621,5 @@
> +            if ((offset > 1 &&
> +                 NS_IS_LOW_SURROGATE(data[offset - 1]) &&
> +                 NS_IS_HIGH_SURROGATE(data[offset - 2])) ||
> +                (offset > 0 &&
> +                 gfxFontUtils::IsVarSelector(data[offset - 1]))) {

Is non-BMP VS check unnecessary here?

::: gfx/thebes/gfxFont.cpp
@@ +563,5 @@
>  
>      while (!iter.AtEnd()) {
>          if (*iter == char16_t(' ')) {
>              glyphs->SetIsSpace();
> +        } else if (gfxFontUtils::IsVarSelector(*iter)) {

Why are low surrogates not checked here?

::: layout/generic/nsTextFrame.cpp
@@ -7026,2 @@
>      return false;
> -  if (index > 0) {

Why is this removed? What happens if the first character in the cluster is a low surrogate or a VS?
Attachment #8676334 - Flags: review?(VYV03354) → feedback+
I hope I am not too late: I realized it takes two backspaces to remove a flag composed of regional indicator symbols too.

https://en.wikipedia.org/wiki/Regional_Indicator_Symbol#Unicode_Block

It's great for me when writing the comment in https://github.com/mozilla-b2g/gaia/pull/31263/files#r42579402 but I guess this falls under the rationale in comment 4 too since user rarely need them individually?

Should I file another bug or the patch here can address it altogether?
(In reply to Tim Guan-tin Chien [:timdream] (please needinfo) from comment #16)
> I hope I am not too late: I realized it takes two backspaces to remove a
> flag composed of regional indicator symbols too.

I can also place caret between these symbols ... something not possible for VS16.
(In reply to Masatoshi Kimura [:emk] from comment #15)
> Comment on attachment 8676334 [details] [diff] [review]
> Ensure a character+VS sequence is deleted as a single unit when backspacing
> 
> Review of attachment 8676334 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> The overall direction is good, but I have some questions.
> 
> ::: editor/libeditor/nsPlaintextEditor.cpp
> @@ +621,5 @@
> > +            if ((offset > 1 &&
> > +                 NS_IS_LOW_SURROGATE(data[offset - 1]) &&
> > +                 NS_IS_HIGH_SURROGATE(data[offset - 2])) ||
> > +                (offset > 0 &&
> > +                 gfxFontUtils::IsVarSelector(data[offset - 1]))) {
> 
> Is non-BMP VS check unnecessary here?

Yes; a non-BMP VS is a surrogate pair, so the first part of this condition will match, and so that will already cause us to enter CharacterExtendForBackspace() here.

> 
> ::: gfx/thebes/gfxFont.cpp
> @@ +563,5 @@
> >  
> >      while (!iter.AtEnd()) {
> >          if (*iter == char16_t(' ')) {
> >              glyphs->SetIsSpace();
> > +        } else if (gfxFontUtils::IsVarSelector(*iter)) {
> 
> Why are low surrogates not checked here?

Low surrogates are found and marked during the loop below that marks trailing characters in the cluster. The only case that doesn't handle is the initial codepoint of the cluster; but that can't be a low surrogate, except in the case of ill-formed data where there's an unmatched surrogate. And in that case, I don't think we want to mark the low surrogate as such (to "glue" it to the preceding codepoint), as it's not part of a valid pair.

> 
> ::: layout/generic/nsTextFrame.cpp
> @@ -7026,2 @@
> >      return false;
> > -  if (index > 0) {
> 
> Why is this removed? What happens if the first character in the cluster is a
> low surrogate or a VS?

It shouldn't ever be marked as a low surrogate, because an initial low surrogate would be ill-formed, and we don't mark those (see above). I think it could be a VS, in the (unlikely) edge case where there's an element boundary between a base char and a following VS, and in that case we probably still want to avoid putting the caret in between.

However, in view of comment 16, I think I'm going to reverse direction here and do the work in IsAcceptableCaretPosition instead of using CompressedGlyph flags after all. New patch coming...
(In reply to Tim Guan-tin Chien [:timdream] (please needinfo) from comment #16)
> I hope I am not too late: I realized it takes two backspaces to remove a
> flag composed of regional indicator symbols too.
> 
> https://en.wikipedia.org/wiki/Regional_Indicator_Symbol#Unicode_Block
> 
> It's great for me when writing the comment in
> https://github.com/mozilla-b2g/gaia/pull/31263/files#r42579402 but I guess
> this falls under the rationale in comment 4 too since user rarely need them
> individually?
> 
> Should I file another bug or the patch here can address it altogether?

Hmm, that's a trickier case, actually. The Regional Indicator letters are not like VS characters, which always "stick" to the preceding character; nor are they like surrogates, which come in distinct High and Low ranges and should always stick together in pairs. Any given RI letter may be either the first or the second of a pair, and the acceptable caret positions depend on the entire string.

E.g. if we have a sequence of [G][B][G][B] (using bracketed letters because bugzilla will choke on the real characters), we'd expect to see a pair of British flags, and the caret should step over each [G][B] pair as a unit. But if we then insert an extra [B] at the start, we suddenly have a pair of Bulgarian flags, with a stray [B] at the end. And the caret should step over the [B][G] pairs, falling in the middle of the previously-indivisible [G][B].

Moreover, we'd only want to do this if the characters are actually rendered as a flag; if we have unsupported combinations [X][Y] that are displayed as individual characters, they should be editable as such, I think.

We can't simply mark the RI characters with the LOW_SURROGATE_OR_VS flag, because that would cause an entire run of flags to act as a unit; we'd only want this flag on the second of a ligated pair. But at SetupClusterBoundaries time, when we're identifying surrogates and VSes, we don't yet know where such ligatures will be formed.

To handle this, I think it's going to make more sense to drop the LOW_SURROGATE_OR_VS flag altogether and instead let IsAcceptableCaretPosition inspect the underlying text; this, together with the glyph flags that identify ligatures should be sufficient.
Added tests for behavior with Regional Indicator flags as well. These require an actual emoji font (because if the characters don't ligate, cursor movement and backspacing will treat them individually instead of in pairs), so for the time being I've set it to only run these tests on recent Mac and Windows systems, where we can expect a suitable font to be present. Later, we can deploy the Firefox Emoji font through @font-face, and test consistently across all platforms/versions.
Attachment #8677324 - Flags: review?(VYV03354)
Attachment #8676333 - Attachment is obsolete: true
Now with support for Regional Indicator pairs. This version eliminates the textrun's low-surrogate flag altogether, as it was only used for this purpose, and consolidates all the character-level checking directly into IsAcceptableCaretPosition().
Attachment #8677327 - Flags: review?(VYV03354)
Attachment #8676334 - Attachment is obsolete: true
(In reply to Jonathan Kew (:jfkthame) from comment #22)
> Created attachment 8677324 [details] [diff] [review]
> Tests for backspacing over a character with variation selector
> 
> Added tests for behavior with Regional Indicator flags as well. These
> require an actual emoji font (because if the characters don't ligate, cursor
> movement and backspacing will treat them individually instead of in pairs),
> so for the time being I've set it to only run these tests on recent Mac and
> Windows systems, where we can expect a suitable font to be present. Later,
> we can deploy the Firefox Emoji font through @font-face, and test
> consistently across all platforms/versions.

I couldn't display regional indicator pairs as emoji on my Windows 10 PC (and hence the test failed even with the next patch applied). How did you confirm?
Flags: needinfo?(jfkthame)
(In reply to Jonathan Kew (:jfkthame) from comment #24)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=57187a8563c5

Looks like the test also fails on OS X 10.10.
Also, I think this patch is a bit over-engineering. IE and Edge does not treat RI pair as an atomic unit. Chrome always treat RI pair as a unit even without font support.
Oh, interesting. I assumed they'd be supported in Segoe UI Emoji (I'd only tested this locally on OS X, where they definitely *do* work!), but you're right -- checking seguiemj.ttf with fontforge, I see it has glyphs for the individual RI letters, but doesn't include actual flag ligatures.

So that part of the test will need to be Mac-only for now, I suppose. I'll remove the Windows part of the hasEmojiFont condition.

As for 10.10: I'll look into that. It passes for me locally on 10.9, so I wonder what's different...
Flags: needinfo?(jfkthame)
(In reply to Masatoshi Kimura [:emk] from comment #26)
> Chrome always treat RI pair as a unit even
> without font support.

Actually, Chrome mishandles this somewhat, AFAICT, if you have a series of flags (more than just a single pair of RI chars in succession). In an example like

data:text/html,<div contenteditable>foo&%23x1f1e7;&%23x1f1ec;&%23x1f1e7;&%23x1f1ec;&%23x1f1e7;&%23x1f1ec;&%23x1f1e7;&%23x1f1ec;bar

I find that the left/right arrows will jump over the first three flags all at once.

I see a similar bug in Safari, too. In TextEdit, OTOH, the cursor steps neatly over each individual flag as a unit -- the same behavior as this patch provides for Firefox.

(Comparison with IE/Edge isn't really useful if the font on Windows doesn't support the flag ligatures. The point here is that once the ligatures are formed, they behave as indivisible characters; but RI characters that have not ligated appear as individual units and can reasonably be edited as such. And trying to identify pairs that "should" become a unit, in the absence of a font that supports this, is problematic because it would require arbitrarily-long lookback to see whether there's an even or odd number of RI characters preceding the position.)
Comment on attachment 8677324 [details] [diff] [review]
Tests for backspacing over a character with variation selector

I could test the patch locally using WadaLabMaruGo2004Emoji.
http://osdn.jp/projects/jis2004/downloads/63433/wlmaru2004emoji4370.lzh/
Unfortunately, I could select the half of RI pair with mouse. So something in this patch is incomplete.
Attachment #8677324 - Flags: review?(VYV03354) → review-
IE/Edge displayed a flag but did not treat it as a unit even with WadaLabMaruGo2004Emoji installed.
(In reply to Masatoshi Kimura [:emk] from comment #30)
> IE/Edge displayed a flag but did not treat it as a unit even with
> WadaLabMaruGo2004Emoji installed.

OK; I guess they haven't considered this behavior yet because they're not shipping font support for it.

(In reply to Masatoshi Kimura [:emk] from comment #29)
> Comment on attachment 8677324 [details] [diff] [review]
> Tests for backspacing over a character with variation selector
> 
> I could test the patch locally using WadaLabMaruGo2004Emoji.
> http://osdn.jp/projects/jis2004/downloads/63433/wlmaru2004emoji4370.lzh/
> Unfortunately, I could select the half of RI pair with mouse. So something
> in this patch is incomplete.

Ah, good catch: this is true on OS X as well. The patch fixes things for cursor movement with the arrow keys, but does not address mouse selection; I hadn't tried that. :(

BTW, for experimentation I've just put up a testcase at http://jsfiddle.net/rb4dLnnz/.
This fixes the issue of selection with the mouse, afaict. Still need to figure out what's up with the test on OS X 10.10.
Attachment #8677537 - Flags: review?(VYV03354)
Extended the tests a bit, and made the RI portion Mac-only; tryserver seems to be happier now.
Attachment #8677663 - Flags: review?(VYV03354)
Attachment #8677324 - Attachment is obsolete: true
Hmm. So actually, what's happening on tryserver is that the RI tests are failing on 10.10, but pass correctly on 10.10.5. (They also pass for me locally on 10.9.5.)
Oh, I see what's going on. Older versions of Apple Color Emoji only support a few of the flag symbols; then in 10.10.5, a lot more of them are present. And I've got a newer font installed locally, which is why I wasn't seeing the failures here; with the stock 10.9 font, several of the flags are missing and then the tests do indeed fail -- which is expected because the unligated RI letters don't cluster for the purpose of cursor movement and backspacing.

I'll update the test to use only flag combinations that were supported in the older font.
OK, at last.... tryserver (comment 37) confirms this will pass on both original OS X 10.10 and 10.10.5.
Attachment #8678034 - Flags: review?(VYV03354)
Attachment #8677663 - Attachment is obsolete: true
Attachment #8677663 - Flags: review?(VYV03354)
(In reply to Jonathan Kew (:jfkthame) from comment #18)
> It shouldn't ever be marked as a low surrogate, because an initial low
> surrogate would be ill-formed, and we don't mark those (see above). I think
> it could be a VS, in the (unlikely) edge case where there's an element
> boundary between a base char and a following VS, and in that case we
> probably still want to avoid putting the caret in between.

I tested
  <div contenteditable="true"><span>&#x263a;&#x1F1EC;</span><span>&#x1F1E7;b</span></div>
  <div contenteditable="true"><span>a&#x263a;</span><span>&#xfe0f;b</span></div>
with the latest patch. The glyph was ligated, and left arrow key and backspace worked. But right arrow key, delete key, and mouse selection didn't work as expected (took two types, could select half of the flag). Is this expected?
Flags: needinfo?(jfkthame)
(In reply to Masatoshi Kimura [:emk] from comment #39)
> (In reply to Jonathan Kew (:jfkthame) from comment #18)
> > It shouldn't ever be marked as a low surrogate, because an initial low
> > surrogate would be ill-formed, and we don't mark those (see above). I think
> > it could be a VS, in the (unlikely) edge case where there's an element
> > boundary between a base char and a following VS, and in that case we
> > probably still want to avoid putting the caret in between.
> 
> I tested
>   <div
> contenteditable="true"><span>&#x263a;&#x1F1EC;</span><span>&#x1F1E7;b</
> span></div>
>   <div
> contenteditable="true"><span>a&#x263a;</span><span>&#xfe0f;b</span></div>
> with the latest patch. The glyph was ligated, and left arrow key and
> backspace worked. But right arrow key, delete key, and mouse selection
> didn't work as expected (took two types, could select half of the flag). Is
> this expected?

I'm not surprised by this, though I hadn't specifically tested these cases. The problem here is that the look-ahead code doesn't "see" across the element boundary. In theory we could fix this, I guess, but ISTM it's such a rare edge case (and the resulting inconsistency is pretty minor) that it's probably not worth added code complexity.
Flags: needinfo?(jfkthame)
Attachment #8678034 - Flags: review?(VYV03354) → review+
Attachment #8677327 - Flags: review?(VYV03354) → review+
Attachment #8677537 - Flags: review?(VYV03354) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/007390c592ebe683d57923ce5155117a56a57355
Bug 1216427 - Tests for backspacing over a character with variation selector, and over Regional Indicator flag symbols. r=emk

https://hg.mozilla.org/integration/mozilla-inbound/rev/019608d8a449322f00784012d18be0bad0b61291
Bug 1216427 - part 1 - Ensure a character+VS sequence or a ligated Regional-Indicator flag symbol is deleted as a single unit when backspacing. r=emk

https://hg.mozilla.org/integration/mozilla-inbound/rev/9a59037687802067d469167e5eafe5cdf3ce1ecb
Bug 1216427 - part 2 - Ensure mouse selection does not split up a Regional Indicator flag symbol. r=emk
https://hg.mozilla.org/mozilla-central/rev/007390c592eb
https://hg.mozilla.org/mozilla-central/rev/019608d8a449
https://hg.mozilla.org/mozilla-central/rev/9a5903768780
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla44
See Also: → 850043
You need to log in before you can comment on or make changes to this bug.