Open Bug 780449 Opened 12 years ago Updated 1 year ago

Password fields display two replacement characters for a single supplementary Unicode symbol like emoji

Categories

(Core :: DOM: Editor, defect)

defect

Tracking

()

People

(Reporter: mathias, Unassigned)

References

Details

Password fields display two replacement characters for a single supplementary Unicode symbol.

Test case here: data:text/html;charset=utf-8,<!DOCTYPE html><title>Test</title><input%20type=password%20value=&%23x1D306;>

As you can see, two replacement characters are shown instead of one. What’s worse is you can delete each individual surrogate separately.
Component: Layout: Text → Editor
In nsTextControlFrame::UpdateValueDisplay, we currently do 

   if (!value.IsEmpty() && IsPasswordTextControl()) {
     nsTextEditRules::FillBufWithPWChars(&value, value.Length());
   }

where value is an nsAutoString, so Length() is the number of UTF-16 code units in the string. We need an API that returns the length of a string in Unicode characters instead.
Status: UNCONFIRMED → NEW
Ever confirmed: true
(In reply to Simon Montagu from comment #1)
> In nsTextControlFrame::UpdateValueDisplay, we currently do 
> 
>    if (!value.IsEmpty() && IsPasswordTextControl()) {
>      nsTextEditRules::FillBufWithPWChars(&value, value.Length());
>    }
> 
> where value is an nsAutoString, so Length() is the number of UTF-16 code
> units in the string. We need an API that returns the length of a string in
> Unicode characters instead.

That's easy enough to provide, but I'm concerned there may be further complications with regard to editing - we'll need to handle mapping between offsets in the "real" text and offsets (for cursor-positioning/selection purposes) in the replacement string.
(In reply to Jonathan Kew (:jfkthame) from comment #2)
> (In reply to Simon Montagu from comment #1)
> > In nsTextControlFrame::UpdateValueDisplay, we currently do 
> > 
> >    if (!value.IsEmpty() && IsPasswordTextControl()) {
> >      nsTextEditRules::FillBufWithPWChars(&value, value.Length());
> >    }
> > 
> > where value is an nsAutoString, so Length() is the number of UTF-16 code
> > units in the string. We need an API that returns the length of a string in
> > Unicode characters instead.
> 
> That's easy enough to provide, but I'm concerned there may be further
> complications with regard to editing - we'll need to handle mapping between
> offsets in the "real" text and offsets (for cursor-positioning/selection
> purposes) in the replacement string.

We only show the replacement string for the password field.  But as far as editing is concerned, I think we handle the surrogates correctly, so this should not be an issue (but we still need to test it once the initial bug is fixed.)

What is the proper API that we should be using here?
FWIW, WebKit seems to behave the same as us.  Opera shows it as two characters but you delete it as one.  IE10 behaves correctly.  And this is why UTF-16 is the worst Unicode encoding in the world.  (Okay, UTF-7 is probably worse.)
The Opera bug was reported as DSK-371460. (https://bugs.opera.com/browse/DSK-371460)
OS: Mac OS X → All
Hardware: x86 → All
Summary: Password fields display two replacement characters for a single supplementary Unicode symbol → Password fields display two replacement characters for a single supplementary Unicode symbol like emoji

(In reply to Jonathan Kew (:jfkthame) from comment #2)

(In reply to Simon Montagu from comment #1)

In nsTextControlFrame::UpdateValueDisplay, we currently do

if (!value.IsEmpty() && IsPasswordTextControl()) {
nsTextEditRules::FillBufWithPWChars(&value, value.Length());
}

where value is an nsAutoString, so Length() is the number of UTF-16 code
units in the string. We need an API that returns the length of a string in
Unicode characters instead.

That's easy enough to provide, but I'm concerned there may be further
complications with regard to editing - we'll need to handle mapping between
offsets in the "real" text and offsets (for cursor-positioning/selection
purposes) in the replacement string.

Hey Jonathan, we're running in to a similar problem in bug 1543449. Do you think you could give some guidance on how to write this API? I could try implementing it with your mentoring.

Flags: needinfo?(jfkthame)

If what you need is to count the number of Unicode characters in a string (i.e. accounting for surrogate pairs, so that emoji etc count as one rather than two), it would be pretty straightforward to just iterate over the char16_t code units checking for surrogates; there are macros in xpcom/string/nsCharTraits.h that would help.

But now that we have ICU in the tree, a better option might be to use its APIs to do it. There's a u_countChar32() function (see unicode/ustring.h) that will directly give you the answer: just pass it the string's char16_t buffer and length. You could probably use that directly, or maybe provide a simple inline wrapper like CountUnicodeChars(const nsAString&) if that's more convenient.

There are further questions you might want to think about, though: do you really want a count of Unicode characters, or are you actually looking for what the user perceives as "characters" on a screen? The difference starts to become pretty noticeable especially in the emoji world, where a single emoji like 👩‍👩‍👦‍👦 may be composed of something like 7 Unicode characters. Does it count as 1 or 7 for your purposes? What about invisible or zero-width characters?

(Will you care about the ability to map between Unicode character counts as reported by this API and offsets in the underlying UTF-16 string? If so, that would complicate things.)

Flags: needinfo?(jfkthame)

Yeah, I did some reading earlier about this problem and understand that it can be very tricky. Not all environments will combine 👩‍👩‍👦‍👦 and this emoji may be rendered as the four separate parts with the zero-width joiner not being used.

Ideally we would be "looking for what the user perceives as 'characters' on a screen". It seems our editor already handles this with selections and caret placement. Our editor also handles "Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞" as I would expect.

It seems one (backwards) way to gather this number may be to place a caret at the beginning of the string and move the caret to the end, counting how many moves forward were needed.

I'm not sure it matters if 👪 is 1 or 7 characters, as long as we are internally consistent. Array.from("👪").length == 1, so in my book that is a single-character string? For bug 1543449, the context is about distinguishing a likely password from a probably-not-a-password by length. As long as it agrees with the number of "•" in the password field we should be good.

(In reply to Sam Foster [:sfoster] (he/him) from comment #10)

In the context of this bug, yes it matters because an unmasked password such as 🧀 looks like and acts like a single character in terms of selection and caret movement, but when masked it is being displayed as two characters. Unfortunately Array.from().length doesn't work for all emojis, such as Array.from("❤️").length which returns 2 (U+2764 U+FE0F).

(In reply to Sam Foster [:sfoster] (he/him) from comment #10)

I'm not sure it matters if 👪 is 1 or 7 characters, as long as we are internally consistent. Array.from("👪").length == 1, so in my book that is a single-character string?

Well, 👪 is a single character in Unicode (U+1F46A), but others such as 👩‍👩‍👦‍👦 are not:

Array.from("👪").length
< 1
Array.from("👩‍👩‍👦‍👦").length
< 7

(In reply to (behind on needinfos) Jared Wein [:jaws] (please needinfo? me) from comment #9)

Yeah, I did some reading earlier about this problem and understand that it can be very tricky. Not all environments will combine 👩‍👩‍👦‍👦 and this emoji may be rendered as the four separate parts with the zero-width joiner not being used.

Right; it's encoded as 4 emoji characters, with three zero-width joiners in between (hence the array length of 7); how it displays is dependent on the font being used. So to figure out how many "graphical characters" the user would see, you'd have to actually shape/render the text (making sure you're using the same font). You can't get that answer from the string alone without associated font information.

For purposes of showing a masked password field, though, I'm not sure it matters if 👩‍👩‍👦‍👦 appears as •••••••. It's true it might be slightly surprising to a user who doesn't know that many emojis are encoded as character sequences, but it accurately reflects the number of characters that are actually present. (I do think it's wrong for it to appear as •••••••••••, with one bullet per UTF-16 code unit; but fixing that just requires counting characters rather than the individual surrogates.)

This isn't much different from a case like श्री in Hindi, which visually has two glyphs, but the first one is really a ligature of three underlying characters and so in masked form श्री would appear as ••••. On my Hindi keyboard, श्र is a single keystroke, though it generates three codepoints, and pressing backspace deletes them one by one, revealing the underlying encoding.

Similarly, I can insert 👩‍👩‍👦‍👦 as a single unit (e.g. from the emoji character-picker palette), but if I then press backspace, it starts deleting individual parts of the family group. So presenting it as ••••••• when masked may be reasonable.

This will be (at least) partially fixed by bug 1548389. It'll handle non-BMP characters correctly, but it keeps handling complicated grapheme cluster as array of unicode characters.

I think that the new behavior must be the best because if we handled the complicated grapheme cluster as a character, it may have not worked with some web apps which had limit length of password data.

Depends on: 1548389
Severity: normal → S3
Duplicate of this bug: 1579776
You need to log in before you can comment on or make changes to this bug.