Open Bug 670837 Opened 11 years ago Updated 4 years ago

maxlength shouldn't count one non-BMP character as two characters

Categories

(Core :: DOM: Editor, defect)

defect
Not set
normal

Tracking

()

People

(Reporter: emk, Unassigned)

References

(Blocks 1 open bug, )

Details

(Keywords: intl)

Steps to reproduce:
1. Enter data:text/html,<input maxlength=3> into the location bar.
2. Type (or copy and paste) 
Oops, Bugzilla didn't accept non-BMP characters...

Steps to reproduce:
1. Enter data:text/html,<input maxlength=3> into the location bar.
2. Type (or copy and paste) &#x20BB7;野家 into the text box.
Expected result:
&#x20BB7;野家
Actual result:
&#x20BB7;野
maxlength is a maximum allowed value length in code-point (not UTF-16 code unit) per HTML5 spec.
Chrome works as expected.
For what it's worth, WebKit's maxlength impl counts glyph clusters.  We should do the same, but we don't have good infrastructure for it....  We probably need a bug on said infrastructure, and also some conversation between Mounir and whoever implements it on how it needs to work for performance to not be hosed.
And I think the HTML5 spec is sort of wrong here.  Maxlength should not count combining marks and the like, imo.
That depends why you want maxlength in the first place. See http://lists.w3.org/Archives/Public/www-international/2011AprJun/0119.html

A more serious case than the STR in comment 1 is entering 野家&#x20BB7; -- this leaves an unpaired surrogate in the input field.
The DOM doesn't do anything with maxlength except the attribute reflection, even validation has been disabled before shipping Firefox 4. This should be fixed in the editor.
FWIW, we might change the behavior with maxlength to not prevent typing but just making the field invalid if the text length is greater than maxlength, see bug 613016.
Component: DOM: Core & HTML → Editor
QA Contact: general → editor
Version: unspecified → Trunk
(In reply to comment #3)
> For what it's worth, WebKit's maxlength impl counts glyph clusters.  We
> should do the same....

I don't think I agree with this, in general. For cases like Latin letters with accents, it seems a reasonable interpretation (although as Simon mentions, this depends on the use case); however, for Indic scripts where a "cluster" may consist of multiple conjoined consonants, plus a vowel mark, plus additional marks such as nasalization, I don't believe it makes any sense (to users) to count the "length" of a string in terms of glyph clusters; they are well aware of the constituent characters within such clusters and would expect to count them separately.

Whether "length" (for the purposes of maxlength) is most usefully measured in terms of Unicode characters or UTF16 code units is a tricky question, but given that Javascript and DOMStrings expose the UTF16 encoding form, I think it would be most consistent for maxlength to be expressed as a count of UTF16 units, too.
(In reply to comment #7)

> Whether "length" (for the purposes of maxlength) is most usefully measured
> in terms of Unicode characters or UTF16 code units is a tricky question, but
> given that Javascript and DOMStrings expose the UTF16 encoding form, I think
> it would be most consistent for maxlength to be expressed as a count of
> UTF16 units, too.

Either way, we shouldn't truncate in mid-supplementary character. If maxlength is interpreted as UTF-16 code units, "野家&#x20BB7;" with a maxlength of 3 should give "野家"
(In reply to comment #8)
> Either way, we shouldn't truncate in mid-supplementary character. If
> maxlength is interpreted as UTF-16 code units, "野家&#x20BB7;" with a
> maxlength of 3 should give "野家"

Definitely. We should never allow unpaired surrogates.

(Well, I don't think we can prevent people creating them via JS string-munging. So our code needs to handle them robustly anywhere they occur. But wherever possible, we should prevent them arising.)
It really sounds like there are two separate issues here:

1)  Fix editor to not truncate in mid-surrogate-pair.
2)  Spec discussion that needs to happen (and I'm probably the wrong person to
    drive it).

Who's willing to take on #2?
I have created a pen that makes it slightly easy to check:

https://codepen.io/thany/pen/zmRZKM

My current findings:
Firefox 64 has this bug
Chrome 72 has this bug
Edge 17 does NOT have this bug
You need to log in before you can comment on or make changes to this bug.