Closed Bug 702296 Opened 13 years ago Closed 8 years ago

Textarea tags with "maxlength" attributes allow too many characters to be posted when value contains hard newlines

Categories

(Core :: DOM: Core & HTML, defect)

defect
Not set
normal

Tracking

()

RESOLVED INVALID

People

(Reporter: emmecinque, Unassigned, Mentored)

Details

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1 Build ID: 20111008085652 Steps to reproduce: When a "textarea" tag has a "maxlength" attribute, and text is typed into it such that the text includes hard newlines (that is, newlines explicitly typed into the field), then the length of the value is computed as if each newline were a single character (and, indeed, that's what the "value" of the element reports). However, when the containing form is submitted, the value of the textarea will be posted such that each newline is actually a pair of characters (a CR and an LF, instead of just an LF). Actual results: Here is a trivial jsfiddle to illustrate: http://jsfiddle.net/Pointy/LqMGH/ If one types several newlines into the textarea, carefully counting the characters, then the textarea stops accepting input at the point that the "value" attribute is 10 characters long. However, when the form is submitted, it's clear that the newlines are sent as "%0D%0A" pairs. Expected results: Bug 670837 and bug 590554 are somewhat similar to this, but in my opinion this is a more basic issue. Firefox has *always* reported the "value" of a textarea differently than the actual posted value. Some client-side libraries (jQuery at least) normalize all browsers to mimic that (mis-)behavior. Now that "maxlength" is supported, however, it seems like a UX problem for a form to impost the input length limit only to have server interactions (that want to preserve hard newlines) respond with errors.
Component: General → Editor
OS: Linux → All
Product: Firefox → Core
QA Contact: general → editor
Hardware: x86_64 → All
As fair as I understand it, the specs say we should send CRLF when the element is submitted [1]. I would tend to say this should be marked as INVALID but I would like someone to confirm :) [1] http://www.whatwg.org/specs/web-apps/current-work/multipage/the-button-element.html#concept-textarea-api-value (see the difference between API value and element's value)
Component: Editor → DOM: Core & HTML
QA Contact: editor → general
Whiteboard: INVALID?
Version: 7 Branch → Trunk
Or a WHATWG bug?
I believe that the browser definitely should send CRLF to the server. That's not the issue. The issue is that the "value" of the element, and the internal checks done to verify that the contents of the field do not exceed the value of the "maxlength" property, count characters *as if* the browser sent back just LF, not CRLF. I question the value of a "maxlength" feature, in other words, that does not reliably limit the number of characters in the value of a form parameter. Consider the user experience here. A form may include messaging to tell me that there's a limit to the number of characters I might type into a form field. I type up to the limit (and hit "Enter" a couple times along the way), and sure enough the form stops allowing more characters when the limit is hit. I submit the form, and BOOM an error comes back from the server (where an additional field length check is made) and I'm told that I've exceeded the maximum length for the field.
Mike, maxlength is defined in the spec as operating on length in codepoints (though there is also talk of having it work on grapheme clusters). So if your server is doing a naive max length on bytes it'll break all over the place for non-ASCII characters (for which a single codepoint can end up as up to 4 bytes depending on the encoding). Or is your server converting the input to UTF-16 and doing the length check on UTF-16 codepoints? If so, why is the length in terms of those a useful quantity for you? In any case, this issue really does need to be raised with the HTML spec. What we're doing right now is what the spec says to do.
Are you seriously suggesting that the two-character sequence CRLF is a single codepoint? A server correctly accepting UTF-8 encoding would have exactly the same problem as a 7-bit ASCII server in this particular case. The CR and LF characters are two separate codepoints in every character set I know of.
> Are you seriously suggesting that the two-character sequence CRLF is a single codepoint? No, I'm trying to understand what your server-side length check is trying to do and how it actually works.
The spec was recently updated on this matter and I believe reflects current best effort at compatibility across all the various constraints. Let me know if there's any specific questions on it.
OK, well that's fair, but the thing is that the most locale-aware, UTF sensitive server-side code in the world that chooses not to flatten CRLF into a single-character sequence is going to consider a CRLF pair to be two distinct characters. That's because it **is** a two-character (two-codepoint) sequence. I'm perfectly OK with the browser sending back hard line breaks as CRLF pairs. That's what IE does, so there's no reason to deviate from that. The issue is that the "maxlength" check does not acknowledge that hard returns in input text will in fact be delivered to the server as two codepoint characters. Instead, it treats hard returns as if they were to be delivered to the server as a single codepoint. Note that in Internet Explorer, the "value" attribute of a textarea tag **does** treat hard line breaks as 2-codepoint sequences. In other words, the length of the string value of the "value" attribute in IE matches the length of the field value as posted back to the server. I don't know whether IE9 supports "maxlength" properly on textarea tags, but if it does, I strongly suspect that it works properly in that regard - that is, that it treats hard returns as contributing two characters to the length of the value.
Mike, I'd still like to understand what sorts of length checks a locale-aware server would be performing and why. Understanding that would help a lot in terms of understanding what the spec should or should not say.
At the server end, I can offer these practical observations. Legacy database systems (or, for that matter, brand new ones) often have fixed-length fields. Thus, a "comments" or "notes" field that is likely to be presented as a textarea in a form may have a maximum size of, say, 1024 characters. Now whether that's really 1024 bytes, or 1024 16-bit words, or some "smart" 1024-codepoint limitation is up to the database. I'm also not a database expert, but in my experience database systems are not particularly "modern" with regard to string length. I imagine that a server-side system that's really concerned with accommodating input in a wide variety of encodings would be careful to either maintain unlimited-length fields, or else pad such fields with additional capacity beyond the nominal maximum. I suspect that many server-side applications are *not* particularly smart about accommodating a variety of input encodings, but that's not necessarily a failure. There are perfectly good reasons for applications to anticipate dealing with a restricted audience, an example being financial applications designed for particular legal domains. Thus I don't think it's realistic to expect *all* server applications to be prepared to handle input encoded in unexpected ways. What I'm saying is that there are many server applications that are designed to deal with strictly 8- or 16-bit encodings, or at least I'd bet that there are (in fact I'd bet that the vast majority are so designed). Thus the issue of dealing with problematic string lengths from UTF-encoded input is simply not undertaken by a good deal of server code. For such applications, that 1024 byte field has a real maximum length of 1024 8-bit (or, possibly, 16-bit) characters. For such systems, "stretchy" UTF-8 input will fail to work properly for reasons besides string length computation: the application, for example, is likely to regurgitate user input in some fixed 8-bit encoding anyway, rendering multi-byte codepoints as garbage.
It seems to me that if the server is assuming that there will never be non-ASCII characters in a "comments" or "notes" field, then that server is just a disaster waiting to happen...
That's true. Thank goodness that both CR and LF are ASCII characters :-)
My point is that any assumptions about correlation between textarea .length values and byte counts are probably bogus... In any case, this is an issue to take up with the spec folks.
Well, in this particular case - the miscounting of CRLF in the "maxlength" rules - there is simply no question that the browser is misbehaving. Yes, things get complicated when we start talking about multi-byte code points, but that's not at all what this is about. The browser here is simply and quite clearly doing the wrong thing. To put it another way, it is inconceivable that any standards body would ever declare that a CRLF pair should be treated as a single codepoint. It makes no sense at all, as the characters have distinctly different meanings, ones that have had currency for about 50 years now.
The standard says that maxlength works on the .value DOM property of the textarea. Newlines in .value are represented as a single LF. Seriously, what we're doing right now is _exactly_ what the standard says. If you think it needs to change, you need to get the standard changed.
Well, so be it then. I'll just point out, as an actual web developer, that the current behavior renders the "maxlength" feature worthless, in practical terms. Like, I will literally not use it at all, instead using a JavaScript shim that correctly computes the length. I guess if that's what the WHATWG has come up with you're basically stuck, however.
Wait a second - as I read it, the HTML 5 spec clearly states that the "value" attribute is the "raw value" with the following replacement (quoting the spec): "every occurrence of a U+000A LINE FEED (LF) character not preceded by a U+000D CARRIAGE RETURN (CR) character, by a two-character string consisting of a U+000D CARRIAGE RETURN U+000A LINE FEED (CRLF) character pair."
Hmm. There's some confusion here between "value" and "API Value". Mounir, can you please take a close look?
Assignee: nobody → mounir
Status: UNCONFIRMED → NEW
Ever confirmed: true
So, is anybody going to do something about this? It's still broken. Note that the bug was fixed (in an odd sort of way) in WebKit: https://bugs.webkit.org/show_bug.cgi?id=74686 Firefox 26 still allows 10 hard newlines in a textarea with "maxlength" set to 10, which means that 20 characters will be sent to the server. Internet Explorer 11 gets it wrong, however.
If I'm reading the spec correctly, the only time when we are supposed to count a new line as two code units is when the user enters such a value directly into the textarea, hence affecting its raw value. Hence, I think Firefox and IE's behavior here conforms to the spec, and the WebKit fix quoted in comment 19 actually violates the spec. Also note that the maxlength handling only applies to text that is entered by the user, and other kinds of changes to the value of a textarea (for example, manipulating it from script) will not be subject to maxlength restrictions, so the server side needs to perform its own validation anyway.
OK well this is my last word on the topic, as it's clear that us web developers are doomed to ignoring any native implementations of textarea maxlength as unreliable anyway, but I'll recap: The "maxlength" attribute - on *any* element - is used to help enforce an actual server-side limit so that a user doesn't unwittingly exceed the limit and submit a form that returns with an error. This has nothing to do with programmatic manipulation of the form element value; it's all about making the keyboard stop adding text when the size limit is reached. The current Firefox implementation of maxlength on textarea elements fails. Why? Because it allows a user to continue typing beyond the declared field length limit. The server, which as a matter of course is also checking the field length, will respond with an error if the user exceeded the length. I don't want that sort of bad user experience on any site for which I'm responsible, so I'll be using my own JavaScript code to correctly detect when the number of characters that will be POSTed back to the server has reached the limit that will actually be accepted.
I agree with Mike, this is a big issue. If Firefox is posting both characters but when enforcing maxlength is only counting it as 1, this is a huge failure to appropriately enforce maxlength. The only browser I have seen so far do this check correctly is Chrome. IE seems to suffer from the same issue. Similar to Mike, I am forced to spend more time writing Javascript to correctly enforce maxlength on the client side, when the browser should be doing this. Regardless of how the HTML spec may be written, if the browser is posting 2 characters for newline then, logically, 2 characters should be counted when enforcing maxlength.
> Regardless of how the HTML spec may be written That's not how specs work. If you think the spec is wrong, you should be sending feedback to that effect to the spec authors and asking them to change the spec, not going around telling people to just ignore the spec.
In this case the spec says that the value of a textarea should have all linefeed characters not preceeded by a carriage return modified to include the carriage return, so Firefox is wrong here.
Flags: needinfo?(mounir)
Mike, here's what the spec says for textarea: For historical reasons, the element's value is normalised in three different ways for three different purposes. The raw value is the value as it was originally set. It is not normalized. The API value is the value used in the value IDL attribute. It is normalized so that line breaks use U+000A LINE FEED (LF) characters. Finally, there is the form submission value. It is normalized so that line breaks use U+000D CARRIAGE RETURN U+000A LINE FEED (CRLF) character pairs, and in addition, if necessary given the element's wrap attribute, additional line breaks are inserted to wrap the text at the given width. and here's what the definition of maxlength says: User agents may prevent the user from causing the element's value to be set to a value whose code-unit length is greater than the element's maximum allowed value length. where "value" links to http://www.whatwg.org/specs/web-apps/current-work/#concept-fe-value Unfortunately, it doesn't make it 100% clear which of the three normalizations it's talking about here, but all the other places that reference "value" without clarifying are talking about the raw value, as comment 20 points out. One other note on comment 21: note that UAs are not even required to prevent the user from entering more than maxlength characters. It's not even a "should" requirement; it's a "may" requirement... Now I agree this is silly, and I think I even agree that it might make sense to have maxlength be based on the form submission value (including the @wrap processing, right?), but that's just not what the spec says to do right now.
Flags: needinfo?(mounir)
I agree it's vague in the spec. Sticking with behavior that's plainly not useful, however, therefore seems strange, given that the spec *could* be interpreted otherwise. (The Chrome people seem to agree.)
I don't think that's an accurate reading of the spec as written (though I agree it's confusing). In particular, see the part that says "The element's value is defined to be the element's raw value with the following transformation applied", which links to the same #concept-fe-value. The only place where I said "value" but meant "raw value" was in the reset algorithm (but that still linked to #concept-textarea-raw-value) (now fixed). I've tried to clean it up a bit. If the spec isn't Web-compatible, do let me know. It's worrying that there's multiple browsers that disagree with it.
> which links to the same #concept-fe-value. Ah, COMEFROM strikes again. The link should be the other way around: the other places that reference "value" should be linking to the place where "value" is actually defined... I agree that this part makes it clear that maxlength is determined based on the form submission value in the spec, thank you. I can't find a single UA that actually implements the spec as written, though. For example, loading http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=2964 and typing "1 2 3 4 5 " in the textarea shows that every single browser (tested Firefox, IE, Safari, Chrome, Presto Opera) ignores the @wrap when enforcing maxlength. Of course Presto Opera also ignores @maxlength on textarea altogether, and IE has the same behavior as us in that it doesn't treat newline as 2 chars for maxlength purposes, at least in this testcase... We could try making the change to count newline as 2 chars for maxlength. Given that no browser does it, I don't think we should include the newlines that wrap="hard" includes in maxlength.
Whiteboard: INVALID?
And Ian, thank you for cleaning up the spec text here!
(In reply to Boris Zbarsky [:bz] (on PTO, reviews are slow) from comment #23) > > Regardless of how the HTML spec may be written > > That's not how specs work. > > If you think the spec is wrong, you should be sending feedback to that > effect to the spec authors and asking them to change the spec, not going > around telling people to just ignore the spec. My point is it does not make sense to implement the maxlength check as such, even if the spec had written it that way, which it did not. Having the maxlength attribute check one value for a length and post another value with a different length is confusing and makes it harder for us to define such constraints on the backend. It is good to see that this issue will be looked at. Thanks!
Nick, note that: 1) It's completely optional for a UA to do anything at all with maxlength in terms of preventing user input. They only do it to make the user experience better. 2) Every single UA posts data that's longer than the maxlength when wrap="hard" is used. This is not an accident: the behavior you want requires expensive (O(N) in the text length) operations on every keystroke and users get annoyed when they type and the display doesn't keep up. So you should just assume the maxlength is a nice convenience for users but not rely on it in any way and have fallback UI for the cases when the value is longer than your maxlength. We'll do what we think is best for our users, which may involve tradeoffs between a "may" part of the spec, memory usage, and performance...
(In reply to comment #28) > We could try making the change to count newline as 2 chars for maxlength. > Given that no browser does it, I don't think we should include the newlines > that wrap="hard" includes in maxlength. That sounds good to me. We should probably fix the spec to not ask for @wrap processing to be taken into account here too, right?
Unless we think it's likely to be web-compatible and we can convince every single browser to change behavior, yes.
(In reply to Boris Zbarsky [:bz] (on PTO, reviews are slow) from comment #33) > Unless we think it's likely to be web-compatible and we can convince every > single browser to change behavior, yes. (In reply to Boris Zbarsky [:bz] (on PTO, reviews are slow) from comment #31) > Nick, note that: > > 1) It's completely optional for a UA to do anything at all with maxlength > in terms of preventing user input. They only do it to make the user > experience better. > > 2) Every single UA posts data that's longer than the maxlength when > wrap="hard" is used. This is not an accident: the behavior you want > requires expensive (O(N) in the text length) operations on every keystroke > and users get annoyed when they type and the display doesn't keep up. > > So you should just assume the maxlength is a nice convenience for users but > not rely on it in any way and have fallback UI for the cases when the value > is longer than your maxlength. We'll do what we think is best for our > users, which may involve tradeoffs between a "may" part of the spec, memory > usage, and performance... That's fine and understandable, and as you said since no other browser handles the @wrap="hard" logic in the algorithm, it might not be a major issue to the community if it were not implemented. However, the first part would be a nice to have, especially since there is logic already in place that tries to do this calculation. I think most developers have already gone the route of writing code on the UI side to take care of this, especially since this is a newer HTML feature, but having native browser support would be preferable in the long run.
I hope somebody blinks (ok I regret this pun) and congeals to some kind of consistency here. I’ve added some additional documentation of current cross-browser behavior here: https://github.com/scottjehl/Device-Bugs/issues/73
I'm happy to mentor someone who wants to fix the \n part of this, I guess. I still think anyone relying on maxlength for anything server-side is just setting themselves up for failure...
Assignee: mounir → nobody
Mentor: bzbarsky
And note that I still think the "must scan all the text in the textarea on every keypress" bit is very unfortunate.
@Boris nobody's *relying* on it; at least, I'm certainly not. The point is to provide a user experience at the client that's consistent with what the server value-checking code will do. If the client has maxlength set in good faith to what the server will actually impose, and the browser does something that does not correspond with what it actually *sends* to the server, then that user experience goal is not satisfied. That is, if the browser lets me type and type up to the maxlength of 100 characters, but what gets sent to the server is 103 characters and the server returns an error, then the user has been abused.
I understand all that. But keep in mind that you lose anyway as soon as someone types pretty much anything non-ASCII.
Well yes, that's certainly true, and quite regrettable.
I guess that means this is now INVALID, yes?
I'm inclined to agree (while wondering what the point of "maxlength" is anyway).
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.