Closed Bug 194498 Opened 19 years ago Closed 17 years ago
non-breaking space turned into space
User-Agent: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.3b) Gecko/20030221 Build Identifier: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.3b) Gecko/20030221 If I enter a non-breaking space character (Alt-0160) in a input type=text box my asp code that handles the form receives an ordinary space character. See also test html code below under reproduce. Reproducible: Always Steps to Reproduce: 1.write this simple html-page. <html><head><title>Try a non-breaking space.</title></head> <body><form method='get' action='./Test0160.html'> <label>Enter non-breaking space Alt-0160</label> <input type='text' name='inbox' /> <button type='submit'>Submit</button> </form></body></html> 2. enter alt-0160 in entry box and push submit. 3. look at addressbar. Actual Results: The address ends on inbox=+ Expected Results: The address should end in inbox=%A0 Same trouble with Version 1.2.1 (20021130) and on Linux with 2001091712.
Assignee: form → form-submission
Component: Layout: Form Controls → Form Submission
QA Contact: desale → ashishbhatt
*** Bug 206925 has been marked as a duplicate of this bug. ***
This affects TEXTAREA and single line "text" INPUT fields, but not "hidden" form elements. I have set up a testcase under <http://bugzilla.ohrbelag.de/mozilla-bug-06.php>. Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko/20030624 Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4b) Gecko/20030516 Mozilla Firebird/0.6
Confirmed for Mozilla 2003121816 on Win2k.
Status: UNCONFIRMED → NEW
Ever confirmed: true
I can't imagine what was in the developper's mind when he wrote a nsPlainTextSerializer::Output which converts U+00A0 (NO-BREAK SPACE) to U+0020 (SPACE) and does nothing else! It doesn't make any kind of sense, especially if other Unicode space characters aren't treated in the same way: this code just *has* to be wrong. The attached patch (somebody please review!) just removes this absurd transformation.
All right, now that I made this patch, what's the next step? I don't have CVS checkin permission, obviously. And I'm sure nobody who does is watching this bug, since it's assigned to <firstname.lastname@example.org> which is a euphemism for "nobody". So how do I attract attention?
Comment on attachment 140944 [details] [diff] [review] Remove replacement of nbsp by space How do I know whom to ask to review the patch? (The help is most unclear.)
Attachment #140944 - Flags: review?
This affects the editor of Mozilla Mailnews too. I cannot write any mail which contains the character U+00A0. I do not think that Form Submission is the right component. Editor: Core should be better. (Note that I have put several non-breaking spaces in this comment, using the browser Konqueror.)
(In reply to comment #8) > This affects the editor of Mozilla Mailnews too. I cannot write any mail which contains the > character U+00A0. The Mozilla localtion bar is affected too. If I type a non-breaking space in the location bar, it is turned out into a simple space. > (Note that I have put several non-breaking spaces in this comment, using the browser > Konqueror.) Well... Not convincing. :-\
Moving bug to "Editor: Core" component, at Laurent Rineau's suggestion.
Assignee: form-submission → mozeditor
QA Contact: ashshbhatt → bugzilla
Why should I think that the original comment is wrong? If it isn't, there are plaintext consumers out there that will be confused. What happens if I copy and paste into a unix command line, for example?
It won't work right on a Unix command line; it won't display right in many plaintext mailers. Non-breaking space is something that doesn't exist in a lot of fonts. The biggest reason for the conversion is that consecutive spaces in the mozilla editor are turned into runs of <space><nbsp><space><nbsp>. So if you go into the plaintext mail editor and type Hello<sp><sp><sp><sp>world, if we didn't do the conversion, it would send as Hello<sp><nbsp><sp><nbsp>world, and lots of recipients would see it as Hello ? ?world, or instead of question marks they might see an A-umlaut or a sequence like Hello \0xa0 \0xa0world. It would be nice if we had some way of distinguishing non-breaking spaces inserted by our editor from non-breaking spaces deliberately typed by the user. Suggestions for how to represent this in the DOM would be welcomed. Simply emitting non-breaking spaces and breaking all the plaintext people is not a viable option.
Well, I work in a French-speaking environment, we exchange a lot of plain text files, typically encoded in iso-8859-1, and these tend to contain lots of unbreakable spaces, because certain punctuation marks in French (colon, semicolon, question mark, exclamation mark) are supposed to be preceded by an unbreakable space. The result of Mozilla mangling these spaces to ordinary breakable spaces is catastrophic. And the unbreakable space isn't any less supported than any other iso-8859-1 character. (Well, I admit I did *once* find a font that was buggy as far as \240 was concerned, but it was on the Solaris 2.5 Openwin, which is a long time ago.) But I won't argue. If people can't agree as to what the correct behavior is, the simple solution is to make that a configuration option. I'm willing to code this if nobody else is (although given my lack of Mozilla expertise I'd much prefer it if somebody else volunteered; it's probably a very simple job just to add one boolean option). Just tell me how the option should be called, and whether the patch would have a chance of being positively reviewed (assuming I did things correctly; and understanding of course that this is no promise, but I'd rather make sure in advance that it won't be rejected by principle). As for a more acceptable long-term solution than replacing multiple spaces by unbreakable spaces in the editor, I might suggest replacing them by a span element with appropriate width (say, as many em's as the user typed spaces).
I'm confused about your example. How is the conversion to spaces catastrophic?
(I am a french speaker too.) (In reply to comment #12) > It won't work right on a Unix command line; Can you explain what you mean by not working on unix command line? Obviously a nbsp will not be taken as word separator. > Non-breaking space is something that doesn't exist in a lot of fonts. U+00A0 is in the iso-8859-1 charset (aka latin-1). I have never seen a iso-8859-1 font where U+00A0 does not exist. > The biggest reason for the conversion is that consecutive spaces in the > mozilla editor are turned into runs of <space><nbsp><space><nbsp>. It seems that is the problem. The editor creates non-breaking spaces, to display consecutive space. As some (really?) us-ascii fonts does not know U+00A0, the plain text serializer has to transform them. This is not a good policy, because some latin-1 users will to use it. In french, for example, a lot of ponctuation marks should almost be preceded by a non-breaking space. Actually, there are lot of other spacing characters that should be used in well-written french: those between U+2000 and U+200B, but they are not in the latin-1 charset. I am part of the french traduction project and we decide not to use them, because only a few fonts have them. Anyway, as far a U+00A0 is concerned, some latin-1 users would like to use them. Consecutives spaces, in editor, should be replaced by something else than <sp><nbsp><sp><nbsp><sp>.
It's easy to say that the editor shouldn't use nbsp's, but thats a dead snake. The alternatives are even worse. Whatever the solution is here, changing the editor's nbsp usage internally is unlikely to be it.
Are you aware that Internet Explorer (and Opera for textareas only) doesn't replace nbsp by space and still nothing "breaks"? If this is an issue for plain text mail composition (it really shouldn't), couldn't that "feature" only be turned on in mail, and not in web forms?
(In reply to comment #14) > I'm confused about your example. How is the conversion to spaces catastrophic? Well, it can happen in two ways. One is when the user enters plain text in a textarea or text input field and takes care to enter unbreakable spaces before punctuations (or in various other contexts) and these are transformed to ordinary spaces: if the resulting text is, say, HTMLified and displayed by the server, line breaks can occur at wrong places; I first noticed the problem because, when commenting on various people's blogs, my comments tended to be broken at places where I was certain I had placed unbreakable spaces (I enter them by typing compose-space-space under Unix, incidentally). The other problem is the reverse: assume the user copies some text from an HTML page into a plain text editor, and asks the editor to reformat the text (M-q under Emacs, say); then spaces which were (justly) unbreakable in the HTML document might have become ordinary spaces and might cause unwanted line breaks after reformating. Basically, unbreakable spaces are just a different character from the ordinary space. Turning one into the other is as bas as turning an X into a Y: then the whole point of having unbreakable spaces is lost. But I agree that the editor's "internal" use of unbreakable spaces, although bugware, can't be simply ignored. This is why I suggest having a configuration option as a short-term solution to the problem (for knowledgeable users who are aware of the existence of the unbreakable space and don't use the Mozilla editor ot at least don't type to spaces in a row or something), although a longer-term solution probably has to be found also, but it won't be as urgent.
First the context at my point of view: Recently fr.wikipedia.org switched from   ; to non-breakable space, we have more than 8 K pages with non-breakable space. It's annoying since each time someone edit one of these page with any mozilla version on any platform it'll break them so we will either re run a bot on all these page to reswitch to   or live with the fact we need to run periodically a bot on all these broken page and through regexp to replace space by 0xa0 code. Both solution have cons, the first uglify html code, the second mean we will get many spurious to review and the log history we will uglier. So I'll be very happy if there is any progress on this issue. If I understand correctly the magic sequence used internally is 0x20 0xA0 so can't you change the 0xA0 to 0x20 only if preceeded by a 0x20 ? Is this an acceptable workaround for other people looking at this problem ?
(In reply to comment #19) > So I'll be very happy if there is any progress on this issue. If I understand > correctly the magic sequence used internally is 0x20 0xA0 so can't you change > the 0xA0 to 0x20 only if preceeded by a 0x20 ? Is this an acceptable workaround > for other people looking at this problem ? Well, I agree that U+0020 U+00A0 is not a sequence that may appear in french, for example.
Supposed that the editor has to do some temporary replacement, I don't see any reason why it must be U+00A0 of all characters. If it is for internal use only and replaced by an ordinary space later anyway, every other character could be used as well, such as U+001A (substitute character), U+FFFC (object placeholder), or some character from the U+E000 to U+E0FF range (private use area). All of these are much less likely to appear in text entered by the user.
Philippe: On the Wikipedia problem, what do you mean by "each time someone edit one of these page with any mozilla version"? Are they submitting a form, where the user has been editing in a textarea and the submitted text changes nbsp to space? Form submission might be a different question, since we (I hope?) know the charset the server is using, so it might be more reasonable to keep the nbsp characters if we know we're using utf-8 (or some other charset) and not plain us-ascii. There may be other issues, though; do I remember rightly that the text in a textarea might not always be the same charset as the page which embeds it? As to Uwe's suggestion of using some other character for space runs in the editor: that might be possible in the editor (Joe would have to comment on that), and the serializer part would be very easy, but it would also need a layout change, since whatever character we use would have to display as a space. The core issue for the editor: if users type two spaces, they want to see the caret move right twice, and be farther right than if they'd just typed one space.
(In reply to comment #22) Yes, user submit new text through a form and afaik both page and form use utf8, you can try it here http://fr.wikipedia.org/wiki/Utilisateur:Phe/test1 click on "Modifier cette page" to modify the text. > The core issue for the editor: if users type two spaces, they want to see > caret move right twice, and be farther right than if they'd just typed one > space. I see what is your problem, magic value are evil but such case is very boring to fix w/o a some sort of magic value
This breaks i18n, and I think the input of a Mozilla i18n expert would be useful here to find the best way out of this, I CC smontagu and jshin. Maybe a solution would be, instead of replacing by nbsp one of every two spaces typed, to insert a default_ignorable_code_point or a ZWNJ/ZWJ character between successive spaces. And conversion to plain text would remove that character when found between two spaces.
This is huge bug. It break the most important rule: don't touch user data. If you need to replace something because another code needs it, fix the other broken code.
Glad to know Mozilla alters user data for the benefit of the outdated us-ascii encoding... Who is still using it? ISO-8859-1 exists guys! When I read that "non-breaking space is something that doesn't exist in a lot of fonts" I can't believe it! " " is a well-know entity, used since the early days of mass web authoring. Don't tell me authors used a character user agents didn't know how to render! So what's the difference with the equivalent character? We do want to use the richness of our encoding, not even considering Unicode! This is another bug where Mozilla sucks, see #70610 for instance. And this bug again has been put to sleep… (This non us-ascii character was brought to you by UTF-8.)
Bug #218277 is about the same issue and has a r+sr patch and dependencies, this bug should probably marked as a duplicate.
*** This bug has been marked as a duplicate of 218277 ***
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → DUPLICATE
*** Bug 268995 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.