Closed Bug 218277 Opened 21 years ago Closed 18 years ago

no-break spaces (nbsp) in submitted text are replaced by breakable spaces

Categories

(Core :: DOM: Core & HTML, defect)

x86
Windows 2000
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla1.9alpha1

People

(Reporter: moz, Assigned: dbaron)

References

()

Details

(Keywords: dataloss, intl)

Attachments

(4 files, 1 obsolete file)

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4b) Gecko/20030516 Mozilla Firebird/0.6 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4b) Gecko/20030516 Mozilla Firebird/0.6 Let's have a string containing 30 times the letter 'a', 5 no-break space (character 0xA0 or 160 in iso-8859-1, iso-8859-15 or unicode) and a dot '.' Note : you can input no-breack space on Windows with by letting the Alt key down and type 0160 on the keypad Input this string into a text input element in a form. Submit the form and display the submitted string. The 5 no-break space have changed to 1 breakable space (0x20 or 32). The 5 no-break spaces should have been kept. You can test this on http://mess.genezys.net/NoBreak/ This is quite important when you submit french texts as this language uses a lot of no-break spaces (for instance before those punctuation marks --> ? ! : ;) Reproducible: Always Steps to Reproduce: 1.input a string with many no-break space in a text input in a form 2.submit the data 3.display the data received Actual Results: One single no-break space (0xA0) are replaced by one breakable space (0x20) Continuous multiple no-break spaces are replaced by one breakable space Expected Results: Mozilla should keep the no-break spaces just as they appear in the original input string.
Will need PHP version 4 or better
Whiteboard: DUPEME
*** Bug 219774 has been marked as a duplicate of this bug. ***
I would like to point that Mozilla just *transforms* 0xA0 bytes (non-breakable space) into 0x20 bytes (breakable space). Why ? HTTP posting should just send bytes through the network, and not analyse them. As you can see with the test page attached, Internet Explorer does not make that transformation.
I discovered I was wrong about one point : « The 5 no-break space have changed to 1 breakable space (0x20 or 32) » Instead, each no-break space (0xA0) is replaced by one breakable space (0x20). For instance, « a b », encoded as : 0x0061, 0x00A0, 0x00A0, 0x00A0, 0x0062 will be converted to : 0x0061, 0x0020, 0x0020, 0x0020, 0x0062 I downloaded and compiled the source code and successfully reproduced the bug. I will try to find the origin of that bug.
I found the problem. The scary thing is that... it is done on purpose ! Here are more details about the bug : (btw, I used the official source archive : MozillaFirebird-source-0.7.tar.gz) Here is the function call stack I had when debugging (the most recent call is written first, the parent calls follow) : nsTextControlFrame::GetText(nsTextControlFrame * const 0x03a03028, nsString * 0x0012f00c {""}) line 2758 nsTextControlFrame::GetValue(nsTextControlFrame * const 0x03a030b4, nsAString & {...}, int 0x00000001) line 2949 + 54 bytes nsPlaintextEditor::OutputToString(nsPlaintextEditor * const 0x03a81320, const nsAString & {...}, unsigned int 0x00000418, nsAString & {...}) line 1424 + 39 bytes nsDocumentEncoder::EncodeToString(nsDocumentEncoder * const 0x0392a438, nsAString & {...}) line 935 + 39 bytes nsPlainTextSerializer::Flush(nsPlainTextSerializer * const 0x0392ac08, nsAString & {...}) line 453 nsPlainTextSerializer::FlushLine() line 1273 nsPlainTextSerializer::Output(nsString & {"u v"}) line 1296 ----------------------------------------------------------------------------------- So here we are, the problem is in void nsPlainTextSerializer::Output(nsString& aString) In file : /mozilla/content/base/src/nsPlainTextSerializer.cpp line 1279 Here is the function : /** * Prints the text to output to our current output device (the string mOutputString). * The only logic here is to replace non breaking spaces with a normal space since * most (all?) receivers of the result won't understand the nbsp and even be * confused by it. */ void nsPlainTextSerializer::Output(nsString& aString) { if (!aString.IsEmpty()) { mStartedOutput = PR_TRUE; } // First, replace all nbsp characters with spaces, // which the unicode encoder won't do for us. static PRUnichar nbsp = 160; static PRUnichar space = ' '; aString.ReplaceChar(nbsp, space); mOutputString->Append(aString); } -------------------------------------------------------------------- So here you see the special replacement. This is bad to do such a thing. Those three code lines should be removed : static PRUnichar nbsp = 160; static PRUnichar space = ' '; aString.ReplaceChar(nbsp, space); I removed them in my personnal build and rebuild the browser. I tried the page on http://mess.genezys.net/NoBreak and everything worked lovely. Internet Explorer and Opera behave correctly. They do not replace 0x00A0 (160) chars with 0x0020 (32) chars. The same should be done for Mozilla. -------------------------------------------------------------------------- Now that the bug was found, who could fix it in the CVS tree ?
D'oh ! I forgot I made reorganisation. Please read the trace stack with the parent calls first instead of « (the most recent call is written first, the parent calls follow) : »
Flags: blocking1.6b?
Bug 195946 may be resolved too with this patch... Need some tests.
> Vincent Robert : > Bug 195946 may be resolved too with this patch... Need some tests. I did some tests, the patch also corrects bug 195946 and bug 213628
We changed this behavior over two years ago, see bug 62189, thinking we were doing the right thing. Comment from nsPlainTextSerializer.cpp where this happens: "Prints the text to output to our current output device (the string mOutputString).The only logic here is to replace non breaking spaces with a normal space sincemost (all?) receivers of the result won't understand the nbsp and even be confused by it." If we're going to change this, it probably needs to happen early in an alpha or beta cycle so we can have some time to see if we break some applications that might expect Mozilla's current behavior.
Flags: blocking1.6b? → blocking1.6b-
Asa, from what I can tell, the nsPlainTextSerializer.cpp was checked in by Vidur with this nbsp->sp conversion directly from nsHTMLToTXTSinkStream. bug 62189 cleaned up a bunch of stuff and added comments to nsPlainTextSerializer::Output but didn't change the already existing nbsp->sp conversion. I see no reason to be doing this conversion and no justification found in the comments. Perhaps Daniel Glazman and others can take a look at this and comment. I agree that this is something for alpha and not beta.
As suggested, I set a review request flag for 1.7a.
Flags: blocking1.7a?
Comment on attachment 136358 [details] [diff] [review] This patch corrects the « nsPlainTextSerializer::Output » function Requesting r from Daniel per comment 11.
Attachment #136358 - Flags: review?(daniel)
I can't think of any reason why this patch is wrong but I would like a few other opinions before I give this r=. Akkana, Joe : we need your help here. Do you think the proposed change is safe?
Well, someone should enumerate the common cases where the plaintext serializer is used, and think abuo the results. Example: you compose mail in composer in the html view, and then choose the "send as plaintext" option when you get a warning about one of your recipients not savvy to html. Are there relevant RFC's on plaintext mail that address this? Are there internationalization issues with 0xA0? My experience doens't help much here. I know more about which situations demand nbsp's in our internal representation of the data while editing is in progress, rather than the various consumers of output from the plaintext serializer. If I understand this bug correctly, though, one side effect should be that copy/paste of nbsp's within a plaintext editor should transform them to spaces? So if you use nbsp's in soem ascii art to prevent a line from wrapping, and then copy/paste that line in a plaintext editting session, it will suddenly wrap (if long enough), ya?
time to make the call in this for 1.7a. how much, and what kind of testing has been done with the patch?
Someone should figure out what codepaths this is used for and make sure this doesn't happen where we don't want it. Furthermore, I don't see any reason this should block the release.
Flags: blocking1.7a? → blocking1.7a-
Here is some piece of information I could gather : --------------------------------------------------------------- The function to change is : protected nsPlainTextSerializer::Output(nsString& aString) called by: protected nsPlainTextSerializer::FlushLine() protected nsPlainTextSerializer::EndLine(PRBool aSoftlinebreak) protected nsPlainTextSerializer::OutputQuotesAndIndent(PRBool stripTrailingSpaces /* = PR_FALSE */) protected nsPlainTextSerializer::Write(const nsAString& aString) class nsPlainTextSerializer used in NS_NewPlainTextSerializer(nsIContentSerializer** aSerializer) NS_NewPlainTextSerializer(nsIContentSerializer** aSerializer) used in \layout\build\nsLayoutModule.cpp(466):MAKE_CTOR(CreatePlainTextSerializer, nsIContentSerializer, NS_NewPlainTextSerializer) CreatePlainTextSerializer found twice in \layout\build\nsLayoutModule.cpp in the array : static const nsModuleComponentInfo gComponents[] { "plaintext content serializer", NS_PLAINTEXTSERIALIZER_CID, NS_CONTENTSERIALIZER_CONTRACTID_PREFIX "text/plain", CreatePlainTextSerializer }, { "plaintext sink", NS_PLAINTEXTSERIALIZER_CID, NS_PLAINTEXTSINK_CONTRACTID, CreatePlainTextSerializer }, --------------------------------------------------------------- Now I don't have a good global vision of the project so I can't say what gComponents is used for. I hope this could help anyway. From a user point of view I didn't see any side effect with my personnal Mozilla version of Mozilla. From a logical point of view I just think like Bob Clary : I see no reason to be doing this conversion. Did someone else try the patch with his own build ?
I can see a very good reason to be doing this conversion. When using the HTML editor, hitting the space bar multiple times turns spaces into  . Mail composed using the HTML editor is often sent as text, and if this is the code used to convert that mail to text, those nbsp characters need to be converted to spaces.
Comment #19: "those nbsp characters need to be converted to spaces" This depends on the charset, and does not only concern no-break spaces. When I'm using the iso-8859-1 charset, I have the right to use the 0xA0 character with this charset (and others which know about 0xA0, like windows-1252, iso-8859-15, unicode flavours,...), the 0xA0 is a perfeclty valid character, like é, è, à, ç... Conversion from characters written in the html editor and not included in the charset selected by the user is another issue. Btw, I found that Thunderbird/Mozilla mail components fails to manage 0xA0 characters. During edition when I insert some 0xA0 characters they also get converted into 0x20, this is very annoying as my default mail encoding is iso-8859-15. Other iso-8859-15 characters are kept. So this is just a matter of consistency. Why some valid characters should be changed, and some others kept ? At best the patch will also correct the mail component. At worst it will do nothing, and then another bug file should be open about this 0xA0 conversion.
My point is that when the user types <SPACE> <SPACE>, we should not send non-breaking space characters in plaintext messages. Does the patch break that?
Comment #21: there is a side effect of the patch in the mail/news component. Here are the results of my tests. A) I composed mails with Thunderbird 0.5 in html mode (my default encoding is iso-8859-15). The sending format is set to « Auto-Detect ». 1) I only used ascii characters, and started a line with two spaces. result: the mail was sent as : Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit info: 7bit, so I only got 0x20 spaces. user expectation: OK. With plain text renderer, spaces are not merged together, so having 0x20 spaces is fine for separation, moreover I did type 0x20 characters. 2) I used ascii letters but inserted manually several 0xA0 characters : Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit info: 7bit, this means my 0xA0 characters were converted to 0x20 spaces. user expectation: Not OK. The 0xA0 I inserted manually should be kept, and so the Content-Transfer-Encoding should be 8bit. 3) I used characters with accent, and started a line with two spaces. result: Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 8bit info: In an hexa viewer I don't see any 0xA0 spaces. user expectation: OK. The Content-Transfer-Encoding is 8bit to handle the characters with accent. No 0xA0 characters were inserted manually so none should be appear. 4) Same test than A3) but I forced the sending in html result: Content-Type: text/html; charset=ISO-8859-15 Content-Transfer-Encoding: 8bit info: the line which starts with two spaces is encoded as : 0xA0 0x20 user expectation: the rendering is OK. The insertion of 0xA0 in order to do a margin ? Why not. B) I composed mails with Mozilla 1.5 with the patch applied, in html mode (my default encoding is iso-8859-1). The sending format is set to « Auto-Detect ». 1) Same test than A1) result: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit info: this time, the 0xA0 automatically inserted was kept, the Content-Transfer-Encoding has been set to 8bits instead of 7bit. user expectation: Not OK. I never inserted any 0xA0 characters manually. I just inserted regular spaces. So I should only get 0x20 spaces. 2) Same test than A2) (manually inserting 0xA0 characters) result: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit info: same as A2). user expectation: Not OK. Same as A2). 3) Same test than B1) but I forced the format to text mode. info: same as B1). user expectation: I expected to avoid the 0xA0 composer behaviour, but same as B1) ------------------------------- In all cases the mail data itself is perfectly valid, but I do not always get expected characters : With the patch, I get unexpected 0xA0 characters instead of 0x20 ones. In both cases, I get unexpected 0x20 characters instead of 0xA0 ones. The problems comes from the fact the composer does not make any difference between a 0xA0 inserted manually by the user, which should be always kept, and its own automatically generated 0xA0 in order to create a margin, which could be kept in html mode, but should be removed in text mode. Daniel may have better ideas than mines to deal with this issue, but here you are anyway : - have a real text mode for the mail composer. In that case the automatic 0xA0 insertion would not be activated ; - use another technic for margins ; - make a difference in the composer between user's 0xA0, and automatic 0xA0. For example : Four spaces<span class="moznbsp"> </span>then three spaces<span class="moznbsp"> </span>done. Automatic 0xA0 would be enclosed with a span in order to make the difference. - use the entity "&nbsp;" for automatic 0xA0, and reals 0xA0 for user's 0xA0. When sending in text/plain mode, &nbps; would be converted to 0x20 and real 0xA0 would be kept. - store somewhere in memory where are the automatic 0xA0 located ------------------------------- A temporary-not-perfect solution would be to keep the 0xA0 --> 0x20 conversion but at the mail component level, not in the nsPlainTextSerializer class itself, so at least we would relieve the browser of this bug (and then open a new bug about the mail component or about the composer).
Today I've been inspecting the code in order to to implement the "temporary-not-perfect solution" but then I thought about a nice improvement in the way the composer could handle spaces. In Comment #22 I proposed to enclose 0xA0 with a special span element : Four spaces<span class="moznbsp"> </span>done. I didn't really like the fact of using 0xA0 characters as the user just inserted normal 0x20 characters. But I discovered those 0x20 characters can be kept given the CSS class : .moznbsp { white-space: pre; } This works just fine. ---------------------------------- Small note : Well, if you want to be stricter the CSS 1 recommandation says that white-space applies to: block-level elements, so maybe it should be changed to : Four spaces<div class="moznbsp"> </div>done. .moznbsp { white-space: pre; display: inline; } but with the span it has been working so far in many browsers. ---------------------------------- Summary : hitting once the space key in the composer will insert a 0x20 character. hitting once again the space key in the composer will insert another 0x20, and enclose both of them into : <span class="moznbsp"> </span> hitting again the space key will insert 0x20 characters into the span element : <span class="moznbsp"> </span> 0xA0 inserted by the user will be managed just like another normal character. I'm going to look around in the source code to find where the automatic 0xA0 insertion mechanism is made and think about how to implement such a solution (but maybe somebody else knows more about this routine and could do the change more easily).
Note that bug 213628 is dataloss and this potentially could be.
Blocks: 195946, 213628
This bug is major on french Wikipédia (http://fr.wikipedia.org/) because in french non-breaking spaces have to be used before colon, exclamation/question mark... Many articles correctly written by IE users using nbsp are altered by editors using Mozilla/Firefox user agent without even being aware of this bug. This bug should be at least written in in the Release Notes of the next version, or better, fixed. This is an HTML/XHTML bug as a readonly control will always have its "current value" (see http://www.w3.org/TR/html401/interact/forms.html#current-value ) different from the "initial value" if the ''initial value'' contains nbsp (which is not the definition of "readonly"). (and the bug is not limited to readonly controls). Severity should be changed to major. Keywords 'conversion' and 'xhtml' should be added.
Attached file Testcase
This is a testcase that can be used locally (without involving form submission over HTTP). "bug!" is shown in the page if your user agent has the bug.
This bug is still NEW, and severity normal... But I just broke my whole wiki content because of it ! I was transferring my database content using phpMyAdmin, and my content did contain the 'à' character which is encoded as 'Ã[nbsp]' in UTF-8. But when submitting this using phpMyAdmin, 'Ã[nbsp]' are simply transformed in 'Ã[sp]', and my char is lost as it is not a valid UTF-8 character. Fortunately, I have IE to upload my text again but I am wondering how would someone under Linux have repaired the problem.
Cc'ing jst in case he has any insights. /be
One possibility here is to stop calling that code only when serializing the value of form controls. I'll attach a patch...
Attachment #162609 - Flags: superreview?(dbaron)
Attachment #162609 - Flags: review?(bryner)
Great! It looks like the temporary-not-perfect I though of. It seems sensible now, as this bug is about the browser only. I may open a new bug for the mail component.
Sorry for the spam but... Pleaaaaase. Check this patch in for 1.0 final.
We're way too close to the release of 1.0 to consider taking this kind of changes. If you strongly disagree you can set the blocking-aviary1.0 flag to ? and make a case for this, but I doubt it'll go through :(
You may be right, but the thing is that this software is opensource (aka community based), and now that the community has found a solution to a very annoying problem, there seem to have no way to help solving this problem :-( I think there is something else we could do to help instead of just waiting for a guru to check in this patch, but I don't know what... Testing ? Voting ? What can be done ?
Increasing severity and requesting block status due to dependent bugs and the existence of a patch.
Severity: normal → critical
Flags: blocking-aviary1.0?
Keywords: dataloss
(In reply to comment #34) > You may be right, but the thing is that this software is opensource (aka community based), and now > that the community has found a solution to a very annoying problem, there seem to have no way to > help solving this problem :-( Yeah, you're right, it's open source n' all, and the community is what matters here. What we, the community, everyone involved, is dealing with here though is balancing between risk of this patch causing problems and delaying the release (or even worse, causing embarasing problems that we don't notice before the release goes out the door) and the benefit of shipping with this fix. IOW, is this problem worse to the community as a whole than the Firefox 1.0 release being a failure due to a problem that could be caused by this fix? With all the amazing effort being put into this release, in the actual product, and at spreadfirefox.com et al, I *really* don't want to screw this up. So unless there's a *huge* benefit to a *lot* of users in taking any given fix, I wouldn't want to take the fix at this point...
Well I don't think we have a keyword for that, but FWIW I can confirm this bug is indeed important for French-speaking users, and for French l10n in general.
Is 251404 a duplicate of this bug?
not this late in the game. sorry. not a blocker.
Flags: blocking1.7a-
Flags: blocking1.6b-
Flags: blocking-aviary1.0?
Flags: blocking-aviary1.0-
Attachment #162609 - Flags: review?(bryner) → review+
Comment on attachment 162609 [details] [diff] [review] Persist nbsp characters only when serializing the value of form controls. >+ // Normally &npsp; is replaced with a space character when &nbsp;, not &npsp; sr=dbaron ***if*** you're sure that the editor never generates non-breaking spaces when the user presses the space bar in text inputs (e.g., by pressing space multiple times).
Attachment #162609 - Flags: superreview?(dbaron) → superreview+
Most probably a duplicate of #194498 (or probably the reverse since this one has a patch with r/sr and dependencies)
(In reply to comment #41) > Most probably a duplicate of #194498 (or probably the reverse since this one has > a patch with r/sr and dependencies) But this one is newer so it is rather the duplicate.
Blocks: 194498
*** Bug 251404 has been marked as a duplicate of this bug. ***
Requesting for 1.8b since there is a patch which could be checked in now (and hoping that it would make it in Firefox 1.1 as well)
Flags: blocking1.8b?
Adding intl keyword since this affects particularly french-speaking users who (should) have to use nbsp extensively (see comment #27, comment #29 and comment #41)
Keywords: intl
Sorry but I don't understand why this is a problem in french, and I am not sure why we need this fix. And I am a native french speaker who has the book "Règles Typographiques en vigueur à l'Imprimerie Nationale" on his work desk, who uses french-speaking wikis. Typographical rules in french allow only one non-breaking space before a semi-colon, a question mark or an exclamation point. I've never seen a prose with more than one unless it's in preformatted style, like in code for instance. In that case - for a wiki - a pre element is the solution, isn't it? Furthermore, David Baron is perfectly right: that code is needed for email... The editor can't preserve multiple spaces if we don't turn some of them into &nbsp;. Again, I am not convinced we need this fix, I can't see the plusses, and we have one possible side-effect on email.
(In reply to comment #46) > Sorry but I don't understand why this is a problem in french, This is not especially a problem in French, but since French uses more non-breaking spaces than other languages, this is a problem dear to many French Mozilla users. > and I am not sure why we need this fix. We need to fix this for two reason: a Mozilla user cannot input a non-breaking space in a web form; and most importantly, Mozilla destroys the work of other users of web forms by removing the non-breaking spaces they might have put in the form. Look at it this way: you edit a document, you save it without touching anything, yet the text has been modified and you cannot do anything about it. > Typographical rules in french allow only one non-breaking space before a > semi-colon, a question mark or an exclamation point. You forgot a few cases. > I've never seen a prose with more than one unless it's in preformatted style, > like in code for instance. The problem is not with using several non-breaking spaces. The problem is using even a single non-breaking space. You cannot do that. > Furthermore, David Baron is perfectly right: that code is needed for email... That code is *currently* needed for email, and that's a purely terrible design. Unfortunately the code is deeply embedded at the core of Mozilla, so it cannot be changed easily. > The editor can't preserve multiple spaces if we don't turn some of them into > &nbsp;. Oh yes it can, just turn some of them into something else than a non-breaking space! Use a character in the Unicode private area, for example.
If this problem is not fixed here, then bug 213628 absolutely must be fixed since in that case _Mozilla doesn't work_.
Bug #194498 comment 22 explains why it is so tricky and exposes some possible solutions based on charset detection, or by using another character internally for editor-generated spaces (using it for user-entered data would probably work too, although it may cause problems when copy/pasting from a text control to another application). In any case, I'm confident that a solution will eventually be found. My two cents on this: in my understanding, 'nbsp's created by the editor internally are never "alone", they are either preceded or followed by other nbsp characters, line returns or 'classic' spaces. So, instead of preserving any nbsp found in a text control like attachment 162609 [details] [diff] [review] does, let's try preserving only nbsps found "alone" in a form control (input text, input file or textarea). That is, when they are surrounded by letters (any language, not only latin ones), numbers or punctuation signs. It would fix 99% of the cases, including the upload problem described in bug #213628 (I can't possibly imagine the use for a file name with multiple nbsps in a row. If it exists it is probably a virus trying to hide a file extension anyway). The only problem I think of would be with so-called ASCII art and the like, which is not so important IMHO, and for which users should rightfully use "pre" or the associated command in their wiki syntax. Is that a sensible approach? I've been thinking of this all day long, but I may not have understood all the ins and outs. I don't know for example how it would affect performance.
This bug is not about mail or multiple nbsp's. This bug is about data loss in form serialization in the browser. Characters provided in a form are not the characters received by the server, and that is not a correct behaviour. It has nothing to do with mails, and the last patch provide a way to keep the current mail behavior while making the browser behaviour correct.
(In reply to comment #50) > This bug is not about mail or multiple nbsp's. This bug is about data loss in > form serialization in the browser. Characters provided in a form are not the > characters received by the server, and that is not a correct behaviour. The relation with mail is that the same code (editor) is used for both, and that it generates "fake" nbsp characters for internal use when you type multiple spaces in a row (see comment #40 and bug #194498 comment 12). That's why they are removed at submission time, because in most cases they are in fact generated by the editor and thus not desired. But that code is just being a little over-zealous, since it also removes nbsp characters entered manually by the user, by the file selector (upload case), by copy-paste from a Word processor (french quote marks and punctuation) or existing nbsp characters that were pre-filled in the form input from a remote database (wikipedia case).
The patch unactivates the 0x00A0 removal for Web forms only. So at least the browser component would be kept bug-free, and many related bugs could be closed. As I stated in comment #22 and comment #23, the composer uses an ugly method to keep adjacent spaces. In an ideal world, this method would be replaced by another one which does not use the 0x00A0 character hack. Therefore, the 0x00A0 to 0x0020 conversion would vanish. Unfortunately this is quite a big work, so the current patch is a very good and safe temporary solution (thanks Johnny). (For Daniel, here is a reminder from the « Lexiques des règles typographiques en usage à l'Imprimerie Nationale (2002) » : http://psydk.org/gecko-bugs/french-spaces.jpg )
Did anyone ever answer dbaron's question in comment 40?
*** Bug 194498 has been marked as a duplicate of this bug. ***
(In reply to comment #53) > Did anyone ever answer dbaron's question in comment 40? I think it was Daniel's point in comment #46. Here is what he said on IRC: [09:53] <Benoit-> I see, so the result would be that with the patch applied it would *generate* nbsps when someone uses multiple spaces in a text form input? [09:54] <glazou> right In my understanding, that means the latest patch is not good enough but it could be tweaked a bit to do what I describe in comment #49 (conserve only _single_ nbsps). Here is what it does in nsPlainTextSerializer.cpp: + if (!(mFlags & nsIDocumentEncoder::OutputPersistNBSP)) { + // First, replace all nbsp characters with spaces, + // which the unicode encoder won't do for us. + static PRUnichar nbsp = 160; + static PRUnichar space = ' '; + aString.ReplaceChar(nbsp, space); + } To that could be added something like (that's really pseudo-code, I don't know how you'd do char concatenation here but you get the picture) + else { + // nbsp characters which are surrounded by spaces or line breaks + // are probably generated by the editor and not meant by the user. + // We are converting them anyway. + static PRUnichar nbsp = 160; + static PRUnichar space = ' '; + static PRUnichar linebreak = '\n'; + + // multiple nbsp characters + aString.ReplaceSubstring(nbsp+nbsp, space+space); + + // nbsp characters following or preceding a space + aString.ReplaceSubstring(nbsp+space, space+space); + aString.ReplaceSubstring(space+nbsp, space+space); + + // nbsp characters beginning or ending a line (the latter could be trimmed) + aString.ReplaceSubstring(nbsp+linebreak, space+linebreak); + aString.ReplaceSubstring(linebreak+nbsp, linebreak+space); + } That "else" would never be reached when not in a form control (e.g. in mail). It would take care of all edge case I can think of, some of them are probably not necessary at all. I know it's a very dirty hack, I'd at least have done that using only one regular expression but I don't think we can do that in c++ can we? (Just trying to be helpful here, sorry if you think I'm being overly naive and consider this as bugspam, that's not in my intention - I'd try this at home, see if it builds and send a proper patch if I was not in a time of exams)
The last patch does NOT affect mail or composer. Comment #30 says that it applies the nbsp-to-sp convertion to everything but web forms. In Comment #31, Hadrien says that he will file a new bug for Mail. I think he is refering to the current wrong behaviour of Mail described in Comment #22. This wrong behaviour is not corrected by the last patch because it does only affect web forms. Just trying to clarify some thoughts :)
I got two ideas to keep adjacent spaces in the editor without the NBSP method. 1) First idea Use the private Unicode area for Gecko and create an "EditorInternalNbsp". Some value carefully chosen in the range 0xE000..0xF8FF. The editor would insert this new character instead of 0x00A0. Pros : - minor changes to the editor. Cons : - changes needed in the html renderer ; - non portable html files. 2) Second idea Instead of using 0x00A0 (NBSP) sequences, a pair of 0x0020 (SPACE) and 0xFEFF (ZERO WIDTH NBSP) could be used. Here you are a scenario after pressing the space key 3 times in the editor : Currently : 0x00A0 0x00A0 0x0020 Suggestion : 0x0020 0xFEFF 0x0020 0xFEFF 0x0020 When converting from html to text, a (0x0020 0xFEFF) pair would change to a single 0x0020 character. Pros : - no change in the html render, adjacent spaces are kept with this method ; - 0x00A0 are always kept and becomes a normal char ; - portable html files, meaning the content can be sent as-is. Cons : - in the editor, the user as to type twice on the arrow key to move the caret ; - when copying & pasting text, the 0xFEFF values are kept, and may be rendered strangely in the destination software (for example, Notepad renders the 0xFEFF as dots). In the last two items, the editor would be aware of (0x0020 0xFEFF) sequences and could move of one space only (I think it would be the same kind of code that when managing surrogate pairs). About copy & pasting, the html to text converter could be aware of the kind of sequence too, and convert them to single 0x0020 characters (same idea than converting html representation to plain text representation, like <li> to "-").
> Use the private Unicode area for Gecko and create an "EditorInternalNbsp". I'd argue that the new character should be treated as a space, not as a non-breaking space; the non-breaking aspect of nbsp isn't why we're using it here, and if someone actually wants non-breaking behavior they'll use a real nbsp. The important part would be modifying layout to do the right thing with the new character. If you could interest a layout person in this, the editor and serializer portions probably wouldn't be that difficult. My guess is that the layout changes wouldn't be too difficult either, for someone familiar with the code: the new character would be treated exactly like normal spaces: no new behavior, just a new char which gets the old behavior. > 0xFEFF (ZERO WIDTH NBSP) Interesting idea, but ...the argument here is that nobody actually wants 0xFEFF in output, while they do want 0xA0? I don't think most people thought anybody wanted 0xA0 when the current code was originally written, so I'd be leery of solutions which assumed "Our character is important, but this other character must not be ..." I also worry that the new caret-moving code in gecko and the code for inserting/deleting would be fairly tricky to account for sometimes skipping the zwsp characters. There would be a potential for a lot of new bugs there if it wasn't tested very thoroughly.
It seems like what we should really be doing is: * when we're editing plain text, store multiple presses of space as spaces (this is a change) * when we're editing HTML, store multiple presses of space using non-breaking spaces for all but the last press (tricky with deletion) (we probably do this fine already) * when serializing HTML to text, convert runs of non-breaking spaces terminated by a space to spaces * when using nsPlainTextSerializer to convert text to text (if we need to use it at all, although we seem to now), don't mess with spaces Does this make sense?
> It seems like what we should really be doing is: > * when we're editing plain text, store multiple presses of space as spaces > (this is a change) Yes, there is a need for a full text editor in the mail component. No need of the NBSP-for-adjacent-spaces trick then. Fine for me. > * when we're editing HTML, store multiple presses of space using non-breaking > spaces for all but the last press (tricky with deletion) (we probably do this > fine already) The editor actually does this, and this behaviour is ok when editing HTML. Fine for me. > * when serializing HTML to text, convert runs of non-breaking spaces terminated > by a space to spaces When you are editing a plain text mail, currently the mail component seems to do HTML editing internally, and converts the HTML to text. That's why the conversion from NBSP to SP was needed. With a true plain text editor (as you suggest in the first point), this conversion won't be needed anymore. The conversion from NBSP to SP would be kept at an higher level, for example when the user starts writing a mail in HTML and then decides to switch to plain text mode while still editing his message. He knows this may lead to some losses or some conversion tricks. Fine for me. > * when using nsPlainTextSerializer to convert text to text (if we need to use > it at all, although we seem to now), don't mess with spaces YES. In that case we can come back to the original patch which just removes the conversion. Fine for me. > Does this make sense? I think you get it right David, your ideas look sensible imho. So changes are : - a real plain text edition mode for the editor ; - a move of the NBSP to SP conversion code out of the nsPlainTextSerializer. I'm clueless about the new location of the NBSP to SP conversion, but I guess there is already a piece of code which goal is to convert HTML to plain text ?
> So changes are : > - a real plain text edition mode for the editor ; this so isn't happening.
Too late for 1.8b1; I'll try to do this for 1.8b2.
Assignee: form-submission → dbaron
Flags: blocking1.8b?
Flags: blocking1.8b2+
Flags: blocking1.8b-
Blocks: 290565
dbaron, should we try to keep this in the b2 timeframe (next few days?) or move it to 1.8b3?
Flags: blocking1.8b2+ → blocking1.8b3+
Target Milestone: --- → mozilla1.8beta3
not a blocker. dbaron will keep this on his list in case he has time.
Flags: blocking1.8b3+ → blocking1.8b3-
This is a duplicate of bug 195946.
*** Bug 195946 has been marked as a duplicate of this bug. ***
*** Bug 310877 has been marked as a duplicate of this bug. ***
Is there any chance of this fixed in Fx 2.0? (FYI, I've just ran into this bug while editing Russian Wikipedia, ru.wikipedia.org... I've had to use lots of &nbsp; instead, in order to fix the Wikipedia article, and that in turn annoyed several further editors who thought that HTML entity, while used much, makes the source code of Wiki article nearly unreadable, and reading diffs of it is made a real pain. They were obviously right. Now I start thinking that mere using of Firefox in Wikipedia sometimes should be considered as a severe case of vandalism, and the wikiuser's account should be banned from the system until he/she starts using some better designed browser.)
Summary: no-break spaces in submited text are replaced by breakable spaces → no-break spaces in submitted text are replaced by breakable spaces
This bug has been opened for 3 years now. What's stopping from fixing it? This is the most distrubing 'feature'. If users type nbsp, copy and paste it or type the Alt+num value it's because they want it.
> What's stopping from fixing it? Lack of time. Feel free to help out -- see comment 59 for what needs to be done.
> What's stopping from fixing it? Lack of time. Feel free to help out -- see comment 59 for what needs to be done.
I can understand that more time would be needed for an elaborate fix, but there is already a patch to fix this problem, attached to this bug, which has been reviewed and superreviewed: what is stopping it from getting committed? Why are two years not sufficient to check in a patch which has already been approved? This bug has such dramatic consequences everywhere (disastrous edits on Wikipedia en masse, for one thing) that I'm willing to stand on my knees and beg, or donate money, or whatever, but I don't have check-in permission so I can't commit the patch myself...
> what is stopping it from getting committed? Just the need to merge it to trunk. I'd be happy to commit if someone does that. Note that nsIDocumentEncoder is an IDL file now, not a .h....
jpl24, would you mind updating that patch to tip?
Sure, no problem.
The patch would still convert all nbsps, though, right? So runs of multiple spaces typed by the user would generate 0x20 0xa0 0x20 0xa0 etc. with this patch even if it were updated to the trunk. There were several suggestions to avoid that (use zwsp, don't convert nbsp when it's by itself but do convert it if it's interspersed with spaces) but the patch that was posted here doesn't attempt to address those.
[To answer Akkana Peck's objection.] What I understand from the discussion so far and from the diff itself is that the patch we're talking about suppresses conversion of nbsp to space on plain text input forms, not on HTML composer. Now on plain text input forms, no spurious nbsp's are generated by Mozilla hacks (contrary to composer), so there any nbsp found there is a genuinely input nbsp (either by user input or by initial form value). Am I misunderstanding something? Because if this is correct, there is really no reason not to apply this patch, it can't break anything. Of course, I still think it's a bug to convert nbsp to space on composer text, but it's not nearly as bad as converting nbsp to space on plain text input forms which is what the patch (again, if I understand correctly) does. Please correct me if I'm wrong.
I believe comment 77 is correct.
Attached patch patch_v2Splinter Review
This is an updated version of jst's patch. It fixes the test case in this bug, but does not change composer behavior. Do we have any unit tests for the plaintext serializer?
Attachment #162609 - Attachment is obsolete: true
If the plaintext edit rules used for form fields don't insert the nbsp characters, then by all means let's get this fixed for forms (then the debate re. html composer can continue: there are valid arguments both ways). I thought that was the holdup, that both html and plaintext edit rules are inserting nbsps when the user types a run of spaces. Is there a form test case somewhere that reports exactly what got submitted? The test case referenced in this bug's URL field doesn't do that, but it would be useful both for testing the patch and as a regression test afterward.
Status: NEW → ASSIGNED
> Do we have any unit tests for the plaintext serializer? The DOM to Text conversion tests (in htmlparser/tests/outsinks, iirc) that run from Tinderbox test the the html and plaintext serializer, and it would definitely be worth adding a case to those to cover this issue. They don't test form submission, though, which takes a slightly different path through the parser (but which might require a manual test).
> Do we have any unit tests for the plaintext serializer? Not really. Bug 333060 more or less covers that (initing the nsIDocumentEncoder with text/plain as the type should do it). It should be possible to test a variety of output types and flags and such, including whatever flags form submission uses in this case. Akkana is probably right that we should add a test to the outsinks if we can too, esp. since tinderbox actually runs those.
(In reply to comment #73) > > what is stopping it from getting committed? > > Just the need to merge it to trunk. I thought comment 40 was, but maybe that was addressed and I don't see it.
I did verify that typing spaces in a text input and textarea generates spaces, not non-breaking spaces, when we hit nsPlainTextSerializer::Output. Checked in that patch to trunk; marking fixed. I've filed bug 347689 on implementing the proposal in comment 59.
Status: ASSIGNED → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Would it be too imprudent to nominate the patch of this dataloss for 1.8.1 blockers? (I mean, somewhere about 2006-08-08, if it survives on the trunk with no regressions reported in 2 days.)
This patch can't land on 1.8 as-is -- it changes interfaces.
Flags: in-testsuite?
Summary: no-break spaces in submitted text are replaced by breakable spaces → no-break spaces (nbsp) in submitted text are replaced by breakable spaces
Whiteboard: DUPEME
RCS file: /cvsroot/mozilla/testing/mochitest/tests/test_bug218277.html,v done Checking in tests/test_bug218277.html; /cvsroot/mozilla/testing/mochitest/tests/test_bug218277.html,v <-- test_bug218277.html initial revision: 1.1 done
Flags: in-testsuite? → in-testsuite+
when will this become functional on other browsers, such as camino, ffx, etc?
Whenever they update to Gecko 1.9. So Firefox 3 and whatever Camino, etc versions will be using that version of Gecko.
Target Milestone: mozilla1.8beta3 → mozilla1.9alpha
FYI, the fact that this fix was intentionally made to only affect TEXTAREAs and not all text fields has resulted in bug 213628 and bug 290565 remaining not fixed when they could have been resolved at the same time. can someone explain why this patch was intentionally castrated?
> the fact that this fix was intentionally made to only affect TEXTAREAs and > not all text fields This patch affects <textarea>, <input type="text">, and <input type="password">. > as resulted in bug 213628 and bug 290565 remaining not > fixed when they could have been resolved at the same time Bug 213628 should have been fixed by this patch. Can you point to a testcase that demonstrates that it's not? Bug 290565 would need to be fixed in totally different code. > can someone explain See comment 19 and comment 21. And in general, please just read bugs carefully and in their entirety before commenting. It'll help prevent the hundreds of emails you've now generated.
(In reply to comment #89) > Whenever they update to Gecko 1.9. So Firefox 3 and whatever Camino, etc > versions will be using that version of Gecko. > How Gecko 1.9-using browser can be identified from User-Agent string?
It'll have "rv:1.9" in it.
Comment on attachment 136358 [details] [diff] [review] This patch corrects the « nsPlainTextSerializer::Output » function Since this bug has been resolved as FIXED, I'm removing the review request.
Attachment #136358 - Flags: review?(daniel)
This bug still exists. When I copy a non breaking space from Firefox, it is automatically converted to a normal space. Please, reopen this case. My u.a. : Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008090210 Mandriva/1.9.0.1-15mdv2009.0 (2009.0) Firefox/3.0.1
Cut'n'paste would be another bug, this one was about form submission. Feel free to open a new bug on cut'n'paste if it doesn't exist yet.
As far as I can tell this still happens in 4.0b11. For example, NBSP characters typed in the <textarea>s used by GMail and Launchpad are converted to normal spaces. I suspect it also happens with the textarea I’m typing this in. On my system, the second paragraph of comment 96 (above) wraps between “converted to” and “a normal”. Below I type the exact same paragraph, but I’ll type NBSPs instead of normal spaces after “automatically”: When I copy a non breaking space from Firefox, it is automatically converted to a normal space.
Note that while I was typing the above comment, the text wrapped correctly (everything from “automatically”, including that word, went on the next line before I clicked “save changes”); however the text above wraps between “to” and “a”.
Hmm. Sorry for the many messages, it appears I’m wrong. Looking at the source for the above test, there is a “&nbsp;” where I typed NBSP characters. There’s apparently a different bug of Firefox that causes it to line-wrap at those entities; can anyone point me to the relevant bug, or to an explanation why that isn’t a bug?
What you are experiencing seems to be a case of server-side conversion, about which Firefox can't do anything. The testcase attached to this bug still returns "no bug" and a preview edit on a Wikipedia page shows that NBSPs are correctly parsed and submitted by Firefox 4 beta 11. If this very sentence full of NBSPs is wrapping, you should probably file a bug on the Bugzilla product itself.
Component: HTML: Form Submission → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: