Non-breaking spaces (nbsp) not copied as such

REOPENED
Unassigned

Status

()

Core
Serializers
--
critical
REOPENED
11 years ago
19 days ago

People

(Reporter: Cody 'codeman38' Boisclair, Unassigned)

Tracking

({dataloss})

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

11 years ago
User-Agent:       Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8.1) Gecko/20061010 Firefox/2.0
Build Identifier: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8.1) Gecko/20061010 Firefox/2.0

When copying text containing non-breaking spaces (  or UTF-8 character 0xA0) to the clipboard, those characters are converted into regular spaces.  I'm not certain that this is a bug rather than a feature, but my expected behavior was that non-breaking spaces would be preserved as such.

Reproducible: Always

Steps to Reproduce:
1. Copy text containing a non-breaking space character (  or UTF-8 0xA0) to the clipboard.
2. Paste into either a text area in Firefox, or into another application.
Actual Results:  
Text should not wrap at the non-breaking space.

Expected Results:  
Text wraps at the non-breaking space.
(Reporter)

Comment 1

11 years ago
Created attachment 244503 [details]
An HTML file containing two text areas with text using only non-breaking spaces, one using the HTML entity and another using the Unicode character 0xA0.
(Reporter)

Comment 2

11 years ago
(In reply to comment #0)
> When copying text containing non-breaking spaces (  or UTF-8 character
> 0xA0)

Gack...I realize I misspoke here; my brain obviously wasn't working.  I of course meant *Unicode* character 0xA0, which is encoded quite differently in UTF-8!

Comment 3

11 years ago
I have same problem with Firefox 2.0, all 'no-brake space' charackers (0x00A0) are replaced by spaces. This bug does not depend on character encoding of html page. It does the same when encoding is UTF-8 or UTF-16 or Latin-1.

Comment 4

11 years ago
Confirming, also reproducible in Firefox 2.0 on Windows XP.
(In reply to comment #0)
> User-Agent:       Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US;
> rv:1.8.1) Gecko/20061010 Firefox/2.0
> Build Identifier: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US;
> rv:1.8.1) Gecko/20061010 Firefox/2.0
> 
> When copying text containing non-breaking spaces (  or UTF-8 character
> 0xA0) to the clipboard, those characters are converted into regular spaces. 
> I'm not certain that this is a bug rather than a feature, but my expected
> behavior was that non-breaking spaces would be preserved as such.
> 
> Reproducible: Always
> 
> Steps to Reproduce:
> 1. Copy text containing a non-breaking space character (  or UTF-8 0xA0)
> to the clipboard.
> 2. Paste into either a text area in Firefox, or into another application.
> Actual Results:  
> Text should not wrap at the non-breaking space.
> 
> Expected Results:  
> Text wraps at the non-breaking space.
> 

Please raise the severity of this bug to Major.
Rationale:
The input control coded by <input type="text" lang="pl" value="1&nbsp;000" > 
has the numerical value of 1,000 (one thousand).
If the value changes to "1 000", the numerical value is lost.
I understand it is kind of weird that there is no visible difference between a number and two numbers; however, while I may disagree with this setting, somebody declared it this way.
I do not know who has the authority over the national numeric format but I am sure the system vendor had consulted a standard body prior to including it in the operating system.

Comment 6

10 years ago
Confirming for FF 2.0.0.7 on Linux.

The &nbsp; is converted to 0x20 at writing time. Any consequent copy to clipboard or sending via HTTP shows regular spaces. OTOH &thinsp; or non-breakable thin space are saved properly.

Comment 7

10 years ago
(In reply to comment #5)
> The input control coded by <input type="text" lang="pl" value="1&nbsp;000" > 
> has the numerical value of 1,000 (one thousand).
> If the value changes to "1 000", the numerical value is lost.
>
The numerical value would be lost, even if you used  "1,000" (in English source code), or "1&#x202F;000"  or "1.000" in Czech). The source code should be locale independent. User agent should do l10n on the value (e.g. using @lang attribute). You could address the same problem on data/time format processing.

In additional, HTML forms are very weak in type handling. Maybe XForms or HTML5 will affect this problem.

Comment 8

10 years ago
(In reply to comment #6)
> Confirming for FF 2.0.0.7 on Linux.
> 
> The &nbsp; is converted to 0x20 at writing time. Any consequent 
[...]
> sending via HTTP shows regular spaces.

3.0a8 (alpha version of FF) doesn't suffer from this problem. Only copying to clipboard remains affected.
(In reply to comment #7)
> (In reply to comment #5)
> > The input control coded by <input type="text" lang="pl" value="1&nbsp;000" > 
> > has the numerical value of 1,000 (one thousand).
> > If the value changes to "1 000", the numerical value is lost.
> >
> The numerical value would be lost, even if you used  "1,000" (in English source
> code), or "1&#x202F;000"  or "1.000" in Czech). The source code should be
> locale independent. 

The source code for what?  Everything is source code in HTML.  And currently there is no standard that allowed you to write locale-independent 'Hamlet'.

> User agent should do l10n on the value (e.g. using @lang
> attribute). You could address the same problem on data/time format processing.
> 

I doubt it should; it would be enough if it refrained from misrepresenting localized data as it gets them.  It converts a number to a sequence of numbers by converting non-breaking spaces to breaking spaces.  This is a bad thing and there is no excuse for this oddity.

Comment 10

10 years ago
(In reply to comment #9)
> (In reply to comment #7)
> > (In reply to comment #5)
> > > The input control coded by <input type="text" lang="pl" value="1&nbsp;000" > 
> > > has the numerical value of 1,000 (one thousand).
> > > If the value changes to "1 000", the numerical value is lost.
> > >
> > The numerical value would be lost, even if you used  "1,000" (in English source
> > code), or "1&#x202F;000"  or "1.000" in Czech). The source code should be
> > locale independent. 
> 
> The source code for what?  Everything is source code in HTML.  And currently
> there is no standard that allowed you to write locale-independent 'Hamlet'.

It depends on where you place presentation layer. You like to cook everything on server, I like to do it on client.

> 
> > User agent should do l10n on the value (e.g. using @lang
> > attribute). You could address the same problem on data/time format processing.
> > 
> 
> I doubt it should; it would be enough if it refrained from misrepresenting
> localized data as it gets them. It converts a number to a sequence of numbers
> by converting non-breaking spaces to breaking spaces.  This is a bad thing and
> there is no excuse for this oddity.
> 

It doesn't convert a number into list of numbers. It just replaces one character with another one in a string. I wanted to point you should not describe this problem as problem with locazation and I showed you another languages where different (and not only one) thousand separator is used.

Despite our missunderstading, I agree the client should not change something he doesn't undersand it.
(In reply to comment #10)
> It doesn't convert a number into list of numbers. It just replaces one
> character with another one in a string. I wanted to point you should not
> describe this problem as problem with locazation and I showed you another
> languages where different (and not only one) thousand separator is used.

Your statement amounts to "since there are languages where this bug does not cause problems, it is not a localization problem".
A similar statement would be "since there are languages that can be put down using ISO-8859-1, the fact that the application is fixed to this character set is not an localization problem."

Comment 12

10 years ago
I have also stumbled across this very annoying bug under Linux in Firefox 2.0.0.1 ("Mozilla/5.0 (X11; U; Linux i686; en-GB; rv:1.8.1.10) Gecko/20071015 SUSE/2.0.0.10-0.1 Firefox/2.0.0.10").

How to reproduce:

I have remapped my X11 keyboard with

  xmodmap -e 'keycode 113 = Mode_switch Mode_switch'
  xmodmap -e 'keysym space = space NoSymbol nobreakspace NoSymbol'

such that pressing AltGr+Space results in the keysym "nobreakspace" to be sent to the application (and have tested that this works fine with xterm and xev). Having NBSP easily available on my keyboard is *far* more convenient than having to type &nbsp; when editing HTML files and wiki pages.

When I type AltGr+Space into a textarea field in Firefox 2.0.0.10, and then copy-and-paste the entered NBSP back into xterm into the stdin of "od -t x1", then it receives only the byte 0x20, which is a normal space. Likewise, if I submit the text field to a HTTP server, it receives only 0x20. I would have expected to receive the bytes 0xc2 0xa0, which is the UTF-8 encoding for the NBSP character (U+00A0).

So something in Firefox is covertly scanning any text that I type into form fields and replaces any entered U+00A0 character immediately with the U+0020 character. This is surprising, disturbing and undesirable, because it unexpectedly corrupts my entered character sequence and it prevents me from typing in no-break space characters directly into content-management systems and wikis.

I think Firefox forms should be fully transparent for NBSP characters, such that we can start to use them in content-management systems and wikis instead of the awkward &nbsp; SGML character reference. Thanks!

Any idea, where exactly this unexpected NBSP-character replacement this happens and why this was introduced in the first place?
(Reporter)

Comment 13

10 years ago
For what it's worth, Mac and Windows users can also perform the same test mentioned in comment #12; on Mac, a non-breaking space may be entered with option+space, while on Windows, it can be entered with alt+0160.

Comment 14

10 years ago
Isn’t this a duplicate of bug 218277 which has been fixed in the mean time?

Comment 15

10 years ago
(In reply to comment #14)
> Isn’t this a duplicate of bug 218277 which has been fixed in the mean time?
> 
No. This bug is about undesired transformation when copying text into clipboard (or primary_selection on X11). Submitting data via HTTP preserves white characters.

Comment 16

10 years ago
Bug 218277 has only been fixed in Gecko rv: 1.9 (Firefox 3.0), but as far as I can see, this trivial patch has not yet been backported to the earlier Gecko releases used by Firefox 2.0 and many other browsers. All these widely used browsers continue to cause Firefox users to unknowingly murder lots of precious characters on wiki pages every day. It is remarkable, how long it took (four (4) years) to fix what was a really trivial and very nasty data-corrupting bug! Silently converting 0xa0 to 0x20 is morally about as sensible as silently converting the ASCII letter N into M everywhere, just for fun. There can be absolutely no excuse for this merciless global 0xa0-genocide on wiki pages.

(Sorry for the strong language, but it is clear from the lengthly dialog on bug 218277 that some people have failed to understand the severity of this bug. Recall that 0xa0 occurs not only in the ISO 8859 representation of NBSP, but also in the UTF-8 and EUC representation of many other characters.)
I can still confirm the silent assassination of NBSP (0x00a0) when copying Text with NBSPs.
FF shows it right, but when you copy the text there are only normal spaces (0x0020).

Mozilla/5.0 (Windows; U; Windows NT 6.1; de; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2 (.NET CLR 3.5.30729)

ASAIK it happens on all plattforms, not only Mac

This bug should be confirmed and critical due to dataloss!
Hello, i have a possible solution:

in content\base\src\nsPlainTextSerializer.cpp line 1265 is root of all evil. What is that good for? Serious can anyone tell me?

What ever, I have commented this line out and build a FF. Bug solved ;)

Comment 19

7 years ago
No, you shouldn't remove this line, because you then break the behavior of the flag OutputPersistNBSP. I think we just should have to replace this kSpace const by the unicode character 0xA0. At least in this method. I didn't look in other methods of the class.

Updated

7 years ago
Status: UNCONFIRMED → NEW
Component: General → Serializers
Ever confirmed: true
OS: Mac OS X → All
Product: Firefox → Core
QA Contact: general → dom-to-text
Hardware: PowerPC → All
(In reply to comment #19)
> No, you shouldn't remove this line, because you then break the behavior of the
> flag OutputPersistNBSP.

Which we won’t need if we don’t change any NBSPs to Spaces, right? 

I’m not sure (pulled the source just a few days ago and it’s far the biggest I have ever worked with) but as far I couldn’t think of a case where this replacement is necessary.

Can anyone show me a case where this transformation is really needed?

In an old bug (IIRC bug 218277) David mentioned that FF-(HTML)Editor replaces multiple Spaces to NBSP cause html doesn’t allow ignores more Spaces in a row.
IF you think, that this case must be handled (it is just the NORMAL html-behaviour) you can do something to change the output.
But replacing it with NBSP is a very bad idea. I didn’t look to that methods yet, but I think It would be better to put a <pre>-Tag around it.

As Markus said, the current FF behaviour is as bad as changing any other valid Character with an other (that looks quite similar).

Greetings Florian

P.S.: Furthermore, I think this bug is critical due to dataloss ;)

Comment 21

7 years ago
Sorry, I didn't read the source code correctly. replacing kNBSP by 0xA0 does nothing, since kNBSP is equals to.. 0xA0 :-)


Apparently, this transformation was to fixed bug 218277, no ? I'm not sure. I can't tell the real purpose of this transformation.

Perhaps the solution is to use the flag nsIDocumentEncoder::OutputPersistNBSP when calling the serializer during copy-paste...
(In reply to comment #21)
> Apparently, this transformation was to fixed bug 218277, no ? I'm not sure. I
> can't tell the real purpose of this transformation.

The first patches would have solved it, but after some Comments (#19 David and some more) it was changed only for from submissions.
I think this was mistake even if that fixed the bug 218277.

> Perhaps the solution is to use the flag nsIDocumentEncoder::OutputPersistNBSP
> when calling the serializer during copy-paste...
and all the other cases… It’s never ok to replace valid characters, isn’t it?

Bug I think the real bug is hidden in the mail editor (which seems to need the replacement that btw a really bad design), but I need more time to look through the code …
Yes, the mail editor is the tricky part here.

Commenting out line 1265 in nsPlainTextSerializer.cpp¹ fixes the bug in FF (I’ve not found any misbehaviour), but it has also affects bug 290565.

Although it allows now to use some NBSP it doesn’t solve it, it just changes the wrong behaviour.

So I would say delete this line to fix this bug and than it’s time to fix the mail editor (there is no reason to leave this bug, just because it affects a either way wrong part of the mail editor).

¹ aString.ReplaceChar(kNBSP, kSPACE);

Comment 24

7 years ago
Does the code in http://mxr.mozilla.org/comm-central/source/mailnews/compose/src/nsMsgSend.cpp#1706 do something with this issue? The mail editor uses this code when it takes the body of the message from editor. And maybe the plain text that is copied to the clipboard undergoes this procedure, thus loosing the nbsps?
I think the main problem is here:

http://mxr.mozilla.org/mozilla-central/source/content/base/src/nsPlainTextSerializer.cpp#1253

But your piece of code is wrong, to. It‘s never fine to replace Space wist NBSP. 

I‘ve build a thunderbird with 3 or 4 fixes, but it sometimes behaves strange because Space ←→ NBSP conversions are often used and I have not found all code pieces that cause this replacements yet.

Comment 26

7 years ago
I didn’t see this mentioned above: it’s not only when copying to clipboard that the conversion happens (as the title of the bug states), it happens also to data sent to the server.

The most visible (annoying) effect of this is that I can’t type NBSP characters in email messages (with GMail in my case), though it also happens to all other textarea fields (used for commenting in general). French requires NBSPs around some punctuation marks, and when they’re converted to normal spaces the punctuation can wrap to the next line, which is plainly bad. (I use AltGr+Space to type it, on a custom keyboard layout. As stated before, this also happens when pasting it.)

As far as I can tell this doesn’t happen in <input type="text"> fields (at least copying to clipboard preserves NBSP characters, I didn’t check if they’re sent to the server), so it’s clearly not an insurmountable problem.
Input-fields were fixed with bug 218277. And they fixed only this because they think it is a good idea to format html mails with NBSBs :(.

Comment 28

6 years ago
Apparently, this also happens when pasting content. I copied an NBSP from a program that keeps it, then tried to paste it in the search box. The result in FF10.0.1 is that the browser freezes while searching for all the regular spaces in the page

Comment 29

4 years ago
Still here, as of Firefox 24.0

Comment 30

4 years ago
I believe I've also wasted some hours of my life before stumbling on this issue report.
I'm trying to develop a FF extension that involves creating an xpath expression based on an html element text. To deal with &nbsp; I did something like "elementText = elementText.replace("&nbsp;", "\u00A0');".
In the end the extension returns a regular space and the xpath won't work.
I would work on this if I had the proper knowledge.
If someone could give a help it would be most appreciated.
Spotted on FF 25.
Duplicate of this bug: 483762
Severity: normal → critical
Keywords: dataloss

Comment 32

3 years ago
Bug 613223 looks like a duplicate.
Duping forward to better documented bug 613223
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 613223
I hadn't read far enough here
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
Duplicate of this bug: 613223

Comment 36

3 years ago
Still here, as of Firefox 30.0 (Linux)

Comment 37

2 years ago
This bug also occurs with drag-and-drop.

In Firefox Mac, in a text area, I drag-and-drop some text to another place in the same text area. 
The characters must not be altered. 
But the non-breaking spaces are altered into normal spaces.

Comment 38

2 years ago
Also confirmed on Firefox 38.0.5 on Windows 7.

Copying from Firefox to either Firefox or Notepad++ converts U+00A0 to U+0020.
Copying Notepad++ to Firefox keeps U+00A0 unchanged.

Comment 39

2 years ago
Confirmed in Firefox 41.0.2 on Windows 7 x86.

Copying from developer console and pasting back in it changes nbsp to regular space. Check this gif I made to see what I mean: http://i.imgur.com/2StB85Q.gif It looks like string variable isn't equal to itself.

Comment 40

a year ago
Confirming in Firefox 45.0 on Ubuntu 14.04 x86_64.
Blocks: 290565

Updated

19 days ago
Duplicate of this bug: 624666

Comment 42

19 days ago
STR (I tried them):
Enter |huhu&nbsp;!| into http://www-archive.mozilla.org/editor/midasdemo/ (source view).
Copy this text.
Paste into Notepad++.
You get a space and not a NBSP.

Likely caused by
https://dxr.mozilla.org/mozilla-central/source/dom/base/nsPlainTextSerializer.cpp#1231-1250

Comment 43

19 days ago
Maybe one can fix this bug by adapting the solution from bug 333064:
https://hg.mozilla.org/mozilla-central/rev/a35b2ac359e5

Add nsIDocumentEncoder::OutputPersistNBSP in the right spot where plain text is placed onto the clipboard. The test can also be adapted.
You need to log in before you can comment on or make changes to this bug.