Open Bug 359303 Opened 14 years ago Updated 10 days ago

Non-breaking spaces (nbsp) not copied as such

Categories

(Core :: DOM: Serializers, defect)

defect

Tracking

()

REOPENED

People

(Reporter: cody, Unassigned, NeedInfo)

References

Details

(Keywords: dataloss)

Attachments

(2 files)

User-Agent:       Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8.1) Gecko/20061010 Firefox/2.0
Build Identifier: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8.1) Gecko/20061010 Firefox/2.0

When copying text containing non-breaking spaces (  or UTF-8 character 0xA0) to the clipboard, those characters are converted into regular spaces.  I'm not certain that this is a bug rather than a feature, but my expected behavior was that non-breaking spaces would be preserved as such.

Reproducible: Always

Steps to Reproduce:
1. Copy text containing a non-breaking space character (  or UTF-8 0xA0) to the clipboard.
2. Paste into either a text area in Firefox, or into another application.
Actual Results:  
Text should not wrap at the non-breaking space.

Expected Results:  
Text wraps at the non-breaking space.
(In reply to comment #0)
> When copying text containing non-breaking spaces (  or UTF-8 character
> 0xA0)

Gack...I realize I misspoke here; my brain obviously wasn't working.  I of course meant *Unicode* character 0xA0, which is encoded quite differently in UTF-8!
I have same problem with Firefox 2.0, all 'no-brake space' charackers (0x00A0) are replaced by spaces. This bug does not depend on character encoding of html page. It does the same when encoding is UTF-8 or UTF-16 or Latin-1.
Confirming, also reproducible in Firefox 2.0 on Windows XP.
(In reply to comment #0)
> User-Agent:       Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US;
> rv:1.8.1) Gecko/20061010 Firefox/2.0
> Build Identifier: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US;
> rv:1.8.1) Gecko/20061010 Firefox/2.0
> 
> When copying text containing non-breaking spaces (  or UTF-8 character
> 0xA0) to the clipboard, those characters are converted into regular spaces. 
> I'm not certain that this is a bug rather than a feature, but my expected
> behavior was that non-breaking spaces would be preserved as such.
> 
> Reproducible: Always
> 
> Steps to Reproduce:
> 1. Copy text containing a non-breaking space character (  or UTF-8 0xA0)
> to the clipboard.
> 2. Paste into either a text area in Firefox, or into another application.
> Actual Results:  
> Text should not wrap at the non-breaking space.
> 
> Expected Results:  
> Text wraps at the non-breaking space.
> 

Please raise the severity of this bug to Major.
Rationale:
The input control coded by <input type="text" lang="pl" value="1&nbsp;000" > 
has the numerical value of 1,000 (one thousand).
If the value changes to "1 000", the numerical value is lost.
I understand it is kind of weird that there is no visible difference between a number and two numbers; however, while I may disagree with this setting, somebody declared it this way.
I do not know who has the authority over the national numeric format but I am sure the system vendor had consulted a standard body prior to including it in the operating system.
Confirming for FF 2.0.0.7 on Linux.

The &nbsp; is converted to 0x20 at writing time. Any consequent copy to clipboard or sending via HTTP shows regular spaces. OTOH &thinsp; or non-breakable thin space are saved properly.
(In reply to comment #5)
> The input control coded by <input type="text" lang="pl" value="1&nbsp;000" > 
> has the numerical value of 1,000 (one thousand).
> If the value changes to "1 000", the numerical value is lost.
>
The numerical value would be lost, even if you used  "1,000" (in English source code), or "1&#x202F;000"  or "1.000" in Czech). The source code should be locale independent. User agent should do l10n on the value (e.g. using @lang attribute). You could address the same problem on data/time format processing.

In additional, HTML forms are very weak in type handling. Maybe XForms or HTML5 will affect this problem.
(In reply to comment #6)
> Confirming for FF 2.0.0.7 on Linux.
> 
> The &nbsp; is converted to 0x20 at writing time. Any consequent 
[...]
> sending via HTTP shows regular spaces.

3.0a8 (alpha version of FF) doesn't suffer from this problem. Only copying to clipboard remains affected.
(In reply to comment #7)
> (In reply to comment #5)
> > The input control coded by <input type="text" lang="pl" value="1&nbsp;000" > 
> > has the numerical value of 1,000 (one thousand).
> > If the value changes to "1 000", the numerical value is lost.
> >
> The numerical value would be lost, even if you used  "1,000" (in English source
> code), or "1&#x202F;000"  or "1.000" in Czech). The source code should be
> locale independent. 

The source code for what?  Everything is source code in HTML.  And currently there is no standard that allowed you to write locale-independent 'Hamlet'.

> User agent should do l10n on the value (e.g. using @lang
> attribute). You could address the same problem on data/time format processing.
> 

I doubt it should; it would be enough if it refrained from misrepresenting localized data as it gets them.  It converts a number to a sequence of numbers by converting non-breaking spaces to breaking spaces.  This is a bad thing and there is no excuse for this oddity.
(In reply to comment #9)
> (In reply to comment #7)
> > (In reply to comment #5)
> > > The input control coded by <input type="text" lang="pl" value="1&nbsp;000" > 
> > > has the numerical value of 1,000 (one thousand).
> > > If the value changes to "1 000", the numerical value is lost.
> > >
> > The numerical value would be lost, even if you used  "1,000" (in English source
> > code), or "1&#x202F;000"  or "1.000" in Czech). The source code should be
> > locale independent. 
> 
> The source code for what?  Everything is source code in HTML.  And currently
> there is no standard that allowed you to write locale-independent 'Hamlet'.

It depends on where you place presentation layer. You like to cook everything on server, I like to do it on client.

> 
> > User agent should do l10n on the value (e.g. using @lang
> > attribute). You could address the same problem on data/time format processing.
> > 
> 
> I doubt it should; it would be enough if it refrained from misrepresenting
> localized data as it gets them. It converts a number to a sequence of numbers
> by converting non-breaking spaces to breaking spaces.  This is a bad thing and
> there is no excuse for this oddity.
> 

It doesn't convert a number into list of numbers. It just replaces one character with another one in a string. I wanted to point you should not describe this problem as problem with locazation and I showed you another languages where different (and not only one) thousand separator is used.

Despite our missunderstading, I agree the client should not change something he doesn't undersand it.
(In reply to comment #10)
> It doesn't convert a number into list of numbers. It just replaces one
> character with another one in a string. I wanted to point you should not
> describe this problem as problem with locazation and I showed you another
> languages where different (and not only one) thousand separator is used.

Your statement amounts to "since there are languages where this bug does not cause problems, it is not a localization problem".
A similar statement would be "since there are languages that can be put down using ISO-8859-1, the fact that the application is fixed to this character set is not an localization problem."
I have also stumbled across this very annoying bug under Linux in Firefox 2.0.0.1 ("Mozilla/5.0 (X11; U; Linux i686; en-GB; rv:1.8.1.10) Gecko/20071015 SUSE/2.0.0.10-0.1 Firefox/2.0.0.10").

How to reproduce:

I have remapped my X11 keyboard with

  xmodmap -e 'keycode 113 = Mode_switch Mode_switch'
  xmodmap -e 'keysym space = space NoSymbol nobreakspace NoSymbol'

such that pressing AltGr+Space results in the keysym "nobreakspace" to be sent to the application (and have tested that this works fine with xterm and xev). Having NBSP easily available on my keyboard is *far* more convenient than having to type &nbsp; when editing HTML files and wiki pages.

When I type AltGr+Space into a textarea field in Firefox 2.0.0.10, and then copy-and-paste the entered NBSP back into xterm into the stdin of "od -t x1", then it receives only the byte 0x20, which is a normal space. Likewise, if I submit the text field to a HTTP server, it receives only 0x20. I would have expected to receive the bytes 0xc2 0xa0, which is the UTF-8 encoding for the NBSP character (U+00A0).

So something in Firefox is covertly scanning any text that I type into form fields and replaces any entered U+00A0 character immediately with the U+0020 character. This is surprising, disturbing and undesirable, because it unexpectedly corrupts my entered character sequence and it prevents me from typing in no-break space characters directly into content-management systems and wikis.

I think Firefox forms should be fully transparent for NBSP characters, such that we can start to use them in content-management systems and wikis instead of the awkward &nbsp; SGML character reference. Thanks!

Any idea, where exactly this unexpected NBSP-character replacement this happens and why this was introduced in the first place?
For what it's worth, Mac and Windows users can also perform the same test mentioned in comment #12; on Mac, a non-breaking space may be entered with option+space, while on Windows, it can be entered with alt+0160.
Isn’t this a duplicate of bug 218277 which has been fixed in the mean time?
(In reply to comment #14)
> Isn’t this a duplicate of bug 218277 which has been fixed in the mean time?
> 
No. This bug is about undesired transformation when copying text into clipboard (or primary_selection on X11). Submitting data via HTTP preserves white characters.
Bug 218277 has only been fixed in Gecko rv: 1.9 (Firefox 3.0), but as far as I can see, this trivial patch has not yet been backported to the earlier Gecko releases used by Firefox 2.0 and many other browsers. All these widely used browsers continue to cause Firefox users to unknowingly murder lots of precious characters on wiki pages every day. It is remarkable, how long it took (four (4) years) to fix what was a really trivial and very nasty data-corrupting bug! Silently converting 0xa0 to 0x20 is morally about as sensible as silently converting the ASCII letter N into M everywhere, just for fun. There can be absolutely no excuse for this merciless global 0xa0-genocide on wiki pages.

(Sorry for the strong language, but it is clear from the lengthly dialog on bug 218277 that some people have failed to understand the severity of this bug. Recall that 0xa0 occurs not only in the ISO 8859 representation of NBSP, but also in the UTF-8 and EUC representation of many other characters.)
I can still confirm the silent assassination of NBSP (0x00a0) when copying Text with NBSPs.
FF shows it right, but when you copy the text there are only normal spaces (0x0020).

Mozilla/5.0 (Windows; U; Windows NT 6.1; de; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2 (.NET CLR 3.5.30729)

ASAIK it happens on all plattforms, not only Mac

This bug should be confirmed and critical due to dataloss!
Hello, i have a possible solution:

in content\base\src\nsPlainTextSerializer.cpp line 1265 is root of all evil. What is that good for? Serious can anyone tell me?

What ever, I have commented this line out and build a FF. Bug solved ;)
No, you shouldn't remove this line, because you then break the behavior of the flag OutputPersistNBSP. I think we just should have to replace this kSpace const by the unicode character 0xA0. At least in this method. I didn't look in other methods of the class.
Status: UNCONFIRMED → NEW
Component: General → Serializers
Ever confirmed: true
OS: Mac OS X → All
Product: Firefox → Core
QA Contact: general → dom-to-text
Hardware: PowerPC → All
(In reply to comment #19)
> No, you shouldn't remove this line, because you then break the behavior of the
> flag OutputPersistNBSP.

Which we won’t need if we don’t change any NBSPs to Spaces, right? 

I’m not sure (pulled the source just a few days ago and it’s far the biggest I have ever worked with) but as far I couldn’t think of a case where this replacement is necessary.

Can anyone show me a case where this transformation is really needed?

In an old bug (IIRC bug 218277) David mentioned that FF-(HTML)Editor replaces multiple Spaces to NBSP cause html doesn’t allow ignores more Spaces in a row.
IF you think, that this case must be handled (it is just the NORMAL html-behaviour) you can do something to change the output.
But replacing it with NBSP is a very bad idea. I didn’t look to that methods yet, but I think It would be better to put a <pre>-Tag around it.

As Markus said, the current FF behaviour is as bad as changing any other valid Character with an other (that looks quite similar).

Greetings Florian

P.S.: Furthermore, I think this bug is critical due to dataloss ;)
Sorry, I didn't read the source code correctly. replacing kNBSP by 0xA0 does nothing, since kNBSP is equals to.. 0xA0 :-)


Apparently, this transformation was to fixed bug 218277, no ? I'm not sure. I can't tell the real purpose of this transformation.

Perhaps the solution is to use the flag nsIDocumentEncoder::OutputPersistNBSP when calling the serializer during copy-paste...
(In reply to comment #21)
> Apparently, this transformation was to fixed bug 218277, no ? I'm not sure. I
> can't tell the real purpose of this transformation.

The first patches would have solved it, but after some Comments (#19 David and some more) it was changed only for from submissions.
I think this was mistake even if that fixed the bug 218277.

> Perhaps the solution is to use the flag nsIDocumentEncoder::OutputPersistNBSP
> when calling the serializer during copy-paste...
and all the other cases… It’s never ok to replace valid characters, isn’t it?

Bug I think the real bug is hidden in the mail editor (which seems to need the replacement that btw a really bad design), but I need more time to look through the code …
Yes, the mail editor is the tricky part here.

Commenting out line 1265 in nsPlainTextSerializer.cpp¹ fixes the bug in FF (I’ve not found any misbehaviour), but it has also affects bug 290565.

Although it allows now to use some NBSP it doesn’t solve it, it just changes the wrong behaviour.

So I would say delete this line to fix this bug and than it’s time to fix the mail editor (there is no reason to leave this bug, just because it affects a either way wrong part of the mail editor).

¹ aString.ReplaceChar(kNBSP, kSPACE);
Does the code in http://mxr.mozilla.org/comm-central/source/mailnews/compose/src/nsMsgSend.cpp#1706 do something with this issue? The mail editor uses this code when it takes the body of the message from editor. And maybe the plain text that is copied to the clipboard undergoes this procedure, thus loosing the nbsps?
I think the main problem is here:

http://mxr.mozilla.org/mozilla-central/source/content/base/src/nsPlainTextSerializer.cpp#1253

But your piece of code is wrong, to. It‘s never fine to replace Space wist NBSP. 

I‘ve build a thunderbird with 3 or 4 fixes, but it sometimes behaves strange because Space ←→ NBSP conversions are often used and I have not found all code pieces that cause this replacements yet.
I didn’t see this mentioned above: it’s not only when copying to clipboard that the conversion happens (as the title of the bug states), it happens also to data sent to the server.

The most visible (annoying) effect of this is that I can’t type NBSP characters in email messages (with GMail in my case), though it also happens to all other textarea fields (used for commenting in general). French requires NBSPs around some punctuation marks, and when they’re converted to normal spaces the punctuation can wrap to the next line, which is plainly bad. (I use AltGr+Space to type it, on a custom keyboard layout. As stated before, this also happens when pasting it.)

As far as I can tell this doesn’t happen in <input type="text"> fields (at least copying to clipboard preserves NBSP characters, I didn’t check if they’re sent to the server), so it’s clearly not an insurmountable problem.
Input-fields were fixed with bug 218277. And they fixed only this because they think it is a good idea to format html mails with NBSBs :(.
Apparently, this also happens when pasting content. I copied an NBSP from a program that keeps it, then tried to paste it in the search box. The result in FF10.0.1 is that the browser freezes while searching for all the regular spaces in the page
Still here, as of Firefox 24.0
I believe I've also wasted some hours of my life before stumbling on this issue report.
I'm trying to develop a FF extension that involves creating an xpath expression based on an html element text. To deal with &nbsp; I did something like "elementText = elementText.replace("&nbsp;", "\u00A0');".
In the end the extension returns a regular space and the xpath won't work.
I would work on this if I had the proper knowledge.
If someone could give a help it would be most appreciated.
Spotted on FF 25.
Duplicate of this bug: 483762
Severity: normal → critical
Keywords: dataloss
Bug 613223 looks like a duplicate.
Duping forward to better documented bug 613223
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 613223
I hadn't read far enough here
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
Duplicate of this bug: 613223
Still here, as of Firefox 30.0 (Linux)
This bug also occurs with drag-and-drop.

In Firefox Mac, in a text area, I drag-and-drop some text to another place in the same text area. 
The characters must not be altered. 
But the non-breaking spaces are altered into normal spaces.
Also confirmed on Firefox 38.0.5 on Windows 7.

Copying from Firefox to either Firefox or Notepad++ converts U+00A0 to U+0020.
Copying Notepad++ to Firefox keeps U+00A0 unchanged.
Confirmed in Firefox 41.0.2 on Windows 7 x86.

Copying from developer console and pasting back in it changes nbsp to regular space. Check this gif I made to see what I mean: http://i.imgur.com/2StB85Q.gif It looks like string variable isn't equal to itself.
Confirming in Firefox 45.0 on Ubuntu 14.04 x86_64.
Blocks: 290565
Duplicate of this bug: 624666
STR (I tried them):
Enter |huhu&nbsp;!| into http://www-archive.mozilla.org/editor/midasdemo/ (source view).
Copy this text.
Paste into Notepad++.
You get a space and not a NBSP.

Likely caused by
https://dxr.mozilla.org/mozilla-central/source/dom/base/nsPlainTextSerializer.cpp#1231-1250
Maybe one can fix this bug by adapting the solution from bug 333064:
https://hg.mozilla.org/mozilla-central/rev/a35b2ac359e5

Add nsIDocumentEncoder::OutputPersistNBSP in the right spot where plain text is placed onto the clipboard. The test can also be adapted.
I'll argue that Firefox should be converting &nbsp; characters to their respective normal space characters.  It helps you avoid problems like these:

https://stackoverflow.com/questions/51794845/pre-html-nbsp-m-bm-bash-scripts-syntax-highlighting-on-the-web-remove-asc

Linux does NOT like &nbsp; characters in the bash shell.  

The real problem is that each browser handles this differently.  Chrome copies the &nbsp; characters as they are, but then you run into problems with that approach when the copied text is used in different environments.

I even filed a bug with Chrome thinking that it should be converting &nbsp; like Firefox does:

https://bugs.chromium.org/p/chromium/issues/detail?id=887511

With Chrome, I have to use the extra step of 

> textToCopy.replace(new RegExp(String.fromCharCode(160), "g"), " ");

To replace &nbsp; with normal space characters. 

I think a discussion needs to take place as to what the correct behavior should be, and all major browsers should pick one way or the other, but I hope that &nbsp; characters end up being converted when copied because why would you ever want to preserve &nbsp; characters?  Seems to be a thing only the web uses?
(In reply to Eric from comment #44)
> I'll argue that Firefox should be converting &nbsp; characters to their
> respective normal space characters.  It helps you avoid problems like these:
> 
> https://stackoverflow.com/questions/51794845/pre-html-nbsp-m-bm-bash-scripts-
> syntax-highlighting-on-the-web-remove-asc

That is an issue with the site, not with the web browser: source code listings should not use one character (e.g. NBSP, curly quotes) and mean a different one (normal space, typewriter quotes). If they need to use multiple consecutive white space characters they should rely on `white-space` CSS property.

> Chrome copies the &nbsp; characters as they are, but then you run into problems
> with that approach when the copied text is used in different environments.

And when you are editing an article for a news site and paste the text into a word processor, you will lose the present non-breaking spaces, introducing breaking the typography.

> I think a discussion needs to take place as to what the correct behavior
> should be, and all major browsers should pick one way or the other, but I
> hope that &nbsp; characters end up being converted when copied because why
> would you ever want to preserve &nbsp; characters?  Seems to be a thing only
> the web uses?

In many languages non-breaking space is expected to be used; for example, you are supposed to use NBSP after non-syllabic prepositions in Czech. This is not a web thing but professional text editing thing.
(In reply to Eric from comment #44)
> I'll argue that Firefox should be converting &nbsp; characters to their
> respective normal space characters.  It helps you avoid problems like these:
> 
> https://stackoverflow.com/questions/51794845/pre-html-nbsp-m-bm-bash-scripts-
> syntax-highlighting-on-the-web-remove-asc

As said in the answer there, the syntax highlighting script should not be corrupting the source code it is highlighting. It is not the browser’s job to undo damage introduced by buggy software.


> I think a discussion needs to take place as to what the correct behavior
> should be, and all major browsers should pick one way or the other, but I
> hope that &nbsp; characters end up being converted when copied because why
> would you ever want to preserve &nbsp; characters?  Seems to be a thing only
> the web uses?

Nope. The U+00A0 NO-BREAK SPACE character is useful in plain text, word processors, and all kinds of natural language text that is intended to be displayed on multiple lines. Its purpose is to prevent a line break between words or other runs of non-whitespace characters. Here are a few examples where non-breaking spaces are essential:

* Some locales (such as Russian) use a non-breaking space for digit grouping. If you replace it with a normal space, you get weird line wrapping with numbers such as 300 000 (three hundred thousand) where 300 remains on one line and 000 is wrapped. Phone numbers are also often formatted with digit grouping, and should not be broken.

* In French typography, you are supposed to put a space on the inside of quotes « like this » and before some punctuation marks such as colons : like this. A line break is not permitted at these points.

* If an abbreviation ending in a period ends up at the end of line, it can be visually mistaken for end of sentence. To prevent that, abbreviation period should be followed with a non-breaking space; sentence end full-stop, with a normal space.

This bug makes it impossible to preserve good formatting when copying and pasting text.
I guess if it's a language locale thing, that makes sense.  I don't personally like it, but your logic seems perfectly reasonable.  In any event, the bash shell doesn't like &nbsp;... so if you end copying text that contains &nbsp; and try to use it there, you're going to have a bad time.
Looks like there is an undocumented configuration option for the syntax highlighter (http://alexgorbatchev.com/SyntaxHighlighter/manual/configuration/) script I'm using:

> SyntaxHighlighter.config.space = " ";
> SyntaxHighlighter.all();

Fixes my issue.  The script was already using white-space: pre for the css styling.  SyntaxHighlighter.config.space was set to "&nbsp;" by default for some reason.
Another couple of examples are figures with units, like 1 km, references, like fig. 12 on p. 9, and footnote markers [1].

Using nonbreaking spaces will break most program code. A similar annoyance is code with curly quotes (“hello world” instead of "hello world", which I’ve seen many times as well). Firefox will not magically replace those curly quotes with ASCII quotes. I would prefer if it wouldn’t replace nonbreaking spaces with regular spaces either.

----

[1] https://practicaltypography.com/nonbreaking-spaces.html
Non-breaking spaces make sense in code, and it is important to preserve them to avoid buggy behavior with potentially bad consequences. For instance, in a shell, if I write:

  rm foo bar

(where the space between foo and bar is a non-breaking space), the file "foo bar" is expected to be removed. But if the non-breaking space is converted to a normal space (e.g. after a copy-paste), then the shell will attempt to remove files "foo" and "bar"!
How does one even insert a non-breaking space character in a terminal window in bash?  I'm not seeing a "&nbsp;" key on my keyboard unfortunately.
(In reply to Eric from comment #51)
> How does one even insert a non-breaking space character in a terminal window
> in bash?  I'm not seeing a "&nbsp;" key on my keyboard unfortunately.

Aside from copy-pasting or Tab-completing, there are several ways:

When running bash or zsh under a Unicode locale, you can use an ANSI C-like string containing a unicode escape: $'\u00a0'

If your terminal emulator is GTK-based, you can hit Ctrl+Shift+u and then type 00a0<Enter>

If you're using the IBus IME system, you can enable Ctrl+u as a cross-toolkit equivalent to GTK's Ctrl+Shift+u.

If you haven't set URxvt.iso14755 to false in ~/.Xresources to enable mapping Ctrl+Shift+... combinations as hotkeys, rxvt-unicode will let you type a non-breaking space by holding down Ctrl+Shift, typing 00a0, and then releasing them.

Your keyboard layout may map it to something like AltGr+Spacebar.

I use these settings to amend my US ASCII layout so that Right Alt is "Level 3 Shift" (ie. AltGr), AltGr+Spacebar produces a non-breaking space and AltGr+Shift+Spacebar produces thin non-breaking spaces.

setxkbmap -variant altgr-intl -option lv3:ralt_switch -option nbsp:level3n

I also wrote a blog post on using setxkbmap for this sort of thing (with an Xmodmap explanation in one of the comments), in case you want it:

http://blog.ssokolow.com/archives/2011/12/24/getting-your-way-with-setxkbmap/

The nice thing about using setxkbmap and Xmodmap is that they only persist across X server restarts if you update the right config files and don't affect SSH sessions, so you can experiment without fear of getting stuck.
Please do not use Bugzilla as a forum and follow the Bugzilla etiquette [1], comments in this bug should only help fixing the issue in Firefox.

[1] https://bugzilla.mozilla.org/page.cgi?id=etiquette.html

Thank you.

When will this bug finally get solved? I always have to use Chrome when I want to paste text containing non-breaking spaces into a website.

There are so many duplicates that are marked as resolved but are not solved in any way.

This bug should also be added as a duplicate because it is the same problem: https://bugzilla.mozilla.org/show_bug.cgi?id=268995

No, in bug 268995, a copy is not involved: the nbsp came from the original form data and was not copied or pasted.

Oh, I see. But then this bug is still annoying and also rated as critical since quite a while.

So please taka a look at it again. I am sure this can be solved quite fast. Unfortunality I am not enough experienced to create a patch for this issue by myself.

This bug is, in fact, trivial to fix, and the attach patch does this (a trivial adaptation of a patch already attached to bug #194498 — which see also).

The reason is that it's not so much a bug as an intentional misfeature: it's not that Firefox accidentally breaks U+00A0 NO-BREAK SPACE (=:NBSP), it's that somebody deliberately decided that it should, and added explicit code to do it. The logic behind the code is that HTML composer sometimes turns some U+0020 SPACE (when there are multiple in a row, or something) into U+00A0 NO-BREAK SPACE and the fear is that pasting them as such would break things.

I think this logic is bad in every sense. Every bit of code which turns U+00A0 NO-BREAK SPACE into U+0020 SPACE or vice versa should be removed, and Firefox should always remain Unicode-transparent. (A middle ground, if fear of breaking things is so great, would be to make this a user-settable preference. I don't know how to do this.) But I don't know who makes these decisions or how to contact them or how to argue about this; I'm pretty sure they don't read this kind of bug reports, however, so it's useless to discuss it here.

In the mean time, I simply apply the patch when compiling Firefox.

Comment on attachment 9060386 [details] [diff] [review]
patch removing special treatment that breaks U+00A0 NO-BREAK SPACE

Review of attachment 9060386 [details] [diff] [review]:
-----------------------------------------------------------------

I fully agree with David A. Madore's assessment here and I strongly recommend to merge in his trivial patch to remove an ancient and very unfortunate, data-corrupting design choice in Firefox. The no-break space character exists in ISO 8859 and ISO 10646 for a good reason, and a web browser must never quietly replace this character with another character without explicit informed user consent.

I registered here just to express my support for the change.
A program should not mangle data, unless there is some really good reason - which i do not see here.

We’re are now in 2019, almost 20 years after the year 2000, and Unicode is now well supported in every OS. This conversion was a bad idea from the very beginning, 13 years after it’s a complete non-sense.
I can type in non-breaking and thin spaces ( ) with my keyboard, but when I paste a text containing such characters in a Firefox form input, I lose the non-breaking spaces, whereas the thin spaces are kept. Do you really see a good reason for this design?!

As a french corrector, I use an external text editor (Gedit) that can display spaces and draw these special spaces differently. I make some automatic corrections to comply with the French orthotypographic rules, and some of these corrections are the replacement of certain spaces by non-breaking and thin spaces. As you can guess, when I copy the corrected text from my editor and then paste it to Firefox, I lose all the non-breaking spaces. :-( It’s a big waste of time!

So please, consider removing this useless, bad and annoying conversion of the non-breaking spaces. As a benefit, you will remove a couple lines of code and you will save a few CPU cycles when pasting text in form inputs.
No caveats, just benefits. So, please, what are you waiting for?…

(In reply to Jorg K (GMT+2) from comment #62)

What do you think, Masayuki-san, should we remove
https://searchfox.org/mozilla-central/rev/ab6f4c453d15ab82147c630a8b886b40240ca72b/dom/base/nsPlainTextSerializer.cpp#1143-1147
as suggested in attachment 9060386 [details] [diff] [review]?

The change would break copy in HTML editor. In HTML editor without <pre> element, 2nd and later ASCII whitespaces are automatically converted to NBSPs since if web browsers don't do that, whitespace collapsing of HTML spec causes only one whitespace rendered. Unfortunately, we cannot distinguish whether every NBSP in an editable text node is truly NBSP or not (i.e., automatically converted one). Therefore, we cannot fix this bug without big changes in HTML editor. However, in non-editable content, not so. So, if nsPlaintextSerializer handles NBSPs when it retrieves text from a node, we can fix this for non-editable content.

Flags: needinfo?(masayuki)

(In reply to Masayuki Nakano [:masayuki] (he/him)(JST, +0900)(Still struggling with the pain, but becoming better) from comment #63)

  1. Is there any reason to preserve multiple spaces? Formatting with spaces (typewriter-like) is generally a bad idea, since it is not reproducible, and there are more proper ways for indentation and alignment.
  2. If preserving them is necessary, doesn't you HTML support the white-space: pre-wrap style?

(In reply to Mikhail Ryazanov from comment #64)

(In reply to Masayuki Nakano [:masayuki] (he/him)(JST, +0900)(Still struggling with the pain, but becoming better) from comment #63)

  1. Is there any reason to preserve multiple spaces? Formatting with spaces (typewriter-like) is generally a bad idea, since it is not reproducible, and there are more proper ways for indentation and alignment.
  2. If preserving them is necessary, doesn't you HTML support the white-space: pre-wrap style?

Not to mention that, for people trying to produce rich-text inputs that are both well-formed and non-confusing to users (ie. me), it's a pain to have to write and debug JavaScript which replicates LyX's behaviour of discarding Space of Enter keypresses if they'd cause the editor view to abuse semantic markup to insert presentational whitespace (eg. non-breaking spaces and empty paragraph tags).

(In reply to Mikhail Ryazanov from comment #64)

(In reply to Masayuki Nakano [:masayuki] (he/him)(JST, +0900)(Still struggling with the pain, but becoming better) from comment #63)

  1. Is there any reason to preserve multiple spaces? Formatting with spaces (typewriter-like) is generally a bad idea, since it is not reproducible, and there are more proper ways for indentation and alignment.

Most users don't know:

  • whether current editor is HTML editor (contenteditable or designMode) or Text Editor (<input> or <textarea>).
  • how multiple whitespaces are treated in HTML editor (without <pre>).

Therefore it makes sense to make typed multiple ASCII whitespaces look as-is.

  1. If preserving them is necessary, doesn't you HTML support the white-space: pre-wrap style?

Ideally yes, but all browsers converts whitespaces to NBSPs:

Therefore, I think that we should keep converting NBSPs in editable elements (contenteditable or designMode), but I think we shouldn't convert in non-editable elements or <input>/<textarea> to make users can treat raw data. So, I think that patch may break something of the former case.

Let me reiterate what I suggested in a comment above: why not make this a user-settable preference? So experienced users who know what an unbreakable space is and care about it can set this preference and Firefox's behavior would then be transparent in every respect (in the editor and while copying and pasting, every character would be retained exactly as it is, including multiple spaces), whereas the default behavior would be either the current one, or some other compromise.

I don't know enough about how Firefox works to code this myself, but I imagine it would be fairly trivial for someone who knows: at least, it doesn't add any significant amount of complexity.

I agree with David that a user-settable preference would be a pretty good compromise until a better solution is found. And it’s only a couple lines of code. Ignoring the problem shouldn’t be an option anymore.

My main concern here is composing mail in Thunderbird. Will not converting in non-editable element fix that? Because I cannot type french correctly without NBSP support in Thunderbird, and that has been quite an issue for years. I trick it by using NWNBSP, but that is far from ideal.

If a NBSP is alone, we can stop converting it even in editable contents. E.g., <p>foo&nbsp;bar</p>.

I'm not a native speaker of any Western languages so that I don't know how NBSP is used in actual usage. Is that used continuously like foo&nbsp;&nbsp;&nbsp;bar?

See comment #26 and comment #46 under "French typography", for example: «&nbsp;like this&nbsp;»

No, actually this is a great idea, I don’t know any language where you are supposed to have multiple ones in a row. In french you use it before colons, within quotes and for « incises » mostly, like this « Une idée : ne pas remplacer les espaces insécables uniques — pour lesquelles ça ne changerait a priori rien pour le rendu et qui sont en revanche l’utilisation légitime la plus probable. ». So yes, that would work for french at least.

(In reply to Bruno Pagani from comment #73)

No, actually this is a great idea, I don’t know any language where you are supposed to have multiple ones in a row. In french you use it before colons, within quotes and for « incises » mostly, like this «&nbsp;Une idée&nbsp;: ne pas remplacer les espaces insécables uniques —&nbsp;pour lesquelles ça ne changerait a priori rien pour le rendu et qui sont en revanche l’utilisation légitime la plus probable.&nbsp;». So yes, that would work for french at least.

Of course those NBSP in my comment have been converted to display, so I’m quoting myself and putting the french excerpt in code formatting just above.

+1 to that heuristic: If NBSP is encountered adjacent to another NBSP or to an ordinary SPACE, assume it is used for alignment/indentation and convert it to a SPACE. Otherwise, assume it is used in its intended function and leave it alone.

(In reply to Masayuki Nakano [:masayuki] (he/him)(JST, +0900)(no pain of the broken bone, but the corset blocks me to concentrate) from comment #71)

I'm not a native speaker of any Western languages so that I don't know how NBSP is used in actual usage. Is that used continuously like foo&nbsp;&nbsp;&nbsp;bar?

I'm not an expert on every European language but, in my experience, there are three ways &nbsp; gets used:

  1. Official language rules of the type that has already been mentioned. I've never heard of any of these situations requiring more than one non-breaking space in a row. (The closest thing that comes to mind is the typewriter-era rule that you type two spaces after a sentence-ending period when using a monospace font to make it easier to see sentence boundaries in among all the whitespace that isn't getting kerned away. Aside from that, the convention was to invent new types of space characters rather than typing multiple of an existing one to ensure that the typeface would preserve the desired proportional relationships and, in the case of digital typography, the semantic meanings.)

  2. Replacing what would otherwise be a normal space in order to manipulate where a line may break without help from whoever is developing the CMS or forum software. (eg. Putting &nbsp; between the last two/few words in a paragraph to so word-wrapping will treat them as a single word. This is done as a way of preventing what typesetters call "orphans"... basically, when a paragraph's last line is so short that it gives the impression of the paragraph break being too tall.)

  3. A chain of   to subvert the stylesheet's rules for how text should be displayed. (Pranksters used to try to feed this into comment forms to force a page to be wider than the viewport.)

Only the third option requires multiple non-breaking spaces in a row.

I disagree about the heuristic as this will introduce confusion (e.g. users thinking that nbsp are preserved, but in some particular cases, they actually aren't). A user-settable preference would be better.

(In reply to Vincent Lefevre from comment #77)

I disagree about the heuristic as this will introduce confusion (e.g. users thinking that nbsp are preserved, but in some particular cases, they actually aren't). A user-settable preference would be better.

Which is still better than what we have today, because there is no indication that NBSP are discarded in a lot of cases, and you might only see it after e.g. receiving a copy of your message that has been wrapped to 80 chars for instance.

In our case we use non-breaking spaces between words and numbers like "page 35" or "article 5" or "§ 50" and so on. In our company we develop web applications to convert text of any kind to a well layouted ready-to-print PDF or other formats. It happens all the time that someone copies text containing non-breakable spaces into our application with the intent that the spaces will be preserved as is after submitting the form.
At the moment this only works in other browsers than Firefox. Our issue really is only submitting forms.

Another use, in French, is to avoid a number being separated from its unit:

  • « L’altitude du Mont Everest est de 8 848[non breaking space]mètres ».

But a thin space ( ) is used when the unit symbol is used (the thin space is also the thousands separator):

  • « altitude du Mont Everest : 8 848 m ».

(In reply to Masayuki Nakano [:masayuki] (he/him)(JST, +0900)(no pain of the broken bone, but the corset blocks me to concentrate) from comment #63)

The change would break copy in HTML editor. In HTML editor without <pre> element, 2nd and later ASCII whitespaces are automatically converted to NBSPs [...]

This is completely broken! I've just received a mail written from Firefox in something like a webmail, containing C code. This C code (as very often) contains multiple consecutive spaces for indenting. But in the mail, all spaces except the last one were converted to NBSPs. And as C code, this is invalid!

Another example: it's impossible to use Transifex (https://www.transifex.com/) for correct French translation since Firefox keeps changing nbsp's into normal spaces.

Quite disappointing that this bug is still here.

The nbsp is used as a typographic treatment in many European languages. When a text is typeset, nbsp prevents unwanted line break between the characters or words (e.g. nbsp is placed between single-letter word and the following word). The fact that Firefox as the only browser deliberately removes nbsp from textarea/input forces users to switch between browsers when they’re addressing typography on the web. Tools such as https://typopo.tota.sk (automatic correction of typography errors) do not work in Firefox since one of their jobs is to place non-breaking spaces properly. I’d say the solution as suggested in comment 57 would be the best option to solve this problem.

Severity: critical → N/A

Bonjour David,
I am not saying that the patch would be accepted but managing preferences is pretty easy to do in Firefox.
Would be interested to propose a patch? I can help internally to find a reviewer and find help if needed.

We have also a step by step new contribution tutorial now:
https://firefox-source-docs.mozilla.org/contributing/contribution_quickref.html

merci

Flags: needinfo?(david+bugs)

I build Firefox with David’s patch for my personal use, and I have to say that some code syntax highlighters and markdown processors out there have become addicted to this corruption of non-breaking spaces by browsers. E.g. when somebody sends a block of code over Slack, and I copy it to my editor, I get nbsps in indentation. So some heuristic to only convert spans of multiple adjacent nbsps would be desirable.

(In reply to Yuri Khan from comment #86)

I build Firefox with David’s patch for my personal use, and I have to say that some code syntax highlighters and markdown processors out there have become addicted to this corruption of non-breaking spaces by browsers. E.g. when somebody sends a block of code over Slack, and I copy it to my editor, I get nbsps in indentation. So some heuristic to only convert spans of multiple adjacent nbsps would be desirable.

I think a Web browser should behave correctly at all times. Fixing Web pages designed by incompetent designers is their own problem.

I think a Web browser should behave correctly at all times. Fixing Web pages designed by incompetent designers is their own problem.

You’re right of course, but web browsers have not been allowed to choose correct behavior over backward bug compatibility since the Netscape/IE times. A partial fix now is preferable to a perfect fix never.

Yes, editor requires to put NBSPs if there are multiple white-spaces are adjacent because adjacent ASCII white-spaces are collapsed to an ASCII white-space at rendering time due to basic of HTML/SGML/XML spec unless parent element is explicitly styled as "preformatted" (Currently, I'm rewritting the normalizer to take similar behavior as Chrome). So, I think that only when an NBSP is not surrounded by another NBSP nor ASCII white-spaces, can say the NBSP is truly an NBSP. Unfortunately, as far as I've tested, you cannot insert multiple NBSPs in a place in contenteditiable because web browsers cannot remeber which white-spaces were NBSPs. If browsers want to do that, browsers need to store x2 footprint per text node unfortunately.

(In reply to Masayuki Nakano [:masayuki] (he/him)(JST, +0900) from comment #89)

Yes, editor requires to put NBSPs if there are multiple white-spaces are adjacent because adjacent ASCII white-spaces are collapsed to an ASCII white-space at rendering time due to basic of HTML/SGML/XML spec unless parent element is explicitly styled as "preformatted"

I have always found this "feature" extremely annoying TBH. I use French spacing when typing and I hate to see one of them become non-breaking. I can type a non-breaking space all right whenever I need one.

(In reply to Yuri Khan from comment #88)

I think a Web browser should behave correctly at all times. Fixing Web pages designed by incompetent designers is their own problem.

You’re right of course, but web browsers have not been allowed to choose correct behavior over backward bug compatibility since the Netscape/IE times. A partial fix now is preferable to a perfect fix never.

I think a perfect fix (by removing the nbsp absurdity) would be easier than a partial fix in this case, so it should win the now/never contest.

(In reply to Masayuki Nakano [:masayuki] (he/him)(JST, +0900) from comment #63)

(In reply to Jorg K (GMT+2) from comment #62)

What do you think, Masayuki-san, should we remove
https://searchfox.org/mozilla-central/rev/ab6f4c453d15ab82147c630a8b886b40240ca72b/dom/base/nsPlainTextSerializer.cpp#1143-1147
as suggested in attachment 9060386 [details] [diff] [review]?

The change would break copy in HTML editor. In HTML editor without <pre> element, 2nd and later ASCII whitespaces are automatically converted to NBSPs since if web browsers don't do that, whitespace collapsing of HTML spec causes only one whitespace rendered. Unfortunately, we cannot distinguish whether every NBSP in an editable text node is truly NBSP or not (i.e., automatically converted one). Therefore, we cannot fix this bug without big changes in HTML editor. However, in non-editable content, not so. So, if nsPlaintextSerializer handles NBSPs when it retrieves text from a node, we can fix this for non-editable content.

I‘d like to apologize for my previous comment and offer something actually constructive.

If I understand both the discussion and the snippets of source code that have been posted here, on type/paste, Firefox converts SP characters (U+0030) into NBSP characters (U+00A0) in order to preserve apparent whitespace. This leaves no way to disambiguate between typed/pasted and automatically converted SP/NBSP characters.

What if automatically converted SP-to-NBSP characters were marked by preceding them with a zero-width joiner (U+200D ZWJ)? It would have no visible impact on the content presented on-screen but also could act as a flag for converting ZWJ NBSP back to plain SP on cut/copy in order to leave unmarked NBSP alone. Also, because it’s so apparently useless as a combination, ZWJ NBSP is extremely unlikely to appear anywhere for just about any reason, so it should be a reliably safe way to mark these spaces.

I‘d offer the code myself, but sadly the most sophisticated anything I’ve programmed was a violin tuner in BASIC about 25 years ago.

What if automatically converted SP-to-NBSP characters were marked by preceding them with a zero-width joiner (U+200D ZWJ)? It would have no visible impact on the content presented on-screen but also could act as a flag for converting ZWJ NBSP back to plain SP on cut/copy in order to leave unmarked NBSP alone. Also, because it’s so apparently useless as a combination, ZWJ NBSP is extremely unlikely to appear anywhere for just about any reason, so it should be a reliably safe way to mark these spaces.

The combinations of plain U+0020 SPACE with U+00A0 NO-BREAK SPACE is already as useless as possible for purposes other than indentation.

(In reply to Yuri Khan from comment #95)

What if automatically converted SP-to-NBSP characters were marked by preceding them with a zero-width joiner (U+200D ZWJ)? It would have no visible impact on the content presented on-screen but also could act as a flag for converting ZWJ NBSP back to plain SP on cut/copy in order to leave unmarked NBSP alone. Also, because it’s so apparently useless as a combination, ZWJ NBSP is extremely unlikely to appear anywhere for just about any reason, so it should be a reliably safe way to mark these spaces.

The combinations of plain U+0020 SPACE with U+00A0 NO-BREAK SPACE is already as useless as possible for purposes other than indentation.

That was kind of my point. Isn’t the problem we’re trying to solve here trying to balance the needs of lay users with the more typographically advanced? By marking which space characters were automatically converted on input so that they’re properly un-converted on output, we can have our proverbial cake and eat it, too. Automatically converting U+0030 ↔ U+200D U+00A0 achieves that.

(In reply to Masayuki Nakano [:masayuki] (he/him)(JST, +0900) from comment #66)

(In reply to Mikhail Ryazanov from comment #64)

  1. If preserving them is necessary, doesn't you HTML support the white-space: pre-wrap style?

Ideally yes, but all browsers converts whitespaces to NBSPs:

I don't understand this comment. Why cannot Firefox choose to do it differently, in a better way?

Or couldn't there be a way to disable conversions between normal space and nbsp for users that don't need such conversions?

When the NBSP came from the site (as opposed to editor in Firefox), changing the NBSP is bad. Yet, it appears that Chrome, too, has the behavior that designMode / contenteditable changes every other space into an NBSP when pressing the space bar multiple times and copying the text undoes the hack.

I think we should investigate what heuristic Chrome uses exactly, but off the top of my head, I suggest changing an NBSP to an ASCII space upon plain text clipboard export only if it is adjacent to an ASCII space. This would leave non-editor-generated NBSP intact, and I think it is a legitimate concern to want those to be left intact e.g. in the case of French quotation marks.

Chrome's editing code appears to also have a special case that if you delete an ASCII space that is adjacent to an NBSP, the NBSP turns into an ASCII space at that point.

You need to log in before you can comment on or make changes to this bug.