Closed Bug 359303 Opened 18 years ago Closed 2 years ago

Non-breaking spaces (nbsp) not copied as such

Categories

(Core :: DOM: Serializers, defect)

defect

Tracking

()

RESOLVED FIXED
103 Branch
Tracking Status
relnote-firefox --- 103+
firefox103 --- fixed

People

(Reporter: cody, Assigned: kemenaran)

References

(Blocks 1 open bug)

Details

(Keywords: dataloss)

Attachments

(3 files)

User-Agent:       Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8.1) Gecko/20061010 Firefox/2.0
Build Identifier: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8.1) Gecko/20061010 Firefox/2.0

When copying text containing non-breaking spaces (  or UTF-8 character 0xA0) to the clipboard, those characters are converted into regular spaces.  I'm not certain that this is a bug rather than a feature, but my expected behavior was that non-breaking spaces would be preserved as such.

Reproducible: Always

Steps to Reproduce:
1. Copy text containing a non-breaking space character (  or UTF-8 0xA0) to the clipboard.
2. Paste into either a text area in Firefox, or into another application.
Actual Results:  
Text should not wrap at the non-breaking space.

Expected Results:  
Text wraps at the non-breaking space.
(In reply to comment #0)
> When copying text containing non-breaking spaces (  or UTF-8 character
> 0xA0)

Gack...I realize I misspoke here; my brain obviously wasn't working.  I of course meant *Unicode* character 0xA0, which is encoded quite differently in UTF-8!
I have same problem with Firefox 2.0, all 'no-brake space' charackers (0x00A0) are replaced by spaces. This bug does not depend on character encoding of html page. It does the same when encoding is UTF-8 or UTF-16 or Latin-1.
Confirming, also reproducible in Firefox 2.0 on Windows XP.
(In reply to comment #0)
> User-Agent:       Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US;
> rv:1.8.1) Gecko/20061010 Firefox/2.0
> Build Identifier: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US;
> rv:1.8.1) Gecko/20061010 Firefox/2.0
> 
> When copying text containing non-breaking spaces (  or UTF-8 character
> 0xA0) to the clipboard, those characters are converted into regular spaces. 
> I'm not certain that this is a bug rather than a feature, but my expected
> behavior was that non-breaking spaces would be preserved as such.
> 
> Reproducible: Always
> 
> Steps to Reproduce:
> 1. Copy text containing a non-breaking space character (  or UTF-8 0xA0)
> to the clipboard.
> 2. Paste into either a text area in Firefox, or into another application.
> Actual Results:  
> Text should not wrap at the non-breaking space.
> 
> Expected Results:  
> Text wraps at the non-breaking space.
> 

Please raise the severity of this bug to Major.
Rationale:
The input control coded by <input type="text" lang="pl" value="1&nbsp;000" > 
has the numerical value of 1,000 (one thousand).
If the value changes to "1 000", the numerical value is lost.
I understand it is kind of weird that there is no visible difference between a number and two numbers; however, while I may disagree with this setting, somebody declared it this way.
I do not know who has the authority over the national numeric format but I am sure the system vendor had consulted a standard body prior to including it in the operating system.
Confirming for FF 2.0.0.7 on Linux.

The &nbsp; is converted to 0x20 at writing time. Any consequent copy to clipboard or sending via HTTP shows regular spaces. OTOH &thinsp; or non-breakable thin space are saved properly.
(In reply to comment #5)
> The input control coded by <input type="text" lang="pl" value="1&nbsp;000" > 
> has the numerical value of 1,000 (one thousand).
> If the value changes to "1 000", the numerical value is lost.
>
The numerical value would be lost, even if you used  "1,000" (in English source code), or "1&#x202F;000"  or "1.000" in Czech). The source code should be locale independent. User agent should do l10n on the value (e.g. using @lang attribute). You could address the same problem on data/time format processing.

In additional, HTML forms are very weak in type handling. Maybe XForms or HTML5 will affect this problem.
(In reply to comment #6)
> Confirming for FF 2.0.0.7 on Linux.
> 
> The &nbsp; is converted to 0x20 at writing time. Any consequent 
[...]
> sending via HTTP shows regular spaces.

3.0a8 (alpha version of FF) doesn't suffer from this problem. Only copying to clipboard remains affected.
(In reply to comment #7)
> (In reply to comment #5)
> > The input control coded by <input type="text" lang="pl" value="1&nbsp;000" > 
> > has the numerical value of 1,000 (one thousand).
> > If the value changes to "1 000", the numerical value is lost.
> >
> The numerical value would be lost, even if you used  "1,000" (in English source
> code), or "1&#x202F;000"  or "1.000" in Czech). The source code should be
> locale independent. 

The source code for what?  Everything is source code in HTML.  And currently there is no standard that allowed you to write locale-independent 'Hamlet'.

> User agent should do l10n on the value (e.g. using @lang
> attribute). You could address the same problem on data/time format processing.
> 

I doubt it should; it would be enough if it refrained from misrepresenting localized data as it gets them.  It converts a number to a sequence of numbers by converting non-breaking spaces to breaking spaces.  This is a bad thing and there is no excuse for this oddity.
(In reply to comment #9)
> (In reply to comment #7)
> > (In reply to comment #5)
> > > The input control coded by <input type="text" lang="pl" value="1&nbsp;000" > 
> > > has the numerical value of 1,000 (one thousand).
> > > If the value changes to "1 000", the numerical value is lost.
> > >
> > The numerical value would be lost, even if you used  "1,000" (in English source
> > code), or "1&#x202F;000"  or "1.000" in Czech). The source code should be
> > locale independent. 
> 
> The source code for what?  Everything is source code in HTML.  And currently
> there is no standard that allowed you to write locale-independent 'Hamlet'.

It depends on where you place presentation layer. You like to cook everything on server, I like to do it on client.

> 
> > User agent should do l10n on the value (e.g. using @lang
> > attribute). You could address the same problem on data/time format processing.
> > 
> 
> I doubt it should; it would be enough if it refrained from misrepresenting
> localized data as it gets them. It converts a number to a sequence of numbers
> by converting non-breaking spaces to breaking spaces.  This is a bad thing and
> there is no excuse for this oddity.
> 

It doesn't convert a number into list of numbers. It just replaces one character with another one in a string. I wanted to point you should not describe this problem as problem with locazation and I showed you another languages where different (and not only one) thousand separator is used.

Despite our missunderstading, I agree the client should not change something he doesn't undersand it.
(In reply to comment #10)
> It doesn't convert a number into list of numbers. It just replaces one
> character with another one in a string. I wanted to point you should not
> describe this problem as problem with locazation and I showed you another
> languages where different (and not only one) thousand separator is used.

Your statement amounts to "since there are languages where this bug does not cause problems, it is not a localization problem".
A similar statement would be "since there are languages that can be put down using ISO-8859-1, the fact that the application is fixed to this character set is not an localization problem."
I have also stumbled across this very annoying bug under Linux in Firefox 2.0.0.1 ("Mozilla/5.0 (X11; U; Linux i686; en-GB; rv:1.8.1.10) Gecko/20071015 SUSE/2.0.0.10-0.1 Firefox/2.0.0.10").

How to reproduce:

I have remapped my X11 keyboard with

  xmodmap -e 'keycode 113 = Mode_switch Mode_switch'
  xmodmap -e 'keysym space = space NoSymbol nobreakspace NoSymbol'

such that pressing AltGr+Space results in the keysym "nobreakspace" to be sent to the application (and have tested that this works fine with xterm and xev). Having NBSP easily available on my keyboard is *far* more convenient than having to type &nbsp; when editing HTML files and wiki pages.

When I type AltGr+Space into a textarea field in Firefox 2.0.0.10, and then copy-and-paste the entered NBSP back into xterm into the stdin of "od -t x1", then it receives only the byte 0x20, which is a normal space. Likewise, if I submit the text field to a HTTP server, it receives only 0x20. I would have expected to receive the bytes 0xc2 0xa0, which is the UTF-8 encoding for the NBSP character (U+00A0).

So something in Firefox is covertly scanning any text that I type into form fields and replaces any entered U+00A0 character immediately with the U+0020 character. This is surprising, disturbing and undesirable, because it unexpectedly corrupts my entered character sequence and it prevents me from typing in no-break space characters directly into content-management systems and wikis.

I think Firefox forms should be fully transparent for NBSP characters, such that we can start to use them in content-management systems and wikis instead of the awkward &nbsp; SGML character reference. Thanks!

Any idea, where exactly this unexpected NBSP-character replacement this happens and why this was introduced in the first place?
For what it's worth, Mac and Windows users can also perform the same test mentioned in comment #12; on Mac, a non-breaking space may be entered with option+space, while on Windows, it can be entered with alt+0160.
Isn’t this a duplicate of bug 218277 which has been fixed in the mean time?
(In reply to comment #14)
> Isn’t this a duplicate of bug 218277 which has been fixed in the mean time?
> 
No. This bug is about undesired transformation when copying text into clipboard (or primary_selection on X11). Submitting data via HTTP preserves white characters.
Bug 218277 has only been fixed in Gecko rv: 1.9 (Firefox 3.0), but as far as I can see, this trivial patch has not yet been backported to the earlier Gecko releases used by Firefox 2.0 and many other browsers. All these widely used browsers continue to cause Firefox users to unknowingly murder lots of precious characters on wiki pages every day. It is remarkable, how long it took (four (4) years) to fix what was a really trivial and very nasty data-corrupting bug! Silently converting 0xa0 to 0x20 is morally about as sensible as silently converting the ASCII letter N into M everywhere, just for fun. There can be absolutely no excuse for this merciless global 0xa0-genocide on wiki pages.

(Sorry for the strong language, but it is clear from the lengthly dialog on bug 218277 that some people have failed to understand the severity of this bug. Recall that 0xa0 occurs not only in the ISO 8859 representation of NBSP, but also in the UTF-8 and EUC representation of many other characters.)
I can still confirm the silent assassination of NBSP (0x00a0) when copying Text with NBSPs.
FF shows it right, but when you copy the text there are only normal spaces (0x0020).

Mozilla/5.0 (Windows; U; Windows NT 6.1; de; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2 (.NET CLR 3.5.30729)

ASAIK it happens on all plattforms, not only Mac

This bug should be confirmed and critical due to dataloss!
Hello, i have a possible solution:

in content\base\src\nsPlainTextSerializer.cpp line 1265 is root of all evil. What is that good for? Serious can anyone tell me?

What ever, I have commented this line out and build a FF. Bug solved ;)
No, you shouldn't remove this line, because you then break the behavior of the flag OutputPersistNBSP. I think we just should have to replace this kSpace const by the unicode character 0xA0. At least in this method. I didn't look in other methods of the class.
Status: UNCONFIRMED → NEW
Component: General → Serializers
Ever confirmed: true
OS: Mac OS X → All
Product: Firefox → Core
QA Contact: general → dom-to-text
Hardware: PowerPC → All
(In reply to comment #19)
> No, you shouldn't remove this line, because you then break the behavior of the
> flag OutputPersistNBSP.

Which we won’t need if we don’t change any NBSPs to Spaces, right? 

I’m not sure (pulled the source just a few days ago and it’s far the biggest I have ever worked with) but as far I couldn’t think of a case where this replacement is necessary.

Can anyone show me a case where this transformation is really needed?

In an old bug (IIRC bug 218277) David mentioned that FF-(HTML)Editor replaces multiple Spaces to NBSP cause html doesn’t allow ignores more Spaces in a row.
IF you think, that this case must be handled (it is just the NORMAL html-behaviour) you can do something to change the output.
But replacing it with NBSP is a very bad idea. I didn’t look to that methods yet, but I think It would be better to put a <pre>-Tag around it.

As Markus said, the current FF behaviour is as bad as changing any other valid Character with an other (that looks quite similar).

Greetings Florian

P.S.: Furthermore, I think this bug is critical due to dataloss ;)
Sorry, I didn't read the source code correctly. replacing kNBSP by 0xA0 does nothing, since kNBSP is equals to.. 0xA0 :-)


Apparently, this transformation was to fixed bug 218277, no ? I'm not sure. I can't tell the real purpose of this transformation.

Perhaps the solution is to use the flag nsIDocumentEncoder::OutputPersistNBSP when calling the serializer during copy-paste...
(In reply to comment #21)
> Apparently, this transformation was to fixed bug 218277, no ? I'm not sure. I
> can't tell the real purpose of this transformation.

The first patches would have solved it, but after some Comments (#19 David and some more) it was changed only for from submissions.
I think this was mistake even if that fixed the bug 218277.

> Perhaps the solution is to use the flag nsIDocumentEncoder::OutputPersistNBSP
> when calling the serializer during copy-paste...
and all the other cases… It’s never ok to replace valid characters, isn’t it?

Bug I think the real bug is hidden in the mail editor (which seems to need the replacement that btw a really bad design), but I need more time to look through the code …
Yes, the mail editor is the tricky part here.

Commenting out line 1265 in nsPlainTextSerializer.cpp¹ fixes the bug in FF (I’ve not found any misbehaviour), but it has also affects bug 290565.

Although it allows now to use some NBSP it doesn’t solve it, it just changes the wrong behaviour.

So I would say delete this line to fix this bug and than it’s time to fix the mail editor (there is no reason to leave this bug, just because it affects a either way wrong part of the mail editor).

¹ aString.ReplaceChar(kNBSP, kSPACE);
Does the code in http://mxr.mozilla.org/comm-central/source/mailnews/compose/src/nsMsgSend.cpp#1706 do something with this issue? The mail editor uses this code when it takes the body of the message from editor. And maybe the plain text that is copied to the clipboard undergoes this procedure, thus loosing the nbsps?
I think the main problem is here:

http://mxr.mozilla.org/mozilla-central/source/content/base/src/nsPlainTextSerializer.cpp#1253

But your piece of code is wrong, to. It‘s never fine to replace Space wist NBSP. 

I‘ve build a thunderbird with 3 or 4 fixes, but it sometimes behaves strange because Space ←→ NBSP conversions are often used and I have not found all code pieces that cause this replacements yet.
I didn’t see this mentioned above: it’s not only when copying to clipboard that the conversion happens (as the title of the bug states), it happens also to data sent to the server.

The most visible (annoying) effect of this is that I can’t type NBSP characters in email messages (with GMail in my case), though it also happens to all other textarea fields (used for commenting in general). French requires NBSPs around some punctuation marks, and when they’re converted to normal spaces the punctuation can wrap to the next line, which is plainly bad. (I use AltGr+Space to type it, on a custom keyboard layout. As stated before, this also happens when pasting it.)

As far as I can tell this doesn’t happen in <input type="text"> fields (at least copying to clipboard preserves NBSP characters, I didn’t check if they’re sent to the server), so it’s clearly not an insurmountable problem.
Input-fields were fixed with bug 218277. And they fixed only this because they think it is a good idea to format html mails with NBSBs :(.
Apparently, this also happens when pasting content. I copied an NBSP from a program that keeps it, then tried to paste it in the search box. The result in FF10.0.1 is that the browser freezes while searching for all the regular spaces in the page
Still here, as of Firefox 24.0
I believe I've also wasted some hours of my life before stumbling on this issue report.
I'm trying to develop a FF extension that involves creating an xpath expression based on an html element text. To deal with &nbsp; I did something like "elementText = elementText.replace("&nbsp;", "\u00A0');".
In the end the extension returns a regular space and the xpath won't work.
I would work on this if I had the proper knowledge.
If someone could give a help it would be most appreciated.
Spotted on FF 25.
Severity: normal → critical
Keywords: dataloss
Bug 613223 looks like a duplicate.
Duping forward to better documented bug 613223
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → DUPLICATE
I hadn't read far enough here
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
Still here, as of Firefox 30.0 (Linux)
This bug also occurs with drag-and-drop.

In Firefox Mac, in a text area, I drag-and-drop some text to another place in the same text area. 
The characters must not be altered. 
But the non-breaking spaces are altered into normal spaces.
Also confirmed on Firefox 38.0.5 on Windows 7.

Copying from Firefox to either Firefox or Notepad++ converts U+00A0 to U+0020.
Copying Notepad++ to Firefox keeps U+00A0 unchanged.
Confirmed in Firefox 41.0.2 on Windows 7 x86.

Copying from developer console and pasting back in it changes nbsp to regular space. Check this gif I made to see what I mean: http://i.imgur.com/2StB85Q.gif It looks like string variable isn't equal to itself.
Confirming in Firefox 45.0 on Ubuntu 14.04 x86_64.
Blocks: 290565
STR (I tried them):
Enter |huhu&nbsp;!| into http://www-archive.mozilla.org/editor/midasdemo/ (source view).
Copy this text.
Paste into Notepad++.
You get a space and not a NBSP.

Likely caused by
https://dxr.mozilla.org/mozilla-central/source/dom/base/nsPlainTextSerializer.cpp#1231-1250
Maybe one can fix this bug by adapting the solution from bug 333064:
https://hg.mozilla.org/mozilla-central/rev/a35b2ac359e5

Add nsIDocumentEncoder::OutputPersistNBSP in the right spot where plain text is placed onto the clipboard. The test can also be adapted.
I'll argue that Firefox should be converting &nbsp; characters to their respective normal space characters.  It helps you avoid problems like these:

https://stackoverflow.com/questions/51794845/pre-html-nbsp-m-bm-bash-scripts-syntax-highlighting-on-the-web-remove-asc

Linux does NOT like &nbsp; characters in the bash shell.  

The real problem is that each browser handles this differently.  Chrome copies the &nbsp; characters as they are, but then you run into problems with that approach when the copied text is used in different environments.

I even filed a bug with Chrome thinking that it should be converting &nbsp; like Firefox does:

https://bugs.chromium.org/p/chromium/issues/detail?id=887511

With Chrome, I have to use the extra step of 

> textToCopy.replace(new RegExp(String.fromCharCode(160), "g"), " ");

To replace &nbsp; with normal space characters. 

I think a discussion needs to take place as to what the correct behavior should be, and all major browsers should pick one way or the other, but I hope that &nbsp; characters end up being converted when copied because why would you ever want to preserve &nbsp; characters?  Seems to be a thing only the web uses?
(In reply to Eric from comment #44)
> I'll argue that Firefox should be converting &nbsp; characters to their
> respective normal space characters.  It helps you avoid problems like these:
> 
> https://stackoverflow.com/questions/51794845/pre-html-nbsp-m-bm-bash-scripts-
> syntax-highlighting-on-the-web-remove-asc

That is an issue with the site, not with the web browser: source code listings should not use one character (e.g. NBSP, curly quotes) and mean a different one (normal space, typewriter quotes). If they need to use multiple consecutive white space characters they should rely on `white-space` CSS property.

> Chrome copies the &nbsp; characters as they are, but then you run into problems
> with that approach when the copied text is used in different environments.

And when you are editing an article for a news site and paste the text into a word processor, you will lose the present non-breaking spaces, introducing breaking the typography.

> I think a discussion needs to take place as to what the correct behavior
> should be, and all major browsers should pick one way or the other, but I
> hope that &nbsp; characters end up being converted when copied because why
> would you ever want to preserve &nbsp; characters?  Seems to be a thing only
> the web uses?

In many languages non-breaking space is expected to be used; for example, you are supposed to use NBSP after non-syllabic prepositions in Czech. This is not a web thing but professional text editing thing.
(In reply to Eric from comment #44)
> I'll argue that Firefox should be converting &nbsp; characters to their
> respective normal space characters.  It helps you avoid problems like these:
> 
> https://stackoverflow.com/questions/51794845/pre-html-nbsp-m-bm-bash-scripts-
> syntax-highlighting-on-the-web-remove-asc

As said in the answer there, the syntax highlighting script should not be corrupting the source code it is highlighting. It is not the browser’s job to undo damage introduced by buggy software.


> I think a discussion needs to take place as to what the correct behavior
> should be, and all major browsers should pick one way or the other, but I
> hope that &nbsp; characters end up being converted when copied because why
> would you ever want to preserve &nbsp; characters?  Seems to be a thing only
> the web uses?

Nope. The U+00A0 NO-BREAK SPACE character is useful in plain text, word processors, and all kinds of natural language text that is intended to be displayed on multiple lines. Its purpose is to prevent a line break between words or other runs of non-whitespace characters. Here are a few examples where non-breaking spaces are essential:

* Some locales (such as Russian) use a non-breaking space for digit grouping. If you replace it with a normal space, you get weird line wrapping with numbers such as 300 000 (three hundred thousand) where 300 remains on one line and 000 is wrapped. Phone numbers are also often formatted with digit grouping, and should not be broken.

* In French typography, you are supposed to put a space on the inside of quotes « like this » and before some punctuation marks such as colons : like this. A line break is not permitted at these points.

* If an abbreviation ending in a period ends up at the end of line, it can be visually mistaken for end of sentence. To prevent that, abbreviation period should be followed with a non-breaking space; sentence end full-stop, with a normal space.

This bug makes it impossible to preserve good formatting when copying and pasting text.
I guess if it's a language locale thing, that makes sense.  I don't personally like it, but your logic seems perfectly reasonable.  In any event, the bash shell doesn't like &nbsp;... so if you end copying text that contains &nbsp; and try to use it there, you're going to have a bad time.
Looks like there is an undocumented configuration option for the syntax highlighter (http://alexgorbatchev.com/SyntaxHighlighter/manual/configuration/) script I'm using:

> SyntaxHighlighter.config.space = " ";
> SyntaxHighlighter.all();

Fixes my issue.  The script was already using white-space: pre for the css styling.  SyntaxHighlighter.config.space was set to "&nbsp;" by default for some reason.
Another couple of examples are figures with units, like 1 km, references, like fig. 12 on p. 9, and footnote markers [1].

Using nonbreaking spaces will break most program code. A similar annoyance is code with curly quotes (“hello world” instead of "hello world", which I’ve seen many times as well). Firefox will not magically replace those curly quotes with ASCII quotes. I would prefer if it wouldn’t replace nonbreaking spaces with regular spaces either.

----

[1] https://practicaltypography.com/nonbreaking-spaces.html
Non-breaking spaces make sense in code, and it is important to preserve them to avoid buggy behavior with potentially bad consequences. For instance, in a shell, if I write:

  rm foo bar

(where the space between foo and bar is a non-breaking space), the file "foo bar" is expected to be removed. But if the non-breaking space is converted to a normal space (e.g. after a copy-paste), then the shell will attempt to remove files "foo" and "bar"!
How does one even insert a non-breaking space character in a terminal window in bash?  I'm not seeing a "&nbsp;" key on my keyboard unfortunately.
(In reply to Eric from comment #51)
> How does one even insert a non-breaking space character in a terminal window
> in bash?  I'm not seeing a "&nbsp;" key on my keyboard unfortunately.

Aside from copy-pasting or Tab-completing, there are several ways:

When running bash or zsh under a Unicode locale, you can use an ANSI C-like string containing a unicode escape: $'\u00a0'

If your terminal emulator is GTK-based, you can hit Ctrl+Shift+u and then type 00a0<Enter>

If you're using the IBus IME system, you can enable Ctrl+u as a cross-toolkit equivalent to GTK's Ctrl+Shift+u.

If you haven't set URxvt.iso14755 to false in ~/.Xresources to enable mapping Ctrl+Shift+... combinations as hotkeys, rxvt-unicode will let you type a non-breaking space by holding down Ctrl+Shift, typing 00a0, and then releasing them.

Your keyboard layout may map it to something like AltGr+Spacebar.

I use these settings to amend my US ASCII layout so that Right Alt is "Level 3 Shift" (ie. AltGr), AltGr+Spacebar produces a non-breaking space and AltGr+Shift+Spacebar produces thin non-breaking spaces.

setxkbmap -variant altgr-intl -option lv3:ralt_switch -option nbsp:level3n

I also wrote a blog post on using setxkbmap for this sort of thing (with an Xmodmap explanation in one of the comments), in case you want it:

http://blog.ssokolow.com/archives/2011/12/24/getting-your-way-with-setxkbmap/

The nice thing about using setxkbmap and Xmodmap is that they only persist across X server restarts if you update the right config files and don't affect SSH sessions, so you can experiment without fear of getting stuck.
Please do not use Bugzilla as a forum and follow the Bugzilla etiquette [1], comments in this bug should only help fixing the issue in Firefox.

[1] https://bugzilla.mozilla.org/page.cgi?id=etiquette.html

Thank you.

When will this bug finally get solved? I always have to use Chrome when I want to paste text containing non-breaking spaces into a website.

There are so many duplicates that are marked as resolved but are not solved in any way.

This bug should also be added as a duplicate because it is the same problem: https://bugzilla.mozilla.org/show_bug.cgi?id=268995

No, in bug 268995, a copy is not involved: the nbsp came from the original form data and was not copied or pasted.

Oh, I see. But then this bug is still annoying and also rated as critical since quite a while.

So please taka a look at it again. I am sure this can be solved quite fast. Unfortunality I am not enough experienced to create a patch for this issue by myself.

This bug is, in fact, trivial to fix, and the attach patch does this (a trivial adaptation of a patch already attached to bug #194498 — which see also).

The reason is that it's not so much a bug as an intentional misfeature: it's not that Firefox accidentally breaks U+00A0 NO-BREAK SPACE (=:NBSP), it's that somebody deliberately decided that it should, and added explicit code to do it. The logic behind the code is that HTML composer sometimes turns some U+0020 SPACE (when there are multiple in a row, or something) into U+00A0 NO-BREAK SPACE and the fear is that pasting them as such would break things.

I think this logic is bad in every sense. Every bit of code which turns U+00A0 NO-BREAK SPACE into U+0020 SPACE or vice versa should be removed, and Firefox should always remain Unicode-transparent. (A middle ground, if fear of breaking things is so great, would be to make this a user-settable preference. I don't know how to do this.) But I don't know who makes these decisions or how to contact them or how to argue about this; I'm pretty sure they don't read this kind of bug reports, however, so it's useless to discuss it here.

In the mean time, I simply apply the patch when compiling Firefox.

Comment on attachment 9060386 [details] [diff] [review]
patch removing special treatment that breaks U+00A0 NO-BREAK SPACE

Review of attachment 9060386 [details] [diff] [review]:
-----------------------------------------------------------------

I fully agree with David A. Madore's assessment here and I strongly recommend to merge in his trivial patch to remove an ancient and very unfortunate, data-corrupting design choice in Firefox. The no-break space character exists in ISO 8859 and ISO 10646 for a good reason, and a web browser must never quietly replace this character with another character without explicit informed user consent.

I registered here just to express my support for the change.
A program should not mangle data, unless there is some really good reason - which i do not see here.

We’re are now in 2019, almost 20 years after the year 2000, and Unicode is now well supported in every OS. This conversion was a bad idea from the very beginning, 13 years after it’s a complete non-sense.
I can type in non-breaking and thin spaces ( ) with my keyboard, but when I paste a text containing such characters in a Firefox form input, I lose the non-breaking spaces, whereas the thin spaces are kept. Do you really see a good reason for this design?!

As a french corrector, I use an external text editor (Gedit) that can display spaces and draw these special spaces differently. I make some automatic corrections to comply with the French orthotypographic rules, and some of these corrections are the replacement of certain spaces by non-breaking and thin spaces. As you can guess, when I copy the corrected text from my editor and then paste it to Firefox, I lose all the non-breaking spaces. :-( It’s a big waste of time!

So please, consider removing this useless, bad and annoying conversion of the non-breaking spaces. As a benefit, you will remove a couple lines of code and you will save a few CPU cycles when pasting text in form inputs.
No caveats, just benefits. So, please, what are you waiting for?…

(In reply to Jorg K (GMT+2) from comment #62)

What do you think, Masayuki-san, should we remove
https://searchfox.org/mozilla-central/rev/ab6f4c453d15ab82147c630a8b886b40240ca72b/dom/base/nsPlainTextSerializer.cpp#1143-1147
as suggested in attachment 9060386 [details] [diff] [review]?

The change would break copy in HTML editor. In HTML editor without <pre> element, 2nd and later ASCII whitespaces are automatically converted to NBSPs since if web browsers don't do that, whitespace collapsing of HTML spec causes only one whitespace rendered. Unfortunately, we cannot distinguish whether every NBSP in an editable text node is truly NBSP or not (i.e., automatically converted one). Therefore, we cannot fix this bug without big changes in HTML editor. However, in non-editable content, not so. So, if nsPlaintextSerializer handles NBSPs when it retrieves text from a node, we can fix this for non-editable content.

Flags: needinfo?(masayuki)

(In reply to Masayuki Nakano [:masayuki] (he/him)(JST, +0900)(Still struggling with the pain, but becoming better) from comment #63)

  1. Is there any reason to preserve multiple spaces? Formatting with spaces (typewriter-like) is generally a bad idea, since it is not reproducible, and there are more proper ways for indentation and alignment.
  2. If preserving them is necessary, doesn't you HTML support the white-space: pre-wrap style?

(In reply to Mikhail Ryazanov from comment #64)

(In reply to Masayuki Nakano [:masayuki] (he/him)(JST, +0900)(Still struggling with the pain, but becoming better) from comment #63)

  1. Is there any reason to preserve multiple spaces? Formatting with spaces (typewriter-like) is generally a bad idea, since it is not reproducible, and there are more proper ways for indentation and alignment.
  2. If preserving them is necessary, doesn't you HTML support the white-space: pre-wrap style?

Not to mention that, for people trying to produce rich-text inputs that are both well-formed and non-confusing to users (ie. me), it's a pain to have to write and debug JavaScript which replicates LyX's behaviour of discarding Space of Enter keypresses if they'd cause the editor view to abuse semantic markup to insert presentational whitespace (eg. non-breaking spaces and empty paragraph tags).

(In reply to Mikhail Ryazanov from comment #64)

(In reply to Masayuki Nakano [:masayuki] (he/him)(JST, +0900)(Still struggling with the pain, but becoming better) from comment #63)

  1. Is there any reason to preserve multiple spaces? Formatting with spaces (typewriter-like) is generally a bad idea, since it is not reproducible, and there are more proper ways for indentation and alignment.

Most users don't know:

  • whether current editor is HTML editor (contenteditable or designMode) or Text Editor (<input> or <textarea>).
  • how multiple whitespaces are treated in HTML editor (without <pre>).

Therefore it makes sense to make typed multiple ASCII whitespaces look as-is.

  1. If preserving them is necessary, doesn't you HTML support the white-space: pre-wrap style?

Ideally yes, but all browsers converts whitespaces to NBSPs:

Therefore, I think that we should keep converting NBSPs in editable elements (contenteditable or designMode), but I think we shouldn't convert in non-editable elements or <input>/<textarea> to make users can treat raw data. So, I think that patch may break something of the former case.

Let me reiterate what I suggested in a comment above: why not make this a user-settable preference? So experienced users who know what an unbreakable space is and care about it can set this preference and Firefox's behavior would then be transparent in every respect (in the editor and while copying and pasting, every character would be retained exactly as it is, including multiple spaces), whereas the default behavior would be either the current one, or some other compromise.

I don't know enough about how Firefox works to code this myself, but I imagine it would be fairly trivial for someone who knows: at least, it doesn't add any significant amount of complexity.

I agree with David that a user-settable preference would be a pretty good compromise until a better solution is found. And it’s only a couple lines of code. Ignoring the problem shouldn’t be an option anymore.

My main concern here is composing mail in Thunderbird. Will not converting in non-editable element fix that? Because I cannot type french correctly without NBSP support in Thunderbird, and that has been quite an issue for years. I trick it by using NWNBSP, but that is far from ideal.

If a NBSP is alone, we can stop converting it even in editable contents. E.g., <p>foo&nbsp;bar</p>.

I'm not a native speaker of any Western languages so that I don't know how NBSP is used in actual usage. Is that used continuously like foo&nbsp;&nbsp;&nbsp;bar?

See comment #26 and comment #46 under "French typography", for example: «&nbsp;like this&nbsp;»

No, actually this is a great idea, I don’t know any language where you are supposed to have multiple ones in a row. In french you use it before colons, within quotes and for « incises » mostly, like this « Une idée : ne pas remplacer les espaces insécables uniques — pour lesquelles ça ne changerait a priori rien pour le rendu et qui sont en revanche l’utilisation légitime la plus probable. ». So yes, that would work for french at least.

(In reply to Bruno Pagani from comment #73)

No, actually this is a great idea, I don’t know any language where you are supposed to have multiple ones in a row. In french you use it before colons, within quotes and for « incises » mostly, like this «&nbsp;Une idée&nbsp;: ne pas remplacer les espaces insécables uniques —&nbsp;pour lesquelles ça ne changerait a priori rien pour le rendu et qui sont en revanche l’utilisation légitime la plus probable.&nbsp;». So yes, that would work for french at least.

Of course those NBSP in my comment have been converted to display, so I’m quoting myself and putting the french excerpt in code formatting just above.

+1 to that heuristic: If NBSP is encountered adjacent to another NBSP or to an ordinary SPACE, assume it is used for alignment/indentation and convert it to a SPACE. Otherwise, assume it is used in its intended function and leave it alone.

(In reply to Masayuki Nakano [:masayuki] (he/him)(JST, +0900)(no pain of the broken bone, but the corset blocks me to concentrate) from comment #71)

I'm not a native speaker of any Western languages so that I don't know how NBSP is used in actual usage. Is that used continuously like foo&nbsp;&nbsp;&nbsp;bar?

I'm not an expert on every European language but, in my experience, there are three ways &nbsp; gets used:

  1. Official language rules of the type that has already been mentioned. I've never heard of any of these situations requiring more than one non-breaking space in a row. (The closest thing that comes to mind is the typewriter-era rule that you type two spaces after a sentence-ending period when using a monospace font to make it easier to see sentence boundaries in among all the whitespace that isn't getting kerned away. Aside from that, the convention was to invent new types of space characters rather than typing multiple of an existing one to ensure that the typeface would preserve the desired proportional relationships and, in the case of digital typography, the semantic meanings.)

  2. Replacing what would otherwise be a normal space in order to manipulate where a line may break without help from whoever is developing the CMS or forum software. (eg. Putting &nbsp; between the last two/few words in a paragraph to so word-wrapping will treat them as a single word. This is done as a way of preventing what typesetters call "orphans"... basically, when a paragraph's last line is so short that it gives the impression of the paragraph break being too tall.)

  3. A chain of   to subvert the stylesheet's rules for how text should be displayed. (Pranksters used to try to feed this into comment forms to force a page to be wider than the viewport.)

Only the third option requires multiple non-breaking spaces in a row.

I disagree about the heuristic as this will introduce confusion (e.g. users thinking that nbsp are preserved, but in some particular cases, they actually aren't). A user-settable preference would be better.

(In reply to Vincent Lefevre from comment #77)

I disagree about the heuristic as this will introduce confusion (e.g. users thinking that nbsp are preserved, but in some particular cases, they actually aren't). A user-settable preference would be better.

Which is still better than what we have today, because there is no indication that NBSP are discarded in a lot of cases, and you might only see it after e.g. receiving a copy of your message that has been wrapped to 80 chars for instance.

In our case we use non-breaking spaces between words and numbers like "page 35" or "article 5" or "§ 50" and so on. In our company we develop web applications to convert text of any kind to a well layouted ready-to-print PDF or other formats. It happens all the time that someone copies text containing non-breakable spaces into our application with the intent that the spaces will be preserved as is after submitting the form.
At the moment this only works in other browsers than Firefox. Our issue really is only submitting forms.

Another use, in French, is to avoid a number being separated from its unit:

  • « L’altitude du Mont Everest est de 8 848[non breaking space]mètres ».

But a thin space ( ) is used when the unit symbol is used (the thin space is also the thousands separator):

  • « altitude du Mont Everest : 8 848 m ».

(In reply to Masayuki Nakano [:masayuki] (he/him)(JST, +0900)(no pain of the broken bone, but the corset blocks me to concentrate) from comment #63)

The change would break copy in HTML editor. In HTML editor without <pre> element, 2nd and later ASCII whitespaces are automatically converted to NBSPs [...]

This is completely broken! I've just received a mail written from Firefox in something like a webmail, containing C code. This C code (as very often) contains multiple consecutive spaces for indenting. But in the mail, all spaces except the last one were converted to NBSPs. And as C code, this is invalid!

Another example: it's impossible to use Transifex (https://www.transifex.com/) for correct French translation since Firefox keeps changing nbsp's into normal spaces.

Quite disappointing that this bug is still here.

The nbsp is used as a typographic treatment in many European languages. When a text is typeset, nbsp prevents unwanted line break between the characters or words (e.g. nbsp is placed between single-letter word and the following word). The fact that Firefox as the only browser deliberately removes nbsp from textarea/input forces users to switch between browsers when they’re addressing typography on the web. Tools such as https://typopo.tota.sk (automatic correction of typography errors) do not work in Firefox since one of their jobs is to place non-breaking spaces properly. I’d say the solution as suggested in comment 57 would be the best option to solve this problem.

Severity: critical → N/A

Bonjour David,
I am not saying that the patch would be accepted but managing preferences is pretty easy to do in Firefox.
Would be interested to propose a patch? I can help internally to find a reviewer and find help if needed.

We have also a step by step new contribution tutorial now:
https://firefox-source-docs.mozilla.org/contributing/contribution_quickref.html

merci

Flags: needinfo?(david+bugs)

I build Firefox with David’s patch for my personal use, and I have to say that some code syntax highlighters and markdown processors out there have become addicted to this corruption of non-breaking spaces by browsers. E.g. when somebody sends a block of code over Slack, and I copy it to my editor, I get nbsps in indentation. So some heuristic to only convert spans of multiple adjacent nbsps would be desirable.

(In reply to Yuri Khan from comment #86)

I build Firefox with David’s patch for my personal use, and I have to say that some code syntax highlighters and markdown processors out there have become addicted to this corruption of non-breaking spaces by browsers. E.g. when somebody sends a block of code over Slack, and I copy it to my editor, I get nbsps in indentation. So some heuristic to only convert spans of multiple adjacent nbsps would be desirable.

I think a Web browser should behave correctly at all times. Fixing Web pages designed by incompetent designers is their own problem.

I think a Web browser should behave correctly at all times. Fixing Web pages designed by incompetent designers is their own problem.

You’re right of course, but web browsers have not been allowed to choose correct behavior over backward bug compatibility since the Netscape/IE times. A partial fix now is preferable to a perfect fix never.

Yes, editor requires to put NBSPs if there are multiple white-spaces are adjacent because adjacent ASCII white-spaces are collapsed to an ASCII white-space at rendering time due to basic of HTML/SGML/XML spec unless parent element is explicitly styled as "preformatted" (Currently, I'm rewritting the normalizer to take similar behavior as Chrome). So, I think that only when an NBSP is not surrounded by another NBSP nor ASCII white-spaces, can say the NBSP is truly an NBSP. Unfortunately, as far as I've tested, you cannot insert multiple NBSPs in a place in contenteditiable because web browsers cannot remeber which white-spaces were NBSPs. If browsers want to do that, browsers need to store x2 footprint per text node unfortunately.

(In reply to Masayuki Nakano [:masayuki] (he/him)(JST, +0900) from comment #89)

Yes, editor requires to put NBSPs if there are multiple white-spaces are adjacent because adjacent ASCII white-spaces are collapsed to an ASCII white-space at rendering time due to basic of HTML/SGML/XML spec unless parent element is explicitly styled as "preformatted"

I have always found this "feature" extremely annoying TBH. I use French spacing when typing and I hate to see one of them become non-breaking. I can type a non-breaking space all right whenever I need one.

(In reply to Yuri Khan from comment #88)

I think a Web browser should behave correctly at all times. Fixing Web pages designed by incompetent designers is their own problem.

You’re right of course, but web browsers have not been allowed to choose correct behavior over backward bug compatibility since the Netscape/IE times. A partial fix now is preferable to a perfect fix never.

I think a perfect fix (by removing the nbsp absurdity) would be easier than a partial fix in this case, so it should win the now/never contest.

(In reply to Masayuki Nakano [:masayuki] (he/him)(JST, +0900) from comment #63)

(In reply to Jorg K (GMT+2) from comment #62)

What do you think, Masayuki-san, should we remove
https://searchfox.org/mozilla-central/rev/ab6f4c453d15ab82147c630a8b886b40240ca72b/dom/base/nsPlainTextSerializer.cpp#1143-1147
as suggested in attachment 9060386 [details] [diff] [review]?

The change would break copy in HTML editor. In HTML editor without <pre> element, 2nd and later ASCII whitespaces are automatically converted to NBSPs since if web browsers don't do that, whitespace collapsing of HTML spec causes only one whitespace rendered. Unfortunately, we cannot distinguish whether every NBSP in an editable text node is truly NBSP or not (i.e., automatically converted one). Therefore, we cannot fix this bug without big changes in HTML editor. However, in non-editable content, not so. So, if nsPlaintextSerializer handles NBSPs when it retrieves text from a node, we can fix this for non-editable content.

I‘d like to apologize for my previous comment and offer something actually constructive.

If I understand both the discussion and the snippets of source code that have been posted here, on type/paste, Firefox converts SP characters (U+0030) into NBSP characters (U+00A0) in order to preserve apparent whitespace. This leaves no way to disambiguate between typed/pasted and automatically converted SP/NBSP characters.

What if automatically converted SP-to-NBSP characters were marked by preceding them with a zero-width joiner (U+200D ZWJ)? It would have no visible impact on the content presented on-screen but also could act as a flag for converting ZWJ NBSP back to plain SP on cut/copy in order to leave unmarked NBSP alone. Also, because it’s so apparently useless as a combination, ZWJ NBSP is extremely unlikely to appear anywhere for just about any reason, so it should be a reliably safe way to mark these spaces.

I‘d offer the code myself, but sadly the most sophisticated anything I’ve programmed was a violin tuner in BASIC about 25 years ago.

What if automatically converted SP-to-NBSP characters were marked by preceding them with a zero-width joiner (U+200D ZWJ)? It would have no visible impact on the content presented on-screen but also could act as a flag for converting ZWJ NBSP back to plain SP on cut/copy in order to leave unmarked NBSP alone. Also, because it’s so apparently useless as a combination, ZWJ NBSP is extremely unlikely to appear anywhere for just about any reason, so it should be a reliably safe way to mark these spaces.

The combinations of plain U+0020 SPACE with U+00A0 NO-BREAK SPACE is already as useless as possible for purposes other than indentation.

(In reply to Yuri Khan from comment #95)

What if automatically converted SP-to-NBSP characters were marked by preceding them with a zero-width joiner (U+200D ZWJ)? It would have no visible impact on the content presented on-screen but also could act as a flag for converting ZWJ NBSP back to plain SP on cut/copy in order to leave unmarked NBSP alone. Also, because it’s so apparently useless as a combination, ZWJ NBSP is extremely unlikely to appear anywhere for just about any reason, so it should be a reliably safe way to mark these spaces.

The combinations of plain U+0020 SPACE with U+00A0 NO-BREAK SPACE is already as useless as possible for purposes other than indentation.

That was kind of my point. Isn’t the problem we’re trying to solve here trying to balance the needs of lay users with the more typographically advanced? By marking which space characters were automatically converted on input so that they’re properly un-converted on output, we can have our proverbial cake and eat it, too. Automatically converting U+0030 ↔ U+200D U+00A0 achieves that.

(In reply to Masayuki Nakano [:masayuki] (he/him)(JST, +0900) from comment #66)

(In reply to Mikhail Ryazanov from comment #64)

  1. If preserving them is necessary, doesn't you HTML support the white-space: pre-wrap style?

Ideally yes, but all browsers converts whitespaces to NBSPs:

I don't understand this comment. Why cannot Firefox choose to do it differently, in a better way?

Or couldn't there be a way to disable conversions between normal space and nbsp for users that don't need such conversions?

When the NBSP came from the site (as opposed to editor in Firefox), changing the NBSP is bad. Yet, it appears that Chrome, too, has the behavior that designMode / contenteditable changes every other space into an NBSP when pressing the space bar multiple times and copying the text undoes the hack.

I think we should investigate what heuristic Chrome uses exactly, but off the top of my head, I suggest changing an NBSP to an ASCII space upon plain text clipboard export only if it is adjacent to an ASCII space. This would leave non-editor-generated NBSP intact, and I think it is a legitimate concern to want those to be left intact e.g. in the case of French quotation marks.

Chrome's editing code appears to also have a special case that if you delete an ASCII space that is adjacent to an NBSP, the NBSP turns into an ASCII space at that point.

Would a Firefox patch based on the Thunderbird patch above be acceptable, and considered for review?
If so, I can submit a Firefox patch for this.

Flags: needinfo?(sledru)

Pierre, the patch looks good but, i don't have the technical expertise on this.
Henri does

Rachel, could you please submit your patch for review:
https://firefox-source-docs.mozilla.org/contributing/contribution_quickref.html

Flags: needinfo?(sledru) → needinfo?(hsivonen)

(In reply to Pierre de La Morinerie from comment #100)

Would a Firefox patch based on the Thunderbird patch above be acceptable, and considered for review?
If so, I can submit a Firefox patch for this.

If you submit the patch, please follow our attribution policy (examples here). It's likely that Mozilla will request to add a test, like this one here. Please note that the initial version of the patch wasn't correct, we've rectified this now.

Attribution: the actual nsPlainTextSerializer changes were written by Rachel Martin <rachel@betterbird.eu>,
as a part of Betterbird.

Assignee: nobody → kemenaran

Please add the following comment to the code:

/*
Mail composers enforce consecutive spaces in HTML output by replacing them with non-breaking spaces.  
Revert this hack when converting HTML text to plain text.
*/

Thank you.

(In reply to Pierre de La Morinerie from comment #103)

Created attachment 9269182 [details]
Bug 359303 - preserve nbsps on clipboard export. r=hsivonen

Nice test. You might want to add test cases for nnns where n is a NBSP and s is a regular space. That will test the left-promotion. For full test coverage we suggest nnns, snnn, nsns , snsn and maybe some random stuff like nnssnsnnsnn or so. In all those cases, the NBSP should be replaced.

Thanks a lot for the suggestions. I added the comment, and updated the test with replacement examples.

I also updated the original code to use uints for length indexing (instead of ints), which makes the clang linter happy. This involved rewriting the indices of the second promotion loop (while retaining the same semantics).

It was not meant to be a structured comment. If you want it to be, you should probably use some comment tag.

I saw other functions in this file being described with the same kind of top-level comments, formatted similarly (e.g. nsPlainTextSerializer::CurrentLine::CreateQuotesAndIndent). I think keeping this descriptive comment as a general function description (rather than in the function body) would be useful; which comment tag would you add to it?

It really depends on the comment structure that is used in a particular project. E.g. <summary>. But I do not know which convention, if any, it used here. If there is no convention, it is best not to use structured comments at all for fear of future conflicts.

I reverted the comment to a non-structured comment (which seems the safest thing to do).

Chrome copies the   characters as they are

What exactly do Chrome and Safari do? I tried copying from https://hsivonen.com/test/moz/nbsp.html in Chrome and pasting into gedit on Ubuntu, Notepad on Windows, and TextEdit on Mac. I also copied from Safari and pasted into TextEdit.

AFAICT, in all four cases all no-break spaces got replaced by regular spaces. What am I missing?

Attribution: the actual nsPlainTextSerializer changes were written by Rachel Martin <rachel@betterbird.eu>, as a part of Betterbird.

Is this correct? The upstream patch said "Betterbird <betterbird@betterbird.eu>".

Flags: needinfo?(hsivonen)

That page has &nbsp; as in <p>nbsp between number and unit: 5&nbsp;km.</p>. The patch is about 0xA0. Anyway, copying this from Chrome and looking at the clipboard in FreeClipViewer (on Windows), it appears that spaces are placed on the clipboard, even for the single NBSPs in « quote ! ». Can you clarify comment #98 which seems to suggest that not replacing NBSP with space could be a good thing.

As for the attribution, anything goes ;-) - https://hg.mozilla.org/comm-central/log?rev=betterbird

(In reply to Rachel Martin from comment #112)
That page has &nbsp; as in <p>nbsp between number and unit: 5&nbsp;km.</p>. The patch is about 0xA0.

They are the same thing in the DOM.

Can you clarify comment #98 which seems to suggest that not replacing NBSP with space could be a good thing.

I think it could be a good thing, but it's also bad to make stuff up relative to other engines. Earlier comments suggested that Chrome was already doing what was being requested here, in which case it would generally make sense to match. Now it seems that Chrome is not, in fact, already doing what's being requested here. Has Chrome changed between https://bugs.chromium.org/p/chromium/issues/detail?id=887511 and now? If Chrome has made the opposite change, it would be useful to understand why.

Introducing a novel (among engines) behavior may still be a good thing but requires more careful consideration.

As for the attribution, anything goes ;-) - https://hg.mozilla.org/comm-central/log?rev=betterbird

OK.

Hmm, https://bugs.chromium.org/p/chromium/issues/detail?id=887511 is a little confusing. Seems like the reporter wanted all NBSP to be changed to normal space. The developer said (cmt 4): "Mark WontFix since Chrome should change U+00A0 to U+0020 when putting into clipboard." Is there the word "not" missing? "Should not change"? They go onto say (cmt 7): "I said from Chrome's point of view putting U+00A0 into clipboard is correct and expected". But yes, Chrome converts NBSP to space, to that issue got "fixed" somehow.

(In reply to Henri Sivonen (:hsivonen) from comment #113)

Introducing a novel (among engines) behavior may still be a good thing but requires more careful consideration.

This bug is 16 years old and we're still debating whether it should be implemented or not. Incredible!

When I copy-paste things around, the expectation is that they will arrive unchanged (thus the name "copy"). I also asked around and this is the case with everybody who answered my very unscientific quizz. The reference here is not browsers, but text editors. Does MS Word change non-breaking spaces? Open/LibreOffice? Generic text editors (Notepad, Notepad++, Sublime, vim, emacs etc.) with their default settings?

(In reply to Rachel Martin from comment #114)

Hmm, https://bugs.chromium.org/p/chromium/issues/detail?id=887511 is a little confusing. Seems like the reporter wanted all NBSP to be changed to normal space. The developer said (cmt 4): "Mark WontFix since Chrome should change U+00A0 to U+0020 when putting into clipboard." Is there the word "not" missing? "Should not change"? They go onto say (cmt 7): "I said from Chrome's point of view putting U+00A0 into clipboard is correct and expected". But yes, Chrome converts NBSP to space, to that issue got "fixed" somehow.

I have just tried in Google Chrome 100.0.4896.60, it does not damage the content of the text areas when selected in the sample form.

What exactly do Chrome and Safari do? I tried copying from https://hsivonen.com/test/moz/nbsp.html in Chrome and pasting into gedit on Ubuntu, Notepad on Windows, and TextEdit on Mac. I also copied from Safari and pasted into TextEdit.

AFAICT, in all four cases all no-break spaces got replaced by regular spaces. What am I missing?

It seems that current versions of Chrome and Safari:

  • preserve nbsps when copying from a user-editable area (such as a <textarea>),
  • but convert nbsps to spaces not when copying from non-editable text.

As far as I understand, this bug is about user-editable content. In any case, the behavior this patch is trying to implement is just for user-editable content. I'm updating the patch's test to use a textarea, which better reflects the intended behavior.

Now there's the question of why Chrome and Safari replace nbsps when copying non-editable text, and whether Firefox should do the same…

(In reply to Strainu from comment #115)

When I copy-paste things around, the expectation is that they will arrive unchanged (thus the name "copy"). I also asked around and this is the case with everybody who answered my very unscientific quizz. The reference here is not browsers, but text editors. Does MS Word change non-breaking spaces? Open/LibreOffice? Generic text editors (Notepad, Notepad++, Sublime, vim, emacs etc.) with their default settings?

Text editors do not collapse white space at all. Word processors collapse white space at the end of line only.

(In reply to Pierre de La Morinerie from comment #117)

Now there's the question of why Chrome and Safari replace nbsps when copying non-editable text, and whether Firefox should do the same…

The most probable answer is they want to be Firefox-compatible ;-)

(In reply to Pierre de La Morinerie from comment #117)

As far as I understand, this bug is about user-editable content.

A text area is not user-editable content, it is a plain text input control.

(In reply to Pierre de La Morinerie from comment #117)

As far as I understand, this bug is about user-editable content. In any case, the behavior this patch is trying to implement is just for user-editable content. I'm updating the patch's test to use a textarea, which better reflects the intended behavior.

Please do not limit the scope to user-editable content.

Given {

  • a non-editable web page,
  • an editable web page,
  • a textarea control,
  • a readonly textarea control,
  • an input control,
  • a readonly input control

} containing text where non-breaking spaces are used in the intended way (between a number and its unit, or around French punctuation, or between groups of digits in a single number, etc),
when such text is copied,
then the non-breaking spaces should be preserved,
so that proper formatting propagates to whatever document the user is pasting into.

Given any of the above, where non-breaking spaces are misguidedly used to prevent space collapsing,
when such text is copied,
then non-breaking spaces should be converted to spaces,
so that pasting such text into a plain text document yields multiple spaces.

It should be noted that French punctuation is different from French spacing; the latter is an example of an inappropriate use of non-breaking spaces.

(In reply to Rachel Martin from comment #114)

Hmm, https://bugs.chromium.org/p/chromium/issues/detail?id=887511 is a little confusing. Seems like the reporter wanted all NBSP to be changed to normal space. The developer said (cmt 4): "Mark WontFix since Chrome should change U+00A0 to U+0020 when putting into clipboard." Is there the word "not" missing? "Should not change"? They go onto say (cmt 7): "I said from Chrome's point of view putting U+00A0 into clipboard is correct and expected". But yes, Chrome converts NBSP to space, to that issue got "fixed" somehow.

For what is worth, I just tested older builds of Chrome on macOS (dating back to 2016), and all nbsps in non-user-editable content already were converted into spaces on copy.

Which means that on macOS at least, Chrome had its behavior to convert non-editable nbsps to spaces for a while (it didn't revert back from an older decision).

(I wonder if this behavior was different on Chrome Linux; if someone with a Linux box can check, that'd be great.)

Google Chrome 100.0.4896.60

cmt 7

document .getSelection () .toString () .charCodeAt (03)

160

// pasted
'cmt 7' .charCodeAt (03)

32

:-(

In the last company I worked we wrote web applications to work with documents. And we were heavily copy-pasting text between Word/LibreOffice and the browser. Firefox always has converted non-breaking-spaces to normal spaces, so we all needed to switch to Chrome. There was no other possibility because of this issue. See also my comment from 3 years ago: https://bugzilla.mozilla.org/show_bug.cgi?id=359303#c79

In my opinion nothing should be done when copying text from the browser. Just let it be the way it intentionally was put on the website.

(In reply to N. Göddel from comment #125)

In my opinion nothing should be done when copying text from the browser. Just let it be the way it intentionally was put on the website.

The problem is about copying text from the browser but the reason for the problem is copying text from the mail reader which is caused by a systematic abuse of logic in the mail composer (which is a part of the Mozilla suite).

(In reply to Yuri Khan from comment #121)

(In reply to Pierre de La Morinerie from comment #117)

Please do not limit the scope to user-editable content.

Thanks for the detailed spec. This is what the attached patch currently implements: in all content (editable or not), intended non-breaking spaces are preserved, and editor-generated non-breaking spaces are replaced.

However the concern here is that Webkit/Blink browsers implement a different behavior, where in non-editable content, all non-breaking spaces are replaced. The question is whether Firefox should implement what you describe (which is a arguably a more correct behavior), or try to stick to the behavior or Webkit/Blink.

All intended non-breaking spaces should be preserved, whatever the context. What's the reason for the behavior of Webkit/Blink browsers?

(In reply to Vincent Lefevre from comment #129)

All intended non-breaking spaces should be preserved, whatever the context. What's the reason for the behavior of Webkit/Blink browsers?

The reason is presumably compatibility with Gecko.

Does anybody know what other parts of the Firefox change U+00A0 NO-BREAK SPACE into U+0020 SPACE or vice versa? I ask because for years I've been compiling Firefox with a patch, linked at the top of this bug, which is a more obvious and more complete fix to this bug than the one under discussion, but even with that, I still can't use U+00A0 NO-BREAK SPACE with Firefox in many contexts, e.g., in Twitter. So even if this patch is accepted in one form or another, it still won't make it possible to tweet U+00A0 using Firefox because there are still other U+00A0 manglers lurking in the code. Does anyone know where they might be, if a bug about them is already opened, and how they might be fixed?

(Sorry if I'm hijacking this bug's discussion to mention other nbsp-related bugs, but I think it's part of the confusion that there are actually several of them and that even fixing this one will leave a lot of problems with nbsp.)

Flags: needinfo?(david+bugs)

Twitter is known to change the text entered by the user, so that this may not be a good example (Twitter itself may be the culprit). You need to provide a simple testcase.

(In reply to David A. Madore from comment #131)

I still can't use U+00A0 NO-BREAK SPACE with Firefox in many contexts, e.g., in Twitter. So even if this patch is accepted in one form or another, it still won't make it possible to tweet U+00A0 using Firefox because there are still other U+00A0 manglers lurking in the code. Does anyone know where they might be, if a bug about them is already opened, and how they might be fixed?

(Sorry if I'm hijacking this bug's discussion to mention other nbsp-related bugs, but I think it's part of the confusion that there are actually several of them and that even fixing this one will leave a lot of problems with nbsp.)

Firefox has a command Report a problem with this page, please use that one to report the problem with Twitter. Note, however, that it need not be the browser’s fault. Facebook, for example, does not allow non-breaking spaces in any browser, which forces you to use ‘_’ instead, which gets you a ban on some groups with strict moderation. I know, this is crazy, but so is the world 🙁

In the early days of HTML WYSIWYG editing, the only way to encode multiple space characters was to replace all but one of the SP = U+0020 codes that the user had entered with NBSP codes. That was of course an ugly hack, as it had nothing to do with the actual purpose of NBSP = U+00A0, which is to be a "no-break space", not an "additional space".

The problem should have gone away when CSS gave us the 'white-space' property, where you can now select for each HTML element whether multiple SP codes are collapsed into one single visible space during rendering or not: https://www.w3.org/TR/CSS2/text.html#white-space-prop

Unfortunately, by the time that CSS-based solution became available, browsers and their built-in HTML editors were already riddled with various ugly hacks to convert between SP and NBSP codes. I doubt that adding yet ever more such HACKs will get us closer to the actual solution. I suspect the actual solution is to remove any automated conversions between NBSP and SP from all browsers and HTML editors, and to instead fix all the places that originally had motivated these hacks (WYSIWYG HTML editors that support the ability to type multiple spaces in a row) by using instead the CSS 'white-space' property where visual support for multiple spaces is desired.

(In reply to David A. Madore from comment #131)

... there are still other U+00A0 manglers lurking in the code ...

Please turn to bug 532712. All the mangling happens in WSRunObject.cpp, search for u" "_ns. For our mail client we've implemented Ctrl+Shift+Space here (something that was rejected by Mozilla in bug 894919) and hacked the Mozilla editor not to remove those spaces straight away. Users can now write e-mail and add NBSP around the French quotation marks, etc.

(In reply to Markus Kuhn from comment #134)

I doubt that adding yet ever more such HACKs will get us closer to the actual solution.

The proposed change does not add more hacks, it makes an existing hack less obnoxious. I understand your holy rage but this bug has been sitting here for sixteen years now, rendering Firefox useless as a desktop document reader. Please let us do something about it instead of preserving the status quo for the next sixteen years.

(In reply to Christopher Yeleighton from comment #119)

(In reply to Pierre de La Morinerie from comment #117)

Now there's the question of why Chrome and Safari replace nbsps when copying non-editable text, and whether Firefox should do the same…

The most probable answer is they want to be Firefox-compatible ;-)

That's indeed a possible explanation.

AFAICT, both Gecko and Blink export no-break spaces to the clipboard in the HTML flavor. Both Gecko and Blink replace no-break spaces with regular spaces when converting HTML into the plain-text clipboard flavor. When copying from a plain-text context (textarea), Gecko replaces no-break spaces with regular ones but Blink does not.

I suspect the actual solution is to remove any automated conversions between NBSP and SP from all browsers and HTML editors, and to instead fix all the places that originally had motivated these hacks (WYSIWYG HTML editors that support the ability to type multiple spaces in a row) by using instead the CSS 'white-space' property where visual support for multiple spaces is desired.

Ideally, yes. However, doing so carries even more compatibility risk than what's being proposed here.

As a user I see the patch here as something I'd like to see landed. As a reviewer, I think it's appropriate to discuss this in standardization context to avoid making stuff up in an area that potentially interop-sensitive even if less so than many other things.

Note that under Linux, it is generally possible to paste the selection directly with the middle button, i.e. without using the clipboard. I'm wondering whether the behavior concerning spaces is similar to copy-paste via the clipboard.

Linux does not offer a clipboard at all. Linux users usually connect to a display server (which need not run under Linux at all). The most popular display server is X Windows, which is based on the X Terminal (X meaning ‘cross’). The X Terminal does not offer a clipboard either; instead, you can paste from either the application-owned primary selection, which is more instantaneous, or the display-owned cut buffer, which is more persistent. Both pastes give the same wrong result.

(In reply to Christopher Yeleighton from comment #139)

That was overkill and some of it is misleading at best.

  1. Unless you've got a citation, "X" doesn't mean "cross"... it was named similarly to how the B programming language begat C which begat C++ and D. Originally, there was an operating system named V which had a windowing system named W. When they ported W to UNIX, they named the port X.
  2. People generally refer to X Windows as X11. It's both shorter and more precise, since previous versions like X10 did exist. (For example, "X11" is the form used in Firefox User-Agent strings as Mozilla/5.0 (X11; Linux; ...)
  3. If we're sticking to maximum relevance, "I'm talking about Firefox on X11. Firefox gives the wrong result whether using the PRIMARY selection or the CLIPBOARD selection." was all you needed to say.
  4. It's unarguable that X11 does offer a clipboard, in that the three selections are named PRIMARY, SECONDARY, and CLIPBOARD and covered by the XDG Clipboards Spec.

As the spec puts it:

X has a thing called "selections" which are just clipboards, essentially (implemented with an asynchronous protocol so you don't actually copy data to them). There are three standard selections defined in the ICCCM: PRIMARY, SECONDARY, and CLIPBOARD.

As for cut buffers, JWZ's X Selections, Cut Buffers, and Kill Rings. has more, but the gist is that they're a hold-over from X10 that were retained for compatibility, nobody should be using them, and nobody probably is, given that Qt and emacs, which were already using selections, switched from only using PRIMARY to Windows-style handling of PRIMARY vs. SELECTION way back around the release of Qt 3.

Also, there was such demand to preserve X11-style "PRIMARY is selecting and middle-click, CLIPBOARD is Ctrl+X/C/V" behaviour that Wayland also has both a "primary" clipboard and a "clipboard" clipboard, and utilities like wl-clipboard can manipulate both, though I don't know what the low-level API constants are as that section of The Wayland Book is still a blank TODO.

(In reply to Stephan Sokolow from comment #140)

Ugh. I brought up Wayland, but forgot to make the point it was intended to support. Namely, that both of the graphical APIs Firefox on Linux can be built against (X11 and Wayland) support a PRIMARY/CLIPBOARD distinction. Therefore, it's only relevant to mention that "Linux" doesn't provide the clipboard services if you think/suspect X11 and Wayland builds may be exhibiting different behaviour.

(In reply to Henri Sivonen (:hsivonen) from comment #137)

Henri, thanks for your clarifications, and for opening an issue against the spec.

Until the spec issues are sorted out, I have a smaller patch to suggest. Here's the current behavior of Webkit/Blink, when copying from:

  • a plain-text control (textarea, input): preserves NBSPs ✅
  • a contenteditable node: replaces NBSPs ❌
  • a non-editable node: replaces NBSPs ❌

(To be clear, this is the relevant code in Webit/Editor.cpp: when copying from a plain-text control, no replacement happens.)

I suggest, as a minimal beginning, to adopt the same behavior in Firefox. I attached a new single-line patch, that configures copies from text widgets to preserve non-breaking spaces. For all other kind of copies (content editable, or non-editable nodes), the current behavior is preserved.

What do you think?

Flags: needinfo?(hsivonen)
See Also: → 532712

Henri, any thoughts on this minimal variation of the patch?

(In reply to Rachel Martin from comment #99)

Implemented the suggestion from the previous comment here:
https://github.com/Betterbird/thunderbird-patches/blob/main/91/bugs/359303-dont-replace-NBSP-m-c.patch

I’ve been testing this patch on my build of Firefox. I have discovered one case where it doesn’t do the right thing: When a line starts with a run of NBSPs followed by a non-space character or end of line, these should be converted to spaces because it’s likely code indentation. In other words, line breaks should be treated as whitespace for the purposes of the heuristic.

(In reply to Pierre de La Morinerie from comment #143)

Henri, any thoughts on this minimal variation of the patch?

Sorry about the delay. I wanted to get some signal from the Web Editing Working Group before replying. I had a change to get this on the WG's meeting agenda today.

Based on that discussion, I think we should proceed with:

  1. When copying from a plaintext context, such as textarea, let's pass all no-break spaces to the clipboard as-is and assume that this behavior is going to stick.
  2. When copying from HTML, keep exporting the HTML clipboard flavor as today.
  3. When copying from HTML, when a text node has a sequence of no-break spaces that isn't preceded or followed by an ASCII space, don't replace those no-break spaces with ASCII spaces. Assume that this change might need to be reverted if problems arise.
  4. When JavaScript writes a string to clipboard, we shouldn't touch the no-break spaces.
Flags: needinfo?(hsivonen)

Sounds great; thanks a lot for carrying this discussion to the relevant working group. The four points above look good to me (especially the distinction between changes that are expected to stick, and those which are more experimental).

The one-line patch currently in review for this bug (https://phabricator.services.mozilla.com/D141934) already implements 1. and 2. – which are the changes expected to stick. I think it would be interesting to land these first, before tackling the more experimental changes (3 and 4). Does that look good to you?

Flags: needinfo?(hsivonen)

(In reply to Pierre de La Morinerie from comment #146)

Sounds great; thanks a lot for carrying this discussion to the relevant working group. The four points above look good to me (especially the distinction between changes that are expected to stick, and those which are more experimental).

The one-line patch currently in review for this bug (https://phabricator.services.mozilla.com/D141934) already implements 1. and 2. – which are the changes expected to stick. I think it would be interesting to land these first, before tackling the more experimental changes (3 and 4). Does that look good to you?

Looks OK to me, but I'm not familiar with the two-step behavior that's being made use of here, so I requested review also from mbrodesser.

Flags: needinfo?(hsivonen)
Blocks: 1769534

I wonder if we should not add this to the release notes?

Pushed by mbrodesser@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1fca61c97256
preserve nbsps on clipboard export. r=hsivonen,mbrodesser

(In reply to Mirko Brodesser (:mbrodesser) from comment #150)

@Sylvestre: the criteria for that are vague (https://wiki.mozilla.org/Release_Management/Release_Notes#How_to_decide_whether_a_change_should_be_included_in_release_notes.3F). Feel free to nominate it.

Include in release notes as a fix for a serious data loss without any work-around (see comment #5)!

Release Note Request (optional, but appreciated)
[Why is this notable]: A 16 y.o. bug seriously affecting interoperability in non-US culture environments has been fixed.
[Affects Firefox for Android]: Yes
[Suggested wording]: Non-US users will be able to copy numbers published on Web pages because non-breaking spaces used to format them will be preserved.
[Links (documentation, blog post, etc)]: <URL: https://bugzilla.mozilla.org/show_bug.cgi?id=359303#c145 >

relnote-firefox: --- → ?

Nitpick: this is also relevant to english-speaking users; non-breaking space as a unit separator is definitely a thing in english typography.

(In reply to Pierre de La Morinerie from comment #154)

Nitpick: this is also relevant to English-speaking users; non-breaking space as a unit separator is definitely a thing in English typography.

This is less important because tools that read physical with units are less common and more likely to accept a space after all.

(In reply to Pulsebot from comment #151)

Pushed by mbrodesser@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1fca61c97256
preserve nbsps on clipboard export. r=hsivonen,mbrodesser

Note: This change implements 2 requirements out of 4. This bug will be resolved only after all 4 requirements have been implemented.

The remaining requirements were extracted to bug 1769534, so that this bug can be closed.

Status: REOPENED → RESOLVED
Closed: 10 years ago2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 103 Branch

(In reply to Pierre de La Morinerie from comment #157)

The remaining requirements were extracted to bug 1769534, so that this bug can be closed.

Oh, in this case publishing a release note will only cause confusion :-(

relnote-firefox: ? → ---

I'll try a release notes request with a narrower scope.

Release Note Request (optional, but appreciated)
[Why is this notable]: A 16 y.o. bug about text copied to the clipboard being changed by Firefox has been fixed.
[Affects Firefox for Android]: Yes
[Suggested wording]: Non-breaking spaces are now preserved when copying text from a form control.
[Links (documentation, blog post, etc)]: <URL: https://bugzilla.mozilla.org/show_bug.cgi?id=359303#c145 >

relnote-firefox: --- → ?

Note added to 103 nightly notes.

Note added to 103 Release Notes

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: