Closed Bug 624666 Opened 14 years ago Closed 7 years ago

Dataloss through Replacement of No-Break Space with Space.

Categories

(Core :: DOM: Serializers, defect)

x86
All
defect
Not set
critical

Tracking

()

RESOLVED DUPLICATE of bug 359303

People

(Reporter: u.mail.me, Unassigned)

References

()

Details

User-Agent:       Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b9pre) Gecko/20110109 Firefox/4.0b9pre
Build Identifier: 

Every times a NBSP is handled it is replaced by space → dataloss.

One source I found is http://mxr.mozilla.org/mozilla-central/source/content/base/src/nsPlainTextSerializer.cpp#1252.

The mentioned reason for this behavior is hilarious: “replace non breaking spaces with a normal space since most (all?) receivers of the result won't understand the nbsp and even be confused by it”

WTF? These are two complete different characters! In some languages like French it is possilbe and common to place a NBSP between the words and the punctuation marks. With NBSP replaced by Space a break may cut this off.

Just because it looks quite similar?  Why not replace „”” with " and ‚‘’ with ' and so on?

Based the comments in the nsPlainTextSerializer.cpp they should have known that it is bad:

“First, replace all nbsp characters with spaces, which the unicode encoder won't do for us.“

Yeah, of course it won't replace nbsp with spaces due they’re different characters.

For firefox an easy solution would be to comment these harmful lines out, I’ve done it and it works fine, but unfortunately Thunderbird uses this to format html mails (why?).

Is it possible to fix it? Or split the code for firefox and thunderbird, because it is complexer to fix the thunderbird.

Reproducible: Always
bug related to bug 359303 . duplicate ?
No, it is not a duplicate. But maybe it has the same root of evil. 

Bug does not only affect copying. Try for example to write a plain text mail with some NBSP [e.g. with utf-8 (ascii doesn’t have NBSP]. You won't be successful, because NBSP are replaced in the text mail editor.
Assignee: nobody → ehsan
Whiteboard: [post-2.0]
It is not a complete duplicate of 359303, but it is covered by bug 359303 and bug 290565. 

On the other hand both bugs are 4 to 6 years old and nothing changed and this one is already assigned to 	Ehsan Akhgari [:ehsan].
It's an important bug for the french users because no breaking space
is often used before some punctuation marks.
Also, why not replace narrow no-break space (\u202F) with space and so
on : <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{Block=General%20Punctuation}> ?

For example, this web service <http://typographie.piprime.fr/> cannot
be used with Firefox…
Hoping this bug will be fix rapidly.
Updating to reality: I won't have time to work on this in the near future.
Assignee: ehsan → nobody
Whiteboard: [post-2.0]
Hi,

Since the ‘near future’ has gone, I'd like to catch more attention?

I confirm it's really annoying for French users (among others). I love software with copyleft and want to write plain text e-mails but very often I have problems with this awful bug.

There are also several people who claimed a fix on the bug 290565 page.

Thanks for considering.
Blocks: 290565
Jörg, could you be the hero who gangs up with :Ehsan to fix this bug? (For Thunderbird, this is bug 290565, but you might prefer this bug which is less noisy).

UX evaluation: For languages like French who regularly use nbsp to stick their punctuation unto the last word of the sentence, but also for anyone trying to just use nbsp (as in 100&nbsp;Euro), the current UX of converting nbsp's to normal spaces is really annoying. It's also real dataloss with potentially hazardous consequences when things sent by email are meant for use in wiki's etc. We eliminate nbsp's both on direct input, and when sending simple HTML as plaintext (if Auto-Detect's auto-downgrade aka "Send messages as plaintext if possible" is on). Per my bug 290565 comment 30, workarounds are complex hence unfit for general use.

Per comment 1, the prime suspect in code is this (updated URL):
https://dxr.mozilla.org/mozilla-beta/source/dom/base/nsPlainTextSerializer.cpp#1211-1231

> if (!(mFlags & nsIDocumentEncoder::OutputPersistNBSP)) {
>    // First, replace all nbsp characters with spaces,
>    // which the unicode encoder won't do for us.
>    aString.ReplaceChar(kNBSP, kSPACE);
>  }
>  mOutputString->Append(aString);

Many commenters including comment 0 here have argued that this blanket substitution is unwarranted and should be completely removed.

There's not much in favor of keeping it: David Baron hints in Bug 218277, comment 19, that we're doing this to undo our own other hack where multiple normal spaces entered by user get converted into nbsp's to actually create visible whitespace in HTML which would otherwise conflate. However, I understand from Bug 218277, comment 21 that he's mainly thinking of plaintext composition, but in that case, the magical substitution doesn't even seem to happen (confirmed in my own testing), so that's a non-argument. So nobody has ever come forward to expain *why* we couldn't keep nbsp's in plaintext these days (even if we perhaps created them ourselves in HTML composition), given that our default plaintext format of UTF-8 has no problems of handling nbsp's (0xA0). If anything, we'd at least have to limit the removal of nbsp's for plaintext to that pattern which we're actually creating in HTML, i.e. one or more nbsp's terminated by a normal space (which would leave all of users' single nbsp's intact).

Otherwise, I think the grand plan for these issues is bug 347689 (sic), essentially described here:

(In reply to David Baron :dbaron: ⌚️UTC-7 from bug 218277 comment 59:)
> It seems like what we should really be doing is:
>  * when we're editing plain text, store multiple presses of space as spaces
> (this is a change)
>  * when we're editing HTML, store multiple presses of space using
> non-breaking
> spaces for all but the last press (tricky with deletion) (we probably do this
> fine already)
>  * when serializing HTML to text, convert runs of non-breaking spaces
> terminated
> by a space to spaces
>  * when using nsPlainTextSerializer to convert text to text (if we need to
> use
> it at all, although we seem to now), don't mess with spaces
> 
> Does this make sense?

Coming back to nsPlainTextSerializer.cpp, I'm wondering if we can get away with less to fix the worst of this issue for our users.

Could you please try the following?

Outcomment this line in nsPlainTextSerializer.cpp (see URL above), and let's have a *trybuild* with that.
>    aString.ReplaceChar(kNBSP, kSPACE);

I'm curious to see how much of the problem actually goes away with that.
If it turns out to be helpful, short of removing that code, we could set OutputPersistNBSP flag for our nsIDocumentEncoder to disable the unwanted elimination of nbsp's in the plaintext serializer:

https://dxr.mozilla.org/mozilla-beta/source/dom/base/nsIDocumentEncoder.idl#184
const unsigned long OutputPersistNBSP = (1 << 17);

I guess deactivating this will fix the HTML->plaintext losses, but may or may not allow direct input of nbsp during composition (e.g. via Alt+255); if not, how can we find the spot in code where that direct input gets immediately eliminated?

On behalf of a large number of affected users out there, thanks a lot for your precious time to look into this, and/or get others on board who can finally fix this, after more than a decade of UX failure.
(In reply to Thomas D. (currently busy elsewhere; needinfo?me) from comment #7)
> Jörg, could you be the hero who gangs up with :Ehsan to fix this bug? (For
> Thunderbird, this is bug 290565, but you might prefer this bug which is less
> noisy).
> ...
> [SNIP]

Jörg, there's comment 7 for you which explains everything...
Flags: needinfo?(jorgk)
Hmm, funny, I was just looking at bug 1377836 (really just a duplicate of bug 144230 but less busy) to preserve tabs and spaces better. There's also a patch that should preserve spaces/tabs better in the plain text serialiser. I'll see whether there is a connection here.
I don't think it makes sense to stare at
https://dxr.mozilla.org/mozilla-central/source/dom/base/nsPlainTextSerializer.cpp#1231-1250
without understanding when this code is called.

Plain text serialiser has to do with turning a HTML document into plain text, that's typically used for TB when converting HTML messages to plain text.

I entered |huhu&nbsp;!| into http://www-archive.mozilla.org/editor/midasdemo/ and when I copied that, it was copied just fine 100%. Surely that code doesn't run through nsPlainTextSerializer.cpp#1231-1250.

So unless someone gives steps to reproduce a problem here, looking at some code as described in comment #0 isn't a problem as such. I'll comment in bug 290565 next.
Flags: needinfo?(jorgk)
(In reply to Jorg K (GMT+2) from comment #11)
> I don't think it makes sense to stare at
> https://dxr.mozilla.org/mozilla-central/source/dom/base/
> nsPlainTextSerializer.cpp#1231-1250
> without understanding when this code is called.

Well, I can't speak for comment 0 but as for myself I correctly understood that this certainly gets called for HTML to plaintext conversion, that's why I suggested in comment 10 that this will probably only fix that part of the problem.
I'm sure Jörg does not expect reporter nor me to have an immediate deep understanding of when this code is called from somewhere deep in the internals of TB. I'm glad that contributors like reporter succeeded to identify that relevant spot in code, and at least my intensive staring at that code combined with my initiative here in comment 7 has enabled Jörg to rapidly fix a large chunk of this long-standing problem in bug 290565, comment 34, within minutes.

> Plain text serialiser has to do with turning a HTML document into plain
> text, that's typically used for TB when converting HTML messages to plain
> text.

Exactly, and that's certainly a big part of the problem reporter was facing, because succeeding to paste 0xA0 into composition is pretty meaningless when it gets eaten by delivery-format:Auto-detect's plaintext conversion of simple HTML messages lateron.

> I entered |huhu&nbsp;!| into
> http://www-archive.mozilla.org/editor/midasdemo/ and when I copied that, it
> was copied just fine 100%. Surely that code doesn't run through
> nsPlainTextSerializer.cpp#1231-1250.

Well, there's core serializer bug 359303 which claims *copying* nbsp is still a problem. But in my tests, copying Hello&nbsp;world into TB composition succeeds, whereas just copying the &nbsp; only didn't. Might be clipboard flavors, not necessarily a Problem in TB or Core.

> So unless someone gives steps to reproduce a problem here, looking at some
> code as described in comment #0 isn't a problem as such. I'll comment in bug
> 290565 next.

Jörg is right that unfortunately, this bug doesn't come with STR.
I think we now have the curious situation that the problem of this bug (with comment 0 correctly pointing to plaintext serializer) is being fixed in bug 290565, whereas the precise STR of bug 290565, comment 0 explicitly talk about "entering" (or copy-pasting), which as Jörg points out will not be fixed by his current patch there.
So maybe we can fix the problem of bug 290565 here ;p

To reproduce the nbsp-input part of the problem (mentioned in my comment 7, which also references my bug 290565 comment 30 with details), simply press Alt+255 in composition, or add &nbsp; in HTML source, and realize that it's just entering plain spaces instead (as seen by pushing the words in question to line break point, when they break which they shouldn't; and only normal spaces 0x20 seen in source before sending, and after.).

Does that suffice or do you need more information or proper STR format to reproduce the problem?
Thanks a lot for looking into this and the first fast fix following my request...
I suspect that STR could be:
Enter |huhu&nbsp;!| into http://www-archive.mozilla.org/editor/midasdemo/ or a TB compose window.
Copy this text.
Paste into Notepad++.
I suspect that you'll get a space and not a NBSP.

That would be the same effect as described in bug 359303 where bug 359303 comment #39 has a nice GIF video (http://i.imgur.com/2StB85Q.gif).

Let's do a bit of bug clean-up and dupe this over to bug 359303.
No longer blocks: 290565
Status: UNCONFIRMED → RESOLVED
Closed: 7 years ago
Resolution: --- → DUPLICATE
(In reply to Thomas D. (currently busy elsewhere; needinfo?me) from comment #12)
> Well, I can't speak for comment 0 but as for myself I correctly understood
> that this certainly gets called for HTML to plaintext conversion, that's why
> I suggested in comment 10 that this will probably only fix that part of the
> problem.

Typo, that's comment 7.
BTW, 359303 is the lower number, so it makes perfect sense to dupe this over to bug 359303 where I also added STR.
(In reply to Jorg K (GMT+2) from comment #13)
> I suspect that STR could be:
> Enter |huhu&nbsp;!| into http://www-archive.mozilla.org/editor/midasdemo/ or
> a TB compose window.
> Copy this text.
> Let's do a bit of bug clean-up and dupe this over to bug 359303.

I would have preferred to attach your patch attachment 8884559 [details] [diff] [review] here where it fits summary and comment 0, then fix the direct input problem in bug 290565 as per that bug's description, and after that, evaluate if copy/paste operations described in bug 359303 are fixed or not, so as to keep the 3 aspects of the problem separate.

Bug 359303 is currently focused on copy/paste and I don't see anything in comment 0 which mentions that, whereas to my layman's technical understanding copy/paste might be different problem from directly typing nbsp, because of clipboard flavors involved.
But as long as you're happy and you know what you're doing, and as long as we address the direct nbsp input problem described in bug 290565, comment 0, before closing that, I guess it's fine with me! ;)
As I said, this bug has no STR and people are just complaining about some code in comment #0 without putting it into context. Surely, that code is the root cause of the problem, so I can rip it out, or elegantly avoid executing it as you suggested. I think the latter is the less painful approach.

The plain text serialiser is used in various circumstances:
FF+TB: When copying HTML, also place a text/plain flavour onto the clipboard --> Bug 359303.
TB: When downgrading HTML to plain text --> Bug 290565.

We can fix both problems separately by adding nsIDocumentEncoder::OutputPersistNBSP in the right spots. We can also deliberate about the "faulty" code here for another six years without ever removing it. I prefer the former.
You need to log in before you can comment on or make changes to this bug.