Copy-Paste of simple HTML adds extra space not present in original content

REOPENED
Assigned to

Status

()

defect
P2
major
REOPENED
10 years ago
Last year

People

(Reporter: mats, Assigned: laurent)

Tracking

(Blocks 1 bug, {qawanted, regression, testcase})

1.9.2 Branch
Points:
---
Dependency tree / graph
Bug Flags:
blocking1.9.2 -

Firefox Tracking Flags

(Not tracked)

Details

()

Attachments

(4 attachments, 1 obsolete attachment)

STEPS TO REPRODUCE
1. Open URL (Testcase #5 in bug 39098)
2. press CTRL+C
3. focus the TEXTAREA
4. press CTRL+V

ACTUAL RESULT
This is a draggable  bit of text.

EXPECTED RESULT
This is a draggable bit of text.

PLATFORMS AND BUILDS TESTED
Bug occurs in Firefox trunk on Linux (and most likely all platforms).
Bug occurs in Firefox 3.6b2pre 20091028
Bug does not occur in Firefox 3.0.x or 3.5.3

Regression window: 2009-04-25-03 -- 2009-04-26-03
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=dfff608ebad0&tochange=b552b1ef8aa0
I'm guessing it's a regression from bug 422403.
Flags: blocking1.9.2?
Blocks: 39098
I discovered this by making the drag-n-drop serialization of the selection
use the same code as the kbd Copy command (in bug 39098).
This made the "content/events/test/test_dragstart.html" mochitest fail:
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1256695352.1256705398.7411.gz
It appears the bug occurs both for the text/plain and text/html serialization.
for text/html serialization, it is not a regression. It is just that the new html serializer does a better job for pretty printing. so you have to update that test.

For text/plain, I don't know yet. I didn't modify the plain text serializer with the patch of bug 422403.

I'm investigating on this.
I understood the issue.

nsHTMLCopyEncoder is used for the serialization, and is called into SelectionCopyHelper (nsCopySupport.cpp).

There is a first serialization here http://mxr.mozilla.org/mozilla-central/source/content/base/src/nsCopySupport.cpp#134 of the selection. It returns

<div id="draggable" ondragstart="doDragStartSelection(event)">This is a <em>draggable</em>\n bit of text.</div>

Note the \n added by the serializer, because of the default flag on wrapping in the HTML serializer. As I said before, this \n is added because of the better pretty printing made by the serializer.

Then this string is converted by nsHTMLFormatConverter http://mxr.mozilla.org/mozilla-central/source/content/base/src/nsCopySupport.cpp#153 . This converter remove all html tag, and it seems it replaces all \n by a space, this why we have two spaces between "draggable" and "bit".

A solution is to disable pretty printing in the nsHTMLCopyEncoder. We have to verify if it doesn't break something somewhere.


I notice an other issue in the SelectionCopyHelper function (not related to the bug) : if the selection to copy is not in a plain text context (bIsHTMLCopy = true), then the selection is serialized twice ! The comment for the first serialization says "We always require a plaintext version", and then the kUnicodeMime is passed to the init method of the encoder. However, it is useless, because nsHTMLCopyEncoder always uses text/html, it doesn't use the given mime type  : http://mxr.mozilla.org/mozilla-central/source/content/base/src/nsDocumentEncoder.cpp#1116

In the "if (bIsHTMLCopy)" statement, the selection is serialized a second time. The only difference is the called method: EncodeToString vs EncodeToStringWithContext.

I think we could optimize this piece of code : only do the first serialization when bIsHTMLCopy == false. 

Do you agree ? Olly ? Mats ?

If ok I can provide a patch to fix these two issues.
(In reply to comment #3)
> A solution is to disable pretty printing in the nsHTMLCopyEncoder. We have to
> verify if it doesn't break something somewhere.
Sounds reasonable.


> I notice an other issue in the SelectionCopyHelper function (not related to the
> bug)
I this a regression too? If it is, is it a serious one?
And in any case,  please file a new bug for this issue.
Posted patch patch v1 (obsolete) — Splinter Review
here is the patch, which only changes flags for the encoder for the copy. I added also a test file.

For the second issue, it is not really a regression, just an optimization we can do. I will create a new bug for that and will attach a patch to it.
Assignee: nobody → laurent
Status: NEW → ASSIGNED
Attachment #409067 - Flags: review?(Olli.Pettay)
Comment on attachment 409067 [details] [diff] [review]
patch v1

How does this work with long lines?
Based on the documentation OutputRaw may wrap.
But I'm not sure if the documentation is right, since here you say it doesn't wrap 
https://bugzilla.mozilla.org/show_bug.cgi?id=422403#c7
I don't know which documentation you talk about, but as indicated in the comment for the OutputRaw flag in the nsIDocumentEncoder.idl, using these flag disable the wrapping and the formating. And this is the real behavior not only with the patch of bug 422403, but it was also the previous behavior (gecko 1.9.1 and before).

Of course, with OutputRaw, there isn't any more new \n, but existing \n are kept. so if in the orignal source code there is a line break like this

<div>This is a <em>draggable</em>
 bit of text.</div>

However, with such cases, I think we will still have the issue, (here, a double space between "draggable" and "bit"), because of the behavior of nsHTMLFormatConverter used in nsCopySupport. As I said, it seems it simply replaces \n by a space, after removing html tags. I don't know if this issue in nsHTMLFormatConverter can be fixed (and how, I don't know this part of gecko). 

I don't know if using the plain text serializer instead of nsHTMLFormatConverter could be an other solution...
Ah, oops, I misunderstood the comment in nsIDocumentEncoder.
Flags: blocking1.9.2? → blocking1.9.2+
Priority: -- → P2
Does the patch bring back the exactly same behavior what we have on 1.9.1?
I wonder especially the removal of flags = 0;
Posted file Testcase #1
Attachment #409067 - Flags: review?(Olli.Pettay) → review-
Comment on attachment 409067 [details] [diff] [review]
patch v1

Hmm, did this pass regression tests?
This patch breaks copy from a plain text context (see Testcase #1).
I believe the first value for 'flags' should be:
  nsIDocumentEncoder::OutputPreformatted | nsIDocumentEncoder::OutputRaw
the second should be:
  nsIDocumentEncoder::OutputRaw

(fwiw, that passed regression tests on TryServer together with the patch
in bug 39098)
Posted patch patch v2Splinter Review
I added the OutputPreformatted flag and re introduce "flags = 0". You're right, it was useless to remove both.

In fact, I didn't see the code of nsHTMLCopyEncoder::setSelection, which select a different serializer in some case. It uses the plaintext serializer when the selection to copy is inside an input and textarea or in a preformatted html element. And it uses the html serializer in the other cases. I thought it uses always the html serializer. And so, the second issue I explained is no more valid for some cases.

I updated the unit test file to include your test case.
Attachment #409067 - Attachment is obsolete: true
Attachment #409324 - Flags: review?(matspal)
Comment on attachment 409324 [details] [diff] [review]
patch v2

Looks good, r=mats.  It still breaks the "test_dragstart.html"
test together with the patch in bug 39098, but that bug isn't
for 1.9.2, so I'll add a \n to the expected result on trunk.

As far as I know this is a change in behavior, and should be considered
when approving this patch for 1.9.2.  I don't see a problem with having
the extra newline in HTML-to-HTML copy on 1.9.2, OTOH, I don't see the
benefit from it either.

I'm requesting an explicit approval1.9.2 to consider the above.
Attachment #409324 - Flags: review?(matspal)
Attachment #409324 - Flags: review+
Attachment #409324 - Flags: approval1.9.2?
Attachment #409324 - Flags: superreview?(Olli.Pettay)
Attachment #409324 - Flags: superreview?(Olli.Pettay) → superreview+
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment on attachment 409324 [details] [diff] [review]
patch v2

(this is a blocker, doesn't need explicit approval, please land ASAP)
Attachment #409324 - Flags: approval1.9.2?
Jonas, any thoughts on whether we'll be better off with or without this patch for 1.9.2?
Wow, this bug is confusing for someone not familiar with our serializers :)

A few questions:
1. So the problem is that when copying, we would first prettyprint the selection, and then concatenate all the textnodes in the result of the prettyprint? And the fix is to not prettyprint but rather *just* concatenate the existing textnodes?

2. Does this patch make us behave as we did in 1.9.1 in all cases? Comment 13 seems to indicate "no". If the answer is "no", then when and how do we differ?

3. If this patch is correct, then why does it combined with the patch in bug 39098 break a existing testcase? Is that a problem with this patch, the patch in bug 39098, or a problem with the testcase?
1) selection inside input/textarea is serialized with the plaintext serializer (so concatenation of textnode + pretty printing), otherwise, it is serialized with the html serializer. And in this case, tags are then removed and \n are replaced by spaces (by using the html converter) to have a text/plain version in the clipboard.

The new html serializer does a better job on wrapping for the pretty printing, so in the example, we have a new \n. The problem is that the \n are replaced by a space, by the html converter. So a solution is to not do pretty printing during the serialization.

2) Yes, for most of case. 

3) No, Mats just changed the way the drag-n-drop serialization is made, by using the same code as the kbd Copy command, so it uses the html serializer, and because of the new wrapping behavior, it changes a test, and  then Mats discovers the bug 524975.

I think we should land my patch to 1.9.2, since kbd copy command doesn't behave anymore as in 1.9.1.
(In reply to comment #18)
> 1) selection inside input/textarea is serialized with the plaintext serializer
> (so concatenation of textnode + pretty printing), otherwise, it is serialized
> with the html serializer. And in this case, tags are then removed and \n are
> replaced by spaces (by using the html converter) to have a text/plain version
> in the clipboard.
> 
> The new html serializer does a better job on wrapping for the pretty printing,
> so in the example, we have a new \n. The problem is that the \n are replaced by
> a space, by the html converter. So a solution is to not do pretty printing
> during the serialization.

Ok, so it sounds like the answer is simply "yes" to my question? Or more precisely "yes. Except when copying from input/textarea, in which case none of the code relevant to this bug is used".

Is that correct?

> 2) Yes, for most of case. 

Well, that's a "no" then :). So in which cases are we now different from 1.9.1?

> 3) No, Mats just changed the way the drag-n-drop serialization is made, by
> using the same code as the kbd Copy command, so it uses the html serializer,
> and because of the new wrapping behavior, it changes a test, and  then Mats
> discovers the bug 524975.

That still doesn't answer the question: If mats patch were to land, is the problem in his patch, in your patch or in the test. It sounds like you're saying that it's in the test?


Another important question here is does this patch change any behavior that webpages see? I.e. is it possible that this patch will break (or fix) websites? Or is there a risk that it will?
let's summary. There are two issues:

1) this current bug: "copy-paste of simple HTML adds extra space not present in original content". This is clearly a regression after my work on the serializer (bug 422403). I already explains why in previous comments. So  should be landed in 1.9.2 (IMHO) since 1.9.2 has the new serializer. 

However, with the patch I provided on this bug 524975, Copy-paste will behave in a different manner for HTML content. 1.9.1: it was serialized with pretty printing flags. With my patch: without pretty printing. I don't think this difference is important, at least for the text/html flavor for the clipboard.

2) the test on drag-n-drop. The problem is in the test. Since mats change the code of the serialization for drag-n-drop (and then the result of the serialization, as I explained), he has to adapt the test.


> Another important question here is does this patch change any behavior that
webpages see?

Clearly, it doesn't.

However, I just found a regression with my patch: in fact, copying HTML produces now non pretty printing, not only for the text/html flavor, but also for text/unicode flavor. So copying this DOM:

bla
<ul>
  <li>foo</li>
  <li>bar</li>
</ul>

When we paste it, we have

bla
foo
bar

instead of 

bla
  * foo
  * bar

like in gecko 1.9.1.

I reopen the bug. (or should I create a new bug ?)
I'm working on a new patch.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to comment #20)
> let's summary. There are two issues:
> 
> 1) this current bug: "copy-paste of simple HTML adds extra space not present in
> original content". This is clearly a regression after my work on the serializer
> (bug 422403). I already explains why in previous comments. So  should be landed
> in 1.9.2 (IMHO) since 1.9.2 has the new serializer. 
> 
> However, with the patch I provided on this bug 524975, Copy-paste will behave
> in a different manner for HTML content. 1.9.1: it was serialized with pretty
> printing flags. With my patch: without pretty printing. I don't think this
> difference is important, at least for the text/html flavor for the clipboard.

My gut reaction is that the behavior with your patch is better, yes. However it's a scary change this late in the game.

> 2) the test on drag-n-drop. The problem is in the test. Since mats change the
> code of the serialization for drag-n-drop (and then the result of the
> serialization, as I explained), he has to adapt the test.

Ok, if the problem is in the test then that's all good. Let's never talk about it in this bug any more :) I only care about bugs in the code, not in the tests.

> However, I just found a regression with my patch: in fact, copying HTML
> produces now non pretty printing, not only for the text/html flavor, but also
> for text/unicode flavor. So copying this DOM:
> 
> bla
> <ul>
>   <li>foo</li>
>   <li>bar</li>
> </ul>
> 
> When we paste it, we have
> 
> bla
> foo
> bar
> 
> instead of 
> 
> bla
>   * foo
>   * bar
> 
> like in gecko 1.9.1.
> 
> I reopen the bug. (or should I create a new bug ?)
> I'm working on a new patch.

This scares me, a lot. This would have been a bad regression to ship with.

I think given how late in the game this is, I don't think we should take anything on the branch. These changes are better done early in the cycle.

For example while not prettyprinting copied HTML seems like a good idea, it's a bigger change than i'm not comfortable with right before release.

So my recommendation is to not take this for the 1.9.2 release. *Possibly* it's something to consider for 1.9.2.x.


This leaves us with the extra space regression. Either we can leave things as is now and consider fixing it in 1.9.2.x, or we can consider backing out bug 422403 on the branch. Do you have an opinion?
Ok. I'm agree to leave the 1.9.2 branch as its current state. I think it's a good idea to not land the patch in it, since it introduces a more critical regression (described in my previous comment) than the extra space bug IMHO. And I will have more time to provide a better patch for bug 524975 :-)

Backing out bug 422403 will be harder, because other changes have been made on the serializer after the check in of the html5 parser. I prefer to spend time on the patch for bug 524975 instead of on the backout of bug 422403 (which is a huge patch).
blocking1.9.2- per the above comments, we'll work on getting this fixed on trunk for the next release! Thank you Laurent for your contributions here!
Flags: blocking1.9.2+ → blocking1.9.2-
Laurent, do you think you would have time to work on an updated patch for this bug?
sorry for the delay. I'm working on it this week.
My apologizes, it seems in fact there is no regression. Perhaps my old builds were bad. I can't reproduce the regression I found. I added some tests to verify that copy paste of a piece HTML produce a well formated text in text/plain. See this patch.
Attachment #424228 - Flags: review?(Olli.Pettay)
The full patch for the 1.9.2 branch.
Attachment #424229 - Flags: superreview?(Olli.Pettay)
Attachment #424229 - Flags: review?(Olli.Pettay)
Attachment #424229 - Flags: approval1.9.2.1?
Attachment #424228 - Flags: review?(Olli.Pettay) → review+
So why would we take the patch for 1.9.2?
This looks like the underlying issue is exactly what bug 549295 is about, which is that the behavior of not passing any wrapping flags at all was totally changed from "wrap if the text is long" to "wrap all the time"....
Laurent, does the patch fix bug 549295 on FF 3.6? Or vice-versa?
no the patch doesn't fix bug 549295. But the patch of bug 549295 probably fixes (or at least, helps to fix) bug 524975. I have to verify it.
Comment on attachment 424229 [details] [diff] [review]
Patch for the 1.9.2 branch

Please renominate when the reviews are complete.
Attachment #424229 - Flags: approval1.9.2.2? → approval1.9.2.2-
checked in additionnal tests http://hg.mozilla.org/mozilla-central/rev/e9e8eaec0cbb
(In reply to comment #31)
> no the patch doesn't fix bug 549295. But the patch of bug 549295 probably fixes
> (or at least, helps to fix) bug 524975. I have to verify it.

Any news here?
As Boris said, the regression comes from the issue described in bug 549295, and the patch of bug 549295 fix the current bug, even without the attachement 409324 already landed into the trunk.

However I think the real issue is in the html converter used by SelectionCopyHelper, when it replaces all LF by a space. If a LF is preceded by a space, it should not replace it by a space, but only remove it. However I don't know how the html convert works, so perhaps I say silly things :-)
Depends on: 549295
Should we back out attachment 409324 [details] [diff] [review] from trunk, if the real regression fix is
bug 549295?
well, attachment 409324 [details] [diff] [review] just add a flag for the encoder. I'm not sure that this flag does some regression, so we can keep it... In fact, now with the patch for bug 549295, the real question is : do we need to wrap on long lines (old behavior before the new serializer) or not, for serialized content for the clipboard. If yes, we should revert the change.

However, if we revert this change, we should keep the test I added, and I can remove the added flag directly with the patch for bug 549295.
We should have the old behavior, unless there is some very good reason
not to.
Ok I will remove the flag in my patch for bug 549295
Attachment #424229 - Flags: superreview?(Olli.Pettay)
Attachment #424229 - Flags: review?(Olli.Pettay)
Blocks: 567362
I'm not sure if this is the most relevant bug to my issue; searching bugzilla for "copy paste space" (without the quotes) returns several bugs.  I'm seeing this behavior in Thunderbird 3.0.4/Linux, whereas in Tbird 2.x, the bug was not present.

What happens is, when you select the last word in a paragraph, you need to be careful not to select "too far" past the end of the word (into the empty area to the right of the word).  If you go too far, your selection will not visibly contain any whitespace; but when you copy & paste it, a trailing space appears in the pasted text.  This also happens when you double-click to select the word, as long as it's the last word in the paragraph (and there's no punctuation after it).  This happens even for plain-text messages (content-type: text/plain).

Firefox does a similar thing if the word you're double-clicking is followed by a newline in the source code, as opposed to being immediately followed by an HTML tag.  You can plainly see that there's no space in the visible selection, yet upon pasting, it's there.

From the comments here about serializers and test cases and regressions, and the several other linked bugs, this seems like a complex issue.  I hope my report helps, or is at least relevant; please let me know if there's a better bug for me to look at instead.
Finally, I forgot to remove the flag in the patch for bug 549295.

And because of this  nsIDocumentEncoder::OutputRaw flag, it causes some bugs like bug 572543.

I really should remove this flag :-) Patch soon.
Blocks: 572543
I have the same issue with Firefox 4.0.1: Firefox adds an invisible space if there is a newline between the last word and a tag (as in "word&#10;<div>") or if there is a space after the word (as in "word <br/>".

This is really annoying if you copy passwords since you can't see the additional space character when you paste.
Depends on: 723163
Any news ? I have this bug
Still had this issue with Firefox 28.0.
Ironically, the workaround was to add &nbsp; after the word and then it worked fine.
Persists to FF 29
Should be fixed by bug 333064?
Flags: needinfo?(mats)
I think it must have been fixed earlier than that.
I can't reproduce this bug using FF46 on Linux.

QA: please test this bug on all platforms; resolve as worksforme if OK.  Thanks.
Flags: needinfo?(mats)
Keywords: qawanted
Persists to FF 46.0.1 on OSX 10.10
To reproduce:

1. Open this issue, https://bugzilla.mozilla.org/show_bug.cgi?id=524975 , in a new private window (so that you're not authenticated)
2. Scroll down to this comment
3. Double-click my name ("aaron"), to highlight the entire thing
4. copy it (either ctrl-C or right-click, copy)
5. paste it somewhere and observe the leading space character, not present in the HTML (which is like <span>aaron</span>)

See video documentation: https://www.youtube.com/watch?v=HI1LqAUaMrI
So the DOM range we get when double-clicking the name per STR in comment 49 :
let range = window.getSelection().getRangeAt(0)
range.commonAncestorContainer is this:

<span class="bz_comment_user"><a href="user_profile?user_id=497669">
  <img src="https://secure.gravatar.com/avatar/d89dd360c498c1bf4ca9d9a01d08596a?d=mm&amp;size=64" width="32" height="32" align="middle" border="0"></a>
          <span class="vcard vcard_497669"><span class="ln">aaron</span>
</span>
       </span>

range.startContainer is the text node before the vcard_497669 span.
range.startOffset is 1.
range.endContainer is the text node after the vcard_497669 span (i.e. the newline).
range.endOffset is 0.

So, given that range, copy-paste should actually result in " aaron".

The error seems to be that the double-click included the space in
the text node before the vcard_497669 span.  It shouldn't do that,
at least not on platforms that excludes the surrounding space when
double-clicking words in plain text.

I'm pretty sure this is a different bug than what I originally reported
though, but let's keep the bug open for now until we know if the latter
bug is on file.
I can confirm the problem described in comment #49. I'm using FF 47 on Linux (Fedora 23).
Happens to me as well. Pasting into a terminal adds 0~ and 1~ around the copied text. Really **** behavior. Never using Firefox again.
This appears to be fixed
FF 51.0.1
OSX 10.10.5
(In reply to aaron from comment #53)
> This appears to be fixed
> FF 51.0.1
> OSX 10.10.5

Nope, sorry, spoke too soon.
It is not fixed in latest 51.0.1, OSX 10.10.5
You need to log in before you can comment on or make changes to this bug.