Closed Bug 1509102 Opened 6 years ago Closed 6 years ago

Serialising a tag attribute with a newline in the string attribute value results in 
 resulting in broken roundtrip - STR: comment #3

Categories

(Core :: DOM: Serializers, defect)

60 Branch
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla65
Tracking Status
firefox65 --- fixed

People

(Reporter: heine.svendsen, Assigned: bzbarsky)

References

Details

Attachments

(1 file, 2 obsolete files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0

Steps to reproduce:

Every time I edit a message these characters  &#xA  are added to for instance <td style>
This has started with the latest version Thunderbird 60.3.1 32-bit.
And only when ThunderHTMLedit is installed and active.
Just opening an email for editing and then closing it again (with no edits), makes Thunderbird ask me, if I want to save the changes to the email.
For every save, ThunderHTMLedit (or Thunderbiord?) doubles the instances of "&#xA" added, so it quickly becomes a bit of a problem, even though these extra characters does not affect the apperance of the email.


Actual results:

This:
    <table style="font-size: 12px; font-family: Verdana,
      Arial, Helvetica; box-shadow: #999999 0px 0px 10px 0px;
      border: 0px; solid; #a0a0a0;" max-width="600"
      cellspacing="0" cellpadding="0" border="0" align="center">

Quickly becomes like this:
    <table style="font-size: 12px; font-family: Verdana,&#xA;&#xA;
      Arial,&#xA;&#xA; Helvetica; box-shadow: #999999 0px 0px 10px&#xA;
      0px;&#xA; border: 0px&#xA; solid&#xA; #a0a0a0;" max-width="600"
      cellspacing="0" cellpadding="0" border="0" align="center">
(notise the   &#xA;



Expected results:

Nothing should have happened, since I did not change anything
See Also: → 1508591
OK, most likely we're looking at a Core::Editor bug here that the Mozilla team will need to fix. It's not limited to tables, it happens for all tag attribute strings broken across lines.

So please:
- Let's not talk about my add-on (yes, it was updated)
- Let's not talk about any other Thunderbird issues.
I'm obsoleting all comments which distract from the real problem.

We have a very valuable contributor who can find where a problem started occurring, so for him we need very simple steps to reproduce the problem.

So here goes. STR:

Create a new HTML message. Click into the body. Form the menu: Insert > HTML. Insert this:
<div style="font-size: 12px; font-family: Verdana,
  Arial, Helvetica;">
</div>
Save the message as draft, inspect the message source.

Bad:
<div style="font-size: 12px; font-family: Verdana,&#xA; Arial,
   Helvetica;">
</div>
Notice the unwanted &#xA;

Alice, when did this start? TB 52 doesn't do that. Can you find the regression range for us.
Flags: needinfo?(alice0775)
Summary: The latest v. 60.3.1 adds &#xA to <td style> and more, when I edit with ThunderHTMLedit plugin → Serialising a tag attribute with a newline in the string attribute value results in &#xA; - STR: comment #3
See Also: 1508591
Thanks Alice, that looks like this changeset caused it:
fe4514c84e7a Ashish Kulkarni - Bug 169521: fix XML attribute serialization for proper roundtripping r=bz
https://hg.mozilla.org/mozilla-central/rev/fe4514c84e7a#l1.32

Boris and Ashish: Looks like serialising now returns &#xA; which causes strive in TB. As you can see in comment #3, the round trip doesn't actually work, since the line is now broken after the
  style="font-size: 12px; font-family: Verdana,&#xA; Arial,
which will lead to another &#xA; being inserted.
Flags: needinfo?(masayuki) → needinfo?(kulkarni.ashish)
For e-mail the roundtrip is this:

Start with a text file (the e-mail) and parse this into a DOM. Display that DOM in an editor. When editing is done, serialise DOM back to text file. The latter seems to add the &#xA; now, but the former doesn't remove them.

But that's not all. For "sanitising" HTML e-mail, we actually add another full round trip here:
https://dxr.mozilla.org/comm-central/rev/2a29ee0adb310b54a6a2df72034953fed8f2b043/comm/mailnews/mime/src/mimeTextHTMLParsed.cpp#91

That code was only added in May 2018 here
https://hg.mozilla.org/comm-central/rev/96fab4a2b811
but it aggravates the problem that started in February 2018.

Adding some debug I see:

Read from the draft/file:
    <div style="font-size: 12px; font-family: Verdana,&#xA; Arial,
      Helvetica;">
(where the &#xA; already resulted from a previous serialisation of the DOM when writing the draft).

After the roundtrip in MimeInlineTextHTMLParsed_parse_eof(), that is, ParseFromString() and EncodeToString():
    <div style="font-size: 12px; font-family: Verdana,&#xA; Arial,&#xA;      Helvetica;">
(note the six spaces before Helvetica are still there).

So I guess there is a bug in the DOMParser which should have undone the inserted &#xA;

NI'ing Henri for the DOM parser here.
Component: General → DOM
Flags: needinfo?(hsivonen)
Flags: needinfo?(bzbarsky)
Product: Thunderbird → Core
Summary: Serialising a tag attribute with a newline in the string attribute value results in &#xA; - STR: comment #3 → Serialising a tag attribute with a newline in the string attribute value results in &#xA; Parsing doesn't remove &#xA; resulting in broken roundtrip - STR: comment #3
Version: 60 → 60 Branch
Status: UNCONFIRMED → NEW
Ever confirmed: true
Looking at this again, I don't think we want to touch the DOM parser here. If bug 169521 was about serialising XML, why is HTML affected? I guess that's the real bug here. Moving this to Core::Serializers is of course dangerous since that's un-owned :-(

The quick fix for TB will be a backout of bug 169521 from TB's branch on mozilla-esr60.
Component: DOM → Serializers
> If bug 169521 was about serialising XML, why is HTML affected?

nsHTMLContentSerializer inherits from nsXHTMLContentSerializer which inherits from nsXMLContentSerializer.  nsHTMLContentSerializer::SerializeHTMLAttributes calls nsXMLContentSerializer::SerializeAttr which does the serialization.  Shouldn't be a problem per se, in general.

> So I guess there is a bug in the DOMParser which should have undone the inserted &#xA;

Testing in the Firefox console, 

  (new DOMParser()).parseFromString("<div foo='a&#xA;b'>", "text/html").querySelector("div").getAttribute("foo")

shows a string with a newline in it, which means the &#xA; got parsed into a newline, as expected.  Why exactly do you think there's an issue with DOMParser here?

As far as I can tell, all the core bits are working fine.  The serializer outputs an entity reference.  The parser converts it to a newline character.  This is all how it should work,.

It's true that if you start with raw newlines and then go through this process they will get converted to &#xA;.  The DOM representation has no way to tell whether the original source had raw newlines or &#xA;, so you will get roundtrip failures no matter what you do here...  I suppose we could restrict the behavior from bug 169521 to XML, in which case HTML with &#xA; on disk will fail to round-trip.

None of that accounts for why the number of newlines keeps increasing, as reported above.  That presumably depends on the exact serialization flags Thunderbird is using.  When you land in nsXMLContentSerializer::SerializeAttr for the relevant attribute, which of the codepaths do you end up following in Thunderbird?  Is it AppendToStringConvertLF or AppendToStringFormatedWrapped or AppendToStringWrapped?  I am guessing that what happens is that the newline is converted to &#xA, but then Thunderbird's serialization adds an extra newline anyway to wrap things nicely or some such.  The next time it's parsed (as HTML) the &#xA; becomes a newline, and the newline that was added is still there, so you get two newlines, etc.
Flags: needinfo?(bzbarsky)
Flags: needinfo?(jorgk)
Hmm, as I said in comment #7, the problem is already visible in about 15 lines of code here:
https://dxr.mozilla.org/comm-central/rev/2a29ee0adb310b54a6a2df72034953fed8f2b043/comm/mailnews/mime/src/mimeTextHTMLParsed.cpp#91-107

Original 'rawHTML' is
    <div style="font-size: 12px; font-family: Verdana,&#xA; Arial,
      Helvetica;">
(where the &#xA; already resulted from a previous serialisation of the DOM when writing the draft).

'parsed' comes back as:
<div style="font-size: 12px; font-family: Verdana,&#xA; Arial,&#xA;      Helvetica;">

So it looks like the serialisation added the &#xA; after "Arial" when it didn't before and the parser didn't remove it.

I'll answer your questions in a minute.
Flags: needinfo?(jorgk)
OK, when saving the draft, so NOT the code quote above, we come through here:
  if (mDoRaw || PreLevel() > 0) {
    NS_ENSURE_TRUE(AppendToStringConvertLF(attrString, aStr), false);
    printf("=== (1)\n");
  }
  else if (mDoFormat) {
    NS_ENSURE_TRUE(AppendToStringFormatedWrapped(attrString, aStr), false);
    printf("=== (2)\n");  <======
  }
  else if (mDoWrap) {
    NS_ENSURE_TRUE(AppendToStringWrapped(attrString, aStr), false);
    printf("=== (3)\n");
  }
  else {
    NS_ENSURE_TRUE(AppendToStringConvertLF(attrString, aStr), false);
    printf("=== (4)\n");
  }

Saving the draft is most likely
https://searchfox.org/comm-central/rev/aa357565819d38429d69190fbcc80f149c6488fe/mailnews/compose/src/nsMsgSend.cpp#1493
which uses nsIDocumentEncoder::OutputFormatted. There the &#xA; is already undesired.

Displaying the message, so the MIME code I quoted, doesn't appear to trigger that block.
(In reply to Jorg K (GMT+1) from comment #11)
> Displaying the message, so the MIME code I quoted, doesn't appear to trigger
> that block.
Correct, it instead triggers this block just above:
  if (aDoEscapeEntities) {
    printf("=== (0)\n");  <====

I hope comment #11 was clear enough, the code path printing "=== (2)\n" was executed. All that changed somehow :-(
If I revert rev fe4514c84e7a everything becomes good again. The saved draft doesn't contain the &#xA; and the code mimeTextHTMLParsed.cpp#91-107 shows:
    <div style="font-size: 12px; font-family: Verdana, Arial,
      Helvetica;">
and then unchanged:
    <div style="font-size: 12px; font-family: Verdana, Arial,
      Helvetica;">
(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #9)
> None of that accounts for why the number of newlines keeps increasing, as
> reported above.
Actually, it does. If saving a draft adds a &#xA; and then editing the draft breaks the line elsewhere, then anther one is added.

Does that give enough information to work with? I think I'll just backout rev fe4514c84e7a for TB 60.
Flags: needinfo?(bzbarsky)
> So it looks like the serialisation added the &#xA; after "Arial" when it didn't before

More precisely, it replaced the \n with &xA;.  But it might have added _another_ \n.

> the code path printing "=== (2)\n" was executed

This is the part I was looking for.  That's definitely going to add newlines.

It looks like we have two different kAttrEntities bits, one in the HTML serializer one in the XML one.  The HTML one is only used for the nsIDocumentEncoder::OutputEncodeBasicEntities case, and encodes quot/amp/lt/gt/nbsp.  The XML one used to only do the first 4 of those until the changes in bug 169521.

The right fix here is probably to introduce an nsXMLContentSerializer method that represents most of the work of  nsXMLContentSerializer::AppendAndTranslateEntities but takes an entity table argument (or probably an entity table, a string table, and a max value).  Then we could call it from the HTML serializer with the HTML entity tables.

Writing that fix is easy; the hard part is writing the tests.  What cases does Thunderbird actually care about here?  Can you create automated tests for them (which presumably fail on trunk right now)?
Blocks: 169521
Flags: needinfo?(bzbarsky) → needinfo?(jorgk)
Reading my comments above, they are a bit convoluted, so I'll restate them here:

Displaying a message runs
https://dxr.mozilla.org/comm-central/rev/2a29ee0adb310b54a6a2df72034953fed8f2b043/comm/mailnews/mime/src/mimeTextHTMLParsed.cpp#91-107
and this
  if (aDoEscapeEntities) {
    printf("=== (0)\n");  <====
printing "=== (0)\n".

Saving a draft (most likely) runs this
https://searchfox.org/comm-central/rev/aa357565819d38429d69190fbcc80f149c6488fe/mailnews/compose/src/nsMsgSend.cpp#1493
and prints "=== (2)\n". You confirmed that that adds "newlines", umm, you mean those &#xA;?

I'm also interested in round trip in mimeTextHTMLParsed.cpp#91-107.
    <div style="font-size: 12px; font-family: Verdana, Arial,
      Helvetica;">
gets turned into
    <div style="font-size: 12px; font-family: Verdana, Arial,&#xA;      Helvetica;">
but maybe that doesn't matter. I noticed that running our "Edit Draft" command doesn't trigger any debug output, so loading the draft again doesn't seem to add another &#xA;. If it did, saving and edit/save cycle would add two &#xA;, but only one gets added when saving.

As for writing tests: There are pure C++/gTests in M-C, so one of those could be added. We can of course also write tests to check message content during a save/edit again cycle.

What we care about is that no additional &#xA; get saved to disk, my add-on ThunderHTMLedit which started this bug, also uses the editor's outputToString() as the send does, so that also goes through (2) and shows undesired &#xA;

Personally I'd also like to understand what has changed on the (0) code path.
Flags: needinfo?(kulkarni.ashish)
Flags: needinfo?(jorgk)
Flags: needinfo?(hsivonen)
> umm, you mean those &#xA;?

No, I mean actual newlines.

> gets turned into

Right, that part is expected given the current setup.  Whether it's desirable, I don't know.

> There are pure C++/gTests in M-C, so one of those could be added.

Yes, I'm aware of what the general idea is.  I just don't have time myself to figure out what behaviors Thunderbird depends on here and write tests for them.

> What we care about is that no additional &#xA; get saved to disk

But you don't care if &#xA; that are actually in the files get lost, right?

> Personally I'd also like to understand what has changed on the (0) code path.

You mean other than bug 169521?
(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #17)
> > What we care about is that no additional &#xA; get saved to disk
> But you don't care if &#xA; that are actually in the files get lost, right?
Well, ideally a roundtrip compose/save draft/edit draft would leave us where the started as it has always been. I can write a test for that in an hour (or less).

> > Personally I'd also like to understand what has changed on the (0) code path.
> You mean other than bug 169521?
No, I meant from that bug. Before you stated that the (2) code path (save draft) would add newlines (which later are turned into &#xA;?). 

Right now you commented on the (0) path, the mimeTextHTMLParsed.cpp#91-107 used to display a message:
> > gets turned into
> Right, that part is expected given the current setup.  Whether it's desirable, I don't know.

so it looks like you figured out where the extra &#xA; comes from for both, right?
> would leave us where the started as it has always been.

It hasn't been.  The pre-60 behavior was that &#xA; in the source file would get converted to \n on save.

> No, I meant from that bug.

What changed is that \n gets converted to &#xA;.

> so it looks like you figured out where the extra &#xA; comes from for both, right?

Look, you start with a \n.  You parse it, you now have a \n in the DOM.  You go to serialize, convert the \n to &#xA;.  The serializer then tries to output that string, sees that it's going to be a really long line, and inserts a newline (\n).  This step used to not happen because there was already a \n there, but now that's been converted to &#xA;.  So you started with "\n" and by the time you save you are saving "&#xA;\n".

Then you parse it again, the DOM now has "\n\n".  You serialize, you have "&#xA;&#xA;" and the serializer helpfully breaks your long line so now you save "&#xA;&#xA;\n".  And so forth.
(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #19)
> It hasn't been.  The pre-60 behavior was that &#xA; in the source file would
> get converted to \n on save.
Alright. Well, no one complained about that. In fact, bug 1508591 comment #0, appear to suggest that TB used to remove "bad" &#xA; when it found them.

> What changed is that \n gets converted to &#xA;.
OK, that the mimeTextHTMLParsed.cpp#91-107 bit:
Feed this
    <div style="font-size: 12px; font-family: Verdana, Arial,\n      Helvetica;">
to the parser and get this when serialising it "unformatted" again:
    <div style="font-size: 12px; font-family: Verdana, Arial,&#xA;      Helvetica;">
Here is doesn't matter since the longer line will eventually be parsed again and then the \n goes back into the DOM.

> Look, you start with a \n.  You parse it, you now have a \n in the DOM.  You
> go to serialize, convert the \n to &#xA;.  The serializer then tries to
> output that string, sees that it's going to be a really long line, and
> inserts a newline (\n).  This step used to not happen because there was
> already a \n there, but now that's been converted to &#xA;.  So you started
> with "\n" and by the time you save you are saving "&#xA;\n".

I guess, finally the penny dropped ;-) - As you've already stated in comment #9, the parsing bit is OK, in the DOM there are always \n.

So if we can get a fix here, I can write a test.

Note the following unrelated item: You kindly provided this API
https://dxr.mozilla.org/comm-central/rev/2a29ee0adb310b54a6a2df72034953fed8f2b043/comm/mailnews/mime/src/mimeTextHTMLParsed.cpp#94
once upon a time for use in
https://searchfox.org/comm-central/rev/efacd58fa63199a34709dc95b05bb9ab56d45f13/mailnews/import/winlivemail/nsWMUtils.cpp#140
and that has since found use on the main code path for message display. Sadly we never wrote a test for it, but I guess TB would fall apart if that stopped working.
Summary: Serialising a tag attribute with a newline in the string attribute value results in &#xA; Parsing doesn't remove &#xA; resulting in broken roundtrip - STR: comment #3 → Serialising a tag attribute with a newline in the string attribute value results in &#xA; resulting in broken roundtrip - STR: comment #3
https://hg.mozilla.org/releases/mozilla-esr60/rev/9f4280a52132bd8335d4833304e4bb7ee4d7c07d
Backed out part of changeset fe4514c84e7a (bug 169521 for causing bug 1509102) to build Thunderbird 60.3.2. a=jorgk
(In reply to Jorg K (GMT+1) from comment #20)
> I guess, finally the penny dropped ;-) - As you've already stated in comment
> #9, the parsing bit is OK, in the DOM there are always \n.
> 
> So if we can get a fix here, I can write a test.

I'm willing to work on a fix if you add a test :-)
Jorg, does that patch work?

Ashish, sorry, I had that written before I saw comment 22....
Flags: needinfo?(jorgk)
No worries, great that you've already done it!
Comment on attachment 9027792 [details] [diff] [review]
Stop converting \n to &#xA; in attribute values when serializing HTML

Review of attachment 9027792 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks Boris, yes, the patch works fine.

::: dom/base/nsHTMLContentSerializer.cpp
@@ +472,5 @@
>    }
>  
> +  // We don't want to call into our superclass 2-arg version of
> +  // AppendAndTranslateEntities, because it wants to encode more characters
> +  // than we do.  Use our tables, but avoir encoding &nbsp; by passing in a smaller max index.

typo: avoid.

::: dom/base/nsXMLContentSerializer.h
@@ +191,5 @@
> +   *
> +   * The first integer template argument, which callers need to specify
> +   * explicitly, is the index of the last entry in aEntityTable that should be
> +   * considered for encoding as an entity reference.  The second integer
> +   * argument will be deduced rom the actual table passed in.

typo: from.
Attachment #9027792 - Flags: feedback+
Flags: needinfo?(jorgk)
Attachment #9027792 - Attachment is obsolete: true
Assignee: nobody → bzbarsky
Status: NEW → ASSIGNED
Comment on attachment 9028005 [details] [diff] [review]
Stop converting \n to &#xA; in attribute values when serializing HTML

Review of attachment 9028005 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/base/nsHTMLContentSerializer.cpp
@@ +473,5 @@
>  
> +  // We don't want to call into our superclass 2-arg version of
> +  // AppendAndTranslateEntities, because it wants to encode more characters
> +  // than we do.  Use our tables, but avoid encoding &nbsp; by passing in a
> +  // smaller max index.

This comment is good, but I think it would be a bit better to clarify further what characters would be encoded in the two main branches in this function (the OutputEncodeBasicEntities case vs this one.)  I'd appreciate if you could add a comment to help clarify that a bit.

::: dom/base/nsXMLContentSerializer.h
@@ +214,5 @@
> +
> +  /**
> +   * Max index that can be used with some of our entity tables.
> +   */
> +  static const uint16_t kGTVal = 62;

Now that you're defining this, https://searchfox.org/mozilla-central/rev/d35199ef15ffa57d190886e20d5df6471faf0870/dom/base/nsXMLContentSerializer.cpp#1178 is unneeded, right?  Can you please get rid of it here as well?
Attachment #9028005 - Flags: review?(ehsan) → review+
> I'd appreciate if you could add a comment to help clarify that a bit.

Done.

> Can you please get rid of it here as well?

The patch does that, yes...
Jorg, I would really like to have some tests here before checking this in...  Do you think you can do that?
Flags: needinfo?(jorgk)
Hmm, I was thinking more of tests in TB for the round trip. I could try some tests for M-C too, but in a follow-up. Or do do you want to wait? I'm pretty busy (still) chasing regressions in TB 60.x, this is just one of them.
Flags: needinfo?(jorgk)
I'd like to have m-c tests, since this is all inside m-c serializers.  

I can wait, especially because I want to make sure that the tests fail without the patch and pass with before I check in.  Do you have an approximate timeframe?
Flags: needinfo?(jorgk)
Hmm, a few days, maybe until the weekend. The other bug landed this:
https://hg.mozilla.org/mozilla-central/rev/fe4514c84e7a#l2.13

Would it be much more than
let input = "<div foo='a&#xA;b'>";
let doc  = (new DOMParser()).parseFromString(input, "text/html");
let encoder = Cu.createDocumentEncoder("text/html");
encoder.init(doc, "text/html", 0);
let output = encoder.encodeToString();

Then some check that input.replace("&#xA;", "\n") is pretty much like output?

What steps do you have in mind? And where to best put such a test?
Flags: needinfo?(jorgk)
Attached patch With testsSplinter Review
Attachment #9028005 - Attachment is obsolete: true
Pushed by bzbarsky@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8fc810b6e528
Stop converting \n to &#xA; in attribute values when serializing HTML.  r=ehsan
Backout by cbrindusan@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/7e8744eac2ca
Backed out changeset 8fc810b6e528 for xpcshell failures from test_serializers_entities.js. CLOSED TREE
Ugh, Windows newlines different from everyone else's newlines...  I guess now I get to do some Windows try runs.
I can take care of it.
I already have a fix running on try, fwiw.
Both pass here (on Windows), thanks again!
Flags: needinfo?(bzbarsky)
Pushed by bzbarsky@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/9e3b1b073d5e
Stop converting \n to &#xA; in attribute values when serializing HTML.  r=ehsan
https://hg.mozilla.org/mozilla-central/rev/9e3b1b073d5e
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
Resolved issue for us after the update: https://www.thunderbird.net/en-US/thunderbird/60.3.2/releasenotes/

My initial bug: 1508591

Thank you
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: