Serialising a tag attribute with a newline in the string attribute value results in 
 resulting in broken roundtrip - STR: comment #3

RESOLVED FIXED in Firefox 65

Status

()

defect
RESOLVED FIXED
5 months ago
5 months ago

People

(Reporter: hs, Assigned: bzbarsky)

Tracking

60 Branch
mozilla65
Points:
---

Firefox Tracking Flags

(firefox65 fixed)

Details

Attachments

(1 attachment, 2 obsolete attachments)

(Reporter)

Description

5 months ago
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0

Steps to reproduce:

Every time I edit a message these characters  &#xA  are added to for instance <td style>
This has started with the latest version Thunderbird 60.3.1 32-bit.
And only when ThunderHTMLedit is installed and active.
Just opening an email for editing and then closing it again (with no edits), makes Thunderbird ask me, if I want to save the changes to the email.
For every save, ThunderHTMLedit (or Thunderbiord?) doubles the instances of "&#xA" added, so it quickly becomes a bit of a problem, even though these extra characters does not affect the apperance of the email.


Actual results:

This:
    <table style="font-size: 12px; font-family: Verdana,
      Arial, Helvetica; box-shadow: #999999 0px 0px 10px 0px;
      border: 0px; solid; #a0a0a0;" max-width="600"
      cellspacing="0" cellpadding="0" border="0" align="center">

Quickly becomes like this:
    <table style="font-size: 12px; font-family: Verdana,&#xA;&#xA;
      Arial,&#xA;&#xA; Helvetica; box-shadow: #999999 0px 0px 10px&#xA;
      0px;&#xA; border: 0px&#xA; solid&#xA; #a0a0a0;" max-width="600"
      cellspacing="0" cellpadding="0" border="0" align="center">
(notise the   &#xA;



Expected results:

Nothing should have happened, since I did not change anything
Comment hidden (obsolete)

Updated

5 months ago
See Also: → 1508591
Comment hidden (obsolete)

Comment 3

5 months ago
OK, most likely we're looking at a Core::Editor bug here that the Mozilla team will need to fix. It's not limited to tables, it happens for all tag attribute strings broken across lines.

So please:
- Let's not talk about my add-on (yes, it was updated)
- Let's not talk about any other Thunderbird issues.
I'm obsoleting all comments which distract from the real problem.

We have a very valuable contributor who can find where a problem started occurring, so for him we need very simple steps to reproduce the problem.

So here goes. STR:

Create a new HTML message. Click into the body. Form the menu: Insert > HTML. Insert this:
<div style="font-size: 12px; font-family: Verdana,
  Arial, Helvetica;">
</div>
Save the message as draft, inspect the message source.

Bad:
<div style="font-size: 12px; font-family: Verdana,&#xA; Arial,
   Helvetica;">
</div>
Notice the unwanted &#xA;

Alice, when did this start? TB 52 doesn't do that. Can you find the regression range for us.
Flags: needinfo?(alice0775)
Summary: The latest v. 60.3.1 adds &#xA to <td style> and more, when I edit with ThunderHTMLedit plugin → Serialising a tag attribute with a newline in the string attribute value results in &#xA; - STR: comment #3

Updated

5 months ago
See Also: 1508591

Updated

5 months ago
Duplicate of this bug: 1508591

Comment 6

5 months ago
Thanks Alice, that looks like this changeset caused it:
fe4514c84e7a Ashish Kulkarni - Bug 169521: fix XML attribute serialization for proper roundtripping r=bz
https://hg.mozilla.org/mozilla-central/rev/fe4514c84e7a#l1.32

Boris and Ashish: Looks like serialising now returns &#xA; which causes strive in TB. As you can see in comment #3, the round trip doesn't actually work, since the line is now broken after the
  style="font-size: 12px; font-family: Verdana,&#xA; Arial,
which will lead to another &#xA; being inserted.
Flags: needinfo?(masayuki) → needinfo?(kulkarni.ashish)

Comment 7

5 months ago
For e-mail the roundtrip is this:

Start with a text file (the e-mail) and parse this into a DOM. Display that DOM in an editor. When editing is done, serialise DOM back to text file. The latter seems to add the &#xA; now, but the former doesn't remove them.

But that's not all. For "sanitising" HTML e-mail, we actually add another full round trip here:
https://dxr.mozilla.org/comm-central/rev/2a29ee0adb310b54a6a2df72034953fed8f2b043/comm/mailnews/mime/src/mimeTextHTMLParsed.cpp#91

That code was only added in May 2018 here
https://hg.mozilla.org/comm-central/rev/96fab4a2b811
but it aggravates the problem that started in February 2018.

Adding some debug I see:

Read from the draft/file:
    <div style="font-size: 12px; font-family: Verdana,&#xA; Arial,
      Helvetica;">
(where the &#xA; already resulted from a previous serialisation of the DOM when writing the draft).

After the roundtrip in MimeInlineTextHTMLParsed_parse_eof(), that is, ParseFromString() and EncodeToString():
    <div style="font-size: 12px; font-family: Verdana,&#xA; Arial,&#xA;      Helvetica;">
(note the six spaces before Helvetica are still there).

So I guess there is a bug in the DOMParser which should have undone the inserted &#xA;

NI'ing Henri for the DOM parser here.
Component: General → DOM
Flags: needinfo?(hsivonen)
Flags: needinfo?(bzbarsky)
Product: Thunderbird → Core
Summary: Serialising a tag attribute with a newline in the string attribute value results in &#xA; - STR: comment #3 → Serialising a tag attribute with a newline in the string attribute value results in &#xA; Parsing doesn't remove &#xA; resulting in broken roundtrip - STR: comment #3
Version: 60 → 60 Branch

Updated

5 months ago
Status: UNCONFIRMED → NEW
Ever confirmed: true

Comment 8

5 months ago
Looking at this again, I don't think we want to touch the DOM parser here. If bug 169521 was about serialising XML, why is HTML affected? I guess that's the real bug here. Moving this to Core::Serializers is of course dangerous since that's un-owned :-(

The quick fix for TB will be a backout of bug 169521 from TB's branch on mozilla-esr60.
Component: DOM → Serializers
> If bug 169521 was about serialising XML, why is HTML affected?

nsHTMLContentSerializer inherits from nsXHTMLContentSerializer which inherits from nsXMLContentSerializer.  nsHTMLContentSerializer::SerializeHTMLAttributes calls nsXMLContentSerializer::SerializeAttr which does the serialization.  Shouldn't be a problem per se, in general.

> So I guess there is a bug in the DOMParser which should have undone the inserted &#xA;

Testing in the Firefox console, 

  (new DOMParser()).parseFromString("<div foo='a&#xA;b'>", "text/html").querySelector("div").getAttribute("foo")

shows a string with a newline in it, which means the &#xA; got parsed into a newline, as expected.  Why exactly do you think there's an issue with DOMParser here?

As far as I can tell, all the core bits are working fine.  The serializer outputs an entity reference.  The parser converts it to a newline character.  This is all how it should work,.

It's true that if you start with raw newlines and then go through this process they will get converted to &#xA;.  The DOM representation has no way to tell whether the original source had raw newlines or &#xA;, so you will get roundtrip failures no matter what you do here...  I suppose we could restrict the behavior from bug 169521 to XML, in which case HTML with &#xA; on disk will fail to round-trip.

None of that accounts for why the number of newlines keeps increasing, as reported above.  That presumably depends on the exact serialization flags Thunderbird is using.  When you land in nsXMLContentSerializer::SerializeAttr for the relevant attribute, which of the codepaths do you end up following in Thunderbird?  Is it AppendToStringConvertLF or AppendToStringFormatedWrapped or AppendToStringWrapped?  I am guessing that what happens is that the newline is converted to &#xA, but then Thunderbird's serialization adds an extra newline anyway to wrap things nicely or some such.  The next time it's parsed (as HTML) the &#xA; becomes a newline, and the newline that was added is still there, so you get two newlines, etc.
Flags: needinfo?(bzbarsky)
(Assignee)

Updated

5 months ago
Flags: needinfo?(jorgk)

Comment 10

5 months ago
Hmm, as I said in comment #7, the problem is already visible in about 15 lines of code here:
https://dxr.mozilla.org/comm-central/rev/2a29ee0adb310b54a6a2df72034953fed8f2b043/comm/mailnews/mime/src/mimeTextHTMLParsed.cpp#91-107

Original 'rawHTML' is
    <div style="font-size: 12px; font-family: Verdana,&#xA; Arial,
      Helvetica;">
(where the &#xA; already resulted from a previous serialisation of the DOM when writing the draft).

'parsed' comes back as:
<div style="font-size: 12px; font-family: Verdana,&#xA; Arial,&#xA;      Helvetica;">

So it looks like the serialisation added the &#xA; after "Arial" when it didn't before and the parser didn't remove it.

I'll answer your questions in a minute.
Flags: needinfo?(jorgk)

Comment 11

5 months ago
OK, when saving the draft, so NOT the code quote above, we come through here:
  if (mDoRaw || PreLevel() > 0) {
    NS_ENSURE_TRUE(AppendToStringConvertLF(attrString, aStr), false);
    printf("=== (1)\n");
  }
  else if (mDoFormat) {
    NS_ENSURE_TRUE(AppendToStringFormatedWrapped(attrString, aStr), false);
    printf("=== (2)\n");  <======
  }
  else if (mDoWrap) {
    NS_ENSURE_TRUE(AppendToStringWrapped(attrString, aStr), false);
    printf("=== (3)\n");
  }
  else {
    NS_ENSURE_TRUE(AppendToStringConvertLF(attrString, aStr), false);
    printf("=== (4)\n");
  }

Saving the draft is most likely
https://searchfox.org/comm-central/rev/aa357565819d38429d69190fbcc80f149c6488fe/mailnews/compose/src/nsMsgSend.cpp#1493
which uses nsIDocumentEncoder::OutputFormatted. There the &#xA; is already undesired.

Displaying the message, so the MIME code I quoted, doesn't appear to trigger that block.

Comment 12

5 months ago
(In reply to Jorg K (GMT+1) from comment #11)
> Displaying the message, so the MIME code I quoted, doesn't appear to trigger
> that block.
Correct, it instead triggers this block just above:
  if (aDoEscapeEntities) {
    printf("=== (0)\n");  <====

I hope comment #11 was clear enough, the code path printing "=== (2)\n" was executed. All that changed somehow :-(

Comment 13

5 months ago
If I revert rev fe4514c84e7a everything becomes good again. The saved draft doesn't contain the &#xA; and the code mimeTextHTMLParsed.cpp#91-107 shows:
    <div style="font-size: 12px; font-family: Verdana, Arial,
      Helvetica;">
and then unchanged:
    <div style="font-size: 12px; font-family: Verdana, Arial,
      Helvetica;">

Comment 14

5 months ago
(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #9)
> None of that accounts for why the number of newlines keeps increasing, as
> reported above.
Actually, it does. If saving a draft adds a &#xA; and then editing the draft breaks the line elsewhere, then anther one is added.

Does that give enough information to work with? I think I'll just backout rev fe4514c84e7a for TB 60.
Flags: needinfo?(bzbarsky)
> So it looks like the serialisation added the &#xA; after "Arial" when it didn't before

More precisely, it replaced the \n with &xA;.  But it might have added _another_ \n.

> the code path printing "=== (2)\n" was executed

This is the part I was looking for.  That's definitely going to add newlines.

It looks like we have two different kAttrEntities bits, one in the HTML serializer one in the XML one.  The HTML one is only used for the nsIDocumentEncoder::OutputEncodeBasicEntities case, and encodes quot/amp/lt/gt/nbsp.  The XML one used to only do the first 4 of those until the changes in bug 169521.

The right fix here is probably to introduce an nsXMLContentSerializer method that represents most of the work of  nsXMLContentSerializer::AppendAndTranslateEntities but takes an entity table argument (or probably an entity table, a string table, and a max value).  Then we could call it from the HTML serializer with the HTML entity tables.

Writing that fix is easy; the hard part is writing the tests.  What cases does Thunderbird actually care about here?  Can you create automated tests for them (which presumably fail on trunk right now)?
Blocks: 169521
Flags: needinfo?(bzbarsky) → needinfo?(jorgk)

Comment 16

5 months ago
Reading my comments above, they are a bit convoluted, so I'll restate them here:

Displaying a message runs
https://dxr.mozilla.org/comm-central/rev/2a29ee0adb310b54a6a2df72034953fed8f2b043/comm/mailnews/mime/src/mimeTextHTMLParsed.cpp#91-107
and this
  if (aDoEscapeEntities) {
    printf("=== (0)\n");  <====
printing "=== (0)\n".

Saving a draft (most likely) runs this
https://searchfox.org/comm-central/rev/aa357565819d38429d69190fbcc80f149c6488fe/mailnews/compose/src/nsMsgSend.cpp#1493
and prints "=== (2)\n". You confirmed that that adds "newlines", umm, you mean those &#xA;?

I'm also interested in round trip in mimeTextHTMLParsed.cpp#91-107.
    <div style="font-size: 12px; font-family: Verdana, Arial,
      Helvetica;">
gets turned into
    <div style="font-size: 12px; font-family: Verdana, Arial,&#xA;      Helvetica;">
but maybe that doesn't matter. I noticed that running our "Edit Draft" command doesn't trigger any debug output, so loading the draft again doesn't seem to add another &#xA;. If it did, saving and edit/save cycle would add two &#xA;, but only one gets added when saving.

As for writing tests: There are pure C++/gTests in M-C, so one of those could be added. We can of course also write tests to check message content during a save/edit again cycle.

What we care about is that no additional &#xA; get saved to disk, my add-on ThunderHTMLedit which started this bug, also uses the editor's outputToString() as the send does, so that also goes through (2) and shows undesired &#xA;

Personally I'd also like to understand what has changed on the (0) code path.
Flags: needinfo?(kulkarni.ashish)
Flags: needinfo?(jorgk)
Flags: needinfo?(hsivonen)
> umm, you mean those &#xA;?

No, I mean actual newlines.

> gets turned into

Right, that part is expected given the current setup.  Whether it's desirable, I don't know.

> There are pure C++/gTests in M-C, so one of those could be added.

Yes, I'm aware of what the general idea is.  I just don't have time myself to figure out what behaviors Thunderbird depends on here and write tests for them.

> What we care about is that no additional &#xA; get saved to disk

But you don't care if &#xA; that are actually in the files get lost, right?

> Personally I'd also like to understand what has changed on the (0) code path.

You mean other than bug 169521?

Comment 18

5 months ago
(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #17)
> > What we care about is that no additional &#xA; get saved to disk
> But you don't care if &#xA; that are actually in the files get lost, right?
Well, ideally a roundtrip compose/save draft/edit draft would leave us where the started as it has always been. I can write a test for that in an hour (or less).

> > Personally I'd also like to understand what has changed on the (0) code path.
> You mean other than bug 169521?
No, I meant from that bug. Before you stated that the (2) code path (save draft) would add newlines (which later are turned into &#xA;?). 

Right now you commented on the (0) path, the mimeTextHTMLParsed.cpp#91-107 used to display a message:
> > gets turned into
> Right, that part is expected given the current setup.  Whether it's desirable, I don't know.

so it looks like you figured out where the extra &#xA; comes from for both, right?
> would leave us where the started as it has always been.

It hasn't been.  The pre-60 behavior was that &#xA; in the source file would get converted to \n on save.

> No, I meant from that bug.

What changed is that \n gets converted to &#xA;.

> so it looks like you figured out where the extra &#xA; comes from for both, right?

Look, you start with a \n.  You parse it, you now have a \n in the DOM.  You go to serialize, convert the \n to &#xA;.  The serializer then tries to output that string, sees that it's going to be a really long line, and inserts a newline (\n).  This step used to not happen because there was already a \n there, but now that's been converted to &#xA;.  So you started with "\n" and by the time you save you are saving "&#xA;\n".

Then you parse it again, the DOM now has "\n\n".  You serialize, you have "&#xA;&#xA;" and the serializer helpfully breaks your long line so now you save "&#xA;&#xA;\n".  And so forth.

Comment 20

5 months ago
(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #19)
> It hasn't been.  The pre-60 behavior was that &#xA; in the source file would
> get converted to \n on save.
Alright. Well, no one complained about that. In fact, bug 1508591 comment #0, appear to suggest that TB used to remove "bad" &#xA; when it found them.

> What changed is that \n gets converted to &#xA;.
OK, that the mimeTextHTMLParsed.cpp#91-107 bit:
Feed this
    <div style="font-size: 12px; font-family: Verdana, Arial,\n      Helvetica;">
to the parser and get this when serialising it "unformatted" again:
    <div style="font-size: 12px; font-family: Verdana, Arial,&#xA;      Helvetica;">
Here is doesn't matter since the longer line will eventually be parsed again and then the \n goes back into the DOM.

> Look, you start with a \n.  You parse it, you now have a \n in the DOM.  You
> go to serialize, convert the \n to &#xA;.  The serializer then tries to
> output that string, sees that it's going to be a really long line, and
> inserts a newline (\n).  This step used to not happen because there was
> already a \n there, but now that's been converted to &#xA;.  So you started
> with "\n" and by the time you save you are saving "&#xA;\n".

I guess, finally the penny dropped ;-) - As you've already stated in comment #9, the parsing bit is OK, in the DOM there are always \n.

So if we can get a fix here, I can write a test.

Note the following unrelated item: You kindly provided this API
https://dxr.mozilla.org/comm-central/rev/2a29ee0adb310b54a6a2df72034953fed8f2b043/comm/mailnews/mime/src/mimeTextHTMLParsed.cpp#94
once upon a time for use in
https://searchfox.org/comm-central/rev/efacd58fa63199a34709dc95b05bb9ab56d45f13/mailnews/import/winlivemail/nsWMUtils.cpp#140
and that has since found use on the main code path for message display. Sadly we never wrote a test for it, but I guess TB would fall apart if that stopped working.

Updated

5 months ago
Summary: Serialising a tag attribute with a newline in the string attribute value results in &#xA; Parsing doesn't remove &#xA; resulting in broken roundtrip - STR: comment #3 → Serialising a tag attribute with a newline in the string attribute value results in &#xA; resulting in broken roundtrip - STR: comment #3

Comment 21

5 months ago
https://hg.mozilla.org/releases/mozilla-esr60/rev/9f4280a52132bd8335d4833304e4bb7ee4d7c07d
Backed out part of changeset fe4514c84e7a (bug 169521 for causing bug 1509102) to build Thunderbird 60.3.2. a=jorgk
(In reply to Jorg K (GMT+1) from comment #20)
> I guess, finally the penny dropped ;-) - As you've already stated in comment
> #9, the parsing bit is OK, in the DOM there are always \n.
> 
> So if we can get a fix here, I can write a test.

I'm willing to work on a fix if you add a test :-)
Jorg, does that patch work?

Ashish, sorry, I had that written before I saw comment 22....
Flags: needinfo?(jorgk)
No worries, great that you've already done it!

Comment 26

5 months ago
Comment on attachment 9027792 [details] [diff] [review]
Stop converting \n to &#xA; in attribute values when serializing HTML

Review of attachment 9027792 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks Boris, yes, the patch works fine.

::: dom/base/nsHTMLContentSerializer.cpp
@@ +472,5 @@
>    }
>  
> +  // We don't want to call into our superclass 2-arg version of
> +  // AppendAndTranslateEntities, because it wants to encode more characters
> +  // than we do.  Use our tables, but avoir encoding &nbsp; by passing in a smaller max index.

typo: avoid.

::: dom/base/nsXMLContentSerializer.h
@@ +191,5 @@
> +   *
> +   * The first integer template argument, which callers need to specify
> +   * explicitly, is the index of the last entry in aEntityTable that should be
> +   * considered for encoding as an entity reference.  The second integer
> +   * argument will be deduced rom the actual table passed in.

typo: from.
Attachment #9027792 - Flags: feedback+

Updated

5 months ago
Flags: needinfo?(jorgk)
(Assignee)

Updated

5 months ago
Attachment #9027792 - Attachment is obsolete: true
(Assignee)

Updated

5 months ago
Assignee: nobody → bzbarsky
Status: NEW → ASSIGNED

Comment 28

5 months ago
Comment on attachment 9028005 [details] [diff] [review]
Stop converting \n to &#xA; in attribute values when serializing HTML

Review of attachment 9028005 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/base/nsHTMLContentSerializer.cpp
@@ +473,5 @@
>  
> +  // We don't want to call into our superclass 2-arg version of
> +  // AppendAndTranslateEntities, because it wants to encode more characters
> +  // than we do.  Use our tables, but avoid encoding &nbsp; by passing in a
> +  // smaller max index.

This comment is good, but I think it would be a bit better to clarify further what characters would be encoded in the two main branches in this function (the OutputEncodeBasicEntities case vs this one.)  I'd appreciate if you could add a comment to help clarify that a bit.

::: dom/base/nsXMLContentSerializer.h
@@ +214,5 @@
> +
> +  /**
> +   * Max index that can be used with some of our entity tables.
> +   */
> +  static const uint16_t kGTVal = 62;

Now that you're defining this, https://searchfox.org/mozilla-central/rev/d35199ef15ffa57d190886e20d5df6471faf0870/dom/base/nsXMLContentSerializer.cpp#1178 is unneeded, right?  Can you please get rid of it here as well?
Attachment #9028005 - Flags: review?(ehsan) → review+
> I'd appreciate if you could add a comment to help clarify that a bit.

Done.

> Can you please get rid of it here as well?

The patch does that, yes...
Jorg, I would really like to have some tests here before checking this in...  Do you think you can do that?
Flags: needinfo?(jorgk)

Comment 31

5 months ago
Hmm, I was thinking more of tests in TB for the round trip. I could try some tests for M-C too, but in a follow-up. Or do do you want to wait? I'm pretty busy (still) chasing regressions in TB 60.x, this is just one of them.
Flags: needinfo?(jorgk)
I'd like to have m-c tests, since this is all inside m-c serializers.  

I can wait, especially because I want to make sure that the tests fail without the patch and pass with before I check in.  Do you have an approximate timeframe?
(Assignee)

Updated

5 months ago
Flags: needinfo?(jorgk)

Comment 33

5 months ago
Hmm, a few days, maybe until the weekend. The other bug landed this:
https://hg.mozilla.org/mozilla-central/rev/fe4514c84e7a#l2.13

Would it be much more than
let input = "<div foo='a&#xA;b'>";
let doc  = (new DOMParser()).parseFromString(input, "text/html");
let encoder = Cu.createDocumentEncoder("text/html");
encoder.init(doc, "text/html", 0);
let output = encoder.encodeToString();

Then some check that input.replace("&#xA;", "\n") is pretty much like output?

What steps do you have in mind? And where to best put such a test?
Flags: needinfo?(jorgk)
(Assignee)

Updated

5 months ago
Attachment #9028005 - Attachment is obsolete: true

Comment 35

5 months ago
Pushed by bzbarsky@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8fc810b6e528
Stop converting \n to &#xA; in attribute values when serializing HTML.  r=ehsan

Comment 36

5 months ago
Backout by cbrindusan@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/7e8744eac2ca
Backed out changeset 8fc810b6e528 for xpcshell failures from test_serializers_entities.js. CLOSED TREE
Ugh, Windows newlines different from everyone else's newlines...  I guess now I get to do some Windows try runs.

Comment 39

5 months ago
I can take care of it.
I already have a fix running on try, fwiw.

Comment 42

5 months ago
Both pass here (on Windows), thanks again!
Flags: needinfo?(bzbarsky)

Comment 43

5 months ago
Pushed by bzbarsky@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/9e3b1b073d5e
Stop converting \n to &#xA; in attribute values when serializing HTML.  r=ehsan

Comment 44

5 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/9e3b1b073d5e
Status: ASSIGNED → RESOLVED
Last Resolved: 5 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla65

Comment 45

5 months ago
Resolved issue for us after the update: https://www.thunderbird.net/en-US/thunderbird/60.3.2/releasenotes/

My initial bug: 1508591

Thank you
You need to log in before you can comment on or make changes to this bug.