New HTML processing introduced in bug 1419417 causes badly rendered HTML

RESOLVED FIXED

Status

defect
--
critical
RESOLVED FIXED
Last year
Last year

People

(Reporter: jorgk, Assigned: jorgk)

Tracking

({regression})

Trunk
Dependency tree / graph

Thunderbird Tracking Flags

(thunderbird_esr52 fixed, thunderbird_esr60 fixed, thunderbird60 fixed, thunderbird61 wontfix, thunderbird62 fixed)

Details

Attachments

(2 attachments, 3 obsolete attachments)

The attached message is rendered badly now.

I has some peculiar HTML:
<p dir="ltr"
      style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">It has been a while since I last sent a status update on council activities, ...

Note the white-space:pre and pre-wrap.

When viewed in TB, the work "It" shows on a separate line. This is because the original HTML has all one long line, the parsed and serialised HTML looks like this:

    <p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family
:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decora
tion:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">It
 has been a while since I last sent a status update on council
activities, ...

The serialising causes line-break after "It" and then the "white-space:pre" is applied.

So, bug 1419417 needed some more testing and here is the first regression :-(

Looks like this is fixable by setting proper serialisation flags:
  uint32_t aFlags = 0; <==== Here.
  rv = encoder->Init(document, NS_LITERAL_STRING("text/html"), aFlags);
  NS_ENSURE_SUCCESS(rv, -1);
  rv = encoder->EncodeToString(parsed);
  NS_ENSURE_SUCCESS(rv, -1);
This works. Magnus, please review as a matter of urgency.
Assignee: nobody → jorgk
Status: NEW → ASSIGNED
Attachment #8980569 - Flags: review?(mkmelin+mozilla)
There is another problem :-(

After the changes in bug 1419417, the <div class="moz-text-html"> no longer wrapped the body, but the body was written after the closing </div>. That was reported in bug 1462895. In bug 1463133
  https://hg.mozilla.org/comm-central/rev/18881dd127e3b0c0d3f97390c9094e309d4dd9c1#l3.22
that prelude was removed and with it, the lang="writing system".

Turns out that that was indeed required as Ben stated in bug 1463133 comment #1. Yes, stuff went broken, but not by the fact that I removed the lang thing, but by the fact that bug 1419417 caused the class to become ineffective.

In any case, that class stuff and other preludes really need to go to fix bug 1429491.

Instead of having two conflicting bugs, I'll fold the two together here.
Duplicate of this bug: 1429491
Merged with the patch from bug 1429491. I still need to add back the lang attribute.
Attachment #8980569 - Attachment is obsolete: true
Attachment #8980569 - Flags: review?(mkmelin+mozilla)
Boris, in an attempt to pass the HTML without any MIME prelude to the Gecko parser, I removed code that inserted:
  <div class="moz-text-html"  lang="...">
https://hg.mozilla.org/comm-central/rev/18881dd127e3b0c0d3f97390c9094e309d4dd9c1#l3.22
It turned out the the font Gecko uses to render text depends on the language, or you can also pass a "language group" or "writing system".

The code here compiles but doesn't link:
error LNK2019: unresolved external symbol "public: class mozilla::dom::HTMLBodyElement * __cdecl nsIDocument::GetBodyElement(void)

What am I doing wrong?

Does the code look reasonable otherwise?
Attachment #8980579 - Attachment is obsolete: true
Flags: needinfo?(bzbarsky)
Comment on attachment 8980730 [details] [diff] [review]
1464391-fix-flags.patch (v3) - WIP.

Review of attachment 8980730 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/mime/src/mimethsa.cpp
@@ +101,5 @@
> +  NS_NAMED_LITERAL_CSTRING(prefix, "<html><head>");
> +  if (StringBeginsWith(resultCStr, prefix)) {
> +    resultCStr.Insert(
> +      NS_LITERAL_CSTRING("<link rel=\"important stylesheet\" href=\"chrome://messagebody/skin/messageBody.css\">"),
> +      prefix.Length());

We also need to get the language into here somehow. Maybe set it on the html tag?
> What am I doing wrong?

GetBodyElement is inline, implemented in nsIDocumentInlines.h.  You need to include that.

> We also need to get the language into here somehow. Maybe set it on the html tag?

Or on the <body>, yes.
Flags: needinfo?(bzbarsky)
Posted patch 1464391-fix-flags.patch (v4) (obsolete) — Splinter Review
This works now with Boris' help - Thanks again!!

In this patch ...
- I correct the serialiser flags.
- I remove the last MIME prelude and replace it with 
  DOM or text manipulation.
- I restore setting the lang attribute which was over-zealously removed
  in bug 1463133. According to Boris the lang attribute
  can be placed on body or html tag.

Note: The only way forward to non-quirks precessing is the removal of all MIME preludes.
Attachment #8980730 - Attachment is obsolete: true
Attachment #8980767 - Flags: review?(mkmelin+mozilla)
Please keep the <div class="moz-text-html">. It's useful for many things later in the processing chain.
It was not OK that you removed that during patch landing process and without review.

Idea: If you think adding it as a string front of the document (as we did before) is a problem, you can add it to the DOM after parsing and before serializing. That allows to add it in the proper place in the HTML document.

> non-quirks precessing

I'm not sure that helps much or is even possible. We stick 5 different HTML documents, some generated by libmime, others from the email sender, all in the same document. All with their own <html> and <meta> and everything.
-    UtilityWriteCRLF("<link rel=\"important stylesheet\" href=\"chrome://messagebody/skin/messageBody.css\">");

I don't think that is correct. The emitter function always runs, for any output.
Your new code only runs for HTML documents.
If I read this right, this will completely break plaintext rendering, among many other things. Please keep this code.
(In reply to Ben Bucksch (:BenB) from comment #9)
> Please keep the <div class="moz-text-html">. It's useful for many things
> later in the processing chain.
> It was not OK that you removed that during patch landing process and without
> review.
Sorry, that was removed *with* review in bug 1463133. Plus you actually rendered it useless by placing the content outside the <div>. There is no CSS defined for moz-text-html, so as far as I can see, it's pretty much useless.

> > non-quirks precessing
> I'm not sure that helps much or is even possible. We stick 5 different HTML
> documents, some generated by libmime, others from the email sender, all in
> the same document. All with their own <html> and <meta> and everything.
We'll if you just render a single HTML message, quirks mode is gone with this patch, try the test message in bug 1429491.

(In reply to Ben Bucksch (:BenB) from comment #10)
> I don't think that is correct. The emitter function always runs, for any
> output.
Good catch, indeed, in plaintext mode the smileys now don't work :-(
Boris, sorry, I thought I had asked the last question, but there is always more. I tried to compile the patch here against mozilla60, our ESR, and I get:
mimeTextHTMLParsed.cpp(102): error C2039: 'CreateElem': is not a member of 'nsIDOMDocument'
mimeTextHTMLParsed.cpp(108): error C2039: 'GetHeadElement': is not a member of 'nsIDOMDocument'
mimeTextHTMLParsed.cpp(115): error C2039: 'GetBodyElement': is not a member of 'nsIDOMDocument'
Can you recall the equivalent calls in mozilla60?
Flags: needinfo?(bzbarsky)
Comment on attachment 8980767 [details] [diff] [review]
1464391-fix-flags.patch (v4)

-    UtilityWriteCRLF("<link rel=\"important stylesheet\" href=\"chrome://messagebody/skin/messageBody.css\">");

This completely breaks all plaintext messages. Not just smileys, but all plaintext styles. Quotes, signatures, even wrapping.

And all other styles defined there, too. Attachment headers, vcards, everything.
Attachment #8980767 - Flags: review?(mkmelin+mozilla) → review-
Please remove that first hunk from the patch, and keep the changes specific to the HTML classes.
Yes, Ben, I know that plaintext doesn't work with the patch, see comment #11 last line.

So how do I distinguish between classes in nsMimeHtmlDisplayEmitter::EndHeader(). Looks like it's only called once and the caller has the usual MIME object. Does it know which class it is?
> So how do I distinguish between classes in nsMimeHtmlDisplayEmitter::EndHeader()

That's what I meant: Leave this function completely alone, please.
> Can you recall the equivalent calls in mozilla60?

Same thing, but you should be working with nsIDocument, not nsIDOMDocument.  Which on trunk presumably you already are.
Flags: needinfo?(bzbarsky)
Summary on current status:

In bug 1419417 Ben replaced the "HTML passthrough" processing of HTML in MIME with a new approach: HTML is parsed into a DOM, then serialised out again.

While doing so, two bugs were introduced: The easy one is that the serialisation was done without any serialisation flag, leading to some messages, like attachment 8980567 [details], being incorrectly displayed.

The more severe issue is that traditionally HTLM rendered was wrapped into a <div class="moz-text-html" lang="...">. With the patch version 4 from bug 1419417 the HTML is no longer placed into the <div> but after it, that means, the </div> tag is written out too early. Sanitised/simple HTML always had this problem. I reported it in bug 1462895. The solution Ben offered doesn't work.

Ben told me via IRC that users modify their HTML display by adding CSS to userChrome.css for that class. This is in fact documented at https://www-archive.mozilla.org/unix/customizing.html

The lang="..." attribute is important, since based on the language or language group, Gecko chooses the font to use for rendering the text. Like CJK languages are typically rendered in other fonts than western languages.

Since the <div> was broken anyway, and at the time, despite Ben's warning, I hadn't realised the importance of the lang attribute, writing of the <div> was removed in bug 1463133.

So now here's here to fix the issues. Another issue to consider in this context is quirks mode. Gecko knows two modes of rendering HTML, standards mode and quirks mode. It distinguishes between the based on the DOCTYPE of the document. Since TB's MIME processing inserts its "MIME prelude" before the original HTML from the message, this detection doesn't work and all HTML from messages is always rendered in quirks mode, and many users complain about that, see bug 1429491 and friends. I had a long discussion with Ben on IRC whether correct rendering of HTML is important. He was more in favour of leaving the MIME prelude in place to allow users some customisation.

So what are possible solutions:
The lang tag can be injected into the HTML document as this patch shows making the broken <div> unnecessary. Another option would be to fix bug 1462895 and restore the <div lang="...">. That of course would close the door to getting rid of quirks mode.

To get rid of quirks mode, another MIME prelude needs to be removed, the one from nsMimeHtmlDisplayEmitter::EndHeader(). The current patch here removes it. I forgot to also remove the closing tags in nsMimeHtmlDisplayEmitter::EndBody(), but that can be corrected. Ben is opposed to the removal. Since the MIME prelude adds the TB stylesheet messageBody.css, when the prelude is removed, this needs to be injected like the lang attribute which is what the patch also does. However, the patch only does this for HTML, not for plaintext, so with the patch here plaintext mail shows no smileys, for example.

We should decide about the way forward.
> The lang="..." attribute is important, since based on the language or language group, Gecko chooses the
> font to use for rendering the text. Like CJK languages are typically rendered in other fonts than western languages.

Yes, that's the important bit. With lang, Asian users will get the wrong font and be very upset. A special, dedicated CJK font is important for them, for readability.

> He was more in favour of leaving the MIME prelude in place to allow users some customisation.

I think the <div> should stay in place, but for many reasons. Mostly so that we can clearly identify where the stuff comes from.

Regarding quirks mode, we've always been in quirks mode. We throw a log of ugly stuff at the HTML parser.
Not to the least the fact that we put several plaintext and HTML files all concatenated into the same output HTML document.
Given that it's always been like that, I don't consider that a serious issue that needs to be fixed right now. We have a serious security bug at our hand, and the release is waiting for that, so right now is not the time for polish.

> To get rid of quirks mode, another MIME prelude needs to be removed, the one from nsMimeHtmlDisplayEmitter::EndHeader()
> The current patch here removes it.

Right. And that is very important for display. See comment 13.

You CANNOT remove that, that introduces bugs FAR more serious than the one here. Please leave that generic emitter function alone. I don't know why we're still talking about that.

> To get rid of quirks mode

We've always been in quirks mode, since 20 years, so that cannot possibly cause any recent regressions.
Or are you just polishing?
From comment 9:
> you can add [the <div>] to the DOM after parsing and before serializing [at least for the HTMLParsed class].
> That allows to add it in the proper place in the HTML document.

---

I'm really confused here. Is this a regression, or not? If not, we should leave it alone. If yes, quirks mode cannot be the reason for a regression, because we've always been in quirks mode.
I think comment #18 was pretty clear. But I can repeat it:

There are two regressions from bug 1419417:
- The missing serialiser flags (easy to fix)
- The content outside the dedicated <div>. That's bug 1462895.
  The removal of the defunct <div> can be backed out by backing out bug 1463133.
  That can be done in a minute.

If the <div> is brought back, than bug 1462895 *MUST* be fixed, preferably by the persons who caused it.

Getting rid of quirks mode is just a side issue. Bug 1463133 removed two MIME preludes, the <div> and another one that was really unnecessary. If we go with removing MIME preludes to achieve standards mode, then the one remove here that inserts the stylesheet needs to be replaced with something else.

Is that clear now? If you want, backout bug 1463133 locally, then fix the <div> issue and present the solution for review. Otherwise the fix for bug 1419417 CANNOT be shipped at all and we go back to square one.
I have confirmed that when backing out bug 1463133 a Japanese test message is rendered in the wrong font since the <div> is closed too early.

<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-text-html" lang="ja">
<meta http-equiv="content-type" content="text/html; charset=ISO-2022-JP">
</div>
    <meta http-equiv="Content-Type" content="text/html; ">
    <p><tt>これは長い日本語のテキストですので、行が折れると思います。</tt></p>
    <p><b><tt>This is Jap.</tt></b><br>
    </p>
</body>
I am for keeping rendering in quirks mode. I remember it was kept for now also due to the messy HTML that Outlook produces, so turning standards mode just so could cause surprising rendering for old messages.
So I'm against trying to enable standards mode in a quick fix patch for eFail (and the fallout patches of that patch). This must be discussed very thoroughly by the engineering committee and other knowledgeable users.
So if there is a solution to just fix eFail and keep the rest (including support for user css), I vote for that one.
OK. Let's go with what Aceman says.
Attachment #8980767 - Attachment is obsolete: true
Comment on attachment 8980569 [details] [diff] [review]
1464391-fix-flags.patch

Minimal fix for using the correct flags.
Attachment #8980569 - Attachment is obsolete: false
Attachment #8980569 - Flags: review?(mkmelin+mozilla)
Attachment #8980569 - Flags: review?(acelists)
Severity: normal → critical
(In reply to :aceman from comment #23)
> I am for keeping rendering in quirks mode. I remember it was kept for now
> also due to the messy HTML that Outlook produces, so turning standards mode
> just so could cause surprising rendering for old messages.
You misunderstood. Only HTML that says "I'm standards mode" would have been rendered in standards mode. All other HTML, including old Outlook mail, would still be rendered in quirks mode.

Anyway, to bring some order into that and not mix quirks mode into this bug, I've now partly backed out bug 1463133 to bring back the <div>.

I've withdrawn all the complicated stuff here and left fixing the flags.

The assignee of bug 1462895, Ben, now needs to fix that critical bug, or else I will have to back out bug 1419417 as well.
> The assignee of bug 1462895, Ben, now needs to fix that critical bug, or else I will have to back out bug 1419417 as well.

If you back out the security fix, you'll have to find another way to fix the security bugs in a way that stops all known and continuously incoming exploits. So, this is not productive.

I told you on email that I've spent 2 weeks full time work on this, for free, and I cannot continue like this.
I'm a contractor, I do this for a living, and you are free to hire me.
Comment on attachment 8980569 [details] [diff] [review]
1464391-fix-flags.patch

This is a good and targetted fix. Thanks for that.
r=BenB
Attachment #8980569 - Flags: review?(mkmelin+mozilla)
Attachment #8980569 - Flags: review?(acelists)
Attachment #8980569 - Flags: review+
From what I understand, once this lands, this leaves only:
* <div lang="..."> is missing, which means we use the wrong font for CJK languages.
Jörg, is that correct?
Given that you filed bug 1462895 about the <div>, landing the latest patch would conclude this bug.
(In reply to Ben Bucksch (:BenB) from comment #27)
> If you back out the security fix, you'll have to find another way to fix the
> security bugs in a way that stops all known and continuously incoming
> exploits. So, this is not productive.
Right. But we can't obliterate HTML rendering for 25 million users to fix security issues that affect a few hundred thousand.

(In reply to Ben Bucksch (:BenB) from comment #29)
> From what I understand, once this lands, this leaves only:
> * <div lang="..."> is missing, which means we use the wrong font for CJK
> languages.
It's not missing, it's closing too early, so the lang attribute has no effect.

(In reply to Ben Bucksch (:BenB) from comment #30)
> Given that you filed bug 1462895 about the <div>, landing the latest patch
> would conclude this bug.
Yes.
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/6ce2b9815dc0
Use appropriate flags when serialising HTML in mimeTextHTMLParsed.cpp. r=BenB DONTBUILD
Status: ASSIGNED → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Attachment #8980569 - Flags: approval-comm-esr60+
Attachment #8980569 - Flags: approval-comm-esr52+
Attachment #8980569 - Flags: approval-comm-beta+
You need to log in before you can comment on or make changes to this bug.