Open Bug 792270 Opened 12 years ago Updated 2 years ago

Add encoding information ("charset", etc.) to e-mails and feed articles saved as HTML files

Categories

(MailNews Core :: Backend, defect)

defect

Tracking

(Not tracked)

REOPENED

People

(Reporter: klasse, Unassigned)

References

Details

(Whiteboard: patchlove)

Attachments

(2 files)

User Agent: Mozilla/5.0 (Windows NT 5.1; rv:15.0) Gecko/20100101 Firefox/15.0.1
Build ID: 20120905151427

Steps to reproduce:

1) Feed articles:
When an RSS feed article is saved as HTML file, the following lines:
    <meta content="text/html; charset=..."
      http-equiv="Content-Type">
are not added. That, due to a missing "charset" parameter, prevents a browser from correctly displaying the HTML file.

It would be nice, if the above lines were also added with feed articles - the necessary "charset" information is provided with the article.
Or better: The encoding placed into the <head> of the HTML file, for all of the file, could generally be the one currently used for display (menu View -> Encoding). Independently of the original encoding information provided with the content. That would also cope with rare cases when the encoding needs to be changed manually by the user to show the content correctly before saving it.

Reproduction info:
* Received an RSS feed article with German umlauts in article title and text and following info in the feed header:
    Content-Type: text/html; charset=UTF-8
* Stored the artice as an HTML file
* Opened the file with Firefox --> Result: Not OK: The umlauts are not shown correctly because Firefox assumes the ISO-8859-1 encoding (but shows umlauts correctly after manual encoding change to UTF-8).
* Using a text editor, added the above lines with charset="utf-8" before the <title> in the <head> of the saved HTML file
* Opened the file with Firefox --> Result: OK: The umlauts, both in file title and its text, are shown correctly and Firefox assumes UTF-8 encoding.

Additional proposals:
a) To additionally improve browser display with regard to future HTML development, the used HTML version could also be added to saved HTML file, e.g.:
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
         "http://www.w3.org/TR/html4/transitional.dtd">
b) The Thunderbird version (e.g. as shown in its User-Agent string) used to save the HTML file could also be added as the "generator" meta tag in the <head> of the saved HTML file.


2) E-mails:
Similarly with e-mails, e.g. as with the Facebook "XY would like to be your friend" e-mails (which, in German, have umlauts, too, used here just as an example). They have following lines at beginnings of respective message body MIME parts:
    Content-Type: text/html; charset="UTF-8"
The "Subject", "To", etc. fields have e.g. following encoding information when the file message is stored as an EML file: ?UTF-8?
But that information is not taken over to the the respective part of the saved HTML file. Same applies to all other e-mails.

Please note that, unlike with feeds, e-mail fields ("Subject", "To", etc., and the different message body MIME parts) each may have a different encoding, and each provide the encoding information separately for themselves. Therefore, the encoding information needs to be provided separately for each of the respective fields, or different encodings translated to one common - e.g. to the one of the body - and the encoding then placed in the <head> for the whole file. (Heavily affected would be e.g. e-mails with mixed western and cyrillic body and fields.)

Or better: The encoding placed into the <head> of the HTML file, for all of the file, could generally be the one currently used for display (menu View -> Encoding). (I.e. the shown content of the e-mail fields which have possibly been decoded using different "charsets" would be encoded with current encoding setting at the time of saving.) Independently of the original encoding information provided with the content. That would also cope with rare cases when the encoding needs to be changed manually by the user to show the content correctly before saving it.

Additional proposals:
a - b) Same as respective "additonal proposals" to feed articles above.
Component: General → Feed Reader
Product: Thunderbird → MailNews Core
The web page linked to in a feed item has the responsibility to set a char encoding; in addition, the encoding of a feed summary 'mail' item is always utf8 and does not necessarily have anything to do with the encoding of its linked web page.
Status: UNCONFIRMED → RESOLVED
Closed: 10 years ago
Resolution: --- → INVALID
I'm not sure I can follow the reasoning for closure. The web page (or e-mail received) DOES set the char encoding (in HTML or as a server header), but the problem is that that encoding is not stored when the feed article (or e-mail) is stored as an HTML file. Which leads to its messy display by the browser.

Or how else does Thunderbird know which char encoding to use for display AND storage into an EML file. To summarize the long bug description into one sentence:
*The problem is that the char encoding, which is correctly stored when a feed article (or e-mail) is exported to an EML file, is not stored when the same article (or e-mail) is exported to an HTML file.*
The eml file has nothing to do with the link to the web page, it is merely the feed file's content tag and metadata, and is always decoded to utf8, as already stated.  The web page being saved is done so both in Tb and Fx by Save as link, which is a direct save of the publishers html file.  Fx has done a lot of work regarding charset detection so this problem likely no longer exists, otherwise the only remaining option is to set Fallback char encoding in Fx.
I don't quite agree that the HTML export "is a direct save of the publishers html file". Because neither e-mails, nor feeds use HTML format (in case of feeds its actually XML). So in both cases, when (a) the feed article/e-mail is saved to EML or (b) its saved to HTML, there is a conversion from the original format (e-mail or XML) to EML or HTML respectively. (Okay e-mail-to-EML might be one-to-one export, but in all other cases there is already a conversion.)

And the feed (or e-mail) in question DOES already provide the encoding info - see e.g. http://www.auswaertiges-amt.de/SiteGlobals/Functions/RSSFeed/DE/RSSNewsfeed/RSS_Reisehinweise.xml feed. It's there both, as HTTP "Content-Type" server header and also at the top of the XML file. And the e-mails I tested also had the encoding set correctly.

So, if this is already not a "direct save" but a conversion to HTML, how about saving it correctly - e.g. also "always in UTF8", just as already the case with EML? Because, as a user, if I'm already satisfied with how the e-mail/feed article is being displayed (which as you state is always decoded to UTF8) and decide to export it to an HTML file, I expect it to be shown by the browser the exact same way I saw it in TB - so let it always be UTF8 then also.

Would that work?

P.S.: The problem still exists when the saved HTML file is opened in FF 32.0, but a user may use other browsers also. So, to aid the browsers in determining the character encoding, its inclusion into the saved HTML file would be very useful, and is also required according to http://www.w3.org/TR/html4/charset.html#doc-char-set, which states that "To promote interoperability, SGML requires that each application (including HTML) specify its document character set."

Also needed is the used HTML version, as suggested in the "Additional proposals" in the bug description, which is required for valid HTML files according to http://www.w3.org/TR/html4/struct/global.html#h-7.2, which states that "a valid HTML document declares what version of HTML is used in the document." The HTML files currently produced by TB are therefore not valid.

The HTML files produced from feeds (and e-mails) by TB also fail the markup validation at http://validator.w3.org/check for various other reasons, on (successfully validated) feed articles saved as HTML.

P.P.S.: As I mentioned, as a user, I think one expects equal (to TB) display of the saved HTML file anyway, not the raw unmodified e-mail/feed data. If one wanted the latter, they would need to open the feed URL in a browser and observe its source XML there. (Or in e-mail source view in TB in case of e-mails.)
Please also note that I use TB as my feed reader, not Firefox. (I remember that there is some difference, with the one in FF being newer, but I don't use it.) I'm referring to using the "Save as..." with feed articles as I see them in TB under "Blogs & News Feeds", where they appear similarly to how e-mails appear under their respective account.
Because I assume that @alta88 mistook this proposal for referring to the feed/e-mail handling in Firefox instead of Thunderbird, I'm going to tentatively re-open it.
Status: RESOLVED → UNCONFIRMED
Resolution: INVALID → ---
There is no mistake.  Do not reopen this.  Your misunderstanding here is quite large, and this is not a support forum.
Status: UNCONFIRMED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → INVALID
The "feed article" throughout this bug report refers to the feed article *summary* by the way, not the web page. Also, both, e-mail and feed article summary are viewed in the text-only mode, not as HTML; and then saved as HTML.
That is a big clarification.  The request was read as though it was for unencoded feed web pages and extrapolated to links related to mail.  So the large misunderstanding is mine.

Yes, if Tb is creating an html file, it should publish the encoding.
Status: RESOLVED → REOPENED
Component: Feed Reader → Backend
Ever confirmed: true
OS: Windows XP → All
Hardware: x86 → All
Resolution: INVALID → ---
Version: 15 → unspecified
This is simple to knock off.  Originally it was going to apply only to feeds, then to non multipart email where the charset wasn't wrong.  But whatever late 90s reasons (likely ripped out now by henri) may have existed to save html with original encodings no longer exist; utf-8 all the way.  Bizarrely, mails saved as .txt have always been utf-8..

(so even chiaki's email works in html now: https://bugzilla.mozilla.org/attachment.cgi?id=8499388)

A real appname/version for generator would be nice but someone would have to feel strongly to create a util func to implement extIApplication.
Assignee: nobody → alta88
Attachment #8500726 - Flags: review?(kent)
Comment on attachment 8500726 [details] [diff] [review]
htmlSaveEncoding.patch

I'd like to be able to help out with this review, but I have not really been following the character set saga as closely as some others. jcranmer is really busy now, but he is the best reviewer, and he is the formal peer on this module. So I will defer to him unless he throws it back.
Attachment #8500726 - Flags: review?(kent) → review?(Pidgeot18)
@alta88: Thanks for the patch. Just a few questions, after trying the newly added lines on http://validator.w3.org/check (using the direct text input field), with a change notification e-mail for this implementation proposal, which was received and viewed as text-only:

1. Are you sure it's HTML5 and not HTML4? Because the e-mail header table ("From", "Subject", etc.) has "width", etc. defined, which the validator complains about when checking as HTML5 ("use CSS instead"), among other things.

2. It seems that the "/>" at the end of the new lines should be a ">". Is that true?

The validator seems to have fewer errors (only 2 on HTML4.01 instead of 11 on HTML5) when I used the following declarations instead (with charset declaration for HTML4.01 and removed forward slashes at the tag end):

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
[...]
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="generator" content="Mozilla Mailnews">


P.S.: I don't know if the "HTML 4.01 Transitional" could be "HTML 4.01 Frameset" instead, as I don't know if HTML saved by Thunderbird can contain framesets or not.


-------------------------

P.P.S.: Here is the unmodified e-mail-to-HTML export used for testing (e-mail addresses modified, hopefully no other modification/filtering on comment submission):

<html>
<head>
<title>[Bug 792270] Add encoding information (&quot;charset&quot;, etc.) to e-mails and feed articles saved as HTML files</title>
<link rel="important stylesheet" href="chrome://messagebody/skin/messageBody.css">
</head>
<body>
<table border=0 cellspacing=0 cellpadding=0 width="100%" class="header-part1"><tr><td><b>Betreff: </b>[Bug 792270] Add encoding information (&quot;charset&quot;, etc.) to e-mails and feed articles saved as HTML files</td></tr><tr><td><b>Von: </b>&quot;Bugzilla@Mozilla&quot; &lt;test@example.com&gt;</td></tr><tr><td><b>Datum: </b>07.10.2014 01:35</td></tr></table><table border=0 cellspacing=0 cellpadding=0 width="100%" class="header-part2"><tr><td><b>An: </b>test@example.com</td></tr></table><br>
<div class="moz-text-plain"><pre wrap>
Do not reply to this email. You can add comments to this bug at
<a class="moz-txt-link-freetext" href="https://bugzilla.mozilla.org/show_bug.cgi?id=792270">https://bugzilla.mozilla.org/show_bug.cgi?id=792270</a>

Kent James (:rkent) <a class="moz-txt-link-rfc2396E" href="mailto:test@example.com">&lt;test@example.com&gt;</a> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |<a class="moz-txt-link-abbreviated" href="mailto:test@example.com">test@example.com</a>
         Attachment|review?(<a class="moz-txt-link-abbreviated" href="mailto:test@example.com">test@example.com</a>)    |review?(<a class="moz-txt-link-abbreviated" href="mailto:test@example.com">test@example.com</a>
     #8500726 Flags|                            |)

--- Comment #12 from Kent James (:rkent) <a class="moz-txt-link-rfc2396E" href="mailto:test@example.com">&lt;test@example.com&gt;</a> 2014-10-06 16:35:46 PDT ---
Comment on attachment 8500726 [details] [diff] [review]
  --&gt; <a class="moz-txt-link-freetext" href="https://bugzilla.mozilla.org/attachment.cgi?id=8500726">https://bugzilla.mozilla.org/attachment.cgi?id=8500726</a>
htmlSaveEncoding.patch

I'd like to be able to help out with this review, but I have not really been
following the character set saga as closely as some others. jcranmer is really
busy now, but he is the best reviewer, and he is the formal peer on this
module. So I will defer to him unless he throws it back.

<div class="moz-txt-sig">-- 
Configure bugmail: <a class="moz-txt-link-freetext" href="https://bugzilla.mozilla.org/userprefs.cgi?tab=email">https://bugzilla.mozilla.org/userprefs.cgi?tab=email</a>

-------------------------------
Product/Component: MailNews Core :: Backend



------- You are receiving this mail because: -------
You voted for the bug.
You reported the bug.

</div></pre></div></body>
</html>
Please note that I didn't test the patch. I just added the three new declarations from nsMimeHtmlEmitter.cpp to the HTML source produced by unpatched TB.
I don't think html5 or html4x doctype practically matters to a browser (as long as there is a doctype, and I'm not even sure quirks mode exists in modern browsers) and only matters to a validator of yesteryear.  The html emitter is also used for message pane display/print preview, thus the useless existing embedded links saved to file and non validating tag attributes used for internal css.  Libmime is actively being replaced by jsmime, and it is a non goal to pull the endless threads of libmime issues.

However, the original patch had html4x and non self closing tags etc, and it would be easy enough to do that fwiw.
@alta88: Thanks for the explanation.

The only downside with usage of the new charset attribute introduced in HTML5 seems to be that some non-browser applications do not (yet) understand it - e.g. the LibreOffice Writer office SW in its current release 4.3.2.

But: In the specific case of LibreOffice Writer, it seems to assume UTF-8 when it doesn't recognize the charset declaration. So, assuming that TB with the provided patch always produces UTF-8 encoded HTML, it should be displayed correctly on LibreOffice, even if it doesn't recognize the charset declaration.

Just a thought on other applications: The "old-style" http-equiv attribute charset declaration (even inside a document with the HTML5 version declaration) would ensure compatibility with both, older and newer applications. Because it's explicitly defined as an alternative in the HTML5 specs - http://www.w3.org/TR/html5/document-metadata.html#character-encoding-declaration. (As seem to be the non self closing tags, which are all over the "meta" element code examples there.)
Legacy apps are a valid point, and of course the spec cannot be argued against (even if one wants to avoid boilerplate which is mostly irrelevant and the fact that there was never a doctype to begin with).

Original attached (which also avoids printing some useless lines); review can decide what to go with.
(In reply to alta88 from comment #15)
> I don't think html5 or html4x doctype practically matters to a browser (as
> long as there is a doctype, and I'm not even sure quirks mode exists in
> modern browsers)

HA HA HA HA HA HA HA HA HA HA HA HA HA HA HA HA HA HA HA HA HA HA HA HA HA

Yes, doctypes do matter. Yes, quirks mode still exists.
Comment on attachment 8500726 [details] [diff] [review]
htmlSaveEncoding.patch

Review of attachment 8500726 [details] [diff] [review]:
-----------------------------------------------------------------

So, I'm going to admit that I haven't bothered to test this patch.

The problem with libmime is that code paths like this affect *a lot* of components, including display, save as, edit as new, and even reply or forward.

My concern is that without testing that all of these of paths in detail to assure that they're not affected, a change like this is highly dangerous. And the patch author does not appear to have done sufficient testing.

::: mailnews/mime/emitters/nsMimeHtmlEmitter.cpp
@@ +286,5 @@
>  nsMimeHtmlDisplayEmitter::EndHeader(const nsACString &name)
>  {
>    if (mDocHeader && (mFormat != nsMimeOutput::nsMimeMessageFilterSniffer))
>    {
> +    UtilityWriteCRLF("<!DOCTYPE html>");

This change makes unconditional HTML5 doctype, which is highly dangerous given the state of HTML in email.

@@ +291,4 @@
>      UtilityWriteCRLF("<html>");
>      UtilityWriteCRLF("<head>");
> +    UtilityWriteCRLF("<meta charset=\"UTF-8\"/>");
> +    UtilityWriteCRLF("<meta name=\"generator\" content=\"Mozilla Mailnews\"/>");

Why are you adding a generator with such an incorrect value? If you can't be bothered to make a correct value, then don't add it in the first place.
Attachment #8500726 - Flags: review?(Pidgeot18) → review-
Sorry "Joe", looks like a no go.  And dealing with reviewers with bad hair days is also a no go.
Assignee: alta88 → nobody
No worries. Thanks for giving this a try. I guess this will have to wait for jsmime to arrive to the affected parts.
For the record, the view in comment 15 is supported in several places on stackoverflow, like here:
http://stackoverflow.com/questions/10030688/is-it-safe-to-change-to-the-html5-doctype

So it seems the reviewer's grasp of doctype is wholly inadequate.  Likewise with libmime, as evidenced by the throwing up a lot of false scary scenarios.  Smells like territoriality combined with couldn't be bothered.
(In reply to alta88 from comment #22)
> For the record, the view in comment 15 is supported in several places on
> stackoverflow, like here:
> http://stackoverflow.com/questions/10030688/is-it-safe-to-change-to-the-
> html5-doctype
> 
> So it seems the reviewer's grasp of doctype is wholly inadequate.  Likewise
> with libmime, as evidenced by the throwing up a lot of false scary
> scenarios.  Smells like territoriality combined with couldn't be bothered.

Your grasp of doctype is incorrect. The question you referred to was about the exact format of the standards-mode doctype. Presently, libmime does not use a doctype--and thus it uses quirks mode. The addition of a doctype switches to standards mode.

If you don't trust my word, perhaps you should read the WHATWG living mode specification as I do. Or maybe you'd prefer to slander me because you don't like my code and my opinions.
What I find absolutely unacceptable is your allcaps attempt at ridicule, which you have exhibited in several other bugs.  You are slandering yourself and if you are incapable of professional behavior in a review role, then remove yourself.

If you are arguing that save as html should (erroneously as an ancient artifact) remain without doctype, to be rendered in quirks mode, then you need to justify why.
(In reply to alta88 from comment #24)
> If you are arguing that save as html should (erroneously as an ancient
> artifact) remain without doctype, to be rendered in quirks mode, then you
> need to justify why.

You need to justify why such a change:
a) would not affect legacy HTML (e.g., the crap HTML produced by Outlook), or
b) why breaking legacy HTML is an acceptable trade-off in terms of cost/benefit tradeoff.

MIME is a standard where many points are observed more in the breach. It's implemented by a library which was designed in 1997 and thenceafter mostly maintained by hacking everything on increasingly large piles of hacks. And libmime is basically functionally untested.

In light of that, conservatism is basically the default mode: you need to justify the changes rather than needing to justify the status quo. Not saving to UTF-8 is clearly (to me) an example of where the status quo is wrong, and thus that is safe to (and needs to) change. Saving an HTML charset information on output-UTF-8 files satisfies the "not affecting legacy" bit. But adding a doctype to use standards mode where we use quirks mode is not so clearly justified, and this needs its own justification.

As a prior contributor to Thunderbird, I fully expect you to be capable of doing due diligence. The actions you performed in this bug were clearly not doing the due diligence expected, hence why I rejected the patch.
I will also addend--as I forgot about it until just now--HTML in email is mostly used in compatibility with GUI document editors (e.g., Microsoft Word, LibreOffice Writer), which tend (last I checked, admittedly a few years ago since I had hard data) to use very old, very outdated HTML rendering engines in comparison with web browsers. So one could argue that you care more about MS Word compatibility with the HTML than Firefox Nightly-level compatibility.

And ISTR reading at one point that MS Word was built on the same layout engine as IE 5.5.
Thanks to everyone involved for discussing this.

(In reply to Joshua Cranmer [:jcranmer] from comment #25)
> MIME is a standard where many points are observed more in the breach. It's
> implemented by a library which was designed in 1997 and thenceafter mostly
> maintained by hacking everything on increasingly large piles of hacks. And
> libmime is basically functionally untested.
>
> In light of that, conservatism is basically the default mode: you need to
> justify the changes rather than needing to justify the status quo. Not
> saving to UTF-8 is clearly (to me) an example of where the status quo is
> wrong, and thus that is safe to (and needs to) change. Saving an HTML
> charset information on output-UTF-8 files satisfies the "not affecting
> legacy" bit. But adding a doctype to use standards mode where we use quirks
> mode is not so clearly justified, and this needs its own justification.

My primary concern, personally, is the correct display of non-English e-mails and feed articles exported to HTML. I would like to suggest, therefore, in order to proceed conservatively, to henceforth ignore my "Additional proposals" (i.e. HTML version and "generator" info) in the original bug report and consider the missing character encoding only.

What do guys think of only taking over the following part of the second patch by alta88 (https://bugzilla.mozilla.org/attachment.cgi?id=8503622&action=diff)? Sorry, I myself lack the necessary infrastructure for patch creation and implementation:

1. The "charset" declaration (a/mailnews/mime/emitters/nsMimeHtmlEmitter.cpp, line 293):
    UtilityWriteCRLF("<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">");

2. The conversion to UTF-8 (a/mailnews/mime/src/mimetpla.cpp, line 429):
      CopyUTF16toUTF8(lineResultUnichar, outString);

I.e., it will still be the quirks mode.


In the meantime, I did some testing with just manually adding the "HTML4-style" charset declaration to HTML source exported by unpatched Thunderbird, to test if various applications detect it (in quirks mode). I did it with:
- an utf-8 encoded feed article in German
- a koi8-r (Cyrillic) encoded e-mail

Here are the results:
--------------------------------------------------------
            | utf-8 Feed Article  |    koi8-r E-mail
            | ------------------- | --------------------
            | charset |  charset  |  charset |  charset
            | missing |  utf-8    |  missing |  koi8-r
--------------------------------------------------------
Firefox     |   O     |    X      |    O     |    X
33.0        |
--------------------------------------------------------
Chrome      |   O     |    X      |    O     |    X
38          |
--------------------------------------------------------
LibreOffice |   X     |    X      |    O     |    X
Writer      |
4.3.2       |
--------------------------------------------------------
LibreOffice |   O     |    X      |    O     |    X
Calc        |
4.3.2       |
--------------------------------------------------------
MS Office   |   O     |    X      |    X     |    X
Word/Excel  |
2007        |

All of the above on WinXP. X = displayed correctly; O = non-English characters not displayed correctly.

So, adding of the charset declaration ensures correct display on all of the above applications. While all of them seem to be able to cope with the quirks mode.

Thanks in advance.
Whiteboard: patchlove
Removing myslef on all the bugs I'm cced on. Please NI me if you need something on MailNews Core bugs from me.
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: