<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Comment 1

•

11 years ago

The web page linked to in a feed item has the responsibility to set a char encoding; in addition, the encoding of a feed summary 'mail' item is always utf8 and does not necessarily have anything to do with the encoding of its linked web page.

Status: UNCONFIRMED → RESOLVED

Closed: 11 years ago

Resolution: --- → INVALID

Reporter

Comment 2

•

11 years ago

I'm not sure I can follow the reasoning for closure. The web page (or e-mail received) DOES set the char encoding (in HTML or as a server header), but the problem is that that encoding is not stored when the feed article (or e-mail) is stored as an HTML file. Which leads to its messy display by the browser. Or how else does Thunderbird know which char encoding to use for display AND storage into an EML file. To summarize the long bug description into one sentence: *The problem is that the char encoding, which is correctly stored when a feed article (or e-mail) is exported to an EML file, is not stored when the same article (or e-mail) is exported to an HTML file.*

Comment 3

•

11 years ago

The eml file has nothing to do with the link to the web page, it is merely the feed file's content tag and metadata, and is always decoded to utf8, as already stated. The web page being saved is done so both in Tb and Fx by Save as link, which is a direct save of the publishers html file. Fx has done a lot of work regarding charset detection so this problem likely no longer exists, otherwise the only remaining option is to set Fallback char encoding in Fx.

Reporter

Comment 4

•

11 years ago

I don't quite agree that the HTML export "is a direct save of the publishers html file". Because neither e-mails, nor feeds use HTML format (in case of feeds its actually XML). So in both cases, when (a) the feed article/e-mail is saved to EML or (b) its saved to HTML, there is a conversion from the original format (e-mail or XML) to EML or HTML respectively. (Okay e-mail-to-EML might be one-to-one export, but in all other cases there is already a conversion.) And the feed (or e-mail) in question DOES already provide the encoding info - see e.g. http://www.auswaertiges-amt.de/SiteGlobals/Functions/RSSFeed/DE/RSSNewsfeed/RSS_Reisehinweise.xml feed. It's there both, as HTTP "Content-Type" server header and also at the top of the XML file. And the e-mails I tested also had the encoding set correctly. So, if this is already not a "direct save" but a conversion to HTML, how about saving it correctly - e.g. also "always in UTF8", just as already the case with EML? Because, as a user, if I'm already satisfied with how the e-mail/feed article is being displayed (which as you state is always decoded to UTF8) and decide to export it to an HTML file, I expect it to be shown by the browser the exact same way I saw it in TB - so let it always be UTF8 then also. Would that work? P.S.: The problem still exists when the saved HTML file is opened in FF 32.0, but a user may use other browsers also. So, to aid the browsers in determining the character encoding, its inclusion into the saved HTML file would be very useful, and is also required according to http://www.w3.org/TR/html4/charset.html#doc-char-set, which states that "To promote interoperability, SGML requires that each application (including HTML) specify its document character set." Also needed is the used HTML version, as suggested in the "Additional proposals" in the bug description, which is required for valid HTML files according to http://www.w3.org/TR/html4/struct/global.html#h-7.2, which states that "a valid HTML document declares what version of HTML is used in the document." The HTML files currently produced by TB are therefore not valid. The HTML files produced from feeds (and e-mails) by TB also fail the markup validation at http://validator.w3.org/check for various other reasons, on (successfully validated) feed articles saved as HTML. P.P.S.: As I mentioned, as a user, I think one expects equal (to TB) display of the saved HTML file anyway, not the raw unmodified e-mail/feed data. If one wanted the latter, they would need to open the feed URL in a browser and observe its source XML there. (Or in e-mail source view in TB in case of e-mails.)

Reporter

Comment 5

•

11 years ago

Please also note that I use TB as my feed reader, not Firefox. (I remember that there is some difference, with the one in FF being newer, but I don't use it.) I'm referring to using the "Save as..." with feed articles as I see them in TB under "Blogs & News Feeds", where they appear similarly to how e-mails appear under their respective account.

Reporter

Comment 6

•

11 years ago

Because I assume that @alta88 mistook this proposal for referring to the feed/e-mail handling in Firefox instead of Thunderbird, I'm going to tentatively re-open it.

Status: RESOLVED → UNCONFIRMED

Resolution: INVALID → ---

Comment 7

•

11 years ago

There is no mistake. Do not reopen this. Your misunderstanding here is quite large, and this is not a support forum.

Status: UNCONFIRMED → RESOLVED

Closed: 11 years ago → 11 years ago

Resolution: --- → INVALID

Reporter

Comment 8

•

11 years ago

The "feed article" throughout this bug report refers to the feed article *summary* by the way, not the web page. Also, both, e-mail and feed article summary are viewed in the text-only mode, not as HTML; and then saved as HTML.

Comment 9

•

11 years ago

That is a big clarification. The request was read as though it was for unencoded feed web pages and extrapolated to links related to mail. So the large misunderstanding is mine. Yes, if Tb is creating an html file, it should publish the encoding.

Status: RESOLVED → REOPENED

Component: Feed Reader → Backend

Ever confirmed: true

OS: Windows XP → All

Hardware: x86 → All

Resolution: INVALID → ---

Version: 15 → unspecified

Comment 10

•

11 years ago

This is simple to knock off. Originally it was going to apply only to feeds, then to non multipart email where the charset wasn't wrong. But whatever late 90s reasons (likely ripped out now by henri) may have existed to save html with original encodings no longer exist; utf-8 all the way. Bizarrely, mails saved as .txt have always been utf-8.. (so even chiaki's email works in html now: https://bugzilla.mozilla.org/attachment.cgi?id=8499388) A real appname/version for generator would be nice but someone would have to feel strongly to create a util func to implement extIApplication.

Comment 11

•

11 years ago

Attached patch htmlSaveEncoding.patch — Details — Splinter Review

Assignee: nobody → alta88

Attachment #8500726 - Flags: review?(kent)

Kent James (:rkent)

Comment 12

•

11 years ago

Comment on attachment 8500726 [details] [diff] [review] htmlSaveEncoding.patch I'd like to be able to help out with this review, but I have not really been following the character set saga as closely as some others. jcranmer is really busy now, but he is the best reviewer, and he is the formal peer on this module. So I will defer to him unless he throws it back.

Attachment #8500726 - Flags: review?(kent) → review?(Pidgeot18)

Reporter

Comment 13

•

11 years ago

@alta88: Thanks for the patch. Just a few questions, after trying the newly added lines on http://validator.w3.org/check (using the direct text input field), with a change notification e-mail for this implementation proposal, which was received and viewed as text-only: 1. Are you sure it's HTML5 and not HTML4? Because the e-mail header table ("From", "Subject", etc.) has "width", etc. defined, which the validator complains about when checking as HTML5 ("use CSS instead"), among other things. 2. It seems that the "/>" at the end of the new lines should be a ">". Is that true? The validator seems to have fewer errors (only 2 on HTML4.01 instead of 11 on HTML5) when I used the following declarations instead (with charset declaration for HTML4.01 and removed forward slashes at the tag end): <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> [...] <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <meta name="generator" content="Mozilla Mailnews"> P.S.: I don't know if the "HTML 4.01 Transitional" could be "HTML 4.01 Frameset" instead, as I don't know if HTML saved by Thunderbird can contain framesets or not. ------------------------- P.P.S.: Here is the unmodified e-mail-to-HTML export used for testing (e-mail addresses modified, hopefully no other modification/filtering on comment submission): <html> <head> <title>[Bug 792270] Add encoding information ("charset", etc.) to e-mails and feed articles saved as HTML files</title> <link rel="important stylesheet" href="chrome://messagebody/skin/messageBody.css"> </head> <body> <table border=0 cellspacing=0 cellpadding=0 width="100%" class="header-part1"><tr><td><b>Betreff: </b>[Bug 792270] Add encoding information ("charset", etc.) to e-mails and feed articles saved as HTML files</td></tr><tr><td><b>Von: </b>"Bugzilla@Mozilla" <test@example.com></td></tr><tr><td><b>Datum: </b>07.10.2014 01:35</td></tr></table><table border=0 cellspacing=0 cellpadding=0 width="100%" class="header-part2"><tr><td><b>An: </b>test@example.com</td></tr></table><br> <div class="moz-text-plain"><pre wrap> Do not reply to this email. You can add comments to this bug at <a class="moz-txt-link-freetext" href="https://bugzilla.mozilla.org/show_bug.cgi?id=792270">https://bugzilla.mozilla.org/show_bug.cgi?id=792270</a> Kent James (:rkent) <a class="moz-txt-link-rfc2396E" href="mailto:test@example.com"><test@example.com></a> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |<a class="moz-txt-link-abbreviated" href="mailto:test@example.com">test@example.com</a> Attachment|review?(<a class="moz-txt-link-abbreviated" href="mailto:test@example.com">test@example.com</a>) |review?(<a class="moz-txt-link-abbreviated" href="mailto:test@example.com">test@example.com</a> #8500726 Flags| |) --- Comment #12 from Kent James (:rkent) <a class="moz-txt-link-rfc2396E" href="mailto:test@example.com"><test@example.com></a> 2014-10-06 16:35:46 PDT --- Comment on attachment 8500726 [details] [diff] [review] --> <a class="moz-txt-link-freetext" href="https://bugzilla.mozilla.org/attachment.cgi?id=8500726">https://bugzilla.mozilla.org/attachment.cgi?id=8500726</a> htmlSaveEncoding.patch I'd like to be able to help out with this review, but I have not really been following the character set saga as closely as some others. jcranmer is really busy now, but he is the best reviewer, and he is the formal peer on this module. So I will defer to him unless he throws it back. <div class="moz-txt-sig">-- Configure bugmail: <a class="moz-txt-link-freetext" href="https://bugzilla.mozilla.org/userprefs.cgi?tab=email">https://bugzilla.mozilla.org/userprefs.cgi?tab=email</a> ------------------------------- Product/Component: MailNews Core :: Backend ------- You are receiving this mail because: ------- You voted for the bug. You reported the bug. </div></pre></div></body> </html>

Reporter

Comment 14

•

11 years ago

Please note that I didn't test the patch. I just added the three new declarations from nsMimeHtmlEmitter.cpp to the HTML source produced by unpatched TB.

Comment 15

•

11 years ago

I don't think html5 or html4x doctype practically matters to a browser (as long as there is a doctype, and I'm not even sure quirks mode exists in modern browsers) and only matters to a validator of yesteryear. The html emitter is also used for message pane display/print preview, thus the useless existing embedded links saved to file and non validating tag attributes used for internal css. Libmime is actively being replaced by jsmime, and it is a non goal to pull the endless threads of libmime issues. However, the original patch had html4x and non self closing tags etc, and it would be easy enough to do that fwiw.

Reporter

Comment 16

•

11 years ago

@alta88: Thanks for the explanation. The only downside with usage of the new charset attribute introduced in HTML5 seems to be that some non-browser applications do not (yet) understand it - e.g. the LibreOffice Writer office SW in its current release 4.3.2. But: In the specific case of LibreOffice Writer, it seems to assume UTF-8 when it doesn't recognize the charset declaration. So, assuming that TB with the provided patch always produces UTF-8 encoded HTML, it should be displayed correctly on LibreOffice, even if it doesn't recognize the charset declaration. Just a thought on other applications: The "old-style" http-equiv attribute charset declaration (even inside a document with the HTML5 version declaration) would ensure compatibility with both, older and newer applications. Because it's explicitly defined as an alternative in the HTML5 specs - http://www.w3.org/TR/html5/document-metadata.html#character-encoding-declaration. (As seem to be the non self closing tags, which are all over the "meta" element code examples there.)

Comment 17

•

11 years ago

Attached patch htmlSaveEncodingSpec.patch — Details — Splinter Review

Legacy apps are a valid point, and of course the spec cannot be argued against (even if one wants to avoid boilerplate which is mostly irrelevant and the fact that there was never a doctype to begin with). Original attached (which also avoids printing some useless lines); review can decide what to go with.

Comment 18

•

11 years ago

(In reply to alta88 from comment #15) > I don't think html5 or html4x doctype practically matters to a browser (as > long as there is a doctype, and I'm not even sure quirks mode exists in > modern browsers) HA HA HA HA HA HA HA HA HA HA HA HA HA HA HA HA HA HA HA HA HA HA HA HA HA Yes, doctypes do matter. Yes, quirks mode still exists.

Comment 19

•

11 years ago

Comment on attachment 8500726 [details] [diff] [review] htmlSaveEncoding.patch Review of attachment 8500726 [details] [diff] [review]: ----------------------------------------------------------------- So, I'm going to admit that I haven't bothered to test this patch. The problem with libmime is that code paths like this affect *a lot* of components, including display, save as, edit as new, and even reply or forward. My concern is that without testing that all of these of paths in detail to assure that they're not affected, a change like this is highly dangerous. And the patch author does not appear to have done sufficient testing. ::: mailnews/mime/emitters/nsMimeHtmlEmitter.cpp @@ +286,5 @@ > nsMimeHtmlDisplayEmitter::EndHeader(const nsACString &name) > { > if (mDocHeader && (mFormat != nsMimeOutput::nsMimeMessageFilterSniffer)) > { > + UtilityWriteCRLF("<!DOCTYPE html>"); This change makes unconditional HTML5 doctype, which is highly dangerous given the state of HTML in email. @@ +291,4 @@ > UtilityWriteCRLF("<html>"); > UtilityWriteCRLF("<head>"); > + UtilityWriteCRLF("<meta charset=\"UTF-8\"/>"); > + UtilityWriteCRLF("<meta name=\"generator\" content=\"Mozilla Mailnews\"/>"); Why are you adding a generator with such an incorrect value? If you can't be bothered to make a correct value, then don't add it in the first place.

Attachment #8500726 - Flags: review?(Pidgeot18) → review-

Comment 20

•

11 years ago

Sorry "Joe", looks like a no go. And dealing with reviewers with bad hair days is also a no go.

Assignee: alta88 → nobody

Reporter

Comment 21

•

11 years ago

No worries. Thanks for giving this a try. I guess this will have to wait for jsmime to arrive to the affected parts.

Comment 22

•

11 years ago

For the record, the view in comment 15 is supported in several places on stackoverflow, like here: http://stackoverflow.com/questions/10030688/is-it-safe-to-change-to-the-html5-doctype So it seems the reviewer's grasp of doctype is wholly inadequate. Likewise with libmime, as evidenced by the throwing up a lot of false scary scenarios. Smells like territoriality combined with couldn't be bothered.

Comment 23

•

11 years ago

(In reply to alta88 from comment #22) > For the record, the view in comment 15 is supported in several places on > stackoverflow, like here: > http://stackoverflow.com/questions/10030688/is-it-safe-to-change-to-the- > html5-doctype > > So it seems the reviewer's grasp of doctype is wholly inadequate. Likewise > with libmime, as evidenced by the throwing up a lot of false scary > scenarios. Smells like territoriality combined with couldn't be bothered. Your grasp of doctype is incorrect. The question you referred to was about the exact format of the standards-mode doctype. Presently, libmime does not use a doctype--and thus it uses quirks mode. The addition of a doctype switches to standards mode. If you don't trust my word, perhaps you should read the WHATWG living mode specification as I do. Or maybe you'd prefer to slander me because you don't like my code and my opinions.

Comment 24

•

11 years ago

What I find absolutely unacceptable is your allcaps attempt at ridicule, which you have exhibited in several other bugs. You are slandering yourself and if you are incapable of professional behavior in a review role, then remove yourself. If you are arguing that save as html should (erroneously as an ancient artifact) remain without doctype, to be rendered in quirks mode, then you need to justify why.

Comment 25

•

11 years ago

(In reply to alta88 from comment #24) > If you are arguing that save as html should (erroneously as an ancient > artifact) remain without doctype, to be rendered in quirks mode, then you > need to justify why. You need to justify why such a change: a) would not affect legacy HTML (e.g., the crap HTML produced by Outlook), or b) why breaking legacy HTML is an acceptable trade-off in terms of cost/benefit tradeoff. MIME is a standard where many points are observed more in the breach. It's implemented by a library which was designed in 1997 and thenceafter mostly maintained by hacking everything on increasingly large piles of hacks. And libmime is basically functionally untested. In light of that, conservatism is basically the default mode: you need to justify the changes rather than needing to justify the status quo. Not saving to UTF-8 is clearly (to me) an example of where the status quo is wrong, and thus that is safe to (and needs to) change. Saving an HTML charset information on output-UTF-8 files satisfies the "not affecting legacy" bit. But adding a doctype to use standards mode where we use quirks mode is not so clearly justified, and this needs its own justification. As a prior contributor to Thunderbird, I fully expect you to be capable of doing due diligence. The actions you performed in this bug were clearly not doing the due diligence expected, hence why I rejected the patch.

Comment 26

•

11 years ago

I will also addend--as I forgot about it until just now--HTML in email is mostly used in compatibility with GUI document editors (e.g., Microsoft Word, LibreOffice Writer), which tend (last I checked, admittedly a few years ago since I had hard data) to use very old, very outdated HTML rendering engines in comparison with web browsers. So one could argue that you care more about MS Word compatibility with the HTML than Firefox Nightly-level compatibility. And ISTR reading at one point that MS Word was built on the same layout engine as IE 5.5.

Reporter

Comment 27

•

11 years ago

Thanks to everyone involved for discussing this. (In reply to Joshua Cranmer [:jcranmer] from comment #25) > MIME is a standard where many points are observed more in the breach. It's > implemented by a library which was designed in 1997 and thenceafter mostly > maintained by hacking everything on increasingly large piles of hacks. And > libmime is basically functionally untested. > > In light of that, conservatism is basically the default mode: you need to > justify the changes rather than needing to justify the status quo. Not > saving to UTF-8 is clearly (to me) an example of where the status quo is > wrong, and thus that is safe to (and needs to) change. Saving an HTML > charset information on output-UTF-8 files satisfies the "not affecting > legacy" bit. But adding a doctype to use standards mode where we use quirks > mode is not so clearly justified, and this needs its own justification. My primary concern, personally, is the correct display of non-English e-mails and feed articles exported to HTML. I would like to suggest, therefore, in order to proceed conservatively, to henceforth ignore my "Additional proposals" (i.e. HTML version and "generator" info) in the original bug report and consider the missing character encoding only. What do guys think of only taking over the following part of the second patch by alta88 (https://bugzilla.mozilla.org/attachment.cgi?id=8503622&action=diff)? Sorry, I myself lack the necessary infrastructure for patch creation and implementation: 1. The "charset" declaration (a/mailnews/mime/emitters/nsMimeHtmlEmitter.cpp, line 293): UtilityWriteCRLF("<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">"); 2. The conversion to UTF-8 (a/mailnews/mime/src/mimetpla.cpp, line 429): CopyUTF16toUTF8(lineResultUnichar, outString); I.e., it will still be the quirks mode. In the meantime, I did some testing with just manually adding the "HTML4-style" charset declaration to HTML source exported by unpatched Thunderbird, to test if various applications detect it (in quirks mode). I did it with: - an utf-8 encoded feed article in German - a koi8-r (Cyrillic) encoded e-mail Here are the results: -------------------------------------------------------- | utf-8 Feed Article | koi8-r E-mail | ------------------- | -------------------- | charset | charset | charset | charset | missing | utf-8 | missing | koi8-r -------------------------------------------------------- Firefox | O | X | O | X 33.0 | -------------------------------------------------------- Chrome | O | X | O | X 38 | -------------------------------------------------------- LibreOffice | X | X | O | X Writer | 4.3.2 | -------------------------------------------------------- LibreOffice | O | X | O | X Calc | 4.3.2 | -------------------------------------------------------- MS Office | O | X | X | X Word/Excel | 2007 | All of the above on WinXP. X = displayed correctly; O = non-English characters not displayed correctly. So, adding of the charset declaration ensures correct display on all of the above applications. While all of them seem to be able to cope with the quirks mode. Thanks in advance.