Closed Bug 657286 Opened 13 years ago Closed 13 years ago

HTML tags are visible in e-mail body for PST imported Outlook messages originally received by Exchange 2003

Categories

(MailNews Core :: Import, defect)

x86
Windows XP
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 207156

People

(Reporter: kiddm_mozilla, Unassigned)

Details

(Keywords: testcase)

Attachments

(8 files)

User-Agent:       Mozilla/5.0 (Windows NT 5.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.17) Gecko/20110414 Thunderbird/3.1.10

Many of my imported HTML e-mails appear as plaintext and show the HTML markup.

I exported e-mail folders from an Exchange 2010 server using Outlook 2007, saving the files in the default PST format (i.e. not the older 2002 format). I then imported these folders into Outlook 2003 on another box (where the e-mail still looks fine) and then imported those folders into Thunderbird using the Tools (menu) --> Import (menu) --> Mail (radio button) --> etc.

First, all messages originally composed by me look fine (up to minor nitpicks). This makes sense because my Outlook settings were to compose e-mail in HTML format and they were presumably saved on the Exchange server as HTML. Second, messages sent to me that were received by Exchange 2010 often look fine. However, HTML messages sent to me and received by Exchange 2003, prior to upgrading Exchange a few months ago, show up in Thunderbird as plaintext with the HTML tags visible.

I am attaching two examples of standard Netflix Disc Ship e-mails, one sent when the Exchange server was 2003 and the other when it was 2010. The files are similar, though the 2003 version begins with the line 'Microsoft Mail Internet Headers Version 2.0' which is not present in the 2010 version while the 2010 version contains a <!DOCTYPE> declaration, an <html> element, a <head> element, and a <body> elements, elements which are not present in the 2003 version where the bare HTML content, a <table> element, follows immediately after the following <meta> line:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">

Arguably this is a Microsoft induced problem; they never were very good at writing compliant HTML. But many users migrate from Outlook. Maybe it makes sense to slap a basic wrapper around bare HTML during the import process, if that is what is causing the problem.

But the problem seems to go beyond non-compliant HTML and Exchange 2003. I see the same issue with the AppleMail.txt attachment which was received by Exchange 2010 and has valid HTML, <!DOCTYPE> and all.

Reproducible: Always
Does it work better if you use this version of Thunderbird to do the import : <http://ftp.mozilla.org/pub/mozilla.org/thunderbird/try-builds/bienvenu@nventure.com-e4866a647ff0/try-win32/thunderbird-3.3a4pre.en-US.win32.installer.exe>

if not

Can you build a .pst containing only this file ?

xref bug 207156
Component: Migration → Import
Product: Thunderbird → MailNews Core
QA Contact: migration → import
The attached file reproduces the Exchange 2003 / 2010 difference on Thunderbird 3.1.10 but curiously does not reproduce the issue cited with Apple Mail. I will try the Thunderbird build cited by Ludovic Hirlimann next.
Behavior appears correct in 3.3a4pre (2011-05-10).
Hmm... maybe not solved. I imported the full folder of Netflix e-mails (~900) into 3.3a4pre and the problem returned.
(In reply to comment #7)
> Hmm... maybe not solved. I imported the full folder of Netflix e-mails
> (~900) into 3.3a4pre and the problem returned.

Can you build a pst with more than one file then ?
When I import this Outlook PST file into 3.3apre4 (2011-05-16), I see the issue for e-mail received 2011-02-08 ("For Wed: Made in Spain: Season 2: Disc 1") and earlier.
Matthew Kidd,
you wrote in comment #6 that you used 3.3a4pre (2011-05-10), and later in comment #9 that you used 3.3apre4 (2011-05-16). But the version that Ludovic pointed to was compiled 10-May-2011 17:25. Note that this version is a tryserver build that contains a specific patch for Bug 207156 (still not present in the trunk) that now undergoes testing, and presumably it should solve your problem (that I believe to be a duplicate of Bug 395745, that is incorrectly marked itself duplicate). Please ensure that you use that specific build, and retest your data.
I will check your attachment as soon as possible, but not earlier then in 10 hours.
What I stated was correct (except 3.3apre4 should be 3.3a4pre). The 2011-05-10 version automatically downloaded an update and I foolishly applied it between comment #6 and comment #7.

Anyway, I just reverted to the 2011-05-06 tryserver build and retested using my full PST file of ~900 Netflix e-mails and everything looks good.

I checked a greater variety of e-mail from other PST folders and things look MUCH better.
Incidentally, the 2011-05-06 tryserver build, specifically the Bug 207156 fix I suppose, corrects another problem: suit symbols are now imported correctly. This will not matter to most users but I am a bridge player and it matters a lot to me.

If one does View Source in Outlook on a message with suit symbols, you don't see nice HTML character entities like &spades; &hearts; &diams; and &clubs; Rather a hex editor shows A2 BE where one might expect &hearts; This value does not correspond to the equivalent Unicode value of &#x2665; Nor does it look like the UTF-8 encoding for Black Heart Suit. But whatever it is, 2011-05-06 tryserver build seems to be properly converting it to the correct UTF-8 representation of E2 99 05 as can be checked by Other Actions (menu) --> View Source (menu item), followed by examination in a hex editor.

Hooray!
(I presume you refer to 2011-05-10, not 2011-05-06)
Oh gaaack! I meant to say 2011-05-10 (not 2011-05-06), but I see I now have the 2011-05-16 version! This time I did not apply any updates. It must have automatically applied the downloaded update when I restarted the application. Looking at the options, I see that is in fact the default behavior. I'm going to revert to the 2011-05-10 tryserver version and shutoff the auto updating.

Then I have to re-import some new test folders where I thought I was seeing new issues since they probably got imported with 2011-05-16.
Then I ask you to test everything you want with the 2011-05-10 build, note the result, and then test these same things with "updated" unpatched build so to be able to tell if a problem is solved by the patch, or is solved outside of it, or isn't solved at all.
I've now imported ~10,000 e-mails using the 2011-05-10 tryserver build and after looking through ~1000 of the imported e-mails I have not seen any with raw HTML in any message body. It is looking really good. The 2011-05-15 and 2011-05-16 versions definitely had many message with raw HTML in the body. It seems that the patch has solved the core problem.

Moreover, the imported e-mails look right in terms of fonts, colors, indentation, bullet points, special symbols, attachments, functioning hyperlinks.

However, there are a couple of issues. First, old e-mail that I originally composed in RTF format (prior to switching to HTML format), is imported as plaintext. Ideally it would be converted to HTML. But all that e-mail is from several years ago and I personally can't be bothered much about the issue.

But there is one problem with the RTF to plaintext conversion. \rquote RTF symbols seem to get dropped instead of being converted to a single quote in the plaintext version. I'll attach an example.

Second, the arrow character that one gets by typing two consecutive dashes followed by a greater than symbol (-->) in Microsoft Word (provided "Replace text as you type" is checked under the AutoCorrect tab), gets munged after import. As far as I can tell the HTML that Word emits for this character looks like:

<font size=3 face=Wingdings><span style='font-size:12.0pt;
font-family:Wingdings;mso-ascii-font-family:"Times New Roman";mso-hansi-font-family:
"Times New Roman";mso-char-type:symbol;mso-symbol-font-family:Wingdings'><span
style='mso-char-type:symbol;mso-symbol-font-family:Wingdings'>&agrave;</span></span></font>

I don't know why the &agave; looks like a right arrow in the Wingdings font, but after it is imported, it just looks like a normal &agave; character entity. It would be reasonable to just blame Microsoft for this one. Or it could be special cased to be converted to an &rarr; entity. I run into this issue because I have messages where I explain how to navigate menus, buttons, tabs, etc to change settings. There are related cases from the auto-correction of <-- ==> and <== .
Example RTF for an e-mail where \rquote RTF symbol is dropped when RTF is converted to plaintext during import. Correct behavior is to convert to a single quote character. Note: \rquote is part of the "I'm" that leads off the e-mail, i.e. I\rquote m in RTF.
Matthew Kidd,
please attach a PST with messages with RTF and -->.
The \rquote issue should and can be fixed easily, while the other issues you describe are rather difficult to handle.
1. RTF->HTML converter is above my skills. I have created the RTF parser that can either extract the (hidden) saved HTML, or convert it to the plaintext. Generally, the conversion between document formats is enormously difficult problem that is almost never solved 100% (e.g., you may try to convert any .doc or .odt with rich markup to HTML using their native editors, and inspect the result). However, it _may_ be appropriate to ask Outlook to provide its HTML version in this case. I need a testcase to check it out, and I never met one.
2. The symbol fonts that MS coded in its proprietary way is a well-known issue. E.g., OpenOffice.org/LibreOffice suite has problems because of this (as it is cross-platform and must use universal standards, not some program's arbitrary decision). However, in this specific case, we may develop some reasonable technique to handle this. In order to achive this, we need to map any symbol character that we may encounter to a unicode codepoint, or (if it's inside HTML) to the HTML entity. I need help in this area, as I am unable both to find all these mappings, and to devote enough time to this.
(And please attach the \rquote as PST)
Yes, I understand full well the RTF --> HTML conversion headaches. Though RTF --> HTML should in principle be possible, while HTML --> RTF is necessarily incomplete, I've noticed that Microsoft never even tried. Within Outlook, you can only convert an HTML or RTF message to plaintext and plaintext to HTML or RTF.

Thus, when the message is originally composed in RTF, I don't think there will be an HTML version for Outlook to provide, except perhaps when Outlook was setup to always send HTML to one or more of the recipients (maybe Microsoft *does* have an RTF to HTML hidden away somewhere).
Keywords: testcase
(In reply to comment #21)
> Though RTF --> HTML should in principle be possible
I disagree. RTF was maintained to be capable to represent full MS Word repertoire up to (and including) v2007. I would rather say that it's quite a task to create a descent convertor that is somehow able to represent in HTML such features as versioning, page headers/footers, footnotes, etc... Furthermore, the RTF spec allows anyone to expand the specification by adding their custom RTF tags. So I see no possibility to create a complete RTF --> HTML convertor.
However, this is just a theory. I personally would be happy to see that I'm wrong.

Could you please add a testcase with Wingdings -->? I'm not sure if this issue will be fixed at all, but it's definitely should be inspected.

And please keep informing us if you find more cases where RTF tags is ignored, like that \rquote issue. As you seem to have a rich collection of such messages, and your computer expertise is high, you are the one who can contribute a lot in this area.
All the Microsoft Word auto-correct generated arrow-like symbols seem problematic.

The letter-like symbols (copyright, registered, and trademark) seem to be working just fine.

For good measure I threw in the Unicode chess pieces and they also look good.
Today David :Bienvenu has landed the fix for Bug 207156 on trunk. The fix for the \rquote is included.
(In reply to comment #24)
> Today David :Bienvenu has landed the fix for Bug 207156 on trunk. The fix
> for the \rquote is included.

So what's left mike ?
(In reply to comment #25)
> So what's left mike ?

Well, as to this specific bug, it's clearly a dup of Bug 395745, and it's clear from Comment #16, it's solved by the patch.

The \rquote issue is patched (but as currently I have no access to a development machine, I cannot test its resolution). It's absolutely possible and expectable that similar issues will arise, and will be patched the same way.

The RTF to HTML conversion is only possible if MS has provided means of it. As I said, I cannot check it right now. But anyway, it's a totally different issue, and I would say It should be filed under a separate enhancement request.

The "Wingdings" issue is separate, too.

So I think this bug is done. If it will turn out that \rquote isn't working, then I think it should be posted to Bug 207156.
Status: UNCONFIRMED → RESOLVED
Closed: 13 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: