Imported Japanese Email from Outlook shows as garbled text

RESOLVED FIXED in Thunderbird 3.3a1

Status

MailNews Core
Import
RESOLVED FIXED
8 years ago
7 years ago

People

(Reporter: dythim, Assigned: Jorg K (GMT+2))

Tracking

(Blocks: 1 bug)

unspecified
Thunderbird 3.3a1
x86
Windows Vista

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [gs], URL)

Attachments

(3 attachments, 6 obsolete attachments)

(Reporter)

Description

8 years ago
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.89 Safari/532.5
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 6.0; ja; rv:1.9.1.7) Gecko/20100111 Thunderbird/3.0.1

When migrating from Outlook 2007 from Thunderbird 3.01 (both Japanese),
using the Tools -> Import to retrieve Outlook messages apparently succeeded, 
but when viewing the imported messages in Thunderbird the text shows as garbage.

The character set is auto-detected as Shift_JIS, and ISO-2022-JP must be manually selected in order to read the messages properly.

Newly sent/received mail does not seem to have the same problem -- it is correctly auto-detected as ISO-2022-JP.

Reproducible: Always

Steps to Reproduce:
1. Have existing Outlook Japanese e-mail.
2. Install Thunderbird 3.0.1 Japanese version.
3. Select Tools -> Import (ツール→設定とデータのインポート)
4. Select Mailbox (メールボックス) from the Import window, click Next (次へ), select Outlook as the type, press Next (次へ) to execute.
5. Import completes automatically, no problems reported.
6. Open one of the imported messages, it shows as garbage.
7. Change the View -> Character Encoding (表示→文字エンコーディング) from Shift_JIS (日本語(Shift_JIS) to (日本語(ISO-2022-JP)).
8. Message shows as normal Japanese. 
Actual Results:  
Thunderbird detects character encoding of imported messages as Shift_JIS, which results in garbled text.

Expected Results:  
Thunderbird should be able to detect that the messages are encoded in ISO-2022-JP.

All mail sent and received after the migration is correctly auto-detected as ISO-2022-JP.  It seems like a problem with the import.

Also, switching between tabs while an imported message is displayed causes the message to be refreshed, and regarbled.
Could you attach sample *.pst file to reproduce this?
(Reporter)

Comment 2

8 years ago
Created attachment 427682 [details]
PST containing a single Japanese-encoded message.
(Reporter)

Comment 3

8 years ago
I uploaded a PST file, but I was not able to force Thunderbird to import from it...  It would only import directly from Outlook.
Thank you!.  I confirmed on 3.1 alpha1.

When Tb imports your mailbox as HTML mail, mail data is following.  Although charset of Content-type is Shift_JIS, but charset of HTML is iso-2022-jp.


 :
Content-Type: text/html; charset=Shift_JIS
 :

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

<head>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-2022-jp">
 :
 :
Status: UNCONFIRMED → NEW
Ever confirmed: true
related to bug 378008 (same) and bug 505072 (wrong fix for this).

When I open HTML source via Outlook, there is no charset...  If we import HTML mail data from Outlook, I should remove charset of header, maybe...
(Assignee)

Comment 6

8 years ago
This is a problem with Korean messages, too. For example this is the source of a message I imported from Outlook:

This is a multi-part message in MIME format.
--------------040206010205070609040902
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: 8bit

Her name is Yena Ryu and that&#39;s 예나 류 in Korean<br><br>

===============
Where does the Content-Type come from???

The message looks OK when viewed as UTF-8, but when viewed with the default windows-1252, it looks like this:
Her name is Yena Ryu and that's 예나 류 in Korean

This is not only a problem with Asian text. Even German text with Umlaut (äöü) gets garbled:

This is a multi-part message in MIME format.
--------------090001070201040804030900
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: 8bit

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>Ihre simyo Rechnung</TITLE>
<META content="text/html; charset=utf-8" http-equiv=Content-Type><!-- Title ON --><!-- Title OFF -->

Later in the body of the message:
  der Rechnungsbetrag für die 
which when viewed at the default looks like this:
  der Rechnungsbetrag für die

========

So once again, is there away to detect the default encoding for the message a little better?

If it's a HTML message, it could be taken from the charset of the HTML header.
(Assignee)

Comment 7

7 years ago
The problem is here in the code:

http://mxr.mozilla.org/comm-central/source/mailnews/import/outlook/src/nsOutlookCompose.cpp#607.

The import tries to get the charset from the header, if none is there, it defaults to the system character set, for Windows machines with an English version of Windows, that's normally "windows-1251".

Instead, what should be done is extract the charset from the appropriate HTML header.
(In reply to comment #7)
> The problem is here in the code:
> 
> http://mxr.mozilla.org/comm-central/source/mailnews/import/outlook/src/nsOutlookCompose.cpp#607.
> 
> The import tries to get the charset from the header, if none is there, it
> defaults to the system character set, for Windows machines with an English
> version of Windows, that's normally "windows-1251".
> 
> Instead, what should be done is extract the charset from the appropriate HTML
> header.

That's right.  we need a meta element parser like nsMsgI18NParseMetaCharset() to outlook importer.
Assignee: nobody → mozilla
Duplicate of this bug: 378008
(Assignee)

Comment 10

7 years ago
Created attachment 442378 [details]
This is how the test case looks like after I imported it with my fixed version.
(Assignee)

Comment 11

7 years ago
Created attachment 442379 [details] [diff] [review]
Patch to header file
Attachment #442379 - Flags: review?(bienvenu)
(Assignee)

Comment 12

7 years ago
Created attachment 442380 [details]
Patch to CPP file.

I'd like to request that this fix be rolled into the source base quickly since I want to continue working on another bug (bug 558653) which will require changes to this file.
Attachment #442380 - Flags: review?(bienvenu)
(Assignee)

Comment 13

7 years ago
Created attachment 442382 [details] [diff] [review]
Patch to CPP file. Oops, didn't check the box.

I'd like to request that this fix be rolled into the source base quickly since
I want to continue working on another bug (bug 558653) which will require
changes to this file.
Attachment #442380 - Attachment is obsolete: true
Attachment #442382 - Flags: review?(bienvenu)
Attachment #442380 - Flags: review?(bienvenu)
Can you provide both .h and cpp changes in the same patch ?

Comment 15

7 years ago
Jorg, thx for working on this.
Yes, one patch per bug fix, and you might want to look into Mercurial queues as a way to manage multiple patches to the same files. https://developer.mozilla.org/en/Mercurial_Queues

We can't land anything until the tree re-opens, so you're going to have to be patient...
(Assignee)

Comment 16

7 years ago
Created attachment 442493 [details] [diff] [review]
Combined patch to header and CPP file
Attachment #442379 - Attachment is obsolete: true
Attachment #442382 - Attachment is obsolete: true
Attachment #442493 - Flags: review?(bienvenu)
Attachment #442379 - Flags: review?(bienvenu)
Attachment #442382 - Flags: review?(bienvenu)
(Assignee)

Comment 17

7 years ago
(In reply to comment #15)
> Yes, one patch per bug fix, and you might want to look into Mercurial queues as
> a way to manage multiple patches to the same files.
> https://developer.mozilla.org/en/Mercurial_Queues

I prefer not to spend time on config management. Once this patch has been included into the source base, I will submit the patch for bug 558653.

Updated

7 years ago
Blocks: 564148
(Assignee)

Updated

7 years ago
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
Jorg K, something's amiss here - the bug got marked fixed, but there's no checkin mentioned and the patch doesn't yet have review approval
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 19

7 years ago
Sorry, looks like I did the wrong thing. I just wanted to give this a little nudge.
David , ping for the review.

Updated

7 years ago
Attachment #442493 - Flags: review?(bienvenu) → review?(bugzilla)
Comment on attachment 442493 [details] [diff] [review]
Combined patch to header and CPP file

Neil, I'm happy to test this, but can you comment on the string related stuff first (and if they will be fine with external api).
Attachment #442493 - Flags: superreview?(neil)
Comment on attachment 442493 [details] [diff] [review]
Combined patch to header and CPP file

Well, we haven't started on porting mailnews/import to the external API (mainly because a Linux guy has been doing most of the work). But I'll comment anyway.

>+  PRInt32 idx_endhead = str.Find( "/head", PR_TRUE);
>+  PRInt32 idx_charset = str.Find( "charset=", PR_TRUE);
[This happens to be the one usage of Find that is directly portable. Which is a shame because we probably should be using nsCString here.]

>+  str.Right( tStr, str.Length() - idx_charset);
Right isn't available in the external API. Instead you should use
tStr = Substring(str, idx_charset)
Or you can use an offset to your FindChar functions and use the three-argument version of substring.

>+  PRInt32 idx  = tStr.FindChar( ';');
>+  PRInt32 idx0 = tStr.FindChar( '"');
>+  PRInt32 idx1 = tStr.FindChar( ' ');
This should use FindCharInSet. (Well, it should use MsgFindCharInSet, but that's currently broken.) Conveniently FindCharInSet accepts an initial offset parameter, so you could write
PRInt32 idx = str.FindCharInSet(";\" ", idx_charset);
[this would of course return an offset from the start of the string]

>+    tStr.Left( str, idx);
str = StringHead(tStr, idx);
Or if you didn't create tStr,
str = Substring(str, idx_charset, idx - idx_charset);

>+  int recheck_charset = 0;
>   if (headerVal.IsEmpty())
>   {
>+    recheck_charset = 1;
This should use bool/false/true or PRBool/PR_FALSE/PR_TRUE. (And the variable name isn't strictly accurate either, since we're checking a charset that we didn't even know existed before.)

>+    pMimeType = strdup ("text/plain");
[I know this was moved code but it should be NS_strdup.]

>+    CopyASCIItoUTF16(m_Body.get(), headerVal);
>+    ExtractMetaCharset (headerVal);
It's not ideal to convert the entire body just to extract the charset. It would be better if the ASCII charset could be extracted from the ASCII body. (Everyone really wants an ASCII charset anyway, but that's harder to fix.)

>-  charSet = headerVal;
So, the one thing I don't understand is why we set the character set field, but not actually use the meta character set when creating the body.
(Assignee)

Comment 23

7 years ago
Thanks for the review.

Being new here, I used the existing ExtractCharset
http://mxr.mozilla.org/comm-central/source/mailnews/import/outlook/src/nsOutlookCompose.cpp#455
as a model.

That uses all the things you complained about, ie. str.Left, str.Right. strdup is also used throughout the same source file.

CopyASCIItoUTF16 is used so that my new function ExtractMetaCharset can work the same way as the existing ExtractCharset.

So please indicate with of the changes you suggested are mandatory.

Most importantly, the logic that you don't understand:

Quite frankly, I don't understand it either, all I can say is that it works this way.

I observed that when the message is the converted using the current version of TB, it creates a 

Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: 8bit

To make the message appear correctly, all you need to do is to change the charset to whatever the HTML code said, for example UTF-8 or iso-2022-jp.

The actual creation of the message is not affected, only the header.

Therefore in the code, I use:
m_pMsgFields->SetCharacterSet( NS_LossyConvertUTF16toASCII(headerVal).get() );
(headerVal being the new charset extracted from the HTML header)

nsMsgI18NConvertFromUnicode( NS_LossyConvertUTF16toASCII(charSet).get(),uniBody, body);
(charSet being the character set retrieved the "traditional" way as in the previous version).

If I use the new charset from the HTML header for the body conversion, things fall apart and I get dog's breakfast.

Please advise how to proceed.
Component: Migration → Import
Product: Thunderbird → MailNews Core
QA Contact: migration → import
standard8 ^^
(Assignee)

Comment 25

7 years ago
Sorry, what does "standard8 ^^" mean?
Asking standard8 to comment on the previous comment that's what it means

Updated

7 years ago
Attachment #442493 - Flags: superreview?(neil) → superreview+
Comment on attachment 442493 [details] [diff] [review]
Combined patch to header and CPP file

OK, you've convinced me.
(Assignee)

Comment 28

7 years ago
Well, I'm happy to make further changes.

1) FindCharInSet seems to me more elegant.
2) If I can avoid to convert the whole message with
     CopyASCIItoUTF16(m_Body.get(), headerVal);
   then surely that's the more elegant solution.
   Perhaps you can give me a hint how I'd do this instead.
(In reply to comment #28)
> If I can avoid to convert the whole message with
>   CopyASCIItoUTF16(m_Body.get(), headerVal);
> then surely that's the more elegant solution.
> Perhaps you can give me a hint how I'd do this instead.
You would need to change ExtractMetaCharset to use nsCString instead of nsString, and you would pass CaseInsensitiveCompare as the second parameter to Find, so it would work with the external API. And you might also want to add a second parameter for the return value, so that you don't have to copy m_Body.
(Assignee)

Comment 30

7 years ago
Created attachment 461877 [details] [diff] [review]
New patch after review

All review comments have now been incorporated.
Attachment #442493 - Attachment is obsolete: true
Attachment #461877 - Flags: review?(neil)
Attachment #442493 - Flags: review?(bugzilla)
(Assignee)

Comment 31

7 years ago
Created attachment 461880 [details] [diff] [review]
New patch after review (take 2)

All review comments have now been incorporated.

However, I don't understand this part of comment 29:

... and you would pass CaseInsensitiveCompare as the second parameter to
Find, so it would work with the external API.
Attachment #461877 - Attachment is obsolete: true
Attachment #461880 - Flags: review?(neil)
Attachment #461877 - Flags: review?(neil)
(In reply to comment #31)
> I don't understand this part of comment 29:
> 
> ... and you would pass CaseInsensitiveCompare as the second parameter to
> Find, so it would work with the external API.
Starting in bug 377319, we've been trying to convert the mailnews codebase to be compatible with the external string API, although there are some notable omissions, such as most of import. In the case of Find, the external string API wants a specific parameter to indicate a case insensitive find. There is a #define CaseInsensitiveCompare PR_TRUE in nsMsgUtils.h for use with the internal API, so that the code continues to compile and work.
Comment on attachment 461880 [details] [diff] [review]
New patch after review (take 2)

>+void nsOutlookCompose::ExtractMetaCharset( nsCString str, nsString& newstr)
>+{ 
Nit: unnecessary space after the { ...

>+} 
>+
> void nsOutlookCompose::ExtractType( nsString& str)
... and again after the }
Attachment #461880 - Flags: review?(neil) → review+
(Assignee)

Comment 34

7 years ago
Created attachment 461937 [details] [diff] [review]
New patch after review (take 3)

used CaseInsensitiveCompare and removed extra whitespace as requested by most recent review.
Attachment #461880 - Attachment is obsolete: true
Attachment #461937 - Flags: review?(neil)

Updated

7 years ago
Attachment #461937 - Flags: review?(neil) → review+

Updated

7 years ago
Duplicate of this bug: 584504

Comment 36

7 years ago
I have the same problem with russian charsets, too.
I want to note that TB not only garbles the HTML-style text, it also changes the text-only messages that do have the 'Content-Type: text/plain; charset="iso-8859-5"' and 'Content-Transfer-Encoding: quoted-printable' into garbled with 'Content-Type: text/html; charset=windows-1251
' and 'Content-transfer-encoding: 8bit', as shown in my (duplicate) https://bugzilla.mozilla.org/show_bug.cgi?id=584504

Updated

7 years ago
Whiteboard: [gs]

Updated

7 years ago
Whiteboard: [gs]
(Assignee)

Comment 37

7 years ago
This bug is about HTML messages only, which are imported correctly but display garbled.

Garbled (badly imported, data loss) plain text messages are covered in bug 207156.

Updated

7 years ago
Whiteboard: [checkin-needed]
Keywords: checkin-needed
Checked into trunk: http://hg.mozilla.org/comm-central/rev/5107881ac105
Status: REOPENED → RESOLVED
Last Resolved: 7 years ago7 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Whiteboard: [checkin-needed]
Target Milestone: --- → Thunderbird 3.2a1

Comment 39

7 years ago
Incorrectly removed the "[gs]" tag and URL due to a mis-reading of Comment #37  (GS topic has BOTH HTML and plain-text problems).  Re-adding tag/URL.
Whiteboard: [gs]
You need to log in before you can comment on or make changes to this bug.