Closed
Bug 547119
Opened 15 years ago
Closed 14 years ago
Imported Japanese Email from Outlook shows as garbled text
Categories
(MailNews Core :: Import, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
Thunderbird 3.3a1
People
(Reporter: dythim, Assigned: jorgk-bmo)
References
(Blocks 1 open bug, )
Details
(Whiteboard: [gs])
Attachments
(3 files, 6 obsolete files)
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.89 Safari/532.5
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 6.0; ja; rv:1.9.1.7) Gecko/20100111 Thunderbird/3.0.1
When migrating from Outlook 2007 from Thunderbird 3.01 (both Japanese),
using the Tools -> Import to retrieve Outlook messages apparently succeeded,
but when viewing the imported messages in Thunderbird the text shows as garbage.
The character set is auto-detected as Shift_JIS, and ISO-2022-JP must be manually selected in order to read the messages properly.
Newly sent/received mail does not seem to have the same problem -- it is correctly auto-detected as ISO-2022-JP.
Reproducible: Always
Steps to Reproduce:
1. Have existing Outlook Japanese e-mail.
2. Install Thunderbird 3.0.1 Japanese version.
3. Select Tools -> Import (ツール→設定とデータのインポート)
4. Select Mailbox (メールボックス) from the Import window, click Next (次へ), select Outlook as the type, press Next (次へ) to execute.
5. Import completes automatically, no problems reported.
6. Open one of the imported messages, it shows as garbage.
7. Change the View -> Character Encoding (表示→文字エンコーディング) from Shift_JIS (日本語(Shift_JIS) to (日本語(ISO-2022-JP)).
8. Message shows as normal Japanese.
Actual Results:
Thunderbird detects character encoding of imported messages as Shift_JIS, which results in garbled text.
Expected Results:
Thunderbird should be able to detect that the messages are encoded in ISO-2022-JP.
All mail sent and received after the migration is correctly auto-detected as ISO-2022-JP. It seems like a problem with the import.
Also, switching between tabs while an imported message is displayed causes the message to be refreshed, and regarbled.
Comment 1•15 years ago
|
||
Could you attach sample *.pst file to reproduce this?
I uploaded a PST file, but I was not able to force Thunderbird to import from it... It would only import directly from Outlook.
Comment 4•15 years ago
|
||
Thank you!. I confirmed on 3.1 alpha1.
When Tb imports your mailbox as HTML mail, mail data is following. Although charset of Content-type is Shift_JIS, but charset of HTML is iso-2022-jp.
:
Content-Type: text/html; charset=Shift_JIS
:
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-2022-jp">
:
:
Status: UNCONFIRMED → NEW
Ever confirmed: true
Comment 5•15 years ago
|
||
related to bug 378008 (same) and bug 505072 (wrong fix for this).
When I open HTML source via Outlook, there is no charset... If we import HTML mail data from Outlook, I should remove charset of header, maybe...
Assignee | ||
Comment 6•15 years ago
|
||
This is a problem with Korean messages, too. For example this is the source of a message I imported from Outlook:
This is a multi-part message in MIME format.
--------------040206010205070609040902
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: 8bit
Her name is Yena Ryu and that's 예나 류 in Korean<br><br>
===============
Where does the Content-Type come from???
The message looks OK when viewed as UTF-8, but when viewed with the default windows-1252, it looks like this:
Her name is Yena Ryu and that's 예나 류 in Korean
This is not only a problem with Asian text. Even German text with Umlaut (äöü) gets garbled:
This is a multi-part message in MIME format.
--------------090001070201040804030900
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: 8bit
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>Ihre simyo Rechnung</TITLE>
<META content="text/html; charset=utf-8" http-equiv=Content-Type><!-- Title ON --><!-- Title OFF -->
Later in the body of the message:
der Rechnungsbetrag für die
which when viewed at the default looks like this:
der Rechnungsbetrag für die
========
So once again, is there away to detect the default encoding for the message a little better?
If it's a HTML message, it could be taken from the charset of the HTML header.
Assignee | ||
Comment 7•15 years ago
|
||
The problem is here in the code:
http://mxr.mozilla.org/comm-central/source/mailnews/import/outlook/src/nsOutlookCompose.cpp#607.
The import tries to get the charset from the header, if none is there, it defaults to the system character set, for Windows machines with an English version of Windows, that's normally "windows-1251".
Instead, what should be done is extract the charset from the appropriate HTML header.
Comment 8•15 years ago
|
||
(In reply to comment #7)
> The problem is here in the code:
>
> http://mxr.mozilla.org/comm-central/source/mailnews/import/outlook/src/nsOutlookCompose.cpp#607.
>
> The import tries to get the charset from the header, if none is there, it
> defaults to the system character set, for Windows machines with an English
> version of Windows, that's normally "windows-1251".
>
> Instead, what should be done is extract the charset from the appropriate HTML
> header.
That's right. we need a meta element parser like nsMsgI18NParseMetaCharset() to outlook importer.
Updated•15 years ago
|
Assignee: nobody → mozilla
Assignee | ||
Comment 10•15 years ago
|
||
Assignee | ||
Comment 11•15 years ago
|
||
Attachment #442379 -
Flags: review?(bienvenu)
Assignee | ||
Comment 12•15 years ago
|
||
I'd like to request that this fix be rolled into the source base quickly since I want to continue working on another bug (bug 558653) which will require changes to this file.
Attachment #442380 -
Flags: review?(bienvenu)
Assignee | ||
Comment 13•15 years ago
|
||
I'd like to request that this fix be rolled into the source base quickly since
I want to continue working on another bug (bug 558653) which will require
changes to this file.
Attachment #442380 -
Attachment is obsolete: true
Attachment #442382 -
Flags: review?(bienvenu)
Attachment #442380 -
Flags: review?(bienvenu)
Comment 14•15 years ago
|
||
Can you provide both .h and cpp changes in the same patch ?
Comment 15•15 years ago
|
||
Jorg, thx for working on this.
Yes, one patch per bug fix, and you might want to look into Mercurial queues as a way to manage multiple patches to the same files. https://developer.mozilla.org/en/Mercurial_Queues
We can't land anything until the tree re-opens, so you're going to have to be patient...
Assignee | ||
Comment 16•15 years ago
|
||
Attachment #442379 -
Attachment is obsolete: true
Attachment #442382 -
Attachment is obsolete: true
Attachment #442493 -
Flags: review?(bienvenu)
Attachment #442379 -
Flags: review?(bienvenu)
Attachment #442382 -
Flags: review?(bienvenu)
Assignee | ||
Comment 17•15 years ago
|
||
(In reply to comment #15)
> Yes, one patch per bug fix, and you might want to look into Mercurial queues as
> a way to manage multiple patches to the same files.
> https://developer.mozilla.org/en/Mercurial_Queues
I prefer not to spend time on config management. Once this patch has been included into the source base, I will submit the patch for bug 558653.
Updated•15 years ago
|
Blocks: tb-enterprise
Assignee | ||
Updated•15 years ago
|
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Comment 18•15 years ago
|
||
Jorg K, something's amiss here - the bug got marked fixed, but there's no checkin mentioned and the patch doesn't yet have review approval
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 19•15 years ago
|
||
Sorry, looks like I did the wrong thing. I just wanted to give this a little nudge.
Comment 20•15 years ago
|
||
David , ping for the review.
Updated•15 years ago
|
Attachment #442493 -
Flags: review?(bienvenu) → review?(bugzilla)
Comment 21•15 years ago
|
||
Comment on attachment 442493 [details] [diff] [review]
Combined patch to header and CPP file
Neil, I'm happy to test this, but can you comment on the string related stuff first (and if they will be fine with external api).
Attachment #442493 -
Flags: superreview?(neil)
Comment 22•15 years ago
|
||
Comment on attachment 442493 [details] [diff] [review]
Combined patch to header and CPP file
Well, we haven't started on porting mailnews/import to the external API (mainly because a Linux guy has been doing most of the work). But I'll comment anyway.
>+ PRInt32 idx_endhead = str.Find( "/head", PR_TRUE);
>+ PRInt32 idx_charset = str.Find( "charset=", PR_TRUE);
[This happens to be the one usage of Find that is directly portable. Which is a shame because we probably should be using nsCString here.]
>+ str.Right( tStr, str.Length() - idx_charset);
Right isn't available in the external API. Instead you should use
tStr = Substring(str, idx_charset)
Or you can use an offset to your FindChar functions and use the three-argument version of substring.
>+ PRInt32 idx = tStr.FindChar( ';');
>+ PRInt32 idx0 = tStr.FindChar( '"');
>+ PRInt32 idx1 = tStr.FindChar( ' ');
This should use FindCharInSet. (Well, it should use MsgFindCharInSet, but that's currently broken.) Conveniently FindCharInSet accepts an initial offset parameter, so you could write
PRInt32 idx = str.FindCharInSet(";\" ", idx_charset);
[this would of course return an offset from the start of the string]
>+ tStr.Left( str, idx);
str = StringHead(tStr, idx);
Or if you didn't create tStr,
str = Substring(str, idx_charset, idx - idx_charset);
>+ int recheck_charset = 0;
> if (headerVal.IsEmpty())
> {
>+ recheck_charset = 1;
This should use bool/false/true or PRBool/PR_FALSE/PR_TRUE. (And the variable name isn't strictly accurate either, since we're checking a charset that we didn't even know existed before.)
>+ pMimeType = strdup ("text/plain");
[I know this was moved code but it should be NS_strdup.]
>+ CopyASCIItoUTF16(m_Body.get(), headerVal);
>+ ExtractMetaCharset (headerVal);
It's not ideal to convert the entire body just to extract the charset. It would be better if the ASCII charset could be extracted from the ASCII body. (Everyone really wants an ASCII charset anyway, but that's harder to fix.)
>- charSet = headerVal;
So, the one thing I don't understand is why we set the character set field, but not actually use the meta character set when creating the body.
Assignee | ||
Comment 23•15 years ago
|
||
Thanks for the review.
Being new here, I used the existing ExtractCharset
http://mxr.mozilla.org/comm-central/source/mailnews/import/outlook/src/nsOutlookCompose.cpp#455
as a model.
That uses all the things you complained about, ie. str.Left, str.Right. strdup is also used throughout the same source file.
CopyASCIItoUTF16 is used so that my new function ExtractMetaCharset can work the same way as the existing ExtractCharset.
So please indicate with of the changes you suggested are mandatory.
Most importantly, the logic that you don't understand:
Quite frankly, I don't understand it either, all I can say is that it works this way.
I observed that when the message is the converted using the current version of TB, it creates a
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: 8bit
To make the message appear correctly, all you need to do is to change the charset to whatever the HTML code said, for example UTF-8 or iso-2022-jp.
The actual creation of the message is not affected, only the header.
Therefore in the code, I use:
m_pMsgFields->SetCharacterSet( NS_LossyConvertUTF16toASCII(headerVal).get() );
(headerVal being the new charset extracted from the HTML header)
nsMsgI18NConvertFromUnicode( NS_LossyConvertUTF16toASCII(charSet).get(),uniBody, body);
(charSet being the character set retrieved the "traditional" way as in the previous version).
If I use the new charset from the HTML header for the body conversion, things fall apart and I get dog's breakfast.
Please advise how to proceed.
Updated•15 years ago
|
Component: Migration → Import
Product: Thunderbird → MailNews Core
QA Contact: migration → import
Comment 24•15 years ago
|
||
standard8 ^^
Assignee | ||
Comment 25•15 years ago
|
||
Sorry, what does "standard8 ^^" mean?
Comment 26•15 years ago
|
||
Asking standard8 to comment on the previous comment that's what it means
Updated•15 years ago
|
Attachment #442493 -
Flags: superreview?(neil) → superreview+
Comment 27•15 years ago
|
||
Comment on attachment 442493 [details] [diff] [review]
Combined patch to header and CPP file
OK, you've convinced me.
Assignee | ||
Comment 28•15 years ago
|
||
Well, I'm happy to make further changes.
1) FindCharInSet seems to me more elegant.
2) If I can avoid to convert the whole message with
CopyASCIItoUTF16(m_Body.get(), headerVal);
then surely that's the more elegant solution.
Perhaps you can give me a hint how I'd do this instead.
Comment 29•15 years ago
|
||
(In reply to comment #28)
> If I can avoid to convert the whole message with
> CopyASCIItoUTF16(m_Body.get(), headerVal);
> then surely that's the more elegant solution.
> Perhaps you can give me a hint how I'd do this instead.
You would need to change ExtractMetaCharset to use nsCString instead of nsString, and you would pass CaseInsensitiveCompare as the second parameter to Find, so it would work with the external API. And you might also want to add a second parameter for the return value, so that you don't have to copy m_Body.
Assignee | ||
Comment 30•15 years ago
|
||
All review comments have now been incorporated.
Attachment #442493 -
Attachment is obsolete: true
Attachment #461877 -
Flags: review?(neil)
Attachment #442493 -
Flags: review?(bugzilla)
Assignee | ||
Comment 31•15 years ago
|
||
All review comments have now been incorporated.
However, I don't understand this part of comment 29:
... and you would pass CaseInsensitiveCompare as the second parameter to
Find, so it would work with the external API.
Attachment #461877 -
Attachment is obsolete: true
Attachment #461880 -
Flags: review?(neil)
Attachment #461877 -
Flags: review?(neil)
Comment 32•15 years ago
|
||
(In reply to comment #31)
> I don't understand this part of comment 29:
>
> ... and you would pass CaseInsensitiveCompare as the second parameter to
> Find, so it would work with the external API.
Starting in bug 377319, we've been trying to convert the mailnews codebase to be compatible with the external string API, although there are some notable omissions, such as most of import. In the case of Find, the external string API wants a specific parameter to indicate a case insensitive find. There is a #define CaseInsensitiveCompare PR_TRUE in nsMsgUtils.h for use with the internal API, so that the code continues to compile and work.
Comment 33•15 years ago
|
||
Comment on attachment 461880 [details] [diff] [review]
New patch after review (take 2)
>+void nsOutlookCompose::ExtractMetaCharset( nsCString str, nsString& newstr)
>+{
Nit: unnecessary space after the { ...
>+}
>+
> void nsOutlookCompose::ExtractType( nsString& str)
... and again after the }
Attachment #461880 -
Flags: review?(neil) → review+
Assignee | ||
Comment 34•15 years ago
|
||
used CaseInsensitiveCompare and removed extra whitespace as requested by most recent review.
Attachment #461880 -
Attachment is obsolete: true
Attachment #461937 -
Flags: review?(neil)
Updated•15 years ago
|
Attachment #461937 -
Flags: review?(neil) → review+
Comment 36•14 years ago
|
||
I have the same problem with russian charsets, too.
I want to note that TB not only garbles the HTML-style text, it also changes the text-only messages that do have the 'Content-Type: text/plain; charset="iso-8859-5"' and 'Content-Transfer-Encoding: quoted-printable' into garbled with 'Content-Type: text/html; charset=windows-1251
' and 'Content-transfer-encoding: 8bit', as shown in my (duplicate) https://bugzilla.mozilla.org/show_bug.cgi?id=584504
Updated•14 years ago
|
Whiteboard: [gs]
Updated•14 years ago
|
Whiteboard: [gs]
Assignee | ||
Comment 37•14 years ago
|
||
This bug is about HTML messages only, which are imported correctly but display garbled.
Garbled (badly imported, data loss) plain text messages are covered in bug 207156.
Updated•14 years ago
|
Whiteboard: [checkin-needed]
Updated•14 years ago
|
Keywords: checkin-needed
Comment 38•14 years ago
|
||
Checked into trunk: http://hg.mozilla.org/comm-central/rev/5107881ac105
Status: REOPENED → RESOLVED
Closed: 15 years ago → 14 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Whiteboard: [checkin-needed]
Target Milestone: --- → Thunderbird 3.2a1
Comment 39•14 years ago
|
||
Incorrectly removed the "[gs]" tag and URL due to a mis-reading of Comment #37 (GS topic has BOTH HTML and plain-text problems). Re-adding tag/URL.
Whiteboard: [gs]
You need to log in
before you can comment on or make changes to this bug.
Description
•