Closed Bug 686985 Opened 10 years ago Closed 10 years ago
Try to restore missing 8-bit header's charset at import from Outlook
19.50 KB, text/plain
The result of importing the message in previous file under Windows with Windows-1251 default codepage.
19.50 KB, text/plain
25.45 KB, patch
|Details | Diff | Splinter Review|
When importing from Outlook, if a mail is not filly frc822/MIME-conformant in the sense that it uses 8-bit characters in its headers, and doesn't specify charset and encode characters, then the resulting imported message may contain these headers with garbled contents. Attached is a sample Outlook message file (MSG) that has its "From:", "To:", "Subject:" and "Disposition-Notification-To" headers, as well as its body, using KOI8-R charset. Note that all these headers have no encoding, and no charset indication. The "Content-Type:" header specifying the body charset is also missing. Current import method is able to guess the body charset correctly, from the information that Outlook provides. Thus, on import, the missing "Content-Type:" header is recreated with correct contents. Then, the code processing headers inside the message-creating API detects that the characters in some headers (namely "From:", "To:" and "Subject:") use 8-bit characters, and converts them from current charset to the charset specified in "Content-Type:", then encodes. This makes these headers unreadable. Note that the result of importing this message's headers will depend on what charset is default in your OS. The proposition is that the import code should detect this situation itself before passing headers to message-creating API, and if 8-bit characters in a header are detected, try to convert that header from "Content-Type"'s charset to UNICODE.
Component: Migration → Import
Product: Thunderbird → MailNews Core
Assignee: nobody → mikekaganski
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
I need some advise. It may look dumb, but anyway, I'm not a rfc822 expert, so please don't be harsh. Here's my speculations. Please check if they are correct. Message header must be 7-bit ASCII text. If it needs to include an i18n'd text, this text must be encoded (quoted-printable adjusted for headers with charset specification), so it becomes 7-bit anyway. The correctly encoded header string may use a charset that is different from the other headers and the body. If a sending party sends a message which headers contain 8-bit characters in header, it always should be presumed that the charset of these characters use the same charset as the body of the message, isn't it? So, our procedure may look like this: 1. Get transport headers, that may include the body charset. 2. Get the body, and do some magic to decide which charset it uses (it includes using information from step 1, and in case it's absent, the data reported by Outlook, OS default charset etc.). 3. Convert all headers from the found charset to Unicode. This will not alter the valid 7-bit characters, and all 8-bit ones will hopefully be converted to their correct codepoints, and will afterwards be correctly processed by message creating code. This may be false, if there is a possibility that there may exist a header that uses 8-bit text AND explicitly specifies its charset. Is it possible? Thank you for help.
Assuming my presumptions are right, this patch takes care of such messages. Tested on 4GB of Outlook messages; seems like it cures the problem while not adding new ones.
Re the assertion that message-id's can't fold, I'm not sure that's the case. I know the message-id can be on it's own line when Exchange/Outlook generates the message. Mike, can you look at the info in the bug below? https://bugzilla.mozilla.org/show_bug.cgi?id=676916 The patch itself has bit-rotted slightly, but I've refreshed it and will attach the refreshed version after I check that it builds...
(In reply to David :Bienvenu from comment #4) Thank you, David. That's definitely my mistake (that was introduced in patch for bug 207156); the RFC2822 allows folding and comments both before and after the Message-Id field. So this comment needs to be removed. However, is there a real need to change the code itself? The Message-Id is passed to CreateAndSendMessage (or its current replacement) to allow it to create the message successfully; we don't use the (possibly changed) value of this header from the composed message - this header is copied as is. Will CreateAndSendMessage fail when a folding Message-Id is passed?
I have checked the code of nsMsgComposeAndSend::InitCompositionFields in http://mxr.mozilla.org/comm-central/source/mailnews/compose/src/nsMsgSend.cpp#2714 (it is called in nsMsgComposeAndSend::Init). It looks like the code wouldn't mind if the Message-Id wouldn't be passed at all - it would simply generate a new one. So here's the same patch without that incorrect comment.
Sorry for noise
this is the de-bitrotted patch I was planning on checking in. I removed the comment locally...
Oh, excuse me. I have misunderstood you, as I thought you refer to bug 676916 when saying that its patch is bit-rotted... Thank you.
By the way, do you know a bug report about incorrect display of sender and topic of messages received _in usual way_, that contain such improper 8-bit headers, in the message list (and sometimes in message view window)? I tried to find one, and failed. I think that TB could use some similar technique to workaround this.
This bug 686985 is duplicate of bug 270638. I must have posted there, the more so since I have already seen it a while ago.
http://hg.mozilla.org/comm-central/rev/76cb5cca90ec fixed on trunk. Thx for the patch, Mike.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 9.0
You need to log in before you can comment on or make changes to this bug.