Closed Bug 479214 Opened 14 years ago Closed 14 years ago

Gloda JS mime emitter needs to deal with character set and encoding issues.


(MailNews Core :: Database, defect)

Not set


(Not tracked)

Thunderbird 3.0b3


(Reporter: bugzilla, Assigned: asuth)



(Keywords: intl)


(1 file, 2 obsolete files)

When we search Japanese (or other multibyte) strings within message body (or attachmentName) with Thunderbar (or Postbox full text search), we cannot find it.
In other words, current Thunderbar don't support multibyte chars for message body.

I first thought that is just bug of thunderbar frontend but when I checked global-messages-db.sqlite with SQLite Manager, I found body/attachmentNames field of messagesText/messagesText_content table contain mojibake (garbage chars).
# subject field contains correct strings

And I checked insertMessage of datastore.js (to see mojibake is introduced before insert or not) method. It will get aMessage argument and aMessage._subject is OK but aMessage._indexedBodyText contains mojibake strings.

So, I think that not exptoolbar but gloda backend (caller of insertMessage) don't support multi-byte chars correctly now.

In general, Subject (and other) header and message Body is not equally encoded in mail message. I guess current code just treat body same (or something wrong) way as one for subject (and other) header and it cause mojibake in message body.

I believe body/attachmentNames field should contain UTF-8 same as subject field but if body/attachment field should contain non-UTF-8 string, I'm wrong and ignore my report. In that case, bug of exptoolbar (and Postbox) not of gloda.

note: One curious thing is that we can fulltext multibyte string search correctly with Edit -> Find -> Find Messages menu, even when we use gloda and global-messages-db.sqlite contains mojibake.
Flags: blocking-thunderbird3?
Thank you for the detailed investigation and bug message!

I was hoping really hard that libmime and the rest of the message streaming process had this under control.  (I was going to look into this when I implemented the tokenizer stuff...)  I suspect the jsmimeemitter should probably be doing something with updateCharacterSet; we previously tried to tell the channel about it, but the IMAP mock channel got angry.

cc-ing bienvenu in case he knows exactly what we should be doing :)

Dropping importance since gloda is not yet exposed in trunk, but marking as blocking enabling gloda since this should be a fairly easy fix.
Blocks: 464354
Severity: blocker → major
Summary: Gloda database must support multibyte chars (for body/attachemntNames fields) → Gloda database must support multibyte chars (for body/attachmentNames fields)
Target Milestone: --- → Thunderbird 3.0b3
Summary: Gloda database must support multibyte chars (for body/attachmentNames fields) → Gloda JS mime emitter needs to deal with character set and encoding issues.
blocking‑thunderbird3+ as this sounds pretty bad
Flags: blocking-thunderbird3? → blocking-thunderbird3+
I would store utf8 in sqlite - e.g., mime2 decode the part header to unicode, and then convert that to utf8 (or maybe mime2 decode directly to utf8 - don't know if our decoders can do that...) - is that not what we're doing?

I would also store the message body as utf8, but I'm not real familiar with what sqlite supports.
libmime is, in fact, doing all the hard work for us.  The problem is that nsIMimeEmitter.writeBody passes its utf-8 data with a "string" signature.  XPConnect takes this to mean we are passing ASCII data and not utf8 data.  So we get utf8 stored in wide characters.  mozStorage uses utf-8 storage in sqlite, but of course utf-8 stored in utf-16 does not reduce to valid utf-8 when converted.  You get escaped utf-8 in utf-8...

The attached patch does the simplest thing possible, which is to use scriptableunicodeconverter to convert all the body data that gets thrown at us.  Generally speaking, this is silly and inefficient.  The arguably better course of action is to change the signature of writeBody to be AUTF8String so that things come out of XPConnect correct.  The trick there is that then all the C++ dudes need to change to using an nsACString& signature.  I am not a "native speaker" when it comes to mozilla string classes, so I'd appreciate input on how far I would want to propagate such changes, or whether I should just make that one change using a dependent string or such...

I did add a simple unit test that sanity checks the current implementation and should be useful to sanity-check a better implementation too.
Assignee: nobody → bugmail
Attachment #363086 - Attachment is obsolete: true
Attachment #363619 - Flags: superreview?(bienvenu)
Attachment #363619 - Flags: review?(bienvenu)
There's only one call to WriteBody, mime_output_fn, in mimemoz2.cpp, so I think it's well worthwhile changing the idl signature. I think you can use nsDependentCString(buf) and change WriteBody to take AUTF8String, which the idl compiler changes to nsACString, and then I think our utf8 data will pass through unharmed. I'm not sure you couldn't even use ACString in the idl, but I bet Neil knows off the top of his head.
(In reply to comment #5)
> I'm not sure you couldn't even use ACString in the idl
My understanding was that string didn't even guarantee 8 bits when passed through XPConnect; ACString is best for binary data since AUTF8String unsurprisingly only works for UTF8 data.
With the patch Thunderbird will (convert and) import Japanese message body correctly into the global-messages-db.sqlite database as utf-8. ;)

But I still see mojibake for attachmentNames field of the message_Text/Messate_Text_content table.

And as far as I test, following will solve the problem for attachemntNames field too:
# do same for mime_emitter_startAttachment() as for mime_emitter_writeBody()

       if (partMatch) {
         let part = new this._mimeMsg.MimeMessageAttachment(partMatch[1],
-            aName, aContentType, aUrl, aIsExternalAttachment);
+            this._converter.ConvertToUnicode(aName), aContentType, aUrl, aIsExternalAttachment);
Blocks: 479783
(In reply to comment #7)
> With the patch Thunderbird will (convert and) import Japanese message body
> correctly into the global-messages-db.sqlite database as utf-8. ;)
> But I still see mojibake for attachmentNames field of the
> message_Text/Messate_Text_content table.

Thanks for (thoroughly) testing!  I'll fix the attachment name vector (and
audit the rest of the interface call-sites) and add tests for this to my
enhanced unit test.  (The unit test in the patch sadly fails to prove much, as
the message payload was actually utf-8 anyways.  Happily, it turns out that it
does work for euc-jp and shift-jis :)
Comment on attachment 363619 [details] [diff] [review]
v1 path of least resistance, convert the stewed utf-8

I'd prefer fixing the idl for WriteBody since that's where we want to end up, I believe.
In addition to changing the signature for writeBody, we also need startAttachment's signature changed for that to be correct.  Also, as Neil points out, ACString is best for binary data, so I changed the signature to write as well.  I tried to propagate signatures just far enough, but may not have struck the right balance.

Unit test now tests subject, body, and attachment name, and does so for 3 encodings.
Attachment #363619 - Attachment is obsolete: true
Attachment #364856 - Flags: superreview?(bienvenu)
Attachment #364856 - Flags: review?(neil)
Attachment #363619 - Flags: superreview?(bienvenu)
Attachment #363619 - Flags: review?(bienvenu)
Comment on attachment 364856 [details] [diff] [review]
v2 convert startAttachment/writeBody/write signatures on nsIMimeEmitter

+ *  expected/actual value. 
I only skimmed the gloda bits, but there's trailing whitespace here. *

>-    mCurrentAttachment->displayName = strdup(name);
>+    mCurrentAttachment->displayName = ToNewCString(name);
This is still an allocator mismatch :-( [According to the rules, strdup goes with strfree and ToNewCString goes with NS_Free and neither go with PR_FREEIF]

>+      mBufferMgr->IncreaseBuffer(buf.Data(), buf.Length());
Data() is internal API only; I think BeginReading() should work instead. *

>+    rv = Write(NS_LITERAL_CSTRING(""), &written);
Use EmptyCString() *

>+  NS_IMETHOD          UtilityWrite(const nsACString &buf);
>   NS_IMETHOD          UtilityWriteCRLF(const char *buf);
Weird, why are these public virtual instead of protected? (Or indeed the "original" UtilityWrite method, which made it into the .idl!)

>     void utilityWrite([const] in string buf);
Ah yes, here it is ;-)

>+      msd->output_emitter->WriteBody(nsDependentCString(buf, (PRUint32)size),
>+                                     &written);
>+      msd->output_emitter->Write(nsDependentCString(buf, (PRUint32)size), &written);
IIRC Substring() would be better, although unfortunately the internal API doesn't have a start/length version... *

>+  // use of nsDependentCString demands nul-termination, so add 1.
Substring() would avoid this. *

* required for r=me
Attachment #364856 - Flags: review?(neil) → review+
Interesting about your vending machine problem. My hovercraft has a similar issue.

In addition to Neil's comments,

you can lose the blank lines here:

+nsMimeBaseEmitter::UtilityWrite(const nsACString &buf)
+  PRUint32    written;
+  Write(buf, &written);
   return NS_OK;
Attachment #364856 - Flags: superreview?(bienvenu) → superreview+
Comment on attachment 364856 [details] [diff] [review]
v2 convert startAttachment/writeBody/write signatures on nsIMimeEmitter

sr=me, but don't forget to fix Neil's starred comments.
comments addressed.  Thanks all.

Closed: 14 years ago
Resolution: --- → FIXED
As far as I test with latest tinderbox builds ISO-2022-JP (most popular Japanese mail encoding), Shift_JIS, EUC-JP, UTF-8 Japanese mail are stored into db with UTF-8 correctly.
# testmail bymyself only for Shift_JIS, EUC-Jp since almost none use it in e-mail

Thanks fixing this bug.

I feel that it may be better to have more languages and more encodings tests if we want to make sure we can handle every locales/encoding.
# it's just better to have, not required
Thank you for verifying!  In this case, it's not essential to have additional encodings.  The main thing was just to make sure to have more than just UTF-8 so we don't get some form of accidental correct behaviour on UTF-8 but fail on all other encodings.  I believe this is sufficient to verify that the proper decoding mechanisms are in play, and it is probably best for the codec code to make sure it has full unit test coverage of the multiple possible encodings, rather than gloda.

I think once we implement the tokenizer, that is where we will want to have a large set of different strings in different encodings to verify that tokenization is occurring as expected/desired.
Note: The fix for this bug accidentally introduced/exposed bug 482416.  (The removal of a guard exposed an oversight; nested rfc822 messages are falling into the void a bit.)
Flags: in-testsuite+
You need to log in before you can comment on or make changes to this bug.