Closed Bug 1515254 Opened 5 years ago Closed 4 years ago

mailstore conversion doesn't preserve character encodings

Categories

(Thunderbird :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 76.0

People

(Reporter: benc, Assigned: benc)

References

(Blocks 1 open bug)

Details

Attachments

(3 files, 5 obsolete files)

My unit test to try a roundtrip maildir->mbox->maildir conversion in Bug 1508931 highlighted a possible encoding issue.

Two of the test emails I'm using have some non-utf8 encoded characters which aren't preserved by the roundtrip conversion.

Steps to replicate:

1) apply the latest `mailstore_roundtrip_test_<n>.patch` from Bug 1508931.

2) run the test and note that it succeeds:

   ./mach xpcshell-test comm/mailnews/base/test/unit/test_mailstoreConverter.js


3) Edit `mailnews/base/test/unit/test_mailstoreConverter.js` and find the `testEmails` list. The two emails with the non-utf-8 encodings are commented out:
   //"../../../data/12-plaintext+attachment.eml",  // using ISO-8859-7 (Greek)
and:
   //"../../../data/20-plaintext+(HTML+embedded-image)+attachment.eml",  // using windows-1252

Uncomment them.


4) run the test again, and note that it now fails.
Version: unspecified → Trunk

"opened 11 months ago". That's why people produce duplicates: https://bugzilla.mozilla.org/show_bug.cgi?id=1593455

Where is the problem to get rid of that bug? Converting mbox to maildir is just cutting the mbox text file into smaller pieces of text. Just do not touch the umlauts.

Assignee: nobody → benc

Notes to self:
The issue is that we're treating the mbox as a text file. It's not really - it's a mashup of text files, all with potentially different encodings. It needs to be treated as binary.
The complication is that the converter code uses regexp to implement the (rather hairy) rules for detecting the separator lines ("From ..."). We have to cope with all kinds of different formats, and also have to heuristics to cope with mbox files which don't properly escape lines in the message body which begin with "From " so we don't accidentally cut messages into bits.

So, we need to ditch the TextEncoder/TextDecoder and deal with the raw bytes directly. This could still involve converting it into a string to use the existing regexp, but that's a bit icky and doubles the memory requirements (stupid ucs-2 javascript strings and their 16bit chars). So if we can efficiently scan the raw Uint8Array chunks we're already reading in that'd be much nicer.

Longer term:

  1. We should probably aspire to using the mbox format standardised in RFC4155. But we're quite a way from that (for one, it specifies a 7bit-clean stream, so we'd have to parse headers and re-encode appropriately). ie Be forgiving about the mbox formats we can read in, but be really strict about the one we write out.
  2. The converter should use the same mbox code as the C++ messagestore. We don't at the moment, due to concerns about single threadedness and freezing the GUI which the conversion runs, but we should figure out how to resolve this and unify the code.

At the moment thunderbird accepts every line starting with "From " (regardless of anything that follows in that line) as an envelope line. The elaborate regexp used for the conversion does already miss some (admittedly very old) envelope styles present in some of my old mboxes. So indeed it might not have to be so complicated. If the conversion uses

let sepRE =/^(?:From .*\r?\n)[\x21-\x7E]+:/gm;

that would be (almost) equivalent to what nsParseMailMessageState::IsEnvelopeLine() in comm/mailnews/local/src/nsParseMailbox.cpp detects. Could then the whole decode/encode cycle be ditched?

Work-in-progress patch which should sort things out. Works for me, although I want to do some stress-testing and I need to do a pass through to clear out the debug output cruft and tidy the code up.

The approach I took is to treat all the incoming data as binary rather than (insanely) trying to decode it as text. It means regexps can't be used, and the alternative code is a little clunky, but that's what you get for doing byte-wrangling in Javascript :-)

(In reply to Cluster15 from comment #4)

let sepRE =/^(?:From .*\r?\n)[\x21-\x7E]+:/gm;

That's more or less what I ended up going for.
A "From " at the beginning of a line, ignoring the rest of the line, as long as the next line looks like it's a header field.

A tidied up/tweaked version of the patch.

Attachment #9123990 - Attachment is obsolete: true
Attachment #9124399 - Flags: review?(mkmelin+mozilla)

Just some notes for future reference:
The original version of the mbox->maildir conversion used a utf-8 decoder upon the incoming mbox data to get it into a javascript string. It would then search for "From " separator lines, break up the string and pass it back through a UTF-8 decoder as it was written out to individual files in the maildir.
With a 7-bit-clean mbox file (which I think most mbox variants are expected to be, using base64 encoding or whatever when contents are 8bit) this does no harm. 7 bit ASCII maps directly to UTF-8 and back again with no loss. However, out in the real world, we've got mbox files which have message containing all sorts of non-7-bit encodings.
Especially when the obvious maildir->mbox conversion method is to concatenate files, adding "From " lines between each one.
It's not guaranteed that such 8-bit messages in an mbox would survive the UTF-8 round trip intact.
So, the safe thing is to treat the mbox files as binary, and preserve everything between "From " separators as is.

Comment on attachment 9124399 [details] [diff] [review]
1515254-preserve-encodings-1.patch

Review of attachment 9124399 [details] [diff] [review]:
-----------------------------------------------------------------

I'm not sure I still understand *why* the earlier thing wasn't working properly for all cases. Even if we're encoding to UTF-8 and it's not, we only care about finding the From parts which are always ascii and thus utf-8 compatible.

Anyway, while it seems to "work", the performance is really abysmal. 

Status quo:; 1G mbox -> maildir took 18s. 
With the patch the same operation takes around 10 minutes.

::: mailnews/base/util/converterWorker.js
@@ +240,5 @@
> +  /**
> +   * Helper. Returns true if the specified bytes look like an rfc2822-style
> +   * header. It checks that there's something looking like a header name at
> +   * the given position (see https://tools.ietf.org/html/rfc2822#section-2.2)
> +   * ie printable US-ASCII 33-126 inclusive, followed by a colon.

i.e.

@@ +277,5 @@
> +   *
> +   * @param {Uint8Array} buf - The bytes to search.
> +   * @param {Number} startIndex - The position to begin searching from.
> +   * @return {Object} An object containing 'start' and 'end' to denote the
> +   *                  range of the separator line, including the LF.

would wrap these lines just indenting by two past @return

@@ +295,5 @@
> +        return null; // "From " not found in buffer.
> +      }
> +      // Found one. Is it at start of line (or file)?
> +      if (idx > 0) {
> +        if (buf[idx - 1] != LF) {

should combine these ifs
Attachment #9124399 - Flags: review?(mkmelin+mozilla) → review-
Status: NEW → ASSIGNED

(In reply to Magnus Melin [:mkmelin] from comment #8)

I'm not sure I still understand why the earlier thing wasn't working
properly for all cases. Even if we're encoding to UTF-8 and it's not, we
only care about finding the From parts which are always ascii and thus utf-8
compatible.

We still need to get the parts in between the "From " separator lines as 8bit bytes, and those chunks won't survive the UTF-8 round trip (some byte sequences are invalid UTF-8, so will be corrupted).

Anyway, while it seems to "work", the performance is really abysmal.
Status quo:; 1G mbox -> maildir took 18s.
With the patch the same operation takes around 10 minutes.

Ouch! Yep, that's unacceptable. Will investigate and try some alternatives.

Lots of faffing about, but here's a different approach.

It's a simpler patch which ditches the utf-8 TextEncoder/TextDecoder pair (which was corrupting the data, because it's not utf-8). Instead, it uses a pair of functions which just map bytes directly to and from characters in a javascript string (so only characters 0-255 appear in the string form).
It's still a little sucky because we're using twice the memory we really need to.
But on the plus side:

  • we can still use regexps to look for "From " separators (which makes for faster searching manually looking through Uint8Arrays)
  • JS strings have all kinds of magic optimisations to speed things up.

This patch works on the previously-failing unit tests, and seems OK-but-not-great on my 150MB mbox example: 10 secs or so (I need to synthesize up some multi-GB mbox tests). But this approach should be more amenable to speeding up than my previous patch.

Geoff, a couple of JS questions:

  1. Is there a "proper" way to do some performance timers in JS? I want to wrap various bits sections of my code up to accumulate timings to see where the time is being spent. The MDN docs seem to suggest the JS performance timers are hobbled for security reasons...
  2. In the attached patch, I suspect stringToBytes and bytesToString are the choke points. I've left them both pretty naive as I know JS optimisation isn't always intuitive. Can you see any obvious ways to speed them up? (I did try let str = String.fromCharCode(...rawBytes), but of course hit some upper limit on param count. Next step I guess would be to try this in batches, but seems silly without code to measure it).
Attachment #9124399 - Attachment is obsolete: true
Flags: needinfo?(geoff)

Before I look at this more closely, you do know that TextDecoder/TextEncoder doesn't have to be UTF-8, right?

The problem is, that berkeley-style mboxes can contain a mixture of valid 8-bit encoded mails (and of course any 7bit encoded as well). One encoding - regardless whichever you choose - will simply not fit in those cases. I have hundreds of 8-bit mails in mboxes where ASCII 7-bit encoded, 8-bit- encoded ISO-8859-1(5)? and later on utf-8 8-bit encoded mails are happily glued together. Unless such a case is captured by the
'x-user-defined' encoding mentioned (but not documented) near the end of the encoding list, any text encoding with just one encoding will fail whenever there is an encoding mismatch.

I might add that while performance maybe an issue, in a way I do think a couple of minutes (up to 20min) is easily tolerable. It's not something you will do on a daily basis. And I personally prefer correctness of code before performance. Tell the user how long it might take and that this is a one-time action and it will be fine.

(In reply to Geoff Lankow (:darktrojan) from comment #11)

Before I look at this more closely, you do know that TextDecoder/TextEncoder doesn't have to be UTF-8, right?

Yes, but I didn't see anything which I was confident would map bytes unmolested to UTF-16 0-255. I'd love to be proved wrong here ;-)

Flags: needinfo?(geoff)

(In reply to Ben Campbell from comment #10)

Geoff, a couple of JS questions:

  1. Is there a "proper" way to do some performance timers in JS? I want to wrap various bits sections of my code up to accumulate timings to see where the time is being spent. The MDN docs seem to suggest the JS performance timers are hobbled for security reasons...

The imprecision might not be there for chrome code but I'm not sure. It might pay to ask somewhere like mozilla.dev.platform.

  1. In the attached patch, I suspect stringToBytes and bytesToString are the choke points. I've left them both pretty naive as I know JS optimisation isn't always intuitive. Can you see any obvious ways to speed them up? (I did try let str = String.fromCharCode(...rawBytes), but of course hit some upper limit on param count. Next step I guess would be to try this in batches, but seems silly without code to measure it).

I suspect you'll get better performance going through the Blob class, but it's only a suspicion:

// string to array buffer
let buffer = await new Blob([string]).arrayBuffer();
let array = new Uint8Array(buffer);

// back again
let reader = new FileReader();
reader.addEventListener("loadend", () => {
  string = reader.result;
});
reader.readAsBinaryString(new Blob([array]));

I've been doing some timings with my patch, and it's about half the speed of the original. Which isn't toooo bad considering it's doing the binary->text decoding in naive loop-over-every-byte javascript rather than native C++ (or Rust - I forget which).
I'm sure we can shave off some time by being a little more clever (eg Geoff's Blob suggestion, although we're in a worker thread, so there might be complications there...).
On my machine, it converts a faked-up 1.2GB mailbox with 1 million small text-only emails in about 5 minutes (vs 2:30 with the old code).
It seems to scale linearly and I didn't spot any weird runaway memory usage (can never quite tell what the garbage collection will do ;- ).
So the next step would be to try some simple-as-possble approaches to shave some time off, then call it done.

However.

While I was running my tests, I noticed that the file counts in the resulting maildir were way lower that they should be - looks like 30-40% loss. I happens both with and without my patch, so I think the problem was already there, we just never noticed it before. The minimal test cases use a small number of emails (like 20-30, not thousands).
I'll look into it tomorrow and try and see what's happening. Probably tackle it in a new bug. Sigh.

Just uploading my hacky little speed-test script for future reference (I did it as a unit test to make it simple to run).

The missing-emails issue I stumbled upon yesterday turned out to be the separator line parsing being too strict. I'll fix that in another patch, but for now, here's a revised version of the main patch to fix the encodings.
To recap, the conversion is about half the speed as the old code (which erroneously used the native TextEncoder/TextDecoder, screwing up some content). For a one-off operation, I can live with that, but happy to look at it afterwards (in a followup bug maybe?). But for now I'd like to be moving on to other stuff.

To be honest, I'd ideally like to ditch this code and do such conversions using the C++ pluggable mbox/maildir implementations rather than maintain these (but there are a few threading & gui issues that'd need to be worked through, so not trivial). I'm pretty sure going that route would be a lot quicker (it should be an IO bound operation).

Magnus: you're my default reviewer on this patch, but an r? flag will have to wait until your bugzilla account is sorted :-)

Attachment #9125380 - Attachment is obsolete: true
Attached patch 1515254-free-form-from-1.patch (obsolete) — Splinter Review

And this one loosens the overly-rigid "From " matching, as per Comment 4.

This was the problem I encountered in Comment 15. It was tripping up on my faked-up mbox data, which had slight date-formatting differences to what was expected (spaces vs leading zeros for single-digit days).

(The overly-rigid regexp was used to distinguish real "From " lines from non-escaped "From " lines in the message body, but since then the much better check-for-a-mail-header-on-the-next-line hack was added).

Attachment #9129000 - Flags: review?(mkmelin+mozilla)
Attachment #9129002 - Flags: review?(mkmelin+mozilla)
Blocks: 1611897
Comment on attachment 9129002 [details] [diff] [review]
1515254-free-form-from-1.patch

Review of attachment 9129002 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/base/util/converterWorker.js
@@ +176,5 @@
>    // cope with separator lines which might span chunks.
>    const SAFE_MARGIN = 100;
>  
>    // A regexp to match mbox separator lines.
> +  // Look for any line which begins with "From ".

Please keep the examples.
Attachment #9129002 - Flags: review?(mkmelin+mozilla) → review+
Comment on attachment 9129000 [details] [diff] [review]
1515254-make-mbox-maildir-8-bit-safe-2.patch

Review of attachment 9129000 [details] [diff] [review]:
-----------------------------------------------------------------

comm/mailnews/base/test/unit/test_mailstoreConverter.js seems to fail with the patch(es) applied (tested locally)

::: mailnews/base/util/converterWorker.js
@@ +196,5 @@
> +  /**
> +   * Helper. Convert a string into a Uint8Array, using no encoding. The low
> +   * byte of each 16 bit character will be used, the high byte discarded.
> +   *
> +   * @param {String} s     - input JS string, with chars from 0 to 255 only.

please Capital I, and no extra indention, like
@parm {string} s - Input string with chars only from 0-255.

@@ +213,5 @@
> +   * directly as a character code. So all characters in the resulting string
> +   * will range from 0 to 255, even though they are 16 bit values.
> +   *
> +   * @param {Uint8Array} bytes - the bytes to convert.
> +   * @returns {String} s       - resultant JS string.

@returns {string} The string corresponding to the bytes in the specified Uint8Array

(no name for @returns)
Talking about js strings is a bit confusing.

@@ +215,5 @@
> +   *
> +   * @param {Uint8Array} bytes - the bytes to convert.
> +   * @returns {String} s       - resultant JS string.
> +   */
> +  let bytesToString = function(bytes) {

Maybe it should just be 
bytes.map(b => String.fromCharCode(b)).join("")

@@ +218,5 @@
> +   */
> +  let bytesToString = function(bytes) {
> +    let str = "";
> +    for (let i = 0; i < bytes.length; i++) {
> +      str = str + String.fromCharCode(bytes[i]);

+=

@@ +235,5 @@
>        let outPath = OS.Path.join(curDirPath, ident.toString() + ".eml");
>        ident += 1;
>        outFile = OS.File.open(outPath, { write: true, create: true }, {});
>      }
> +    // We know that str is really raw 8-bit data, not UTF-16. So we can

Hmm, what does happen with UTF-16? Test case at https://searchfox.org/comm-central/source/mail/test/browser/composition/data/body-utf16.eml - maybe you can add that to the tests?

@@ +258,5 @@
>    let buf = "";
>    let eof = false;
>    while (!eof) {
> +    let rawBytes = mboxFile.read(CHUNK_SIZE);
> +    // We're using javascript strings (which hold 16bit characters) to store

nit: JavaScript

@@ +261,5 @@
> +    let rawBytes = mboxFile.read(CHUNK_SIZE);
> +    // We're using javascript strings (which hold 16bit characters) to store
> +    // 8 bit data. This sucks, but is faster than trying to operate directly
> +    // upon Uint8Arrays. A lot of work goes into optimising JS strings.
> +    buf = buf + bytesToString(rawBytes);

+=
Attachment #9129000 - Flags: review?(mkmelin+mozilla) → review-

(In reply to Magnus Melin [:mkmelin] from comment #20)

  • let bytesToString = function(bytes) {

Maybe it should just be
bytes.map(b => String.fromCharCode(b)).join("")

No, it doesn't work. map() returns another UInt8Array, so the join() doesn't work properly (you just end up with a rubbish string of digits). I did change bytesToString to accumulate the string using reduce, as in:

    return bytes.reduce(function(str, b) { return str + String.fromCharCode(b); }, "");

that works, but isn't any faster.

  • // We know that str is really raw 8-bit data, not UTF-16. So we can

Hmm, what does happen with UTF-16? Test case at
https://searchfox.org/comm-central/source/mail/test/browser/composition/data/
body-utf16.eml - maybe you can add that to the tests?

No, it really is raw 8-bit data. The whole point is to treat all incoming data as bytes, regardless of encoding, and write it back out verbatim (and split by the "From " lines). The only reason we want it in string form is to run the "From " regex, which is waaay quicker than searching the raw bytes.

Attachment #9129000 - Attachment is obsolete: true
Attachment #9132739 - Flags: review?(mkmelin+mozilla)

And this update includes the fix for the broken unit test.
I'd screwed up the regexp. Doh.

Attachment #9129002 - Attachment is obsolete: true
Attachment #9132743 - Flags: review?(mkmelin+mozilla)
Comment on attachment 9132739 [details] [diff] [review]
1515254-make-mbox-maildir-8-bit-safe-3.patch

Review of attachment 9132739 [details] [diff] [review]:
-----------------------------------------------------------------

This is better. I'm clocking in around 52sec/GB which is not too bad.

::: mailnews/base/util/converterWorker.js
@@ +256,5 @@
>    let buf = "";
>    let eof = false;
>    while (!eof) {
> +    let rawBytes = mboxFile.read(CHUNK_SIZE);
> +    // We're using Javascript strings (which hold 16bit characters) to store

JavaScript

@@ +258,5 @@
>    while (!eof) {
> +    let rawBytes = mboxFile.read(CHUNK_SIZE);
> +    // We're using Javascript strings (which hold 16bit characters) to store
> +    // 8 bit data. This sucks, but is faster than trying to operate directly
> +    // upon Uint8Arrays. A lot of work goes into optimising JS strings.

and maybe here too
Attachment #9132739 - Flags: review?(mkmelin+mozilla) → review+
Attachment #9132743 - Flags: review?(mkmelin+mozilla) → review+
Target Milestone: --- → Thunderbird 76.0

Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/b8cc5965aca0
Make mbox->maildir conversion 8-bit safe. r=mkmelin
https://hg.mozilla.org/comm-central/rev/4a0bd51ee9c0
Loosen restrictive "From "-line matching during mbox->maildir conversions. r=mkmelin

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: