Last Comment Bug 790855 - Make the new MIME parser charset-aware
: Make the new MIME parser charset-aware
Status: RESOLVED FIXED
:
Product: MailNews Core
Classification: Components
Component: MIME (show other bugs)
: unspecified
: All All
: -- normal (vote)
: Thunderbird 32.0
Assigned To: Joshua Cranmer [:jcranmer]
:
:
Mentors:
Depends on: 746052 764234 858337 959309 1022723 1045753 1060901
Blocks:
  Show dependency treegraph
 
Reported: 2012-09-12 20:31 PDT by Joshua Cranmer [:jcranmer]
Modified: 2015-03-13 11:14 PDT (History)
12 users (show)
Pidgeot18: in‑testsuite+
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
First cut (27.46 KB, patch)
2012-09-12 20:31 PDT, Joshua Cranmer [:jcranmer]
no flags Details | Diff | Splinter Review
Part 1: Separate out header parsing from the rest of the core (37.15 KB, patch)
2013-04-10 06:42 PDT, Joshua Cranmer [:jcranmer]
irving: review+
Details | Diff | Splinter Review
Part 2: Implement header tokenization (12.48 KB, patch)
2013-04-10 06:43 PDT, Joshua Cranmer [:jcranmer]
irving: review+
Details | Diff | Splinter Review
Part 3: Introduce JSMime to charsets (54.84 KB, patch)
2013-04-10 06:47 PDT, Joshua Cranmer [:jcranmer]
irving: review+
Details | Diff | Splinter Review
Part α: Implement RFC 2047 encoding (12.35 KB, patch)
2013-04-10 06:53 PDT, Joshua Cranmer [:jcranmer]
irving: review+
Details | Diff | Splinter Review
Part β: Make nsIMimeConverter return an AUTF8String (6.45 KB, patch)
2013-04-10 06:55 PDT, Joshua Cranmer [:jcranmer]
irving: feedback+
Details | Diff | Splinter Review
Part γ: Implement nsIMimeConverter in JS (23.02 KB, patch)
2013-04-10 06:56 PDT, Joshua Cranmer [:jcranmer]
irving: feedback+
Details | Diff | Splinter Review
Part 1: Replace the ACString in the API with AUTF8String (7.48 KB, patch)
2014-04-05 18:13 PDT, Joshua Cranmer [:jcranmer]
neil: review+
Details | Diff | Splinter Review
Implement the API in JS and remove the C++ version (55.63 KB, patch)
2014-04-05 18:14 PDT, Joshua Cranmer [:jcranmer]
irving: review+
Details | Diff | Splinter Review

Description Joshua Cranmer [:jcranmer] 2012-09-12 20:31:15 PDT
Created attachment 660684 [details] [diff] [review]
First cut

The new MIME parser may have lots of goodies, but it doesn't yet deal with the mess that is charsets. It also needs to learn about RFC 2047 and RFC 2231. But I need goodies first: TextDecoder, in particular.

Attached is a first cut on this patch, which adds a shim to get around the fact that I don't have decoders yet. It also adds in parts of the API that I removed from bug 746052 since that bug didn't implement them yet. And that I still don't implement. It also does RFC 2231 decoding, largely stolen from necko's parser, but it does appear to have somewhat different error handling. RFC 2047 decoding is still unconsidered.

I'm also proud of my testcase for charsets in MIME, which is an evil little message that I hope no one ever sees on the web: it has "Hello world" in 6 languages and a variety of charsets and encodings, including a binary UTF-16 encoding.

Oh crap, I just realized: I never included any non-BMP characters in that testcase. I guess it's time to see which charsets other than Unicode have them...
Comment 1 Joshua Cranmer [:jcranmer] 2012-10-07 10:23:21 PDT
I've been busy testing the 2231 decoding with all of the test_MIME_params.js and arrived at an interesting dichotomy here:

The necko parser has a preference for options that looks like the following order:
1. Prefer extended value
2. Prefer continuations
3. Prefer basic

Continuations are preferred even if the continuation found is bad, using only the subset of the continuation that is good.

The preliminary implementation I have in the first cut patch ignores the continuation potential if it is malformed, preferring to use a weaker version. It's not too hard to make it follow the necko version instead (it's just commenting out about three lines of code, in fact), but I wonder if it is the best idea to use for error handling...
Comment 2 Julian Reschke 2012-10-07 10:32:09 PDT
(In reply to Joshua Cranmer [:jcranmer] from comment #1)
> The preliminary implementation I have in the first cut patch ignores the
> continuation potential if it is malformed, preferring to use a weaker
> version. It's not too hard to make it follow the necko version instead (it's
> just commenting out about three lines of code, in fact), but I wonder if it
> is the best idea to use for error handling...

I don't think it matters in practice.

The existing code works the way it does because nobody wants to change things for no good reasons; but that doesn't mean it needs to work exactly that way.
Comment 3 Joshua Cranmer [:jcranmer] 2013-04-10 06:42:59 PDT
Created attachment 735738 [details] [diff] [review]
Part 1: Separate out header parsing from the rest of the core

That file's getting a bit large...
Comment 4 Joshua Cranmer [:jcranmer] 2013-04-10 06:43:54 PDT
Created attachment 735739 [details] [diff] [review]
Part 2: Implement header tokenization
Comment 5 Joshua Cranmer [:jcranmer] 2013-04-10 06:47:23 PDT
Created attachment 735741 [details] [diff] [review]
Part 3: Introduce JSMime to charsets

I particularly like the insanity that is test/data/charsets. At least Mercurial was nice enough to consider it a binary file instead of choking on displaying a file that is simultaneously UTF-8, Big5, and UTF-16.
Comment 6 Joshua Cranmer [:jcranmer] 2013-04-10 06:53:51 PDT
Created attachment 735743 [details] [diff] [review]
Part α: Implement RFC 2047 encoding

Harder than it looks.

Also, this is numbered differently, since it needs the patches in the other bugs to be applied between part 3 and part α due to patch conflicts, although the addressing parser/emitter will be needed for part γ.
Comment 7 Joshua Cranmer [:jcranmer] 2013-04-10 06:55:32 PDT
Created attachment 735745 [details] [diff] [review]
Part β: Make nsIMimeConverter return an AUTF8String

So I thought I had fixed all the problems with the last nsIMimeConverter changes, but apparently not. If I don't tell xpconnect that this is UTF-8, then we break sending IDN messages (since we don't try to encode the UTF-8 in email addresses).
Comment 8 Joshua Cranmer [:jcranmer] 2013-04-10 06:56:52 PDT
Created attachment 735747 [details] [diff] [review]
Part γ: Implement nsIMimeConverter in JS
Comment 9 Joshua Cranmer [:jcranmer] 2013-04-10 09:21:05 PDT
I should explain part γ's design points a bit better:

Getting RFC 2047 encoding working correctly in all cases--including in particular non-BMP characters--is very difficult and annoying. The algorithm I use only works because of the particulars of how UTF-8 encoding works, and isn't applicable to all charsets, e.g., Shift-JIS. (note that the current implementation in comi18n.cpp does not encode non-BMP characters correctly in some situations--that's why I have tests for exactly that).

Additionally, the only charsets that can perfectly map all Unicode characters are UTF-* and GB18030, so using any other charset in the input would mean my code has to specify how to handle someone passing in incorrect data. The tool that I'm using, TextEncoder, can only encode to UTF-8 or UTF-16 as well, so implementing other parsers would mean resorting to implementing a kludge glue API myself.

So, in short, I only want to support UTF-8 in the JS mime version of 2047 encoding (and 2231 encoding, when I rewrite mime assembly eventually). That leaves me with two options for implementing the interface: I can either ignore the charset parameter altogether and unconditionally use UTF-8, or I can fallback to the old implementation of 2047-encoding. The former strikes me as the better long-term proposition, but I discovered that a surprising amount of our tests assume that this will be called with ISO-8859-1, and I didn't want yet another round of fix-all-the-broken-tests, so my plan is to leave this for a followup.
Comment 10 :Irving Reid (No longer working on Firefox) 2013-05-01 13:07:10 PDT
Comment on attachment 735738 [details] [diff] [review]
Part 1: Separate out header parsing from the rest of the core

Review of attachment 735738 [details] [diff] [review]:
-----------------------------------------------------------------

The raw patch contains changes to mimeParserCore.js, followed by an HG copy with further changes from mimeParserCore.js to mimeParserHeaders.js; this makes it hard to see what really happened in the splinter review. If I'm reading it correctly, the actual change was "copy core to headers, remove header bits from core, remove non-header bits from headers" - does that sound right?
Comment 11 :Irving Reid (No longer working on Firefox) 2013-05-01 13:07:53 PDT
Comment on attachment 735739 [details] [diff] [review]
Part 2: Implement header tokenization

Review of attachment 735739 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/mime/jsmime/mimeParserHeaders.js
@@ +193,5 @@
> +      //  token = token.replace(/%([0-9A-Fa-f]{2})/g,
> +      //    function percent_deencode(match, hexchars) {
> +      //      return String.fromCharCode(parseInt(hexchars, 16));
> +      //  });
> +      //}

Does this chunk of commented out code need be resurrected, or does it need to be removed?

@@ +206,5 @@
>  
>    // Now matches holds the parameters. Clean up for RFC 2231. There are four
>    // cases: param=val, param*=us-ascii'en-US'blah, and param*n= variants. The
>    // order of preference is to pick the middle, then the last, then the first.
>    // TODO: RFC 2231 is yet to be implemented

Do. Or do not. There is no "TODO" ;->
Comment 12 :Irving Reid (No longer working on Firefox) 2013-05-01 13:40:47 PDT
Comment on attachment 735741 [details] [diff] [review]
Part 3: Introduce JSMime to charsets

Review of attachment 735741 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/mime/jsmime/mimeParserHeaders.js
@@ +186,5 @@
> +function convertHeaderToUnicode(headerValue, fallbackCharset) {
> +  // Only attempt to convert the headerValue if it contains non-ASCII
> +  // characters.
> +  if (/[\x80-\xff]/.exec(headerValue)) {
> +    // First convert the value to a typed-arry for TextDecoder.

nit: s/arry/array/

@@ -193,5 @@
> -      //  token = token.replace(/%([0-9A-Fa-f]{2})/g,
> -      //    function percent_deencode(match, hexchars) {
> -      //      return String.fromCharCode(parseInt(hexchars, 16));
> -      //  });
> -      //}

Ah, this answers my question from one of the earlier patch reviews...

@@ -208,3 @@
>    // cases: param=val, param*=us-ascii'en-US'blah, and param*n= variants. The
>    // order of preference is to pick the middle, then the last, then the first.
> -  // TODO: RFC 2231 is yet to be implemented

And this answers another ;->
Comment 13 Joshua Cranmer [:jcranmer] 2013-05-02 21:30:38 PDT
(In reply to :Irving Reid from comment #10)
> The raw patch contains changes to mimeParserCore.js, followed by an HG copy
> with further changes from mimeParserCore.js to mimeParserHeaders.js; this
> makes it hard to see what really happened in the splinter review. If I'm
> reading it correctly, the actual change was "copy core to headers, remove
> header bits from core, remove non-header bits from headers" - does that
> sound right?

That is correct.
Comment 14 :Irving Reid (No longer working on Firefox) 2013-05-26 17:57:07 PDT
Comment on attachment 735743 [details] [diff] [review]
Part α: Implement RFC 2047 encoding

Review of attachment 735743 [details] [diff] [review]:
-----------------------------------------------------------------

Had bitrot problems when I tried to apply the whole stack, so I'm not doing my usual build / run / look for error messages routine, just reading the code for now.

::: mailnews/mime/jsmime/mimeAssemblerHeaders.js
@@ +62,5 @@
>   * and trailing whitespace is trimmed from inputs.
>   */
>  function HeaderAssembler(options) {
>    this._options = options;
> +  this._use2047 = options.use2047 === undefined ? false : options.use2047;

Is this meant to be ("use2047" in options) ? : , or is passing undefined a thing callers might do?

@@ +243,5 @@
> +/**
> + * Add a block of text as a single RFC 2047 encoded word. This does not try to
> + * split words if they are too long.
> + *
> + * @param encodedText   A typed array of octets to encode.

it's not encoded yet - maybe call this textToEncode?

@@ +260,5 @@
> +      if (encodedText[i] < 0x20 || encodedText[i] >= 0x7F ||
> +          qpForbidden.contains(binaryString[i])) {
> +        let ch = encodedText[i];
> +        let hexString = "0123456789abcdef";
> +        token += "=" + hexString[(ch & 0xf0) >> 4] + hexString[ch & 0x0f];

I'm surprised we don't have a general purpose QP-converter we could use here. I didn't see one based on a quick mxr search.

@@ +308,5 @@
> +        qpForbidden.contains(String.fromCharCode(encodedText[i]))) {
> +      qpInc = 3;
> +    } else {
> +      qpInc = 1;
> +    }

Keeping the side-by-side QP or b64 conversion check here seems like a bit much - I'd be inclined to just take a sample of the text to encode, count what proportion of the characters in the sample need encoding, and then choose to QP or b64 everything based on that.

::: mailnews/mime/test/unit/test_header_assembler.js
@@ +63,5 @@
> +    "=?UTF-8?Q?=1faaa=1faaaaaaaaa?=\r\n =?UTF-8?Q?a?="],
> +
> +  // Choose base64/qp independently for each word
> +  ["\ud83d\udca9\ud83d\udca9\ud83d\udca9a",
> +    "=?UTF-8?B?8J+SqfCfkqnwn5Kp?=\r\n =?UTF-8?Q?a?="],

Any chance a receiver would be confused by getting mixed QP and b64 applied to words in the same string?
Comment 15 :Irving Reid (No longer working on Firefox) 2013-05-26 18:12:58 PDT
Comment on attachment 735745 [details] [diff] [review]
Part β: Make nsIMimeConverter return an AUTF8String

Review of attachment 735745 [details] [diff] [review]:
-----------------------------------------------------------------

I'll make this an f+ for now, and we can magically turn it into an r+ if it's not worth spreading the String type farther out in this patch.

::: mailnews/base/util/nsMsgI18N.cpp
@@ +218,2 @@
>  
> +  return NS_SUCCEEDED(res) ? PL_strdup(encodedString.get()) : nullptr;

Is it going to cause too much spreading String infection to also change this return type (and the one in nsMsgUtils.cpp)? Sad to have to strdup(encodedString.get()) here.
Comment 16 :Irving Reid (No longer working on Firefox) 2013-05-26 18:43:16 PDT
Comment on attachment 735747 [details] [diff] [review]
Part γ: Implement nsIMimeConverter in JS

Review of attachment 735747 [details] [diff] [review]:
-----------------------------------------------------------------

is it worth putting the mime converter into mailServices.js?

::: mailnews/mime/src/mimeJSComponents.js
@@ +173,5 @@
> +    // The JSMime encoder only works in UTF-8, so if someone requests to not do
> +    // it in UTF-8, fallback to the broken C version.
> +    if (aCharset.toLowerCase() != "utf-8") {
> +      //Deprecated.warning("Encoding to non-UTF-8 values is obsolete",
> +      //  "http://bugzilla.mozilla.org/show_bug.cgi?id=790855");

You commented out the deprecation warning here, but not the module load above.

How risky would it be to just completely remove non-UTF8 support? I see from the bug comments that it will break our tests, which is a big pain, but does it break any other code? Are there any extensions using it?
Comment 17 Joshua Cranmer [:jcranmer] 2013-12-14 11:46:47 PST
[Apologies for bugspam, there's no easy way to mass-obsolete patches without uploading a new one]

After some consideration, I've changed the landing strategy for the next steps of JSMime, which makes the existing patches in the queue outside of bug 842632 obsolete.
Comment 18 Joshua Cranmer [:jcranmer] 2014-04-05 18:13:07 PDT
Created attachment 8402242 [details] [diff] [review]
Part 1: Replace the ACString in the API with AUTF8String

Turns out nsIMimeConverter has an edge case where it fails to convert strings to ASCII: <foo.bar@ケツァルコアトル.tlalocan> wouldn't be converted by the API. To make it work with the JS implementation, it needs to be an AUTF8String instead.
Comment 19 Joshua Cranmer [:jcranmer] 2014-04-05 18:14:22 PDT
Created attachment 8402243 [details] [diff] [review]
Implement the API in JS and remove the C++ version

Don't you love the smell of 1000 lines of code down the drain?
Comment 20 Joshua Cranmer [:jcranmer] 2014-04-05 18:15:51 PDT
This requires the patch in bug 858337 to work properly.
Comment 21 :Irving Reid (No longer working on Firefox) 2014-04-07 11:27:03 PDT
Comment on attachment 8402243 [details] [diff] [review]
Implement the API in JS and remove the C++ version

Review of attachment 8402243 [details] [diff] [review]:
-----------------------------------------------------------------

It's nice.

::: mailnews/mime/src/mimeJSComponents.js
@@ +260,5 @@
> +}
> +MimeConverter.prototype = {
> +  classID: Components.ID("93f8c049-80ed-4dda-9000-94ad8daba44c"),
> +  QueryInterface: XPCOMUtils.generateQI([Components.interfaces.nsIMimeConverter]),
> + 

White space!

@@ +273,4 @@
>  
> +    // Compute the encoding options. The way our API is structured in this
> +    // method is really horrendous and does not align with the way that JSMime
> +    // handles it. Instead, we'll need to create a fake header to take into

Do we have a follow up bug to restructure the API and get rid of the hack?

@@ +308,5 @@
> +    emitter.finish(true);
> +    let value = handler.value;
> +    value = value.substring(value.indexOf(': ') + 2);
> +    return value.substring(0, value.length - 2);
> +  },

Blank line between method definitions
Comment 22 neil@parkwaycc.co.uk 2014-04-07 14:33:25 PDT
Comment on attachment 8402242 [details] [diff] [review]
Part 1: Replace the ACString in the API with AUTF8String

>       return PL_strdup(convertedStr.get());
>     else
>       return PL_strdup(header);
>   }
> 
>-  char *encodedString = nullptr;
>+  nsAutoCString encodedString;
>   nsresult res;
>   nsCOMPtr<nsIMimeConverter> converter = do_GetService(NS_MIME_CONVERTER_CONTRACTID, &res);
>   if (NS_SUCCEEDED(res) && nullptr != converter)
>-    res = converter->EncodeMimePartIIStr_UTF8(nsDependentCString(header), structured, charset,
>-      fieldnamelen, nsIMimeConverter::MIME_ENCODED_WORD_SIZE, &encodedString);
>+    res = converter->EncodeMimePartIIStr_UTF8(nsDependentCString(header),
>+      structured, charset, fieldnamelen,
>+      nsIMimeConverter::MIME_ENCODED_WORD_SIZE, encodedString);
> 
>-  return NS_SUCCEEDED(res) ? encodedString : nullptr;
>+  return NS_SUCCEEDED(res) ? PL_strdup(encodedString.get()) : nullptr;
Normally I'd say use NS_strdup but that would look inconsistent so I guess that's OK...
Comment 23 Joshua Cranmer [:jcranmer] 2014-04-07 18:39:19 PDT
(In reply to :Irving Reid from comment #21)
> @@ +273,4 @@
> >  
> > +    // Compute the encoding options. The way our API is structured in this
> > +    // method is really horrendous and does not align with the way that JSMime
> > +    // handles it. Instead, we'll need to create a fake header to take into
> 
> Do we have a follow up bug to restructure the API and get rid of the hack?

I'm largely planning on making use of this API obsolete via the structured header revolution (bug 970116) and a refactoring of compose code (not yet filed). Once I've made it so that you can compose mail messages that automatically does RFC 2047, I'll deprecate the interface altogether (anyone who desperately needs it from JS can use jsmime, and there shouldn't be a need for it from C++).

Note You need to log in before you can comment on or make changes to this bug.