Closed
Bug 790855
Opened 11 years ago
Closed 10 years ago
Make the new MIME parser charset-aware
Categories
(MailNews Core :: MIME, defect)
MailNews Core
MIME
Tracking
(Not tracked)
RESOLVED
FIXED
Thunderbird 32.0
People
(Reporter: jcranmer, Assigned: jcranmer)
References
Details
Attachments
(2 files, 7 obsolete files)
7.48 KB,
patch
|
neil
:
review+
|
Details | Diff | Splinter Review |
55.63 KB,
patch
|
Irving
:
review+
|
Details | Diff | Splinter Review |
The new MIME parser may have lots of goodies, but it doesn't yet deal with the mess that is charsets. It also needs to learn about RFC 2047 and RFC 2231. But I need goodies first: TextDecoder, in particular. Attached is a first cut on this patch, which adds a shim to get around the fact that I don't have decoders yet. It also adds in parts of the API that I removed from bug 746052 since that bug didn't implement them yet. And that I still don't implement. It also does RFC 2231 decoding, largely stolen from necko's parser, but it does appear to have somewhat different error handling. RFC 2047 decoding is still unconsidered. I'm also proud of my testcase for charsets in MIME, which is an evil little message that I hope no one ever sees on the web: it has "Hello world" in 6 languages and a variety of charsets and encodings, including a binary UTF-16 encoding. Oh crap, I just realized: I never included any non-BMP characters in that testcase. I guess it's time to see which charsets other than Unicode have them...
Assignee | ||
Comment 1•11 years ago
|
||
I've been busy testing the 2231 decoding with all of the test_MIME_params.js and arrived at an interesting dichotomy here: The necko parser has a preference for options that looks like the following order: 1. Prefer extended value 2. Prefer continuations 3. Prefer basic Continuations are preferred even if the continuation found is bad, using only the subset of the continuation that is good. The preliminary implementation I have in the first cut patch ignores the continuation potential if it is malformed, preferring to use a weaker version. It's not too hard to make it follow the necko version instead (it's just commenting out about three lines of code, in fact), but I wonder if it is the best idea to use for error handling...
Comment 2•11 years ago
|
||
(In reply to Joshua Cranmer [:jcranmer] from comment #1) > The preliminary implementation I have in the first cut patch ignores the > continuation potential if it is malformed, preferring to use a weaker > version. It's not too hard to make it follow the necko version instead (it's > just commenting out about three lines of code, in fact), but I wonder if it > is the best idea to use for error handling... I don't think it matters in practice. The existing code works the way it does because nobody wants to change things for no good reasons; but that doesn't mean it needs to work exactly that way.
Assignee | ||
Comment 3•11 years ago
|
||
That file's getting a bit large...
Attachment #660684 -
Attachment is obsolete: true
Attachment #735738 -
Flags: review?(irving)
Assignee | ||
Comment 4•11 years ago
|
||
Attachment #735739 -
Flags: review?(irving)
Assignee | ||
Comment 5•11 years ago
|
||
I particularly like the insanity that is test/data/charsets. At least Mercurial was nice enough to consider it a binary file instead of choking on displaying a file that is simultaneously UTF-8, Big5, and UTF-16.
Attachment #735741 -
Flags: superreview?(neil)
Attachment #735741 -
Flags: review?(irving)
Assignee | ||
Comment 6•11 years ago
|
||
Harder than it looks. Also, this is numbered differently, since it needs the patches in the other bugs to be applied between part 3 and part α due to patch conflicts, although the addressing parser/emitter will be needed for part γ.
Attachment #735743 -
Flags: review?(irving)
Assignee | ||
Comment 7•11 years ago
|
||
So I thought I had fixed all the problems with the last nsIMimeConverter changes, but apparently not. If I don't tell xpconnect that this is UTF-8, then we break sending IDN messages (since we don't try to encode the UTF-8 in email addresses).
Attachment #735745 -
Flags: review?(irving)
Assignee | ||
Comment 8•11 years ago
|
||
Attachment #735747 -
Flags: superreview?(neil)
Attachment #735747 -
Flags: review?(irving)
Assignee | ||
Comment 9•11 years ago
|
||
I should explain part γ's design points a bit better: Getting RFC 2047 encoding working correctly in all cases--including in particular non-BMP characters--is very difficult and annoying. The algorithm I use only works because of the particulars of how UTF-8 encoding works, and isn't applicable to all charsets, e.g., Shift-JIS. (note that the current implementation in comi18n.cpp does not encode non-BMP characters correctly in some situations--that's why I have tests for exactly that). Additionally, the only charsets that can perfectly map all Unicode characters are UTF-* and GB18030, so using any other charset in the input would mean my code has to specify how to handle someone passing in incorrect data. The tool that I'm using, TextEncoder, can only encode to UTF-8 or UTF-16 as well, so implementing other parsers would mean resorting to implementing a kludge glue API myself. So, in short, I only want to support UTF-8 in the JS mime version of 2047 encoding (and 2231 encoding, when I rewrite mime assembly eventually). That leaves me with two options for implementing the interface: I can either ignore the charset parameter altogether and unconditionally use UTF-8, or I can fallback to the old implementation of 2047-encoding. The former strikes me as the better long-term proposition, but I discovered that a surprising amount of our tests assume that this will be called with ISO-8859-1, and I didn't want yet another round of fix-all-the-broken-tests, so my plan is to leave this for a followup.
Comment 10•11 years ago
|
||
Comment on attachment 735738 [details] [diff] [review] Part 1: Separate out header parsing from the rest of the core Review of attachment 735738 [details] [diff] [review]: ----------------------------------------------------------------- The raw patch contains changes to mimeParserCore.js, followed by an HG copy with further changes from mimeParserCore.js to mimeParserHeaders.js; this makes it hard to see what really happened in the splinter review. If I'm reading it correctly, the actual change was "copy core to headers, remove header bits from core, remove non-header bits from headers" - does that sound right?
Attachment #735738 -
Flags: review?(irving) → review+
Comment 11•11 years ago
|
||
Comment on attachment 735739 [details] [diff] [review] Part 2: Implement header tokenization Review of attachment 735739 [details] [diff] [review]: ----------------------------------------------------------------- ::: mailnews/mime/jsmime/mimeParserHeaders.js @@ +193,5 @@ > + // token = token.replace(/%([0-9A-Fa-f]{2})/g, > + // function percent_deencode(match, hexchars) { > + // return String.fromCharCode(parseInt(hexchars, 16)); > + // }); > + //} Does this chunk of commented out code need be resurrected, or does it need to be removed? @@ +206,5 @@ > > // Now matches holds the parameters. Clean up for RFC 2231. There are four > // cases: param=val, param*=us-ascii'en-US'blah, and param*n= variants. The > // order of preference is to pick the middle, then the last, then the first. > // TODO: RFC 2231 is yet to be implemented Do. Or do not. There is no "TODO" ;->
Attachment #735739 -
Flags: review?(irving) → review+
Comment 12•11 years ago
|
||
Comment on attachment 735741 [details] [diff] [review] Part 3: Introduce JSMime to charsets Review of attachment 735741 [details] [diff] [review]: ----------------------------------------------------------------- ::: mailnews/mime/jsmime/mimeParserHeaders.js @@ +186,5 @@ > +function convertHeaderToUnicode(headerValue, fallbackCharset) { > + // Only attempt to convert the headerValue if it contains non-ASCII > + // characters. > + if (/[\x80-\xff]/.exec(headerValue)) { > + // First convert the value to a typed-arry for TextDecoder. nit: s/arry/array/ @@ -193,5 @@ > - // token = token.replace(/%([0-9A-Fa-f]{2})/g, > - // function percent_deencode(match, hexchars) { > - // return String.fromCharCode(parseInt(hexchars, 16)); > - // }); > - //} Ah, this answers my question from one of the earlier patch reviews... @@ -208,3 @@ > // cases: param=val, param*=us-ascii'en-US'blah, and param*n= variants. The > // order of preference is to pick the middle, then the last, then the first. > - // TODO: RFC 2231 is yet to be implemented And this answers another ;->
Attachment #735741 -
Flags: review?(irving) → review+
Assignee | ||
Comment 13•11 years ago
|
||
(In reply to :Irving Reid from comment #10) > The raw patch contains changes to mimeParserCore.js, followed by an HG copy > with further changes from mimeParserCore.js to mimeParserHeaders.js; this > makes it hard to see what really happened in the splinter review. If I'm > reading it correctly, the actual change was "copy core to headers, remove > header bits from core, remove non-header bits from headers" - does that > sound right? That is correct.
Comment 14•11 years ago
|
||
Comment on attachment 735743 [details] [diff] [review] Part α: Implement RFC 2047 encoding Review of attachment 735743 [details] [diff] [review]: ----------------------------------------------------------------- Had bitrot problems when I tried to apply the whole stack, so I'm not doing my usual build / run / look for error messages routine, just reading the code for now. ::: mailnews/mime/jsmime/mimeAssemblerHeaders.js @@ +62,5 @@ > * and trailing whitespace is trimmed from inputs. > */ > function HeaderAssembler(options) { > this._options = options; > + this._use2047 = options.use2047 === undefined ? false : options.use2047; Is this meant to be ("use2047" in options) ? : , or is passing undefined a thing callers might do? @@ +243,5 @@ > +/** > + * Add a block of text as a single RFC 2047 encoded word. This does not try to > + * split words if they are too long. > + * > + * @param encodedText A typed array of octets to encode. it's not encoded yet - maybe call this textToEncode? @@ +260,5 @@ > + if (encodedText[i] < 0x20 || encodedText[i] >= 0x7F || > + qpForbidden.contains(binaryString[i])) { > + let ch = encodedText[i]; > + let hexString = "0123456789abcdef"; > + token += "=" + hexString[(ch & 0xf0) >> 4] + hexString[ch & 0x0f]; I'm surprised we don't have a general purpose QP-converter we could use here. I didn't see one based on a quick mxr search. @@ +308,5 @@ > + qpForbidden.contains(String.fromCharCode(encodedText[i]))) { > + qpInc = 3; > + } else { > + qpInc = 1; > + } Keeping the side-by-side QP or b64 conversion check here seems like a bit much - I'd be inclined to just take a sample of the text to encode, count what proportion of the characters in the sample need encoding, and then choose to QP or b64 everything based on that. ::: mailnews/mime/test/unit/test_header_assembler.js @@ +63,5 @@ > + "=?UTF-8?Q?=1faaa=1faaaaaaaaa?=\r\n =?UTF-8?Q?a?="], > + > + // Choose base64/qp independently for each word > + ["\ud83d\udca9\ud83d\udca9\ud83d\udca9a", > + "=?UTF-8?B?8J+SqfCfkqnwn5Kp?=\r\n =?UTF-8?Q?a?="], Any chance a receiver would be confused by getting mixed QP and b64 applied to words in the same string?
Attachment #735743 -
Flags: review?(irving) → review+
Comment 15•11 years ago
|
||
Comment on attachment 735745 [details] [diff] [review] Part β: Make nsIMimeConverter return an AUTF8String Review of attachment 735745 [details] [diff] [review]: ----------------------------------------------------------------- I'll make this an f+ for now, and we can magically turn it into an r+ if it's not worth spreading the String type farther out in this patch. ::: mailnews/base/util/nsMsgI18N.cpp @@ +218,2 @@ > > + return NS_SUCCEEDED(res) ? PL_strdup(encodedString.get()) : nullptr; Is it going to cause too much spreading String infection to also change this return type (and the one in nsMsgUtils.cpp)? Sad to have to strdup(encodedString.get()) here.
Updated•11 years ago
|
Attachment #735745 -
Flags: review?(irving) → feedback+
Comment 16•11 years ago
|
||
Comment on attachment 735747 [details] [diff] [review] Part γ: Implement nsIMimeConverter in JS Review of attachment 735747 [details] [diff] [review]: ----------------------------------------------------------------- is it worth putting the mime converter into mailServices.js? ::: mailnews/mime/src/mimeJSComponents.js @@ +173,5 @@ > + // The JSMime encoder only works in UTF-8, so if someone requests to not do > + // it in UTF-8, fallback to the broken C version. > + if (aCharset.toLowerCase() != "utf-8") { > + //Deprecated.warning("Encoding to non-UTF-8 values is obsolete", > + // "http://bugzilla.mozilla.org/show_bug.cgi?id=790855"); You commented out the deprecation warning here, but not the module load above. How risky would it be to just completely remove non-UTF8 support? I see from the bug comments that it will break our tests, which is a big pain, but does it break any other code? Are there any extensions using it?
Attachment #735747 -
Flags: review?(irving) → feedback+
Assignee | ||
Updated•10 years ago
|
Attachment #735738 -
Attachment is obsolete: true
Assignee | ||
Updated•10 years ago
|
Attachment #735739 -
Attachment is obsolete: true
Assignee | ||
Updated•10 years ago
|
Attachment #735741 -
Attachment is obsolete: true
Attachment #735741 -
Flags: superreview?(neil)
Assignee | ||
Updated•10 years ago
|
Attachment #735743 -
Attachment is obsolete: true
Assignee | ||
Updated•10 years ago
|
Attachment #735745 -
Attachment is obsolete: true
Assignee | ||
Updated•10 years ago
|
Attachment #735747 -
Attachment is obsolete: true
Attachment #735747 -
Flags: superreview?(neil)
Assignee | ||
Comment 17•10 years ago
|
||
[Apologies for bugspam, there's no easy way to mass-obsolete patches without uploading a new one] After some consideration, I've changed the landing strategy for the next steps of JSMime, which makes the existing patches in the queue outside of bug 842632 obsolete.
Assignee | ||
Comment 18•10 years ago
|
||
Turns out nsIMimeConverter has an edge case where it fails to convert strings to ASCII: <foo.bar@ケツァルコアトル.tlalocan> wouldn't be converted by the API. To make it work with the JS implementation, it needs to be an AUTF8String instead.
Attachment #8402242 -
Flags: review?(neil)
Assignee | ||
Comment 19•10 years ago
|
||
Don't you love the smell of 1000 lines of code down the drain?
Attachment #8402243 -
Flags: review?(irving)
Assignee | ||
Comment 20•10 years ago
|
||
This requires the patch in bug 858337 to work properly.
Comment 21•10 years ago
|
||
Comment on attachment 8402243 [details] [diff] [review] Implement the API in JS and remove the C++ version Review of attachment 8402243 [details] [diff] [review]: ----------------------------------------------------------------- It's nice. ::: mailnews/mime/src/mimeJSComponents.js @@ +260,5 @@ > +} > +MimeConverter.prototype = { > + classID: Components.ID("93f8c049-80ed-4dda-9000-94ad8daba44c"), > + QueryInterface: XPCOMUtils.generateQI([Components.interfaces.nsIMimeConverter]), > + White space! @@ +273,4 @@ > > + // Compute the encoding options. The way our API is structured in this > + // method is really horrendous and does not align with the way that JSMime > + // handles it. Instead, we'll need to create a fake header to take into Do we have a follow up bug to restructure the API and get rid of the hack? @@ +308,5 @@ > + emitter.finish(true); > + let value = handler.value; > + value = value.substring(value.indexOf(': ') + 2); > + return value.substring(0, value.length - 2); > + }, Blank line between method definitions
Attachment #8402243 -
Flags: review?(irving) → review+
Comment 22•10 years ago
|
||
Comment on attachment 8402242 [details] [diff] [review] Part 1: Replace the ACString in the API with AUTF8String > return PL_strdup(convertedStr.get()); > else > return PL_strdup(header); > } > >- char *encodedString = nullptr; >+ nsAutoCString encodedString; > nsresult res; > nsCOMPtr<nsIMimeConverter> converter = do_GetService(NS_MIME_CONVERTER_CONTRACTID, &res); > if (NS_SUCCEEDED(res) && nullptr != converter) >- res = converter->EncodeMimePartIIStr_UTF8(nsDependentCString(header), structured, charset, >- fieldnamelen, nsIMimeConverter::MIME_ENCODED_WORD_SIZE, &encodedString); >+ res = converter->EncodeMimePartIIStr_UTF8(nsDependentCString(header), >+ structured, charset, fieldnamelen, >+ nsIMimeConverter::MIME_ENCODED_WORD_SIZE, encodedString); > >- return NS_SUCCEEDED(res) ? encodedString : nullptr; >+ return NS_SUCCEEDED(res) ? PL_strdup(encodedString.get()) : nullptr; Normally I'd say use NS_strdup but that would look inconsistent so I guess that's OK...
Attachment #8402242 -
Flags: review?(neil) → review+
Assignee | ||
Comment 23•10 years ago
|
||
(In reply to :Irving Reid from comment #21) > @@ +273,4 @@ > > > > + // Compute the encoding options. The way our API is structured in this > > + // method is really horrendous and does not align with the way that JSMime > > + // handles it. Instead, we'll need to create a fake header to take into > > Do we have a follow up bug to restructure the API and get rid of the hack? I'm largely planning on making use of this API obsolete via the structured header revolution (bug 970116) and a refactoring of compose code (not yet filed). Once I've made it so that you can compose mail messages that automatically does RFC 2047, I'll deprecate the interface altogether (anyone who desperately needs it from JS can use jsmime, and there shouldn't be a need for it from C++).
Assignee | ||
Updated•10 years ago
|
tracking-thunderbird31:
--- → ?
Assignee | ||
Comment 24•10 years ago
|
||
https://hg.mozilla.org/comm-central/rev/29d10e5f69ab https://hg.mozilla.org/comm-central/rev/8877b30a38c3
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 32.0
Updated•10 years ago
|
tracking-thunderbird31:
? → ---
Assignee | ||
Updated•9 years ago
|
Flags: in-testsuite+
You need to log in
before you can comment on or make changes to this bug.
Description
•