Last Comment Bug 746052 - Make a JS-implemented raw MIME parser
: Make a JS-implemented raw MIME parser
Status: RESOLVED FIXED
:
Product: MailNews Core
Classification: Components
Component: MIME (show other bugs)
: unspecified
: All All
: -- normal (vote)
: Thunderbird 21.0
Assigned To: Joshua Cranmer [:jcranmer]
:
Mentors:
Depends on: 744952
Blocks: 790852 790855
  Show dependency treegraph
 
Reported: 2012-04-16 21:04 PDT by Joshua Cranmer [:jcranmer]
Modified: 2013-01-25 08:59 PST (History)
20 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
Preliminary cut (34.67 KB, patch)
2012-04-16 21:04 PDT, Joshua Cranmer [:jcranmer]
no flags Details | Diff | Splinter Review
Less preliminary cut (37.81 KB, patch)
2012-05-05 20:03 PDT, Joshua Cranmer [:jcranmer]
no flags Details | Diff | Splinter Review
Version 0.8 (46.25 KB, patch)
2012-05-15 09:48 PDT, Joshua Cranmer [:jcranmer]
no flags Details | Diff | Splinter Review
Version 0.9 (49.35 KB, patch)
2012-05-18 08:31 PDT, Joshua Cranmer [:jcranmer]
mozilla: feedback+
Details | Diff | Splinter Review
Part 1: Add the JS parser (48.52 KB, patch)
2012-07-30 07:13 PDT, Joshua Cranmer [:jcranmer]
neil: superreview+
Details | Diff | Splinter Review
Part 2: JS parser tests (1.77 MB, patch)
2012-07-30 07:14 PDT, Joshua Cranmer [:jcranmer]
standard8: review-
Details | Diff | Splinter Review
Part 3: Use the JS parser in the fakeserver code (21.26 KB, patch)
2012-07-30 07:17 PDT, Joshua Cranmer [:jcranmer]
no flags Details | Diff | Splinter Review
Part 1: Add the JS parser (47.15 KB, patch)
2012-09-12 19:51 PDT, Joshua Cranmer [:jcranmer]
irving: review-
Details | Diff | Splinter Review
Part 2: JS parser tests (1.78 MB, patch)
2012-09-12 19:53 PDT, Joshua Cranmer [:jcranmer]
standard8: review+
Details | Diff | Splinter Review
Part 3: Use the JS parser in the fakeserver code (21.23 KB, patch)
2012-09-12 19:54 PDT, Joshua Cranmer [:jcranmer]
no flags Details | Diff | Splinter Review
Part 1: Add the JS parser (47.49 KB, patch)
2012-10-08 16:44 PDT, Joshua Cranmer [:jcranmer]
irving: review+
neil: superreview+
Details | Diff | Splinter Review
Part 3: Use the JS parser in the fakeserver code (24.51 KB, patch)
2013-01-22 07:39 PST, Joshua Cranmer [:jcranmer]
irving: review+
Details | Diff | Splinter Review

Description Joshua Cranmer [:jcranmer] 2012-04-16 21:04:10 PDT
Created attachment 615610 [details] [diff] [review]
Preliminary cut

Okay, I was heavily tempted to summarize this bug as "Infect libmime with the JS virus" [1], but since this isn't adding any users, I couldn't justify it is a title. The part I have right now is nowhere near ready for primetime; parts of the comment blocks are Blatant Lies™, other parts I got too lazy to document (when 1/3 of the file is contained in just two comment blocks, this seems a bit hard to understand), it doesn't support qp or b64 decoding yet, and the C++ glue is very much not-functional (programming JSAPI makes me go into sad land).

Still, it's beneficial to have a copy that isn't on my computer and doesn't need to be pastebinned every time I go into existential crisis over it.


[1] This comes from the following comment in mimei.h:
  The definition of these classes is somewhat idiosyncratic, since I defined
  my own small object system, instead of giving the C++ virus another foothold.
Comment 1 :aceman 2012-04-17 05:05:37 PDT
Maybe you could elaborate more on what is the gain from this patch?
It is some addition? Or is the goal to replace the current parser? Why? Will this be faster or more correct?
Comment 2 Jonathan Protzenko [:protz] 2012-04-17 05:20:29 PDT
https://mail.mozilla.org/pipermail/tb-planning/2012-January/001522.html has a good summary.
Comment 3 :aceman 2012-04-17 06:43:52 PDT
I don't think it answers any of my questions but at least it is some doc for context:)
Comment 4 Joshua Cranmer [:jcranmer] 2012-04-17 07:17:51 PDT
(In reply to :aceman from comment #1)
> Maybe you could elaborate more on what is the gain from this patch?
> It is some addition? Or is the goal to replace the current parser? Why? Will
> this be faster or more correct?

The long-long-long-term goal is to replace the current libmime implementation, since it is a quagmire of an ad-hoc object system, a dangerous variant of an unsafe language (non-null-terminated char*, IIRC), and an engineering black hole that does three separate tasks at the same time. In other words, the current implementation is 100% unmaintainable.

The reason for implementing in JS is part of a long-long-long-long-long-long-term ideal to move more things to JS (would make it easier for, e.g., B2G to leverage), and also that JS enables us to have nicer extension hooks.
Comment 5 Joshua Cranmer [:jcranmer] 2012-04-19 20:11:55 PDT
Looking for "easy-to-replace" components in libmime that could probably use this quickly:
1. nsIMimeHeaders [ Update from IRC conversation: ACString conversion should preserve \x00-\xff <-> \u0000-\u000FF mappings, and probably IDL string as well ]
2. NNTP fakeserver "header" parsing
3. IMAP fakeserver bodystructure/envelope/FETCH helpers. (Although note that we use $ to hunt for message/* subparts right now, which isn't in IMAP, so...)
Comment 6 Joshua Cranmer [:jcranmer] 2012-05-05 20:03:13 PDT
Created attachment 621372 [details] [diff] [review]
Less preliminary cut

QP/base64 added, some tests added, some comments fixed. Works well enough to replace the hacky mime parsing in nntpd.js (haven't tested imapd.js yet). I dropped the JSAPI access stuff into a separate patch since I don't have the energy to push on that.

Still needs work:
1. Stripping line continuations in headers
2. Structured header parsing cleanup (it looks like, in particular, Content-Type and extraneous comments/whitespace are not going to get along).
3. More tests, particularly for body parsing and packetization issues.
4. JS typed arrays, charset conversion
5. Comment cleanup, JSM access cleaned up
Comment 7 Joshua Cranmer [:jcranmer] 2012-05-15 09:48:49 PDT
Created attachment 624081 [details] [diff] [review]
Version 0.8

I have more tests, they're just not in this patch (they end up using the torture test, which is ~1.5M of changes by itself). At this point, I'm missing tests for packetization issues, non-CRLF line endings, and things I haven't implemented (particularly strformat). I have local patches that replace the implementation of nsIMimeHeaders and all message parsing in nntpd.js and imapd.js with this implementation, which provides more indirect tests of stuff (particularly the extractParameters function).

The only major thing I have left to do is to implement strformat; the parsing of other types of header fields and CFWS/2231/2047-decoding isn't necessary for any uses so far.
Comment 8 Joshua Cranmer [:jcranmer] 2012-05-18 08:31:29 PDT
Created attachment 625118 [details] [diff] [review]
Version 0.9

A more up-to-date version of this patch.
Comment 9 David :Bienvenu 2012-05-29 14:33:53 PDT
Comment on attachment 625118 [details] [diff] [review]
Version 0.9

while it's true that data coming from servers will tend to be line-oriented, we don't tend to pass that data straight to the mime parser, (e.g., code that wants to mime parse a message after it has been received, like message display or snippet generation), so I'm not sure about this comment:

+  // delivered. It's likely that our callers our already doing line-buffering,
+  // (e.g., if a protocol uses dot-stuffing), so our basic model is a twist on

I think the mime header parsing stuff is much more straightforward than the mime message parsing (and lives in necko, iirc).

In any case, it's great to see the progress you've made on this.
Comment 10 Joshua Cranmer [:jcranmer] 2012-05-30 19:52:17 PDT
Roughly speaking, my plan of action is the following right now:
1. Flesh out some more parser tests
2. Temporarily remove advanced header parsing and charset-decoding support from my patches. Charsets are a big hairy ball that I don't want to go through right now, so I will leave it be.
3. Post the following as separate patches:
- Core parser code and README
- Direct parser tests
- Replace the imapd.js and nntpd.js header parsers with the new one
- Replace the nsMimeHeaders.cpp implementation with a JS-based one backed by this code.

I want this to land earlyish in the next cycle (so after Monday). The biggest piece I have that's potentially confusing is that the MIME parser strips continuations right now, which may or may not be desirable (probably best to make it an option?).
Comment 11 Mark Banner (:standard8) (afk until 26th July) 2012-05-31 02:11:23 PDT
+This directory consists of two MIME parsers, the older of which is implemented
+in a C++-like format and is being phased out over time.

I'm not quite sure of the transition plan here, but would we be able to roll this out in a pref on/off manner? I'm just thinking that if we find issues when we hit aurora/beta we might want to get users to compare before & after, or just pref off completely and allow time for some more testing.
Comment 12 Joshua Cranmer [:jcranmer] 2012-05-31 04:10:47 PDT
(In reply to Mark Banner (:standard8) from comment #11)
> +This directory consists of two MIME parsers, the older of which is
> implemented
> +in a C++-like format and is being phased out over time.
> 
> I'm not quite sure of the transition plan here, but would we be able to roll
> this out in a pref on/off manner? I'm just thinking that if we find issues
> when we hit aurora/beta we might want to get users to compare before &
> after, or just pref off completely and allow time for some more testing.

Right now, my plans are to not control reimplementation of auxiliary interfaces (nsIMimeConverter, nsIMimeHeaders, nsIMsgHeaderParser) by preferences, while larger components (e.g., gloda's mimemsg.js helpers or replacing the stream converter) would be pref-controlled.
Comment 13 Daniel Veditz [:dveditz] 2012-07-10 22:37:51 PDT
drive-by typo nit:

> +X RFC 4480 -- OpenPGP

openPGP is rfc 4880 (double the obsolete 2440 just for fun)

PGP/MIME might as well be in the list for completeness, rfc 3156. We'd most likely support both or neither.
Comment 14 Joshua Cranmer [:jcranmer] 2012-07-30 07:13:25 PDT
Created attachment 647150 [details] [diff] [review]
Part 1: Add the JS parser

[There's not a lot of good reviewers for this patch, are there?]
Comment 15 Joshua Cranmer [:jcranmer] 2012-07-30 07:14:25 PDT
Created attachment 647153 [details] [diff] [review]
Part 2: JS parser tests

I split the tests out in a separate patch.
Comment 16 Joshua Cranmer [:jcranmer] 2012-07-30 07:17:38 PDT
Created attachment 647154 [details] [diff] [review]
Part 3: Use the JS parser in the fakeserver code

The fakeserver code isn't release-critical code, and there's a lot of bugs in the ad-hoc mime parser right now (see how often "undefined" crops up output :-P).

Note to reviewers: feel free to rerequest review from other people you think might be better.
Comment 17 Kent James (:rkent) 2012-07-30 09:06:36 PDT
I thought that thread.processNextEvent was not considered to be an acceptable workaround for async->sync conversion. IIRC for example I tried to use it as a workaround in our sync message folder search code in an addon, and I gave up because it seemed to cause mysterious crashes. I also vaguely recall seeing a posting in m.d.platform outlining all of the common ways it can fail. As an addon editor I certainly did not allow it as a lazy way to avoid normal async js processing, except in certain rare cases.
Comment 18 Joshua Cranmer [:jcranmer] 2012-07-30 13:16:40 PDT
(In reply to Kent James (:rkent) from comment #17)
> I thought that thread.processNextEvent was not considered to be an
> acceptable workaround for async->sync conversion. IIRC for example I tried
> to use it as a workaround in our sync message folder search code in an
> addon, and I gave up because it seemed to cause mysterious crashes. I also
> vaguely recall seeing a posting in m.d.platform outlining all of the common
> ways it can fail. As an addon editor I certainly did not allow it as a lazy
> way to avoid normal async js processing, except in certain rare cases.

AFAIK, there is no other way to make an underlying asynchronous framework act as if it were synchronous. The thread I found about the processNextEvent issues is <https://groups.google.com/forum/?fromgroups#!searchin/mozilla.dev.platform/processNextEvent/mozilla.dev.platform/CzGDcfwjEMU/HuB-SXAYDOoJ>, which is more specifically discussing the issues with faking an asynchronous API from a synchronous one. Looking at how, for example, Necko implements the synchronous nsIInputStream API from an nsIStreamListener, it works via ProcessNextEvent as well.

I do explain in the comment that non-string inputs are parsed by spinning the event loop, so I would hope that people would avoid using it unless it is really necessary.
Comment 19 Andrew Sutherland [:asuth] 2012-07-30 13:37:20 PDT
(In reply to Joshua Cranmer [:jcranmer] from comment #18)
> AFAIK, there is no other way to make an underlying asynchronous framework
> act as if it were synchronous.

Right.  But that doesn't mean that it's okay.

> I do explain in the comment that non-string inputs are parsed by spinning
> the event loop, so I would hope that people would avoid using it unless it
> is really necessary.

People do not reliably read comments, and your comment does not convey the depth of the bad-ideaness of spinning the event loop.  Even assuming someone makes a reasonable call to use the event-loop spinning, the next person who simply copies and pastes the code or calls that method may not realize the implications.  For example, you provide a helper method, extractHeaders, which makes no mention of the event loop spinning.  This is how we ended up with Firefox Sync spinning a nested event loop.  It does not appear to have blown up the world yet, but there's no need for it.

If there is unit testing code that is greatly simplified by spinning the nested event loop, then have that code live in the unit tests or the unit test frameworks, but I can't think of any other reasonable case to do so.

Along those lines, why not just make more of the parsing methods fundamentally async?
Comment 20 Kent James (:rkent) 2012-07-30 13:46:37 PDT
(In reply to Andrew Sutherland (:asuth) from comment #19)

> Along those lines, why not just make more of the parsing methods
> fundamentally async?

There are places in existing code where it would be really great to have a sync mime processor.

One example is in traditional search, like in filters. Filter searches have no support for async search, so something as simple as "search on whether this email has an attachment or not" cannot be done currently, as the MIME implementation required to get that information is only available async.

So I was looking forward to a sync MIME until I saw the event loop spinning.
Comment 21 Andrew Sutherland [:asuth] 2012-07-30 14:50:01 PDT
(In reply to Kent James (:rkent) from comment #20)
> (In reply to Andrew Sutherland (:asuth) from comment #19)
> 
> > Along those lines, why not just make more of the parsing methods
> > fundamentally async?
> 
> There are places in existing code where it would be really great to have a
> sync mime processor.
> 
> One example is in traditional search, like in filters. Filter searches have
> no support for async search, so something as simple as "search on whether
> this email has an attachment or not" cannot be done currently, as the MIME
> implementation required to get that information is only available async.

Yes, that would be super-useful.  But the problem there is how the filtering system's control flow operates, not with the MIME parser.  Fix filters to support filters providing their notifications asynchronously and everything becomes okay.
Comment 22 Florian Bender 2012-07-30 15:11:51 PDT
(In reply to Andrew Sutherland (:asuth) from comment #21)
> […]  Fix filters to support filters providing their notifications asynchronously 
> and everything becomes okay.

+1

E. g., see how node-lazy does it, which deals with the mentioned filter problem on async data: https://github.com/pkrumins/node-lazy
Comment 23 neil@parkwaycc.co.uk 2012-07-31 07:43:50 PDT
Comment on attachment 647150 [details] [diff] [review]
Part 1: Add the JS parser

>+// Load the core MIME parser. Since it doesn't define EXPORTED_SYMBOLS, we must
>+// use the subscript loader instead.
>+Services.scriptloader.loadSubScript("resource:///modules/mimeParserCore.js");
[Any reason why not?]

>+      var parser = new Parser(emitter, opts);
>+      parser.deliverData(input);
>+      parser.deliverEOF();
[Slightly inefficient if input ends in a CR.]

>+    var emitter = Object.create(ExtractHeadersEmitter);
[What's the difference between Object.create and new?]

>+/// This function returns [string that ends in CRLF, rest of string]
>+function conditionToEndOnCRLF(buffer) {
This seems overcomplex. Here's what I think it should say:
function conditionToEndOnCRLF(buffer) {
  // Ignore the last character if it is a '\r' -
  // there may be an '\n' to follow in the next data block.
  let lastCR = buffer.lastIndexOf('\r', buffer.length - 1);
  let lastLF = buffer.lastIndexOf('\n');
  let end = lastLF > lastCR ? lastLF : lastCR;
  return [buffer.substring(0, end + 1), buffer.substring(end + 1)];
}

>+Parser.prototype.deliverEOF = function Parser_deliverEOF() {
>+  // Start of input buffered too long? Call start message now.
>+  if (!this._triggeredCall) {
>+    this._triggeredCall = true;
>+    this._callEmitter("startMessage");
>+  }
[I'd put this in _dispatchData, but then what happens if there is no data?]

>+  if (this._options["pruneat"]) {
>+    let match = this._options["pruneat"];
[Why not this._options.pruneat?]

>+    let result = /(?:^(?:\r\n|[\r\n]))|(\r\n|[\r\n])\1/.exec(this._headerData);
The \1 seems wrong, since it means "match exactly what the ()s captured".
Can you not use /(?:^|\r\n|[\r\n])(\r\n|[\r\n])/?

>+  let defaultContentType = this._defaultContentType ? this._defaultContentType :
>+    'text/plain';
this._defaultContentType || "text/plain";

>+    buffer = sanitize.substring(sanitize.length - excess);
Avoid substring; use substr if your second argument is a length and slice if it is a point (including negative points). So in this case, you could use sanitize.slice(-excess) if you know that excess > 0.
Comment 24 Joshua Cranmer [:jcranmer] 2012-07-31 08:01:21 PDT
(In reply to neil@parkwaycc.co.uk from comment #23)
> >+    var emitter = Object.create(ExtractHeadersEmitter);
> [What's the difference between Object.create and new?]

Object.create makes a new object whose prototype is passed into the object, whereas it looks like |new Object(x) === x| holds true.

> >+    let result = /(?:^(?:\r\n|[\r\n]))|(\r\n|[\r\n])\1/.exec(this._headerData);
> The \1 seems wrong, since it means "match exactly what the ()s captured".
> Can you not use /(?:^|\r\n|[\r\n])(\r\n|[\r\n])/?

That regex matches just a plain \r\n byitself (make the first group match \r and the second group match \n). I originally had the regex closer to the one you recommend, until I added tests and found them failing for that exact reason. I'll add a note as to why the complexity here is necessary.

[I'll look through the other comments later].
Comment 25 Joshua Cranmer [:jcranmer] 2012-08-01 07:09:15 PDT
(In reply to neil@parkwaycc.co.uk from comment #23)
> >+// Load the core MIME parser. Since it doesn't define EXPORTED_SYMBOLS, we must
> >+// use the subscript loader instead.
> >+Services.scriptloader.loadSubScript("resource:///modules/mimeParserCore.js");
> [Any reason why not?]

mimeParserCore.js is intended to be code that relies on no codebase, so using it requires some shims which are ultimately provided by mimeParser.jsm. Omitting EXPORTED_SYMBOLS means that Components.utils.import doesn't work properly, so it provides a weak layer of footgun-protection. It's not something I have a astrong opinion one way or the other.

> >+      var parser = new Parser(emitter, opts);
> >+      parser.deliverData(input);
> >+      parser.deliverEOF();
> [Slightly inefficient if input ends in a CR.]

Hopefully, \r-delimited files are very rare-to-nonexistent. Unfortunately, there's not much that can be done about the fact that we need to look to the next character to distinguish between CR-delimited files and CRLF-delimited.

> >+Parser.prototype.deliverEOF = function Parser_deliverEOF() {
> >+  // Start of input buffered too long? Call start message now.
> >+  if (!this._triggeredCall) {
> >+    this._triggeredCall = true;
> >+    this._callEmitter("startMessage");
> >+  }
> [I'd put this in _dispatchData, but then what happens if there is no data?]

_dispatchData is also used internally for triggering recursive part handling via subparsers, which makes putting the "first data processed" callback there dangerous. Far better, IMO, to just duplicate the startMessage call the first time we send data from the external hooks to the internal ones.
Comment 26 Mark Banner (:standard8) (afk until 26th July) 2012-08-20 04:48:04 PDT
Comment on attachment 647153 [details] [diff] [review]
Part 2: JS parser tests

Review of attachment 647153 [details] [diff] [review]:
-----------------------------------------------------------------

I think you should include a readme file or something with mime-torture to explain what is in the file and where it comes from originally.

::: mailnews/mime/test/unit/test_parser.js
@@ +1,1 @@
> +Components.utils.import("resource:///modules/mimeParser.jsm");

This file should get a license header and description.

@@ +33,5 @@
> +    contents = contents.slice(start - 1, end - 1);
> +  }
> +  return contents.join('\r\n');
> +}
> +var file_cache = {};

nit: I think you should define file_cache before the function or at the start of the file.

@@ +52,5 @@
> +  }
> +  return [test, msgcontents, opts, results];
> +}
> +
> +let mpart_complex1 = [['1', 8, 10], ['2', 14, 16], ['3.1', 22, 24],

Magic numbers! We need an explanation of this array and where these numbers come from please.

@@ +58,5 @@
> +
> +// Format of tests:
> +// entry[0] = name
> +// entry[1] = message (a string or an array of packets)
> +// entry[2] = options to pass in

What sort of options? This appears to be an object by its own right, so what can this contain?

@@ +61,5 @@
> +// entry[1] = message (a string or an array of packets)
> +// entry[2] = options to pass in
> +// entry[3] = A checker result:
> +//            either a {partnum: header object} (to check headers)
> +//            or a [[partnum body], [partnum body], ...] (to check bodies)

These numbers definitely need a bit more explanation as to where they come from/what they are. Possibly just shows I'm not up to date on mime, but there again, we need people not so familiar to be able to read these as well.
Comment 27 Joshua Cranmer [:jcranmer] 2012-08-20 12:41:17 PDT
(In reply to neil@parkwaycc.co.uk from comment #23)
> Comment on attachment 647150 [details] [diff] [review]
> Part 1: Add the JS parser
> 
> >+/// This function returns [string that ends in CRLF, rest of string]
> >+function conditionToEndOnCRLF(buffer) {
> This seems overcomplex. Here's what I think it should say:

The documentation on lastIndexOf is bad, you want lastIndexOf('\r', buffer.length - 2) instead of lastIndexOf('\r', buffer.length - 1).

Anyways, part of the reason for the complexity is how the function evolved--it was originally just holding '\r' so that I could see if the next char was going to be '\n', and then I think I had to introduce the CRLF-ending condition later. The other part comes from a desire to do as little as possible in one of the common cases (that all packets end with '\r\n' or '\n').

Now that I think about, in the common case of '\r\n', both searches will stop on the first character they try. We penalize '\n'-only files, but we can make a separate fast-path if '\n' is the last character.

> >+  if (this._options["pruneat"]) {
> >+    let match = this._options["pruneat"];
> [Why not this._options.pruneat?]

As far as the JS parser is concerned, the two forms are exactly equivalent; since I view options as a dictionary, I slightly prefer the _options["pruneat"] form, but I'm ambivalent in this regard.
Comment 28 :Irving Reid (No longer working on Firefox) 2012-09-07 05:06:08 PDT
Comment on attachment 647150 [details] [diff] [review]
Part 1: Add the JS parser

Review of attachment 647150 [details] [diff] [review]:
-----------------------------------------------------------------

Really good over all, I like the way you structured this. Once we get it into production it would be interesting to do some profiling and see if there are any performance hotspots, but I can't see any obvious wasted string copies that could be eliminated without making the code significantly more complex.

I'm going to leave it r? until we finish talking about the open questions.

::: mailnews/mime/jsmime/mimeParser.jsm
@@ +13,5 @@
> +var EXPORTED_SYMBOLS = ["MimeParser"];
> +
> +// Emitter helpers, for internal functions later on.
> +var ExtractHeadersEmitter = {
> +  startPart: function (partNum, headers) {

Give all your functions names (e.g. startPart: function EHE_startPart(partNum, headers)...) so that they don't show up as anonymous in stack dumps (most of your declarations already have them)

@@ +49,5 @@
> +   * @param opts    A set of options for the parser.
> +   */
> +  parseAsync: function MimeParser_parseAsync(input, emitter, opts) {
> +    // Normalize the input into an input stream.
> +    if (typeof input == "string") {

My inclination would be to provide two APIs: one that does a sync parse of a string, and one that does an async parse of an InputStream. Callers that want to parse other inputs synchronously are welcome to either slurp their entire input into a string or code their own read loop and push the data directly into parser.deliverData().

::: mailnews/mime/jsmime/mimeParserCore.js
@@ +654,5 @@
> +// Content transfer decoders
> +var ContentDecoders = {};
> +ContentDecoders['quoted-printable'] = function decode_qp(buffer, more) {
> +  // Quoted-printable works on a per-line basis, so we do not need to worry
> +  // about the possibility that we have =<packet boundary>A1

Officially, yes, but we have bugs filed with examples of q-p data that have a bare = at the end of the part. In that case I'd be inclined to discard the final '=' (see bug 697448)
Comment 29 Andrew Sutherland [:asuth] 2012-09-07 10:39:27 PDT
(In reply to Irving Reid (:irving) from comment #28)
> Give all your functions names (e.g. startPart: function
> EHE_startPart(partNum, headers)...) so that they don't show up as anonymous
> in stack dumps (most of your declarations already have them)

Trunk automatically names functions now, although their names may look stupid. See bug 433529.
Comment 30 Joshua Cranmer [:jcranmer] 2012-09-09 01:40:09 PDT
(In reply to Irving Reid (:irving) from comment #28)
> My inclination would be to provide two APIs: one that does a sync parse of a
> string, and one that does an async parse of an InputStream. Callers that
> want to parse other inputs synchronously are welcome to either slurp their
> entire input into a string or code their own read loop and push the data
> directly into parser.deliverData().

My original intent was to make all of the APIs to make the input accepted common to all ways of doing things. Given the distinction between synchronous and asynchronous parsing, this may not be the optimal situation. My original intention was to support fairly maximal/magical interfaces, although I'm willing to drop this as desired.
 
> ::: mailnews/mime/jsmime/mimeParserCore.js
> @@ +654,5 @@
> > +// Content transfer decoders
> > +var ContentDecoders = {};
> > +ContentDecoders['quoted-printable'] = function decode_qp(buffer, more) {
> > +  // Quoted-printable works on a per-line basis, so we do not need to worry
> > +  // about the possibility that we have =<packet boundary>A1
> 
> Officially, yes, but we have bugs filed with examples of q-p data that have
> a bare = at the end of the part. In that case I'd be inclined to discard the
> final '=' (see bug 697448)

I don't know what I was thinking when I wrote that comment, since it doesn't justify the conclusion properly. The main point, I think, is that, unlike base64 decoding, we can decode QP on a line-by-line basis without needing to hold to see what follows. Looking at the bug you've cited, my parser should be ignoring the last `=', since it treats <EOF> and <CRLF> as equivalent.
Comment 31 Joshua Cranmer [:jcranmer] 2012-09-12 19:51:42 PDT
Created attachment 660670 [details] [diff] [review]
Part 1: Add the JS parser

This patch has a few differences from the old one, but the most notable are the following:

1. The headers object has been replaced with a JS Map instead of a JS object. Judging from poor recollection of tests, this may be slightly faster (test_parser.js feels faster than before). It's also safer.

2. The options dictionary now has an onerror method that allows users of the parser to customize what happens if an emitter throws an error. By default, the error is ignored; but the test uses a custom handler that rethrows the error and causes the parser to fail. (Useful for debugging!)

3. Functionality for parseSync and parseAsync have been stripped down. Now parseSync only takes a string as input and parseAsync only takes an input stream as input. Documentation has been tweaked to explain some of the guarantees better.

My original intent was to provide a uniform definition of input that would grow to accept fairly large diverse sets of objects (e.g., headers, files, URIs, etc.), but for the sake of simplicity, I'll strip it down to the smallest set that matters: input streams/stream listeners for async and strings for sync. People who really want sync parsing badly can grab the parser and drive it themselves, and if it's really desired to grab this low-level construct with more objects, they can be added in later.

*mutters about having the conflicts that latter patches in his queue has.*

asuth (or protz): if you have any feedback about the API, please feel free to comment.
Comment 32 Joshua Cranmer [:jcranmer] 2012-09-12 19:53:21 PDT
Created attachment 660672 [details] [diff] [review]
Part 2: JS parser tests

I think I fixed all the review comments here. I thought about adding another large torture test, but that one focused more on things like cid: and friends which isn't as useful for the progress I have so far.
Comment 33 Joshua Cranmer [:jcranmer] 2012-09-12 19:54:29 PDT
Created attachment 660673 [details] [diff] [review]
Part 3: Use the JS parser in the fakeserver code

This is really just the old part 3 migrated to the new API.
Comment 34 Joshua Cranmer [:jcranmer] 2012-09-12 20:11:36 PDT
Ah yes, one design point I was wary of (putting up bug 790852 reminded me of this) was that my header values strips the newlines in continuations right off the bat. This ultimately leads to a change of semantics in nsIMimeHeaders of bug 790852 when part 2 is applied, as the current interface does not strip the newlines (see the test changes).

I made this decision on the basis that removing newlines from header values makes some naive users slightly safer and that the unfolded form is semantically the more meaningful. Quoting from RFC 5322:
   The process of moving from this folded multiple-line representation
   of a header field to its single line representation is called
   "unfolding".  Unfolding is accomplished by simply removing any CRLF
   that is immediately followed by WSP.  Each header field should be
   treated in its unfolded form for further syntactic and semantic
   evaluation.  An unfolded header field has no length restriction and
   therefore may be indeterminately long.
Comment 35 :Irving Reid (No longer working on Firefox) 2012-09-18 07:48:42 PDT
Comment on attachment 660670 [details] [diff] [review]
Part 1: Add the JS parser

Review of attachment 660670 [details] [diff] [review]:
-----------------------------------------------------------------

Great overall, the usual couple of minor things.

::: mailnews/mime/jsmime/mimeParserCore.js
@@ +115,5 @@
> +    }
> +
> +  // Set the default error function in the options.
> +  if (!("onerror" in options) || typeof options.onerror != "function")
> +    this._options["onerror"] = function swallow(error) {};

I'd prefer to see the default onerror() say something so we have a hope of diagnosing problems. A message in the error console feels like the best option to me, but I'm open to suggestions.

And while I'm at it, we need to warn (again, error console unless you have a better idea) if a caller passes a non-function for the onerror option. Nothing more frustrating than spending hours investigating why your code didn't work, only to find your mistake was hidden by a silent "there I fixed it" far away down the call stack...

I'm OK with excluding strings-that-eval-to-functions, because I want to train JS programmers not to do that ;->

log4moz strikes me as too heavyweight for this, unless you decide to add log4moz for other purposes as well.

@@ +185,5 @@
> +
> +  // Signal the beginning, if we haven't done so.
> +  if (!this._triggeredCall) {
> +    this._callEmitter("startMessage");
> +    this._triggeredCall = true;

Do we really need to delay the call to startMessage() until the first data arrives? It feels like we're making the code more complicated to optimize a rare case (setting up a parser and then receiving no data).
Comment 36 Joshua Cranmer [:jcranmer] 2012-09-22 13:46:37 PDT
(In reply to Irving Reid (:irving) from comment #35)
> I'd prefer to see the default onerror() say something so we have a hope of
> diagnosing problems. A message in the error console feels like the best
> option to me, but I'm open to suggestions.

I really want mimeParserCore.js to be completely agnostic about the exact JS environment it is operating in. Would it be acceptable to you to have mimeParser.jsm override the options object to default onerror to Components.utils.reportError?

> And while I'm at it, we need to warn (again, error console unless you have a
> better idea) if a caller passes a non-function for the onerror option.
> Nothing more frustrating than spending hours investigating why your code
> didn't work, only to find your mistake was hidden by a silent "there I fixed
> it" far away down the call stack...

I can make the parser constructor throw if onerror isn't a function, which should cause the error to bubble up somewhere visible.

> @@ +185,5 @@
> > +
> > +  // Signal the beginning, if we haven't done so.
> > +  if (!this._triggeredCall) {
> > +    this._callEmitter("startMessage");
> > +    this._triggeredCall = true;
> 
> Do we really need to delay the call to startMessage() until the first data
> arrives? It feels like we're making the code more complicated to optimize a
> rare case (setting up a parser and then receiving no data).

I'm following the model here of nsIStreamListener: when nsIChannel.asyncOpen is called, the API requires that none of the stream listener callbacks will be called before asyncOpen completes. Similarly, none of the functions in the emitter will be called before the constructor (or resetParser) completes. The only methods we have available to shove data into the caller is via dispatchData/dispatchEOF.

It's possible to have another public member signalStart that calls back startMessage, but that strikes me as more error-prone, especially since not all streaming APIs have nsIStreamListener's convenient onStart/onData/onEnd callbacks.
Comment 37 :Irving Reid (No longer working on Firefox) 2012-09-24 05:44:24 PDT
(In reply to Joshua Cranmer [:jcranmer] from comment #36)
> I really want mimeParserCore.js to be completely agnostic about the exact JS
> environment it is operating in. Would it be acceptable to you to have
> mimeParser.jsm override the options object to default onerror to
> Components.utils.reportError?

OK.

> I can make the parser constructor throw if onerror isn't a function, which
> should cause the error to bubble up somewhere visible.

OK.

> I'm following the model here of nsIStreamListener: when nsIChannel.asyncOpen
> is called, the API requires that none of the stream listener callbacks will
> be called before asyncOpen completes. Similarly, none of the functions in
> the emitter will be called before the constructor (or resetParser)
> completes. The only methods we have available to shove data into the caller
> is via dispatchData/dispatchEOF.
> 
> It's possible to have another public member signalStart that calls back
> startMessage, but that strikes me as more error-prone, especially since not
> all streaming APIs have nsIStreamListener's convenient onStart/onData/onEnd
> callbacks.

OK, leave it the way it is, since there's a good reason for it to be that way.
Comment 38 Joshua Cranmer [:jcranmer] 2012-10-08 16:44:14 PDT
Created attachment 669330 [details] [diff] [review]
Part 1: Add the JS parser

Updated version of part 1, hopefully addressing irving's comments.
Comment 39 :Irving Reid (No longer working on Firefox) 2012-10-10 12:51:05 PDT
Comment on attachment 669330 [details] [diff] [review]
Part 1: Add the JS parser

Review of attachment 669330 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/mime/jsmime/mimeParser.jsm
@@ +32,5 @@
> +
> +const Ci = Components.interfaces;
> +const Cc = Components.classes;
> +
> +/// Sets appropriate default options for chrome-priveleged environments

s/privelege/privilege/  (I worked for lawyers for a while - they don't let you misspell that word)
Comment 40 Mark Banner (:standard8) (afk until 26th July) 2012-10-29 15:30:37 PDT
Comment on attachment 660672 [details] [diff] [review]
Part 2: JS parser tests

I've not re-run these, but the interdiff looks good.
Comment 41 neil@parkwaycc.co.uk 2012-12-19 12:53:20 PST
Comment on attachment 669330 [details] [diff] [review]
Part 1: Add the JS parser

>+var ExtractHeadersEmitter = {
>+  startPart: function (partNum, headers) {
>+    if (partNum == '') {
>+      this.headers = headers;
You no longer clone the headers object. Was this intentional?

>+  // Create the helper property so that it is not enumerable.
>+  Object.defineProperty(headers, "rawHeaderText", {value: this._headerData});
Is it still necessary to do this?
Comment 42 Joshua Cranmer [:jcranmer] 2013-01-08 12:32:37 PST
(In reply to neil@parkwaycc.co.uk from comment #41)
> Comment on attachment 669330 [details] [diff] [review]
> Part 1: Add the JS parser
> 
> >+var ExtractHeadersEmitter = {
> >+  startPart: function (partNum, headers) {
> >+    if (partNum == '') {
> >+      this.headers = headers;
> You no longer clone the headers object. Was this intentional?

I realized that the cloning was brittle and would be prone to failure if headers gained non-enumerable properties that might be useful.

> >+  // Create the helper property so that it is not enumerable.
> >+  Object.defineProperty(headers, "rawHeaderText", {value: this._headerData});
> Is it still necessary to do this?

I was concerned that enumerable properties would still funge the interation of Map objects, but I tested in Firefox's Scratchpad and it appears that this is no longer necessary, so I'll change that.
Comment 43 neil@parkwaycc.co.uk 2013-01-10 12:19:25 PST
Comment on attachment 669330 [details] [diff] [review]
Part 1: Add the JS parser

OK, so sr=me with that fixed.
Comment 44 Joshua Cranmer [:jcranmer] 2013-01-12 12:19:57 PST
Parts 1 and 2 have been checked in:
<https://hg.mozilla.org/comm-central/pushloghtml?changeset=d5ac1b0f26d8>.
Comment 45 Joshua Cranmer [:jcranmer] 2013-01-22 07:39:19 PST
Created attachment 704886 [details] [diff] [review]
Part 3: Use the JS parser in the fakeserver code

Updated patch.
Comment 46 :Irving Reid (No longer working on Firefox) 2013-01-22 11:37:20 PST
Comment on attachment 704886 [details] [diff] [review]
Part 3: Use the JS parser in the fakeserver code

Review of attachment 704886 [details] [diff] [review]:
-----------------------------------------------------------------

Looks fine, passes tests.

::: mailnews/test/fakeserver/imapd.js
@@ +42,5 @@
>  // + Messages: A message is represented internally as an annotated URI.       //
>  ////////////////////////////////////////////////////////////////////////////////
> +
> +if (!("MimeParser" in this))
> +  Components.utils.import("resource:///modules/mimeParser.jsm");

Do you want ...import("resource...", this); here to get the exported symbols into this?
Comment 47 Joshua Cranmer [:jcranmer] 2013-01-25 08:59:59 PST
Part 3 pushed:
<https://hg.mozilla.org/comm-central/rev/0d02d0b5c30c>.

Note You need to log in before you can comment on or make changes to this bug.