Parsing ~40MB of minified source code causes us to allocate something like 1.5GB of ParseNodes. We can fix it by writing a custom parser for asm.js. Luke thinks we can parse, typecheck, and emit MIR in a single forward pass.
I'm hacking on something similar at http://turtlescript.github.cscott.net/docco/asm-llvm.html. It would help if the asm.js spec were updated to reflect the new call coercions in bug 864600 (hint, hint).
In a single-forward-pass typechecker, you need to infer the types of function tables and local functions which you haven't seen yet. I assume that +f(...) implies that f returns doublish, f(...)|0 implies f returns intish, and f(...); as an expression statement or in a comma operator, etc implies void. The "implies void" part is a bit tricky -- even if I plan to throw the value of f away, I have to write "+f(....);" or else f will be inferred to be void. (This is at odds with bits of the text in bug 864600 which seem to indicate that function type inference is at work only if the value is "used".) It's probably easiest to implement by just having the unary + propagate doublish forward, and ExpressionStatement/comma operator propagate void forward, and assuming intish if neither doublish nor void is being sent forward. That avoids having to lookahead past the call site to parse the | 0 before compiling/type-checking the function.
In the spec, the change (that hopefully we'll be making soon) would be that a call to a function that is declared to return double (i.e., by ending with a double-returning statement) would be typed as doublish and analogously int to intish, void to void. Thus, if there is no coercion, the callee function can return any type. In a single-pass algorithm, you'd implement this accumulating, for each function, how it had been coerced. Then, when type checking a function, if there were either (1) no calls to the function or (2) only calls that ignored the returned value, no coercion would be recorded and the function return type could be whatever. If there is a coercion, it has to be the same as all previous coercions (you can't have +f() and f()|0 in the same module) and, when type-checking f, the return value has to match the coercion.
Another wrinkle: currently SpiderMonkey parses the left-hand operand of * / % with Use::ToNumber and parses the left-hand side of >>> | & ^ << >> with Use::ToInt32. This is hard to implement in a single-forward-pass typechecker, since we don't know what the operator will be until after we've parsed the left hand operand. Consider the following cases: +foo() +(foo()) +(foo() * bar) +(foo() >>> bar) +(foo() | 0) +((((foo())))|0) In the first, there's an unambigious Use::ToNumber. In the second, we don't know until we see the closing paren that it's not one of the other two cases. The third case has a Use::ToNumber in the current code. The fourth and fifth cases have Use::ToInt32 in the current code. The sixth case illustrates that arbitrary lookahead may be required during parsing. I think it would be safest to restrict Use::ToNumber to simple unary application. This still makes it difficult to distinguish Use::ToInt32 and Use::Void when parsing. For example: foo(); foo() | 0; Until I've seen the trailing semicolon, I can't distinguish between 'returns void' context and 'returns int32'. One solution is to say that "foo();" implies that foo returns an intish value, and define 'void' as a subtype of intish. After all, 'undefined' is intish. The new simple syntax rule is that calls to functions returning double must always appear as "+foo(...)". If a call to a forward-defined local function or function table reference is encountered anywhere other than as the direct child of a unary plus, it will be assumed to return intish. If the later definition reveals that the function returns void, the declared type will still be consistent with the inferred type (since void is a subtype of intish). Internally, functions which return void will probably be compiled as if they return 0. This is what I'm going to implement for http://turtlescript.github.cscott.net/docco/asm-llvm.html -- suggestions on good tests cases would be welcome (or other ways around this issue I might implement).
Ok, https://github.com/cscott/TurtleScript/blob/2e5f4c8da54961707025649d4c8c6840ae32deb0/asm-llvm.js now does the parsing and type checking in a single pass using the strategy described in comment 4. Now to find a better set of test cases to find out what I (must have) overlooked.
That is a nice analysis of the challenge in comment 4 and an interesting proposed solution. I'm a little uncomfortable with the changes to the asm.js type system though. What I was thinking is that, keeping the same type rules, a possible implementation would be: when type checking a call, conservatively assume it returns void. At all places where we pass Use::ToInt32/ToNumber, instead call a function CoerceInputs(lhsDef, &lhsType, Use::ToInt32) on the result of the CheckExpr. If lhsDef->isAsmJSCall(), CoerceInputs would: 1. assert lhsType.isVoid()) for sanity 2. record the use and report an error if it conflicts with another use 3. *lhsType == Type::Int32/Double 4. lhsDef->setResultType(MIRType_Int32/MIRType_Double). Without changing the visible behavior, I think it'd be possible to make this change now (removing 'Use' altogether).
* "insist that calls (including calls via function table)" -> "insist that local calls (including calls via function table)" We can already distinguish between local calls and FFI/stdlib calls, so no need to coerce arguments or return values there. (This is obvious, I'm just trying to be super clear.) And note that the "save the void" and "use site coercion" issues can be decoupled to some degree. If you remove void, then the remaining coercions are just int/double, and codegen only has to be deferred until you are certain about the use context instead of until you are certain the return value is never used anywhere else in the program. If you remove the implicit coercions (allowing grammar-based distinction of calls to void, intish, and doublish functions), then you again don't have to defer codegen indefinitely, and there are fewer spots to insert your CoerceInputs() call which makes the propagation/fixup of "this is maybe void" simpler. But obviously I feel the most bang for the buck is found in combining the two proposals, which lets you determine the function type and codegen the function call immediately when you encounter it.
One more bugfix on my comment (!): "Unary "~~" is defined to operate only on double, but implies int at a use site (!) which guarantees a type check failure (probably just a bug?)." Sorry, this is a spec issue (and a bug in asm-llvm). http://asmjs.org/spec/latest/#unaryexpression says "A UnaryExpression node of the form '~~arg:UnaryExpression' validates as type signed if arg validates as a subtype of double." but it (presumably, and rather unclearly) *also* allows ~~arg:UnaryExpression to type check as ~(~arg:UnaryExpression), using the "(intish) → signed" signature given for ~ in the table at the end. The AsmJS.cpp code implies ToInt32, which won't "guarantee a type check failure", but still seems to be at odds with the intended "CoerceToInt" meaning of the ~~ operator. But my point still stands: for operators which take more than one type, the use coercions pick a single type (or no coercion at all) somewhat arbitrarily.
Per azakai's recommendation, I tested against the asm.js benchmarks from https://github.com/dvander/arewefastyet/tree/master/benchmarks/asmjs-apps. This uncovered two buglets in the asm.js spec (https://github.com/dherman/asm.js/issues/63 and https://github.com/dherman/asm.js/issues/64), but otherwise box2d and zlib parse fine. Bullet has some issues with a function table returning double which is used first in a void context. Ie, first: b$[c[(c[k >> 2] | 0) + 12 >> 2] & 2047](k, w, 1); but later: j = +b$[c[(c[h >> 2] | 0) + 12 >> 2] & 2047](h, b, d); So I'd propose that that first call site be rewritten to prefix the unary plus operator, in either the "no void" or "no crazy coercions" subpart of my proposal. In SpiderMonkey's code generation framework, calling a "function returning a double" and calling a "function returning void" are the same, but they're not necessarily the same in every calling convention, and they are not the same in LLVM. On these platforms, we can't generate code for the first statement until we've seen the second.
luke and I hashed this out on IRC. I think the proposal is to reintroduce void, but to require local calls and FFI calls to appear in one of the following contexts "+f(...)" (=> return type of function is double), "f(...)|0" (=> return type of function is signed), or every other "f(....)" (=> return type of function is void). This restriction allows us to derive a precise type for a local call at its first use site, and also prevents doing a ToInt32() coercion on an FFI value which isn't used (since the coercion is potentially observable via Object.valueOf()). The "f(...)|0" can be done with a little bit of parser hackery --- since we typically have at least one token of lookahead, we can lookahead at the token following the ')' and conclude that the return type should be signed if we see the pipe character.
Alright, here is a tentative summary from irc discussions: - tweak the type system to recognize 3 syntactic forms: +f(), f()|0, f() - these 3 forms return types double, signed, and void, resp., and require the callee return exactly this type (even void requires the callee return void) - the syntactic forms (as always) ignore parentheses A few rationales: - For having |0 instead of defaulting to intish: ffi calls need to know they are coerced so they can coerce on the callee side - For void requiring the callee return void (instead of returning whatever): if one were to compile asm.js to a stack-based machine, you'd want to know whether there was anything to pop (e.g., the x87 FPU stack) - For ignoring parens: if you validate asm.js with an existing parser (as we do now), parens are automatically thrown away, so it'd be bad for an asm.js parser to be stricter; would lead to subtle incompatibility. I think this plan makes sense and avoids the scattered CoerceInputs/Use::* currently required. Still want to get a bit more feedback (and get existing asm.js codes converted to validate with these new rules) before converting Odin, though.
Oops, mid-air collision.
Looking at the current state of the spec, it isn't clear about something that I think is important: that parentheses should be meaningless for expressions. This should be added to the rules in Section 4, and it should be respected here as well. While this affects one-pass algorithms, I don't believe it should be too complex or expensive to track paren nesting depth and do the lookahead past any closing-parens. And I believe it's important that asm.js should not be sensitive to parenthesization. In particular, any implementation of asm.js that builds off an existing real-world JS parser will not be sensitive to parenthesization, and it would make ahead-of-time compilation brittle to what should be a meaningless transformation. So I agree with Luke's plan in Comment 12. Dave
If we're changing things, using 0|x instead of x|0 would make the whole "arbitrary lookahead required" thing moot.
Having spent some time trying to implement the "ignore parens" rule in a single-pass parser, let me just note that it's a pain in the neck. I will respectfully suggest that the spec be written to be strict, even if spidermonkey is currently "liberal in what it accepts". Since the whole point of asm.js is speed, it seems somewhat counterproductive to require a bunch of parenthesis-processing and parser lookahead/fixup to handle cases which no self-respecting asm.js-emitter would produce. It also introduces the likelihood of subtle differences between implementations, since these weird corner cases (say, ~~x vs ~(~x), and -(-(4294967296)) ) are almost never encountered in practice. In fact, properly writing test cases just for all the possible combinations of parenthesization is a daunting task. My humble suggestion is that the spec is written to *permit* an asm.js implementation to accept x|0 and 0|x as coercions and arbitrary parenthesization, but not to *require* this. Then the spidermonkey implementation can be gradually transitioned (first to accept 0|x, then later to emit a warning, then eventually with the landing of the single-pass compiler, to no longer accept x|0 and nontrivial parenthesization). I'm willing to contribute a fast strict type-checker which can generate appropriate deprecation warnings, in the interm before spidermonkey's implementation is changed. My concern is that the spec will hardwire these annoying quirks into asm.js and we'll be stuck with them (and with subtle bugs/inconsistencies between implementations) forever.
Quick note: ~~ is another annoying case -- we have to push the "leading tilde" status through all layers of the parseExpression() stack in order to ensure we don't have a type check failure trying to parse the inner "~double". If asm.js insisted that ~~ must appear as a single token (no intervening spaces, no parentheses between the ~), parsing gets much simpler/faster.
General comment: Making it simpler to write a one-pass parser makes sense, but other goals include making it easy to write compilers that generate asm.js, keeping asm.js compact, and making it somewhat reasonable to write small amounts of hand-written asm.js.
Sure. But using 0|x instead of x|0 is a wash on the 'other goals' list, and http://paste.debian.net/9099/ shows that the practical overhead of adding coercions to void callsite is very small (10 extra characters in the 2.4M bullet.js; and the other benchmarks required no additional characters).
Before making any final decisions, I'd like to see how if there are any problems with the CoerceInputs solution mentioned above which would allow us to keep the current type system rules. It may be a little less elegant to clutter use sites but it seems pretty insignificant in the grand scheme of things.
https://github.com/cscott/TurtleScript/commit/acddbd4461945143103fb186f701f41c9aa677a6 implements the "parser hackery" described in comment 11.
(In reply to Luke Wagner [:luke] from comment #20) > Before making any final decisions, I'd like to see how if there are any > problems with the CoerceInputs solution mentioned above which would allow us > to keep the current type system rules. The main problem is that you will not know the type (maybe-void but actually intish, maybe-void but actually doublish, actually void) of the local function when you generate the code for the call site. That can't be solved without a spec change (the implementation strategy, whether CoerceInputs or otherwise, is irrelevant). Everything else is just a size/complexity of spec/implementation issue. You can follow the commits in https://github.com/cscott/TurtleScript to get an idea for how much complexity the current type system rules entail.
(In reply to cscott from comment #22) > The main problem is that you will not know the type (maybe-void but actually > intish, maybe-void but actually doublish, actually void) of the local > function when you generate the code for the call site. That can't be solved > without a spec change (the implementation strategy, whether CoerceInputs or > otherwise, is irrelevant). Yes, but, fortunately, we don't need this information.
bullet.js also contains the following code: a_ = aY & mT(y, aR, aU, c[aS + 36 + (aZ << 2) >> 2] | 0, d[aZ + (aS + 56) | 0] | 0, t); where mT returns intish. This relies on use-site propagation; it can't be checked using the strategy of comment 12.
lua-binarytrees.js also contains a number of examples of the form: b[aw >> 1] = he(av, q) & 65535; These can be accomodated by simply checking for '|' or '&' in the trailing context, rather than insisting on the '|0' form.
Ok, I'm putting this project down for a while. https://github.com/cscott/TurtleScript/ implements the solution from comment 12, except (a) I actually allow f() to return the 'maybe void' type, and (b) I allow f() | NumericLiteral and f() & NumericLiteral to imply int, which works around the issue in comment 25. So the set of recognized call site types are: +f() => f returns doublish, f()|N or f()&N => f returns intish, everything else => f returns maybe-void. There are still a few issues with existing emscripten code such as comment 24. Note that if you decide not to make any of the changes described here, you need to add a full description of "use site coercion" to the spec, and describe for each operator/grammar production what use site coercion applies (as described in comment 7). You should still fix the strange use site coercions currently applied for ! and ~~, as described in comment 7 and comment 9.
I posted my code as an online asm.js verification service, at http://turtlescript.github.cscott.net/asmjs.html
The problem in bug 883175 suggests that we probably do want the special f()|0, +f call forms after all.
Noted, and bug 883175 (and http://turtlescript.github.cscott.net/asmjs.html) updated. To accomodate emscripten, the set of valid call site forms are now: +f() => f returns doublish f()|N or f()&N or f()>>N or f()>>>N => f returns intish (where N is an integer literal possibly preceded with unary minus) everything else => f returns maybe-void (Note that, if we're breaking things, I still recommend getting rid of the 'maybe void' type and just making that final form imply that f returns void. That just requires coercion on calls to local non-void functions even if you don't plan to use the result. It simplifies the spec and the type system by not requiring fixup for 'maybe void' functions. That said, the asm-llvm verifier does current implements the 'maybe void' rules.)
To be clear to anyone just reading this: an asm.js parser would not be semantically visible; it would fall back to the regular parser on any validation error.