Closed Bug 835587 Opened 11 years ago Closed 11 years ago

Add syntax only mode to parser

Categories

(Core :: JavaScript Engine, defect)

Other Branch
x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla22

People

(Reporter: bhackett1024, Assigned: bhackett1024)

References

Details

Attachments

(2 files, 3 obsolete files)

Ecmascript requires that syntax errors be generated as soon as code is encountered, rather than at some later point.  This is problematic for lazy bytecode generation, which defers generation of the syntax tree until a script first executes.  So a separate pass to detect these syntax errors is needed, and rather than write a separate parser for this (an approach which seems crazy), it seems reasonable to adapt the current parser for this purpose.

This would templatize Parser, so that there is a Parser<FullParse> which generates the AST, and a Parser<SyntaxParse> which does not generate any parse nodes and just returns true if the code has valid syntax, false if it *might* have invalid syntax.  When a wad of code is initially loaded, do a syntax parse.  If it succeeds, do a full parse of individual scripts when they first run.  If it fails, do a full parse to generate any errors --- the syntax parser doesn't need to keep enough information around to generate good error messages itself.

Per IRC discussion with jorendorff, some features of the language require parse trees to some degree in order to generate syntax errors.  Some of these are not-yet-standard features like destructuring, array comprehensions and generators, for which the syntax parse can just fail and fall back to the full parse.  Others are fundamental features like assignment and labels; the syntax parser needs to retain enough information to fail on code like 'f(x) = 3' and 'x.f: 0'.  This seems doable.

This approach will only work if doing a syntax parse is much faster than doing a full parse and bytecode compilation.  Otherwise the syntax parse will just be a drag on performance.  Also, eventually the syntax parse will also need to compute information about variable declarations and scopes, so that aliased variables in a parent script can be identified without doing a full parse of its children.  Leaving that for followup, however.
Attached patch WIP (obsolete) — Splinter Review
WIP, runs the syntax parser before a full parse and reports whether it succeeded and the amount of time taken for syntax vs. full parsing.

Testing the minified jquery in octane's code-load benchmark, just doing a syntax parse is about 3.8x faster than fully parsing and emitting the source (3.4ms vs. 13ms).  Using the numbers on executed bytecode in bug 678037 comment 33, this suggests that on a typical browser workload we'd spend about 63% as much time parsing/emitting by doing things lazily as eagerly, and 40% on code-load (which runs less code than is typical).

The syntax parse doesn't compute binding information or detect parse errors related to bindings (like duplicate arguments in strict mode code), but I don't expect that adding binding stuff will affect these numbers significantly.
Assignee: general → bhackett1024
Attached patch patch (052d2de29f8f) (obsolete) — Splinter Review
This templatizes Parser so it can do either a full parse or a syntax parse.  Syntax parses only happen through the new syntaxParse(src) shell method, though for testing I got this working with all parses we do at the top level or within an eval in jit-tests and jstests.  "Working" here just means "doesn't crash".  It's valid for a syntax parse to fail but the real parse to succeed, when dealing with features like destructuring and generators, or for the full parse to fail but the syntax parse to succeed, because this doesn't yet deal with errors similar to the duplicate arguments one mentioned above.

A goal in followup will be to eliminate the second category, so that an error during full parsing implies an error during syntax parsing, at least for errors mandated by the spec.  Some errors only reported with JSOPTION_STRICT would be very painful to report eagerly, such as for HasFinalReturn.
Attachment #708554 - Attachment is obsolete: true
Attachment #708672 - Flags: review?(jorendorff)
Attached patch rebaseSplinter Review
Rebase across de-e4x.
Attachment #708672 - Attachment is obsolete: true
Attachment #708672 - Flags: review?(jorendorff)
Attachment #709039 - Flags: review?(jorendorff)
Comment on attachment 709039 [details] [diff] [review]
rebase

Review of attachment 709039 [details] [diff] [review]:
-----------------------------------------------------------------

Skimming through this patch, I thought "this isn't nearly as invasive as I thought it would be"... and then I got to the Parser.cpp changes.  There sure are a lot of |if (!fullParse()) {| conditions.  I guess there's no other sensible way to do it.  What happens if you forget one of these conditions?
(In reply to Nicholas Nethercote [:njn] from comment #4)
> Comment on attachment 709039 [details] [diff] [review]
> rebase
> 
> Review of attachment 709039 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Skimming through this patch, I thought "this isn't nearly as invasive as I
> thought it would be"... and then I got to the Parser.cpp changes.  There
> sure are a lot of |if (!fullParse()) {| conditions.  I guess there's no
> other sensible way to do it.  What happens if you forget one of these
> conditions?

Yeah, the fullParse() testing everywhere is pretty annoying.  This could be avoided with more abstraction of parse node interaction, as is done in a few places like Parser::isName, but given the current state of the parser this would be almost a total rewrite.

Missing a fullParse() will lead to a crash near NULL, as during the syntax parse each ParseNode* is actually a small enum value.
Comment on attachment 709039 [details] [diff] [review]
rebase

Review of attachment 709039 [details] [diff] [review]:
-----------------------------------------------------------------

My first impression is this isn't the best way to do it.

Instead, how about this. We split AST generation from parsing. The Parser *just* parses, and when it finds an ExpressionStatement, for example, it passes that off to a handler object that either builds the parse tree or not. The Parser would communicate with the handler object via a set of methods, one per grammar production, either because the Handler base class has a bunch of pure virtual methods, or by doing 'template <class Handler> class Parser' --probably the latter, for speed.

That would be more invasive, but less of a hack. It would separate concerns better. The same stuff could be used to implement Reflect.parse.

What do you think?
I originally started with something similar to the approach you describe, but gave up quickly because of the almost complete lack of abstraction when manipulating parse nodes throughout the parser.  Looking back though, it's not really a good long term strategy to work around some horribleness by making things even more horrible.

This was good to test the viability and speed gains of the approach, but I'll toss it out and start on something similar to what you describe.  It will, yeah, need to be templated.
Attachment #709039 - Flags: review?(jorendorff)
Attached patch WIP (8cf5d7935060) (obsolete) — Splinter Review
WIP on new Parser refactoring, compiles and links but untested.

This is, yeah, rather more invasive than the earlier patch, but cleaner.  Instead of having a ParseNode* represent the values of an enum, this uses the enum explicitly during syntax parsing --- Parser is now Parser<ParseHandler> and manipulates ParseHandle::Node values instead of ParseNode*.  During full parsing this is a ParseNode*, during syntax parsing this is an enum, no casting craziness so no potential to crash on dereferencing an enum.

The majority of parse methods use the handler exclusively to construct and manipulate the AST, though there are exceptions.  Some areas of the parser use template specialization so that they can manipulate the parse tree directly; in these areas there is a much simpler specialization for syntax parsing.  Most of these are when dealing with features not expected in content JS like destructuring and generators, for which the syntax parser can just bail out.  A notable exception is Parser::forStatement.  I think this is the most complicated function in the entire VM, mostly owing to the swiss army knife of standard and non-standard features which 'for' loops in our JS implementation supports.  For now the syntax parser just bails out on 'for' loops, but a followup patch will provide a simpler alternative to ::forStatement which supports just the 'for' loop forms we will typically see.
Attached patch patchSplinter Review
Updated patch.  Full parser passes jit-tests and jstests.  Syntax parser tested somewhat, doesn't crash and is able to detect strict mode correctly.
Attachment #712253 - Attachment is obsolete: true
Attachment #712785 - Flags: review?(jorendorff)
(In reply to Brian Hackett (:bhackett) from comment #7)
> I originally started with something similar to the approach you describe,
> but gave up quickly because of the almost complete lack of abstraction when
> manipulating parse nodes throughout the parser.  Looking back though, it's
> not really a good long term strategy to work around some horribleness by
> making things even more horrible.

My hero.
Comment on attachment 712785 [details] [diff] [review]
patch

Review of attachment 712785 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/TokenStream.h
@@ +412,2 @@
>    public:
> +    StrictModeGetter(void *p) : parser(p) { }

LOL.  It would be good to have a brief comment explaining (a) what the actual type is, and (b) why you're using void*.

::: js/src/tests/ecma/extensions/errorcolumnblame.js
@@ +13,3 @@
>          f();
>      } catch (e) {
> +      print(e);

Should this change still be here?
Comment on attachment 712785 [details] [diff] [review]
patch

Review of attachment 712785 [details] [diff] [review]:
-----------------------------------------------------------------

I've reviewed everything except Parser.{h,cpp} and FullParseHandler.{h,cpp}. Notes so far:

::: js/src/frontend/BytecodeCompiler.cpp
@@ -185,5 @@
>  
>          if (!EmitTree(cx, &bce, pn))
>              return UnrootedScript(NULL);
> -
> -        parser.freeTree(pn);

njn just finished showing me in bug 831097 that ParseNode recycling reduces maximum memory usage significantly, at least for some real-world scripts. I think it is mainly this freeTree() call that does it.

Should be easy to keep this.

@@ +195,5 @@
>      if (callerFrame && callerFrame.isFunctionFrame() && callerFrame.fun()->hasRest()) {
>          HandlePropertyName arguments = cx->names().arguments;
>          for (AtomDefnRange r = pc.lexdeps->all(); !r.empty(); r.popFront()) {
>              if (r.front().key() == arguments) {
> +                JS_ReportErrorNumber(cx, js_GetErrorMessage, NULL, JSMSG_ARGUMENTS_AND_REST);

Does this change (here and elsewhere) cause us not to get the right lineNumber on the exception?

The reportError() method was inline-- why get rid of it?

@@ +242,5 @@
> +    }
> +
> +    TokenStream &tokenStream = parser.tokenStream;
> +    for (;;) {
> +        TokenKind tt = tokenStream.peekToken(TSF_OPERAND);

Nit: The temporary reference seems unnecessary.

::: js/src/frontend/ParseNode-inl.h
@@ +58,1 @@
>  struct ParseContext;

Instead just delete this forward declaration; it compiles fine without it (has to, given the use of pc members in the inline function immediately following this).

::: js/src/frontend/ParseNode.cpp
@@ +338,5 @@
>  }
>  
>  // Nb: unlike most functions that are passed a Parser, this one gets a
>  // SharedContext passed in separately, because in this case |pc| may not equal
>  // |parser->pc|.

The comment is still basically correct but some names have changed. Also (pre-existing nit) the right spelling is "N.B." or better yet "Note:" not "Nb:".

@@ +370,5 @@
>   * binding context as the original tree.
>   */
> +template <>
> +ParseNode *
> +Parser<FullParseHandler>::cloneParseTree(ParseNode *opn)

Can cloneParseTree and cloneLeftHandSide be methods of the handler instead?

::: js/src/frontend/Parser.h
@@ +503,5 @@
> +    friend class CompExprTransplanter;
> +    friend class GenexpGuard<ParseHandler>;
> +    friend struct BindData<ParseHandler>;
> +
> +    // XXX fix these backdoors.

If they can be fixed simply by making Parser<>::handler public, I'm for it. (But your taste is as good as mine.)

::: js/src/frontend/TokenStream.h
@@ +414,2 @@
>  
>      bool get() const;

Two maybe not-quite-so-hacky ways to do this:

1. Give Parser<T> a non-template base class BaseParser that can answer the strictMode() question, and use that type here. OR,

2. Make this a pure virtual method and require each Parser to subclass it. I am assuming it's not hot. (Actually it might only be used in error cases; not sure.)

But a comment would suffice!

::: js/src/tests/ecma/extensions/errorcolumnblame.js
@@ +8,5 @@
>  
>  function test(f, col) {
>      var caught = false;
>      try {
> +      print(f);

Looks like some debugging stuff you can revert now.
I'm still going, btw. Still going to finish tonight. So far it looks much better and cleaner than I expected.
Comment on attachment 712785 [details] [diff] [review]
patch

Review of attachment 712785 [details] [diff] [review]:
-----------------------------------------------------------------

OK, I know I said I was going to finish this but it's enormous and the changes are less mechanical than I thought. I'll be back on it tomorrow morning.

::: js/src/frontend/Parser.cpp
@@ +6219,4 @@
>  
>              /*
>               * Check for duplicate property names.  Duplicate data properties
>               * only conflict in strict mode.  Duplicate getter or duplicate

We talked on IRC about strict-mode errors for stuff like this, and vars/args named 'eval', 'arguments', and duplicate argument names. Right now we make no attempt to find this kind of stuff in SyntaxParse mode. The standard requires us to throw an early SyntaxError. It can be hooked up later, and I imagine the tests for all those cases are already in place, so no worries.

::: js/src/frontend/Parser.h
@@ +227,1 @@
>  struct BindData;

Instead of a template, BindData could be a member of ParseHandler. But that'd be even more intrusive. Better to do it this way for now...

@@ +407,5 @@
> +    Node mulExpr1n();
> +    Node unaryExpr();
> +    Node memberExpr(bool allowCallSyntax);
> +    Node primaryExpr(TokenKind tt);
> +    Node parenExpr(bool *genexp = NULL);

Headline: "Mozilla JS Engine Switches To Node"
Bugzilla just threw away a bunch of my review work, I think because I had Splinter open in multiple windows at once. Little bit discouraged. I'm not going to finish before Bugzilla goes down for maintenance.
Comment on attachment 712785 [details] [diff] [review]
patch

Review of attachment 712785 [details] [diff] [review]:
-----------------------------------------------------------------

OK, I think this is everything important that I lost. Still looking good.

::: js/src/frontend/Parser.cpp
@@ +79,5 @@
>  StrictModeGetter::get() const
>  {
> +    JS_ASSERT(&((Parser<FullParseHandler> *) parser)->pc->sc->strict ==
> +              &((Parser<SyntaxParseHandler> *) parser)->pc->sc->strict);
> +    return ((Parser<FullParseHandler> *) parser)->pc->sc->strict;

:-O

@@ +115,5 @@
> +Parser<FullParseHandler>::expr();
> +
> +template <>
> +bool
> +Parser<FullParseHandler>::setAssignmentLhsOps(ParseNode *pn, JSOp op);

Can we put these declarations in a header, so the definition and the call sites see the same thing?

@@ +263,5 @@
>      }
>  }
>  
> +template void
> +ParseContext<FullParseHandler>::updateDecl(JSAtom *atom, Node pn);

This is just instantiating the member function, right? Would it be possible to just say:
  template class ParseContext<FullParseHandler>;

to instantiate all the methods at once? Perhaps at the bottom of the file to make sure it happens after all method specializations?

@@ +343,5 @@
>  }
>  
> +template <typename ParseHandler>
> +Parser<ParseHandler>::Parser(JSContext *cx, const CompileOptions &options,
> +                          StableCharPtr chars, size_t length, bool foldConstants)

Micro-nit: Indentation looks a bit off here.

@@ +521,1 @@
>      traceListHead->trace(trc);

Remove the static assertion? This code would work fine even if that were false, since this is a template.

@@ +570,5 @@
>   * that contains an early return e1 will get a strict warning.  Similarly for
>   * iloops: while (true){...} is treated as though ... returns.
>   */
>  #define ENDS_IN_OTHER   0
>  #define ENDS_IN_RETURN  1

If you feel like making this an enum, go for it.

@@ +691,5 @@
>  
> +static int
> +HasFinalReturn(SyntaxParseHandler::Node pn)
> +{
> +    return true;

Should return ENDS_IN_RETURN.

@@ +2225,5 @@
>  }
>  
> +template <>
> +/* static */ bool
> +Parser<SyntaxParseHandler>::BindLet(JSContext *cx, BindData<SyntaxParseHandler> *data,

Style nit: Lowercase method name (bindLet) here and elsewhere.

@@ +2880,4 @@
>      if (!block)
> +        return null();
> +
> +    Node pnlet = handler.newBinary(PNK_LET, vars, block);

Unfortunately moving this affects the position of the node, because BinaryNode::create uses the current read position of the tokenStream.

I moved node allocation around in my patch, too, and I ended up making local TokenPos variables. I made a method Parser::pos() that returns tokenStream->currentToken().pos. I went ahead and made all the parse-node-creating methods take an extra const TokenPos &pos argument.

::: js/src/frontend/SyntaxParseHandler.h
@@ +43,5 @@
> +            return true;
> +        if (kind == ParseStrictError)
> +            return !strict;
> +        return false;
> +    }

I don't think there's any need to put error reporting in the ParseHandler interface.

Maybe I'm missing the reason for this. But I would make the parser report errors on Parser::context --which is what it already does now. My thinking is, the ParseHandler is provided by the caller, and all callers actually want the same error behavior: an exception on cx.

If the problem is error location information, then as a temporary hack add a ParseHandler::getNodePosition(Node node) method (so that it works just like it used to in FullParse mode) and file a follow-up bug for me to get rid of that. I'll replace it with code that just stores the position in local variables in Parser methods as needed to report errors, instead of relying on pn_pos. It shouldn't be a huge deal. (And with that change, error-reporting convenience methods would fit nicely in a non-template ParserBase class.)

Another issue is how to report cases where SyntaxParseHandler simply can't cope with the source code. For that, set a flag in the SyntaxParseHandler, and check it when superficial parsing fails to decide whether to propagate the SyntaxError or reparse using FullParseHandler.

::: js/src/shell/js.cpp
@@ +3259,5 @@
> +
> +    bool succeeded = parser.parse(NULL);
> +    cx->clearPendingException();
> +
> +    args.rval().setBoolean(succeeded);

I dunno, why not just let the exception propagate? That's what parse() does.
Comment on attachment 712785 [details] [diff] [review]
patch

Review of attachment 712785 [details] [diff] [review]:
-----------------------------------------------------------------

Oh, I noticed you started putting 'TokenPtr begin;' local variables in there. That's the right thing. But I think you may have missed a few.

There are at least places like in Parser<H>::statement() where the resulting positions will be *different*, making Reflect.parse return mildly wrong-ish values, so that for example the position of

    if (x) f();

will be reported as just the part that says "x) f()". I think. I don't think it will break anything other than Reflect.parse (although--who knows).

::: js/src/frontend/Parser.cpp
@@ +3071,4 @@
>      }
>  
> +    if (!handler.checkListLength(caseList, JS_BIT(16), JSMSG_TOO_MANY_CASES))
> +        return null();

This can be removed from the interface too--just count them in a local variable (like we do with setDefault).

...Though I have no idea why we bomb out at 64K cases. Is there a reason for that? Surely it's not something to do with bytecode format...?

@@ +3514,5 @@
> +SyntaxParseHandler::Node
> +Parser<SyntaxParseHandler>::forStatement()
> +{
> +    // XXX bug 835587 for statement needs a rewrite for syntax parsing.
> +    return SyntaxParseHandler::NodeFailure;

LOL yeah.
OK, I'm not going to play chicken with Bugzilla downtime, so I'll finish this up on Monday. Only 2300 lines to go.
Comment on attachment 712785 [details] [diff] [review]
patch

Review of attachment 712785 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with comments addressed.

Note that the review comments span from comment 12 to this one.

::: js/src/frontend/Parser.cpp
@@ +4584,2 @@
>  bool
> +Parser<FullParseHandler>::setAssignmentLhsOps(ParseNode *pn, JSOp op)

(no changes requested here, just a comment) This is one of those places where we do syntax-checking in a second pass. It'll be hard to do this if we've thrown away the parse tree. Need to do it on the first pass, locally to the parser, which will be a pain. But involving the handler in syntax-checking seems much worse to me.

@@ +4829,4 @@
>          pn2 = memberExpr(true);
>          if (!pn2)
> +            return null();
> +        pn = handler.newUnary((tt == TOK_INC) ? PNK_PREINCREMENT : PNK_PREDECREMENT, pn2);

Source coordinates are off here, I think.

@@ +4844,5 @@
> +
> +        if (!checkDeleteExpression(&pn2))
> +            return null();
> +
> +        pn = handler.newUnary(PNK_DELETE, pn2);

Here too.

@@ +5474,5 @@
> +template <>
> +bool
> +Parser<SyntaxParseHandler>::arrayInitializerComprehensionTail(Node pn)
> +{
> +    return false;

Returning false without an error reported.

@@ +5732,5 @@
>          if (tokenStream.matchToken(TOK_LP) && !argumentList(lhs))
> +            return null();
> +
> +        if (!handler.checkListLength(lhs, ARGC_LIMIT, JSMSG_TOO_MANY_CON_ARGS))
> +            return null();

I think that check belongs in the emitter. Same goes for the place where we check the number of switch-statement cases. Hopefully get rid of checkListLength from the ParseHandler interface.

@@ +5769,5 @@
>               */
>              if (foldConstants && !FoldConstants(context, &propExpr, this))
> +                return null();
> +
> +            PropertyName *name = foldPropertyByValue(propExpr);

The previous code had three possible outcomes: success with name non-null, success with name NULL, and OOM. So the new code should be like

    PropertyName *name;
    if (!foldPropertyByValue(propExpr, &name))
        return null();

@@ +5801,5 @@
>                  /* Select JSOP_FUNAPPLY given foo.apply(...). */
> +                if (atom == context->names().apply)
> +                    handler.setOp(nextMember, JSOP_FUNAPPLY);
> +                else if (atom == context->names().call)
> +                    handler.setOp(nextMember, JSOP_FUNCALL);

This opcode selection stuff could go in the emitter too, right? And maybe get rid of isGetProp. Followup bug would be fine, you can assign it to me if you like.

@@ -6035,5 @@
> -        if (!compileAndGo) {
> -            if (!JSObject::clearParent(context, reobj))
> -                return NULL;
> -            if (!JSObject::clearType(context, reobj))
> -                return NULL;

Where did this code go? I don't see it in newRegExp. (If it's just totally unnecessary, great; but I'd be more comfortable deleting it in a separate patch...)

@@ -6097,5 @@
>  #if JS_HAS_GENERATOR_EXPRS
>      if (tokenStream.matchToken(TOK_FOR)) {
>          if (!guard.checkValidBody(pn))
> -            return NULL;
> -        JS_ASSERT(!pn->isKind(PNK_YIELD));

Why remove the assertion? Is it just too obvious?
Attachment #712785 - Flags: review?(jorendorff) → review+
cloneParseTree creates new function boxes so unfortunately it has to be in the parser.  I added back the clearParent/clearType calls to newRegExp (they are, indeed, unnecessary now that we have compartment per global).

https://hg.mozilla.org/integration/mozilla-inbound/rev/c92816f3028c
Blocks: 845404
Blocks: 845596
https://hg.mozilla.org/mozilla-central/rev/c92816f3028c
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla22
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: