Closed Bug 546087 Opened 14 years ago Closed 14 years ago

[narcissus] make the scanner less insanely slow

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: gal, Assigned: pcwalton)

Details

Attachments

(1 file, 3 obsolete files)

This patch shortcuts parsing numbers if the first digit rules out a number. This is a 2.5x speedup (total parse time) when parsing jsparse.js.
Attached patch patch (obsolete) — Splinter Review
Assignee: general → gal
Attachment #426883 - Attachment is patch: true
Attachment #426883 - Attachment mime type: application/octet-stream → text/plain
Comment on attachment 426883 [details] [diff] [review]
patch

Not super pretty I admit but the regexps are seriously slow. What do you think?
Attachment #426883 - Flags: review?(brendan)
Comment on attachment 426883 [details] [diff] [review]
patch

>diff --git a/js/narcissus/jsparse.js b/js/narcissus/jsparse.js
>--- a/js/narcissus/jsparse.js
>+++ b/js/narcissus/jsparse.js
>@@ -149,20 +149,23 @@ Tokenizer.prototype = {
>         this.tokenIndex = (this.tokenIndex + 1) & 3;
>         token = this.tokens[this.tokenIndex];
>         if (!token)
>             this.tokens[this.tokenIndex] = token = {};
> 
>         if (!input)
>             return token.type = END;
> 
>-        if ((match = fpRegExp(input))) {
>+        var c0 = input[0];
>+        var number = c0 >= '0' && c0 <= '9';

Ok, this does read poorly (number line order please), and it can cost an extra test/jump (no GCC optimizer to lean on here :-P).

How about this:

          var number = (+c0 == c0);

Is it as fast or faster?

>+
>+        if ((number || (c0 == '.')) && (match = fpRegExp(input))) {
>             token.type = NUMBER;
>             token.value = parseFloat(match[0]);
>-        } else if ((match = /^0[xX][\da-fA-F]+|^0[0-7]*|^\d+/(input))) {
>+        } else if (number && (match = /^0[xX][\da-fA-F]+|^0[0-7]*|^\d+/(input))) {

Looks good otherwise.

/be
Assignee: gal → pwalton
Attached patch Proposed patch. (obsolete) — Splinter Review
This patch replaces the lexer with a handwritten charAt()-based one, to avoid creating a new substring with every token (reducing lexing from quadratic to linear time). Parsing jQuery is 35x faster in Fx 3.6.4.

Additionally, the "GLOBAL" alias has been removed, and the usage of "eval" has been lessened. These changes reduced TraceMonkey aborts on trunk and resulted in an additional 40% speedup in V8.
Attachment #426883 - Attachment is obsolete: true
Attachment #444206 - Flags: review?
Attachment #426883 - Flags: review?(brendan)
Attachment #444206 - Flags: review? → review?(gal)
Status: NEW → ASSIGNED
Comment on attachment 444206 [details] [diff] [review]
Proposed patch.


>+            if (this.scanNewlines && next.lineno != this.lineno)

{} this

>+                tt = NEWLINE;
>+            else
>+                tt = next.type;

{} this too, easier to read. If there is an else block or a multi line consequence, always use {}.

>+        } else {
>+            tt = this.get();
>+            this.unget();
>+        }
>+        return tt;
>+    },
>+
>+    peekOnSameLine: function () {
>+        this.scanNewlines = true;
>+        var tt = this.peek();
>+        this.scanNewlines = false;
>+        return tt;
>+    },
>+
>+    // Eats comments and whitespace.
>+    skip: function () {
>+        var input = this.source;
>+        for (;;) {
>+            var ch = input[this.cursor++];
>+            var next = input[this.cursor];
>+            if (ch === '\n' && !this.scanNewlines) {
>+                this.lineno++;
>+            } else if (ch === '/' && next === '*') {
>+                this.cursor++;
>+                for (;;) {
>+                    ch = input[this.cursor++];
>+                    if (ch === undefined) {
>+                        throw this.newSyntaxError("Unterminated comment");
>+                    } else if (ch === '*') {
>+                        next = input[this.cursor];
>+                        if (next === '/') {
>+                            this.cursor++;
>+                            break;
>+                        }
>+                    } else if (ch === '\n') {
>+                        this.lineno++;
>+                    }
>+                }
>+            } else if (ch === '/' && next === '/') {
>+                this.cursor++;
>+                for (;;) {
>+                    ch = input[this.cursor++];
>+                    if (ch === undefined) {
>+                        return;
>+                    } else if (ch === '\n') {
>+                        this.lineno++;
>+                        break;
>+                    }
>+                }
>+            } else if (ch !== ' ' && ch !== '\t') {
>+                this.cursor--;
>+                return;
>+            }
>+        }
>+    },
>+
>+    // Lexes the exponential part of a number, if present. Returns true iff an
>+    // exponential part was found.
>+    lexExponent: function() {
>+        var input = this.source;
>+        var next = input[this.cursor];
>+        if (next === 'e' || next === 'E') {
>+            this.cursor++;
>+            ch = input[this.cursor++];
>+            if (ch === '+' || ch === '-')
>+                ch = input[this.cursor++];
>+
>+            if (ch < '0' || ch > '9')
>+                throw this.newSyntaxError("Missing exponent");
>+
>+            do
>+                ch = input[this.cursor++];

{} here too.

>+            while (ch >= '0' && ch <= '9');
>+            this.cursor--;
>+
>+            return true;
>+        }
>+
>+        return false;
>+    },
>+
>+    lexZeroNumber: function (ch) {
>+        var token = this.token, input = this.source;
>+        token.type = NUMBER;
>+
>+        ch = input[this.cursor++];
>+        if (ch === '.') {
>+            do

{}

>+                ch = input[this.cursor++];
>+            while (ch >= '0' && ch <= '9');
>+            this.cursor--;
>+
>+            this.lexExponent();
>+            token.value = parseFloat(token.start, this.cursor);
>+        } else if (ch === 'x' || ch === 'X') {
>+            do
>+                ch = input[this.cursor++];

{}

>+            while ((ch >= '0' && ch <= '9') || (ch >= 'a' && ch <= 'f') ||
>+                    (ch >= 'A' && ch <= 'F'));

; in the next line to make it clear its a no-body loop;

>+            this.cursor--;
>+
>+            token.value = parseInt(input.substring(token.start,
>+                this.cursor));

indent

>+        } else if (ch >= '0' && ch <= '7') {
>+            do
>+                ch = input[this.cursor++];

{}

>+            while (ch >= '0' && ch <= '7');
>+            this.cursor--;
>+
>+            token.value = parseInt(input.substring(token.start,
>+                this.cursor));

indent

>+        } else {
>+            this.cursor--;
>+            this.lexExponent();     // 0E1, &c.
>+            token.value = 0;
>+        }
>+    },
>+
>+    lexNumber: function (ch) {
>+        var token = this.token, input = this.source;
>+        token.type = NUMBER;
>+
>+        var floating = false;
>+        do {
>+            ch = input[this.cursor++];
>+            if (ch === '.' && !floating) {
>+                floating = true;
>+                ch = input[this.cursor++];
>+            }
>+        } while (ch >= '0' && ch <= '9');
>+
>+        this.cursor--;
>+
>+        var exponent = this.lexExponent();
>+        floating = floating || exponent;
>+
>+        var str = input.substring(token.start, this.cursor);
>+        if (floating)

{}

>+            token.value = parseFloat(str);
>+        else

{}

>+            token.value = parseInt(str);
>+    },
>+
>+    lexDot: function (ch) {
>+        var token = this.token, input = this.source;
>+        var next = input[this.cursor];
>+        if (next >= '0' && next <= '9') {
>+            do

dito

>+                ch = input[this.cursor++];
>+            while (ch >= '0' && ch <= '9');
>+            this.cursor--;
>+
>+            this.lexExponent();
>+
>+            token.type = NUMBER;
>+            token.value = parseFloat(token.start, this.cursor);
>+        } else {
>+            token.type = DOT;
>+            token.assignOp = null;
>+            token.value = '.';
>+        }
>+    },
>+
>+    lexString: function (ch) {
>+        var token = this.token, input = this.source;
>+        token.type = STRING;
>+
>+        var hasEscapes = false;
>+        var delim = ch;
>+        ch = input[this.cursor++];
>+        while (ch !== delim) {
>+            if (ch === '\\') {
>+                hasEscapes = true;
>+                this.cursor++;
>+            }
>+            ch = input[this.cursor++];
>+        }
>+
>+        if (hasEscapes) {
>+            var str = input.substring(token.start, this.cursor);
>+            token.value = eval(str);
>+        } else {
>+            token.value = input.substring(token.start + 1, this.cursor - 1);
>+        }
>+    },
>+
>+    lexRegExp: function (ch) {
>+        var token = this.token, input = this.source;
>+        token.type = REGEXP;
>+
>+        do {
>+            ch = input[this.cursor++];
>+            if (ch === '\\') {
>+                this.cursor++;
>+            } else if (ch === '[') {
>+                do {
>+                    if (ch === undefined) {
>+                        throw this.newSyntaxError("Unterminated " +
>+                            "character class");

I would make this one line.

>+                    } else if (ch === '\\') {
>+                        this.cursor++;
>+                    }
>+
>+                    ch = input[this.cursor++];
>+                } while (ch !== ']');
>+            } else if (ch === undefined) {
>+                throw this.newSyntaxError("Unterminated regex");
>+            }
>+        } while (ch !== '/');
>+
>+        do
>+            ch = input[this.cursor++];

{}

>+        while (ch >= 'a' && ch <= 'z');
>+
>+        this.cursor--;
>+
>+        token.value = eval(input.substring(token.start, this.cursor));
>+    },
>+
>+    lexOp: function (ch) {
>+        var token = this.token, input = this.source;
>+
>+        // A bit ugly, but it seems wasteful to write a trie lookup routine
>+        // for only 3 characters...
>+        var node = opTokens[ch];
>+        var next = input[this.cursor];
>+        if (next in node) {
>+            node = node[next];
>+            this.cursor++;
>+            next = input[this.cursor];
>+            if (next in node) {
>+                node = node[next];
>+                this.cursor++;
>+                next = input[this.cursor];
>+            }
>+        }
>+
>+        var op = node.op;
>+        if (assignOps[op] && input[this.cursor] === '=') {
>+            this.cursor++;
>+            token.type = ASSIGN;
>+            token.assignOp = tokenIds[opTypeNames[op]];
>+            op += '=';
>+        } else {
>+            token.type = tokenIds[opTypeNames[op]];
>+            if (this.scanOperand) {
>+                switch (token.type) {
>+                case PLUS:  token.type = UNARY_PLUS;    break;
>+                case MINUS: token.type = UNARY_MINUS;   break;

indent case by 2 (half-indent, don't ask, its a brendan thing).

>+                }
>+            }
>+
>+            token.assignOp = null;
>+        }
>+
>+        token.value = op;
>+    },
>+
>+    // FIXME: Unicode escape sequences
>+    // FIXME: Unicode identifiers
>+    lexIdent: function (ch) {
>+        var token = this.token, input = this.source;
>+
>+        do {
>+            ch = input[this.cursor++];
>+        } while ((ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z') ||
>+            (ch >= '0' && ch <= '9') || ch === '$' || ch === '_');

indent

>+
>+        this.cursor--;  // put the non-word character back
>+
>+        var id = input.substring(token.start, this.cursor);
>+        token.type = keywords[id] || IDENTIFIER;
>+        token.value = id;
>+    },
>+
>+    get: function () {
>+        var token;
>+        while (this.lookahead) {
>+            --this.lookahead;
>+            this.tokenIndex = (this.tokenIndex + 1) & 3;
>+            token = this.tokens[this.tokenIndex];
>+            if (token.type != NEWLINE || this.scanNewlines)
>+                return token.type;
>+        }
>+
>+        this.skip();
>+
>+        this.tokenIndex = (this.tokenIndex + 1) & 3;
>+        token = this.tokens[this.tokenIndex];
>+        if (!token)
>+            this.tokens[this.tokenIndex] = token = {};
>+
>+        var input = this.source;
>+        if (this.cursor === input.length)
>+            return token.type = END;
>+
>+        token.start = this.cursor;
>+        token.lineno = this.lineno;
>+
>+        var ch = input[this.cursor++];
>+        if ((ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z') ||
>+                ch === '$' || ch === '_') {
>+            this.lexIdent(ch);
>+        } else if (this.scanOperand && ch === '/') {
>+            this.lexRegExp(ch);
>+        } else if (ch in opTokens) {
>+            this.lexOp(ch);
>+        } else if (ch === '.') {
>+            this.lexDot(ch);
>+        } else if (ch >= '1' && ch <= '9') {
>+            this.lexNumber(ch);
>+        } else if (ch === '0') {
>+            this.lexZeroNumber(ch);
>+        } else if (ch === '"' || ch === "'") {
>+            this.lexString(ch);
>+        } else if (this.scanNewlines && ch === '\n') {
>+            token.type = NEWLINE;
>+            token.value = '\n';
>+            this.lineno++;
>+        } else {
>+            throw this.newSyntaxError("Illegal token");
>+        }
>+
>+        token.end = this.cursor;
>+        return token.type;
>+    },
>+
>+    unget: function () {
>+        if (++this.lookahead == 4) throw "PANIC: too much lookahead!";
>+        this.tokenIndex = (this.tokenIndex - 1) & 3;
>+    },
>+
>+    newSyntaxError: function (m) {
>+        var e = new SyntaxError(m, this.filename, this.lineno);
>+        e.source = this.source;
>+        e.cursor = this.cursor;
>+        return e;
>+    }
>+};
>+
>diff --git a/js/narcissus/jsparse.js b/js/narcissus/jsparse.js
>--- a/js/narcissus/jsparse.js
>+++ b/js/narcissus/jsparse.js
>@@ -38,179 +38,9 @@
> /*
>  * Narcissus - JS implemented in JS.
>  *
>- * Lexical scanner and parser.
>+ * Parser.
>  */
> 
>-// Build a regexp that recognizes operators and punctuators (except newline).
>-var opRegExpSrc = "^";
>-for (i in opTypeNames) {
>-    if (i == '\n')
>-        continue;
>-    if (opRegExpSrc != "^")
>-        opRegExpSrc += "|^";
>-    opRegExpSrc += i.replace(/[?|^&(){}\[\]+\-*\/\.]/g, "\\$&");
>-}
>-var opRegExp = new RegExp(opRegExpSrc);
>-
>-// A regexp to match floating point literals (but not integer literals).
>-var fpRegExp = /^\d+\.\d*(?:[eE][-+]?\d+)?|^\d+(?:\.\d*)?[eE][-+]?\d+|^\.\d+(?:[eE][-+]?\d+)?/;
>-
>-// A regexp to match regexp literals.
>-var reRegExp = /^\/((?:\\.|\[(?:\\.|[^\]])*\]|[^\/])+)\/([gimy]*)/;
>-
>-function Tokenizer(s, f, l) {
>-    this.cursor = 0;
>-    this.source = String(s);
>-    this.tokens = [];
>-    this.tokenIndex = 0;
>-    this.lookahead = 0;
>-    this.scanNewlines = false;
>-    this.scanOperand = true;
>-    this.filename = f || "";
>-    this.lineno = l || 1;
>-}
>-
>-Tokenizer.prototype = {
>-    get input() {
>-        return this.source.substring(this.cursor);
>-    },
>-
>-    get done() {
>-        return this.peek() == END;
>-    },
>-
>-    get token() {
>-        return this.tokens[this.tokenIndex];
>-    },
>-
>-    match: function (tt) {
>-        return this.get() == tt || this.unget();
>-    },
>-
>-    mustMatch: function (tt) {
>-        if (!this.match(tt))
>-            throw this.newSyntaxError("Missing " + tokens[tt].toLowerCase());
>-        return this.token;
>-    },
>-
>-    peek: function () {
>-        var tt, next;
>-        if (this.lookahead) {
>-            next = this.tokens[(this.tokenIndex + this.lookahead) & 3];
>-            if (this.scanNewlines && next.lineno != this.lineno)
>-                tt = NEWLINE;
>-            else
>-                tt = next.type;
>-        } else {
>-            tt = this.get();
>-            this.unget();
>-        }
>-        return tt;
>-    },
>-
>-    peekOnSameLine: function () {
>-        this.scanNewlines = true;
>-        var tt = this.peek();
>-        this.scanNewlines = false;
>-        return tt;
>-    },
>-
>-    get: function () {
>-        var token;
>-        while (this.lookahead) {
>-            --this.lookahead;
>-            this.tokenIndex = (this.tokenIndex + 1) & 3;
>-            token = this.tokens[this.tokenIndex];
>-            if (token.type != NEWLINE || this.scanNewlines)
>-                return token.type;
>-        }
>-
>-        for (;;) {
>-            var input = this.input;
>-            var match = (this.scanNewlines ? /^[ \t]+/ : /^\s+/)(input);
>-            if (match) {
>-                var spaces = match[0];
>-                this.cursor += spaces.length;
>-                var newlines = spaces.match(/\n/g);
>-                if (newlines)
>-                    this.lineno += newlines.length;
>-                input = this.input;
>-            }
>-
>-            if (!(match = /^\/(?:\*(?:.|\n)*?\*\/|\/.*)/(input)))
>-                break;
>-            var comment = match[0];
>-            this.cursor += comment.length;
>-            newlines = comment.match(/\n/g);
>-            if (newlines)
>-                this.lineno += newlines.length
>-        }
>-
>-        this.tokenIndex = (this.tokenIndex + 1) & 3;
>-        token = this.tokens[this.tokenIndex];
>-        if (!token)
>-            this.tokens[this.tokenIndex] = token = {};
>-
>-        if (!input)
>-            return token.type = END;
>-
>-        if ((match = fpRegExp(input))) {
>-            token.type = NUMBER;
>-            token.value = parseFloat(match[0]);
>-        } else if ((match = /^0[xX][\da-fA-F]+|^0[0-7]*|^\d+/(input))) {
>-            token.type = NUMBER;
>-            token.value = parseInt(match[0]);
>-        } else if ((match = /^[$_\w]+/(input))) {       // FIXME no ES3 unicode
>-            var id = match[0];
>-            token.type = keywords[id] || IDENTIFIER;
>-            token.value = id;
>-        } else if ((match = /^"(?:\\.|[^"])*"|^'(?:\\.|[^'])*'/(input))) { //"){
>-            token.type = STRING;
>-            token.value = eval(match[0]);
>-        } else if (this.scanOperand && (match = reRegExp(input))) {
>-            token.type = REGEXP;
>-            token.value = new RegExp(match[1], match[2]);
>-        } else if ((match = opRegExp(input))) {
>-            var op = match[0];
>-            if (assignOps[op] && input[op.length] == '=') {
>-                token.type = ASSIGN;
>-                token.assignOp = GLOBAL[opTypeNames[op]];
>-                match[0] += '=';
>-            } else {
>-                token.type = GLOBAL[opTypeNames[op]];
>-                if (this.scanOperand &&
>-                    (token.type == PLUS || token.type == MINUS)) {
>-                    token.type += UNARY_PLUS - PLUS;
>-                }
>-                token.assignOp = null;
>-            }
>-            token.value = op;
>-        } else if (this.scanNewlines && (match = /^\n/(input))) {
>-            token.type = NEWLINE;
>-        } else {
>-            throw this.newSyntaxError("Illegal token");
>-        }
>-
>-        token.start = this.cursor;
>-        this.cursor += match[0].length;
>-        token.end = this.cursor;
>-        token.lineno = this.lineno;
>-        return token.type;
>-    },
>-
>-    unget: function () {
>-        if (++this.lookahead == 4) throw "PANIC: too much lookahead!";
>-        this.tokenIndex = (this.tokenIndex - 1) & 3;
>-    },
>-
>-    newSyntaxError: function (m) {
>-        var e = new SyntaxError(m, this.filename, this.lineno);
>-        e.source = this.source;
>-        e.cursor = this.cursor;
>-        return e;
>-    }
>-};
>-
> function CompilerContext(inFunction) {
>     this.inFunction = inFunction;
>     this.stmtStack = [];
>@@ -690,7 +520,7 @@
> 
> // Map operator type code to precedence.
> for (i in opPrecedence)
>-    opPrecedence[GLOBAL[i]] = opPrecedence[i];
>+    opPrecedence[tokenIds[i]] = opPrecedence[i];
> 
> var opArity = {
>     COMMA: -2,
>@@ -715,7 +545,7 @@
> 
> // Map operator type code to arity.
> for (i in opArity)
>-    opArity[GLOBAL[i]] = opArity[i];
>+    opArity[tokenIds[i]] = opArity[i];
> 
> function Expression(t, x, stop) {
>     var n, id, tt, operators = [], operands = [];
Attachment #444206 - Flags: review?(gal) → review+
Comment on attachment 444206 [details] [diff] [review]
Proposed patch.

>+            if (this.scanNewlines && next.lineno != this.lineno)
>+                tt = NEWLINE;
>+            else
>+                tt = next.type;

I see no need to overbrace here, but we prefer to common the "tt = " like so:

               tt = (...) ? NEWLINE : next.type;

>+                    if (ch === undefined) {
>+                        throw this.newSyntaxError("Unterminated comment");
>+                    } else if (ch === '*') {

else after throw non-sequitur.

>+                    if (ch === undefined) {
>+                        return;
>+                    } else if (ch === '\n') {

else after return non-sequitur.

>+        if (hasEscapes) {
>+            var str = input.substring(token.start, this.cursor);
>+            token.value = eval(str);
>+        } else {
>+            token.value = input.substring(token.start + 1, this.cursor - 1);

Long enough, and the var str compounds the issue, that using ?: to common the "token.value = " may not be worth it, but it could be. Multiline ?: expressions split with the ? and : stacked under the ( before the condition. Why the str temporary?

>+                    if (ch === undefined) {
>+                        throw this.newSyntaxError("Unterminated " +
>+                            "character class");
>+                    } else if (ch === '\\') {

else after throw non-sequitur.

>+        this.cursor--;
>+
>+        token.value = eval(input.substring(token.start, this.cursor));

No str temp here, grist for my mill above ;-).

>+                switch (token.type) {
>+                case PLUS:  token.type = UNARY_PLUS;    break;
>+                case MINUS: token.type = UNARY_MINUS;   break;

Half-indent (two spaces) the case labels; likewise default: and "goto" (break and continue) labels.

Andreas caught the do-while braceless issue -- we follow K&R, where IIRC Kernighan pointed out that do-while should always be braced even if it fits on three lines total, because otherwise the while is too easily misread as the start of a while loop.

/be
Attached patch Proposed patch, version 2. (obsolete) — Splinter Review
And here's a second version of the patch with the style fixes.
Attachment #444206 - Attachment is obsolete: true
Attachment #445212 - Flags: review?(gal)
Comment on attachment 445212 [details] [diff] [review]
Proposed patch, version 2.

>+            var lineBreak = next.lineno != this.lineno;
>+            tt = (this.scanNewlines && lineBreak) ? NEWLINE : next.type;

Single-use temporaries seem unnecessary, and here it defeats short-circuiting when !this.scanNewlines.

>+        token.value = (floating) ? parseFloat(str) : parseInt(str);

Style uber-nit: no need to paren ?: condition if it's a primary or member expression.

/be
So glad to see this work happen, for the big speedup and (who cares about source lines? regexps are just unclear) clarity win. Thanks, Patrick!

/be
Style issues addressed. (BTW, I'm assuming Narcissus source is 80 characters wide based on the vim modeline.)
Attachment #445212 - Attachment is obsolete: true
Attachment #445231 - Flags: review?(brendan)
Attachment #445212 - Flags: review?(gal)
Comment on attachment 445231 [details] [diff] [review]
Proposed patch, version 3.

Sure, thanks -- no need for re-review on small nit-picks like this.

/be
Attachment #445231 - Flags: review?(brendan) → review+
Attachment #445231 - Flags: superreview?
Comment on attachment 445231 [details] [diff] [review]
Proposed patch, version 3.

No superreview required for this.
Attachment #445231 - Flags: superreview?
Comment on attachment 445231 [details] [diff] [review]
Proposed patch, version 3.

No need for superreview in js/{src,narcissus}.

/be
Pushed as changeset 0fa1b00e3a68.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: