Closed
Bug 546087
Opened 14 years ago
Closed 14 years ago
[narcissus] make the scanner less insanely slow
Categories
(Core :: JavaScript Engine, defect)
Tracking
()
RESOLVED
FIXED
People
(Reporter: gal, Assigned: pcwalton)
Details
Attachments
(1 file, 3 obsolete files)
21.93 KB,
patch
|
brendan
:
review+
|
Details | Diff | Splinter Review |
This patch shortcuts parsing numbers if the first digit rules out a number. This is a 2.5x speedup (total parse time) when parsing jsparse.js.
Reporter | ||
Comment 1•14 years ago
|
||
Assignee: general → gal
Reporter | ||
Updated•14 years ago
|
Attachment #426883 -
Attachment is patch: true
Attachment #426883 -
Attachment mime type: application/octet-stream → text/plain
Reporter | ||
Comment 2•14 years ago
|
||
Comment on attachment 426883 [details] [diff] [review] patch Not super pretty I admit but the regexps are seriously slow. What do you think?
Attachment #426883 -
Flags: review?(brendan)
Comment 3•14 years ago
|
||
Comment on attachment 426883 [details] [diff] [review] patch >diff --git a/js/narcissus/jsparse.js b/js/narcissus/jsparse.js >--- a/js/narcissus/jsparse.js >+++ b/js/narcissus/jsparse.js >@@ -149,20 +149,23 @@ Tokenizer.prototype = { > this.tokenIndex = (this.tokenIndex + 1) & 3; > token = this.tokens[this.tokenIndex]; > if (!token) > this.tokens[this.tokenIndex] = token = {}; > > if (!input) > return token.type = END; > >- if ((match = fpRegExp(input))) { >+ var c0 = input[0]; >+ var number = c0 >= '0' && c0 <= '9'; Ok, this does read poorly (number line order please), and it can cost an extra test/jump (no GCC optimizer to lean on here :-P). How about this: var number = (+c0 == c0); Is it as fast or faster? >+ >+ if ((number || (c0 == '.')) && (match = fpRegExp(input))) { > token.type = NUMBER; > token.value = parseFloat(match[0]); >- } else if ((match = /^0[xX][\da-fA-F]+|^0[0-7]*|^\d+/(input))) { >+ } else if (number && (match = /^0[xX][\da-fA-F]+|^0[0-7]*|^\d+/(input))) { Looks good otherwise. /be
Updated•14 years ago
|
Assignee: gal → pwalton
Assignee | ||
Comment 4•14 years ago
|
||
This patch replaces the lexer with a handwritten charAt()-based one, to avoid creating a new substring with every token (reducing lexing from quadratic to linear time). Parsing jQuery is 35x faster in Fx 3.6.4. Additionally, the "GLOBAL" alias has been removed, and the usage of "eval" has been lessened. These changes reduced TraceMonkey aborts on trunk and resulted in an additional 40% speedup in V8.
Attachment #426883 -
Attachment is obsolete: true
Attachment #444206 -
Flags: review?
Attachment #426883 -
Flags: review?(brendan)
Assignee | ||
Updated•14 years ago
|
Attachment #444206 -
Flags: review? → review?(gal)
Assignee | ||
Updated•14 years ago
|
Status: NEW → ASSIGNED
Reporter | ||
Comment 5•14 years ago
|
||
Comment on attachment 444206 [details] [diff] [review] Proposed patch. >+ if (this.scanNewlines && next.lineno != this.lineno) {} this >+ tt = NEWLINE; >+ else >+ tt = next.type; {} this too, easier to read. If there is an else block or a multi line consequence, always use {}. >+ } else { >+ tt = this.get(); >+ this.unget(); >+ } >+ return tt; >+ }, >+ >+ peekOnSameLine: function () { >+ this.scanNewlines = true; >+ var tt = this.peek(); >+ this.scanNewlines = false; >+ return tt; >+ }, >+ >+ // Eats comments and whitespace. >+ skip: function () { >+ var input = this.source; >+ for (;;) { >+ var ch = input[this.cursor++]; >+ var next = input[this.cursor]; >+ if (ch === '\n' && !this.scanNewlines) { >+ this.lineno++; >+ } else if (ch === '/' && next === '*') { >+ this.cursor++; >+ for (;;) { >+ ch = input[this.cursor++]; >+ if (ch === undefined) { >+ throw this.newSyntaxError("Unterminated comment"); >+ } else if (ch === '*') { >+ next = input[this.cursor]; >+ if (next === '/') { >+ this.cursor++; >+ break; >+ } >+ } else if (ch === '\n') { >+ this.lineno++; >+ } >+ } >+ } else if (ch === '/' && next === '/') { >+ this.cursor++; >+ for (;;) { >+ ch = input[this.cursor++]; >+ if (ch === undefined) { >+ return; >+ } else if (ch === '\n') { >+ this.lineno++; >+ break; >+ } >+ } >+ } else if (ch !== ' ' && ch !== '\t') { >+ this.cursor--; >+ return; >+ } >+ } >+ }, >+ >+ // Lexes the exponential part of a number, if present. Returns true iff an >+ // exponential part was found. >+ lexExponent: function() { >+ var input = this.source; >+ var next = input[this.cursor]; >+ if (next === 'e' || next === 'E') { >+ this.cursor++; >+ ch = input[this.cursor++]; >+ if (ch === '+' || ch === '-') >+ ch = input[this.cursor++]; >+ >+ if (ch < '0' || ch > '9') >+ throw this.newSyntaxError("Missing exponent"); >+ >+ do >+ ch = input[this.cursor++]; {} here too. >+ while (ch >= '0' && ch <= '9'); >+ this.cursor--; >+ >+ return true; >+ } >+ >+ return false; >+ }, >+ >+ lexZeroNumber: function (ch) { >+ var token = this.token, input = this.source; >+ token.type = NUMBER; >+ >+ ch = input[this.cursor++]; >+ if (ch === '.') { >+ do {} >+ ch = input[this.cursor++]; >+ while (ch >= '0' && ch <= '9'); >+ this.cursor--; >+ >+ this.lexExponent(); >+ token.value = parseFloat(token.start, this.cursor); >+ } else if (ch === 'x' || ch === 'X') { >+ do >+ ch = input[this.cursor++]; {} >+ while ((ch >= '0' && ch <= '9') || (ch >= 'a' && ch <= 'f') || >+ (ch >= 'A' && ch <= 'F')); ; in the next line to make it clear its a no-body loop; >+ this.cursor--; >+ >+ token.value = parseInt(input.substring(token.start, >+ this.cursor)); indent >+ } else if (ch >= '0' && ch <= '7') { >+ do >+ ch = input[this.cursor++]; {} >+ while (ch >= '0' && ch <= '7'); >+ this.cursor--; >+ >+ token.value = parseInt(input.substring(token.start, >+ this.cursor)); indent >+ } else { >+ this.cursor--; >+ this.lexExponent(); // 0E1, &c. >+ token.value = 0; >+ } >+ }, >+ >+ lexNumber: function (ch) { >+ var token = this.token, input = this.source; >+ token.type = NUMBER; >+ >+ var floating = false; >+ do { >+ ch = input[this.cursor++]; >+ if (ch === '.' && !floating) { >+ floating = true; >+ ch = input[this.cursor++]; >+ } >+ } while (ch >= '0' && ch <= '9'); >+ >+ this.cursor--; >+ >+ var exponent = this.lexExponent(); >+ floating = floating || exponent; >+ >+ var str = input.substring(token.start, this.cursor); >+ if (floating) {} >+ token.value = parseFloat(str); >+ else {} >+ token.value = parseInt(str); >+ }, >+ >+ lexDot: function (ch) { >+ var token = this.token, input = this.source; >+ var next = input[this.cursor]; >+ if (next >= '0' && next <= '9') { >+ do dito >+ ch = input[this.cursor++]; >+ while (ch >= '0' && ch <= '9'); >+ this.cursor--; >+ >+ this.lexExponent(); >+ >+ token.type = NUMBER; >+ token.value = parseFloat(token.start, this.cursor); >+ } else { >+ token.type = DOT; >+ token.assignOp = null; >+ token.value = '.'; >+ } >+ }, >+ >+ lexString: function (ch) { >+ var token = this.token, input = this.source; >+ token.type = STRING; >+ >+ var hasEscapes = false; >+ var delim = ch; >+ ch = input[this.cursor++]; >+ while (ch !== delim) { >+ if (ch === '\\') { >+ hasEscapes = true; >+ this.cursor++; >+ } >+ ch = input[this.cursor++]; >+ } >+ >+ if (hasEscapes) { >+ var str = input.substring(token.start, this.cursor); >+ token.value = eval(str); >+ } else { >+ token.value = input.substring(token.start + 1, this.cursor - 1); >+ } >+ }, >+ >+ lexRegExp: function (ch) { >+ var token = this.token, input = this.source; >+ token.type = REGEXP; >+ >+ do { >+ ch = input[this.cursor++]; >+ if (ch === '\\') { >+ this.cursor++; >+ } else if (ch === '[') { >+ do { >+ if (ch === undefined) { >+ throw this.newSyntaxError("Unterminated " + >+ "character class"); I would make this one line. >+ } else if (ch === '\\') { >+ this.cursor++; >+ } >+ >+ ch = input[this.cursor++]; >+ } while (ch !== ']'); >+ } else if (ch === undefined) { >+ throw this.newSyntaxError("Unterminated regex"); >+ } >+ } while (ch !== '/'); >+ >+ do >+ ch = input[this.cursor++]; {} >+ while (ch >= 'a' && ch <= 'z'); >+ >+ this.cursor--; >+ >+ token.value = eval(input.substring(token.start, this.cursor)); >+ }, >+ >+ lexOp: function (ch) { >+ var token = this.token, input = this.source; >+ >+ // A bit ugly, but it seems wasteful to write a trie lookup routine >+ // for only 3 characters... >+ var node = opTokens[ch]; >+ var next = input[this.cursor]; >+ if (next in node) { >+ node = node[next]; >+ this.cursor++; >+ next = input[this.cursor]; >+ if (next in node) { >+ node = node[next]; >+ this.cursor++; >+ next = input[this.cursor]; >+ } >+ } >+ >+ var op = node.op; >+ if (assignOps[op] && input[this.cursor] === '=') { >+ this.cursor++; >+ token.type = ASSIGN; >+ token.assignOp = tokenIds[opTypeNames[op]]; >+ op += '='; >+ } else { >+ token.type = tokenIds[opTypeNames[op]]; >+ if (this.scanOperand) { >+ switch (token.type) { >+ case PLUS: token.type = UNARY_PLUS; break; >+ case MINUS: token.type = UNARY_MINUS; break; indent case by 2 (half-indent, don't ask, its a brendan thing). >+ } >+ } >+ >+ token.assignOp = null; >+ } >+ >+ token.value = op; >+ }, >+ >+ // FIXME: Unicode escape sequences >+ // FIXME: Unicode identifiers >+ lexIdent: function (ch) { >+ var token = this.token, input = this.source; >+ >+ do { >+ ch = input[this.cursor++]; >+ } while ((ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z') || >+ (ch >= '0' && ch <= '9') || ch === '$' || ch === '_'); indent >+ >+ this.cursor--; // put the non-word character back >+ >+ var id = input.substring(token.start, this.cursor); >+ token.type = keywords[id] || IDENTIFIER; >+ token.value = id; >+ }, >+ >+ get: function () { >+ var token; >+ while (this.lookahead) { >+ --this.lookahead; >+ this.tokenIndex = (this.tokenIndex + 1) & 3; >+ token = this.tokens[this.tokenIndex]; >+ if (token.type != NEWLINE || this.scanNewlines) >+ return token.type; >+ } >+ >+ this.skip(); >+ >+ this.tokenIndex = (this.tokenIndex + 1) & 3; >+ token = this.tokens[this.tokenIndex]; >+ if (!token) >+ this.tokens[this.tokenIndex] = token = {}; >+ >+ var input = this.source; >+ if (this.cursor === input.length) >+ return token.type = END; >+ >+ token.start = this.cursor; >+ token.lineno = this.lineno; >+ >+ var ch = input[this.cursor++]; >+ if ((ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z') || >+ ch === '$' || ch === '_') { >+ this.lexIdent(ch); >+ } else if (this.scanOperand && ch === '/') { >+ this.lexRegExp(ch); >+ } else if (ch in opTokens) { >+ this.lexOp(ch); >+ } else if (ch === '.') { >+ this.lexDot(ch); >+ } else if (ch >= '1' && ch <= '9') { >+ this.lexNumber(ch); >+ } else if (ch === '0') { >+ this.lexZeroNumber(ch); >+ } else if (ch === '"' || ch === "'") { >+ this.lexString(ch); >+ } else if (this.scanNewlines && ch === '\n') { >+ token.type = NEWLINE; >+ token.value = '\n'; >+ this.lineno++; >+ } else { >+ throw this.newSyntaxError("Illegal token"); >+ } >+ >+ token.end = this.cursor; >+ return token.type; >+ }, >+ >+ unget: function () { >+ if (++this.lookahead == 4) throw "PANIC: too much lookahead!"; >+ this.tokenIndex = (this.tokenIndex - 1) & 3; >+ }, >+ >+ newSyntaxError: function (m) { >+ var e = new SyntaxError(m, this.filename, this.lineno); >+ e.source = this.source; >+ e.cursor = this.cursor; >+ return e; >+ } >+}; >+ >diff --git a/js/narcissus/jsparse.js b/js/narcissus/jsparse.js >--- a/js/narcissus/jsparse.js >+++ b/js/narcissus/jsparse.js >@@ -38,179 +38,9 @@ > /* > * Narcissus - JS implemented in JS. > * >- * Lexical scanner and parser. >+ * Parser. > */ > >-// Build a regexp that recognizes operators and punctuators (except newline). >-var opRegExpSrc = "^"; >-for (i in opTypeNames) { >- if (i == '\n') >- continue; >- if (opRegExpSrc != "^") >- opRegExpSrc += "|^"; >- opRegExpSrc += i.replace(/[?|^&(){}\[\]+\-*\/\.]/g, "\\$&"); >-} >-var opRegExp = new RegExp(opRegExpSrc); >- >-// A regexp to match floating point literals (but not integer literals). >-var fpRegExp = /^\d+\.\d*(?:[eE][-+]?\d+)?|^\d+(?:\.\d*)?[eE][-+]?\d+|^\.\d+(?:[eE][-+]?\d+)?/; >- >-// A regexp to match regexp literals. >-var reRegExp = /^\/((?:\\.|\[(?:\\.|[^\]])*\]|[^\/])+)\/([gimy]*)/; >- >-function Tokenizer(s, f, l) { >- this.cursor = 0; >- this.source = String(s); >- this.tokens = []; >- this.tokenIndex = 0; >- this.lookahead = 0; >- this.scanNewlines = false; >- this.scanOperand = true; >- this.filename = f || ""; >- this.lineno = l || 1; >-} >- >-Tokenizer.prototype = { >- get input() { >- return this.source.substring(this.cursor); >- }, >- >- get done() { >- return this.peek() == END; >- }, >- >- get token() { >- return this.tokens[this.tokenIndex]; >- }, >- >- match: function (tt) { >- return this.get() == tt || this.unget(); >- }, >- >- mustMatch: function (tt) { >- if (!this.match(tt)) >- throw this.newSyntaxError("Missing " + tokens[tt].toLowerCase()); >- return this.token; >- }, >- >- peek: function () { >- var tt, next; >- if (this.lookahead) { >- next = this.tokens[(this.tokenIndex + this.lookahead) & 3]; >- if (this.scanNewlines && next.lineno != this.lineno) >- tt = NEWLINE; >- else >- tt = next.type; >- } else { >- tt = this.get(); >- this.unget(); >- } >- return tt; >- }, >- >- peekOnSameLine: function () { >- this.scanNewlines = true; >- var tt = this.peek(); >- this.scanNewlines = false; >- return tt; >- }, >- >- get: function () { >- var token; >- while (this.lookahead) { >- --this.lookahead; >- this.tokenIndex = (this.tokenIndex + 1) & 3; >- token = this.tokens[this.tokenIndex]; >- if (token.type != NEWLINE || this.scanNewlines) >- return token.type; >- } >- >- for (;;) { >- var input = this.input; >- var match = (this.scanNewlines ? /^[ \t]+/ : /^\s+/)(input); >- if (match) { >- var spaces = match[0]; >- this.cursor += spaces.length; >- var newlines = spaces.match(/\n/g); >- if (newlines) >- this.lineno += newlines.length; >- input = this.input; >- } >- >- if (!(match = /^\/(?:\*(?:.|\n)*?\*\/|\/.*)/(input))) >- break; >- var comment = match[0]; >- this.cursor += comment.length; >- newlines = comment.match(/\n/g); >- if (newlines) >- this.lineno += newlines.length >- } >- >- this.tokenIndex = (this.tokenIndex + 1) & 3; >- token = this.tokens[this.tokenIndex]; >- if (!token) >- this.tokens[this.tokenIndex] = token = {}; >- >- if (!input) >- return token.type = END; >- >- if ((match = fpRegExp(input))) { >- token.type = NUMBER; >- token.value = parseFloat(match[0]); >- } else if ((match = /^0[xX][\da-fA-F]+|^0[0-7]*|^\d+/(input))) { >- token.type = NUMBER; >- token.value = parseInt(match[0]); >- } else if ((match = /^[$_\w]+/(input))) { // FIXME no ES3 unicode >- var id = match[0]; >- token.type = keywords[id] || IDENTIFIER; >- token.value = id; >- } else if ((match = /^"(?:\\.|[^"])*"|^'(?:\\.|[^'])*'/(input))) { //"){ >- token.type = STRING; >- token.value = eval(match[0]); >- } else if (this.scanOperand && (match = reRegExp(input))) { >- token.type = REGEXP; >- token.value = new RegExp(match[1], match[2]); >- } else if ((match = opRegExp(input))) { >- var op = match[0]; >- if (assignOps[op] && input[op.length] == '=') { >- token.type = ASSIGN; >- token.assignOp = GLOBAL[opTypeNames[op]]; >- match[0] += '='; >- } else { >- token.type = GLOBAL[opTypeNames[op]]; >- if (this.scanOperand && >- (token.type == PLUS || token.type == MINUS)) { >- token.type += UNARY_PLUS - PLUS; >- } >- token.assignOp = null; >- } >- token.value = op; >- } else if (this.scanNewlines && (match = /^\n/(input))) { >- token.type = NEWLINE; >- } else { >- throw this.newSyntaxError("Illegal token"); >- } >- >- token.start = this.cursor; >- this.cursor += match[0].length; >- token.end = this.cursor; >- token.lineno = this.lineno; >- return token.type; >- }, >- >- unget: function () { >- if (++this.lookahead == 4) throw "PANIC: too much lookahead!"; >- this.tokenIndex = (this.tokenIndex - 1) & 3; >- }, >- >- newSyntaxError: function (m) { >- var e = new SyntaxError(m, this.filename, this.lineno); >- e.source = this.source; >- e.cursor = this.cursor; >- return e; >- } >-}; >- > function CompilerContext(inFunction) { > this.inFunction = inFunction; > this.stmtStack = []; >@@ -690,7 +520,7 @@ > > // Map operator type code to precedence. > for (i in opPrecedence) >- opPrecedence[GLOBAL[i]] = opPrecedence[i]; >+ opPrecedence[tokenIds[i]] = opPrecedence[i]; > > var opArity = { > COMMA: -2, >@@ -715,7 +545,7 @@ > > // Map operator type code to arity. > for (i in opArity) >- opArity[GLOBAL[i]] = opArity[i]; >+ opArity[tokenIds[i]] = opArity[i]; > > function Expression(t, x, stop) { > var n, id, tt, operators = [], operands = [];
Attachment #444206 -
Flags: review?(gal) → review+
Comment 6•14 years ago
|
||
Comment on attachment 444206 [details] [diff] [review] Proposed patch. >+ if (this.scanNewlines && next.lineno != this.lineno) >+ tt = NEWLINE; >+ else >+ tt = next.type; I see no need to overbrace here, but we prefer to common the "tt = " like so: tt = (...) ? NEWLINE : next.type; >+ if (ch === undefined) { >+ throw this.newSyntaxError("Unterminated comment"); >+ } else if (ch === '*') { else after throw non-sequitur. >+ if (ch === undefined) { >+ return; >+ } else if (ch === '\n') { else after return non-sequitur. >+ if (hasEscapes) { >+ var str = input.substring(token.start, this.cursor); >+ token.value = eval(str); >+ } else { >+ token.value = input.substring(token.start + 1, this.cursor - 1); Long enough, and the var str compounds the issue, that using ?: to common the "token.value = " may not be worth it, but it could be. Multiline ?: expressions split with the ? and : stacked under the ( before the condition. Why the str temporary? >+ if (ch === undefined) { >+ throw this.newSyntaxError("Unterminated " + >+ "character class"); >+ } else if (ch === '\\') { else after throw non-sequitur. >+ this.cursor--; >+ >+ token.value = eval(input.substring(token.start, this.cursor)); No str temp here, grist for my mill above ;-). >+ switch (token.type) { >+ case PLUS: token.type = UNARY_PLUS; break; >+ case MINUS: token.type = UNARY_MINUS; break; Half-indent (two spaces) the case labels; likewise default: and "goto" (break and continue) labels. Andreas caught the do-while braceless issue -- we follow K&R, where IIRC Kernighan pointed out that do-while should always be braced even if it fits on three lines total, because otherwise the while is too easily misread as the start of a while loop. /be
Assignee | ||
Comment 7•14 years ago
|
||
And here's a second version of the patch with the style fixes.
Attachment #444206 -
Attachment is obsolete: true
Assignee | ||
Updated•14 years ago
|
Attachment #445212 -
Flags: review?(gal)
Comment 8•14 years ago
|
||
Comment on attachment 445212 [details] [diff] [review] Proposed patch, version 2. >+ var lineBreak = next.lineno != this.lineno; >+ tt = (this.scanNewlines && lineBreak) ? NEWLINE : next.type; Single-use temporaries seem unnecessary, and here it defeats short-circuiting when !this.scanNewlines. >+ token.value = (floating) ? parseFloat(str) : parseInt(str); Style uber-nit: no need to paren ?: condition if it's a primary or member expression. /be
Comment 9•14 years ago
|
||
So glad to see this work happen, for the big speedup and (who cares about source lines? regexps are just unclear) clarity win. Thanks, Patrick! /be
Assignee | ||
Comment 10•14 years ago
|
||
Style issues addressed. (BTW, I'm assuming Narcissus source is 80 characters wide based on the vim modeline.)
Attachment #445212 -
Attachment is obsolete: true
Attachment #445231 -
Flags: review?(brendan)
Attachment #445212 -
Flags: review?(gal)
Comment 11•14 years ago
|
||
Comment on attachment 445231 [details] [diff] [review] Proposed patch, version 3. Sure, thanks -- no need for re-review on small nit-picks like this. /be
Attachment #445231 -
Flags: review?(brendan) → review+
Assignee | ||
Updated•14 years ago
|
Attachment #445231 -
Flags: superreview?
Reporter | ||
Comment 12•14 years ago
|
||
Comment on attachment 445231 [details] [diff] [review] Proposed patch, version 3. No superreview required for this.
Attachment #445231 -
Flags: superreview?
Comment 13•14 years ago
|
||
Comment on attachment 445231 [details] [diff] [review] Proposed patch, version 3. No need for superreview in js/{src,narcissus}. /be
Assignee | ||
Comment 14•14 years ago
|
||
Pushed as changeset 0fa1b00e3a68.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•