Closed
Bug 638034
Opened 13 years ago
Closed 13 years ago
Make scanning safer
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
FIXED
People
(Reporter: n.nethercote, Assigned: n.nethercote)
References
Details
(Whiteboard: fixed-in-tracemonkey)
Attachments
(1 file, 1 obsolete file)
28.48 KB,
patch
|
brendan
:
review+
|
Details | Diff | Splinter Review |
This patch makes scanning safer in various ways. - Removes TSF_NEWLINES. I don't like having a token kind that is produced or not depending on a flag. In particular, it makes ungetting and regetting tokens weird. Also, getting the begin/end position of a TOK_EOL token right is tricky because EOL sequences can be one or two chars, and by the time you call newToken() you're on a new line, which is unlike other token kinds. Really, what we want is ancillary notification if a newline was hit while getting a token, rather than returning a weird special token. So I added the TSF_EOL flag, which is for exactly that. It's queried by peekTokenSameLine(), which returns TOK_EOL if it occurs. peekTokenSameLine() is now the only place that produces TOK_EOL. getToken() no longer needs to check for TOK_EOL when regetting a token. This patch also adds a no-flags variant of matchToken(), which helped avoid any slowdown. - Encapsulates TokenBuf. This makes TokenStream's code easier to read, because there aren't any direct pointer manipulations. Also, moving findEOL() into TokenBuf simplifies it -- there's no need to restore TokenStream state like 'lineno'. - Fixes problems with reporting errors in bad multi-line tokens, whereby reportCompileErrorNumberVA() used a mixture of (a) state from the token, and (b) state from TokenStream, which weren't necessarily in sync and could lead to eg. the index being wrong. Eg. for this code: x = "foo\ bar\ baz we'd produce this output: b.js:3: SyntaxError: unterminated string literal: b.js:3: baz b.js:3: ....^ The caret refers to the index of the first line, but TokenStream's state at error-reporting time is on the third line, which can lead to problems of the sort that required bug 634444 to be backed out. With the patch, if the token line and the TokenStream current state line don't match (and this includes all cases where the erroneous token is multi-line) we don't try to print context; it's a rare case and not worth the pain. This simplifies the use of 'index' later in the function. - Adds IsTokenSane(), which is called on every generated token. Suggestions for extending it are welcome. - Fixes the 'adjust' factor in newToken() for EOF. - Initializes tp->pos.end.index for error tokens. Previously it was left as garbage, which makes me very nervous. Erroneous tokens now have begin/end positions that make them look like 1-char tokens, which seems a reasonable fallback. - Poisons userbuf after an erroneous token occurs. This establishes an invariant that userbuf isn't accessed again. This is good to know, since we may have abandoned scanning mid-token and so the userbuf state is no longer reliable. - Fixes the context shown for these bogus lines: var x = 0x var x = 3e Previously the context showed a blank line instead of the line containing the bad token.
Attachment #516210 -
Flags: review?(brendan)
Assignee | ||
Comment 1•13 years ago
|
||
Just like the first patch, but with a couple of unnecessary |#include "jsscan.h"| lines removed. One consequence is that JaegerMonkey now isn't rebuilt every time jsscan.h changes :)
Attachment #516210 -
Attachment is obsolete: true
Attachment #516210 -
Flags: review?(brendan)
Attachment #516409 -
Flags: review?(brendan)
Comment 2•13 years ago
|
||
Comment on attachment 516409 [details] [diff] [review] patch v2 (against 62858:f85299200bd3) Great work. You are reforming some of the oldest code in the engine. I wrote what became jsscan.cpp in May 1995, it was among the first source files created (mo_scan.c, if memory serves -- mo_ for Mocha, following Netscape 8.3-constrained filename conventions). TSF_NEWLINES was one of my hasty hacks from that create-JS-in-ten-days sprint. Glad to see it go. Nits only below, apart from the const issue. >- userbuf.base = (jschar *)base; >- userbuf.limit = (jschar *)base + length; >- userbuf.ptr = (jschar *)base; Must we cast away const? From skimming it seems to me you are *this* close to being able to constipate everything. It's scary to cast away const (enabling writes to readonly memory). > if (c == '\n') { > #ifdef DEBUG >- int32 c2 = *userbuf.ptr; >- JS_ASSERT(c2 == '\n' || c2 == '\r' || c2 == LINE_SEPARATOR || c2 == PARA_SEPARATOR); >+ int32 c2 = userbuf.peekRawChar(); >+ JS_ASSERT(TokenBuf::isRawEOLChar(c2)); > #endif >- if (userbuf.ptr > userbuf.base && *(userbuf.ptr - 1) == '\r') >- userbuf.ptr--; /* also unget the \r in a \r\n sequence */ Blank line here. >+ /* if it's a \r\n sequence, also unget the \r */ >+ if (!userbuf.atStart() && userbuf.matchRawCharBackwards('\r')) > JS_ASSERT(prevLinebase); /* we should never get more than one EOL char */ Indentation off here. >+ jschar *tmp = ptr ? ptr : ptrOnPoisoning; Somehow "OnPoisoning" rubs me the wrong way -- "WhenPoisoned"? Not sure, something about the -ing implying ongoing tense... >+#else >+ jschar *tmp = ptr; >+#endif Could always declare jschar *tmp = ptr; once before #ifdef DEBUG that does if (!ptr) ptr = ...; >+ tp = pn >+ ? &pn->pn_pos >+ : (TokenPos *)¤tToken().pos; Prevailing style wants ? and : underhanging first char of condition, not = here -- but it all fits on one line and seems readable enough (modulo that cast). About that cast: that is telling you to change tp to a const TokenPos *. >+ /* >+ * Given a token, T, that we want to complain about: if T's (starting) I am stifling complaints about English spacing (two spaces after full stop) but there's no reason for two spaces after : or (later in the patch) ; in any style manual I've read. >+bool >+IsTokenSane(Token *tp) Brilliant sanity checking (but kill the trailing space on the last line shown above, and grep for more like it to trim). > if (!JS7_ISDEC(c)) { >+ ungetChar(c); Ancient bug, perhaps one of the oldest. Seeing back almost to the big bang! Thanks. > error: >- tt = TOK_ERROR; >+ /* >+ * For erroneous multi-line tokens we won't have changed end.lineno (it'll >+ * still be equal to begin.lineno) so we revert end.index to be equal to >+ * begin.index + 1 (as if it's a 1-char token) to avoid having ..12345678901234567890123456789012345678901234567890123456789012345678901234567890 >+ * inconsistent begin/end positions. end.index isn't used in error >+ * messages anyway. "inconsistent" fits on the previous line and then the comment wraps more prettily (I have a sense for this!). >+#ifdef DEBUG >+ /* >+ * Poisoning userbuf on error establishes an invariant: once an erroneous >+ * token has been seen, userbuf will not be consulted again. This is true >+ * because the parser will either (a) deal with the TOK_ERROR token by >+ * aborting parsing immediately; or (b) if the TOK_ERROR token doesn't Space doubling after : and ; purple alert! >+ * match what it expected, it will unget the token, and the next ..12345678901234567890123456789012345678901234567890123456789012345678901234567890 >+ * getToken() call will immediately return the just-gotten TOK_ERROR token >+ * again without consulting userbuf, thanks to the lookahead buffer. "getToken()" fits on previous line even with the non-obligatory "()" -- rewrap. >+/* Unicode separators that are treated as line terminators, in addition to \n, \r */ >+#define LINE_SEPARATOR 0x2028 >+#define PARA_SEPARATOR 0x2029 May as well make these enum members in the closest type-scope? Macros are so 1995. > TokenKind peekTokenSameLine(uintN withFlags = 0) { >- Flagger flagger(this, withFlags); > if (!onCurrentLine(currentToken().pos)) > return TOK_EOL; >- TokenKind tt = peekToken(TSF_NEWLINES); >+ >+ if (lookahead != 0) { >+ JS_ASSERT(lookahead == 1); >+ return tokens[(cursor + lookahead) & ntokensMask].type; >+ } Blank line here. >+ /* >+ * This is the only place TOK_EOL is produced. No token with TOK_EOL >+ * is created, just a TOK_EOL TokenKind is returned. >+ */ >+ flags &= ~TSF_EOL; >+ TokenKind tt = getToken(withFlags); >+ if (flags & TSF_EOL) { >+ tt = TOK_EOL; >+ flags &= ~TSF_EOL; >+ } >+ ungetToken(); >+ Kill this blank line, or at least move it up one. Many thanks again. Beautiful work, I am getting verklempt! /be
Attachment #516409 -
Flags: review?(brendan) → review+
Assignee | ||
Comment 3•13 years ago
|
||
Pushed with all nits fixed, including the extra const decls which were pretty easy -- all the pointers in TokenBuf are now to const, as is Token::ptr. http://hg.mozilla.org/tracemonkey/rev/3035bb782013
Whiteboard: fixed-in-tracemonkey
Comment 4•13 years ago
|
||
http://hg.mozilla.org/mozilla-central/rev/3035bb782013
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•