Closed Bug 638034 Opened 13 years ago Closed 13 years ago

Make scanning safer

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: n.nethercote, Assigned: n.nethercote)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(1 file, 1 obsolete file)

This patch makes scanning safer in various ways.

- Removes TSF_NEWLINES.  I don't like having a token kind that is produced or
  not depending on a flag.  In particular, it makes ungetting and regetting
  tokens weird.  Also, getting the begin/end position of a TOK_EOL token right
  is tricky because EOL sequences can be one or two chars, and by the time you
  call newToken() you're on a new line, which is unlike other token kinds.

  Really, what we want is ancillary notification if a newline was hit while
  getting a token, rather than returning a weird special token.  So I added
  the TSF_EOL flag, which is for exactly that.  It's queried by
  peekTokenSameLine(), which returns TOK_EOL if it occurs.
  peekTokenSameLine() is now the only place that produces TOK_EOL.  getToken()
  no longer needs to check for TOK_EOL when regetting a token.

  This patch also adds a no-flags variant of matchToken(), which helped avoid
  any slowdown.

- Encapsulates TokenBuf.  This makes TokenStream's code easier to read,
  because there aren't any direct pointer manipulations.  Also, moving
  findEOL() into TokenBuf simplifies it -- there's no need to restore
  TokenStream state like 'lineno'.

- Fixes problems with reporting errors in bad multi-line tokens, whereby
  reportCompileErrorNumberVA() used a mixture of (a) state from the token,
  and (b) state from TokenStream, which weren't necessarily in sync and
  could lead to eg. the index being wrong.  Eg. for this code:

    x = "foo\
    bar\
    baz

  we'd produce this output:

    b.js:3: SyntaxError: unterminated string literal:
    b.js:3: baz
    b.js:3: ....^

  The caret refers to the index of the first line, but TokenStream's state
  at error-reporting time is on the third line, which can lead to problems
  of the sort that required bug 634444 to be backed out.

  With the patch, if the token line and the TokenStream current state line
  don't match (and this includes all cases where the erroneous token is
  multi-line) we don't try to print context;  it's a rare case and not worth
  the pain.  This simplifies the use of 'index' later in the function.

- Adds IsTokenSane(), which is called on every generated token.  Suggestions
  for extending it are welcome.

- Fixes the 'adjust' factor in newToken() for EOF.

- Initializes tp->pos.end.index for error tokens.  Previously it was left as
  garbage, which makes me very nervous.  Erroneous tokens now have begin/end
  positions that make them look like 1-char tokens, which seems a reasonable
  fallback.

- Poisons userbuf after an erroneous token occurs.  This establishes an
  invariant that userbuf isn't accessed again.  This is good to know, since
  we may have abandoned scanning mid-token and so the userbuf state is no
  longer reliable.

- Fixes the context shown for these bogus lines:

    var x = 0x
    var x = 3e

  Previously the context showed a blank line instead of the line containing
  the bad token.
Attachment #516210 - Flags: review?(brendan)
Blocks: 634444
Just like the first patch, but with a couple of unnecessary |#include "jsscan.h"| lines removed.  One consequence is that JaegerMonkey now isn't rebuilt every time jsscan.h changes :)
Attachment #516210 - Attachment is obsolete: true
Attachment #516210 - Flags: review?(brendan)
Attachment #516409 - Flags: review?(brendan)
Blocks: 634800
Blocks: 636654
Blocks: 639420
Comment on attachment 516409 [details] [diff] [review]
patch v2 (against 62858:f85299200bd3)

Great work. You are reforming some of the oldest code in the engine. I wrote what became jsscan.cpp in May 1995, it was among the first source files created (mo_scan.c, if memory serves -- mo_ for Mocha, following Netscape 8.3-constrained filename conventions). TSF_NEWLINES was one of my hasty hacks from that create-JS-in-ten-days sprint. Glad to see it go.

Nits only below, apart from the const issue.

>-    userbuf.base = (jschar *)base;
>-    userbuf.limit = (jschar *)base + length;
>-    userbuf.ptr = (jschar *)base;

Must we cast away const? From skimming it seems to me you are *this* close to being able to constipate everything. It's scary to cast away const (enabling writes to readonly memory).

>     if (c == '\n') {
> #ifdef DEBUG
>-        int32 c2 = *userbuf.ptr;
>-        JS_ASSERT(c2 == '\n' || c2 == '\r' || c2 == LINE_SEPARATOR || c2 == PARA_SEPARATOR);
>+        int32 c2 = userbuf.peekRawChar();
>+        JS_ASSERT(TokenBuf::isRawEOLChar(c2));
> #endif
>-        if (userbuf.ptr > userbuf.base && *(userbuf.ptr - 1) == '\r')
>-            userbuf.ptr--;          /* also unget the \r in a \r\n sequence */

Blank line here.

>+        /* if it's a \r\n sequence, also unget the \r */
>+        if (!userbuf.atStart() && userbuf.matchRawCharBackwards('\r'))
>         JS_ASSERT(prevLinebase);    /* we should never get more than one EOL char */

Indentation off here.

>+    jschar *tmp = ptr ? ptr : ptrOnPoisoning;

Somehow "OnPoisoning" rubs me the wrong way -- "WhenPoisoned"? Not sure, something about the -ing implying ongoing tense...

>+#else
>+    jschar *tmp = ptr;
>+#endif

Could always declare jschar *tmp = ptr; once before #ifdef DEBUG that does if (!ptr) ptr = ...;

>+    tp = pn
>+       ? &pn->pn_pos
>+       : (TokenPos *)&currentToken().pos;

Prevailing style wants ? and : underhanging first char of condition, not = here -- but it all fits on one line and seems readable enough (modulo that cast).

About that cast: that is telling you to change tp to a const TokenPos *.

>+    /*
>+     * Given a token, T, that we want to complain about:  if T's (starting)

I am stifling complaints about English spacing (two spaces after full stop) but there's no reason for two spaces after : or (later in the patch) ; in any style manual I've read.

>+bool
>+IsTokenSane(Token *tp) 

Brilliant sanity checking (but kill the trailing space on the last line shown above, and grep for more like it to trim).

>                 if (!JS7_ISDEC(c)) {
>+                    ungetChar(c);

Ancient bug, perhaps one of the oldest. Seeing back almost to the big bang! Thanks.

>   error:
>-    tt = TOK_ERROR;
>+    /*
>+     * For erroneous multi-line tokens we won't have changed end.lineno (it'll
>+     * still be equal to begin.lineno) so we revert end.index to be equal to
>+     * begin.index + 1 (as if it's a 1-char token) to avoid having 
..12345678901234567890123456789012345678901234567890123456789012345678901234567890
>+     * inconsistent begin/end positions.  end.index isn't used in error
>+     * messages anyway.

"inconsistent" fits on the previous line and then the comment wraps more prettily (I have a sense for this!).

>+#ifdef DEBUG
>+    /*
>+     * Poisoning userbuf on error establishes an invariant:  once an erroneous
>+     * token has been seen, userbuf will not be consulted again.  This is true
>+     * because the parser will either (a) deal with the TOK_ERROR token by
>+     * aborting parsing immediately;  or (b) if the TOK_ERROR token doesn't

Space doubling after : and ; purple alert!

>+     * match what it expected, it will unget the token, and the next
..12345678901234567890123456789012345678901234567890123456789012345678901234567890
>+     * getToken() call will immediately return the just-gotten TOK_ERROR token
>+     * again without consulting userbuf, thanks to the lookahead buffer.

"getToken()" fits on previous line even with the non-obligatory "()" -- rewrap.

>+/* Unicode separators that are treated as line terminators, in addition to \n, \r */
>+#define LINE_SEPARATOR  0x2028
>+#define PARA_SEPARATOR  0x2029

May as well make these enum members in the closest type-scope? Macros are so 1995.

>     TokenKind peekTokenSameLine(uintN withFlags = 0) {
>-        Flagger flagger(this, withFlags);
>         if (!onCurrentLine(currentToken().pos))
>             return TOK_EOL;
>-        TokenKind tt = peekToken(TSF_NEWLINES);
>+
>+        if (lookahead != 0) {
>+            JS_ASSERT(lookahead == 1);
>+            return tokens[(cursor + lookahead) & ntokensMask].type;
>+        }

Blank line here.

>+        /*
>+         * This is the only place TOK_EOL is produced.  No token with TOK_EOL
>+         * is created, just a TOK_EOL TokenKind is returned.
>+         */
>+        flags &= ~TSF_EOL;
>+        TokenKind tt = getToken(withFlags);
>+        if (flags & TSF_EOL) {
>+            tt = TOK_EOL;
>+            flags &= ~TSF_EOL;
>+        }
>+        ungetToken();
>+

Kill this blank line, or at least move it up one.

Many thanks again. Beautiful work, I am getting verklempt!

/be
Attachment #516409 - Flags: review?(brendan) → review+
Pushed with all nits fixed, including the extra const decls which were pretty easy -- all the pointers in TokenBuf are now to const, as is Token::ptr.

http://hg.mozilla.org/tracemonkey/rev/3035bb782013
Whiteboard: fixed-in-tracemonkey
http://hg.mozilla.org/mozilla-central/rev/3035bb782013
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: