Last Comment Bug 747831 - Record buffer index in ParseNodes
: Record buffer index in ParseNodes
Status: RESOLVED FIXED
[MemShrink:P2]
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: All All
: -- normal (vote)
: mozilla22
Assigned To: Nicholas Nethercote [:njn]
:
: Jason Orendorff [:jorendorff]
Mentors:
: 748201 (view as bug list)
Depends on: 842438 854137 CVE-2013-5619
Blocks: 626932
  Show dependency treegraph
 
Reported: 2012-04-22 22:32 PDT by Nicholas Nethercote [:njn]
Modified: 2013-09-23 14:47 PDT (History)
9 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
patch (49.78 KB, patch)
2012-04-26 19:05 PDT, Nicholas Nethercote [:njn]
no flags Details | Diff | Splinter Review
patch v2 (2.86 KB, patch)
2012-04-26 21:43 PDT, Nicholas Nethercote [:njn]
no flags Details | Diff | Splinter Review
(DRAFT) Record buffer index in ParseNodes. (54.12 KB, patch)
2013-03-09 02:35 PST, Nicholas Nethercote [:njn]
no flags Details | Diff | Splinter Review
Record buffer index in ParseNodes instead of line number and column index. (62.94 KB, patch)
2013-03-11 17:59 PDT, Nicholas Nethercote [:njn]
jorendorff: review+
Details | Diff | Splinter Review

Description Nicholas Nethercote [:njn] 2012-04-22 22:32:59 PDT
In order to implement decompilation and lazy bytecode compilation, we need to record the source code for functions.  I.e. given a ParseNode for a function, we need to be able to copy the source code chars from the original buffer.  That doesn't sound difficult.

However, currently the beginning and end of each ParseNode is recorded with a {lineno, index} pair, where |index| is the char index in the line specified by |lineno|.  This is fine for error message but no good for copying chars from the original buffer.

There's a partial patch in bug 626932 that started towards replacing |lineno| and |index| with a single offset into the original buffer, which will make this easy.  (It will also shrink ParseNode if I play my cards right, which is a good thing because we can have 10s of MBs of them, albeit short-lived.)  I will try to resurrect this patch.
Comment 1 Nicholas Nethercote [:njn] 2012-04-26 19:05:30 PDT
Created attachment 618894 [details] [diff] [review]
patch

This patch:

- Changes TokenPtr to just be a uint32_t, which is an index into the source
  code buffer.

- Added TokenStream::SourceCoords, a class for mapping between buffer
  indices and line/column numbers.

Consequences:

- Having to look up a table to get a line/column number is obviously more
  complex than just having it attached to the Token/ParseNode.  But there is
  a complexity win, which is that we don't have the possibility of
  inconsistent lineno/index values in a TokenPtr;  the existing code sets
  these in different places (e.g. for multi-line tokens).  This simplifies
  things in a couple of places.

- I carefully optimized the lookups.  Cachegrind tells me the increase in
  instructions executed on SunSpider/V8/Kraken is negligible (i.e. less than
  1 part in 1000).  When parsing and byte-code compiling parsemark the
  increase is about 1%, which is higher than I'd like.

Things I'm uncertain about:

- I haven't handled //@line comments.  For SourceCoords lookups to be fast 
  it relies heavily on line numbers increasing monotonically, and
  //@line comments can break that.  (There's an "njn" comment
  in the relevant part of the patch about this.)

- I haven't done anything w.r.t. source maps.  They're not implemented fully
  enough for me to tell how my changes interact with them.

- I used 1-indexing for the column numbers, because that seemed obviously
  right, and it's what vim uses, GCC uses, etc.  But yesterday I learnt that
  Reflect.parse and emacs both use 0-indexed column numbers (but 1-indexed
  line numbers, of course).  Sigh.
Comment 2 Nicholas Nethercote [:njn] 2012-04-26 21:43:32 PDT
Created attachment 618926 [details] [diff] [review]
patch v2

This version inlines the crucial isOnThisLine() function.  Perf on parsemark -- a much tougher test than SS/V8/Kraken because there's no code execution time to dilute costs added during parsing/bytecode-compilation -- is now unchanged on average:

---------------------------------------------------------------
| millions of instructions executed                           |
| total                        | compiled (may overestimate)  |
---------------------------------------------------------------
|    16.534    16.614 (0.995x) |         0         0 (------) | 280slides-comb
|    65.168    65.131 (1.001x) |         0         0 (------) | ball-pool-comb
|    33.581    33.457 (1.004x) |         0         0 (------) | dojo
|    60.508    60.720 (0.997x) |         0         0 (------) | effectgames-ga
|    15.534    15.495 (1.003x) |         0         0 (------) | ga
|   354.401   353.013 (1.004x) |         0         0 (------) | gmail-combined
|   100.657   100.465 (1.002x) |         0         0 (------) | gravity-combin
|    22.697    22.634 (1.003x) |         0         0 (------) | jquery-min
|    43.582    43.830 (0.994x) |         0         0 (------) | jsgb-combined
|    62.129    62.353 (0.996x) |         0         0 (------) | mochikit
|   310.067   308.866 (1.004x) |         0         0 (------) | pipio-combined
|    12.622    12.778 (0.988x) |         0         0 (------) | processing-twi
|    54.844    55.115 (0.995x) |         0         0 (------) | sunspider-comb
|    19.806    19.835 (0.999x) |         0         0 (------) | tetris-combine
|    76.899    76.980 (0.999x) |         0         0 (------) | twitter-combin
|    81.912    82.916 (0.988x) |         0         0 (------) | v8-combined
|     9.928     9.904 (1.002x) |         0         0 (------) | yui-min
|   466.954   467.718 (0.998x) |         0         0 (------) | zimbra-combine
-------
|  1807.832  1807.831 (------) |         0         0 (------) | all


So that neutralizes the performance concerns, which is good.  The //@line comments are still a problem.
Comment 3 Nicholas Nethercote [:njn] 2012-05-07 16:11:05 PDT
jimb: 19-day feedback ping.  This is blocking lazy bytecode generation and decompiler removal and so I'd like to keep it moving.  Thanks.
Comment 4 Nicholas Nethercote [:njn] 2012-05-28 16:39:40 PDT
Looks like I missed this feedback request's 1 month birthday, which was two days ago.  Sorry kiddo, I'm a bad father.
Comment 5 :Benjamin Peterson 2012-06-05 11:42:19 PDT
*** Bug 761723 has been marked as a duplicate of this bug. ***
Comment 6 :Benjamin Peterson 2012-06-05 11:45:11 PDT
Wouldn't it be easiest to just record the original buffer index at Parser::functionDef, stick it in funbox, and be done with it?
Comment 7 Nicholas Nethercote [:njn] 2012-06-05 16:44:45 PDT
(In reply to Benjamin Peterson from comment #6)
> Wouldn't it be easiest to just record the original buffer index at
> Parser::functionDef, stick it in funbox, and be done with it?

Hmm, good question.  Quite possibly.
Comment 8 :Benjamin Peterson 2012-06-07 23:47:15 PDT
(In reply to Nicholas Nethercote [:njn] from comment #7)
> (In reply to Benjamin Peterson from comment #6)
> > Wouldn't it be easiest to just record the original buffer index at
> > Parser::functionDef, stick it in funbox, and be done with it?
> 
> Hmm, good question.  Quite possibly.

FWIW, I'm taking this approach for bug 791723.
Comment 9 :Benjamin Peterson 2012-06-07 23:47:37 PDT
Uh, I meant 761723.
Comment 10 Nicholas Nethercote [:njn] 2012-06-20 21:06:11 PDT
> > Wouldn't it be easiest to just record the original buffer index at
> > Parser::functionDef, stick it in funbox, and be done with it?
> 
> Hmm, good question.  Quite possibly.

Yeah, that's much simpler and avoids all the difficulty with //@line comments and source maps.
Comment 11 Jim Blandy :jimb 2013-02-21 21:55:01 PST
As I said in bug 626932 comment 1: I love this idea.
Comment 12 Nicholas Nethercote [:njn] 2013-03-09 02:35:56 PST
Created attachment 723056 [details] [diff] [review]
(DRAFT) Record buffer index in ParseNodes.

Not yet ready for review, but here's a version that passes jit-tests and
jstests for Luke to play with.
Comment 13 Nicholas Nethercote [:njn] 2013-03-11 17:59:08 PDT
Created attachment 723735 [details] [diff] [review]
Record buffer index in ParseNodes instead of line number and column index.

Luke, are you happy to review this?  If not, I'll ask jorendorff.

This patch replaces Token's begin/end line number and column index with a
begin/end buffer index, and adds a SourceCoords table on the side that allows
a line/column lookup from the buffer index.

On Parsemark it increase the number of instructions executed by 1.002x, but on
a big asm.js file it reduces it by 1.005x.  But I have some good parsing
speed-ups forthcoming in bug 849367 anyway.

Note that I fixed an incorrect |end| in the TOK_CONTINUE case in Parser.cpp.
This was filed as bug 748201, but it was easy to do it here.

One thing I found was that |tp->pos.end| was sometimes being read (in
reportCompileErrorNumberVA) before being set in getTokenInternal().  It was
just for an assertion, but I fixed it nonetheless, and added a Valgrind
annotation to catch similar problems in the future.
Comment 14 Luke Wagner [:luke] 2013-03-11 18:04:32 PDT
(In reply to Nicholas Nethercote [:njn] from comment #13)
> Luke, are you happy to review this?  If not, I'll ask jorendorff.

I'm pretty swamped in the short-term, so better to not.
Comment 15 Nicholas Nethercote [:njn] 2013-03-11 23:03:57 PDT
*** Bug 748201 has been marked as a duplicate of this bug. ***
Comment 16 Jason Orendorff [:jorendorff] 2013-03-18 20:46:33 PDT
Comment on attachment 723735 [details] [diff] [review]
Record buffer index in ParseNodes instead of line number and column index.

Review of attachment 723735 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/BytecodeEmitter.cpp
@@ +88,5 @@
>  
>  BytecodeEmitter::BytecodeEmitter(BytecodeEmitter *parent,
>                                   Parser<FullParseHandler> *parser, SharedContext *sc,
>                                   HandleScript script, HandleScript evalCaller, bool hasGlobalScope,
> +                                 unsigned lineNum, bool selfHostingMode)

uint32_t here too?

::: js/src/frontend/FullParseHandler.h
@@ +142,3 @@
>          return new_<ContinueStatement>(label, begin, end);
>      }
>      ParseNode *newDebuggerStatement(const TokenPos &pos) {

Hmm. You changed a bunch of places that take TokenPos arguments from const references to values, but not others (including this one). What was the rule?

::: js/src/frontend/TokenStream.cpp
@@ +115,5 @@
> +    // The first line begins at index 0.  MAX_PTR is the sentinel.
> +    // The appends cannot fail because |lineStartPtrs_| has
> +    // statically-allocated elements.
> +    JS_ASSERT(lineStartPtrs_.capacity() >= 2);
> +    (void)lineStartPtrs_.reserve(2);

Is this reserve() call necessary?

@@ +152,5 @@
> +
> +    if (lineStartPtrs_[lastLineIndex_] <= tptr) {
> +        // If we reach here, tptr is on a line the same as or higher than last
> +        // time.  Check first for the +0, +1, +2 cases, because they typically
> +        // cover 85--98% of cases.

Wow. This really improves performance? Huh.

@@ -410,4 @@
>  
>      // Poison other members.
>      cur->type = TOK_ERROR;
> -    cur->ptr = NULL;

Micro-nit: Plural "members" in the comment isn't true anymore.

@@ +622,5 @@
>      /*
>       * Given a token, T, that we want to complain about: if T's (starting)
>       * lineno doesn't match TokenStream's lineno, that means we've scanned past
>       * the line that T starts on, which makes it hard to print some or all of
>       * T's (starting) line for context.

Um. Is it still hard?

::: js/src/frontend/TokenStream.h
@@ +176,5 @@
>      return tt == TOK_VAR;
>  #endif
>  }
>  
> +typedef uint32_t TokenPtr;

Can we just replace this with uint32_t everywhere? That'll be less confusing to people reading the code later. (Especially anyone who doesn't see this patch go in and would assume that TokenPtr is still a struct.)

The things called "tptr" and such could just be called "offset".

@@ +622,5 @@
>       */
>      bool checkForKeyword(const jschar *s, size_t length, TokenKind *ttp, JSOp *topp);
>  
> +    // This class maps a userbuf index (which is 0-indexed) to a line number
> +    // (which is 1-indexed) and a column index (which is 0-indexed).

Great comment.
Comment 17 Nicholas Nethercote [:njn] 2013-03-19 12:25:45 PDT
> Hmm. You changed a bunch of places that take TokenPos arguments from const
> references to values, but not others (including this one). What was the rule?

Hmm... I didn't mean to change those.  I must have mixed them up with TokenPtr arguments.  I'll revert the changes.


> Is this reserve() call necessary?

Yep.  If you call infallibleAppend() without a prior reserve() you get assertion failures.


> > +    if (lineStartPtrs_[lastLineIndex_] <= tptr) {
> > +        // If we reach here, tptr is on a line the same as or higher than last
> > +        // time.  Check first for the +0, +1, +2 cases, because they typically
> > +        // cover 85--98% of cases.
> 
> Wow. This really improves performance? Huh.

Not by a huge amount, but it does -- I just remeasured several configurations, and checking for +0, +1, +2 was the best.


> @@ +622,5 @@
> >      /*
> >       * Given a token, T, that we want to complain about: if T's (starting)
> >       * lineno doesn't match TokenStream's lineno, that means we've scanned past
> >       * the line that T starts on, which makes it hard to print some or all of
> >       * T's (starting) line for context.
> 
> Um. Is it still hard?

Probably not.  Want to file a follow-up?


> Can we just replace this with uint32_t everywhere? That'll be less confusing
> to people reading the code later. (Especially anyone who doesn't see this
> patch go in and would assume that TokenPtr is still a struct.)
> 
> The things called "tptr" and such could just be called "offset".

Ok.
Comment 18 Nicholas Nethercote [:njn] 2013-03-19 12:39:13 PDT
Landed:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d1b71de5bbc1

jorendorff, in bug 846406 you modified peekTokenSameLine() in a way that made things difficult for this patch.  I changed it back to what it was previously.  The tests still pass (https://tbpl.mozilla.org/?tree=Try&rev=fcf77bb5b425) so I figured it was ok.
Comment 19 Ryan VanderMeulen [:RyanVM] 2013-03-19 13:09:55 PDT
(In reply to Nicholas Nethercote [:njn] from comment #18)
> Landed:
> https://hg.mozilla.org/integration/mozilla-inbound/rev/d1b71de5bbc1

Backed out for Windows bustage.
https://hg.mozilla.org/integration/mozilla-inbound/rev/cad5306d569e

https://tbpl.mozilla.org/php/getParsedLog.php?id=20839159&tree=Mozilla-Inbound

TokenStream.obj : error LNK2005: "private: static unsigned int const js::frontend::TokenStream::SourceCoords::MAX_PTR" (?MAX_PTR@SourceCoords@TokenStream@frontend@js@@0IB) already defined in jsapi.obj
Comment 20 Nicholas Nethercote [:njn] 2013-03-19 17:48:45 PDT
Second attempt.  My build-on-all-platforms try run was green.

https://hg.mozilla.org/integration/mozilla-inbound/rev/1f3587e02361
Comment 21 Ed Morley [:emorley] 2013-03-20 04:51:40 PDT
https://hg.mozilla.org/mozilla-central/rev/1f3587e02361

Note You need to log in before you can comment on or make changes to this bug.