Closed Bug 747831 Opened 12 years ago Closed 11 years ago

Record buffer index in ParseNodes

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla22

People

(Reporter: n.nethercote, Assigned: n.nethercote)

References

Details

(Whiteboard: [MemShrink:P2])

Attachments

(2 files, 2 obsolete files)

In order to implement decompilation and lazy bytecode compilation, we need to record the source code for functions.  I.e. given a ParseNode for a function, we need to be able to copy the source code chars from the original buffer.  That doesn't sound difficult.

However, currently the beginning and end of each ParseNode is recorded with a {lineno, index} pair, where |index| is the char index in the line specified by |lineno|.  This is fine for error message but no good for copying chars from the original buffer.

There's a partial patch in bug 626932 that started towards replacing |lineno| and |index| with a single offset into the original buffer, which will make this easy.  (It will also shrink ParseNode if I play my cards right, which is a good thing because we can have 10s of MBs of them, albeit short-lived.)  I will try to resurrect this patch.
Attached patch patch (obsolete) — Splinter Review
This patch:

- Changes TokenPtr to just be a uint32_t, which is an index into the source
  code buffer.

- Added TokenStream::SourceCoords, a class for mapping between buffer
  indices and line/column numbers.

Consequences:

- Having to look up a table to get a line/column number is obviously more
  complex than just having it attached to the Token/ParseNode.  But there is
  a complexity win, which is that we don't have the possibility of
  inconsistent lineno/index values in a TokenPtr;  the existing code sets
  these in different places (e.g. for multi-line tokens).  This simplifies
  things in a couple of places.

- I carefully optimized the lookups.  Cachegrind tells me the increase in
  instructions executed on SunSpider/V8/Kraken is negligible (i.e. less than
  1 part in 1000).  When parsing and byte-code compiling parsemark the
  increase is about 1%, which is higher than I'd like.

Things I'm uncertain about:

- I haven't handled //@line comments.  For SourceCoords lookups to be fast 
  it relies heavily on line numbers increasing monotonically, and
  //@line comments can break that.  (There's an "njn" comment
  in the relevant part of the patch about this.)

- I haven't done anything w.r.t. source maps.  They're not implemented fully
  enough for me to tell how my changes interact with them.

- I used 1-indexing for the column numbers, because that seemed obviously
  right, and it's what vim uses, GCC uses, etc.  But yesterday I learnt that
  Reflect.parse and emacs both use 0-indexed column numbers (but 1-indexed
  line numbers, of course).  Sigh.
Attachment #618894 - Flags: feedback?(jimb)
Attached patch patch v2Splinter Review
This version inlines the crucial isOnThisLine() function.  Perf on parsemark -- a much tougher test than SS/V8/Kraken because there's no code execution time to dilute costs added during parsing/bytecode-compilation -- is now unchanged on average:

---------------------------------------------------------------
| millions of instructions executed                           |
| total                        | compiled (may overestimate)  |
---------------------------------------------------------------
|    16.534    16.614 (0.995x) |         0         0 (------) | 280slides-comb
|    65.168    65.131 (1.001x) |         0         0 (------) | ball-pool-comb
|    33.581    33.457 (1.004x) |         0         0 (------) | dojo
|    60.508    60.720 (0.997x) |         0         0 (------) | effectgames-ga
|    15.534    15.495 (1.003x) |         0         0 (------) | ga
|   354.401   353.013 (1.004x) |         0         0 (------) | gmail-combined
|   100.657   100.465 (1.002x) |         0         0 (------) | gravity-combin
|    22.697    22.634 (1.003x) |         0         0 (------) | jquery-min
|    43.582    43.830 (0.994x) |         0         0 (------) | jsgb-combined
|    62.129    62.353 (0.996x) |         0         0 (------) | mochikit
|   310.067   308.866 (1.004x) |         0         0 (------) | pipio-combined
|    12.622    12.778 (0.988x) |         0         0 (------) | processing-twi
|    54.844    55.115 (0.995x) |         0         0 (------) | sunspider-comb
|    19.806    19.835 (0.999x) |         0         0 (------) | tetris-combine
|    76.899    76.980 (0.999x) |         0         0 (------) | twitter-combin
|    81.912    82.916 (0.988x) |         0         0 (------) | v8-combined
|     9.928     9.904 (1.002x) |         0         0 (------) | yui-min
|   466.954   467.718 (0.998x) |         0         0 (------) | zimbra-combine
-------
|  1807.832  1807.831 (------) |         0         0 (------) | all


So that neutralizes the performance concerns, which is good.  The //@line comments are still a problem.
Attachment #618894 - Attachment is obsolete: true
Attachment #618894 - Flags: feedback?(jimb)
Attachment #618926 - Flags: feedback?(jimb)
jimb: 19-day feedback ping.  This is blocking lazy bytecode generation and decompiler removal and so I'd like to keep it moving.  Thanks.
Looks like I missed this feedback request's 1 month birthday, which was two days ago.  Sorry kiddo, I'm a bad father.
Wouldn't it be easiest to just record the original buffer index at Parser::functionDef, stick it in funbox, and be done with it?
Blocks: savesource
(In reply to Benjamin Peterson from comment #6)
> Wouldn't it be easiest to just record the original buffer index at
> Parser::functionDef, stick it in funbox, and be done with it?

Hmm, good question.  Quite possibly.
(In reply to Nicholas Nethercote [:njn] from comment #7)
> (In reply to Benjamin Peterson from comment #6)
> > Wouldn't it be easiest to just record the original buffer index at
> > Parser::functionDef, stick it in funbox, and be done with it?
> 
> Hmm, good question.  Quite possibly.

FWIW, I'm taking this approach for bug 791723.
Uh, I meant 761723.
Attachment #618926 - Flags: feedback?(jimb)
> > Wouldn't it be easiest to just record the original buffer index at
> > Parser::functionDef, stick it in funbox, and be done with it?
> 
> Hmm, good question.  Quite possibly.

Yeah, that's much simpler and avoids all the difficulty with //@line comments and source maps.
No longer blocks: LazyBytecode, 718969, savesource
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Depends on: 842438
As I said in bug 626932 comment 1: I love this idea.
Not yet ready for review, but here's a version that passes jit-tests and
jstests for Luke to play with.
Luke, are you happy to review this?  If not, I'll ask jorendorff.

This patch replaces Token's begin/end line number and column index with a
begin/end buffer index, and adds a SourceCoords table on the side that allows
a line/column lookup from the buffer index.

On Parsemark it increase the number of instructions executed by 1.002x, but on
a big asm.js file it reduces it by 1.005x.  But I have some good parsing
speed-ups forthcoming in bug 849367 anyway.

Note that I fixed an incorrect |end| in the TOK_CONTINUE case in Parser.cpp.
This was filed as bug 748201, but it was easy to do it here.

One thing I found was that |tp->pos.end| was sometimes being read (in
reportCompileErrorNumberVA) before being set in getTokenInternal().  It was
just for an assertion, but I fixed it nonetheless, and added a Valgrind
annotation to catch similar problems in the future.
Attachment #723735 - Flags: review?(luke)
Attachment #723056 - Attachment is obsolete: true
(In reply to Nicholas Nethercote [:njn] from comment #13)
> Luke, are you happy to review this?  If not, I'll ask jorendorff.

I'm pretty swamped in the short-term, so better to not.
Attachment #723735 - Flags: review?(luke) → review?(jorendorff)
Comment on attachment 723735 [details] [diff] [review]
Record buffer index in ParseNodes instead of line number and column index.

Review of attachment 723735 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/frontend/BytecodeEmitter.cpp
@@ +88,5 @@
>  
>  BytecodeEmitter::BytecodeEmitter(BytecodeEmitter *parent,
>                                   Parser<FullParseHandler> *parser, SharedContext *sc,
>                                   HandleScript script, HandleScript evalCaller, bool hasGlobalScope,
> +                                 unsigned lineNum, bool selfHostingMode)

uint32_t here too?

::: js/src/frontend/FullParseHandler.h
@@ +142,3 @@
>          return new_<ContinueStatement>(label, begin, end);
>      }
>      ParseNode *newDebuggerStatement(const TokenPos &pos) {

Hmm. You changed a bunch of places that take TokenPos arguments from const references to values, but not others (including this one). What was the rule?

::: js/src/frontend/TokenStream.cpp
@@ +115,5 @@
> +    // The first line begins at index 0.  MAX_PTR is the sentinel.
> +    // The appends cannot fail because |lineStartPtrs_| has
> +    // statically-allocated elements.
> +    JS_ASSERT(lineStartPtrs_.capacity() >= 2);
> +    (void)lineStartPtrs_.reserve(2);

Is this reserve() call necessary?

@@ +152,5 @@
> +
> +    if (lineStartPtrs_[lastLineIndex_] <= tptr) {
> +        // If we reach here, tptr is on a line the same as or higher than last
> +        // time.  Check first for the +0, +1, +2 cases, because they typically
> +        // cover 85--98% of cases.

Wow. This really improves performance? Huh.

@@ -410,4 @@
>  
>      // Poison other members.
>      cur->type = TOK_ERROR;
> -    cur->ptr = NULL;

Micro-nit: Plural "members" in the comment isn't true anymore.

@@ +622,5 @@
>      /*
>       * Given a token, T, that we want to complain about: if T's (starting)
>       * lineno doesn't match TokenStream's lineno, that means we've scanned past
>       * the line that T starts on, which makes it hard to print some or all of
>       * T's (starting) line for context.

Um. Is it still hard?

::: js/src/frontend/TokenStream.h
@@ +176,5 @@
>      return tt == TOK_VAR;
>  #endif
>  }
>  
> +typedef uint32_t TokenPtr;

Can we just replace this with uint32_t everywhere? That'll be less confusing to people reading the code later. (Especially anyone who doesn't see this patch go in and would assume that TokenPtr is still a struct.)

The things called "tptr" and such could just be called "offset".

@@ +622,5 @@
>       */
>      bool checkForKeyword(const jschar *s, size_t length, TokenKind *ttp, JSOp *topp);
>  
> +    // This class maps a userbuf index (which is 0-indexed) to a line number
> +    // (which is 1-indexed) and a column index (which is 0-indexed).

Great comment.
Attachment #723735 - Flags: review?(jorendorff) → review+
> Hmm. You changed a bunch of places that take TokenPos arguments from const
> references to values, but not others (including this one). What was the rule?

Hmm... I didn't mean to change those.  I must have mixed them up with TokenPtr arguments.  I'll revert the changes.


> Is this reserve() call necessary?

Yep.  If you call infallibleAppend() without a prior reserve() you get assertion failures.


> > +    if (lineStartPtrs_[lastLineIndex_] <= tptr) {
> > +        // If we reach here, tptr is on a line the same as or higher than last
> > +        // time.  Check first for the +0, +1, +2 cases, because they typically
> > +        // cover 85--98% of cases.
> 
> Wow. This really improves performance? Huh.

Not by a huge amount, but it does -- I just remeasured several configurations, and checking for +0, +1, +2 was the best.


> @@ +622,5 @@
> >      /*
> >       * Given a token, T, that we want to complain about: if T's (starting)
> >       * lineno doesn't match TokenStream's lineno, that means we've scanned past
> >       * the line that T starts on, which makes it hard to print some or all of
> >       * T's (starting) line for context.
> 
> Um. Is it still hard?

Probably not.  Want to file a follow-up?


> Can we just replace this with uint32_t everywhere? That'll be less confusing
> to people reading the code later. (Especially anyone who doesn't see this
> patch go in and would assume that TokenPtr is still a struct.)
> 
> The things called "tptr" and such could just be called "offset".

Ok.
Landed:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d1b71de5bbc1

jorendorff, in bug 846406 you modified peekTokenSameLine() in a way that made things difficult for this patch.  I changed it back to what it was previously.  The tests still pass (https://tbpl.mozilla.org/?tree=Try&rev=fcf77bb5b425) so I figured it was ok.
Whiteboard: [MemShrink]
(In reply to Nicholas Nethercote [:njn] from comment #18)
> Landed:
> https://hg.mozilla.org/integration/mozilla-inbound/rev/d1b71de5bbc1

Backed out for Windows bustage.
https://hg.mozilla.org/integration/mozilla-inbound/rev/cad5306d569e

https://tbpl.mozilla.org/php/getParsedLog.php?id=20839159&tree=Mozilla-Inbound

TokenStream.obj : error LNK2005: "private: static unsigned int const js::frontend::TokenStream::SourceCoords::MAX_PTR" (?MAX_PTR@SourceCoords@TokenStream@frontend@js@@0IB) already defined in jsapi.obj
Whiteboard: [MemShrink] → [MemShrink:P2]
Second attempt.  My build-on-all-platforms try run was green.

https://hg.mozilla.org/integration/mozilla-inbound/rev/1f3587e02361
https://hg.mozilla.org/mozilla-central/rev/1f3587e02361
Status: REOPENED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla22
Depends on: CVE-2013-5619
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: