Experiment with a binary AST format

NEW
Assigned to

Status

()

P2
normal
2 years ago
7 days ago

People

(Reporter: Yoric, Assigned: Yoric)

Tracking

(Depends on: 11 bugs)

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(10 attachments, 4 obsolete attachments)

59 bytes, text/x-review-board-request
Details
105.97 KB, text/plain
Details
59 bytes, text/x-review-board-request
Details
59 bytes, text/x-review-board-request
Details
59 bytes, text/x-review-board-request
Details
59 bytes, text/x-review-board-request
Details
59 bytes, text/x-review-board-request
Details
59 bytes, text/x-review-board-request
Details
59 bytes, text/x-review-board-request
Details
59 bytes, text/x-review-board-request
Details
Let's run an experiment to find out whether we could come up with a binary format for delivering JS source code and that would support faster parsing.
Assignee: nobody → dteller
A few random ideas that could make the binary format more efficient:
- extracting string constants, making sure that there is a single canonical representation for them;
- extracting all identifiers, representing them by numbers.

A big question: what about toSource()?
Out of curiosity, which aspects of parsing take the most time?  I have heard "scope analysis" in the past.

Not to shoot this suggestion down (all speedups are great), just remember that unless there's a very close correspondence between the token streams of source JS and the binary ast format you'll effectively have *two* programming languages to support instead of one.  Almost every optimization / simplification performed by the translator will cause the introduction in the AST language of some phrase that almost, but not quite, corresponds to JS semantics (the new phrase is usually more general); people programming directly in the AST language will then be able to create programs that could not have been expressed in JS, which can be a challenge for the consumer.  Avoiding this trap while getting worthwhile speedups is probably somewhat difficult.  And the AST language has to be checked for syntax and semantics when it's parsed, too.
(In reply to David Teller [:Yoric] (please use "needinfo") from comment #0)
> Let's run an experiment to find out whether we could come up with a binary
> format for delivering JS source code and that would support faster parsing.

I think looking at WebAssembly structure would be a good starting point, as they share similar goals, in terms of network size and in terms of getting to the interpretation/compilation quickly.

If this format is delivering JS source code, I guess we would ideally prefer to have a 1-to-1 mapping with JS.  Maybe we could limit that to a subset of JS semantic instead (no with keyword for example / statically typed)?

Side note, one idea which was raised 5 years ago, was about not having a bytecode interpreter as we have today, but having an AST interpreter to avoid the BytecodeEncoder.  If we want to add a new format this might be the next question in row, should we change our interpreter too?
(In reply to Lars T Hansen [:lth] from comment #2)
> Out of curiosity, which aspects of parsing take the most time?  I have heard
> "scope analysis" in the past.

Good question. I haven't seen anything that looks like scope analysis in profiles but I haven't checked closely.

> Not to shoot this suggestion down (all speedups are great), just remember
> that unless there's a very close correspondence between the token streams of
> source JS and the binary ast format you'll effectively have *two*
> programming languages to support instead of one.  Almost every optimization
> / simplification performed by the translator will cause the introduction in
> the AST language of some phrase that almost, but not quite, corresponds to
> JS semantics (the new phrase is usually more general); people programming
> directly in the AST language will then be able to create programs that could
> not have been expressed in JS, which can be a challenge for the consumer. 
> Avoiding this trap while getting worthwhile speedups is probably somewhat
> difficult.  And the AST language has to be checked for syntax and semantics
> when it's parsed, too.

Good point. For the moment, the idea is just to start experimenting, though. As suggested by nbp, I believe that a subset of JS would be interesting, rather than a superset.

Comment 5

2 years ago
(In reply to Nicolas B. Pierron [:nbp] from comment #3)
> (In reply to David Teller [:Yoric] (please use "needinfo") from comment #0)
> > Let's run an experiment to find out whether we could come up with a binary
> > format for delivering JS source code and that would support faster parsing.
> 
> I think looking at WebAssembly structure would be a good starting point, as
> they share similar goals, in terms of network size and in terms of getting
> to the interpretation/compilation quickly.

I think the wasm format of basic block trees of stack bytecode to be not comparable to the goal here: we need a surface structure. I agree with lth that it is in fact rather unclear what the net speedups are, given the required verification. Also it is simply true that anything that does not preserve 1-1 correspondence between JS text source and a binary format is a nonstarter due to standardization alone.

> 
> If this format is delivering JS source code, I guess we would ideally prefer
> to have a 1-to-1 mapping with JS.  Maybe we could limit that to a subset of
> JS semantic instead (no with keyword for example / statically typed)?

I strongly, strongly prefer not subsetting and tacking on different semantics. I will also veto any kind of typing pipe dream right now. This should be a delivery format only.

> 
> Side note, one idea which was raised 5 years ago, was about not having a
> bytecode interpreter as we have today, but having an AST interpreter to
> avoid the BytecodeEncoder.  If we want to add a new format this might be the
> next question in row, should we change our interpreter too?

Not right now, no, that'd be too much scope creep. It might open new opportunities though, for sure.

Comment 6

2 years ago
(In reply to Lars T Hansen [:lth] from comment #2)
> Out of curiosity, which aspects of parsing take the most time?  I have heard
> "scope analysis" in the past.

I'm not sure I can point to any particular component that is the obvious culprit. Tokenizing is slow, parsing + bytecode compilation as 2 passes isn't fast either. Some of the present things that are slow will remain slow, I believe, even in a binary format world: name analysis of binding names, atomizing strings, etc.

> 
> Not to shoot this suggestion down (all speedups are great), just remember
> that unless there's a very close correspondence between the token streams of
> source JS and the binary ast format you'll effectively have *two*
> programming languages to support instead of one.  Almost every optimization
> / simplification performed by the translator will cause the introduction in
> the AST language of some phrase that almost, but not quite, corresponds to
> JS semantics (the new phrase is usually more general); people programming
> directly in the AST language will then be able to create programs that could
> not have been expressed in JS, which can be a challenge for the consumer. 
> Avoiding this trap while getting worthwhile speedups is probably somewhat
> difficult.

This is an astute point. I am confident that any format that requires spec authors to spec both a text and a binary grammar is a nonstarter. 1-1 correspondence to spec grammar is a diamond-hard constraint.

> And the AST language has to be checked for syntax and semantics
> when it's parsed, too.

Yeah, this is the one of the things we need to think through. The verification needs to be single pass, and, ideally, understandably piggy-backed onto bytecode generation.
I was going to comment that it would probably be best if this were about producing an *algorithm* for mechanically translating the grammar productions of the ECMAScript spec into optimized notation.  That way we won't ever have two diverging languages with distinct syntaxes.

But then I thought a little bit more about it, and I'm not sure that's quite the case, at least not if we look to history.  Consider, for example, that ES3 had all manner of *NoIn grammar productions in it, that all went await in the Great Grammar Parametrization of ES6.  A sort of braindead translation of grammar productions to numeric constants for embedding in a compact binary format, would probably have quickly broken on that.  It's possible to *imagine* that that could be papered over.  But it seems doubtful an automatic translation process could ever have handled that in an anticipatory manner.

Which is a long way of saying that this, despite that I was thinking might be possible with minimal invasion into the spec process, when I first heard about this late last year -- I'm not sure this will be long-term feasible without it being an additional constraint on spec evolution.  A constraint along a very different axis from most syntax-related constraints, but a constraint nonetheless.

Anyway.  Two cents, not stop energy, just that this is probably more complicated than I'd originally thought it might be.
You're right, we need to be on the lookout for future-proofing any format we design.

A few additional ideas, btw. With a carefully-crafted format, we could achieve better laziness. We could for instance have a table of symbols, with offsets to let us jump to the definition of symbols, without even having to look at the code during  a first pass. That's assuming that we can come up with a different semantics for SyntaxError for binary-formatted code.
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Created attachment 8859329 [details]
Size change

Attaching a few numbers that represent how the size currently changes by going from the source format to the binary format. Numbers are taken from Facebook's snapshot test (file names are strings of random chars, so not meaningful). This is prior to any optimization with the length of integers/variable length encoding. All numbers are taken *after* gzip.
Created attachment 8859331 [details]
Parsing times

These are the duration numbers for parsing our snapshot of Facebook's JS code. I have attempted to measure only the creation of the AST. Code hasn't been optimized yet, in particular this checks a lots of useless headers/footers and performs a lot of useless string copies.
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Attachment #8859324 - Attachment is obsolete: true
Attachment #8859325 - Attachment is obsolete: true
Attachment #8859327 - Attachment is obsolete: true
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment on attachment 8859329 [details]
Size change

With the latest patch, I get the following results:

- Facebook: * 1.05
- Webpacked Devtools: * 0.47
Attachment #8859329 - Attachment is obsolete: true
Great work Yoric.  Good to see us getting to parity on the size issue.  What are the major updates you did to get to that?
Comment hidden (mozreview-request)
With the latest patch, I get the following result:

- Facebook: *0.96
- Webpacked Devtools: 0.38

The list of changes is in the commit messages, but basically:
- using variable length numbers almost everywhere;
- removing some redundant data;
- atoms now use null-terminated strings (with a special constant \0\0 for "no atom") instead of present/absent + length-prefixed;
- atoms are now extracted to a table of strings and are represented inline as a variable length identifier;
- taking advantage of variable length string identifiers to ensure that the most common strings are represented with a single byte.

Also, the table of strings is designed so that we can either read it lazily or start extracting atoms before the buffer has been received in its entirety.

And I'm currently out of ideas for file size :)
Nice, nice, nice.  This is awesome!  Overall I think the "proof of concept" aspect of this has basically complete.

As some in Canada have been known to say, "A proof is a proof. And when you have a good proof, it's because it's proven."
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
I have just implemented an approximation of lazy parsing, which cuts functions beyond a given depth.

We don't really have the data structures to store lazy parsing data, so for the sake of benchmarking, I just heap-allocates some data to represent the cost of storing this lazy parsing data.

With this patch, we get the following speedups on the (2.5Mb gzipped) of Facebook source code Facebook, measuring the AST creation phase (which takes ~500-800 ms from source):
* no skipping: *0.31
* skipping all nested functions: *0.12
* skipping all functions: *0.04

All measures are done in optimized build.

So, the question is whether these gains are sufficient to proceed.
I have done some quick experiments to see if lazy loading of strings would be useful, but it doesn't seem to affect performance on Facebook, one way or the other.
> * no skipping: *0.31
> * skipping all nested functions: *0.12
> * skipping all functions: *0.04

I take these numbers to mean that with no skipping, we're 3x faster.  With nested function skipping we're ~8x faster?
> I take these numbers to mean that with no skipping, we're 3x faster.  With nested function skipping we're ~8x faster?

Exactly. On the subset we're measuring, of course.
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
I realized that there was still debugging code on the hot path, so I removed it, improving a bit further the speed. I have also implemented a pseudo syntax-parse, which doesn't allocate anything.

Finally, I have also attempted to optimize parsing of atoms, which doesn't seem to have any measurable effect.

Updated numbers vs syntax parse/vs full parse, using the usual Facebook snapshot:

* no skipping: *0.35/*0.29;
* skipping all nested functions: *0.14/*0.12;
* skipping all functions: *0.08/*0.06


And the same without allocating nodes:

* no skipping: *0.28/*0.23;
* skipping all nested functions: *0.12/*0.10;
* skipping all functions: *0.08/*0.06


I'm a bit disappointed that removing allocations doesn't improve performance all that much, but profiling seems to confirm that memory allocations have little to no impact on the total runtime. This might be an artifact of the fact that I'm performing each benchmark with a fresh jsshell.
By the way, there are several performance gains that we haven't discussed and that I cannot measure for the time being:

* as this is a binary format, we don't need to transcode first;
* this format is designed so that we can start parsing without having received the entire data;
* should we decide to head in this direction, we could possibly hand the parsing of specific subtrees to worker threads.
Comment hidden (mozreview-request)
Depends on: 1377007
Depends on: 1409815
Depends on: 1409840
Depends on: 1417850
Depends on: 1417851
Depends on: 1417852
Depends on: 1417853
Depends on: 1417854
Depends on: 1417858
Depends on: 1417859
Depends on: 1417886
See Also: → bug 1420263
Depends on: 1439855
Depends on: 1437004
Priority: -- → P1
Depends on: 1444956
Depends on: 1455547
Priority: P1 → P2
Depends on: 1459555
Depends on: 1460805
Depends on: 1495611
Depends on: 1498101
Depends on: 1499044
Depends on: 1500853
Depends on: 1504595
Depends on: 1505267
Depends on: 1505269
Depends on: 1519302
You need to log in before you can comment on or make changes to this bug.