Closed Bug 964253 Opened 11 years ago Closed 11 years ago

Untangle TokenStream::getTokenInternal

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: jandem, Assigned: jandem)

Details

Attachments

(8 files, 1 obsolete file)

I was experimenting with some TokenStream::getTokenInternal optimizations but the control flow is really complicated due to backwards gotos in the middle of other cases etc. So I thought it'd be good to refactor the code a bit first.
Moves the code for scanning string literals into its own method, getStringToken.
Attachment #8365939 - Flags: review?(n.nethercote)
This one was a bit nasty due to |goto decimal_dot;| and could use a careful review.
Attachment #8365940 - Flags: review?(n.nethercote)
Attachment #8365941 - Flags: review?(n.nethercote)
Attachment #8365942 - Flags: review?(n.nethercote)
Attachment #8365943 - Flags: review?(n.nethercote)
Factor out the regexp literal scanning. Also some minor "badchar" cleanup.
Attachment #8365945 - Flags: review?(n.nethercote)
Attached patch Part 7 - Miscellaneous changes (obsolete) — Splinter Review
Some minor changes. I saw TokenStream::getChar() show up in Instruments, so I added some JS_ALWAYS_INLINEs.
Attachment #8365949 - Flags: review?(n.nethercote)
njn, I flagged you as reviewer since you also touched getTokenInternal; feel free to reassign.

I tested Citadel's UDKGame-Browser-Shipping.js (23 MB) with the run() call commented out, with/without asm.js (without asm.js to ignore compilation time). Perf seems to be either the same or maybe a tiny bit faster with these patches.

Same for Parsemark and Octane-Codeload. Unfortunately all these benchmarks are pretty noisy :(
Oops, part 7 didn't apply cleanly; this one does.
Attachment #8365949 - Attachment is obsolete: true
Attachment #8365949 - Flags: review?(n.nethercote)
Attachment #8365951 - Flags: review?(n.nethercote)
Attached patch Combined patchSplinter Review
Should apply to inbound revision b2b1b90f5082.

It'd be great if somebody could verify these patches don't affect performance. As I said, on OS X things look ok though.
I have a script that runs the JS shell on parsemark (bug 839450), under Cachegrind. Here are the results for the combined patch:

---------------------------------------------------------------
| millions of instructions executed                           |
| total                        | compiled (may overestimate)  |
---------------------------------------------------------------
|    58.700    58.819 (0.998x) |     6.003     6.003 (------) | 280slides-comb
|    96.493    96.765 (0.997x) |    20.021    20.021 (------) | ball-pool-comb
|    85.716    85.950 (0.997x) |    11.192    11.192 (------) | dojo
|    97.969    98.333 (0.996x) |    23.664    23.664 (------) | effectgames-ga
|    48.730    48.864 (0.997x) |     3.743     3.743 (------) | ga
|   338.461   339.860 (0.996x) |    92.427    92.427 (------) | gmail-combined
|   126.102   126.586 (0.996x) |    31.001    31.001 (------) | gravity-combin
|    56.304    56.441 (0.998x) |     7.190     7.190 (------) | jquery-min
|    83.715    83.761 (0.999x) |    15.170    15.170 (------) | jsgb-combined
|    97.343    97.659 (0.997x) |    21.122    21.122 (------) | mochikit
|   410.809   412.300 (0.996x) |   100.694   100.694 (------) | pipio-combined
|    65.201    65.232 (------) |     4.650     4.650 (------) | processing-twi
|    78.898    79.969 (0.987x) |     1.389     1.389 (------) | sunspider-comb
|    52.990    53.126 (0.997x) |     5.480     5.480 (------) | tetris-combine
|   110.808   110.795 (------) |    29.972    29.972 (------) | twitter-combin
|   132.599   131.901 (1.005x) |    35.653    35.653 (------) | v8-combined
|    45.329    45.441 (0.998x) |     2.821     2.821 (------) | yui-min
|   767.127   769.733 (0.997x) |   176.807   176.807 (------) | zimbra-combine
-------
|  2753.304  2761.546 (0.997x) |   589.009   589.009 (------) | all

So overall it's a slight regression. I'll try the patches individually as well.

In general, patches 1, 3, 5, 6 and 7 look fine. I'm a bit uncertain about 2 and 4; both of them cause us to inline a big function in three places, which isn't ideal... what were the optimizations you have in mind? I'll be pleased if you can get speed-ups here, because I've previously removed every last ounce of fat I could find in this function :)
More instruction counts from Cachegrind.

        clang               gcc         
orig    2753.298 (1.000x)   2608.813
part1   2753.216 (1.000x)   2610.726 (0.999x)
part2   2754.158 (1.000x)   2612.656 (0.999x)
part3   2754.482 (1.000x)   2616.232 (0.997x)
part4   2776.504 (0.992x)   2626.606 (0.993x)
part5   2776.429 (0.992x)   2640.561 (0.992x)
part6   2779.920 (0.990x)   2626.653 (0.993x)
part7   2761.552 (0.997x)   2617.431 (0.997x)
only-
 part7  2758.066 (0.998x)   2616.173 (0.997x)

So, a bit here, a bit there. Even part 7 is a net regression when applied without the others.

As for parts 1 to 6... does the current structure actively block the optimizations you were planning?
Nicholas, thanks a lot for the numbers! Using Cachegrind is a good idea; I should probably set that up as well... These 0.003x regressions are pretty much impossible to measure without it.

Also interesting that gcc regresses slightly more than clang, that's what I feared.

0.003x is tiny but it's a regression so for now I'll just drop these patches I think.
Looking at h4writer's splendid tracelogging numbers[1], we spend roughly two thirds of our time on SunSpider in parsing. Does that mean that SunSpider truly is as fast as it will ever get, or are there other parts of the parsing phase that can still be sped up?

If the former, how does IE manage to be 30% faster, if their numbers[2] are to be believed[3]?


[1]: https://tl-beta.paas.allizom.org/
[2]: http://ie.microsoft.com/testDrive/Benchmarks/SunSpider/Default.html
[3]: I couldn't find any independent results corroborating them, which might be a good sign as it might indicate that less people do SunSpider comparisons
h4writer cleared up the mystery here: the tracelogging results are a combination of running all sunspider sub-tests individually, starting a new shell each time. That means that all parsing times include parsing the self-hosting code, making parsing heavily over-represented.

As a short-term solution to that, and to get better numbers here, I'll change things so that tracelogging only starts after self-hosting has been initialized.
Just ran Sunspider fully getting the following results:

{
    "engineOverview": {
        "parsing script": 0.0790143,
        "Interpreter": 0.0903848,
        "GC": 0.0241764,
        "Baseline": 0.37226,
        "IonMonkey": 0.358923,
        "Lazy parsing": 0.0197462,
        "Ion Compilation": 0.00268987,
        "Yarr execution": 0.0528035
    }
}

So 80%+ is spend in execution ;). Only 9% in parsing.
Parsing certainly is important -- think of all those enormous scripts that websites use. But the lexer and parser have both enjoyed (suffered?) several rounds of optimization over the past couple of years, with the help of Parsemark (bug 839450). I don't think there's much fat to be removed, especially in the lexer. But I'd love to be proven wrong, which is why I am interested to hear Jan's ideas.
Attachment #8365939 - Flags: review?(n.nethercote)
Attachment #8365940 - Flags: review?(n.nethercote)
Attachment #8365941 - Flags: review?(n.nethercote)
Attachment #8365942 - Flags: review?(n.nethercote)
Attachment #8365943 - Flags: review?(n.nethercote)
Attachment #8365945 - Flags: review?(n.nethercote)
Attachment #8365951 - Flags: review?(n.nethercote)
(In reply to Nicholas Nethercote [:njn] from comment #17)
> I don't think there's much fat to be removed,
> especially in the lexer. But I'd love to be proven wrong, which is why I am
> interested to hear Jan's ideas.

I was looking at an Octane-codeload profile and noticed quite a lot of time atomizing identifier strings. We could try to do something similar to what JSC does, add a PropertyName *nameCache_[128] to TokenStream, then when you have a name with char[0] < 128 and length > 1 (length == 1 is very hot and should probably be inlined in getTokenInternal), you look up the cached PropertyName*, check if the length + chars match the current one, and avoid the whole AtomizeChars path (including the atom table lock!). I hacked this up last week and combined with the length == 1 fast path it avoided the AtomizeChars call for 75% of all identifiers IIRC.

Also, integer literals are way more common than double literals (imagine all these x|0 expressions etc), it would be nice if we could avoid doubles completely. Again, not sure if this wins much but worth measuring maybe.
Anyway, this one can be closed as WONTFIX.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → WONTFIX
The PropertyName idea sounds good.

> Also, integer literals are way more common than double literals (imagine all
> these x|0 expressions etc), it would be nice if we could avoid doubles
> completely.

We already do that: we use GetDecimalInteger() to compute the value of integer literals and js_strtod() to compute the value of double literals (bug 639420, patch 3 introduced this). It was a ~10% win on integer-heavy inputs such as some of the Kraken ones.

Or were you thinking of something else?
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: