Closed
Bug 813381
Opened 13 years ago
Closed 13 years ago
Optimize CheckStackRoots
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
FIXED
mozilla20
People
(Reporter: sfink, Assigned: sfink)
Details
(Whiteboard: [js:t])
Attachments
(2 files)
|
13.58 KB,
patch
|
terrence
:
review+
sfink
:
checkin+
|
Details | Diff | Splinter Review |
|
4.28 KB,
patch
|
terrence
:
review+
sfink
:
checkin+
|
Details | Diff | Splinter Review |
Currently, CheckStackRoots has an optimization where it takes a snapshot of the stack every time it runs, and scans it every time to avoid re-poisoning the portion that hasn't changed since the last call.
In my testing, that optimization sped up all js1_8_5 tests from 347sec to 257sec.
However, this optimization means that we have to scan the whole unchanged portion of the stack on every call, so that time is dependent on the current stack depth (as well as the number of live rooters.)
| Assignee | ||
Comment 1•13 years ago
|
||
This patch removes the previous optimization, and instead uses the idea that the Rooted<T> structs are lying about on the stack, giving us a place to insert a marker that indicates whether we've scanned that portion of the stack yet. The marker is set to false when the Rooted<T> is instantiated, true when the the Rooted<T> is scanned, and forgotten when the Rooted<T> is popped (b/c we have a linked list of all Rooted<T>'s, so there is no possibility of looking at a stale Rooted<T> on the stack.)
This eliminates the time deppendence on the current stack depth. (It adds a full scan of all Rooted<T>, but we already do that once *per addressable gcthing word* on the stack.)
In addition, all current Rooted<T> *contain* the pointer they are rooting, which means that we don't need to scan any Rooted<T> that are outside of the portion of the stack that we're scanning. By constructing a list of all Rooted<T> at the beginning of CheckStackRoots, we can discard everything that cannot possible keep a gcthing stack pointer from being poisoned by the analysis. Since all Rooted<T> are scanned for each potential Cell* on the stack, make the time for the analysis largely dependent only on the amount of changed stuff on the stack.
One place where this optimization could lose is if we are in a loop that pushes and pops the same set of Rooted<T> each time. The previous optimization might notice that the actual values on the stack don't change across these calls; ours will conservatively assume that because the Rooted<T> have been popped and pushed in between CheckStackRoots() calls, we need to rescan that part of the stack. I don't know if this is a problem in practice, because it is likely that some loop counter variable will be saved on the stack anyway. Anyway, that's just speculation.
The win from this optimization is far less than I had hoped. I go from 257 seconds + 4 timeouts with the previous optimization, to 197 seconds + 3 timeouts with this patch applied. (Timeouts can hide unbounded extra time, but it probably doesn't affect the ratio much.)
I had intermediate, incorrect versions of this optimization that showed a much larger win (eg 79 seconds total with no timeouts), so it's possible that there are larger gains to be had. But I think I'd better stop messing with this for now.
Speculation on further work:
I have collected stats on how many <pointer,rooter> pairs are checked, and the effect of this optimization is massive as compared to the completely unoptimized version (2 orders of magnitude improvement for all stack words, 1 order of magnitude for valid gcthing stack words), so it seems like it ought to have more effect. It is possible that there's just something dumb going on.
I haven't compared times using -j. When there are timeouts, I suspect the total runtime is strongly influenced by the order that those tests appear -- if you start one of those at the very end, you won't be able to overlap it with anything else.
I also think the time with -j1 is mostly determined by a few tests that take way longer with the analysis. Targeting those specific tests is likely to be more useful than speeding things up across the board.
One probabilistic improvement would be to remember the addresses of Rooted<T>s across CheckStackRoot invocations, and most of the time don't bother poisoning the region of the stack where they don't change. This might help those slow tests.
Another would be to revive the original optimization, but ignore changes in values on the stack that aren't gcthing pointers. This is more likely to ignore legitimate changes, but might make the loops much faster. (This could be applied on top of this new optimization, checking only the portion of the stack that had any possibility of changing.)
| Assignee | ||
Comment 2•13 years ago
|
||
Comment on attachment 683381 [details] [diff] [review]
Optimize CheckStackRoots
Oops, I left the r? unset at the last minute because I hit some regressions with the patch, but it was just because my patch stack was out of order (I have some rooting fixes that are necessary for the test suite to pass right now.)
Oh, and now I'm wondering if my timings *were* with parallel tests -- does jstests.py default to -j1, or to -jN where N is magically determined and > 1?
-j6 is giving me 178 sec, a slight improvement.
Attachment #683381 -
Flags: review?(terrence)
| Assignee | ||
Comment 3•13 years ago
|
||
Oh, and I should note that this patch also contains an unrelated optimization: the only thing that needed a tracer object was MarkIfGCThing, and we're not marking anything, so I ripped it out and just used IsAddressableGCThing.
Comment 4•13 years ago
|
||
(In reply to Steve Fink [:sfink] from comment #2)
> Oh, and now I'm wondering if my timings *were* with parallel tests -- does
> jstests.py default to -j1, or to -jN where N is magically determined and > 1?
>
> -j6 is giving me 178 sec, a slight improvement.
It defaults to whatever multiprocessing.cpu_count() returns or 2 if multiprocessing is not available on your system or python. Really, it just makes the 20% speedup even more impressive. :-)
Comment 5•13 years ago
|
||
Comment on attachment 683381 [details] [diff] [review]
Optimize CheckStackRoots
Review of attachment 683381 [details] [diff] [review]:
-----------------------------------------------------------------
I think this is self-contained enough and clean enough to justify taking it for a 20% speedup on rooting analysis.
::: js/src/gc/Root.h
@@ +667,5 @@
> +
> +#if defined(JSGC_ROOT_ANALYSIS)
> + public:
> + /* Has the rooting analysis ever scanned this Rooted's stack location? */
> + bool scanned;
I wonder if it would be faster to embed this in the bottom bit of |ptr|? Uhg, nevermind, it's too gross to even consider. In fact, I think I'll break out the bit I added in Unrooted<T> into a separate bool too.
::: js/src/jsgc.cpp
@@ +9,5 @@
>
> #include "mozilla/Attributes.h"
> #include "mozilla/Util.h"
>
> +#include <algorithm>
Lets not start depending on std:: for just Min and Swap.
@@ +5093,5 @@
> for (ion::IonActivationIterator ion(rt); ion.more(); ++ion) {
> uintptr_t *ionMin, *ionEnd;
> ion.ionStackRange(ionMin, ionEnd);
>
> + uintptr_t *upto = std::min(ionMin, end);
Use jsutil.h's |Min(T t1, T t2)|.
@@ +5103,5 @@
> }
> #endif
>
> /* The topmost Ion activiation may be beyond our prior top. */
> + if (i < end)
|<| instead of |<=| caused me to crash when I added AndSkipIon, but I think that may have been a bug in the IonActivationIterator. If it doesn't crash for you now, awesome.
@@ +5203,5 @@
> +#else
> + if (p->rooter <= firstScanned)
> +#endif
> + {
> + std::swap(*firstToScan, *p);
There is a very nice MoveRef aware |Swap| in jsutil.h. I don't think you'll be able to eek out any more performance here with MoveRef's since it's just POD, but no point adding a dependence on std:: for a 3-liner.
Attachment #683381 -
Flags: review?(terrence) → review+
| Assignee | ||
Comment 6•13 years ago
|
||
To fix the problem of repeated invocations with the same call stack, this patch computes a checksum of the last J rooters' addresses and suppresses the CheckStackRoots scan if those same addresses have been observed recently (in the last K distinct CheckStackRoots calls, distinct wrt stack checksums.)
The idea is that if you have a loop with a couple of CheckStackRoots calls within the loop, then all but the first of those is largely useless because it's the same static location over and over again. (Theoretically, a different path and therefore a different set of live variables could be accessed based on data differences, so this isn't 100% correct. But removing the timeouts results in better coverage, so overall this is still a win.) The stack will not be identical throughout that loop, because there'll at least be a loop iterator or something on the stack, but the data types on the stack should be identical. Using the address of the most recent handful of rooters is a proxy for that, because if a different stack frame were on the stack then the addresses of rooters would be unlikely to be identical.
Overall, this optimization skips 79% of the CheckStackRoots scans, at the cost of some sorting and bookkeeping. That percentage is expected to vary widely from test to test. For example, many of the savings come from a clone test that clones something like 100k objects.
In my testing, this dropped the time required for a test run from 205sec + 3 timeouts with just the previous, to 155sec + 1 timeout with both patches applied. All in all, this halves the rootanalysis overhead as compared to the version with neither patch applied -- so I am comparing to the previous stack-copying-and-scanning optimization. As compared to an unoptimized version, these patches provide a 2.6x speedup (new overhead = 0.38 times unoptimized overhead).
Attachment #685396 -
Flags: review?(terrence)
Comment 7•13 years ago
|
||
Comment on attachment 685396 [details] [diff] [review]
Further CheckStackRoots optimization - suppress repeated checks of the same stack configuration
Review of attachment 685396 [details] [diff] [review]:
-----------------------------------------------------------------
I love it.
::: js/src/jsgc.cpp
@@ +5127,5 @@
> + * alternate between a couple of stacks rather than just repeating the same one
> + * over and over, so we need more than a depth-1 memory.
> + */
> +static bool
> +SuppressCheckRoots(Vector< Rooter, 0, SystemAllocPolicy>& rooters)
Remove the extra space after < and move the & next to rooters.
@@ +5143,5 @@
> + // partitioning pass, perhaps.
> + qsort(rooters.begin(), rooters.length(), sizeof(Rooter), CompareRooters);
> +
> + // Compute the hash of the current stack
> + uint32_t hash = mozilla::HashGeneric(&hash);
I had no idea that was even possible. I'm not sure I want it to be possible.
Could you add a note that we hash the location of |hash| to ensure that this is the same CheckStackRoots caller? (Assuming I'm interpreting it correctly, of course.)
Attachment #685396 -
Flags: review?(terrence) → review+
| Assignee | ||
Comment 8•13 years ago
|
||
(In reply to Terrence Cole [:terrence] from comment #7)
> @@ +5143,5 @@
> > + // partitioning pass, perhaps.
> > + qsort(rooters.begin(), rooters.length(), sizeof(Rooter), CompareRooters);
> > +
> > + // Compute the hash of the current stack
> > + uint32_t hash = mozilla::HashGeneric(&hash);
>
> I had no idea that was even possible. I'm not sure I want it to be possible.
>
> Could you add a note that we hash the location of |hash| to ensure that this
> is the same CheckStackRoots caller? (Assuming I'm interpreting it correctly,
> of course.)
Oh, sorry, I'll comment it. Thought process was roughly "I should make sure the stack depth is identical too. I could take the address of a parameter for that. Oh, wait, my only parameter is a reference. Taking its address is not generally guaranteed to give me a recent stack location, since it's really just passing a pointer, so that's kind of confusing. I guess I could use the address of a local variable for that. Hey, I have a local variable: 'hash'! Its value may not be initialized yet, but its address has to be determined."
Now that I've written the rest of the code and have more local vars to choose from, maybe I'll just switch to declaring |pos| before |hash| and using &pos instead. It *is* kind of wonky to initialize a variable with a function of its own address, given that the variable itself isn't really in scope on the rhs of its initializer.
| Assignee | ||
Updated•13 years ago
|
Attachment #683381 -
Flags: checkin+
| Assignee | ||
Comment 9•13 years ago
|
||
Comment 10•13 years ago
|
||
Backed out because of build bustage: https://hg.mozilla.org/integration/mozilla-inbound/rev/82e61145a1dc
| Assignee | ||
Updated•13 years ago
|
Attachment #685396 -
Flags: checkin+
| Assignee | ||
Comment 11•13 years ago
|
||
Comment 12•13 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/5bb43720c04c
https://hg.mozilla.org/mozilla-central/rev/6c24bf926afc
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla20
Updated•13 years ago
|
Whiteboard: [js:t]
You need to log in
before you can comment on or make changes to this bug.
Description
•