Closed Bug 1078437 Opened 10 years ago Closed 8 years ago

78.6% regression in octane-splay on Flame Android

Categories

(Core :: JavaScript Engine, defect)

ARM
Android
defect
Not set
normal

Tracking

()

RESOLVED INVALID
mozilla36

People

(Reporter: h4writer, Assigned: sfink)

References

Details

Attachments

(2 files)

There was a regression of 78.6% on flame Android:
arewefastyet.com/#machine=11&view=breakdown&suite=kraken

I could narrow the regression down to:
http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=9e3d649b80a2&tochange=818f353b7aa6

I'll try to see if I can narrow it down even more...
I think this is:
d948d19639ab	Steve Fink — Bug 1068399 - Enable GGC on the mobile browser, r=ted

Causing a 78.6% regression on Splay
http://arewefastyet.com/#machine=27&view=single&suite=octane&subtest=SplayLatency&start=1408649334&end=1412028700

And 50.7% on Splay Latency 
http://arewefastyet.com/#machine=27&view=single&suite=octane&subtest=Splay&start=1408649334&end=1412028700

This seems much worse than the x86 regression on splay when GGC landed...
Flags: needinfo?(terrence)
Flags: needinfo?(sphink)
Blocks: 1068399
Blocks: 1079952
What is the plan of attack here? It is a very big regressions and decreases our Octane score a lot. This will now go into aurora as of today ...

The issue is also the variation between runs. We are mostly slow, but 1 out of 10 (or something like that). We are clearly really quick. (Faster than before). Can we find out why we are quicker in those cases and always be that fast?
Still working on it.

If you run this with --nursery-size=1 (which is 1MB, or 4 chunks), both splay and splay-latency are substantially faster than pre-GGC. Here are splay results on my Flame for a range of sizes (note that the first column is in number of chunks, which can only be set to a multiple of 4 with the current code):

NurseryChunks    Splay Score    #Minor GCs     #Major GCs
      0              933            0             1
      1             2529           54             1
      2             2442           28             1
      3             1940           18             1
      4             1726           14             1
      5             1558           11             1
      6             1231            9             1
      7              968            8             1
      8              316            6             0
    256              295            6             0

That last really really bad result? That's the current default nursery size.

I haven't really figured out what this means, though. The difference between 0 and 1 chunk confuses me.
Flags: needinfo?(sphink)
Hm, I have some bad news. I was assuming my Flame would be giving about the same numbers as the awfy runs at http://arewefastyet.com/#machine=26&view=single&suite=octane&subtest=Splay&start=1410832513&end=1410997963 but I am seeing *way* faster non-GGC numbers: 3077 for splay. So even the 1-chunk speed is slower than non-GGC.

It's even faster (3208) if I run with --no-threads. Interestingly, there is no value of --thread-count that will match that score (0, 1, and 4 all made no difference to the score.)
What is happening is storebuffer compaction thrashing. It piles a whole bunch of stuff into the storebuffer, sees that it's running low, and compacts it. That frees up a bunch more space, so it repeats, only now it has less room in the store buffer and it will compact sooner. Eventually, it gets to a point where pretty much everything you put in the store buffer can be compacted (it's a duplicate of something already in there), so you end up being just under the compaction trigger threshold (MinAvailableSize), you insert one more entry, you compact and it goes away, you see that you're now below the trigger threshold, and you repeat. The end result is a full compaction (which takes about 3ms on a Flame) for *every* store buffer insertion. And splay has a lot of those.

It would be far better to give up and just do a minor GC.

This is a problem on desktop too. It is almost certainly the source of bug 1022090.

Currently, MinAvailableSize is set to LifoAllocBlockSize / 16, where LifoAllocBlockSize is 64KB so it is 4KB. If I create two thresholds: MinAvailableSize to trigger a compaction, and MinRunwaySize to decide whether to trigger a minor GC after compaction, then I get *much* better results.

Leaving MinAvailableSize the same and setting MinRunwaySize to LifoAllocBlockSize/8 (so 8KB), the score with the default nursery size goes from 295 -> 2336, which is near the peak score of 2554 for various nursery sizes. With MinRunwaySize of LifoAllocBlockSize/4, it does even better. The default nursery size now jumps to 2901 and is the new peak value for my test run (so it's faster than any of the small nursery sizes). 2901 is getting pretty close to 3107, which is what I just got for a non-GGC build (so no GGC write barriers.)

I named it MinRunwaySize because it's the amount of space available after a compaction in which to run and generate more store buffer entries -- it's how much "runway" you have left before you hit the next compaction speed bump.

I'm not sure how much sense there is in basing these on LifoAllocBlockSize. MinAvailableSize should really be set based on how many store buffer entries you could possibly generate before an interrupt request is serviced (maybe to a 99% confidence or something.) MinRunwaySize balances the overhead of compaction vs the cost of a nursery collection (the latter cost is spread out among a bunch of different things, since premature nursery collections result in more cross-generational edges.) Neither of these has much to do with the size of a lifo alloc block. I wouldn't really care, except that the score difference between blocksize/8 vs blocksize/4 is pretty large, so MinRunwaySize at least may be a fairly important parameter.

I also wonder whether always compacting the whole store buffer is the right thing. I would guess that compaction mostly finds matches in the newer portion of the buffer, and the older "packed" part doesn't see as many hits. So you might do better by only compacting from partway through the buffer up to the end, at least part of the time. But that's perhaps more complex than is really needed here.

I only modified the MonoTypeBuffer. The GenericBuffer does things differently. I don't *think* it has the thrashing problem, but I'm not 100% sure.
Ok, strange. I just ran the full Octane benchmark with my MinRunwaySize=LifoAllocBlockSize/4 build. The bad news is that Splay did not demonstrate anywhere near the gain that I was seeing with isolated runs. The good news is that a couple of other benchmarks showed some gains. The ones that changed substantially, showing non-GGC -> current code -> new code:

RayTrace: 2738 -> 5861 -> 5568
Splay: 2586 -> 952 -> 1508
SplayLatency: 2823 -> 873 -> 1241
PdfJS: 1238 -> 1030 -> 1241
CodeLoad: 2146 -> 1941 -> 2006
Box2D: 1822 -> 2315 -> 2221
Typescript: 3214 -> 2806 -> 2923

So we still have a large hit in Splay and SplayLatency, RayTrace and Box2D lost a little of their GGC gains, PdfJS erased its GGC losses, and CodeLoad and Typescript recovered some GGC losses but are still behind non-GGC. I have no idea how much variance to expect in these scores. The overall score goes 2327 -> 2195 -> 2322, which surprises me because I thought we had an overall win from GGC. But I guess I was only paying attention to desktop.

Rerunning a few times, I definitely see wildly different behavior on Splay when running as part of the full benchmark vs running it alone. (And SplayLatency is bimodal, with the current code usually around 1427 but sometimes dropping to 783. Splay is awful, way worse than the 952 above, usually at about 150.) A second run of the full suite gets pretty much the same numbers as the first run.

It seems that after this change, when running the full benchmark we're better off but Splay/SplayLatency are still the major things dragging the score down. It is probably worth landing my current code just to keep GGC on in the tree, as it erases any perf loss from GGC even if it fails to show a gain.
Flags: needinfo?(terrence)
Assignee: nobody → sphink
Status: NEW → ASSIGNED
Thanks for the insight. Since it is on Aurora, I agree the focus is on neutralizing the performance drop. Later it can get investigated to try to get the performance gain you were pursuing.

As far as I can see you are running on Firefox OS? Can you create an Android try build? That way I can report the numbers for that build. (The regression on Android is worse than the FFOS one.)
Comment on attachment 8504526 [details] [diff] [review]
Prevent store buffer compaction from thrashing

Review of attachment 8504526 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/gc/StoreBuffer.h
@@ +117,5 @@
>          bool isAboutToOverflow() const {
>              return !storage_->isEmpty() && storage_->availableInCurrentChunk() < MinAvailableSize;
>          }
>  
> +        bool hasEnoughRunway() const {

Shouldn't this be "isOutOfRunway" since we've filled past the runway marker? And the usage of triggering a GC when hasEnoughRunway seems backwards.
Attachment #8504526 - Flags: review?(terrence) → review+
Looking at the results on AWFY android slave, there are two noticable things:
1) We still have these bi-modal runs on Splay
2) Best score is still the same best score on Splay (448)
3) Worse code is still the same worse score on Splay (2841)
4) We now hit the best score more often than before on Splay. (1 of 10 to 5 of 10)
5) Splay latency didn't improve (785, while before ggc 1812)

Are there other things we can do? Is there anything I can help by debugging or running some diagnostics or something on the android phone?
https://hg.mozilla.org/mozilla-central/rev/713b9fa30f62
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
(In reply to Hannes Verschore [:h4writer] from comment #13)
> Looking at the results on AWFY android slave, there are two noticable things:
> 1) We still have these bi-modal runs on Splay
> 2) Best score is still the same best score on Splay (448)
> 3) Worse code is still the same worse score on Splay (2841)

2841 is best, 448 is worst, right?

> 4) We now hit the best score more often than before on Splay. (1 of 10 to 5
> of 10)
> 5) Splay latency didn't improve (785, while before ggc 1812)
> 
> Are there other things we can do? Is there anything I can help by debugging
> or running some diagnostics or something on the android phone?

Yes, there is more to do. No, I don't think independent runs would help, because b2g on my Flame already shows large and serious problems that I'm trying to figure out. Once those are ironed out, it would be useful to check on the state of Android.

Specifically, my change here almost completely removes the GGC perf hit on Splay -- but *only* when running run-splay.js individually. When running the full Octane suite, my change barely helps Splay at all. It just happens to improve various *other* Octane tests enough the the overall Octane score regains its pre-GGC numbers on Flame on b2g. The specific problem of Splay is still unresolved, and I'm still working on it. (It ought to be easy, darn it -- it's such a large hit. But it's no longer thrashing the store buffer, and for all I know maybe it never was when running the full suite.)
Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---
I should also note that I can't get to afwy right now to see what *it* thinks the effect of the patch was. My description above is still based on my manual shell runs on a b2g Flame.
(In reply to Steve Fink [:sfink] from comment #15)
> (In reply to Hannes Verschore [:h4writer] from comment #13)
> > Looking at the results on AWFY android slave, there are two noticable things:
> > 1) We still have these bi-modal runs on Splay
> > 2) Best score is still the same best score on Splay (448)
> > 3) Worse code is still the same worse score on Splay (2841)
> 
> 2841 is best, 448 is worst, right?

Oh yeah sorry. 2841 is indeed best.

(In reply to Steve Fink [:sfink] from comment #16)
> I should also note that I can't get to afwy right now to see what *it*
> thinks the effect of the patch was. My description above is still based on
> my manual shell runs on a b2g Flame.

What issue do you have to view awfy?
Comment on attachment 8504526 [details] [diff] [review]
Prevent store buffer compaction from thrashing

For the notes this seems to be the cause of 15% better in Mac x86 octane-SplayLatency, 5% for Mac x86 octane-Typescript according to AWFY.
My awfy problem was local. It seems like the Mozilla VPN grabs its IP range, which doesn't work. If I drop that route and let it go directly, it works fine.

(In reply to Hannes Verschore [:h4writer] from comment #13)
> Looking at the results on AWFY android slave, there are two noticable things:
> 1) We still have these bi-modal runs on Splay

So I think what's going on is that Octane does a warmup run before each Splay run. The bimodality is caused by us sometimes getting lucky and doing the GC during the warmup run instead of the timed run.

Which really disturbs me. I hadn't realized that Octane had a carpet that we could sweep GCs under. I'm guessing that we only see this on Android and B2G because of tighter memory constraints? So on desktop, we'll just never do a major GC during Splay? I can't prove it, but it seems plausible.
Here's my output from a full octane run with splay repeated half a dozen times. Notice that the Splay score is either close to 3000 or close to 2000, depending on whether we spent time in GC during that run. (There's quite a bit of variance within those two groups too, but that's a lesser effect.)

The "Within timed region:" chunks are output just before the relevant test, and are only done for Splay runs. (For no particular reason, as it turns out.)
(In reply to Steve Fink [:sfink] from comment #19)
> I'm guessing that we only see this on Android and B2G
> because of tighter memory constraints? So on desktop, we'll just never do a
> major GC during Splay? I can't prove it, but it seems plausible.

No, it's because the desktop is fast enough that it will always see at least a dozen GCs, so one more or less isn't a huge difference.
This patch was made unnecessary by bug 1100652.
Status: REOPENED → RESOLVED
Closed: 10 years ago8 years ago
Resolution: --- → INVALID
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: