1078437 - 78.6% regression in octane-splay on Flame Android

Reporter

Description

•

10 years ago

There was a regression of 78.6% on flame Android:
arewefastyet.com/#machine=11&view=breakdown&suite=kraken

I could narrow the regression down to:
http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=9e3d649b80a2&tochange=818f353b7aa6

I'll try to see if I can narrow it down even more...

Hannes Verschore [:h4writer]

Reporter

Comment 1

•

10 years ago

http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=dc8fac10012d&tochange=1763f5d09746

Hannes Verschore [:h4writer]

Reporter

Comment 2

•

10 years ago

I think this is:
d948d19639ab	Steve Fink — Bug 1068399 - Enable GGC on the mobile browser, r=ted

Causing a 78.6% regression on Splay
http://arewefastyet.com/#machine=27&view=single&suite=octane&subtest=SplayLatency&start=1408649334&end=1412028700

And 50.7% on Splay Latency 
http://arewefastyet.com/#machine=27&view=single&suite=octane&subtest=Splay&start=1408649334&end=1412028700

This seems much worse than the x86 regression on splay when GGC landed...

Flags: needinfo?(terrence)

Flags: needinfo?(sphink)

Hannes Verschore [:h4writer]

Reporter

Comment 3

•

10 years ago

This also affects FFOS phones, something like 50%:

http://arewefastyet.com/#machine=26&view=single&suite=octane&subtest=Splay&start=1410832513&end=1410997963

Terrence Cole [:terrence]

Updated

•

10 years ago

Blocks: 1068399

Hannes Verschore [:h4writer]

Reporter

Updated

•

10 years ago

Blocks: 1079952

Hannes Verschore [:h4writer]

Reporter

Comment 4

•

10 years ago

What is the plan of attack here? It is a very big regressions and decreases our Octane score a lot. This will now go into aurora as of today ...

The issue is also the variation between runs. We are mostly slow, but 1 out of 10 (or something like that). We are clearly really quick. (Faster than before). Can we find out why we are quicker in those cases and always be that fast?

Steve Fink [:sfink] [:s:]

Assignee

Comment 5

•

10 years ago

Still working on it.

If you run this with --nursery-size=1 (which is 1MB, or 4 chunks), both splay and splay-latency are substantially faster than pre-GGC. Here are splay results on my Flame for a range of sizes (note that the first column is in number of chunks, which can only be set to a multiple of 4 with the current code):

NurseryChunks    Splay Score    #Minor GCs     #Major GCs
      0              933            0             1
      1             2529           54             1
      2             2442           28             1
      3             1940           18             1
      4             1726           14             1
      5             1558           11             1
      6             1231            9             1
      7              968            8             1
      8              316            6             0
    256              295            6             0

That last really really bad result? That's the current default nursery size.

I haven't really figured out what this means, though. The difference between 0 and 1 chunk confuses me.

Flags: needinfo?(sphink)

Steve Fink [:sfink] [:s:]

Assignee

Comment 6

•

10 years ago

Hm, I have some bad news. I was assuming my Flame would be giving about the same numbers as the awfy runs at http://arewefastyet.com/#machine=26&view=single&suite=octane&subtest=Splay&start=1410832513&end=1410997963 but I am seeing *way* faster non-GGC numbers: 3077 for splay. So even the 1-chunk speed is slower than non-GGC.

It's even faster (3208) if I run with --no-threads. Interestingly, there is no value of --thread-count that will match that score (0, 1, and 4 all made no difference to the score.)

Steve Fink [:sfink] [:s:]

Assignee

Comment 7

•

10 years ago

What is happening is storebuffer compaction thrashing. It piles a whole bunch of stuff into the storebuffer, sees that it's running low, and compacts it. That frees up a bunch more space, so it repeats, only now it has less room in the store buffer and it will compact sooner. Eventually, it gets to a point where pretty much everything you put in the store buffer can be compacted (it's a duplicate of something already in there), so you end up being just under the compaction trigger threshold (MinAvailableSize), you insert one more entry, you compact and it goes away, you see that you're now below the trigger threshold, and you repeat. The end result is a full compaction (which takes about 3ms on a Flame) for *every* store buffer insertion. And splay has a lot of those.

It would be far better to give up and just do a minor GC.

This is a problem on desktop too. It is almost certainly the source of bug 1022090.

Currently, MinAvailableSize is set to LifoAllocBlockSize / 16, where LifoAllocBlockSize is 64KB so it is 4KB. If I create two thresholds: MinAvailableSize to trigger a compaction, and MinRunwaySize to decide whether to trigger a minor GC after compaction, then I get *much* better results.

Leaving MinAvailableSize the same and setting MinRunwaySize to LifoAllocBlockSize/8 (so 8KB), the score with the default nursery size goes from 295 -> 2336, which is near the peak score of 2554 for various nursery sizes. With MinRunwaySize of LifoAllocBlockSize/4, it does even better. The default nursery size now jumps to 2901 and is the new peak value for my test run (so it's faster than any of the small nursery sizes). 2901 is getting pretty close to 3107, which is what I just got for a non-GGC build (so no GGC write barriers.)

I named it MinRunwaySize because it's the amount of space available after a compaction in which to run and generate more store buffer entries -- it's how much "runway" you have left before you hit the next compaction speed bump.

I'm not sure how much sense there is in basing these on LifoAllocBlockSize. MinAvailableSize should really be set based on how many store buffer entries you could possibly generate before an interrupt request is serviced (maybe to a 99% confidence or something.) MinRunwaySize balances the overhead of compaction vs the cost of a nursery collection (the latter cost is spread out among a bunch of different things, since premature nursery collections result in more cross-generational edges.) Neither of these has much to do with the size of a lifo alloc block. I wouldn't really care, except that the score difference between blocksize/8 vs blocksize/4 is pretty large, so MinRunwaySize at least may be a fairly important parameter.

I also wonder whether always compacting the whole store buffer is the right thing. I would guess that compaction mostly finds matches in the newer portion of the buffer, and the older "packed" part doesn't see as many hits. So you might do better by only compacting from partway through the buffer up to the end, at least part of the time. But that's perhaps more complex than is really needed here.

I only modified the MonoTypeBuffer. The GenericBuffer does things differently. I don't *think* it has the thrashing problem, but I'm not 100% sure.

Steve Fink [:sfink] [:s:]

Assignee

Comment 8

•

10 years ago

Ok, strange. I just ran the full Octane benchmark with my MinRunwaySize=LifoAllocBlockSize/4 build. The bad news is that Splay did not demonstrate anywhere near the gain that I was seeing with isolated runs. The good news is that a couple of other benchmarks showed some gains. The ones that changed substantially, showing non-GGC -> current code -> new code:

RayTrace: 2738 -> 5861 -> 5568
Splay: 2586 -> 952 -> 1508
SplayLatency: 2823 -> 873 -> 1241
PdfJS: 1238 -> 1030 -> 1241
CodeLoad: 2146 -> 1941 -> 2006
Box2D: 1822 -> 2315 -> 2221
Typescript: 3214 -> 2806 -> 2923

So we still have a large hit in Splay and SplayLatency, RayTrace and Box2D lost a little of their GGC gains, PdfJS erased its GGC losses, and CodeLoad and Typescript recovered some GGC losses but are still behind non-GGC. I have no idea how much variance to expect in these scores. The overall score goes 2327 -> 2195 -> 2322, which surprises me because I thought we had an overall win from GGC. But I guess I was only paying attention to desktop.

Rerunning a few times, I definitely see wildly different behavior on Splay when running as part of the full benchmark vs running it alone. (And SplayLatency is bimodal, with the current code usually around 1427 but sometimes dropping to 783. Splay is awful, way worse than the 952 above, usually at about 150.) A second run of the full suite gets pretty much the same numbers as the first run.

It seems that after this change, when running the full benchmark we're better off but Splay/SplayLatency are still the major things dragging the score down. It is probably worth landing my current code just to keep GGC on in the tree, as it erases any perf loss from GGC even if it fails to show a gain.

Flags: needinfo?(terrence)

Steve Fink [:sfink] [:s:]

Assignee

Comment 9

•

10 years ago

Attached patch Prevent store buffer compaction from thrashing — Details — Splinter Review

Attachment #8504526 - Flags: review?(terrence)

Steve Fink [:sfink] [:s:]

Assignee

Updated

•

10 years ago

Assignee: nobody → sphink

Status: NEW → ASSIGNED

Hannes Verschore [:h4writer]

Reporter

Comment 10

•

10 years ago

Thanks for the insight. Since it is on Aurora, I agree the focus is on neutralizing the performance drop. Later it can get investigated to try to get the performance gain you were pursuing.

As far as I can see you are running on Firefox OS? Can you create an Android try build? That way I can report the numbers for that build. (The regression on Android is worse than the FFOS one.)

Terrence Cole [:terrence]

Comment 11

•

10 years ago

Comment on attachment 8504526 [details] [diff] [review]
Prevent store buffer compaction from thrashing

Review of attachment 8504526 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/gc/StoreBuffer.h
@@ +117,5 @@
>          bool isAboutToOverflow() const {
>              return !storage_->isEmpty() && storage_->availableInCurrentChunk() < MinAvailableSize;
>          }
>  
> +        bool hasEnoughRunway() const {

Shouldn't this be "isOutOfRunway" since we've filled past the runway marker? And the usage of triggering a GC when hasEnoughRunway seems backwards.

Attachment #8504526 - Flags: review?(terrence) → review+

Steve Fink [:sfink] [:s:]

Assignee

Comment 12

•

10 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/713b9fa30f62

Hannes Verschore [:h4writer]

Reporter

Comment 13

•

10 years ago

Looking at the results on AWFY android slave, there are two noticable things:
1) We still have these bi-modal runs on Splay
2) Best score is still the same best score on Splay (448)
3) Worse code is still the same worse score on Splay (2841)
4) We now hit the best score more often than before on Splay. (1 of 10 to 5 of 10)
5) Splay latency didn't improve (785, while before ggc 1812)

Are there other things we can do? Is there anything I can help by debugging or running some diagnostics or something on the android phone?

Ryan VanderMeulen [:RyanVM]

Comment 14

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/713b9fa30f62

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla36

Steve Fink [:sfink] [:s:]

Assignee

Comment 15

•

10 years ago

(In reply to Hannes Verschore [:h4writer] from comment #13)
> Looking at the results on AWFY android slave, there are two noticable things:
> 1) We still have these bi-modal runs on Splay
> 2) Best score is still the same best score on Splay (448)
> 3) Worse code is still the same worse score on Splay (2841)

2841 is best, 448 is worst, right?

> 4) We now hit the best score more often than before on Splay. (1 of 10 to 5
> of 10)
> 5) Splay latency didn't improve (785, while before ggc 1812)
> 
> Are there other things we can do? Is there anything I can help by debugging
> or running some diagnostics or something on the android phone?

Yes, there is more to do. No, I don't think independent runs would help, because b2g on my Flame already shows large and serious problems that I'm trying to figure out. Once those are ironed out, it would be useful to check on the state of Android.

Specifically, my change here almost completely removes the GGC perf hit on Splay -- but *only* when running run-splay.js individually. When running the full Octane suite, my change barely helps Splay at all. It just happens to improve various *other* Octane tests enough the the overall Octane score regains its pre-GGC numbers on Flame on b2g. The specific problem of Splay is still unresolved, and I'm still working on it. (It ought to be easy, darn it -- it's such a large hit. But it's no longer thrashing the store buffer, and for all I know maybe it never was when running the full suite.)

Status: RESOLVED → REOPENED

Keywords: leave-open

Resolution: FIXED → ---

Steve Fink [:sfink] [:s:]

Assignee

Comment 16

•

10 years ago

I should also note that I can't get to afwy right now to see what *it* thinks the effect of the patch was. My description above is still based on my manual shell runs on a b2g Flame.

Hannes Verschore [:h4writer]

Reporter

Comment 17

•

10 years ago

(In reply to Steve Fink [:sfink] from comment #15)
> (In reply to Hannes Verschore [:h4writer] from comment #13)
> > Looking at the results on AWFY android slave, there are two noticable things:
> > 1) We still have these bi-modal runs on Splay
> > 2) Best score is still the same best score on Splay (448)
> > 3) Worse code is still the same worse score on Splay (2841)
> 
> 2841 is best, 448 is worst, right?

Oh yeah sorry. 2841 is indeed best.

(In reply to Steve Fink [:sfink] from comment #16)
> I should also note that I can't get to afwy right now to see what *it*
> thinks the effect of the patch was. My description above is still based on
> my manual shell runs on a b2g Flame.

What issue do you have to view awfy?

Hugh Nougher [:Hughman]

Comment 18

•

10 years ago

Comment on attachment 8504526 [details] [diff] [review]
Prevent store buffer compaction from thrashing

For the notes this seems to be the cause of 15% better in Mac x86 octane-SplayLatency, 5% for Mac x86 octane-Typescript according to AWFY.

Steve Fink [:sfink] [:s:]

Assignee

Comment 19

•

10 years ago

My awfy problem was local. It seems like the Mozilla VPN grabs its IP range, which doesn't work. If I drop that route and let it go directly, it works fine.

(In reply to Hannes Verschore [:h4writer] from comment #13)
> Looking at the results on AWFY android slave, there are two noticable things:
> 1) We still have these bi-modal runs on Splay

So I think what's going on is that Octane does a warmup run before each Splay run. The bimodality is caused by us sometimes getting lucky and doing the GC during the warmup run instead of the timed run.

Which really disturbs me. I hadn't realized that Octane had a carpet that we could sweep GCs under. I'm guessing that we only see this on Android and B2G because of tighter memory constraints? So on desktop, we'll just never do a major GC during Splay? I can't prove it, but it seems plausible.

Steve Fink [:sfink] [:s:]

Assignee

Comment 20

•

10 years ago

Attached file Timing log of octane run with extra splay runs — Details

Here's my output from a full octane run with splay repeated half a dozen times. Notice that the Splay score is either close to 3000 or close to 2000, depending on whether we spent time in GC during that run. (There's quite a bit of variance within those two groups too, but that's a lesser effect.)

The "Within timed region:" chunks are output just before the relevant test, and are only done for Splay runs. (For no particular reason, as it turns out.)

Steve Fink [:sfink] [:s:]

Assignee

Comment 21

•

10 years ago

(In reply to Steve Fink [:sfink] from comment #19)
> I'm guessing that we only see this on Android and B2G
> because of tighter memory constraints? So on desktop, we'll just never do a
> major GC during Splay? I can't prove it, but it seems plausible.

No, it's because the desktop is fast enough that it will always see at least a dozen GCs, so one more or less isn't a huge difference.

Steve Fink [:sfink] [:s:]

Assignee

Comment 22

•

8 years ago

This patch was made unnecessary by bug 1100652.

Status: REOPENED → RESOLVED

Closed: 10 years ago → 8 years ago

Resolution: --- → INVALID

Firefox Bug Husbandry Bot

Comment 23

•

6 years ago

Removing leave-open keyword from resolved bugs, per :sylvestre.

Keywords: leave-open

Prevent store buffer compaction from thrashing 10 years ago Steve Fink [:sfink] [:s:] 3.09 KB, patch	terrence : review+	Details \| Diff \| Splinter Review
Timing log of octane run with extra splay runs 10 years ago Steve Fink [:sfink] [:s:] 2.18 KB, text/plain		Details