Regression on Spinning Balls

RESOLVED DUPLICATE of bug 707162

Status

()

Core
JavaScript Engine
RESOLVED DUPLICATE of bug 707162
6 years ago
6 years ago

People

(Reporter: dmandelin, Unassigned)

Tracking

Trunk
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [Snappy], URL)

(Reporter)

Description

6 years ago
Several people have noticed a regression. The history is complicated; so far I've come up with these scores in nightly builds:

08/08    181
  (regression?)
08/24    140
  (regression?)
08/30    108
09/08    103
10/06    102
  (slight regression??)
11/05     90
  (regression?)
12/02     69
  (most regression went away?)
12/06    128
12/12    126
  (even more improvement?)
12/16    151
  (huge regression)
12/17     65
12/18     80
  (regression?)
12/19     44

Only the last two have 1-day ranges, the others need more bisection. Even after that, I still want to know the exact changesets.
We should also figure out what exactly the score means.  I found it to be fairly reproducible, where "fairly" means that I got about the same score when running it twice on 8, 9, and 10.  (not twice in a row, but intermixed with running various versions)
OS: Windows 7 → All
Hardware: x86_64 → All
Whiteboard: [Snappy]
FYI, when I looked into this before, every pause spike in the display at the bottom, which is presumably what drives the score, seemed to correspond to GCs.
Blocks: 704716
The score is related to the number of frames you can display in 60 seconds as well as the sum of squares of pauses that are longer than 20ms. Here's the relevant code:

function updateStats(pause) {
  numberOfFrames++;
  if (pause > 20) {
    sumOfSquaredPauses += (pause - 20) * (pause - 20);
  }
  pauseDistribution[Math.floor(pause / 10)] |= 0;
  pauseDistribution[Math.floor(pause / 10)]++;
}

function renderStats() {
  ...
  msg.innerHTML = "Score " +
    Math.round(numberOfFrames * 1000 / sumOfSquaredPauses);
  ...
}
When I compared 8, 9 and 10, the number of frames was about the same (decreasing slightly as version number went up), so the massive swings may be explained mostly by pauses.
(Reporter)

Comment 5

6 years ago
Filtering out the noise, it looks like there was a regression on 8/26 from the compartment check landing, which then got removed in early Dec. 

The only other thing I found was that there seemed to be a regression with bug 711095. I think the regression was an increase in GC times. It may be just that there are more outliers, but I think the pause times might actually be a bit longer. From the description of bug 711095, it sounds like the patch might have (temporarily?) disabled some optimizations. I don't think it's a huge deal because that bug was to enable IGC, and IGC should help a lot with our pause times anyway.

I think it would still be good to check on bug 711095, to find out in an increase in pause times was expected, and if not, whether there actually was an increase in pause times.

Comment 6

6 years ago
(In reply to David Mandelin from comment #5)
> The only other thing I found was that there seemed to be a regression with
> bug 711095.

That bug has changed the order of object traversal plus it increased the mark stack usage so it grew beyond our fixed space. That may well adversely influenced the result. It should be easy to fix.
Could we also look at using Peptest to do some automatic regression analysis?  This could also be a good test to get them to add, as it is fairly simple.

Comment 8

6 years ago
(In reply to David Mandelin from comment #0)
> Several people have noticed a regression. The history is complicated; so far
> I've come up with these scores in nightly builds:

This is on Windows, right?

I compiled a build with MOZ_GCTIMER support and this is what I got for GC timing and benchmark results for 16th and 17th of December builds:


2011-12-16:         gc time
                    full      mark      finalize
                    22.716000 22.113000 0.563000 
Score 204           30.025000 26.545000 2.954000
0-10ms  => 1        27.802000 26.815000 0.947000
10-20ms => 3131     28.144000 27.111000 0.961000
20-30ms => 464      27.740000 26.764000 0.937000
30-40ms => 1        28.218000 27.237000 0.933000
40-50ms => 10       27.903000 26.945000 0.920000
50-60ms => 2        27.638000 26.659000 0.932000
70-80ms => 1        23.676000 23.166000 0.460000
90-100ms => 1       31.486000 30.507000 0.942000
                    28.124000 27.137000 0.949000
                    28.100000 27.106000 0.945000
                    27.649000 26.559000 1.047000

2011-12-17:

        	    21.210000 20.630000 0.539000
Score 165	    27.237000 24.411000 2.387000
0-10ms	=> 2	    26.256000 25.257000 0.959000
10-20ms	=> 3103	    26.103000 24.947000 0.925000
20-30ms	=> 490	    26.592000 25.598000 0.943000
30-40ms	=> 2	    26.119000 24.998000 0.922000
40-50ms	=> 11	    25.806000 24.852000 0.917000
70-80ms	=> 1	    26.790000 25.807000 0.919000
110-120ms => 1      25.751000 24.795000 0.918000
		    26.329000 25.363000 0.915000
		    26.396000 25.411000 0.930000
		    25.779000 24.846000 0.881000
		    22.024000 21.341000 0.481000

This shows that the GC marking time improved. The only change in the area between builds is the bug 711095 in consistence with results in that bug for other benchmarks. So I think the bug 711095 cannot be responsible for the massive score drop in the results from the comment 0.

Comment 9

6 years ago
I observed that the score result is extremely noisy. For example, for 3 separated browser runs like:

  firefox http://v8.googlecode.com/svn/branches/bleeding_edge/benchmarks/spinning-balls/index.html

on a Linux box I got scores 204 145 315. Are the results on Windows more consistent?

Another observation based on the comment 8 results is that the GC timing is rather consistent and stays at 25-30 ms during the benchmark run and can explain those 15 cases in the above results when the score went beyond 30 ms. However, that cannot explain couple of outliners that goes beyond 60 ms. 

So something else is contributing to those big pauses.

Comment 10

6 years ago
I run the benchmark он quiet Linux box for 20000 seconds (about 6 hours) against MC tip. During the benchmark the GC time stayed within 22-30 ms range and there was about 3400 GC runs. Here is the resulting numbers:

Score 789
0-10ms	=> 462
10-20ms	=> 1013517
20-30ms	=> 141176
30-40ms	=> 2337
40-50ms	=> 644
50-60ms	=> 9
60-70ms	=> 74
70-80ms	=> 6
110-120ms	=> 7
120-130ms	=> 2

Comment 11

6 years ago
As yet another conformation that the regression is caused by the GC scheduling changes I see that with the patch from the bug 707162 noticeably improved the benchmark scores. In particular, it reduces the number of particularly bad pauses and increased the score to 300-450 from 150-300 during 60 seconds run.

Comment 12

6 years ago
On Windows I have not observed any consistency in the benchmark results for 60 second runs. For example, for the nightly build from 2011-12-26 I got scores in the range 30-38 or over 25% variation. In all cases this was for the newly started browser.

So I presume the benchmark can be used only as a very rough estimation for the GC pause time.
(Reporter)

Comment 13

6 years ago
Thanks for checking out bug 711095, Igor. Yes, the results are very noisy. We could probably modify the scoring to make it more repeatable. 

For now, it looks like this bug, if it exists at all, is just a dup of 707162, so I'm closing it out.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 707162
You need to log in before you can comment on or make changes to this bug.