SunSpider numbers not in agreement with 'time', shark or cachegrind

RESOLVED INCOMPLETE

Status

()

RESOLVED INCOMPLETE
9 years ago
8 months ago

People

(Reporter: jseward, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

9 years ago
As a result of profiling the patch in bug 513407, I came across some
anomalies in SunSpider's output, which gives me cause to wonder how
trustworthy it is.  Given that SS is the metric for a lot of
performance tuning work, this seems to me to be an important question.

These numbers are from a quiet Mac Mini w/ OSX 10.5.8 and no power
management features enabled, afaics.

I did a 1000-pass SS run with and without said patch.  For 1000 runs
on this machine, the noise level is very low, so the following change
was marked as "significant".  In fact I did this twice just to be sure
(4000 runs in total):

nsieve-bits:   *1.060x as slow*    33.1ms +/- 0.1%     35.1ms +/- 0.1%
nsieve-bits:   *1.064x as slow*    32.6ms +/- 0.1%     34.6ms +/- 0.1%

So I went looking for a 6% slowdown.  Unfortunately failed to find it:

* 1000 runs directly with 'time':

  spinup && time for ((i = 0; i < 1000; i++)) ; \
     do ../CMv1/js/src/BuildR/js -j tests/bitops-nsieve-bits.js ; done

  spinup is a loop which runs for ~ 2 seconds, so as to ensure the
  cpu is at its highest clock rate.
  
  result: baseline 79.5s, patched 79.4s


* shark, 20us resolution:

  shark -1 -i -I 20u ../CMv1/js/src/BuildR/js -j \
    tests/bitops-nsieve-bits.js

  result: baseline 3072 samples, patched 3080 samples.  A slowdown
  but definitely not 6%.
  

* cachegrind, counting insn reads (Ir), data accesses (Drw), D1 misses
  (D1m) and mispredicts (Bm):

  baseline
     Ir/Drw/D1m/Bm = 148,322,803 / 77,680,465 / 451,295 / 376,740
  patched
     Ir/Drw/D1m/Bm = 148,324,668 / 77,681,421 / 451,302 / 375,379

-----------

SS also shows the following speedup, which is equally unaccountable-for:

3bit-bits-in-byte: 1.049x as fast       2.3ms +/- 1.2%      2.2ms +/- 1.1%
3bit-bits-in-byte: 1.058x as fast       2.3ms +/- 1.2%      2.1ms +/- 1.0%

* 1000 runs directly with 'time':

  baseline 49.8s, patched 49.8s

* shark, 20us resolution

  baseline 1354 samples, patched 1355 samples

* cachegrind:

  baseline
     Ir/Drw/D1m/Bm = 56,380,653 / 24,394,260 / 390,572 / 235,233
  patched
     Ir/Drw/D1m/Bm = 56,380,574 / 24,394,264 / 390,588 / 235,581
I'm seeing something similar with a patch I'm working on.  I'm getting a 6--7% (ie. huge) slowdown for 3d-morph, which is surprising because the patch only affects nanojit's compile-time (not the generated code) and 3d-morph doesn't involve much JIT compilation.  If I run 3d-cube and 3d-morph within the SS harness I get the slowdown:

   cube:  1.011x as fast    38.9ms +/- 0.2%   38.5ms +/- 0.3%     significant
   morph: *1.068x as slow*  27.3ms +/- 0.4%   29.2ms +/- 0.3%     significant

But If I just run 3d-morph, I get a much smaller slowdown:

   morph: *1.009x as slow*  28.5ms +/- 0.4%   28.7ms +/- 0.3%     significant

If you look at the actual numbers, it seems that running 3d-morph after 3d-cube is a win with the original code, but a loss with my patch applied.

Running 3d-morph alone, outside the harness (ie. just using 'time' in the shell) I get little or no slowdown.

Just to confuse things, Cachegrind says there is a conditional branch misprediction spike in 'sin$fenv_access_off', which is in /usr/lib/libSystem.B.dylib,  with my patch applied.  It goes from 93,000 mispredicts to 292,000 in that function, and from 379,000 to 584,000 overall.  (It's worth noting that 3d-morph spends over 20% of total run-time in sin(), according to Shark;  I'm not sure what the exact relationship betwen sin() and 'sin$fenv_access_off' is, presumably they're the same thing or closely related.)

I don't know if the branch mispredicts are a red herring.  However, it is clear that the position of a benchmark within the overall SS suite is significant.  And AFAICT SunSpider is run as a single 'js' invocation (with a gc() between each test), so this makes sense.
I have seen this before myself. In my experience a 5% perf difference on a single SS test is often not explainable. I don't really know why "fake" perf differences occur, but Andreas suggested that maybe GCC just compiles things a bit differently in arbitrary ways for some patches. I have seen some evidence of that, and I haven't thought of much of anything better. 

I think the main routes for digging further on this kind of thing are:

 - increase the workload size to try to get a larger absolute perf difference. The first example had a 2ms slowdown. Increasing workload by 10-100x may give a slowdown large enough to be measured easily, or make the slowdown go away.

 - look at cycle counts and also instruction counts and stalls using Shark/VTune/oprofile. If instruction counts and stalls are not increased, then either the perf difference is not real or it is caused by OS-level events, which should be easier to check on.

 - Delta debug between the baseline and the slowed-down version. It may be possible to find a smaller patch that exhibits the same perf effect.
(Reporter)

Comment 3

9 years ago
> maybe GCC just compiles things a bit differently

Numbers from comment #0 don't support that.  nsieve-bits is listed as
running 6% slower w/ test patch, but instruction counts, mem ref
counts, etc are practically unchanged.
(Reporter)

Comment 4

9 years ago
(In reply to comment #1)
> If I run 3d-cube and 3d-morph within the SS harness I get the slowdown:
> 
>  cube:  1.011x as fast    38.9ms +/- 0.2%   38.5ms +/- 0.3%     significant
>  morph: *1.068x as slow*  27.3ms +/- 0.4%   29.2ms +/- 0.3%     significant
> 
> But If I just run 3d-morph, I get a much smaller slowdown:
> 
>  morph: *1.009x as slow*  28.5ms +/- 0.4%   28.7ms +/- 0.3%     significant

I can reproduce that, yes.  But stranger even is this.  As part of a
group (all-tests) run, I get this for raytrace:

  raytrace:  1.007x as fast   107.3ms +/- 0.0%  106.5ms +/- 0.0%   significant

but by itself:

  raytrace:  1.006x as fast   100.9ms +/- 0.0%  100.3ms +/- 0.0%   significant

the ratio is the same, but it runs 7% faster when not part of the 
suite.

---------

The SS driver code (resources/sunspider-standalone-driver.js) runs
each test once, all in one big jsshell invokation.  And the driver
perl script runs sunspider-standalone-driver.js the relevant number of
times.  It isn't the case that each test is run individually.

The driver loop is this:

  for (var j = 0; j < tests.length; j++) {
      var testName = "tests/" + tests[j] + ".js";
      var startTime = new Date;
      load(testName);
      times[j] = new Date() - startTime;
      gc();
  }

I guess the gc() is intended to get the runtime in a known state
before each loop iteration.  But whether it really succeeds, from the
point of view of the cost of subsequent GCs, and/or how much the
mutator can run before it GC runs again, is a different question.  Any
suggestions?

Basically what the loop does is to measure the time of each test in
the situation where the runtime has been preconditioned by running all
previous tests.
(In reply to comment #3)
> > maybe GCC just compiles things a bit differently
> 
> Numbers from comment #0 don't support that.  nsieve-bits is listed as
> running 6% slower w/ test patch, but instruction counts, mem ref
> counts, etc are practically unchanged.

Numbers from comment #0 don't address the question, because AFAICT they don't reproduce the perf difference. I suspect that the difference shows up only in an SS run. I now do think it's unlikely to be a GCC issue, because the time spent in C code is less than 2ms total. 

But I tried to reproduce it personally and I couldn't show any reliable difference at all with the standard version of the test, either standalone or in SS. Changing the test to call 'nsieve' 100 times in the loop in a loop yields a 2% speedup (2% faster with patch than without).

Comment 6

9 years ago
how do you do in a full browser build?
I tried a little more measurement on my 2% speedup. I found no difference in CPU cycles or instructions retired. I did find a difference of roughly the right size in VM page faults and zero fills using the Shark standard system call profile configuration.

Another trick that often helps with this kind of thing is to measure specific intervals of execution with rdtsc and bisect to find a small interval of program (e.g. one trace) that contains the perf difference.
Assignee: general → nobody
Closing stale performance bug for no longer relevant Sunspider benchmark as INCOMPLETE.
Status: NEW → RESOLVED
Last Resolved: 8 months ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.