Closed Bug 580532 Opened 14 years ago Closed 2 years ago

Improve visibility of small wins in JS engine (reduce SunSpider noise)

Categories

(Core :: JavaScript Engine, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED INACTIVE

People

(Reporter: paul.biggar, Unassigned)

References

Details

Attachments

(2 files, 11 obsolete files)

2.66 KB, text/plain
Details
1.55 KB, patch
Details | Diff | Splinter Review
Wins of less than 1-2% are hard to witness. On my machine, I see variance of 2-3% between runs of sunspider. This bug throws together a few experiments to decrease variance.
This reads the test file, and _then_ starts a timer, then evals the code. 

Note that calling eval changes the test results significantly (see bug 580529), so I've called the 'snarl' function which does the same thing, which I'll attach in the next patch.

This is a patch against sunspider 63804.
Attached patch (good) The snarl function (obsolete) — Splinter Review
Does the same as eval, except without a massive slowdown from bug 580529.
The sunspider script does not give a useful result for variance in the test cases.
I would guess that it gives _overall_ variance in the test cases, meaning that +1ms in one test cancels out -1ms in another.

calc_benchmark_variance.js works out the stddev of each test individually. It also averages them to give an overall number, though I think this would earn a look of disapproval from someone who knows anything about stats.

These numbers give a better indication of the variance of one or more test runs. This is primarily useful for seeing if some change reduces variability in the test, which is the  point of this bug.
(In reply to comment #1)
> This reads the test file, and _then_ starts a timer, then evals the code. 

This reduces variability significantly in my tests (from .8 to .5) using the attachment 458949 [details].
sstangl suggests a RAMdisk (see https://bugzilla.mozilla.org/show_bug.cgi?id=580447#c1).
v8 and sunspider both have some randomness in them, so it may help to seed the test to make them reproducible.

tests/v8-v4/v8-crypto.js
1411:  while(rng_pptr < rng_psize) {  // extract some randomness from Math.random()
1412:    t = Math.floor(65536 * Math.random());
1468:// PKCS#1 (type 2, random) pad input string s to n bytes, and return a bigint
1480:  while(n > 2) { // random non-zero pad
1542:// Undo PKCS#1 (type 2, random) padding and, if valid, return the plaintext
1585:// Generate a new random private key B bits long, using public expt E

tests/v8-v4/v8-earley-boyer.js
1949:           (peephole (hole 1 "Math.floor(Math.random()*" 'n ")")))
1951:function sc_random(n) {
1952:    return Math.floor(Math.random()*n);

tests/v8-v4/v8-splay.js
60:  // The benchmark framework guarantees that Math.random is
62:  return Math.random();

tests/sunspider-0.9.1/string-validate-input.js
67:      var l = Math.floor(26*Math.random());
78:      var l = Math.floor(9*Math.random());

tests/sunspider-0.9.1/string-base64.js
122:        str += String.fromCharCode( (25 * Math.random()) + 97 );
This has no effect (or maybe a slight negative effect).
I checked out what v8 does, see if we can learn from them.

- they do a few more warm up runs.
- they use this Math.random implementation:

// To make the benchmark results predictable, we replace Math.random
// with a 100% deterministic alternative.
Math.random = (function() {
  var seed = 49734321;
  return function() {
    // Robert Jenkins' 32 bit integer hash function.
    seed = ((seed + 0x7ed55d16) + (seed << 12))  & 0xffffffff;
    seed = ((seed ^ 0xc761c23c) ^ (seed >>> 19)) & 0xffffffff;
    seed = ((seed + 0x165667b1) + (seed << 5))   & 0xffffffff;
    seed = ((seed + 0xd3a2646c) ^ (seed << 9))   & 0xffffffff;
    seed = ((seed + 0xfd7046c5) + (seed << 3))   & 0xffffffff;
    seed = ((seed ^ 0xb55a4f09) ^ (seed >>> 16)) & 0xffffffff;
    return (seed & 0xfffffff) / 0x10000000;
  };
})();
This has no effect. In v8-the-engine's test script, they run the test multiple times without restarting the benchmark, so more warm up is probably better. However, we use JSC's test scripts, which means the tests are run in separate shell invocation. 

I think changing the sunspider tests to duplicate this would move the goalposts, so probably a bad idea.
Attachment #459277 - Attachment description: Bad: Run sunspider tests multiple times → (bad) Run sunspider tests multiple times
Attachment #458939 - Attachment description: Remove IO overhead from sunspider → (good) Remove IO overhead from sunspider
Attachment #458945 - Attachment description: The snarl function → (good) The snarl function
Attachment #458949 - Attachment description: Script to calculate variance of sunspider results → (good) Script to calculate variance of sunspider results
Attachment #459270 - Attachment description: Seed the shell's random number generator. → (bad) Seed the shell's random number generator.
This comes from v8, but it doesn't measurably help variability.
Disabling address-space-randomization seems to help a lot. Combined with the other random things above, I'm getting results which barely change in their overall running time. (That said, I'm still getting the same level of overall variance from the calc_benchmark_variance script, so this might just be luck).

On Linux:

setarch i686 -Rl bash
niceing the process doesn't help, and actually seems to hinder, though I may be running this wrong. I tried both of:

nice -20 sunsp
sudo /usr/bin/nice -20 sunsp

(sunsp is a bash script I use to run sunspider)
Attachment #459281 - Attachment description: (bad) Replace Math.random with a deterministic function → (neutral) Replace Math.random with a deterministic function
Attachment #459277 - Attachment description: (bad) Run sunspider tests multiple times → (neutral) Run sunspider tests multiple times
Attachment #459270 - Attachment description: (bad) Seed the shell's random number generator. → (neutral) Seed the js shell's random number generator.
Date.now() truncates the number of milliseconds to an integer. This keeps the double. This makes really short tests significantly more stable.
Relevant literature:

- Statistically Rigorous Java Performance Evaluation by Andy Georges et al

- Producing wrong data without doing anything obviously wrong! by Todd Mytkowicz et al (aka most mind blowing paper of all time).
Had a chat with dmandelin about some rudimentary stats. It seems that sunspider is applying tools used for the normal distribution to a distribution which is probably not normal, which is probably why the "significant/not significant" results are not very good.

Anyway, he supports the made-up variability thing I'm using in this script, which just sums the difference between the each sub-test's mean and its actual result:

sum = 0
foreach i in 1..n
   calculate mean across the 60 runs of x_i
   foreach j in 1..60
       sum += abs(mean[i] - x_i[j])
variability = sum / 60
Attachment #458949 - Attachment is obsolete: true
This is great stuff!

Bug 534547 is related.  Paul, feel free to mark that as a dup of this bug if you think it's appropriate;  you've certainly dug deeper into this than I have.

Are you planning to file these changes as bugs against SunSpider?  Seems like the best thing to do.

Measuring V8 with the SS harness is kind of bogus, apparently V8 proper ignores some setup/teardown code.  Eg. see bug 562553 where I did an optimization that helps v8-splay.js enormously when measured via the SS harness, but I think it may not have helped when measuring V8 proper.  (Follow the link to the V8 bug for full info.)
BTW, I regularly use Cachegrind to compare SS instruction counts for changes to TM.  The variance is *much* less than you see when timing SS.  So I suspect most of the variance is due to hard-to-handle stuff like cache and branch predictor behaviour changes.

On the other hand, V8 has significant variance even under Cachegrind (at least, V8 when you run the tests as is, without using the V8 harness).
(In reply to comment #17)
> BTW, I regularly use Cachegrind to compare SS instruction counts for changes to TM.

In a previous project (phpcompiler.org), I used cachegrind for this. However, instruction count doesn't really cut it: there is cache misses and branch mispredictions at the least. Of course, that makes it so much harder to get a single "is this better or worse" number. It could work if we say "branch misprediction is X cycles, l2 cache miss is Y cycles", etc. What do you think?


> On the other hand, V8 has significant variance even under Cachegrind (at least,
> V8 when you run the tests as is, without using the V8 harness).

Does comment 8 help here?
(In reply to comment #16)
> Are you planning to file these changes as bugs against SunSpider?  Seems like
> the best thing to do.

I don't know. I'll look in the future. Fixing it may be too mig a job, and would invalidate past results, IE those in arewefastyet.com, which may or may not be a big deal.
(In reply to comment #18)
> However,
> instruction count doesn't really cut it: there is cache misses and branch
> mispredictions at the least.  Of course, that makes it so much harder to get a
> single "is this better or worse" number. It could work if we say "branch
> misprediction is X cycles, l2 cache miss is Y cycles", etc. What do you think?

That's a swamp, belive me.  (I started a PhD on such a topic, and quickly switched to a different topic when I saw how awful it was.)

I know that instruction counts is only part of the picture, but it turns out, for the things I'm doing, it's actually really useful.  One explanation is that an imperfect metric measured precisely can often be more useful than an imperfect metric measured imprecisely.  Another is that most of my changes are attempts to execute less code, and instruction count differences are great for that.

Occasionally I make a change that aims to improve cache or branch prediction behaviour (eg. get rid of a big switch) and Cachegrind gives good (simulated) numbers for that too.

Finally, everyone else measures time, so I find it useful to be the one guy who measures instruction counts;  it just gives me a different view on things.

> > On the other hand, V8 has significant variance even under Cachegrind (at least,
> > V8 when you run the tests as is, without using the V8 harness).
> 
> Does comment 8 help here?

I know that it helps v8-crypto a lot at least;  I've previously hacked it to use a deterministic Math.random().
(In reply to comment #19)
> (In reply to comment #16)
> > Are you planning to file these changes as bugs against SunSpider?  Seems like
> > the best thing to do.
> 
> I don't know. I'll look in the future. Fixing it may be too mig a job, and
> would invalidate past results, IE those in arewefastyet.com, which may or may
> not be a big deal.

Apple can keep hosting 0.9 or whatever they host forever, but new versions should come up under separate version numbers and URLs. All benchmarks should be annual editions (or every six months, maybe). If you are willing to make the case in a bugs.webkit.org bug, I'll help get you air support.

/be
Your comment was:

OK, I've cooked up a proper test. I ran sunspider with --runs=1000 for each
change, and measured the "variability" using the algorithm from comment 15.
Each change builds on the previous one.


Baseline:
Total time: 584319
Total variance: 13660.900000001668
Variance / #scripts: 13.661
Variance * 1000000000 / #scripts / total time: 23379.182


Better date resolution:
Total time: 583266.9450683594
Total variance: 7989.476169433345
Variance / #scripts: 7.989
Variance * 1000000000 / #scripts / total time: 13697.804

Use eval to avoid IO:
Total time: 579124.4201660156
Total variance: 6005.46807714848
Variance / #scripts: 6.005
Variance * 1000000000 / #scripts / total time: 10369.910

Seed random number generator:
Total time: 578326.2463378906
Total variance: 6097.8916064452715
Variance / #scripts: 6.098
Variance * 1000000000 / #scripts / total time: 10544.034

Use deterministic Math.random:
Total time: 584234.15625
Total variance: 5991.305539062612
Variance / #scripts: 5.991
Variance * 1000000000 / #scripts / total time: 10254.973

Warm up "properly":
Total time: 580214.2788085938
Total variance: 6612.213340820201
Variance / #scripts: 6.612
Variance * 1000000000 / #scripts / total time: 11396.158

Address-space randomization:
Total time: 579880.6350097656
Total variance: 6310.679111327958
Variance / #scripts: 6.311
Variance * 1000000000 / #scripts / total time: 10882.721

It seems that seeding the random number generator and warming up "properly" are
not useful, so I'll remove them and run it again. The biggest wins are using
better date resolution (by far!), using eval (big), and
address-space-randomization (not bad).

If you have fault with the numbers, let me know.
I should try it with 30 runs too, since that's what people use. I could try finding a sweet spot in the number of runs, but I think that will vary from machine to machine.
OK, new run, having taken out the patches from above that didn't provide benefits.

Baseline:
Total time: 584454
Total variance: 13775.409999999318
Variance / #scripts: 13.775
Variance * 1000000000 / #scripts / total time:23569.708

Better date resolution:
Total time: 583279.1579589844
Total variance: 8011.607755859456
Variance / #scripts: 8.012
Variance * 1000000000 / #scripts / total time:13735.460

Use eval to avoid IO:
Total time: 578957.7702636719
Total variance: 5925.116317871156
Variance / #scripts: 5.925
Variance * 1000000000 / #scripts / total time:10234.108

Use deterministic Math.random:
Total time: 585289.6984863281
Total variance: 5637.047147460895
Variance / #scripts: 5.637
Variance * 1000000000 / #scripts / total time:9631.209

Address-space randomization:
Total time: 585399.4216308594
Total variance: 5990.405707519354
Variance / #scripts: 5.990
Variance * 1000000000 / #scripts / total time:10233.023

Looks like address space randomization isn't so useful, everything else looks good.
njn: I had a look at bug 534547, and I'll fold that into the work I'm doing here.

Bug 534547 notes variance due to the tests using Date functions to add randomness, which obviously makes them completely unreproducible (which is actually worse than a lot of what I was measuring here). The tests which use Date in some fashion are:

3d-raytrace.js 
crypto-aes.js
date-format-tofte.js
date-format-xparb.js
math-cordic.js
string-tagcloud.js


v8-v4 also uses Date in:

v8-earley-boyer.
And if we go fixing sunspider, then using the v8 setup/teardown code will fix bug 562553.
Attachment #459281 - Attachment description: (neutral) Replace Math.random with a deterministic function → (good) Replace Math.random with a deterministic function
Apply this patch to significantly increase the consistency of your sunspider runs. You don't need to do anything else.

Note that it contains all of SunSpider.
Assignee: general → pbiggar
Attachment #458939 - Attachment is obsolete: true
Attachment #458945 - Attachment is obsolete: true
Attachment #459270 - Attachment is obsolete: true
Attachment #459277 - Attachment is obsolete: true
Attachment #459281 - Attachment is obsolete: true
Attachment #459554 - Attachment is obsolete: true
Status: NEW → UNCONFIRMED
Ever confirmed: false
Depends on: 585001
I've been filing related bugs to SunSpider. The meta bug is:
https://bugs.webkit.org/show_bug.cgi?id=43253
Attachment #463636 - Attachment is patch: false
Attachment #463636 - Attachment mime type: text/plain → application/bzip2
Fixes a bug caused by rebasing.
Attachment #463636 - Attachment is obsolete: true
Patch to apply to checkout to make SunSpider much less random. This assumes you have a SunSpider directory is js/src.

(The patch no longer has a full copy of SunSpider, since that bothered people. Most of the other functionality has moved into SpiderMonkey, so is no longer required here.)
Attachment #463665 - Attachment is obsolete: true
I'm finding some significant outliers, and other anomalies. In all cases I ran the benchmark 1000 times.

access-binary-trees has odd pathological cases. 990 times it takes 14-16ms, then it gets weird:

 18.072021484375,
 40.85498046875,
 40.9990234375,
 41.1201171875,
 41.3681640625,
 42.1279296875,
 44.419921875,
 45.10595703125,
 45.52197265625,

This costs me 6 ms when I average it all up.

Same with bitops-bitwise-and. For 970 cases it takes 2-3ms, then BOOM:

    "bitops-bitwise-and": 6.81591796875, 
    "bitops-bitwise-and": 7.798095703125,
    "bitops-bitwise-and": 7.9970703125, 
    "bitops-bitwise-and": 8.06005859375, 
    "bitops-bitwise-and": 8.075927734375, 
    "bitops-bitwise-and": 8.2060546875, 
    "bitops-bitwise-and": 8.875, 


crypto-aes and string-tagcloud get much faster when address-space-randomization is turned on. Run-time (both mean and median) goes from 26ms to 22ms in the former case, and 53 to 47ms in the latter. There aren't bizarre outliers in either case, like in access-binary-trees: every instance speeds up.

Meanwhile, access-fannkuch, math-partial-sums, math-spectral-norm and string-unpack-code go nuts (become very very very noisy) when ASR is turned on. access-binary-tree, bitops-bitwise-and, access-nsieve and bitops-nsive-bits go a bit nuts, but nothing like the others. I'm not sure exactly what would cause this.
eval() can be used again now (bug 580529), which means the patch is usable on shells too (so long as they support read()).
Attachment #465599 - Attachment is obsolete: true
Summary: Improve visibility of small wins in JS engine → Improve visibility of small wins in JS engine (reduce SunSpider noise)
Now that SunSpider uses a run function, we no longer need to use eval(), and no longer need the submillisecond timing, as they are both subsumed by run().

This updates to apply to trunk SunSpider as of today (Feb 11, 2011).
Attachment #469028 - Attachment is obsolete: true
Assignee: paul.biggar → general
Assignee: general → nobody
Status: UNCONFIRMED → RESOLVED
Closed: 2 years ago
Resolution: --- → INACTIVE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: