Improve visibility of small wins in JS engine (reduce SunSpider noise)

UNCONFIRMED
Unassigned

Status

()

Core
JavaScript Engine
UNCONFIRMED
8 years ago
3 years ago

People

(Reporter: Paul Biggar, Unassigned)

Tracking

Trunk
x86
Linux
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments, 11 obsolete attachments)

2.66 KB, text/plain
Details
1.55 KB, patch
Details | Diff | Splinter Review
(Reporter)

Description

8 years ago
Wins of less than 1-2% are hard to witness. On my machine, I see variance of 2-3% between runs of sunspider. This bug throws together a few experiments to decrease variance.
(Reporter)

Comment 1

8 years ago
Created attachment 458939 [details] [diff] [review]
(good) Remove IO overhead from sunspider

This reads the test file, and _then_ starts a timer, then evals the code. 

Note that calling eval changes the test results significantly (see bug 580529), so I've called the 'snarl' function which does the same thing, which I'll attach in the next patch.

This is a patch against sunspider 63804.
(Reporter)

Comment 2

8 years ago
Created attachment 458945 [details] [diff] [review]
(good) The snarl function

Does the same as eval, except without a massive slowdown from bug 580529.
(Reporter)

Comment 3

8 years ago
Created attachment 458949 [details]
(good) Script to calculate variance of sunspider results

The sunspider script does not give a useful result for variance in the test cases.
I would guess that it gives _overall_ variance in the test cases, meaning that +1ms in one test cancels out -1ms in another.

calc_benchmark_variance.js works out the stddev of each test individually. It also averages them to give an overall number, though I think this would earn a look of disapproval from someone who knows anything about stats.

These numbers give a better indication of the variance of one or more test runs. This is primarily useful for seeing if some change reduces variability in the test, which is the  point of this bug.
(Reporter)

Comment 4

8 years ago
(In reply to comment #1)
> This reads the test file, and _then_ starts a timer, then evals the code. 

This reduces variability significantly in my tests (from .8 to .5) using the attachment 458949 [details].
(Reporter)

Comment 5

8 years ago
sstangl suggests a RAMdisk (see https://bugzilla.mozilla.org/show_bug.cgi?id=580447#c1).
(Reporter)

Comment 6

8 years ago
v8 and sunspider both have some randomness in them, so it may help to seed the test to make them reproducible.

tests/v8-v4/v8-crypto.js
1411:  while(rng_pptr < rng_psize) {  // extract some randomness from Math.random()
1412:    t = Math.floor(65536 * Math.random());
1468:// PKCS#1 (type 2, random) pad input string s to n bytes, and return a bigint
1480:  while(n > 2) { // random non-zero pad
1542:// Undo PKCS#1 (type 2, random) padding and, if valid, return the plaintext
1585:// Generate a new random private key B bits long, using public expt E

tests/v8-v4/v8-earley-boyer.js
1949:           (peephole (hole 1 "Math.floor(Math.random()*" 'n ")")))
1951:function sc_random(n) {
1952:    return Math.floor(Math.random()*n);

tests/v8-v4/v8-splay.js
60:  // The benchmark framework guarantees that Math.random is
62:  return Math.random();

tests/sunspider-0.9.1/string-validate-input.js
67:      var l = Math.floor(26*Math.random());
78:      var l = Math.floor(9*Math.random());

tests/sunspider-0.9.1/string-base64.js
122:        str += String.fromCharCode( (25 * Math.random()) + 97 );
(Reporter)

Comment 7

8 years ago
Created attachment 459270 [details] [diff] [review]
(neutral) Seed the js shell's random number generator.

This has no effect (or maybe a slight negative effect).
(Reporter)

Comment 8

8 years ago
I checked out what v8 does, see if we can learn from them.

- they do a few more warm up runs.
- they use this Math.random implementation:

// To make the benchmark results predictable, we replace Math.random
// with a 100% deterministic alternative.
Math.random = (function() {
  var seed = 49734321;
  return function() {
    // Robert Jenkins' 32 bit integer hash function.
    seed = ((seed + 0x7ed55d16) + (seed << 12))  & 0xffffffff;
    seed = ((seed ^ 0xc761c23c) ^ (seed >>> 19)) & 0xffffffff;
    seed = ((seed + 0x165667b1) + (seed << 5))   & 0xffffffff;
    seed = ((seed + 0xd3a2646c) ^ (seed << 9))   & 0xffffffff;
    seed = ((seed + 0xfd7046c5) + (seed << 3))   & 0xffffffff;
    seed = ((seed ^ 0xb55a4f09) ^ (seed >>> 16)) & 0xffffffff;
    return (seed & 0xfffffff) / 0x10000000;
  };
})();
(Reporter)

Comment 9

8 years ago
Created attachment 459277 [details] [diff] [review]
(neutral) Run sunspider tests multiple times

This has no effect. In v8-the-engine's test script, they run the test multiple times without restarting the benchmark, so more warm up is probably better. However, we use JSC's test scripts, which means the tests are run in separate shell invocation. 

I think changing the sunspider tests to duplicate this would move the goalposts, so probably a bad idea.
(Reporter)

Updated

8 years ago
Attachment #459277 - Attachment description: Bad: Run sunspider tests multiple times → (bad) Run sunspider tests multiple times
(Reporter)

Updated

8 years ago
Attachment #458939 - Attachment description: Remove IO overhead from sunspider → (good) Remove IO overhead from sunspider
(Reporter)

Updated

8 years ago
Attachment #458945 - Attachment description: The snarl function → (good) The snarl function
(Reporter)

Updated

8 years ago
Attachment #458949 - Attachment description: Script to calculate variance of sunspider results → (good) Script to calculate variance of sunspider results
(Reporter)

Updated

8 years ago
Attachment #459270 - Attachment description: Seed the shell's random number generator. → (bad) Seed the shell's random number generator.
(Reporter)

Comment 10

8 years ago
Created attachment 459281 [details] [diff] [review]
(good) Replace Math.random with a deterministic function

This comes from v8, but it doesn't measurably help variability.
(Reporter)

Comment 11

8 years ago
Disabling address-space-randomization seems to help a lot. Combined with the other random things above, I'm getting results which barely change in their overall running time. (That said, I'm still getting the same level of overall variance from the calc_benchmark_variance script, so this might just be luck).

On Linux:

setarch i686 -Rl bash
(Reporter)

Comment 12

8 years ago
niceing the process doesn't help, and actually seems to hinder, though I may be running this wrong. I tried both of:

nice -20 sunsp
sudo /usr/bin/nice -20 sunsp

(sunsp is a bash script I use to run sunspider)
(Reporter)

Updated

8 years ago
Attachment #459281 - Attachment description: (bad) Replace Math.random with a deterministic function → (neutral) Replace Math.random with a deterministic function
(Reporter)

Updated

8 years ago
Attachment #459277 - Attachment description: (bad) Run sunspider tests multiple times → (neutral) Run sunspider tests multiple times
(Reporter)

Updated

8 years ago
Attachment #459270 - Attachment description: (bad) Seed the shell's random number generator. → (neutral) Seed the js shell's random number generator.
(Reporter)

Comment 13

8 years ago
Created attachment 459554 [details] [diff] [review]
(good) Use sub-millisecond resolution.

Date.now() truncates the number of milliseconds to an integer. This keeps the double. This makes really short tests significantly more stable.
(Reporter)

Comment 14

8 years ago
Relevant literature:

- Statistically Rigorous Java Performance Evaluation by Andy Georges et al

- Producing wrong data without doing anything obviously wrong! by Todd Mytkowicz et al (aka most mind blowing paper of all time).
(Reporter)

Comment 15

8 years ago
Created attachment 459618 [details]
Script to calculate variance of sunspider results

Had a chat with dmandelin about some rudimentary stats. It seems that sunspider is applying tools used for the normal distribution to a distribution which is probably not normal, which is probably why the "significant/not significant" results are not very good.

Anyway, he supports the made-up variability thing I'm using in this script, which just sums the difference between the each sub-test's mean and its actual result:

sum = 0
foreach i in 1..n
   calculate mean across the 60 runs of x_i
   foreach j in 1..60
       sum += abs(mean[i] - x_i[j])
variability = sum / 60
Attachment #458949 - Attachment is obsolete: true
This is great stuff!

Bug 534547 is related.  Paul, feel free to mark that as a dup of this bug if you think it's appropriate;  you've certainly dug deeper into this than I have.

Are you planning to file these changes as bugs against SunSpider?  Seems like the best thing to do.

Measuring V8 with the SS harness is kind of bogus, apparently V8 proper ignores some setup/teardown code.  Eg. see bug 562553 where I did an optimization that helps v8-splay.js enormously when measured via the SS harness, but I think it may not have helped when measuring V8 proper.  (Follow the link to the V8 bug for full info.)
BTW, I regularly use Cachegrind to compare SS instruction counts for changes to TM.  The variance is *much* less than you see when timing SS.  So I suspect most of the variance is due to hard-to-handle stuff like cache and branch predictor behaviour changes.

On the other hand, V8 has significant variance even under Cachegrind (at least, V8 when you run the tests as is, without using the V8 harness).
(Reporter)

Comment 18

8 years ago
(In reply to comment #17)
> BTW, I regularly use Cachegrind to compare SS instruction counts for changes to TM.

In a previous project (phpcompiler.org), I used cachegrind for this. However, instruction count doesn't really cut it: there is cache misses and branch mispredictions at the least. Of course, that makes it so much harder to get a single "is this better or worse" number. It could work if we say "branch misprediction is X cycles, l2 cache miss is Y cycles", etc. What do you think?


> On the other hand, V8 has significant variance even under Cachegrind (at least,
> V8 when you run the tests as is, without using the V8 harness).

Does comment 8 help here?
(Reporter)

Comment 19

8 years ago
(In reply to comment #16)
> Are you planning to file these changes as bugs against SunSpider?  Seems like
> the best thing to do.

I don't know. I'll look in the future. Fixing it may be too mig a job, and would invalidate past results, IE those in arewefastyet.com, which may or may not be a big deal.
(In reply to comment #18)
> However,
> instruction count doesn't really cut it: there is cache misses and branch
> mispredictions at the least.  Of course, that makes it so much harder to get a
> single "is this better or worse" number. It could work if we say "branch
> misprediction is X cycles, l2 cache miss is Y cycles", etc. What do you think?

That's a swamp, belive me.  (I started a PhD on such a topic, and quickly switched to a different topic when I saw how awful it was.)

I know that instruction counts is only part of the picture, but it turns out, for the things I'm doing, it's actually really useful.  One explanation is that an imperfect metric measured precisely can often be more useful than an imperfect metric measured imprecisely.  Another is that most of my changes are attempts to execute less code, and instruction count differences are great for that.

Occasionally I make a change that aims to improve cache or branch prediction behaviour (eg. get rid of a big switch) and Cachegrind gives good (simulated) numbers for that too.

Finally, everyone else measures time, so I find it useful to be the one guy who measures instruction counts;  it just gives me a different view on things.

> > On the other hand, V8 has significant variance even under Cachegrind (at least,
> > V8 when you run the tests as is, without using the V8 harness).
> 
> Does comment 8 help here?

I know that it helps v8-crypto a lot at least;  I've previously hacked it to use a deterministic Math.random().
(In reply to comment #19)
> (In reply to comment #16)
> > Are you planning to file these changes as bugs against SunSpider?  Seems like
> > the best thing to do.
> 
> I don't know. I'll look in the future. Fixing it may be too mig a job, and
> would invalidate past results, IE those in arewefastyet.com, which may or may
> not be a big deal.

Apple can keep hosting 0.9 or whatever they host forever, but new versions should come up under separate version numbers and URLs. All benchmarks should be annual editions (or every six months, maybe). If you are willing to make the case in a bugs.webkit.org bug, I'll help get you air support.

/be
(Reporter)

Comment 22

8 years ago
Your comment was:

OK, I've cooked up a proper test. I ran sunspider with --runs=1000 for each
change, and measured the "variability" using the algorithm from comment 15.
Each change builds on the previous one.


Baseline:
Total time: 584319
Total variance: 13660.900000001668
Variance / #scripts: 13.661
Variance * 1000000000 / #scripts / total time: 23379.182


Better date resolution:
Total time: 583266.9450683594
Total variance: 7989.476169433345
Variance / #scripts: 7.989
Variance * 1000000000 / #scripts / total time: 13697.804

Use eval to avoid IO:
Total time: 579124.4201660156
Total variance: 6005.46807714848
Variance / #scripts: 6.005
Variance * 1000000000 / #scripts / total time: 10369.910

Seed random number generator:
Total time: 578326.2463378906
Total variance: 6097.8916064452715
Variance / #scripts: 6.098
Variance * 1000000000 / #scripts / total time: 10544.034

Use deterministic Math.random:
Total time: 584234.15625
Total variance: 5991.305539062612
Variance / #scripts: 5.991
Variance * 1000000000 / #scripts / total time: 10254.973

Warm up "properly":
Total time: 580214.2788085938
Total variance: 6612.213340820201
Variance / #scripts: 6.612
Variance * 1000000000 / #scripts / total time: 11396.158

Address-space randomization:
Total time: 579880.6350097656
Total variance: 6310.679111327958
Variance / #scripts: 6.311
Variance * 1000000000 / #scripts / total time: 10882.721

It seems that seeding the random number generator and warming up "properly" are
not useful, so I'll remove them and run it again. The biggest wins are using
better date resolution (by far!), using eval (big), and
address-space-randomization (not bad).

If you have fault with the numbers, let me know.
(Reporter)

Comment 23

8 years ago
I should try it with 30 runs too, since that's what people use. I could try finding a sweet spot in the number of runs, but I think that will vary from machine to machine.
(Reporter)

Comment 24

8 years ago
OK, new run, having taken out the patches from above that didn't provide benefits.

Baseline:
Total time: 584454
Total variance: 13775.409999999318
Variance / #scripts: 13.775
Variance * 1000000000 / #scripts / total time:23569.708

Better date resolution:
Total time: 583279.1579589844
Total variance: 8011.607755859456
Variance / #scripts: 8.012
Variance * 1000000000 / #scripts / total time:13735.460

Use eval to avoid IO:
Total time: 578957.7702636719
Total variance: 5925.116317871156
Variance / #scripts: 5.925
Variance * 1000000000 / #scripts / total time:10234.108

Use deterministic Math.random:
Total time: 585289.6984863281
Total variance: 5637.047147460895
Variance / #scripts: 5.637
Variance * 1000000000 / #scripts / total time:9631.209

Address-space randomization:
Total time: 585399.4216308594
Total variance: 5990.405707519354
Variance / #scripts: 5.990
Variance * 1000000000 / #scripts / total time:10233.023

Looks like address space randomization isn't so useful, everything else looks good.
(Reporter)

Comment 25

8 years ago
njn: I had a look at bug 534547, and I'll fold that into the work I'm doing here.

Bug 534547 notes variance due to the tests using Date functions to add randomness, which obviously makes them completely unreproducible (which is actually worse than a lot of what I was measuring here). The tests which use Date in some fashion are:

3d-raytrace.js 
crypto-aes.js
date-format-tofte.js
date-format-xparb.js
math-cordic.js
string-tagcloud.js


v8-v4 also uses Date in:

v8-earley-boyer.
(Reporter)

Comment 26

8 years ago
And if we go fixing sunspider, then using the v8 setup/teardown code will fix bug 562553.
(Reporter)

Updated

8 years ago
Attachment #459281 - Attachment description: (neutral) Replace Math.random with a deterministic function → (good) Replace Math.random with a deterministic function
(Reporter)

Comment 27

8 years ago
Created attachment 463636 [details]
Single application patch to incorporate all the (good) changes here.

Apply this patch to significantly increase the consistency of your sunspider runs. You don't need to do anything else.

Note that it contains all of SunSpider.
Assignee: general → pbiggar
Attachment #458939 - Attachment is obsolete: true
Attachment #458945 - Attachment is obsolete: true
Attachment #459270 - Attachment is obsolete: true
Attachment #459277 - Attachment is obsolete: true
Attachment #459281 - Attachment is obsolete: true
Attachment #459554 - Attachment is obsolete: true
Status: NEW → UNCONFIRMED
Ever confirmed: false
(Reporter)

Updated

8 years ago
Depends on: 585001
(Reporter)

Comment 28

8 years ago
I've been filing related bugs to SunSpider. The meta bug is:
https://bugs.webkit.org/show_bug.cgi?id=43253
(Reporter)

Updated

8 years ago
Attachment #463636 - Attachment is patch: false
Attachment #463636 - Attachment mime type: text/plain → application/bzip2
(Reporter)

Comment 29

8 years ago
Created attachment 463665 [details]
Single-application patch to make sunspider less random.

Fixes a bug caused by rebasing.
Attachment #463636 - Attachment is obsolete: true
(Reporter)

Comment 30

7 years ago
Created attachment 465599 [details] [diff] [review]
Single-application patch to make sunspider less random.

Patch to apply to checkout to make SunSpider much less random. This assumes you have a SunSpider directory is js/src.

(The patch no longer has a full copy of SunSpider, since that bothered people. Most of the other functionality has moved into SpiderMonkey, so is no longer required here.)
Attachment #463665 - Attachment is obsolete: true
(Reporter)

Comment 31

7 years ago
I'm finding some significant outliers, and other anomalies. In all cases I ran the benchmark 1000 times.

access-binary-trees has odd pathological cases. 990 times it takes 14-16ms, then it gets weird:

 18.072021484375,
 40.85498046875,
 40.9990234375,
 41.1201171875,
 41.3681640625,
 42.1279296875,
 44.419921875,
 45.10595703125,
 45.52197265625,

This costs me 6 ms when I average it all up.

Same with bitops-bitwise-and. For 970 cases it takes 2-3ms, then BOOM:

    "bitops-bitwise-and": 6.81591796875, 
    "bitops-bitwise-and": 7.798095703125,
    "bitops-bitwise-and": 7.9970703125, 
    "bitops-bitwise-and": 8.06005859375, 
    "bitops-bitwise-and": 8.075927734375, 
    "bitops-bitwise-and": 8.2060546875, 
    "bitops-bitwise-and": 8.875, 


crypto-aes and string-tagcloud get much faster when address-space-randomization is turned on. Run-time (both mean and median) goes from 26ms to 22ms in the former case, and 53 to 47ms in the latter. There aren't bizarre outliers in either case, like in access-binary-trees: every instance speeds up.

Meanwhile, access-fannkuch, math-partial-sums, math-spectral-norm and string-unpack-code go nuts (become very very very noisy) when ASR is turned on. access-binary-tree, bitops-bitwise-and, access-nsieve and bitops-nsive-bits go a bit nuts, but nothing like the others. I'm not sure exactly what would cause this.
(Reporter)

Comment 32

7 years ago
Created attachment 469028 [details] [diff] [review]
Single-application patch to make sunspider less random.

eval() can be used again now (bug 580529), which means the patch is usable on shells too (so long as they support read()).
Attachment #465599 - Attachment is obsolete: true

Updated

7 years ago
Summary: Improve visibility of small wins in JS engine → Improve visibility of small wins in JS engine (reduce SunSpider noise)
(Reporter)

Comment 33

7 years ago
Created attachment 511780 [details] [diff] [review]
Single-patch to make sunspider less random (updated for the run() function)

Now that SunSpider uses a run function, we no longer need to use eval(), and no longer need the submillisecond timing, as they are both subsumed by run().

This updates to apply to trunk SunSpider as of today (Feb 11, 2011).
Attachment #469028 - Attachment is obsolete: true
(Reporter)

Updated

6 years ago
Assignee: paul.biggar → general
(Assignee)

Updated

3 years ago
Assignee: general → nobody
You need to log in before you can comment on or make changes to this bug.