Closed Bug 580532 Opened 15 years ago Closed 3 years ago

Improve visibility of small wins in JS engine (reduce SunSpider noise)

Tracking

()

Status:

RESOLVED INACTIVE

People

(Reporter: paul.biggar, Unassigned)

References

Details

Attachments

(2 files, 11 obsolete files)

(good) Remove IO overhead from sunspider 15 years ago Paul Biggar 811 bytes, patch		Details \| Diff \| Splinter Review
(good) The snarl function 15 years ago Paul Biggar 3.04 KB, patch		Details \| Diff \| Splinter Review
(good) Script to calculate variance of sunspider results 15 years ago Paul Biggar 1.68 KB, text/plain		Details
(neutral) Seed the js shell's random number generator. 15 years ago Paul Biggar 551 bytes, patch		Details \| Diff \| Splinter Review
(neutral) Run sunspider tests multiple times 15 years ago Paul Biggar 343 bytes, patch		Details \| Diff \| Splinter Review
(good) Replace Math.random with a deterministic function 15 years ago Paul Biggar 989 bytes, patch		Details \| Diff \| Splinter Review
(good) Use sub-millisecond resolution. 15 years ago Paul Biggar 523 bytes, patch		Details \| Diff \| Splinter Review
Script to calculate variance of sunspider results 15 years ago Paul Biggar 2.66 KB, text/plain		Details
Single application patch to incorporate all the (good) changes here. 15 years ago Paul Biggar 488.90 KB, application/bzip2		Details
Single-application patch to make sunspider less random. 15 years ago Paul Biggar 488.92 KB, application/bzip2		Details
Single-application patch to make sunspider less random. 15 years ago Paul Biggar 2.22 KB, patch		Details \| Diff \| Splinter Review
Single-application patch to make sunspider less random. 15 years ago Paul Biggar 4.93 KB, patch		Details \| Diff \| Splinter Review
Single-patch to make sunspider less random (updated for the run() function) 14 years ago Paul Biggar 1.55 KB, patch		Details \| Diff \| Splinter Review

Paul Biggar

Reporter

Description

•

15 years ago

Wins of less than 1-2% are hard to witness. On my machine, I see variance of 2-3% between runs of sunspider. This bug throws together a few experiments to decrease variance.

Paul Biggar

Reporter

Comment 1

•

15 years ago

Attached patch (good) Remove IO overhead from sunspider (obsolete) — Details — Splinter Review

This reads the test file, and _then_ starts a timer, then evals the code. Note that calling eval changes the test results significantly (see bug 580529), so I've called the 'snarl' function which does the same thing, which I'll attach in the next patch. This is a patch against sunspider 63804.

Paul Biggar

Reporter

Comment 2

•

15 years ago

Attached patch (good) The snarl function (obsolete) — Details — Splinter Review

Does the same as eval, except without a massive slowdown from bug 580529.

Paul Biggar

Reporter

Comment 3

•

15 years ago

Attached file (good) Script to calculate variance of sunspider results (obsolete) — Details

The sunspider script does not give a useful result for variance in the test cases. I would guess that it gives _overall_ variance in the test cases, meaning that +1ms in one test cancels out -1ms in another. calc_benchmark_variance.js works out the stddev of each test individually. It also averages them to give an overall number, though I think this would earn a look of disapproval from someone who knows anything about stats. These numbers give a better indication of the variance of one or more test runs. This is primarily useful for seeing if some change reduces variability in the test, which is the point of this bug.

Paul Biggar

Reporter

Comment 4

•

15 years ago

(In reply to comment #1) > This reads the test file, and _then_ starts a timer, then evals the code. This reduces variability significantly in my tests (from .8 to .5) using the attachment 458949 [details].

Paul Biggar

Reporter

Comment 5

•

15 years ago

sstangl suggests a RAMdisk (see https://bugzilla.mozilla.org/show_bug.cgi?id=580447#c1).

Paul Biggar

Reporter

Comment 6

•

15 years ago

v8 and sunspider both have some randomness in them, so it may help to seed the test to make them reproducible. tests/v8-v4/v8-crypto.js 1411: while(rng_pptr < rng_psize) { // extract some randomness from Math.random() 1412: t = Math.floor(65536 * Math.random()); 1468:// PKCS#1 (type 2, random) pad input string s to n bytes, and return a bigint 1480: while(n > 2) { // random non-zero pad 1542:// Undo PKCS#1 (type 2, random) padding and, if valid, return the plaintext 1585:// Generate a new random private key B bits long, using public expt E tests/v8-v4/v8-earley-boyer.js 1949: (peephole (hole 1 "Math.floor(Math.random()*" 'n ")"))) 1951:function sc_random(n) { 1952: return Math.floor(Math.random()*n); tests/v8-v4/v8-splay.js 60: // The benchmark framework guarantees that Math.random is 62: return Math.random(); tests/sunspider-0.9.1/string-validate-input.js 67: var l = Math.floor(26*Math.random()); 78: var l = Math.floor(9*Math.random()); tests/sunspider-0.9.1/string-base64.js 122: str += String.fromCharCode( (25 * Math.random()) + 97 );

Paul Biggar

Reporter

Comment 7

•

15 years ago

Attached patch (neutral) Seed the js shell's random number generator. (obsolete) — Details — Splinter Review

This has no effect (or maybe a slight negative effect).

Paul Biggar

Reporter

Comment 8

•

15 years ago

I checked out what v8 does, see if we can learn from them. - they do a few more warm up runs. - they use this Math.random implementation: // To make the benchmark results predictable, we replace Math.random // with a 100% deterministic alternative. Math.random = (function() { var seed = 49734321; return function() { // Robert Jenkins' 32 bit integer hash function. seed = ((seed + 0x7ed55d16) + (seed << 12)) & 0xffffffff; seed = ((seed ^ 0xc761c23c) ^ (seed >>> 19)) & 0xffffffff; seed = ((seed + 0x165667b1) + (seed << 5)) & 0xffffffff; seed = ((seed + 0xd3a2646c) ^ (seed << 9)) & 0xffffffff; seed = ((seed + 0xfd7046c5) + (seed << 3)) & 0xffffffff; seed = ((seed ^ 0xb55a4f09) ^ (seed >>> 16)) & 0xffffffff; return (seed & 0xfffffff) / 0x10000000; }; })();

Paul Biggar

Reporter

Comment 9

•

15 years ago

Attached patch (neutral) Run sunspider tests multiple times (obsolete) — Details — Splinter Review

This has no effect. In v8-the-engine's test script, they run the test multiple times without restarting the benchmark, so more warm up is probably better. However, we use JSC's test scripts, which means the tests are run in separate shell invocation. I think changing the sunspider tests to duplicate this would move the goalposts, so probably a bad idea.

Paul Biggar

Reporter

Updated

•

15 years ago

Attachment #459277 - Attachment description: Bad: Run sunspider tests multiple times → (bad) Run sunspider tests multiple times

Paul Biggar

Reporter

Updated

•

15 years ago

Attachment #458939 - Attachment description: Remove IO overhead from sunspider → (good) Remove IO overhead from sunspider

Paul Biggar

Reporter

Updated

•

15 years ago

Attachment #458945 - Attachment description: The snarl function → (good) The snarl function

Paul Biggar

Reporter

Updated

•

15 years ago

Attachment #458949 - Attachment description: Script to calculate variance of sunspider results → (good) Script to calculate variance of sunspider results

Paul Biggar

Reporter

Updated

•

15 years ago

Attachment #459270 - Attachment description: Seed the shell's random number generator. → (bad) Seed the shell's random number generator.

Paul Biggar

Reporter

Comment 10

•

15 years ago

Attached patch (good) Replace Math.random with a deterministic function (obsolete) — Details — Splinter Review

This comes from v8, but it doesn't measurably help variability.

Paul Biggar

Reporter

Comment 11

•

15 years ago

Disabling address-space-randomization seems to help a lot. Combined with the other random things above, I'm getting results which barely change in their overall running time. (That said, I'm still getting the same level of overall variance from the calc_benchmark_variance script, so this might just be luck). On Linux: setarch i686 -Rl bash

Paul Biggar

Reporter

Comment 12

•

15 years ago

niceing the process doesn't help, and actually seems to hinder, though I may be running this wrong. I tried both of: nice -20 sunsp sudo /usr/bin/nice -20 sunsp (sunsp is a bash script I use to run sunspider)

Paul Biggar

Reporter

Updated

•

15 years ago

Attachment #459281 - Attachment description: (bad) Replace Math.random with a deterministic function → (neutral) Replace Math.random with a deterministic function

Paul Biggar

Reporter

Updated

•

15 years ago

Attachment #459277 - Attachment description: (bad) Run sunspider tests multiple times → (neutral) Run sunspider tests multiple times

Paul Biggar

Reporter

Updated

•

15 years ago

Attachment #459270 - Attachment description: (bad) Seed the shell's random number generator. → (neutral) Seed the js shell's random number generator.

Paul Biggar

Reporter

Comment 13

•

15 years ago

Attached patch (good) Use sub-millisecond resolution. (obsolete) — Details — Splinter Review

Date.now() truncates the number of milliseconds to an integer. This keeps the double. This makes really short tests significantly more stable.

Paul Biggar

Reporter

Comment 14

•

15 years ago

Relevant literature: - Statistically Rigorous Java Performance Evaluation by Andy Georges et al - Producing wrong data without doing anything obviously wrong! by Todd Mytkowicz et al (aka most mind blowing paper of all time).

Paul Biggar

Reporter

Comment 15

•

15 years ago

Attached file Script to calculate variance of sunspider results — Details

Had a chat with dmandelin about some rudimentary stats. It seems that sunspider is applying tools used for the normal distribution to a distribution which is probably not normal, which is probably why the "significant/not significant" results are not very good. Anyway, he supports the made-up variability thing I'm using in this script, which just sums the difference between the each sub-test's mean and its actual result: sum = 0 foreach i in 1..n calculate mean across the 60 runs of x_i foreach j in 1..60 sum += abs(mean[i] - x_i[j]) variability = sum / 60

Attachment #458949 - Attachment is obsolete: true

Nicholas Nethercote [inactive]

Comment 16

•

15 years ago

This is great stuff! Bug 534547 is related. Paul, feel free to mark that as a dup of this bug if you think it's appropriate; you've certainly dug deeper into this than I have. Are you planning to file these changes as bugs against SunSpider? Seems like the best thing to do. Measuring V8 with the SS harness is kind of bogus, apparently V8 proper ignores some setup/teardown code. Eg. see bug 562553 where I did an optimization that helps v8-splay.js enormously when measured via the SS harness, but I think it may not have helped when measuring V8 proper. (Follow the link to the V8 bug for full info.)

Nicholas Nethercote [inactive]

Comment 17

•

15 years ago

BTW, I regularly use Cachegrind to compare SS instruction counts for changes to TM. The variance is *much* less than you see when timing SS. So I suspect most of the variance is due to hard-to-handle stuff like cache and branch predictor behaviour changes. On the other hand, V8 has significant variance even under Cachegrind (at least, V8 when you run the tests as is, without using the V8 harness).

Paul Biggar

Reporter

Comment 18

•

15 years ago

(In reply to comment #17) > BTW, I regularly use Cachegrind to compare SS instruction counts for changes to TM. In a previous project (phpcompiler.org), I used cachegrind for this. However, instruction count doesn't really cut it: there is cache misses and branch mispredictions at the least. Of course, that makes it so much harder to get a single "is this better or worse" number. It could work if we say "branch misprediction is X cycles, l2 cache miss is Y cycles", etc. What do you think? > On the other hand, V8 has significant variance even under Cachegrind (at least, > V8 when you run the tests as is, without using the V8 harness). Does comment 8 help here?

Paul Biggar

Reporter

Comment 19

•

15 years ago

(In reply to comment #16) > Are you planning to file these changes as bugs against SunSpider? Seems like > the best thing to do. I don't know. I'll look in the future. Fixing it may be too mig a job, and would invalidate past results, IE those in arewefastyet.com, which may or may not be a big deal.

Nicholas Nethercote [inactive]

Comment 20

•

15 years ago

(In reply to comment #18) > However, > instruction count doesn't really cut it: there is cache misses and branch > mispredictions at the least. Of course, that makes it so much harder to get a > single "is this better or worse" number. It could work if we say "branch > misprediction is X cycles, l2 cache miss is Y cycles", etc. What do you think? That's a swamp, belive me. (I started a PhD on such a topic, and quickly switched to a different topic when I saw how awful it was.) I know that instruction counts is only part of the picture, but it turns out, for the things I'm doing, it's actually really useful. One explanation is that an imperfect metric measured precisely can often be more useful than an imperfect metric measured imprecisely. Another is that most of my changes are attempts to execute less code, and instruction count differences are great for that. Occasionally I make a change that aims to improve cache or branch prediction behaviour (eg. get rid of a big switch) and Cachegrind gives good (simulated) numbers for that too. Finally, everyone else measures time, so I find it useful to be the one guy who measures instruction counts; it just gives me a different view on things. > > On the other hand, V8 has significant variance even under Cachegrind (at least, > > V8 when you run the tests as is, without using the V8 harness). > > Does comment 8 help here? I know that it helps v8-crypto a lot at least; I've previously hacked it to use a deterministic Math.random().

Brendan Eich [:brendan]

Comment 21

•

15 years ago

(In reply to comment #19) > (In reply to comment #16) > > Are you planning to file these changes as bugs against SunSpider? Seems like > > the best thing to do. > > I don't know. I'll look in the future. Fixing it may be too mig a job, and > would invalidate past results, IE those in arewefastyet.com, which may or may > not be a big deal. Apple can keep hosting 0.9 or whatever they host forever, but new versions should come up under separate version numbers and URLs. All benchmarks should be annual editions (or every six months, maybe). If you are willing to make the case in a bugs.webkit.org bug, I'll help get you air support. /be

Paul Biggar

Reporter

Comment 22

•

15 years ago

Your comment was: OK, I've cooked up a proper test. I ran sunspider with --runs=1000 for each change, and measured the "variability" using the algorithm from comment 15. Each change builds on the previous one. Baseline: Total time: 584319 Total variance: 13660.900000001668 Variance / #scripts: 13.661 Variance * 1000000000 / #scripts / total time: 23379.182 Better date resolution: Total time: 583266.9450683594 Total variance: 7989.476169433345 Variance / #scripts: 7.989 Variance * 1000000000 / #scripts / total time: 13697.804 Use eval to avoid IO: Total time: 579124.4201660156 Total variance: 6005.46807714848 Variance / #scripts: 6.005 Variance * 1000000000 / #scripts / total time: 10369.910 Seed random number generator: Total time: 578326.2463378906 Total variance: 6097.8916064452715 Variance / #scripts: 6.098 Variance * 1000000000 / #scripts / total time: 10544.034 Use deterministic Math.random: Total time: 584234.15625 Total variance: 5991.305539062612 Variance / #scripts: 5.991 Variance * 1000000000 / #scripts / total time: 10254.973 Warm up "properly": Total time: 580214.2788085938 Total variance: 6612.213340820201 Variance / #scripts: 6.612 Variance * 1000000000 / #scripts / total time: 11396.158 Address-space randomization: Total time: 579880.6350097656 Total variance: 6310.679111327958 Variance / #scripts: 6.311 Variance * 1000000000 / #scripts / total time: 10882.721 It seems that seeding the random number generator and warming up "properly" are not useful, so I'll remove them and run it again. The biggest wins are using better date resolution (by far!), using eval (big), and address-space-randomization (not bad). If you have fault with the numbers, let me know.

Paul Biggar

Reporter

Comment 23

•

15 years ago

I should try it with 30 runs too, since that's what people use. I could try finding a sweet spot in the number of runs, but I think that will vary from machine to machine.

Paul Biggar

Reporter

Comment 24

•

15 years ago

OK, new run, having taken out the patches from above that didn't provide benefits. Baseline: Total time: 584454 Total variance: 13775.409999999318 Variance / #scripts: 13.775 Variance * 1000000000 / #scripts / total time:23569.708 Better date resolution: Total time: 583279.1579589844 Total variance: 8011.607755859456 Variance / #scripts: 8.012 Variance * 1000000000 / #scripts / total time:13735.460 Use eval to avoid IO: Total time: 578957.7702636719 Total variance: 5925.116317871156 Variance / #scripts: 5.925 Variance * 1000000000 / #scripts / total time:10234.108 Use deterministic Math.random: Total time: 585289.6984863281 Total variance: 5637.047147460895 Variance / #scripts: 5.637 Variance * 1000000000 / #scripts / total time:9631.209 Address-space randomization: Total time: 585399.4216308594 Total variance: 5990.405707519354 Variance / #scripts: 5.990 Variance * 1000000000 / #scripts / total time:10233.023 Looks like address space randomization isn't so useful, everything else looks good.

Paul Biggar

Reporter

Comment 25

•

15 years ago

njn: I had a look at bug 534547, and I'll fold that into the work I'm doing here. Bug 534547 notes variance due to the tests using Date functions to add randomness, which obviously makes them completely unreproducible (which is actually worse than a lot of what I was measuring here). The tests which use Date in some fashion are: 3d-raytrace.js crypto-aes.js date-format-tofte.js date-format-xparb.js math-cordic.js string-tagcloud.js v8-v4 also uses Date in: v8-earley-boyer.

Paul Biggar

Reporter

Comment 26

•

15 years ago

And if we go fixing sunspider, then using the v8 setup/teardown code will fix bug 562553.

Paul Biggar

Reporter

Updated

•

15 years ago

Attachment #459281 - Attachment description: (neutral) Replace Math.random with a deterministic function → (good) Replace Math.random with a deterministic function

Paul Biggar

Reporter

Comment 27

•

15 years ago

Attached file Single application patch to incorporate all the (good) changes here. (obsolete) — Details

Apply this patch to significantly increase the consistency of your sunspider runs. You don't need to do anything else. Note that it contains all of SunSpider.

Assignee: general → pbiggar

Attachment #458939 - Attachment is obsolete: true

Attachment #458945 - Attachment is obsolete: true

Attachment #459270 - Attachment is obsolete: true

Attachment #459277 - Attachment is obsolete: true

Attachment #459281 - Attachment is obsolete: true

Attachment #459554 - Attachment is obsolete: true

Status: NEW → UNCONFIRMED

Ever confirmed: false

Paul Biggar

Reporter

Updated

•

15 years ago

Depends on: 585001

Paul Biggar

Reporter

Comment 28

•

15 years ago

I've been filing related bugs to SunSpider. The meta bug is: https://bugs.webkit.org/show_bug.cgi?id=43253

Paul Biggar

Reporter

Updated

•

15 years ago

Attachment #463636 - Attachment is patch: false

Attachment #463636 - Attachment mime type: text/plain → application/bzip2

Paul Biggar

Reporter

Comment 29

•

15 years ago

Attached file Single-application patch to make sunspider less random. (obsolete) — Details

Fixes a bug caused by rebasing.

Attachment #463636 - Attachment is obsolete: true

Paul Biggar

Reporter

Comment 30

•

15 years ago

Attached patch Single-application patch to make sunspider less random. (obsolete) — Details — Splinter Review

Patch to apply to checkout to make SunSpider much less random. This assumes you have a SunSpider directory is js/src. (The patch no longer has a full copy of SunSpider, since that bothered people. Most of the other functionality has moved into SpiderMonkey, so is no longer required here.)

Attachment #463665 - Attachment is obsolete: true

Paul Biggar

Reporter

Comment 31

•

15 years ago

I'm finding some significant outliers, and other anomalies. In all cases I ran the benchmark 1000 times. access-binary-trees has odd pathological cases. 990 times it takes 14-16ms, then it gets weird: 18.072021484375, 40.85498046875, 40.9990234375, 41.1201171875, 41.3681640625, 42.1279296875, 44.419921875, 45.10595703125, 45.52197265625, This costs me 6 ms when I average it all up. Same with bitops-bitwise-and. For 970 cases it takes 2-3ms, then BOOM: "bitops-bitwise-and": 6.81591796875, "bitops-bitwise-and": 7.798095703125, "bitops-bitwise-and": 7.9970703125, "bitops-bitwise-and": 8.06005859375, "bitops-bitwise-and": 8.075927734375, "bitops-bitwise-and": 8.2060546875, "bitops-bitwise-and": 8.875, crypto-aes and string-tagcloud get much faster when address-space-randomization is turned on. Run-time (both mean and median) goes from 26ms to 22ms in the former case, and 53 to 47ms in the latter. There aren't bizarre outliers in either case, like in access-binary-trees: every instance speeds up. Meanwhile, access-fannkuch, math-partial-sums, math-spectral-norm and string-unpack-code go nuts (become very very very noisy) when ASR is turned on. access-binary-tree, bitops-bitwise-and, access-nsieve and bitops-nsive-bits go a bit nuts, but nothing like the others. I'm not sure exactly what would cause this.

Paul Biggar

Reporter

Comment 32

•

15 years ago

Attached patch Single-application patch to make sunspider less random. (obsolete) — Details — Splinter Review

eval() can be used again now (bug 580529), which means the patch is usable on shells too (so long as they support read()).

Attachment #465599 - Attachment is obsolete: true

Jim Blandy :jimb

Updated

•

15 years ago

Summary: Improve visibility of small wins in JS engine → Improve visibility of small wins in JS engine (reduce SunSpider noise)

Paul Biggar

Reporter

Comment 33

•

14 years ago

Attached patch Single-patch to make sunspider less random (updated for the run() function) — Details — Splinter Review

Now that SunSpider uses a run function, we no longer need to use eval(), and no longer need the submillisecond timing, as they are both subsumed by run(). This updates to apply to trunk SunSpider as of today (Feb 11, 2011).

Attachment #469028 - Attachment is obsolete: true

Paul Biggar

Reporter

Updated

•

14 years ago

Assignee: paul.biggar → general

Nobody; OK to take it and work on it

Assignee

Updated

•

11 years ago

Assignee: general → nobody

Matthew Gaudet (he/him) [:mgaudet]

Updated

•

3 years ago

Status: UNCONFIRMED → RESOLVED

Closed: 3 years ago

Resolution: --- → INACTIVE

You need to log in before you can comment on or make changes to this bug.