Open Bug 482638 Opened 14 years ago Updated 9 years ago

Validate Dromaeo benchmark techniques and statistics


(Core :: General, defect)

Not set




(Reporter: sayrer, Unassigned)



We're seeing some strange swings in Dromeao that make us want to examine the approach it uses for each individual test, and the way it creates its overall score.
Some of these swings are recorded in bug 480494, but I can summarize better the relevant findings so far:

Short summary: TM, especially today, doesn't trace closures well. TM is also very sensitive to some other scoping issues. Because Dromaeo uses closures extensively, we take a perf hit *and* our Dromaeo perf can change a lot based on relatively small changes to TM.

Long summary:

In SunSpider, the tests are "big dumb programs"--HTML files with a script. The only instrumentation added is date stamping at the beginning and end. The programs themselves are activated by a pageload, and thus are independent from the recording and stats processing code.

In contrast, in Dromaeo, the HTML file for each test produces a list of closures. Some of the closures are "prep" steps that are not timed, while the rest are timed tests. The list is passed to the harness, which runs the closures repeatedly. This allows Dromaeo to time test steps while omitting prep steps, which SunSpider can't easily do.

The primary problem with the Dromaeo approach for Tracemonkey is that we don't trace closures well at this point. The first effect is that we may not score as well as we would on a simple single-script approach. Second, because the closures cause tracing aborts, we end up backing off and blacklisting traces. But blacklisting is "twitchy"--seemingly small changes that affect blacklisting can cause big changes in performance.

The latter effect seems to have caused the regression on sunspider-string-fasta.html in bug 480494. On the standard SunSpider version, we get 6 aborts, of the relatively benign "no compatible inner tree" and "inner tree is trying to grow" types that occur in the startup phase with nested loops. If I increase the "count" parameter from 7 to 70, I get no more aborts. In the Dromaeo version, I get several "returned out of a loop we started tracing/LeaveFrame" aborts, which eventually cause blacklisting and hurt our performance.

I should note that if I take the SS shell test and wrap the bottom 3 tests in a loop that runs 10 times, I also get the LeaveFrame aborts and a 3x perf hit relative to increasing count by 10x. So it's not necessarily purely a closure issue.
So David brings up some technical issues inside of TM that might be causing issues, and that's well and good.  But I am hearing through other channels that there might be concerns about the way that dromaeo is counting and measuring.  Robert + Brendan, can you guys talk about your concerns in that area, outside of specific TM bugs?
My concern is that I don't understand how the numbers are calculated. If we could get an explanation of this process, that would be great.
I've also mailed about this, early last December, so none of this is new.

I haven't had time to check, but rate averaging is tricky and arithmetic mean is wrong. More, I question using exclusively rate-based perf measurement. On the web, compile-time does matter, and short bursts of non-repeated computation that are long enough to add up to latency matter too. We need to measure compilation costs as well as exclude then, to isolate factors that can hurt perceived performance.

Also it seems undesirable to chop up standard benchmark code (SunSpider) in order to rate-measure it. But let's settle the rate-based issue first.

Hey, sorry for the delay - was at SXSW and now I'm sick.

As far as I can ascertain from the various comments is that there are three issues:

1) How are the scores arrived at?

A score for an individual test (such as a test for MD5 computation) is the result from 5 runs of the test (each run consisting of running the test in a loop for one second). Thus the number might be 200, 205, 195, 200, 200. A geometric mean is then computed of these numbers (which results in the 'score') along with a confidence interval.

The total score for a browser is done much in the same way. The geometric mean of all the individual means from all of the individual tests (md5, dom, etc.) is computed (which gives you the total score).

2) The averaging is incorrect and needs to be changed.

Right now a Geometric mean is used - a Harmonic mean was recommended in the thread from last year and that seems like an acceptable replacement. I initially used the geometric mean because that's what the V8 benchmark uses to arrive at its score. If there's no major disagreement I can make a shift towards using this averaging technique.

3) There needs to be a way to measure compilation performance.

If there's a way to measure this in a reliable, cross-browser, manner then I'll definitely be open to it. At first glance I'd think that something like:
  new Function("...code...")
might work - but it's not clear to me how much compilation may occur at that time or may be deferred. Any tips here would be appreciated.
You need to log in before you can comment on or make changes to this bug.