Closed Bug 767226 Opened 12 years ago Closed 12 years ago

Create Talos test for V8 benchmarks

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dmandelin, Assigned: jmaher)

References

Details

Attachments

(2 files)

The V8 benchmarks are here:

  http://v8.googlecode.com/svn/data/benchmarks/v7/run.html

The score is somewhat unstable (tends to vary -/+10%). The score is a geometric mean, so I'm not sure exactly how to aggregate multiple runs. (I think it's actually a geometric mean (over all tests) of geometric means (over 5 seconds of repeated runs per test) of rates.)
Would this be the same as the V8 that we run only on inbound and Ionmonkey, or the V8.2 that we don't run anywhere last I looked, or one of them updated to current V8, or something different?
Ah, something different. Those would be from bug 506772 putting our custom wrapper around V8 version 5, and ending up with (among other things) bug 755633 where the score is one of 10.0, 10.333, 10.167, etc., and this would maybe be us just adding a tiny bit of a patch so that we can report real V8 scores from real V8 runs.
(In reply to Phil Ringnalda (:philor) from comment #2)
> Ah, something different. Those would be from bug 506772 putting our custom
> wrapper around V8 version 5, and ending up with (among other things) bug
> 755633 where the score is one of 10.0, 10.333, 10.167, etc., and this would
> maybe be us just adding a tiny bit of a patch so that we can report real V8
> scores from real V8 runs.

Yes, real V8 scores from real V8 (currently version 7) runs.
ok, time to replace the old v8 with the new v8
This adds v8 version 7 to talos.  this works fine as integration into pageloader and reporting raw results.

I would like to resolve what iterations and calculation algorithms we want to perform on this data.  Ideally the more data points we have the better off we are.  

Getting this landed will allow others to look at it as well as make future patches to tweak iterations, etc... much smaller.
Assignee: nobody → jmaher
Status: NEW → ASSIGNED
Attachment #637749 - Flags: review?(jhammel)
Comment on attachment 637749 [details] [diff] [review]
add v8 (ver 7) to talos (1.0)

I assume all the JS files are straight copies?  Looks good to me
Attachment #637749 - Flags: review?(jhammel) → review+
everything is verbatim EXCEPT: base.js and run.html.  I added param parsing to run.html, and some raw results and tprecordresults to base.js.
(In reply to Joel Maher (:jmaher) from comment #7)
> everything is verbatim EXCEPT: base.js and run.html.  I added param parsing
> to run.html, and some raw results and tprecordresults to base.js.

Sounds good to me
tests are added to repository:
http://hg.mozilla.org/build/talos/rev/3607fcd7ce14

still need to hook up to buildbot and figure out what branches and number of runs/frequency we want.
If it's worth running, it's worth running everywhere. We don't need to recreate bug 733490.
Blocks: 733490
In trying to replicate the number I am having trouble.  Here is what happens.

In eahc of the test definitions we have a hardcoded number that is the 'reference'.

We run each test and collect a set of results []

At the end of a test we do:
mean = geometric_mean(results)
score = reference / mean
scores.push(score)

* NOTE: when you run v8, the score is the number posted for each 'test'.

The overall metric is reported as geometric_mean(scores).

There are a few problems here:
1) it is not very logical to simply get the reference number as we process results outside of the browser window.
2) we really really really really like to collect results per test (page).  If we do that, then we will have the mean or score for each page.  Graph server just takes an average of the numbers reported for each page.

Do we need to have the v8 score as is reported, or can we come up with a solution for reporting and stick with it.
(In reply to Joel Maher (:jmaher) from comment #11)
> In trying to replicate the number I am having trouble.  Here is what happens.
> 
> In eahc of the test definitions we have a hardcoded number that is the
> 'reference'.
> 
> We run each test and collect a set of results []
> 
> At the end of a test we do:
> mean = geometric_mean(results)
> score = reference / mean
> scores.push(score)
> 
> * NOTE: when you run v8, the score is the number posted for each 'test'.
> 
> The overall metric is reported as geometric_mean(scores).
> 
> There are a few problems here:
> 1) it is not very logical to simply get the reference number as we process
> results outside of the browser window.
> 2) we really really really really like to collect results per test (page). 
> If we do that, then we will have the mean or score for each page.  Graph
> server just takes an average of the numbers reported for each page.
> 
> Do we need to have the v8 score as is reported, or can we come up with a
> solution for reporting and stick with it.

It's only useful to us to have the V8 score as it's reported by the standard test harness. If that's hard to do, you could leave off V8 for now.
So :jmaher and I talked about this on IRC and there are a few challenges implementation-wise:

1. getting results for graphserver: the current infrastructure is such:
- we accumulate N values per page
- we pass these to talos.output:GraphserverOutput
- this calls filter on them

However, since the reference numbers come from the js/page side of the tests, we need to
- gather these reference numbers
- put them somewhere ;)
- make a filter that can use these numbers

This all sounds reasonable.  However, there is a challenge.  The filter architecture in general requires a function that can be called with arguments known at configuration-time.  While this is a fairly general requirement, we violate it here.

One hack around would be to specify a filter function, e.g. 

def divide_by_reference(series):
    return [series[index]/reference[index] for index in range(len(series))]

and *then* you have to stick the reference as a module global into talos.filters :(

This is a horrible hack.  Another horrible hack would just be to bypass everything and hardcode this for now.  Or...maybe we could leverage output another way.

:jmaher, any inclination?  I'm happy to look at this if you want to pass some of this to me

2. Posting to datazilla: If we can get the page numbers out, we can post them to datazilla.  However, we still have to send the reference numbers if datazilla is going to compute the v8 metric.  Where to put them?  And, in general, how will it work for datazilla to calculate these industry-standard but computed benchmarks?
(So part of this is that we don't really have test classes on the run_tests.py side of talos and pass around a test_config dictionary versus test objects.  I've noted this vaguely in bug 770744)
Another alternative, which i also hate but i haven't come up with one i liked yet, is just taking the static reference numbers and sticking them

A. in a .py file in talos (e.g. filters)
B. in graphserver...somewhere

I mostly really hate this but it doesn't seem worse than the alternatives.  ABICT, the numbers are static to the suite unless i missed something.
Another thing to consider is when looking at the raw data from v8, there is 1 dta point except for 2 tests which have 2 data points.  It might make sense to just report the score (as defined by "reference / geometric_mean(raw_values)") and not hack around.

I do have a patch in talos that outputs _tp_aux_data to stdout and this includes the pagename/testname and the referencedata for each test.

:dmandelin, one question for you would be if the v8 tests are running enough iterations now and we don't want to increase the number of runs to gain a more statistical sound number.  If not, I propose returning only the score.  If we might want to do more runs, then we should figure something out.
adding Christina for any metrics input
So I looked at the patch today I see a number of "interesting" things:

- Crypto and EarleyBoyer both have two tests (Encrypt and Decrypt, and Earley and Boyer, respectively) on a single page. This mostly defies the way Talos works in its entirety for pageloader tests

- Indeed, we do the geometric mean/reference fun for all tests.  However, except for the above, the reference numbers could be offloaded somewhere

So we're face with two dilemmas:  what do we want to report for graphserver (and TBPL)? what do we want to report to datazilla? 

My feeling is that we should report the separate metrics separately as the "page name" for Crypto and EarleyBoyer.  That said, I'm not sure how easy this actually is ;) Hopefully not too bad.
the big question becomes we run a test Encrypt and it generates a single number.

For some reason we do a: score = reference/geometric_mean(single_number);  Is it useful to retain the single number?  Is that single number useful?  Should we collect >1 number?  What does it really mean.

Ok, my confusion is in writing :)
(In reply to Joel Maher (:jmaher) from comment #19)
> the big question becomes we run a test Encrypt and it generates a single
> number.
> 
> For some reason we do a: score = reference/geometric_mean(single_number); 
> Is it useful to retain the single number?  Is that single number useful? 
> Should we collect >1 number?  What does it really mean.
> 
> Ok, my confusion is in writing :)

So this leads to more.....fun. (And don't worry, I am also pretty confused)

- For retaining the v8 score, we'll want to use the geometric mean of the two tests in the case that there are two.  For a single test, the geometric_mean is not significant; but for some reason the test creators decided this is what they want to do.  So we have to reproduce this.

- So we have a few choices:

  1 we can report all the test singularly; that is the two tests with double metrics can report each individually and we can assemble this all python-side into the v8 numbers

 2 we can take the geometric mean / ref JS side and just report one number per page, ignoring the two pages with two (completely differently timed) tests

 3 other: I don't really have any other brilliant ideas.  I think most else involves deeper changes to talos.

FWIW, i *think* i'm in favor of option 2. at the moment.  Although I'd rather have finer-grained control, I fear another can of worms in this one.
So talking to jmaher and dmandelin, it sounds like we want to keep the individual benchmarks for reporting (Encrypt + Decrypt vs Crypto) to datazilla, and report the single v8 score to graphserver via AVERAGE: https://wiki.mozilla.org/Buildbot/Talos/DataFormat

I've retooled how pageloader works so that you can have multiple tests per (self-reporting page).  The thing that magically makes the v8 benchmark number I'll put into filters, though it will be special-cased to v8 and I'll just copy the (non-changing) reference numbers there as well as a test -> page mapping (Not what I would call a great practice, but the best I can do without much delay).

So we'll upload the one value to graphserver and the per-test values to datazilla, which will have to have similar logic to compute the industry standard
see https://bugzilla.mozilla.org/show_bug.cgi?id=770744#c1 for a discussion about the right and the wrong way to do this.  For expedience, of course, I'm planning on doing the wrong way
So this seems to do what i want.  It is giant since it contains more than just v8.  Let me know if you want me to break it up, jmaher

http://k0s.org/mozilla/hg/talos-patches/file/368fb22aad68/bug-767226
(In reply to Jeff Hammel [:jhammel] from comment #23)
> So this seems to do what i want.  It is giant since it contains more than
> just v8.  Let me know if you want me to break it up, jmaher
> 
> http://k0s.org/mozilla/hg/talos-patches/file/368fb22aad68/bug-767226

Oops, that should be

http://k0s.org/mozilla/hg/talos-patches/file/tip/bug-767226
I'm assuming we want more than just windows 7 64bit here
OS: Windows 7 → All
Hardware: x86_64 → All
Jeff, the one thing we need to do in addition to all of this is run the V8 tests as a single page.  iirc, sunspider and kraken can run with each test being a separate page, but V8 needs to be run at once.

I mention that as we might need to adjust a bit more.

regarding the large patch, lets land the kraken bits and then the rest of the files will make a much smaller, easier to manage patch :)
lets get this thing done!
Attachment #641996 - Flags: review?(jhammel)
Comment on attachment 641996 [details] [diff] [review]
reporting hack for v8, sunspider, kraken (1.0)

+var gReference = -1;

We probably don't need this anymore

Other than that, looks good
Attachment #641996 - Flags: review?(jhammel) → review+
great, testing on try server for conformance and stability of existing test numbers.  Things are lining up where we might be able to get this landed and live next week.
overall this seems good.  For some reason tdhtml tests always post higher numbers, but nothing in this patch affects that or the reporting portion of it.  I am not concerned with it as the numbers are pretty close to the ranges posted historically.
Depends on: 774282
Depends on: 774283
Here is a link to some raw data and links to the graphs for these:
http://people.mozilla.org/~jmaher/sxs/tp5n/jstests.html
(In reply to Joel Maher (:jmaher) from comment #32)
> Here is a link to some raw data and links to the graphs for these:
> http://people.mozilla.org/~jmaher/sxs/tp5n/jstests.html

Why are the sunspider number around 11 to 13? I thought a normal range for these is 200ms to 300ms.
here is the raw data:
NOISE: |i|pagename|runs|
NOISE: |0;3d-cube;22;22;22;22;22;22;22;22;22;22
NOISE: |1;3d-morph;12;12;12;11;12;12;12;12;13;12
NOISE: |2;3d-raytrace;23;23;23;22;22;23;23;23;22;23
NOISE: |3;access-binary-trees;4;3;4;4;6;6;6;5;6;5
NOISE: |4;access-fannkuch;10;10;9;10;9;10;10;10;10;10
NOISE: |5;access-nbody;6;6;7;6;7;6;7;6;7;7
NOISE: |6;access-nsieve;5;5;5;5;5;5;6;5;5;6
NOISE: |7;bitops-3bit-bits-in-byte;1;1;1;1;1;1;1;1;1;1
NOISE: |8;bitops-bits-in-byte;6;6;7;6;6;6;6;6;6;6
NOISE: |9;bitops-bitwise-and;5;5;4;4;4;4;4;4;5;4
NOISE: |10;bitops-nsieve-bits;5;4;5;5;4;5;5;4;5;5
NOISE: |11;controlflow-recursive;3;3;3;3;3;4;4;3;4;3
NOISE: |12;crypto-aes;15;15;15;15;15;15;15;15;15;15
NOISE: |13;crypto-md5;10;10;10;10;10;10;10;10;10;9
NOISE: |14;crypto-sha1;6;5;6;5;5;5;5;5;5;5
NOISE: |15;date-format-tofte;88;90;90;89;89;91;91;91;89;157
NOISE: |16;date-format-xparb;18;19;19;18;18;19;18;18;18;18
NOISE: |17;math-cordic;5;4;5;5;5;4;5;5;5;4
NOISE: |18;math-partial-sums;18;19;18;17;17;18;19;18;18;18
NOISE: |19;math-spectral-norm;4;5;4;4;5;5;4;4;4;4
NOISE: |20;regexp-dna;18;18;19;18;19;19;18;18;19;19
NOISE: |21;string-base64;8;8;9;8;9;9;8;9;9;9
NOISE: |22;string-fasta;11;10;11;10;11;11;10;11;11;11
NOISE: |23;string-tagcloud;32;32;32;32;32;33;32;32;32;32
NOISE: |24;string-unpack-code;73;73;73;73;72;73;73;73;72;72
NOISE: |25;string-validate-input;13;13;13;13;13;14;15;15;15;15


Not sure why we would be getting much lower numbers, we are more than happy to change things if needed.
ok, I have updated talos to do proper calculations for kraken and sunspider.  Please take a look at this page and see if the numbers look right:
http://people.mozilla.org/~jmaher/sxs/tp5n/jstests.html

I want to get these correct before I make these default and unhide them.
(In reply to Joel Maher (:jmaher) from comment #35)
> ok, I have updated talos to do proper calculations for kraken and sunspider.
> Please take a look at this page and see if the numbers look right:
> http://people.mozilla.org/~jmaher/sxs/tp5n/jstests.html
> 
> I want to get these correct before I make these default and unhide them.

Yep, those numbers are similar to AWFY. It's very cool to see multiple platforms being tested! For example, you can see how on JM 64bit is 10% or so slower than 32bit.
great, I want to see at least 30 iterations of this running producing reasonable results before closing the bugs out.  Looking like early next week!
this is live!
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Depends on: 786729
Blocks: 755633
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: