Closed Bug 767226 Opened 13 years ago Closed 13 years ago

Create Talos test for V8 benchmarks

Categories

(Testing :: Talos, defect)

Product:

Component:

Type:

defect

Priority:

Not set

Severity:

normal

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: dmandelin, Assigned: jmaher)

References

Details

Attachments

(2 files)

add v8 (ver 7) to talos (1.0) 13 years ago Joel Maher ( :jmaher ) (UTC -8) 504.11 KB, patch	k0scist : review+	Details \| Diff \| Splinter Review
reporting hack for v8, sunspider, kraken (1.0) 13 years ago Joel Maher ( :jmaher ) (UTC -8) 18.70 KB, patch	k0scist : review+	Details \| Diff \| Splinter Review

David Mandelin [:dmandelin]

Reporter

Description

•

13 years ago

The V8 benchmarks are here: http://v8.googlecode.com/svn/data/benchmarks/v7/run.html The score is somewhat unstable (tends to vary -/+10%). The score is a geometric mean, so I'm not sure exactly how to aggregate multiple runs. (I think it's actually a geometric mean (over all tests) of geometric means (over 5 seconds of repeated runs per test) of rates.)

Phil Ringnalda (:philor)

Comment 1

•

13 years ago

Would this be the same as the V8 that we run only on inbound and Ionmonkey, or the V8.2 that we don't run anywhere last I looked, or one of them updated to current V8, or something different?

Phil Ringnalda (:philor)

Comment 2

•

13 years ago

Ah, something different. Those would be from bug 506772 putting our custom wrapper around V8 version 5, and ending up with (among other things) bug 755633 where the score is one of 10.0, 10.333, 10.167, etc., and this would maybe be us just adding a tiny bit of a patch so that we can report real V8 scores from real V8 runs.

David Mandelin [:dmandelin]

Reporter

Comment 3

•

13 years ago

(In reply to Phil Ringnalda (:philor) from comment #2) > Ah, something different. Those would be from bug 506772 putting our custom > wrapper around V8 version 5, and ending up with (among other things) bug > 755633 where the score is one of 10.0, 10.333, 10.167, etc., and this would > maybe be us just adding a tiny bit of a patch so that we can report real V8 > scores from real V8 runs. Yes, real V8 scores from real V8 (currently version 7) runs.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 4

•

13 years ago

ok, time to replace the old v8 with the new v8

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 5

•

13 years ago

Attached patch add v8 (ver 7) to talos (1.0) — Details — Splinter Review

This adds v8 version 7 to talos. this works fine as integration into pageloader and reporting raw results. I would like to resolve what iterations and calculation algorithms we want to perform on this data. Ideally the more data points we have the better off we are. Getting this landed will allow others to look at it as well as make future patches to tweak iterations, etc... much smaller.

Assignee: nobody → jmaher

Status: NEW → ASSIGNED

Attachment #637749 - Flags: review?(jhammel)

Comment 6

•

13 years ago

Comment on attachment 637749 [details] [diff] [review] add v8 (ver 7) to talos (1.0) I assume all the JS files are straight copies? Looks good to me

Attachment #637749 - Flags: review?(jhammel) → review+

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 7

•

13 years ago

everything is verbatim EXCEPT: base.js and run.html. I added param parsing to run.html, and some raw results and tprecordresults to base.js.

Comment 8

•

13 years ago

(In reply to Joel Maher (:jmaher) from comment #7) > everything is verbatim EXCEPT: base.js and run.html. I added param parsing > to run.html, and some raw results and tprecordresults to base.js. Sounds good to me

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 9

•

13 years ago

tests are added to repository: http://hg.mozilla.org/build/talos/rev/3607fcd7ce14 still need to hook up to buildbot and figure out what branches and number of runs/frequency we want.

Phil Ringnalda (:philor)

Comment 10

•

13 years ago

If it's worth running, it's worth running everywhere. We don't need to recreate bug 733490.

Blocks: 733490

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 11

•

13 years ago

In trying to replicate the number I am having trouble. Here is what happens. In eahc of the test definitions we have a hardcoded number that is the 'reference'. We run each test and collect a set of results [] At the end of a test we do: mean = geometric_mean(results) score = reference / mean scores.push(score) * NOTE: when you run v8, the score is the number posted for each 'test'. The overall metric is reported as geometric_mean(scores). There are a few problems here: 1) it is not very logical to simply get the reference number as we process results outside of the browser window. 2) we really really really really like to collect results per test (page). If we do that, then we will have the mean or score for each page. Graph server just takes an average of the numbers reported for each page. Do we need to have the v8 score as is reported, or can we come up with a solution for reporting and stick with it.

David Mandelin [:dmandelin]

Reporter

Comment 12

•

13 years ago

(In reply to Joel Maher (:jmaher) from comment #11) > In trying to replicate the number I am having trouble. Here is what happens. > > In eahc of the test definitions we have a hardcoded number that is the > 'reference'. > > We run each test and collect a set of results [] > > At the end of a test we do: > mean = geometric_mean(results) > score = reference / mean > scores.push(score) > > * NOTE: when you run v8, the score is the number posted for each 'test'. > > The overall metric is reported as geometric_mean(scores). > > There are a few problems here: > 1) it is not very logical to simply get the reference number as we process > results outside of the browser window. > 2) we really really really really like to collect results per test (page). > If we do that, then we will have the mean or score for each page. Graph > server just takes an average of the numbers reported for each page. > > Do we need to have the v8 score as is reported, or can we come up with a > solution for reporting and stick with it. It's only useful to us to have the V8 score as it's reported by the standard test harness. If that's hard to do, you could leave off V8 for now.

Comment 13

•

13 years ago

So :jmaher and I talked about this on IRC and there are a few challenges implementation-wise: 1. getting results for graphserver: the current infrastructure is such: - we accumulate N values per page - we pass these to talos.output:GraphserverOutput - this calls filter on them However, since the reference numbers come from the js/page side of the tests, we need to - gather these reference numbers - put them somewhere ;) - make a filter that can use these numbers This all sounds reasonable. However, there is a challenge. The filter architecture in general requires a function that can be called with arguments known at configuration-time. While this is a fairly general requirement, we violate it here. One hack around would be to specify a filter function, e.g. def divide_by_reference(series): return [series[index]/reference[index] for index in range(len(series))] and *then* you have to stick the reference as a module global into talos.filters :( This is a horrible hack. Another horrible hack would just be to bypass everything and hardcode this for now. Or...maybe we could leverage output another way. :jmaher, any inclination? I'm happy to look at this if you want to pass some of this to me 2. Posting to datazilla: If we can get the page numbers out, we can post them to datazilla. However, we still have to send the reference numbers if datazilla is going to compute the v8 metric. Where to put them? And, in general, how will it work for datazilla to calculate these industry-standard but computed benchmarks?

Comment 14

•

13 years ago

(So part of this is that we don't really have test classes on the run_tests.py side of talos and pass around a test_config dictionary versus test objects. I've noted this vaguely in bug 770744)

Comment 15

•

13 years ago

Another alternative, which i also hate but i haven't come up with one i liked yet, is just taking the static reference numbers and sticking them A. in a .py file in talos (e.g. filters) B. in graphserver...somewhere I mostly really hate this but it doesn't seem worse than the alternatives. ABICT, the numbers are static to the suite unless i missed something.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 16

•

13 years ago

Another thing to consider is when looking at the raw data from v8, there is 1 dta point except for 2 tests which have 2 data points. It might make sense to just report the score (as defined by "reference / geometric_mean(raw_values)") and not hack around. I do have a patch in talos that outputs _tp_aux_data to stdout and this includes the pagename/testname and the referencedata for each test. :dmandelin, one question for you would be if the v8 tests are running enough iterations now and we don't want to increase the number of runs to gain a more statistical sound number. If not, I propose returning only the score. If we might want to do more runs, then we should figure something out.

Comment 17

•

13 years ago

adding Christina for any metrics input

Comment 18

•

13 years ago

So I looked at the patch today I see a number of "interesting" things: - Crypto and EarleyBoyer both have two tests (Encrypt and Decrypt, and Earley and Boyer, respectively) on a single page. This mostly defies the way Talos works in its entirety for pageloader tests - Indeed, we do the geometric mean/reference fun for all tests. However, except for the above, the reference numbers could be offloaded somewhere So we're face with two dilemmas: what do we want to report for graphserver (and TBPL)? what do we want to report to datazilla? My feeling is that we should report the separate metrics separately as the "page name" for Crypto and EarleyBoyer. That said, I'm not sure how easy this actually is ;) Hopefully not too bad.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 19

•

13 years ago

the big question becomes we run a test Encrypt and it generates a single number. For some reason we do a: score = reference/geometric_mean(single_number); Is it useful to retain the single number? Is that single number useful? Should we collect >1 number? What does it really mean. Ok, my confusion is in writing :)

Comment 20

•

13 years ago

(In reply to Joel Maher (:jmaher) from comment #19) > the big question becomes we run a test Encrypt and it generates a single > number. > > For some reason we do a: score = reference/geometric_mean(single_number); > Is it useful to retain the single number? Is that single number useful? > Should we collect >1 number? What does it really mean. > > Ok, my confusion is in writing :) So this leads to more.....fun. (And don't worry, I am also pretty confused) - For retaining the v8 score, we'll want to use the geometric mean of the two tests in the case that there are two. For a single test, the geometric_mean is not significant; but for some reason the test creators decided this is what they want to do. So we have to reproduce this. - So we have a few choices: 1 we can report all the test singularly; that is the two tests with double metrics can report each individually and we can assemble this all python-side into the v8 numbers 2 we can take the geometric mean / ref JS side and just report one number per page, ignoring the two pages with two (completely differently timed) tests 3 other: I don't really have any other brilliant ideas. I think most else involves deeper changes to talos. FWIW, i *think* i'm in favor of option 2. at the moment. Although I'd rather have finer-grained control, I fear another can of worms in this one.

Comment 21

•

13 years ago

So talking to jmaher and dmandelin, it sounds like we want to keep the individual benchmarks for reporting (Encrypt + Decrypt vs Crypto) to datazilla, and report the single v8 score to graphserver via AVERAGE: https://wiki.mozilla.org/Buildbot/Talos/DataFormat I've retooled how pageloader works so that you can have multiple tests per (self-reporting page). The thing that magically makes the v8 benchmark number I'll put into filters, though it will be special-cased to v8 and I'll just copy the (non-changing) reference numbers there as well as a test -> page mapping (Not what I would call a great practice, but the best I can do without much delay). So we'll upload the one value to graphserver and the per-test values to datazilla, which will have to have similar logic to compute the industry standard

Comment 22

•

13 years ago

see https://bugzilla.mozilla.org/show_bug.cgi?id=770744#c1 for a discussion about the right and the wrong way to do this. For expedience, of course, I'm planning on doing the wrong way

Comment 23

•

13 years ago

So this seems to do what i want. It is giant since it contains more than just v8. Let me know if you want me to break it up, jmaher http://k0s.org/mozilla/hg/talos-patches/file/368fb22aad68/bug-767226

Comment 24

•

13 years ago

(In reply to Jeff Hammel [:jhammel] from comment #23) > So this seems to do what i want. It is giant since it contains more than > just v8. Let me know if you want me to break it up, jmaher > > http://k0s.org/mozilla/hg/talos-patches/file/368fb22aad68/bug-767226 Oops, that should be http://k0s.org/mozilla/hg/talos-patches/file/tip/bug-767226

Comment 25

•

13 years ago

I'm assuming we want more than just windows 7 64bit here

OS: Windows 7 → All

Hardware: x86_64 → All

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 26

•

13 years ago

Jeff, the one thing we need to do in addition to all of this is run the V8 tests as a single page. iirc, sunspider and kraken can run with each test being a separate page, but V8 needs to be run at once. I mention that as we might need to adjust a bit more. regarding the large patch, lets land the kraken bits and then the rest of the files will make a much smaller, easier to manage patch :)

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 27

•

13 years ago

Attached patch reporting hack for v8, sunspider, kraken (1.0) — Details — Splinter Review

lets get this thing done!

Attachment #641996 - Flags: review?(jhammel)

Comment 28

•

13 years ago

Comment on attachment 641996 [details] [diff] [review] reporting hack for v8, sunspider, kraken (1.0) +var gReference = -1; We probably don't need this anymore Other than that, looks good

Attachment #641996 - Flags: review?(jhammel) → review+

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 29

•

13 years ago

great, testing on try server for conformance and stability of existing test numbers. Things are lining up where we might be able to get this landed and live next week.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 30

•

13 years ago

overall this seems good. For some reason tdhtml tests always post higher numbers, but nothing in this patch affects that or the reporting portion of it. I am not concerned with it as the numbers are pretty close to the ranges posted historically.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 31

•

13 years ago

landed on talos: http://hg.mozilla.org/build/talos/rev/1f8f47452048

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

13 years ago

Depends on: 774282

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

13 years ago

Depends on: 774283

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 32

•

13 years ago

Here is a link to some raw data and links to the graphs for these: http://people.mozilla.org/~jmaher/sxs/tp5n/jstests.html

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 33

•

13 years ago

(In reply to Joel Maher (:jmaher) from comment #32) > Here is a link to some raw data and links to the graphs for these: > http://people.mozilla.org/~jmaher/sxs/tp5n/jstests.html Why are the sunspider number around 11 to 13? I thought a normal range for these is 200ms to 300ms.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 34

•

13 years ago

here is the raw data: NOISE: |i|pagename|runs| NOISE: |0;3d-cube;22;22;22;22;22;22;22;22;22;22 NOISE: |1;3d-morph;12;12;12;11;12;12;12;12;13;12 NOISE: |2;3d-raytrace;23;23;23;22;22;23;23;23;22;23 NOISE: |3;access-binary-trees;4;3;4;4;6;6;6;5;6;5 NOISE: |4;access-fannkuch;10;10;9;10;9;10;10;10;10;10 NOISE: |5;access-nbody;6;6;7;6;7;6;7;6;7;7 NOISE: |6;access-nsieve;5;5;5;5;5;5;6;5;5;6 NOISE: |7;bitops-3bit-bits-in-byte;1;1;1;1;1;1;1;1;1;1 NOISE: |8;bitops-bits-in-byte;6;6;7;6;6;6;6;6;6;6 NOISE: |9;bitops-bitwise-and;5;5;4;4;4;4;4;4;5;4 NOISE: |10;bitops-nsieve-bits;5;4;5;5;4;5;5;4;5;5 NOISE: |11;controlflow-recursive;3;3;3;3;3;4;4;3;4;3 NOISE: |12;crypto-aes;15;15;15;15;15;15;15;15;15;15 NOISE: |13;crypto-md5;10;10;10;10;10;10;10;10;10;9 NOISE: |14;crypto-sha1;6;5;6;5;5;5;5;5;5;5 NOISE: |15;date-format-tofte;88;90;90;89;89;91;91;91;89;157 NOISE: |16;date-format-xparb;18;19;19;18;18;19;18;18;18;18 NOISE: |17;math-cordic;5;4;5;5;5;4;5;5;5;4 NOISE: |18;math-partial-sums;18;19;18;17;17;18;19;18;18;18 NOISE: |19;math-spectral-norm;4;5;4;4;5;5;4;4;4;4 NOISE: |20;regexp-dna;18;18;19;18;19;19;18;18;19;19 NOISE: |21;string-base64;8;8;9;8;9;9;8;9;9;9 NOISE: |22;string-fasta;11;10;11;10;11;11;10;11;11;11 NOISE: |23;string-tagcloud;32;32;32;32;32;33;32;32;32;32 NOISE: |24;string-unpack-code;73;73;73;73;72;73;73;73;72;72 NOISE: |25;string-validate-input;13;13;13;13;13;14;15;15;15;15 Not sure why we would be getting much lower numbers, we are more than happy to change things if needed.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 35

•

13 years ago

ok, I have updated talos to do proper calculations for kraken and sunspider. Please take a look at this page and see if the numbers look right: http://people.mozilla.org/~jmaher/sxs/tp5n/jstests.html I want to get these correct before I make these default and unhide them.

David Mandelin [:dmandelin]

Reporter

Comment 36

•

13 years ago

(In reply to Joel Maher (:jmaher) from comment #35) > ok, I have updated talos to do proper calculations for kraken and sunspider. > Please take a look at this page and see if the numbers look right: > http://people.mozilla.org/~jmaher/sxs/tp5n/jstests.html > > I want to get these correct before I make these default and unhide them. Yep, those numbers are similar to AWFY. It's very cool to see multiple platforms being tested! For example, you can see how on JM 64bit is 10% or so slower than 32bit.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 37

•

13 years ago

great, I want to see at least 30 iterations of this running producing reasonable results before closing the bugs out. Looking like early next week!

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 38

•

13 years ago

this is live!

Status: ASSIGNED → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Matt Brubeck (:mbrubeck)

Updated

•

12 years ago

Depends on: 786729

Matt Brubeck (:mbrubeck)

Updated

•

12 years ago

Blocks: 755633

You need to log in before you can comment on or make changes to this bug.