Closed Bug 572860 Opened 15 years ago Closed 15 years ago

performance runtests improvements tracking bug

Categories

(Tamarin Graveyard :: Tools, defect, P3)

defect

Tracking

(Not tracked)

VERIFIED FIXED
Q3 11 - Serrano

People

(Reporter: cpeyer, Assigned: cpeyer)

References

Details

(Whiteboard: Tracker)

Attachments

(3 files, 5 obsolete files)

Tracking bug for a variety of performance/runtests.py improvements.
Depends on: 572861
Depends on: 572863
Depends on: 572867
Depends on: 572877
Depends on: 573106
Depends on: 568496
Flags: flashplayer-qrb+
Priority: -- → P3
Whiteboard: Tracker
Target Milestone: --- → flash10.2
I didn't intend to have a giant patch for all the runtests improvements, but most of the changes relied on a new datastruct used to store results internally so they slowly merged together as I was working on it. I'm not too concerned about the code change (though feel free to review) but would like for feedback on using the script. Additions: multi-metric support: runtests can now handle multiple metrics being reported in one test. The easiest way to demonstrate that is to run with -m. Instead of just memory results, both memory and test results will be displayed. Runtests also supports output from 568933. vprof (permf) results are also supported, but the --perfm flag still must be used indexing: Two new command line flags have been added: --index and --saveindex both require an index filename to be specified after the flag. "--index index.py" will load index.py and run all of the tests in that index and report results normalized to the index values "--saveindex indexname.py" will run the specified tests and save the values to the give index file. Specifying both --index and --saveindex will merge the values specified in --index to the new saved indexfile. cumulative score: new option of --score will display a cumulative score for every metric. Really most useful with an index file. csv output: "--csv filename" will save run results to a csv file that can be analyzed using excel pivottables. Use --csvappend if you want to append the results to the csv file There is still work to be done with including configuration information into the csv data. misc: - use --raw to display all test values for each vm - removed -p --prettyprint option (I don't think this was used by anybody)
Attachment #459674 - Flags: feedback?(fklockii)
Attachment #459674 - Flags: feedback?(edwsmith)
Attachment #459674 - Flags: feedback?(brbaker)
metricinfo.py has also been added. This file defines all of the metrics being used and how to calculated the "best" value for that metric as well as information on units and a description. see the file for details. Note that metrics can now compute the best value using the mean (or any other function that can be applied to a list for that matter).
Comment on attachment 459674 [details] [diff] [review] runtest.py mega patch - csv, indexing, cumulative score, updated internal data struct I know you said you cared more about usage feedback than code feedback, but, well, I started by reading the code (not finished). Initial feedback: - remove the tabs from the python code (preferably in a preliminary separate refactoring). I don't want to have to worry about Python's notion of spaces/tab conflicting with my own. - if we're already allowing choice between {min,max,mean}, we might add median as well. Nit: typo line 4: whos' (the possessive of "who" is "whose") Nit: I still think help text should put "--detail" close to "--verbose"
Comment on attachment 459674 [details] [diff] [review] runtest.py mega patch - csv, indexing, cumulative score, updated internal data struct Boo: there's something wrong. Behavior before your patch (and presumably expected behavior): bash-3.2$ python runtests.py -f -v --detail --avm ../../../tamarin-redux3/objdir-64/shell/avmshell mmgc/gcbench.as self.detail = True Tamarin tests started: 2010-07-27 12:33:07.409681 Executing 1 test(s) avm: ../../../tamarin-redux3/objdir-64/shell/avmshell version: cyclone iterations: 1 test avm:0 0 running mmgc/gcbench.as compiling: java -jar /Users/fklockii/Dev/asc/lib/asc.jar -import /Users/fklockii/Dev/Bugz/tamarin-redux/objdir-baseline64/../core/builtin.abc -config CONFIG::desktop=true -import /Users/fklockii/Dev/Bugz/tamarin-redux/objdir-baseline64/../shell/shell_toplevel.abc -optimize -AS3 mmgc/gcbench.as Metric: time mmgc/gcbench.as 2841 Behavior after your patch: bash-3.2$ python runtests.py -f -v --detail --avm ../../../tamarin-redux3/objdir-64/shell/avmshell mmgc/gcbench.as Tamarin tests started: 2010-07-27 12:33:19.606983 Executing 1 test(s) avm: ../../../tamarin-redux3/objdir-64/shell/avmshell version: cyclone iterations: 1 test avm:0 compiling: java -jar /Users/fklockii/Dev/asc/lib/asc.jar -import /Users/fklockii/Dev/Bugz/tamarin-redux/objdir-baseline64/../core/builtin.abc -config CONFIG::desktop=true -import /Users/fklockii/Dev/Bugz/tamarin-redux/objdir-baseline64/../shell/shell_toplevel.abc -optimize -AS3 mmgc/gcbench.as Runtest Abnormal Exit Sanity check (that this avm should be able to run that test): bash-3.2$ ../../../tamarin-redux3/objdir-64/shell/avmshell mmgc/gcbench.abc Garbage Collector Test Stretching memory with a binary tree of depth 18 Creating a long-lived binary tree of depth 16 Creating a long-lived array of 500000 doubles Creating 33825 trees of depth 4 Top down construction took 196msecs Bottom up construction took 202msecs Creating 8256 trees of depth 6 Top down construction took 201msecs Bottom up construction took 190msecs Creating 2052 trees of depth 8 Top down construction took 208msecs Bottom up construction took 228msecs Creating 512 trees of depth 10 Top down construction took 204msecs Bottom up construction took 198msecs Creating 128 trees of depth 12 Top down construction took 222msecs Bottom up construction took 189msecs Creating 32 trees of depth 14 Top down construction took 198msecs Bottom up construction took 188msecs Creating 8 trees of depth 16 Top down construction took 234msecs Bottom up construction took 213msecs metric time 3139
(in case its not clear, buried in all that output, in the second run labelled "Behavior after your patch:", it dies, with the message "Runtest Abnormal Exit.")
Attached patch Updated runtests mega patch (obsolete) — Splinter Review
- Fixes issue with single metric causing an abnormal exit (also added debug code so that traceback is printed for now). - fixed tabs issue - add median function for metrics calc - fix whos' typo - re: --detail next to verbose - since perf runtests is a subclass of runtestBase, I can't mix & match the help strings between the two classes.
Attachment #459674 - Attachment is obsolete: true
Attachment #460570 - Flags: feedback?(fklockii)
Attachment #460570 - Flags: feedback?(edwsmith)
Attachment #460570 - Flags: feedback?(brbaker)
Attachment #459674 - Flags: feedback?(fklockii)
Attachment #459674 - Flags: feedback?(edwsmith)
Attachment #459674 - Flags: feedback?(brbaker)
Attached patch Updated runtests mega patch (obsolete) — Splinter Review
Fix minor SystemExit issue.
Attachment #460570 - Attachment is obsolete: true
Attachment #460571 - Flags: feedback?(fklockii)
Attachment #460571 - Flags: feedback?(edwsmith)
Attachment #460571 - Flags: feedback?(brbaker)
Attachment #460570 - Flags: feedback?(fklockii)
Attachment #460570 - Flags: feedback?(edwsmith)
Attachment #460570 - Flags: feedback?(brbaker)
Comment on attachment 460570 [details] [diff] [review] Updated runtests mega patch All code files (including these .py files) should have modelines at the top for vi and emacs. You can crib from core/builtin.py if you like.
(In reply to comment #8) > Comment on attachment 460570 [details] [diff] [review] > Updated runtests mega patch > > All code files (including these .py files) should have modelines at the top for > vi and emacs. You can crib from core/builtin.py if you like. Also, util/addmodeline.sh supports Python (though looking at that now, the modeline it adds is a tiny bit different than the one in core/builtin.py).
Attachment #460571 - Flags: feedback?(edwsmith)
As discussed between Chris and I yesterday in private chat, this patch is inverting the spdup %diff, so that improvements yield negative %'s and regressions yield positive ones. I can cope with either convention (but I admit a slight preference for the original version where minus signs mean badness, as my terminal will color negative numbers red.) When you look into this, you might consider trying to change the script so that memory reporting is consistent with whatever choice you make: if a negative %diff means a regression, then make sure that a memory regression yields a negative number. (Right now the memory reporting produces a sign that is opposite from the v8/time reporting, so "fixing" one "breaks" the other.)
Code nitpick: if you're storing first class procedures into tables, I advise against using the identity of a particular procedure to decide control flow. I'm referring in particular to: # if the best value is the max value, reverse the sign of the spdup sign = -1 if self.metricInfo[metric]['best'] is max else 1 which is relying on the identity of max to decide how to set the sign. This is cute but fragile (what if another best-chooser should also reverse the sign? What if someone uses an instrumenting wrapper around max as the best-chooser?) (There also may be a separate design issue hiding here: perhaps we should tease apart the two issues of whether LargerIsBetter, and what SelectionMechanism (Best? Worst? Average?) the user desires. For example, for memory usage it is not the case that LargerIsBetter (in the usual sense of the word "better"), but I like the current behavior where the script is selecting the Worst value for presentation and spdup calculation. It seems like that could be indicated explicitly.)
I discussed the calculation of sig with Ed recently. 1. He told me he was okay with changing its calculation to use (sumStdDev + 1) in the denominator, which would make the sig purely a damping factor on the %diff. So we may want to do that. 2. Ed also mentioned that we could just drop reporting the sig value at all in the condensed output, leaving only the "+/-/++/--" marker on the line. At first I was a little adverse to this, but I just realized that if we did that, it would leave room for us to put in a %diff for the average of the measurements, in addition to the current %diff of the selected "best" of the measurements.
In the old script, "--memory" would turn on memory reporting and turn off reporting of the default metric. In the new one, "--memory" adds memory reporting but also continues to report the default metric. Is there a way to get back the previous behavior?
prototyped ideas I posted in recent comments: - reverted to the previous signage when comparing two vms - put in special handling of memory metric when calculating signage - added %diff for avg in addition to current %diff for best.
Attachment #462369 - Flags: feedback?(cpeyer)
Another suggestion: do all environment inspection checking before initiating any benchmarks. I just ran into this, where the runtests script eagerly started running all the compiled asmicro benchmarks, but then upon finishing that, it died with a message that "ERROR: shell .abc does not exist, ..." I understand that we may not want to unconditionally require that all runs set SHELLABC or pass --shellabc argument, since it is not necessary in all cases (namely if all the object code has been compiled, right?) But it should be easy and quick to first do a prepass checking if all of the necessary object code is in place, and if any are missing, *then* check if the environment is properly configured to do a compile. (You could even compile a dummy file to really make sure the whole pipeline is clean...)
Comment on attachment 460571 [details] [diff] [review] Updated runtests mega patch I think I'm done with feedback. I'm really happy with my local hack of output the %dAvg in addition to the %dBst; unless you find objections elsewhere, I think you should adopt it or something similar.
Attachment #460571 - Flags: feedback?(fklockii) → feedback+
Merged the patch from Felix as well as fixing the following issues: re Comment 11: Added another parameter to the metricInfo.py file called "largerIsFaster". This decouples the selectionMechanism from whether larger values are considered faster. Defaults to False re Comment 13: --memory now is hard-coded to only display the memory metric when enabled. This would be easy to override with (yet another) command line switch, but I'll leave that for a future bug if desired re Comment 15: Opened up Bug 584753 re %diff for avg and dropping sig val (comment 12 and comment 14): I really like this, leaving in with the slight modification of changing some variable names to be a little clearer I believe that this patch should be ready for submission. Additional bugs can be opened afterwards regarding any improvements needed.
Attachment #460571 - Attachment is obsolete: true
Attachment #462369 - Attachment is obsolete: true
Attachment #463200 - Flags: review?(brbaker)
Attachment #460571 - Flags: feedback?(brbaker)
Attachment #462369 - Flags: feedback?(cpeyer)
Attachment #463200 - Flags: review?(brbaker) → review?(dschaffe)
Comment on attachment 463200 [details] [diff] [review] Updated patch - incorporates feedback from Felix reviewed the patch briefly. - should verify --socketlog works correctly so the performance tests in buildbot continue to log into the db - I saw some odd results using runtests.py --index=foo.py -i 3 v8.5 where the conf interval was > 100%. e.g. Metric: v8 (indexed) custom v8 normalized metric (hardcoded in the test) v8.5/optimized/crypto.as 1.00 0.99 117.6% [' 0.99', ' 1.00', ' 0.99'] v8.5/optimized/deltablue.as 1.01 1.00 117.1% [' 0.99', ' 1.00', ' 1.01'] ... also this was a little unexpected: $ runtests.py --saveindex=dan.py v8.5/ $ runtests.py --index=dan.py v8.5/optimized the 2nd run ran all of v8 instead of just v8.5/optimized, when I edited dan.py and removed the other cases it just ran optimized, should be able to run a subset of your index.
Attachment #463200 - Flags: review?(dschaffe) → review+
re comment 18: - Verified that socketlog is working as expected. - added ability to specify a subset of the index tests to run (will run union of index tests and tests specified) - The issue with conf interval > 100% has been filed as 584891 Patch pushed: http://hg.mozilla.org/tamarin-redux/rev/9ef6c916002e
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Forgot to check for zero before calculating speedup. Reviewed by jsudduth via im.
divide by zero fix pushed to redux: http://hg.mozilla.org/tamarin-redux/rev/d8b1686924e3
Bugfix for when comparing two vm's that output different metrics. Only display the metrics common to both vm's. Also adds ability to handle metrics that are not defined in metricInfo.py, undefined metrics default to min for calculating the best value and to largerIsFaster=False. (This was already documented in metricInfo.py, but didn't actually work)
Attachment #464172 - Flags: review?(brbaker)
Attachment #464172 - Flags: review?(brbaker) → review?(jsudduth)
Attachment #464172 - Attachment is obsolete: true
Attachment #464203 - Flags: review?
Attachment #464172 - Flags: review?(jsudduth)
Attachment #464203 - Flags: review? → review?(jsudduth)
Attachment #464203 - Flags: review?(jsudduth) → review+
(In reply to comment #17) > re Comment 13: > --memory now is hard-coded to only display the memory metric when enabled. > This would be easy to override with (yet another) command line switch, but I'll > leave that for a future bug if desired I'll open a bug for this. (I *liked* the option of getting succinct time and memory stats from just one run; I just wanted to know if I could drop the times for those cases where memory usage is the criteria I am focusing on.)
(In reply to comment #24) > (In reply to comment #17) > > re Comment 13: > > --memory now is hard-coded to only display the memory metric when enabled. > > This would be easy to override with (yet another) command line switch, but I'll > > leave that for a future bug if desired > > I'll open a bug for this. (I *liked* the option of getting succinct time and > memory stats from just one run; I just wanted to know if I could drop the times > for those cases where memory usage is the criteria I am focusing on.) Filed as bug 587060.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: