Closed
Bug 572860
Opened 15 years ago
Closed 15 years ago
performance runtests improvements tracking bug
Categories
(Tamarin Graveyard :: Tools, defect, P3)
Tamarin Graveyard
Tools
Tracking
(Not tracked)
VERIFIED
FIXED
Q3 11 - Serrano
People
(Reporter: cpeyer, Assigned: cpeyer)
References
Details
(Whiteboard: Tracker)
Attachments
(3 files, 5 obsolete files)
82.52 KB,
patch
|
dschaffe
:
review+
|
Details | Diff | Splinter Review |
944 bytes,
patch
|
Details | Diff | Splinter Review | |
2.01 KB,
patch
|
jsudduth
:
review+
|
Details | Diff | Splinter Review |
Tracking bug for a variety of performance/runtests.py improvements.
Flags: flashplayer-qrb+
Priority: -- → P3
Whiteboard: Tracker
Target Milestone: --- → flash10.2
Assignee | ||
Comment 1•15 years ago
|
||
I didn't intend to have a giant patch for all the runtests improvements, but most of the changes relied on a new datastruct used to store results internally so they slowly merged together as I was working on it.
I'm not too concerned about the code change (though feel free to review) but would like for feedback on using the script.
Additions:
multi-metric support:
runtests can now handle multiple metrics being reported in one test.
The easiest way to demonstrate that is to run with -m. Instead of just memory results, both memory and test results will be displayed. Runtests also supports output from 568933.
vprof (permf) results are also supported, but the --perfm flag still must be used
indexing:
Two new command line flags have been added: --index and --saveindex
both require an index filename to be specified after the flag.
"--index index.py" will load index.py and run all of the tests in that index and report results normalized to the index values
"--saveindex indexname.py" will run the specified tests and save the values to the give index file. Specifying both --index and --saveindex will merge the values specified in --index to the new saved indexfile.
cumulative score:
new option of --score will display a cumulative score for every metric. Really most useful with an index file.
csv output:
"--csv filename" will save run results to a csv file that can be analyzed using excel pivottables. Use --csvappend if you want to append the results to the csv file
There is still work to be done with including configuration information into the csv data.
misc:
- use --raw to display all test values for each vm
- removed -p --prettyprint option (I don't think this was used by anybody)
Attachment #459674 -
Flags: feedback?(fklockii)
Attachment #459674 -
Flags: feedback?(edwsmith)
Attachment #459674 -
Flags: feedback?(brbaker)
Assignee | ||
Comment 2•15 years ago
|
||
metricinfo.py has also been added. This file defines all of the metrics being used and how to calculated the "best" value for that metric as well as information on units and a description. see the file for details.
Note that metrics can now compute the best value using the mean (or any other function that can be applied to a list for that matter).
Comment 3•15 years ago
|
||
Comment on attachment 459674 [details] [diff] [review]
runtest.py mega patch - csv, indexing, cumulative score, updated internal data struct
I know you said you cared more about usage feedback than code feedback, but, well, I started by reading the code (not finished).
Initial feedback:
- remove the tabs from the python code (preferably in a preliminary separate refactoring). I don't want to have to worry about Python's notion of spaces/tab conflicting with my own.
- if we're already allowing choice between {min,max,mean}, we might add median as well.
Nit: typo line 4: whos' (the possessive of "who" is "whose")
Nit: I still think help text should put "--detail" close to "--verbose"
Comment 4•15 years ago
|
||
Comment on attachment 459674 [details] [diff] [review]
runtest.py mega patch - csv, indexing, cumulative score, updated internal data struct
Boo: there's something wrong.
Behavior before your patch (and presumably expected behavior):
bash-3.2$ python runtests.py -f -v --detail --avm ../../../tamarin-redux3/objdir-64/shell/avmshell mmgc/gcbench.as
self.detail = True
Tamarin tests started: 2010-07-27 12:33:07.409681
Executing 1 test(s)
avm: ../../../tamarin-redux3/objdir-64/shell/avmshell version: cyclone
iterations: 1
test avm:0
0 running mmgc/gcbench.as
compiling: java -jar /Users/fklockii/Dev/asc/lib/asc.jar -import /Users/fklockii/Dev/Bugz/tamarin-redux/objdir-baseline64/../core/builtin.abc -config CONFIG::desktop=true -import /Users/fklockii/Dev/Bugz/tamarin-redux/objdir-baseline64/../shell/shell_toplevel.abc -optimize -AS3 mmgc/gcbench.as
Metric: time
mmgc/gcbench.as 2841
Behavior after your patch:
bash-3.2$ python runtests.py -f -v --detail --avm ../../../tamarin-redux3/objdir-64/shell/avmshell mmgc/gcbench.as
Tamarin tests started: 2010-07-27 12:33:19.606983
Executing 1 test(s)
avm: ../../../tamarin-redux3/objdir-64/shell/avmshell version: cyclone
iterations: 1
test avm:0
compiling: java -jar /Users/fklockii/Dev/asc/lib/asc.jar -import /Users/fklockii/Dev/Bugz/tamarin-redux/objdir-baseline64/../core/builtin.abc -config CONFIG::desktop=true -import /Users/fklockii/Dev/Bugz/tamarin-redux/objdir-baseline64/../shell/shell_toplevel.abc -optimize -AS3 mmgc/gcbench.as
Runtest Abnormal Exit
Sanity check (that this avm should be able to run that test):
bash-3.2$ ../../../tamarin-redux3/objdir-64/shell/avmshell mmgc/gcbench.abc
Garbage Collector Test
Stretching memory with a binary tree of depth 18
Creating a long-lived binary tree of depth 16
Creating a long-lived array of 500000 doubles
Creating 33825 trees of depth 4
Top down construction took 196msecs
Bottom up construction took 202msecs
Creating 8256 trees of depth 6
Top down construction took 201msecs
Bottom up construction took 190msecs
Creating 2052 trees of depth 8
Top down construction took 208msecs
Bottom up construction took 228msecs
Creating 512 trees of depth 10
Top down construction took 204msecs
Bottom up construction took 198msecs
Creating 128 trees of depth 12
Top down construction took 222msecs
Bottom up construction took 189msecs
Creating 32 trees of depth 14
Top down construction took 198msecs
Bottom up construction took 188msecs
Creating 8 trees of depth 16
Top down construction took 234msecs
Bottom up construction took 213msecs
metric time 3139
Comment 5•15 years ago
|
||
(in case its not clear, buried in all that output, in the second run labelled "Behavior after your patch:", it dies, with the message "Runtest Abnormal Exit.")
Assignee | ||
Comment 6•15 years ago
|
||
- Fixes issue with single metric causing an abnormal exit (also added debug code so that traceback is printed for now).
- fixed tabs issue
- add median function for metrics calc
- fix whos' typo
- re: --detail next to verbose - since perf runtests is a subclass of runtestBase, I can't mix & match the help strings between the two classes.
Attachment #459674 -
Attachment is obsolete: true
Attachment #460570 -
Flags: feedback?(fklockii)
Attachment #460570 -
Flags: feedback?(edwsmith)
Attachment #460570 -
Flags: feedback?(brbaker)
Attachment #459674 -
Flags: feedback?(fklockii)
Attachment #459674 -
Flags: feedback?(edwsmith)
Attachment #459674 -
Flags: feedback?(brbaker)
Assignee | ||
Comment 7•15 years ago
|
||
Fix minor SystemExit issue.
Attachment #460570 -
Attachment is obsolete: true
Attachment #460571 -
Flags: feedback?(fklockii)
Attachment #460571 -
Flags: feedback?(edwsmith)
Attachment #460571 -
Flags: feedback?(brbaker)
Attachment #460570 -
Flags: feedback?(fklockii)
Attachment #460570 -
Flags: feedback?(edwsmith)
Attachment #460570 -
Flags: feedback?(brbaker)
Comment 8•15 years ago
|
||
Comment on attachment 460570 [details] [diff] [review]
Updated runtests mega patch
All code files (including these .py files) should have modelines at the top for vi and emacs. You can crib from core/builtin.py if you like.
Comment 9•15 years ago
|
||
(In reply to comment #8)
> Comment on attachment 460570 [details] [diff] [review]
> Updated runtests mega patch
>
> All code files (including these .py files) should have modelines at the top for
> vi and emacs. You can crib from core/builtin.py if you like.
Also, util/addmodeline.sh supports Python (though looking at that now, the modeline it adds is a tiny bit different than the one in core/builtin.py).
Updated•15 years ago
|
Attachment #460571 -
Flags: feedback?(edwsmith)
Comment 10•15 years ago
|
||
As discussed between Chris and I yesterday in private chat, this patch is inverting the spdup %diff, so that improvements yield negative %'s and regressions yield positive ones. I can cope with either convention (but I admit a slight preference for the original version where minus signs mean badness, as my terminal will color negative numbers red.)
When you look into this, you might consider trying to change the script so that memory reporting is consistent with whatever choice you make: if a negative %diff means a regression, then make sure that a memory regression yields a negative number. (Right now the memory reporting produces a sign that is opposite from the v8/time reporting, so "fixing" one "breaks" the other.)
Comment 11•15 years ago
|
||
Code nitpick: if you're storing first class procedures into tables, I advise against using the identity of a particular procedure to decide control flow. I'm referring in particular to:
# if the best value is the max value, reverse the sign of the spdup
sign = -1 if self.metricInfo[metric]['best'] is max else 1
which is relying on the identity of max to decide how to set the sign. This is cute but fragile (what if another best-chooser should also reverse the sign? What if someone uses an instrumenting wrapper around max as the best-chooser?)
(There also may be a separate design issue hiding here: perhaps we should tease apart the two issues of whether LargerIsBetter, and what SelectionMechanism (Best? Worst? Average?) the user desires. For example, for memory usage it is not the case that LargerIsBetter (in the usual sense of the word "better"), but I like the current behavior where the script is selecting the Worst value for presentation and spdup calculation. It seems like that could be indicated explicitly.)
Comment 12•15 years ago
|
||
I discussed the calculation of sig with Ed recently.
1. He told me he was okay with changing its calculation to use (sumStdDev + 1) in the denominator, which would make the sig purely a damping factor on the %diff. So we may want to do that.
2. Ed also mentioned that we could just drop reporting the sig value at all in the condensed output, leaving only the "+/-/++/--" marker on the line. At first I was a little adverse to this, but I just realized that if we did that, it would leave room for us to put in a %diff for the average of the measurements, in addition to the current %diff of the selected "best" of the measurements.
Comment 13•15 years ago
|
||
In the old script, "--memory" would turn on memory reporting and turn off reporting of the default metric. In the new one, "--memory" adds memory reporting but also continues to report the default metric. Is there a way to get back the previous behavior?
Comment 14•15 years ago
|
||
prototyped ideas I posted in recent comments:
- reverted to the previous signage when comparing two vms
- put in special handling of memory metric when calculating signage
- added %diff for avg in addition to current %diff for best.
Attachment #462369 -
Flags: feedback?(cpeyer)
Comment 15•15 years ago
|
||
Another suggestion: do all environment inspection checking before initiating any benchmarks.
I just ran into this, where the runtests script eagerly started running all the compiled asmicro benchmarks, but then upon finishing that, it died with a message that "ERROR: shell .abc does not exist, ..."
I understand that we may not want to unconditionally require that all runs set SHELLABC or pass --shellabc argument, since it is not necessary in all cases (namely if all the object code has been compiled, right?) But it should be easy and quick to first do a prepass checking if all of the necessary object code is in place, and if any are missing, *then* check if the environment is properly configured to do a compile. (You could even compile a dummy file to really make sure the whole pipeline is clean...)
Comment 16•15 years ago
|
||
Comment on attachment 460571 [details] [diff] [review]
Updated runtests mega patch
I think I'm done with feedback.
I'm really happy with my local hack of output the %dAvg in addition to the %dBst; unless you find objections elsewhere, I think you should adopt it or something similar.
Attachment #460571 -
Flags: feedback?(fklockii) → feedback+
Assignee | ||
Comment 17•15 years ago
|
||
Merged the patch from Felix as well as fixing the following issues:
re Comment 11:
Added another parameter to the metricInfo.py file called "largerIsFaster". This decouples the selectionMechanism from whether larger values are considered faster. Defaults to False
re Comment 13:
--memory now is hard-coded to only display the memory metric when enabled. This would be easy to override with (yet another) command line switch, but I'll leave that for a future bug if desired
re Comment 15: Opened up Bug 584753
re %diff for avg and dropping sig val (comment 12 and comment 14):
I really like this, leaving in with the slight modification of changing some variable names to be a little clearer
I believe that this patch should be ready for submission. Additional bugs can be opened afterwards regarding any improvements needed.
Attachment #460571 -
Attachment is obsolete: true
Attachment #462369 -
Attachment is obsolete: true
Attachment #463200 -
Flags: review?(brbaker)
Attachment #460571 -
Flags: feedback?(brbaker)
Attachment #462369 -
Flags: feedback?(cpeyer)
Assignee | ||
Updated•15 years ago
|
Attachment #463200 -
Flags: review?(brbaker) → review?(dschaffe)
Comment 18•15 years ago
|
||
Comment on attachment 463200 [details] [diff] [review]
Updated patch - incorporates feedback from Felix
reviewed the patch briefly.
- should verify --socketlog works correctly so the performance tests in buildbot continue to log into the db
- I saw some odd results using runtests.py --index=foo.py -i 3 v8.5 where the conf interval was > 100%. e.g.
Metric: v8 (indexed) custom v8 normalized metric (hardcoded in the test)
v8.5/optimized/crypto.as 1.00 0.99 117.6% [' 0.99', ' 1.00', ' 0.99']
v8.5/optimized/deltablue.as 1.01 1.00 117.1% [' 0.99', ' 1.00', ' 1.01']
...
also this was a little unexpected:
$ runtests.py --saveindex=dan.py v8.5/
$ runtests.py --index=dan.py v8.5/optimized
the 2nd run ran all of v8 instead of just v8.5/optimized, when I edited dan.py and removed the other cases it just ran optimized, should be able to run a subset of your index.
Attachment #463200 -
Flags: review?(dschaffe) → review+
Assignee | ||
Comment 19•15 years ago
|
||
re comment 18:
- Verified that socketlog is working as expected.
- added ability to specify a subset of the index tests to run (will run union of index tests and tests specified)
- The issue with conf interval > 100% has been filed as 584891
Patch pushed:
http://hg.mozilla.org/tamarin-redux/rev/9ef6c916002e
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Updated•15 years ago
|
Assignee | ||
Comment 20•15 years ago
|
||
Forgot to check for zero before calculating speedup. Reviewed by jsudduth via im.
Assignee | ||
Comment 21•15 years ago
|
||
divide by zero fix pushed to redux:
http://hg.mozilla.org/tamarin-redux/rev/d8b1686924e3
Assignee | ||
Comment 22•15 years ago
|
||
Bugfix for when comparing two vm's that output different metrics.
Only display the metrics common to both vm's.
Also adds ability to handle metrics that are not defined in metricInfo.py, undefined metrics default to min for calculating the best value and to largerIsFaster=False. (This was already documented in metricInfo.py, but didn't actually work)
Attachment #464172 -
Flags: review?(brbaker)
Assignee | ||
Updated•15 years ago
|
Attachment #464172 -
Flags: review?(brbaker) → review?(jsudduth)
Assignee | ||
Comment 23•15 years ago
|
||
Attachment #464172 -
Attachment is obsolete: true
Attachment #464203 -
Flags: review?
Attachment #464172 -
Flags: review?(jsudduth)
Assignee | ||
Updated•15 years ago
|
Attachment #464203 -
Flags: review? → review?(jsudduth)
Updated•15 years ago
|
Attachment #464203 -
Flags: review?(jsudduth) → review+
Comment 24•15 years ago
|
||
(In reply to comment #17)
> re Comment 13:
> --memory now is hard-coded to only display the memory metric when enabled.
> This would be easy to override with (yet another) command line switch, but I'll
> leave that for a future bug if desired
I'll open a bug for this. (I *liked* the option of getting succinct time and memory stats from just one run; I just wanted to know if I could drop the times for those cases where memory usage is the criteria I am focusing on.)
Comment 25•15 years ago
|
||
(In reply to comment #24)
> (In reply to comment #17)
> > re Comment 13:
> > --memory now is hard-coded to only display the memory metric when enabled.
> > This would be easy to override with (yet another) command line switch, but I'll
> > leave that for a future bug if desired
>
> I'll open a bug for this. (I *liked* the option of getting succinct time and
> memory stats from just one run; I just wanted to know if I could drop the times
> for those cases where memory usage is the criteria I am focusing on.)
Filed as bug 587060.
Assignee | ||
Updated•15 years ago
|
Status: RESOLVED → VERIFIED
You need to log in
before you can comment on or make changes to this bug.
Description
•