Closed
Bug 1021842
Opened 10 years ago
Closed 10 years ago
Talos: improve graphserver's formula for "final test result"
Categories
(Testing :: Talos, defect)
Testing
Talos
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: avih, Assigned: jmaher)
References
Details
(Keywords: ateam-talos-task)
Attachments
(7 files, 5 obsolete files)
2.77 KB,
patch
|
Details | Diff | Splinter Review | |
3.86 KB,
patch
|
rhelmer
:
review+
MattN
:
feedback+
|
Details | Diff | Splinter Review |
1.38 KB,
patch
|
avih
:
review+
|
Details | Diff | Splinter Review |
4.79 KB,
patch
|
MattN
:
review+
rhelmer
:
review+
|
Details | Diff | Splinter Review |
10.83 KB,
patch
|
catlee
:
review+
|
Details | Diff | Splinter Review |
1.40 KB,
patch
|
rhelmer
:
review+
avih
:
feedback+
|
Details | Diff | Splinter Review |
9.03 KB,
patch
|
catlee
:
review+
|
Details | Diff | Splinter Review |
+++ This bug was initially created as a clone of Bug #1020677 +++ Bug 1020677 discusses 2 main changes to improve our regression detection for tests where each sub-test is independently meaningful (e.g. yes TART, not TP5): 1. Improved formula for processing the sub-results into a single number. 2. Regression detection which also takes individual sub-results into account. This bug is about #1 above. Summary of the current formula: > - A talos run of a test consists of N executions of each sub-test (mostly N==25). > - Take median for each N results per sub-test. > - Ignore the highest of those medians. > - Average the rest <-- the "final" number by which we calculate regressions > and which graphserver displays at its graphs. The new formula will: - Use geometric mean instead of average, - Without dropping the highest sub-result. The formula is used explicitly in one place (when collecting the data, e.g. when talos sends it to graphserver) where graphserver then stores the result of this calculation, and it's also used implicitly at any later time when the graphs are displayed (using this stored pre-calculated value). So, unfortunately, we can't just modify the formula and expect all the graphs to reflect it retroactively, because the calculation result is cached, and it's probably impractical to literally rewrite history cache using a new formula (though TBH, it might not be an absolutely terrible idea if we believe the new formula is better). So, we'll approach this transition from the old to the new formula as follows: - The old formula will keep producing results for the transition period. - The new formula will produce results under a new temporary test name for this period. This will allow us to catch regressions around the switch (using any of the formulas). Once the transition period ends and we're happy with the new formula, we'll terminate the old formula, and will only report values using the new formula - with the _old_ names (so effectively also terminating the new temporary names). Such that whoever browses the graph doesn't have to switch to a different test (name), and instead may see only some "jump" in values when the formula switched. We'll also add a dev.tree-management and wiki notice about tests which started using the new formula, and at which date. Anything still missing?
Reporter | ||
Comment 1•10 years ago
|
||
Joel, so re bug 1020677 comment 41 - 44, seems like nothing blocks us from going on with this. Correct? How do you suggest to continue? I'd prefer if you take this bug, but if you can't, let me know and I'll try to handle it myself. Right now it seems we've decided to continue but.. nothing happens :)
Flags: needinfo?(jmaher)
Assignee | ||
Comment 2•10 years ago
|
||
uploading my WIP patch, it requires a few more changes and a lot of local testing. I won't get to this for at least a week, most likely 2 weeks.
Flags: needinfo?(jmaher)
Reporter | ||
Comment 3•10 years ago
|
||
Comment on attachment 8448973 [details] [diff] [review] initial work done (0.5) Review of attachment 8448973 [details] [diff] [review]: ----------------------------------------------------------------- I know it's a WIP patch, but wanted to check it out regardless. What repo is it at (so I could compare the existing formula [insisible at the patch] to the new one)? ::: server/pyfomatic/collect.py @@ +228,3 @@ > _updateAverageForTestRun(average, databaseCursor, inputStream, metadata) > + > + # I am concerned that this might fail and we never Accidentally cut line? ::: sql/data.sql @@ +4570,4 @@ > insert into tests values (NULL,"sessionrestore_no_auto_restore","Session Restore no Auto Restore Test",1,1,NULL); > > insert into tests values (NULL,"ts_paint_cold","TS Paint, Cold Load Time",1,1,NULL); > +insert into tests values (NULL,"tp5o_scroll_paint","TP5 Scroll",1,1,15); I'm guessing this unrelated to this bug?
Assignee | ||
Comment 5•10 years ago
|
||
here is the repo: http://hg.mozilla.org/graphs you can run it locally without too much trouble. yes, the patch needs a lot of cleanup :)
Assignee | ||
Comment 6•10 years ago
|
||
Assignee: nobody → jmaher
Status: NEW → ASSIGNED
Attachment #8499099 -
Flags: review?(rhelmer)
Attachment #8499099 -
Flags: review?(avihpit)
Reporter | ||
Comment 7•10 years ago
|
||
Comment on attachment 8499099 [details] [diff] [review] add geomean field to db and populate it (1.0) Review of attachment 8499099 [details] [diff] [review]: ----------------------------------------------------------------- Due to my lack of familiarity with Graph server internals, this is a f+ that it looks roughly right to me (I especially like the calculation part), but it shouldn't be considered a proper review.
Attachment #8499099 -
Flags: review?(avihpit) → feedback+
Assignee | ||
Comment 8•10 years ago
|
||
this hasn't been tested, a few more code paths for me to explore.
Attachment #8499113 -
Flags: feedback?(MattN+bmo)
Assignee | ||
Comment 9•10 years ago
|
||
for the core geomean calculation in the first patch, we need to run this on the database: alter table alerts add geomean FLOAT; when we get r+, I will request this be done on the graph server db's. :MattN, this is something you will need to run manually on your local database.
Comment 10•10 years ago
|
||
Comment on attachment 8499099 [details] [diff] [review] add geomean field to db and populate it (1.0) Review of attachment 8499099 [details] [diff] [review]: ----------------------------------------------------------------- Unless I am missing something, ::: server/pyfomatic/collect.py @@ +229,5 @@ > + > + try: > + databaseCursor.execute("""select exp(sum(log(value))/count(value)) from test_run_values where test_run_id = %s""", (metadata.test_run_id)) > + geomean = databaseCursor.fetchall()[0][0] > + if geomean is None:average Is "average" supposed to be there ^ @@ +234,5 @@ > + geomean = values[1] > + _updateGeoMeanForTestRun(geomean, databaseCursor, inputStream, metadata) > + except Exception, x: > + # we don't want to rollback or raise an exception, this should be passively running in the background > + pass Silently eating exceptions makes me sad, although the comment is reasonable. Maybe we could log something?
Attachment #8499099 -
Flags: review?(rhelmer) → review-
Assignee | ||
Comment 11•10 years ago
|
||
oops- there was a cut/paste nugget in there :( I have cleaned up it, and added print statements. This is running in the context of a webserver, so I assume these print statements show up in the webserver log somewhere?
Attachment #8499099 -
Attachment is obsolete: true
Attachment #8499136 -
Flags: review?(rhelmer)
Comment 12•10 years ago
|
||
Comment on attachment 8499136 [details] [diff] [review] add geomean field to db and populate it (1.1) Review of attachment 8499136 [details] [diff] [review]: ----------------------------------------------------------------- Hmm so print might not work - there isn't really any logging going on in here already is there :/ This is running under WSGI now so may not work if something is printed to stdout, I will test.
Comment 13•10 years ago
|
||
Comment on attachment 8499136 [details] [diff] [review] add geomean field to db and populate it (1.1) lgtm, tested print on the dev server and goes to Apache error log, which seems fine.
Attachment #8499136 -
Flags: review?(rhelmer) → review+
Updated•10 years ago
|
Attachment #8499113 -
Attachment is patch: true
Attachment #8499113 -
Attachment mime type: text/html → text/plain
Comment 14•10 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #9) > alter table alerts add geomean FLOAT; Note that the correct table is test_runs: alter table test_runs add geomean FLOAT;
Comment 15•10 years ago
|
||
Comment on attachment 8499136 [details] [diff] [review] add geomean field to db and populate it (1.1) While testing this patch I got the following error upon upload of a TART result: Exception: to determine geomean from 'test_run_values' for 1369 - 'long' object is not iterable
Comment 16•10 years ago
|
||
Comment on attachment 8499136 [details] [diff] [review] add geomean field to db and populate it (1.1) Review of attachment 8499136 [details] [diff] [review]: ----------------------------------------------------------------- ::: server/pyfomatic/collect.py @@ +228,4 @@ > _updateAverageForTestRun(average, databaseCursor, inputStream, metadata) > + > + try: > + databaseCursor.execute("""select exp(sum(log(value))/count(value)) from test_run_values where test_run_id = %s""", (metadata.test_run_id)) (In reply to Matthew N. [:MattN] from comment #15) > 'long' object is not iterable This is from a missing comma after "metadata.test_run_id". The argument should be a tuple like: (metadata.test_run_id,) With that the patch works on my server :)
Attachment #8499136 -
Flags: feedback+
Comment 17•10 years ago
|
||
Comment on attachment 8499113 [details] [diff] [review] modify api that graph.html uses to include geomean in addition to everything else (0.9) Review of attachment 8499113 [details] [diff] [review]: ----------------------------------------------------------------- I have this being used on my GS instance when "?geo" is the query string: http://graphs.mattn.ca/graph-local.html?geo I didn't test all of the affected endpoints and I didn't check the math by hand. ::: server/api.py @@ +322,5 @@ > for changeset, total in averages.iteritems()) > + > + geomeans = dict( > + (changeset, total / geo_totals[changeset]) > + for changeset, total in geomeans.iteritems()) OK, so this is averaging the geometric means for a specific changeset but it's just confusing because geo_totals actually contains a count of the number of geometrics means for a changeset, not a total as the name says. num_geos would be easier to understand IMO. Not sure if you want to change that for none, one, or both of the cases (avg & geo). @@ +328,5 @@ > 'averages': averages, > 'min': min(row['average'] for row in rows), > 'max': max(row['average'] for row in rows), > + 'geomeans': geomeans, > + 'mingeo': min(row['geomean'] for row in rows), Not sure how much of a problem this is but "mingeo" ends up being "null" since there are old rows where mingeo wasn't calculated yet. It would be nice for it to exclude null.
Attachment #8499113 -
Flags: feedback?(MattN+bmo) → feedback+
Comment 18•10 years ago
|
||
I tested that the following two endpoints show the data: http://graphs.mattn.ca/api/test/runs/revisions?revision=b6c2deadbeef (used by compare-talos) http://graphs.mattn.ca/api/test/runs?id=293&branchid=1&platformid=13 (used by graphs)
Assignee | ||
Comment 19•10 years ago
|
||
landed this backend change: http://hg.mozilla.org/graphs/rev/77bdf53fae9f waiting for the database to be updated before we make this live. Next up is getting the api changes landed.
Reporter | ||
Comment 20•10 years ago
|
||
Do we have a bug to backfill the new column with the new calculation result? such that we could examine 2-3+ months old data with the new calculation?
Assignee | ||
Comment 21•10 years ago
|
||
I was not aware of doing that, I was under the impression this plan was to make this possible and collect it going forward. What benefit would we get from going back in time? If you think it is beneficial, please file a bug and outline what needs to be done.
Reporter | ||
Comment 22•10 years ago
|
||
Bug 1020677 comment 41 - 44 suggested that we should just switch the formula, at which case back-filling (practically - rewriting history) would probably a bad idea. However, the way which was apparently chosen to go forward, implied by the patch at this bug, is that we add a new column for the new formula calculation, at which case this column would be empty for older results, at which case I think it should be possible to back-fill it. If we think that the new formula is better, and I think we do, then being able to watch graphs over few months with the new formula would be better than watching them with the old formula. So I guess the first question is "would it be possible to back-fill the new column of the new formula?" If yes (and I _think_ it should be a yes), then I'll file a bug for it.
Assignee | ||
Comment 23•10 years ago
|
||
yeah, upon further thought and your comment, we really should backfill as much as possible. I suspect it just requires us writing a script to query the data for the page values, then calculate/insert the geomean. That might take some fiddling with, but it should be possible. I suspect a custom script that we could run would be the ideal thing to do.
Reporter | ||
Comment 24•10 years ago
|
||
Comment on attachment 8499136 [details] [diff] [review] add geomean field to db and populate it (1.1) Review of attachment 8499136 [details] [diff] [review]: ----------------------------------------------------------------- ::: server/pyfomatic/collect.py @@ +228,4 @@ > _updateAverageForTestRun(average, databaseCursor, inputStream, metadata) > + > + try: > + databaseCursor.execute("""select exp(sum(log(value))/count(value)) from test_run_values where test_run_id = %s""", (metadata.test_run_id)) So what happens if value is 0? log(value) would be -INFINITY, which is probably undesirable (and in a simpler calculation of Nth root of the product, the product would be 0). I think some (sub)tests do report 0 on cases of error, and while we do want it to affect the result, I don't think we want it to null-ify it. In the simplified calculation, I'd say just use Nth root of the product of N-1 values (if it only had a single 0 sub-result) which would be the equivalent of averaging N values where one of them is 0, though I'm not sure how to approach it with the formula which you used at this patch.
Reporter | ||
Comment 25•10 years ago
|
||
(In reply to Avi Halachmi (:avih) from comment #24) > > So what happens if value is 0? log(value) would be -INFINITY, which is > probably undesirable (and in a simpler calculation of Nth root of the > product, the product would be 0). Hmm.. I think there's a simple workaround for this issue, and which would also solve the general issue of values lower than 1 (which I don't think we have right now, but it's not impossible to have them). How about just do ... sum(log(value + 1))/count(value) ... ? Such change would still keep the relative results the same, would hardly affect the absolute values (since most of them are considerably higher than 1 anyway) and will solve the 0 issue and the general "smaller than 1" issue. Joel?
Reporter | ||
Comment 26•10 years ago
|
||
Maybe also subtract 1 from the final result (because it's a mean of each value plus 1) and clamping it at 0 if it's negative?
Comment 27•10 years ago
|
||
I suggest ignoring the zeros, and using absolute value, just in case:
> select
> exp(sum(log(abs(value)))/count(value))
> from
> test_run_values
> where
> test_run_id = %s""" and
> abs(value) > 0
Assignee | ||
Comment 28•10 years ago
|
||
Kyle, what would happen if we selected nothing?
Assignee | ||
Comment 29•10 years ago
|
||
Matt, thanks for the comment here, I have updated the patch to make the variable names more useful and to default to a value of 0 if we find a non float type. Avi, for geomeans I am really taking an average of the geomeans if we end up with >1 value for a given build/test/platform/branch combo. Do you think that is correct?
Attachment #8499113 -
Attachment is obsolete: true
Attachment #8500664 -
Flags: review?(MattN+bmo)
Flags: needinfo?(avihpit)
Updated•10 years ago
|
Attachment #8500664 -
Flags: review?(rhelmer)
Reporter | ||
Comment 30•10 years ago
|
||
(In reply to Kyle Lahnakoski [:ekyle] from comment #27) > I suggest ignoring the zeros, and using absolute value, just in case: > > > select > > exp(sum(log(abs(value)))/count(value)) > > from > > test_run_values > > where > > test_run_id = %s""" and > > abs(value) > 0 This would completely ignore the 0's at the geomean calculation. However, I'd think that preferably having 0's at some sub tests would also be reflected at the summary/average/geomean. The average already reflects it by including 0 at the average (so the average will be lower), my suggestion of adding 1 to each value and then subtracting 1 from the sum would be the equivalent when doing a geomean, I think. But if that's hard to define in SQL syntax, then your suggestion is still better than the patch prior to comment 29. Thanks. (In reply to Joel Maher (:jmaher) from comment #29) > ... > Avi, for geomeans I am really taking an average of the geomeans if we end up > with >1 value for a given build/test/platform/branch combo. Do you think > that is correct? Not sure I follow you. geomean _is_ the replacement for the average at the new calculation, what does "average of geomeans" mean? Also, I really liked the previous calculation, how's the new one better? (other than also dealing with 0's - which I think we could also have done with the previous calculation instead of replacing it altogether).
Flags: needinfo?(avihpit)
Comment 31•10 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #28) > Kyle, what would happen if we selected nothing? The result would be NULL (well, not even calculated). If there are no values, or all values are zero, then geomean has no clear definition. Alternatively, I would define geomean([])=1, just for the sake of logical consistency.
Reporter | ||
Comment 32•10 years ago
|
||
(In reply to Avi Halachmi (:avih) from comment #30) > (In reply to Kyle Lahnakoski [:ekyle] from comment #27) > > I suggest ignoring the zeros, and using absolute value, just in case: > > > > > select > > > exp(sum(log(abs(value)))/count(value)) > > > from > > > test_run_values > > > where > > > test_run_id = %s""" and > > > abs(value) > 0 > > This would completely ignore the 0's at the geomean calculation. However, > I'd think that preferably having 0's at some sub tests would also be > reflected at the summary/average/geomean. > ... I think there is actually a case where we'd like to completely ignore the value - if there's no value (i.e. not even 0). We could get this case if we cancel a page/sub-test, so that its value isn't sent anymore to graphserver, but does exist for test runs prior to removing this page. On such case we should completely ignore it at any calculation (be it average, geomean, or anything else) where it isn't available. Joel, What happens now if we cancel a page, e.g. like we removed alipay.com from tp5o at bug 1026869 ? is it taken into account with the current average calculation for tp5o? Kyle, what about comment 30 WRT the inclusion of 0's into the geomean to indicate a difference at the summary as well?
Flags: needinfo?(klahnakoski)
Flags: needinfo?(jmaher)
Comment 33•10 years ago
|
||
In reply to Avi Halachmi (:avih) from comment #30) > (In reply to Kyle Lahnakoski [:ekyle] from comment #27) > > > select > > > exp(sum(log(abs(value)))/count(value)) > > > from > > > test_run_values > > > where > > > test_run_id = %s""" and > > > abs(value) > 0 > > This would completely ignore the 0's at the geomean calculation. However, > I'd think that preferably having 0's at some sub tests would also be > reflected at the summary/average/geomean. Avi, I am concerned your log(v+1) proposal will swamp legitimate performance values between 0 and 1. I do not know if that even happens, so maybe my concern is unwarranted. Ignoring zeros can be good, especially if zero is being used to indicate failure. Furthermore, the ignorance will impact the overall geomean[1]; which means the introduction of zeros will be detectable. Finally, the v8 test suite uses geomean: using log(v+1) we loose the ability to compare. [1] except in the case when the ignored metric is very close to 1
Flags: needinfo?(klahnakoski)
Comment 34•10 years ago
|
||
Comment on attachment 8500664 [details] [diff] [review] modify api that graph.html uses to include geomean in addition to everything else (1.0) Review of attachment 8500664 [details] [diff] [review]: ----------------------------------------------------------------- Looks reasonable to me, I am not set up to test at the moment. I wish we had tests on this codebase :/
Attachment #8500664 -
Flags: review?(rhelmer) → review+
Reporter | ||
Comment 35•10 years ago
|
||
(In reply to Kyle Lahnakoski [:ekyle] from comment #33) > I am concerned your log(v+1) proposal will swamp legitimate performance > values between 0 and 1. I do not know if that even happens, so maybe my > concern is unwarranted. What does "swamp" mean/imply here? could you please provide sample set of numbers which would exhibit this issue with the (v+1) calculation? Also note than apart from adding 1 to each value, my suggestion also includes subtracting one from the final result (comment 26). > Ignoring zeros can be good, especially if zero is being used to indicate > failure. Furthermore, the ignorance will impact the overall geomean[1]; > which means the introduction of zeros will be detectable. Indeed ignoring 0's will impact the overall, but IMO less so and less deterministically so than including an unlikely low value into the calculation (such as 0 into an average - which we do now already). > Finally, the v8 > test suite uses geomean: using log(v+1) we loose the ability to compare. Not sure I understand this. Would you be able to provide a sample set of results which would get hurt from using the v+1 ... -1 calculation?
Comment 36•10 years ago
|
||
Comment on attachment 8500664 [details] [diff] [review] modify api that graph.html uses to include geomean in addition to everything else (1.0) Flag me for review again once we've settled on the formula and I can test.
Attachment #8500664 -
Flags: review?(MattN+bmo)
Assignee | ||
Comment 37•10 years ago
|
||
Kyle, I want to get this resolved, can you help clarify your concerns with avi's approach of: exp(sum(log(abs(value + 1)))/count(value)) - 1
Flags: needinfo?(jmaher) → needinfo?(klahnakoski)
Comment 38•10 years ago
|
||
Avi, (In reply to Avi Halachmi (:avih) from comment #35) > What does "swamp" mean/imply here? My apologies upfront about being pedantic here. A geometric argument would be easier to digest. Let F0(v)=exp(sum(log(vi))/n) F1(v)=exp(sum(log(vi+1))/n)-1 where v = [v1, v2, v3, ..., vn] Consider a performance change of E in one of the statistics, such that v1' = (1+E)*v1 and v' = [v1', v2, v3, ..., vn] F1(v') = exp(sum(log(vi'+1))/n) F1(v') = exp([log(v1'+1) + log(v2+1) + ... + log(vn+1)]/n) F1(v') = exp([log((1+E)*v1 + 1) + log(v2+1) + ... + log(vn+1)]/n) F1(v') = exp([log((1+E)*v1 + 1) - log(v1+1) + log(v1+1) + log(v2+1) + ... + log(vn+1)]/n) F1(v') = exp([log((1+E)*v1 + 1) - log(v1+1) + sum(log(vi+1))]/n) F1(v') = exp(log((1+E)*v1 + 1)/n - log(v1+1)/n + sum(log(vi+1))/n) F1(v') = exp(log((1+E)*v1 + 1)/n - log(v1+1)/n) * exp(sum(log(vi+1))/n)) F1(v') = exp(log((1+E)*v1 + 1) - log(v1+1))^(1/n) * F1(v) F1(v') = ((1+E)*v1 + 1)/(v1+1))^(1/n) * F1(v) F1(v') = ((v1+1 + v1*E)/(v1+1))^(1/n) * F1(v) We will use Taylor's theorem near v1 to approximate the slope wrt E: dF1(v)/dE = (1/n)*(((v1+1 + v1*E)/(v1+1))^((1/n)-1))*v1/(v1+1) * F1(v) dF1(v)/dE (at E=0) = (1/n)*(1^((1/n)-1))*v1/(v1+1) * F1(v) dF1(v)/dE (at E=0) = (1/n) * v1/(v1+1) * F1(v) F1(v') = F1(v) + (1/n) * v1/(v1+1) * F1(v) * E (at E near 0) We look at F1(v') - F1(v) to emphasize our slope: F1(v') - F1(v) = (1/n) * v1/(v1+1) * F1(v) * E (at E near 0) F1(v') - F1(v) = v1/(v1+1) * (F1(v) * E)/n (at E near 0) Using similar logic, compare this to behaviour of F0 near E=0: F0(v') - F0(v) = (F0(v) * E)/n (at E near 0) The F1 function has an extra v1/(v1+1) term; for large values this is near 1, but for small values this number is small, and makes error detection more difficult. F1(v') - F1(v) = v1/(v1+1) * (F0(v') - F0(v)) (at E near 0) Suppose we have a statistic around 0.5, and it changes by 10%: How much does F1 change? F1(v') - F1(v) = 0.5/(0.5+1) * (F0(v') - F0(v)) (at E near 0) F1(v') - F1(v) = 1/3 * (F0(v') - F0(v)) (at E near 0) Which is 1/3 less sensitive than F0. And smaller values are worse, of course. This is what I meant by "swamp". All this is mute if performance values are > 1.
Flags: needinfo?(klahnakoski)
Assignee | ||
Comment 39•10 years ago
|
||
thanks Kyle. what formula do you use in dzalerts?
Comment 40•10 years ago
|
||
Avi, How does F1 compare to F0 in the face of zeros? We start with F1, and compare the new value to the old: F1(v')/F1(v) = prod(vi'+1)^(1/n) / prod(vi+1)^(1/n) F1(v')/F1(v) = (prod(vi'+1)/prod(vi+1))^(1/n) F1(v')/F1(v) = [(0+1)*(v2+1)*(v3+1)*...*(vn+1) / ((v1+1)*(v2+1)*(v3+1)*...*(vn+1))]^(1/n) F1(v')/F1(v) = [(0+1)/(v1+1)]^(1/n) F1(v')/F1(v) = [1/(v1+1)]^(1/n) Same for F0, which ignores the zero (v1' = 0): F0(v')/F0(v) = [v2 * v3 * ... * vn]^(1/(n-1)) / [v1 * v2 * v3 * ... * vn]^(1/n) F0(v')/F0(v) = [v2 * v3 * ... * vn]^(n/n(n-1)) / [v1 * v2 * v3 * ... * vn]^((n-1)/n(n-1)) F0(v')/F0(v) = [[v2 * v3 * ... * vn]^n / [v1 * v2 * v3 * ... * vn]^(n-1)]^(1/(n(n-1))) F0(v')/F0(v) = [[v2 * v3 * ... * vn] / v1^(n-1)]^(1/(n(n-1))) F0(v')/F0(v) = [v2 * v3 * ... * vn]^(1/(n(n-1))) / v1^(1/n) F0(v')/F0(v) = (1/v1)^(1/n) * F0(v')^(1/n) Let K1 = F1(v')/F1(v) Let K0 = F0(v')/F0(v) When is K1 > K0 ? <=> [1/(v1+1)]^(1/n) > (1/v1)^(1/n) * F0(v')^(1/n) <=> [1/(v1+1)]^(1/n) / (1/v1)^(1/n) > F0(v')^(1/n) <=> [(1/(v1+1)) / (1/v1)]^(1/n) > F0(v')^(1/n) <=> [v1/(v1+1)]^(1/n) > F0(v')^(1/n) <=> [v1/(v1+1)] > F0(v') If performance values less than 1 do not exist: Then F0(v') > 1, and K1 < K0, and we can conclude F0 is a better zero indicator than F1. If performance values less than 1 do exist, then K1 could perform better given the condition above. The difference between F0 and F1 is probably inconsequential in practice due to high vi>>1 values and n>3. When F0 and F1 do differ, we must also consider the noise is similarly different. An analysis of how this affects confidence must be done to be sure F0 still performs better than F1. In conclusion, I like F0 for its adherence to definition, and its proper treatment of performance values less than one.
Comment 41•10 years ago
|
||
Joel, I use the formal definition of geomean: F0
Comment 42•10 years ago
|
||
(In reply to Avi Halachmi (:avih) from comment #35) > (In reply to Kyle Lahnakoski [:ekyle] from comment #33) > > Finally, the v8 > > test suite uses geomean: using log(v+1) we loose the ability to compare. > > Not sure I understand this. Would you be able to provide a sample set of > results which would get hurt from using the v+1 ... -1 calculation? All I am saying here is that the V8 performance statistic uses the formal definition of geomean to combine the test results. If we use some other aggregate metric, like F1, then we will not be able to compare the official V8 statistic to the one we see in graph server.
Assignee | ||
Comment 43•10 years ago
|
||
I thought I had implemented the formal definition to start with. Here is what I have in talos: http://hg.mozilla.org/build/talos/file/b2a3d0493173/talos/filter.py#l61 but we have questions about values <1. Kyle, can you link me to the formal definition you are using?
Flags: needinfo?(klahnakoski)
Reporter | ||
Comment 44•10 years ago
|
||
(In reply to Kyle Lahnakoski [:ekyle] from comment #38) > Avi, > > (In reply to Avi Halachmi (:avih) from comment #35) > > What does "swamp" mean/imply here? > > My apologies upfront about being pedantic here. A geometric argument would > be easier to digest. Pedantic is appreciated, I just didn't understand what swamp imply :) and now I do - it compresses the dynamic range slightly more than the "pure" geometric mean formula, and this compression increases the closer we are to 0. I also never claimed that the two calculations are equivalent - they're clearly not. However, consider these two "sets of results" which are undeniably much closer to 0 than to 1 and therefore should "suffer" from being less sensitive to detect regression between them with the +/- 1 variant: var set1 = [.01, .02, .03, .04, .05]; var set2 = [.02, .03, .04, .05, .06]; Their averages are 0.03 and 0.04 respectively. "proper" geometric means are 0.026 and 0.037 geometric means with the +1/-1 variant are 0.0299 and 0.0399 regression from set1 to set2: Using average: 33.33% Using proper geometric mean: 43.1% Using +1/-1 geometric mean: 33.4% The +/-1 version is indeed less sensitive than the proper geomean. However, even for such small values, the +1/-1 geomean regression calculations will still be more sensitive than the average-based regression calculation which we've been using till now. So compared to now, I don't think it will hurt us. > This is what I meant by "swamp". All this is mute if performance values are > > 1. I've never seen results smaller than 1. Maybe Joel did? (In reply to Kyle Lahnakoski [:ekyle] from comment #40) > ... > In conclusion, I like F0 for its adherence to definition, and its proper > treatment of performance values less than one. I don't disagree with the preference to choose the purest form available. But I'm trying to apply a practical approach where this specific deviation from "purest" might have more pros than cons, or that in practice the cons are negligible, as I _think_ they are. (In reply to Kyle Lahnakoski [:ekyle] from comment #42) > All I am saying here is that the V8 performance statistic uses the formal > definition of geomean to combine the test results. If we use some other > aggregate metric, like F1, then we will not be able to compare the official > V8 statistic to the one we see in graph server. I still don't get it. We've been using averages till now, and it never bothered us. Surely geomean +/-1 would be still better than averages? Also, the V8 numbers are more than 15k, the +/-1 variant difference compared to "pure" version would be less than negligible.
Comment 45•10 years ago
|
||
Joel, The code has implemented the formal definition.
Flags: needinfo?(klahnakoski)
Comment 46•10 years ago
|
||
Avi, We are in agreement. (In reply to Avi Halachmi (:avih) from comment #44) > (In reply to Kyle Lahnakoski [:ekyle] from comment #42) > > All I am saying here is that the V8 performance statistic uses the formal > > definition of geomean to combine the test results. If we use some other > > aggregate metric, like F1, then we will not be able to compare the official > > V8 statistic to the one we see in graph server. > > I still don't get it. We've been using averages till now, and it never > bothered us. Surely geomean +/-1 would be still better than averages? Using averages was wrong, and geomean +/-1 is way better. Also geomean +/-1 is negligibly different from the geomean for the purposes of regression detection given the >>1 performance values we see.
Comment 47•10 years ago
|
||
(In reply to Kyle Lahnakoski [:ekyle] from comment #45) > Joel, > > The code has implemented the formal definition. well, except for it's treatment of zeros, where it is undefined :)
Assignee | ||
Comment 48•10 years ago
|
||
Attachment #8503345 -
Flags: review?(avihpit)
Reporter | ||
Comment 49•10 years ago
|
||
Comment on attachment 8503345 [details] [diff] [review] reduce the chance of values = 0 (1.0) Review of attachment 8503345 [details] [diff] [review]: ----------------------------------------------------------------- w00t!
Attachment #8503345 -
Flags: review?(avihpit) → review+
Assignee | ||
Comment 50•10 years ago
|
||
https://hg.mozilla.org/graphs/rev/a1977105ed7c
Assignee | ||
Comment 51•10 years ago
|
||
slight adjustment to the script, accounting for your previous feedback and double checking the code. Going from the definitions listed here: https://wiki.mozilla.org/Perfomatic:API#Get_test_run_information We have 3 cases where the average is displayed and we should also display the geometric mean: run info: desc: https://wiki.mozilla.org/Perfomatic:API#Get_test_run_information ex: http://graphs.mozilla.org/api/test/runs/info/?id=3794718 (broken) test runs: desc: https://wiki.mozilla.org/Perfomatic:API#Get_test_runs_.28data.29 ex: http://graphs.mozilla.org/api/test/runs?id=223&branchid=131&platformid=31&days=7 by revision: desc: https://wiki.mozilla.org/Perfomatic:API#Get_test_data_by_revision ex: http://graphs.mozilla.org/api/test/runs/revisions?revision=10d2046d2b64&revision=d16525937c8b going through the code for these 3 cases, I believe I have properly handled added geomean to the api.
Attachment #8500664 -
Attachment is obsolete: true
Attachment #8503400 -
Flags: review?(MattN+bmo)
Updated•10 years ago
|
Attachment #8503400 -
Attachment is patch: true
Attachment #8503400 -
Attachment mime type: text/html → text/plain
Comment 52•10 years ago
|
||
My server is now updated with the landed patch and attachment 8503400 [details] [diff] [review]. Testing now.
Comment 53•10 years ago
|
||
The mingeo and/or maxgeo line with the min/max is causing a 500 error on the server. I think it's not returning a number in some cases.
Comment 54•10 years ago
|
||
Comment on attachment 8503400 [details] [diff] [review] modify api that graph.html uses to include geomean in addition to everything else (1.1) Review of attachment 8503400 [details] [diff] [review]: ----------------------------------------------------------------- ::: server/api.py @@ +329,5 @@ > 'min': min(row['average'] for row in rows), > 'max': max(row['average'] for row in rows), > + 'geomeans': geomeans, > + 'mingeo': min(type(row['geomean']) if type(row['geomean']) == type(1.0) else 0.0 for row in rows), > + 'maxgeo': max(type(row['geomean']) if type(row['geomean']) == type(1.0) else 0.0 for row in rows), I think the first |type(…)| on each of these lines is wrong but I'm not expert in Python. It ends up returning the type instead of the value.
Attachment #8503400 -
Flags: review?(MattN+bmo) → review-
Comment 55•10 years ago
|
||
(In reply to Matthew N. [:MattN] (behind on reviews - focused on Loop) from comment #54) > Comment on attachment 8503400 [details] [diff] [review] > modify api that graph.html uses to include geomean in addition to everything > else (1.1) > > Review of attachment 8503400 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: server/api.py > @@ +329,5 @@ > > 'min': min(row['average'] for row in rows), > > 'max': max(row['average'] for row in rows), > > + 'geomeans': geomeans, > > + 'mingeo': min(type(row['geomean']) if type(row['geomean']) == type(1.0) else 0.0 for row in rows), > > + 'maxgeo': max(type(row['geomean']) if type(row['geomean']) == type(1.0) else 0.0 for row in rows), > > I think the first |type(…)| on each of these lines is wrong but I'm not > expert in Python. It ends up returning the type instead of the value. You're right... comparing type(row['geomean']) == type(1.0) is a bit confusing too, that should probably just be e.g.: 'mingeo': min(row['geomean'] if type(row['geomean']) == float else 0.0 for row in rows),
Assignee | ||
Comment 56•10 years ago
|
||
ah: 'mingeo': min(row['geomean'] if isinstance(x, float) else 0.0 for row in rows), can we assume all values are float? I assume so, but let me double check.
Assignee | ||
Comment 57•10 years ago
|
||
updated patch to check for float|int. Tested with mocked up python locally.
Attachment #8503400 -
Attachment is obsolete: true
Attachment #8504277 -
Flags: review?(rhelmer)
Attachment #8504277 -
Flags: review?(MattN+bmo)
Comment 58•10 years ago
|
||
Comment on attachment 8504277 [details] [diff] [review] modify api that graph.html uses to include geomean in addition to everything else (2.0) Review of attachment 8504277 [details] [diff] [review]: ----------------------------------------------------------------- ::: server/api.py @@ +329,5 @@ > 'min': min(row['average'] for row in rows), > 'max': max(row['average'] for row in rows), > + 'geomeans': geomeans, > + 'mingeo': min(row['geomean'] if isinstance(row['geomean'], float) or isinstance(row['geomean'], int) else 0.0 for row in rows), > + 'maxgeo': max(row['geomean'] if isinstance(row['geomean'], float) or isinstance(row['geomean'], int) else 0.0 for row in rows), This seems to work. I would have preferred just checking for the None/Null case (however that gets represented) instead of float/int but this is better than nothing.
Attachment #8504277 -
Flags: review?(MattN+bmo) → review+
Comment 59•10 years ago
|
||
Comment on attachment 8504277 [details] [diff] [review] modify api that graph.html uses to include geomean in addition to everything else (2.0) Review of attachment 8504277 [details] [diff] [review]: ----------------------------------------------------------------- ::: server/api.py @@ +329,5 @@ > 'min': min(row['average'] for row in rows), > 'max': max(row['average'] for row in rows), > + 'geomeans': geomeans, > + 'mingeo': min(row['geomean'] if isinstance(row['geomean'], float) or isinstance(row['geomean'], int) else 0.0 for row in rows), > + 'maxgeo': max(row['geomean'] if isinstance(row['geomean'], float) or isinstance(row['geomean'], int) else 0.0 for row in rows), Hm so this could be coerced to float and have a None check instead: 'mingeo': min(float(row['geomean']) if not isinstance(row['geomean'], None) else 0.0 for row in rows), That will raise an exception if something is in there that can't be coerced to float. If this is never expected to happen then raising here might be a good thing.
Attachment #8504277 -
Flags: review?(rhelmer) → review+
Comment 60•10 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #59) > Comment on attachment 8504277 [details] [diff] [review] > modify api that graph.html uses to include geomean in addition to everything > else (2.0) > > Review of attachment 8504277 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: server/api.py > @@ +329,5 @@ > > 'min': min(row['average'] for row in rows), > > 'max': max(row['average'] for row in rows), > > + 'geomeans': geomeans, > > + 'mingeo': min(row['geomean'] if isinstance(row['geomean'], float) or isinstance(row['geomean'], int) else 0.0 for row in rows), > > + 'maxgeo': max(row['geomean'] if isinstance(row['geomean'], float) or isinstance(row['geomean'], int) else 0.0 for row in rows), > > Hm so this could be coerced to float and have a None check instead: > > 'mingeo': min(float(row['geomean']) if not isinstance(row['geomean'], None) > else 0.0 for row in rows), > > That will raise an exception if something is in there that can't be coerced > to float. If this is never expected to happen then raising here might be a > good thing. (maybe I should say "cast" instead of "coerce" since it's explicit) Anyway that's more of a response to comment 58, you can take it or leave it I think the code looks OK to go in as-is. Thanks to MattN for testing this.
Assignee | ||
Comment 61•10 years ago
|
||
https://hg.mozilla.org/graphs/rev/f8c4ec8f9ac9
Assignee | ||
Comment 62•10 years ago
|
||
:rhelmer, now that we have the backend in place, can we get this deployed and started on collected data. We can view the results from a local version of graph.html. Should we file a bug for deployment? do you have a suggested path forward?
Flags: needinfo?(rhelmer)
Comment 63•10 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #62) > :rhelmer, now that we have the backend in place, can we get this deployed > and started on collected data. We can view the results from a local version > of graph.html. Should we file a bug for deployment? do you have a > suggested path forward? I filed bug 1082694 to track this and will work on it now.
Flags: needinfo?(rhelmer)
Assignee | ||
Comment 64•10 years ago
|
||
so here is what I believe is left to do: * look at the data now that we have some and verify it is what we want * deploy a UI change to production that will view this data * adjust the alerting all to reference the geomean instead of average field am I missing anything?
Comment 65•10 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #64) > * deploy a UI change to production that will view this data That's bug 1077177 for graph server btw.
Assignee | ||
Comment 66•10 years ago
|
||
after a brief chat on irc, the next steps are: * send geometric mean alerts to dev.tree-alerts (the new newsgroup) * continue to send average alerts to dev.tree-management * after a week or two to ensure all is well, disable average alerts to dev.tree-management This involves some assumptions, but all very doable: * analyze_talos.py has enough cpu bandwidth to run two passes * we are fine not emailing patch authors twice (only email on 'average' alerts for now) * someone will look at both sources. The problem becomes, what do we do if we find alerts that show up in 'average' only and not in 'geomean'? Will we be fine ignoring those?
Assignee | ||
Comment 67•10 years ago
|
||
I am not sure how to modify the analysis.cfg.template. I would like to make sure that we have: # Who should emails be sent to regression_emails = dev.tree-management@lists.mozilla.org geomean_regression_emails = dev.tree-alerts@lists.mozilla.org I honestly have no idea what value is in regression_emails, so figuring out what is there would be a good idea.
Attachment #8526144 -
Flags: review?(rhelmer)
Comment 68•10 years ago
|
||
Comment on attachment 8526144 [details] [diff] [review] update scripts / tooling to support geomean (1.0) I don't actually know where this tool is run... IIRC catlee has the analysis tool running somewhere.
Flags: needinfo?(catlee)
Updated•10 years ago
|
Attachment #8526144 -
Flags: review?(rhelmer) → review+
Assignee | ||
Comment 69•10 years ago
|
||
catlee, can you answer this question: I am not sure how to modify the analysis.cfg.template. I would like to make sure that we have: # Who should emails be sent to regression_emails = dev.tree-management@lists.mozilla.org geomean_regression_emails = dev.tree-alerts@lists.mozilla.org I guess making available the analysis.cfg file (sans private keys, etc.) would be a good thing :)
Assignee | ||
Comment 70•10 years ago
|
||
Also, if catlee knows more about this script and it isn't run on the normal graph server (i.e. this is run someplace that he has access to), then I would like to know what the procedure is for deploying this.
Comment 71•10 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #69) > I guess making available the analysis.cfg file (sans private keys, etc.) > would be a good thing :) Ideally we could just check it into the repo in the appropriate location, but named as analysis.cfg.example etc
Comment 72•10 years ago
|
||
That's what analysis.cfg.template is supposed to be.
Flags: needinfo?(catlee)
Comment 73•10 years ago
|
||
In any case, the deployment process right now is to file a bug in releng (cc me) specifying which settings need to be added or modified.
Assignee | ||
Comment 74•10 years ago
|
||
fixed reference to data_type as a value instead of a string.
Attachment #8526144 -
Attachment is obsolete: true
Attachment #8532064 -
Flags: review?(catlee)
Updated•10 years ago
|
Attachment #8532064 -
Flags: review?(catlee) → review+
Assignee | ||
Comment 75•10 years ago
|
||
landed on graphs: http://hg.mozilla.org/graphs/rev/4095308c115a
Assignee | ||
Comment 76•10 years ago
|
||
fixed a typo and verified an alert on tree-alerts: https://groups.google.com/forum/#!topic/mozilla.dev.tree-alerts/a7L7nh1PJlQ caveat: this most likely will only show unique geomean alerts due to caching of the warnings to prevent duplication of alerts. The talos alert sheriff team will look at both average/geomean on all alerts for the bake in period until we turn off the dev.tree-management average alerts.
Assignee | ||
Comment 77•10 years ago
|
||
on 3 tests: tcanvasmark, v8_7, and kraken we upload a single value from talos: http://hg.mozilla.org/build/talos/file/tip/talos/output.py#l253 these are of type 'AVERAGE', not 'VALUES'. In the code to store a geometric mean, we only handle 'VALUES'. This patch stores the single value as the geometric mean for the 'AVERAGE' inputs we get.
Attachment #8537197 -
Flags: review?(rhelmer)
Attachment #8537197 -
Flags: review?(avihpit)
Reporter | ||
Comment 78•10 years ago
|
||
Comment on attachment 8537197 [details] [diff] [review] store the single value for average as geomean (1.0) Review of attachment 8537197 [details] [diff] [review]: ----------------------------------------------------------------- Changing to f+ because I'm not familiar with the graphserver internals, but the fix looks valid. THanks!
Attachment #8537197 -
Flags: review?(avihpit) → feedback+
Updated•10 years ago
|
Attachment #8537197 -
Flags: review?(rhelmer) → review+
Assignee | ||
Comment 79•10 years ago
|
||
http://hg.mozilla.org/graphs/rev/c6266505147e :rhelmer, can you deploy this to the live server?
Comment 80•10 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #79) > http://hg.mozilla.org/graphs/rev/c6266505147e > > :rhelmer, can you deploy this to the live server? Done in bug 1112387
Comment 81•10 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #76) > caveat: this most likely will only show unique geomean alerts due to caching > of the warnings to prevent duplication of alerts. The talos alert sheriff > team will look at both average/geomean on all alerts for the bake in period > until we turn off the dev.tree-management average alerts. How are the new alerts looking? Are we good to switch off the ones sent to dev.tree-management now? :-)
Flags: needinfo?(jmaher)
Assignee | ||
Comment 82•10 years ago
|
||
yes, we are ready to do it, I will need to figure out a patch and coordinate with catlee on how to deploy it. I verified the data, found an error a couple weeks ago, now all the data is good.
Flags: needinfo?(jmaher)
Comment 83•10 years ago
|
||
That's great, thank you :-)
Assignee | ||
Comment 84•10 years ago
|
||
This patch does a couple things: 1) sets geomean as the default on graph server 2) only runs one analysis, not two, so we will be posting to dev.tree-alerts 3) fixes the upload/reporting to use geomean and passively collect average Once this is live we could eventually remove the average code.
Attachment #8544026 -
Flags: review?(rhelmer)
Attachment #8544026 -
Flags: review?(catlee)
Comment 85•10 years ago
|
||
Comment on attachment 8544026 [details] [diff] [review] run analysis for geomean as default (1.0) Review of attachment 8544026 [details] [diff] [review]: ----------------------------------------------------------------- the regression analysis bits look fine to me. let me know when you want to deploy.
Attachment #8544026 -
Flags: review?(catlee) → review+
Assignee | ||
Comment 86•10 years ago
|
||
Comment on attachment 8544026 [details] [diff] [review] run analysis for geomean as default (1.0) via irc chat with rhelmer, he is fine with this patch and will help coordinate deployment.
Attachment #8544026 -
Flags: review?(rhelmer)
Assignee | ||
Comment 87•10 years ago
|
||
checked in: http://hg.mozilla.org/graphs/rev/f0a48c84d941 Next steps are to: 1) deploy changes to live server (:rhelmer) 2) update analyze_talos machine (:catlee) I would like to do this in the same general time window. Maybe Monday or Tuesday morning PST so :rhelmer can deploy and then in EST midday/afternoon :catlee can update the analyze talos script. ;rhelmer, :catlee- can you guys confirm this update/deployment time sounds good?
Flags: needinfo?(rhelmer)
Flags: needinfo?(catlee)
Comment 88•10 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #87) > checked in: > http://hg.mozilla.org/graphs/rev/f0a48c84d941 > > Next steps are to: > 1) deploy changes to live server (:rhelmer) > 2) update analyze_talos machine (:catlee) > > I would like to do this in the same general time window. Maybe Monday or > Tuesday morning PST so :rhelmer can deploy and then in EST midday/afternoon > :catlee can update the analyze talos script. > > ;rhelmer, :catlee- can you guys confirm this update/deployment time sounds > good? WFM!
Flags: needinfo?(rhelmer)
Assignee | ||
Comment 89•10 years ago
|
||
this was just deployed about 15 minutes ago.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Flags: needinfo?(catlee)
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•