Closed Bug 1021842 Opened 10 years ago Closed 9 years ago

Talos: improve graphserver's formula for "final test result"

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: avih, Assigned: jmaher)

References

Details

(Keywords: ateam-talos-task)

Attachments

(7 files, 5 obsolete files)

2.77 KB, patch
Details | Diff | Splinter Review
3.86 KB, patch
rhelmer
: review+
MattN
: feedback+
Details | Diff | Splinter Review
1.38 KB, patch
avih
: review+
Details | Diff | Splinter Review
4.79 KB, patch
MattN
: review+
rhelmer
: review+
Details | Diff | Splinter Review
10.83 KB, patch
catlee
: review+
Details | Diff | Splinter Review
1.40 KB, patch
rhelmer
: review+
avih
: feedback+
Details | Diff | Splinter Review
9.03 KB, patch
catlee
: review+
Details | Diff | Splinter Review
+++ This bug was initially created as a clone of Bug #1020677 +++

Bug 1020677 discusses 2 main changes to improve our regression detection for tests where each sub-test is independently meaningful (e.g. yes TART, not TP5):

1. Improved formula for processing the sub-results into a single number.
2. Regression detection which also takes individual sub-results into account.

This bug is about #1 above.

Summary of the current formula:

> - A talos run of a test consists of N executions of each sub-test (mostly N==25).
> - Take median for each N results per sub-test.
> - Ignore the highest of those medians.
> - Average the rest <-- the "final" number by which we calculate regressions
>                        and which graphserver displays at its graphs.

The new formula will:
- Use geometric mean instead of average,
- Without dropping the highest sub-result.

The formula is used explicitly in one place (when collecting the data, e.g. when talos sends it to graphserver) where graphserver then stores the result of this calculation, and it's also used implicitly at any later time when the graphs are displayed (using this stored pre-calculated value).

So, unfortunately, we can't just modify the formula and expect all the graphs to reflect it retroactively, because the calculation result is cached, and it's probably impractical to literally rewrite history cache using a new formula (though TBH, it might not be an absolutely terrible idea if we believe the new formula is better).

So, we'll approach this transition from the old to the new formula as follows:
- The old formula will keep producing results for the transition period.
- The new formula will produce results under a new temporary test name for this period.

This will allow us to catch regressions around the switch (using any of the formulas).

Once the transition period ends and we're happy with the new formula, we'll terminate the old formula, and will only report values using the new formula - with the _old_ names (so effectively also terminating the new temporary names). Such that whoever browses the graph doesn't have to switch to a different test (name), and instead may see only some "jump" in values when the formula switched.

We'll also add a dev.tree-management and wiki notice about tests which started using the new formula, and at which date.

Anything still missing?
Joel, so re bug 1020677 comment 41 - 44, seems like nothing blocks us from going on with this. Correct?

How do you suggest to continue? I'd prefer if you take this bug, but if you can't, let me know and I'll try to handle it myself. Right now it seems we've decided to continue but.. nothing happens :)
Flags: needinfo?(jmaher)
uploading my WIP patch, it requires a few more changes and a lot of local testing.  I won't get to this for at least a week, most likely 2 weeks.
Flags: needinfo?(jmaher)
Comment on attachment 8448973 [details] [diff] [review]
initial work done (0.5)

Review of attachment 8448973 [details] [diff] [review]:
-----------------------------------------------------------------

I know it's a WIP patch, but wanted to check it out regardless.

What repo is it at (so I could compare the existing formula [insisible at the patch] to the new one)?

::: server/pyfomatic/collect.py
@@ +228,3 @@
>    _updateAverageForTestRun(average, databaseCursor, inputStream, metadata)
> +
> +  # I am concerned that this might fail and we never 

Accidentally cut line?

::: sql/data.sql
@@ +4570,4 @@
>  insert into tests values (NULL,"sessionrestore_no_auto_restore","Session Restore no Auto Restore Test",1,1,NULL);
>  
>  insert into tests values (NULL,"ts_paint_cold","TS Paint, Cold Load Time",1,1,NULL);
> +insert into tests values (NULL,"tp5o_scroll_paint","TP5 Scroll",1,1,15);

I'm guessing this unrelated to this bug?
Pardon the (way too many) typos on comment 3...
here is the repo:
http://hg.mozilla.org/graphs

you can run it locally without too much trouble.  yes, the patch needs a lot of cleanup :)
Assignee: nobody → jmaher
Status: NEW → ASSIGNED
Attachment #8499099 - Flags: review?(rhelmer)
Attachment #8499099 - Flags: review?(avihpit)
Comment on attachment 8499099 [details] [diff] [review]
add geomean field to db and populate it (1.0)

Review of attachment 8499099 [details] [diff] [review]:
-----------------------------------------------------------------

Due to my lack of familiarity with Graph server internals, this is a f+ that it looks roughly right to me (I especially like the calculation part), but it shouldn't be considered a proper review.
Attachment #8499099 - Flags: review?(avihpit) → feedback+
this hasn't been tested, a few more code paths for me to explore.
Attachment #8499113 - Flags: feedback?(MattN+bmo)
for the core geomean calculation in the first patch, we need to run this on the database:
alter table alerts add geomean FLOAT;

when we get r+, I will request this be done on the graph server db's.  :MattN, this is something you will need to run manually on your local database.
Comment on attachment 8499099 [details] [diff] [review]
add geomean field to db and populate it (1.0)

Review of attachment 8499099 [details] [diff] [review]:
-----------------------------------------------------------------

Unless I am missing something,

::: server/pyfomatic/collect.py
@@ +229,5 @@
> +
> +  try:
> +    databaseCursor.execute("""select exp(sum(log(value))/count(value)) from test_run_values where test_run_id = %s""", (metadata.test_run_id))
> +    geomean = databaseCursor.fetchall()[0][0]
> +    if geomean is None:average

Is "average" supposed to be there ^

@@ +234,5 @@
> +      geomean = values[1]
> +    _updateGeoMeanForTestRun(geomean, databaseCursor, inputStream, metadata)
> +  except Exception, x:
> +    # we don't want to rollback or raise an exception, this should be passively running in the background
> +    pass

Silently eating exceptions makes me sad, although the comment is reasonable. Maybe we could log something?
Attachment #8499099 - Flags: review?(rhelmer) → review-
oops- there was a cut/paste nugget in there :(

I have cleaned up it, and added print statements.  This is running in the context of a webserver, so I assume these print statements show up in the webserver log somewhere?
Attachment #8499099 - Attachment is obsolete: true
Attachment #8499136 - Flags: review?(rhelmer)
Comment on attachment 8499136 [details] [diff] [review]
add geomean field to db and populate it (1.1)

Review of attachment 8499136 [details] [diff] [review]:
-----------------------------------------------------------------

Hmm so print might not work - there isn't really any logging going on in here already is there :/

This is running under WSGI now so may not work if something is printed to stdout, I will test.
Comment on attachment 8499136 [details] [diff] [review]
add geomean field to db and populate it (1.1)

lgtm, tested print on the dev server and goes to Apache error log, which seems fine.
Attachment #8499136 - Flags: review?(rhelmer) → review+
Attachment #8499113 - Attachment is patch: true
Attachment #8499113 - Attachment mime type: text/html → text/plain
(In reply to Joel Maher (:jmaher) from comment #9)
> alter table alerts add geomean FLOAT;

Note that the correct table is test_runs:

alter table test_runs add geomean FLOAT;
Comment on attachment 8499136 [details] [diff] [review]
add geomean field to db and populate it (1.1)

While testing this patch I got the following error upon upload of a TART result:

  Exception: to determine geomean from 'test_run_values' for  1369 - 'long' 
  object is not iterable
Comment on attachment 8499136 [details] [diff] [review]
add geomean field to db and populate it (1.1)

Review of attachment 8499136 [details] [diff] [review]:
-----------------------------------------------------------------

::: server/pyfomatic/collect.py
@@ +228,4 @@
>    _updateAverageForTestRun(average, databaseCursor, inputStream, metadata)
> +
> +  try:
> +    databaseCursor.execute("""select exp(sum(log(value))/count(value)) from test_run_values where test_run_id = %s""", (metadata.test_run_id))

(In reply to Matthew N. [:MattN] from comment #15)
> 'long' object is not iterable

This is from a missing comma after "metadata.test_run_id". The argument should be a tuple like:
(metadata.test_run_id,)

With that the patch works on my server :)
Attachment #8499136 - Flags: feedback+
Comment on attachment 8499113 [details] [diff] [review]
modify api that graph.html uses to include geomean in addition to everything else (0.9)

Review of attachment 8499113 [details] [diff] [review]:
-----------------------------------------------------------------

I have this being used on my GS instance when "?geo" is the query string: http://graphs.mattn.ca/graph-local.html?geo

I didn't test all of the affected endpoints and I didn't check the math by hand.

::: server/api.py
@@ +322,5 @@
>              for changeset, total in averages.iteritems())
> +
> +        geomeans = dict(
> +            (changeset, total / geo_totals[changeset])
> +            for changeset, total in geomeans.iteritems())

OK, so this is averaging the geometric means for a specific changeset but it's just confusing because geo_totals actually contains a count of the number of geometrics means for a changeset, not a total as the name says. num_geos would be easier to understand IMO. Not sure if you want to change that for none, one, or both of the cases (avg & geo).

@@ +328,5 @@
>                    'averages': averages,
>                    'min': min(row['average'] for row in rows),
>                    'max': max(row['average'] for row in rows),
> +                  'geomeans': geomeans,
> +                  'mingeo': min(row['geomean'] for row in rows),

Not sure how much of a problem this is but "mingeo" ends up being "null" since there are old rows where mingeo wasn't calculated yet. It would be nice for it to exclude null.
Attachment #8499113 - Flags: feedback?(MattN+bmo) → feedback+
Blocks: 1077177
I tested that the following two endpoints show the data:
http://graphs.mattn.ca/api/test/runs/revisions?revision=b6c2deadbeef (used by compare-talos)
http://graphs.mattn.ca/api/test/runs?id=293&branchid=1&platformid=13 (used by graphs)
Depends on: 1078272
landed this backend change:
http://hg.mozilla.org/graphs/rev/77bdf53fae9f

waiting for the database to be updated before we make this live.  Next up is getting the api changes landed.
Do we have a bug to backfill the new column with the new calculation result? such that we could examine 2-3+ months old data with the new calculation?
I was not aware of doing that, I was under the impression this plan was to make this possible and collect it going forward.  What benefit would we get from going back in time?  If you think it is beneficial, please file a bug and outline what needs to be done.
Bug 1020677 comment 41 - 44 suggested that we should just switch the formula, at which case back-filling (practically - rewriting history) would probably a bad idea.

However, the way which was apparently chosen to go forward, implied by the patch at this bug, is that we add a new column for the new formula calculation, at which case this column would be empty for older results, at which case I think it should be possible to back-fill it.

If we think that the new formula is better, and I think we do, then being able to watch graphs over few months with the new formula would be better than watching them with the old formula.

So I guess the first question is "would it be possible to back-fill the new column of the new formula?"

If yes (and I _think_ it should be a yes), then I'll file a bug for it.
yeah, upon further thought and your comment, we really should backfill as much as possible.  I suspect it just requires us writing a script to query the data for the page values, then calculate/insert the geomean.

That might take some fiddling with, but it should be possible.  I suspect a custom script that we could run would be the ideal thing to do.
Comment on attachment 8499136 [details] [diff] [review]
add geomean field to db and populate it (1.1)

Review of attachment 8499136 [details] [diff] [review]:
-----------------------------------------------------------------

::: server/pyfomatic/collect.py
@@ +228,4 @@
>    _updateAverageForTestRun(average, databaseCursor, inputStream, metadata)
> +
> +  try:
> +    databaseCursor.execute("""select exp(sum(log(value))/count(value)) from test_run_values where test_run_id = %s""", (metadata.test_run_id))

So what happens if value is 0? log(value) would be -INFINITY, which is probably undesirable (and in a simpler calculation of Nth root of the product, the product would be 0).

I think some (sub)tests do report 0 on cases of error, and while we do want it to affect the result, I don't think we want it to null-ify it.

In the simplified calculation, I'd say just use Nth root of the product of N-1 values (if it only had a single 0 sub-result) which would be the equivalent of averaging N values where one of them is 0, though I'm not sure how to approach it with the formula which you used at this patch.
(In reply to Avi Halachmi (:avih) from comment #24)
>
> So what happens if value is 0? log(value) would be -INFINITY, which is
> probably undesirable (and in a simpler calculation of Nth root of the
> product, the product would be 0).


Hmm.. I think there's a simple workaround for this issue, and which would also solve the general issue of values lower than 1 (which I don't think we have right now, but it's not impossible to have them).

How about just do ... sum(log(value + 1))/count(value) ... ?

Such change would still keep the relative results the same, would hardly affect the absolute values (since most of them are considerably higher than 1 anyway) and will solve the 0 issue and the general "smaller than 1" issue.

Joel?
Maybe also subtract 1 from the final result (because it's a mean of each value plus 1) and clamping it at 0 if it's negative?
I suggest ignoring the zeros, and using absolute value, just in case:

> select 
>     exp(sum(log(abs(value)))/count(value)) 
> from 
>     test_run_values 
> where 
>     test_run_id = %s""" and
>     abs(value) > 0
Kyle, what would happen if we selected nothing?
Matt, thanks for the comment here, I have updated the patch to make the variable names more useful and to default to a value of 0 if we find a non float type.

Avi, for geomeans I am really taking an average of the geomeans if we end up with >1 value for a given build/test/platform/branch combo.  Do you think that is correct?
Attachment #8499113 - Attachment is obsolete: true
Attachment #8500664 - Flags: review?(MattN+bmo)
Flags: needinfo?(avihpit)
Attachment #8500664 - Flags: review?(rhelmer)
(In reply to Kyle Lahnakoski [:ekyle] from comment #27)
> I suggest ignoring the zeros, and using absolute value, just in case:
> 
> > select 
> >     exp(sum(log(abs(value)))/count(value)) 
> > from 
> >     test_run_values 
> > where 
> >     test_run_id = %s""" and
> >     abs(value) > 0

This would completely ignore the 0's at the geomean calculation. However, I'd think that preferably having 0's at some sub tests would also be reflected at the summary/average/geomean.

The average already reflects it by including 0 at the average (so the average will be lower), my suggestion of adding 1 to each value and then subtracting 1 from the sum would be the equivalent when doing a geomean, I think. But if that's hard to define in SQL syntax, then your suggestion is still better than the patch prior to comment 29. Thanks.

(In reply to Joel Maher (:jmaher) from comment #29)
> ...
> Avi, for geomeans I am really taking an average of the geomeans if we end up
> with >1 value for a given build/test/platform/branch combo.  Do you think
> that is correct?

Not sure I follow you. geomean _is_ the replacement for the average at the new calculation, what does "average of geomeans" mean?

Also, I really liked the previous calculation, how's the new one better? (other than also dealing with 0's - which I think we could also have done with the previous calculation instead of replacing it altogether).
Flags: needinfo?(avihpit)
(In reply to Joel Maher (:jmaher) from comment #28)
> Kyle, what would happen if we selected nothing?

The result would be NULL (well, not even calculated).  If there are no values, or all values are zero, then geomean has no clear definition.  Alternatively, I would define geomean([])=1, just for the sake of logical consistency.
(In reply to Avi Halachmi (:avih) from comment #30)
> (In reply to Kyle Lahnakoski [:ekyle] from comment #27)
> > I suggest ignoring the zeros, and using absolute value, just in case:
> > 
> > > select 
> > >     exp(sum(log(abs(value)))/count(value)) 
> > > from 
> > >     test_run_values 
> > > where 
> > >     test_run_id = %s""" and
> > >     abs(value) > 0
> 
> This would completely ignore the 0's at the geomean calculation. However,
> I'd think that preferably having 0's at some sub tests would also be
> reflected at the summary/average/geomean.
> ...


I think there is actually a case where we'd like to completely ignore the value - if there's no value (i.e. not even 0). We could get this case if we cancel a page/sub-test, so that its value isn't sent anymore to graphserver, but does exist for test runs prior to removing this page. On such case we should completely ignore it at any calculation (be it average, geomean, or anything else) where it isn't available.

Joel, What happens now if we cancel a page, e.g. like we removed alipay.com from tp5o at bug 1026869 ? is it taken into account with the current average calculation for tp5o?

Kyle, what about comment 30 WRT the inclusion of 0's into the geomean to indicate a difference at the summary as well?
Flags: needinfo?(klahnakoski)
Flags: needinfo?(jmaher)
In reply to Avi Halachmi (:avih) from comment #30)
> (In reply to Kyle Lahnakoski [:ekyle] from comment #27)
> > > select 
> > >     exp(sum(log(abs(value)))/count(value)) 
> > > from 
> > >     test_run_values 
> > > where 
> > >     test_run_id = %s""" and
> > >     abs(value) > 0
> 
> This would completely ignore the 0's at the geomean calculation. However,
> I'd think that preferably having 0's at some sub tests would also be
> reflected at the summary/average/geomean.

Avi, 

I am concerned your log(v+1) proposal will swamp legitimate performance values between 0 and 1.  I do not know if that even happens, so maybe my concern is unwarranted.

Ignoring zeros can be good, especially if zero is being used to indicate failure.  Furthermore, the ignorance will impact the overall geomean[1]; which means the introduction of zeros will be detectable.  Finally, the v8 test suite uses geomean: using log(v+1) we loose the ability to compare.


[1] except in the case when the ignored metric is very close to 1
Flags: needinfo?(klahnakoski)
Comment on attachment 8500664 [details] [diff] [review]
modify api that graph.html uses to include geomean in addition to everything else (1.0)

Review of attachment 8500664 [details] [diff] [review]:
-----------------------------------------------------------------

Looks reasonable to me, I am not set up to test at the moment. I wish we had tests on this codebase :/
Attachment #8500664 - Flags: review?(rhelmer) → review+
(In reply to Kyle Lahnakoski [:ekyle] from comment #33)
> I am concerned your log(v+1) proposal will swamp legitimate performance
> values between 0 and 1.  I do not know if that even happens, so maybe my
> concern is unwarranted.

What does "swamp" mean/imply here? could you please provide sample set of numbers which would exhibit this issue with the (v+1) calculation? Also note than apart from adding 1 to each value, my suggestion also includes subtracting one from the final result (comment 26).

> Ignoring zeros can be good, especially if zero is being used to indicate
> failure.  Furthermore, the ignorance will impact the overall geomean[1];
> which means the introduction of zeros will be detectable.

Indeed ignoring 0's will impact the overall, but IMO less so and less deterministically so than including an unlikely low value into the calculation (such as 0 into an average - which we do now already).

> Finally, the v8
> test suite uses geomean: using log(v+1) we loose the ability to compare.

Not sure I understand this. Would you be able to provide a sample set of results which would get hurt from using the v+1 ... -1 calculation?
Comment on attachment 8500664 [details] [diff] [review]
modify api that graph.html uses to include geomean in addition to everything else (1.0)

Flag me for review again once we've settled on the formula and I can test.
Attachment #8500664 - Flags: review?(MattN+bmo)
Kyle,

I want to get this resolved, can you help clarify your concerns with avi's approach of:
exp(sum(log(abs(value + 1)))/count(value)) - 1
Flags: needinfo?(jmaher) → needinfo?(klahnakoski)
Avi,

(In reply to Avi Halachmi (:avih) from comment #35)
> What does "swamp" mean/imply here?

My apologies upfront about being pedantic here.  A geometric argument would be easier to digest.

Let
    F0(v)=exp(sum(log(vi))/n)
    F1(v)=exp(sum(log(vi+1))/n)-1
where
    v = [v1, v2, v3, ..., vn]

Consider a performance change of E in one of the statistics, such that v1' = (1+E)*v1
and v' = [v1', v2, v3, ..., vn]


F1(v') = exp(sum(log(vi'+1))/n)
F1(v') = exp([log(v1'+1) + log(v2+1) + ... + log(vn+1)]/n)
F1(v') = exp([log((1+E)*v1 + 1) + log(v2+1) + ... + log(vn+1)]/n)
F1(v') = exp([log((1+E)*v1 + 1) - log(v1+1) + log(v1+1) + log(v2+1) + ... + log(vn+1)]/n)
F1(v') = exp([log((1+E)*v1 + 1) - log(v1+1) + sum(log(vi+1))]/n)
F1(v') = exp(log((1+E)*v1 + 1)/n - log(v1+1)/n + sum(log(vi+1))/n)
F1(v') = exp(log((1+E)*v1 + 1)/n - log(v1+1)/n) * exp(sum(log(vi+1))/n))
F1(v') = exp(log((1+E)*v1 + 1) - log(v1+1))^(1/n) * F1(v)
F1(v') = ((1+E)*v1 + 1)/(v1+1))^(1/n) * F1(v)
F1(v') = ((v1+1 + v1*E)/(v1+1))^(1/n) * F1(v)

We will use Taylor's theorem near v1 to approximate the slope wrt E:

dF1(v)/dE = (1/n)*(((v1+1 + v1*E)/(v1+1))^((1/n)-1))*v1/(v1+1) * F1(v)
dF1(v)/dE (at E=0) = (1/n)*(1^((1/n)-1))*v1/(v1+1) * F1(v)
dF1(v)/dE (at E=0) = (1/n) * v1/(v1+1) * F1(v)

F1(v') = F1(v) + (1/n) * v1/(v1+1) * F1(v) * E  (at E near 0)

We look at F1(v') - F1(v) to emphasize our slope:

F1(v') - F1(v) = (1/n) * v1/(v1+1) * F1(v) * E  (at E near 0)
F1(v') - F1(v) = v1/(v1+1) * (F1(v) * E)/n  (at E near 0)

Using similar logic, compare this to behaviour of F0 near E=0:

F0(v') - F0(v) = (F0(v) * E)/n  (at E near 0)

The F1 function has an extra v1/(v1+1) term; for large values this is near 1, but for small values this number is small, and makes error detection more difficult.

F1(v') - F1(v) = v1/(v1+1) * (F0(v') - F0(v))  (at E near 0)


Suppose we have a statistic around 0.5, and it changes by 10%:  How much does F1 change? 

F1(v') - F1(v) = 0.5/(0.5+1) * (F0(v') - F0(v))  (at E near 0)
F1(v') - F1(v) = 1/3 * (F0(v') - F0(v))  (at E near 0)

Which is 1/3 less sensitive than F0.  And smaller values are worse, of course.

This is what I meant by "swamp".  All this is mute if performance values are > 1.
Flags: needinfo?(klahnakoski)
thanks Kyle.  what formula do you use in dzalerts?
Avi, 

How does F1 compare to F0 in the face of zeros? We start with F1, and compare the new value to the old:

F1(v')/F1(v) = prod(vi'+1)^(1/n) / prod(vi+1)^(1/n)
F1(v')/F1(v) = (prod(vi'+1)/prod(vi+1))^(1/n)
F1(v')/F1(v) = [(0+1)*(v2+1)*(v3+1)*...*(vn+1) / ((v1+1)*(v2+1)*(v3+1)*...*(vn+1))]^(1/n)
F1(v')/F1(v) = [(0+1)/(v1+1)]^(1/n)
F1(v')/F1(v) = [1/(v1+1)]^(1/n)

Same for F0, which ignores the zero (v1' = 0):

F0(v')/F0(v) = [v2 * v3 * ... * vn]^(1/(n-1)) / [v1 * v2 * v3 * ... * vn]^(1/n)
F0(v')/F0(v) = [v2 * v3 * ... * vn]^(n/n(n-1)) / [v1 * v2 * v3 * ... * vn]^((n-1)/n(n-1))
F0(v')/F0(v) = [[v2 * v3 * ... * vn]^n / [v1 * v2 * v3 * ... * vn]^(n-1)]^(1/(n(n-1)))
F0(v')/F0(v) = [[v2 * v3 * ... * vn] / v1^(n-1)]^(1/(n(n-1)))
F0(v')/F0(v) = [v2 * v3 * ... * vn]^(1/(n(n-1))) / v1^(1/n)
F0(v')/F0(v) = (1/v1)^(1/n) * F0(v')^(1/n)


Let K1 = F1(v')/F1(v) 
Let K0 = F0(v')/F0(v)

When is K1 > K0 ?

<=> [1/(v1+1)]^(1/n) > (1/v1)^(1/n) * F0(v')^(1/n)
<=> [1/(v1+1)]^(1/n) / (1/v1)^(1/n) > F0(v')^(1/n) 
<=> [(1/(v1+1)) / (1/v1)]^(1/n) > F0(v')^(1/n) 
<=> [v1/(v1+1)]^(1/n) > F0(v')^(1/n)
<=> [v1/(v1+1)] > F0(v')

If performance values less than 1 do not exist: Then F0(v') > 1, and K1 < K0, and we can conclude F0 is a better zero indicator than F1.  If performance values less than 1 do exist, then K1 could perform better given the condition above.  The difference between F0 and F1 is probably inconsequential in practice due to high vi>>1 values and n>3.  

When F0 and F1 do differ, we must also consider the noise is similarly different.  An analysis of how this affects confidence must be done to be sure F0 still performs better than F1.

In conclusion, I like F0 for its adherence to definition, and its proper treatment of performance values less than one.
Joel,

I use the formal definition of geomean: F0
(In reply to Avi Halachmi (:avih) from comment #35)
> (In reply to Kyle Lahnakoski [:ekyle] from comment #33)
> > Finally, the v8
> > test suite uses geomean: using log(v+1) we loose the ability to compare.
> 
> Not sure I understand this. Would you be able to provide a sample set of
> results which would get hurt from using the v+1 ... -1 calculation?

All I am saying here is that the V8 performance statistic uses the formal definition of geomean to combine the test results.  If we use some other aggregate metric, like F1, then we will not be able to compare the official V8 statistic to the one we see in graph server.
I thought I had implemented the formal definition to start with.  Here is what I have in talos:
http://hg.mozilla.org/build/talos/file/b2a3d0493173/talos/filter.py#l61

but we have questions about values <1.

Kyle, can you link me to the formal definition you are using?
Flags: needinfo?(klahnakoski)
(In reply to Kyle Lahnakoski [:ekyle] from comment #38)
> Avi,
> 
> (In reply to Avi Halachmi (:avih) from comment #35)
> > What does "swamp" mean/imply here?
> 
> My apologies upfront about being pedantic here.  A geometric argument would
> be easier to digest.

Pedantic is appreciated, I just didn't understand what swamp imply :) and now I do - it compresses the dynamic range slightly more than the "pure" geometric mean formula, and this compression increases the closer we are to 0.

I also never claimed that the two calculations are equivalent - they're clearly not.

However, consider these two "sets of results" which are undeniably much closer to 0 than to 1 and therefore should "suffer" from being less sensitive to detect regression between them with the +/- 1 variant:

var set1 = [.01, .02, .03, .04, .05];
var set2 =      [.02, .03, .04, .05, .06];

Their averages are 0.03 and 0.04 respectively.
"proper" geometric means are 0.026 and 0.037
geometric means with the +1/-1 variant are 0.0299 and 0.0399

regression from set1 to set2:

Using average: 33.33%
Using proper geometric mean: 43.1%
Using +1/-1 geometric mean: 33.4%

The +/-1 version is indeed less sensitive than the proper geomean.

However, even for such small values, the +1/-1 geomean regression calculations will still be more sensitive than the average-based regression calculation which we've been using till now. So compared to now, I don't think it will hurt us.

> This is what I meant by "swamp".  All this is mute if performance values are
> > 1.

I've never seen results smaller than 1. Maybe Joel did?

(In reply to Kyle Lahnakoski [:ekyle] from comment #40)
> ...
> In conclusion, I like F0 for its adherence to definition, and its proper
> treatment of performance values less than one.

I don't disagree with the preference to choose the purest form available. But I'm trying to apply a practical approach where this specific deviation from "purest" might have more pros than cons, or that in practice the cons are negligible, as I _think_ they are.

(In reply to Kyle Lahnakoski [:ekyle] from comment #42)
> All I am saying here is that the V8 performance statistic uses the formal
> definition of geomean to combine the test results.  If we use some other
> aggregate metric, like F1, then we will not be able to compare the official
> V8 statistic to the one we see in graph server.

I still don't get it. We've been using averages till now, and it never bothered us. Surely geomean +/-1 would be still better than averages?

Also, the V8 numbers are more than 15k, the +/-1 variant difference compared to "pure" version would be less than negligible.
Joel, 

The code has implemented the formal definition.
Flags: needinfo?(klahnakoski)
Avi, 

We are in agreement.

(In reply to Avi Halachmi (:avih) from comment #44)
> (In reply to Kyle Lahnakoski [:ekyle] from comment #42)
> > All I am saying here is that the V8 performance statistic uses the formal
> > definition of geomean to combine the test results.  If we use some other
> > aggregate metric, like F1, then we will not be able to compare the official
> > V8 statistic to the one we see in graph server.
> 
> I still don't get it. We've been using averages till now, and it never
> bothered us. Surely geomean +/-1 would be still better than averages?

Using averages was wrong, and geomean +/-1 is way better.  Also geomean +/-1 is negligibly different from the geomean for the purposes of regression detection given the >>1 performance values we see.
(In reply to Kyle Lahnakoski [:ekyle] from comment #45)
> Joel, 
> 
> The code has implemented the formal definition.

well, except for it's treatment of zeros, where it is undefined :)
Comment on attachment 8503345 [details] [diff] [review]
reduce the chance of values = 0 (1.0)

Review of attachment 8503345 [details] [diff] [review]:
-----------------------------------------------------------------

w00t!
Attachment #8503345 - Flags: review?(avihpit) → review+
slight adjustment to the script, accounting for your previous feedback and double checking the code.

Going from the definitions listed here:
https://wiki.mozilla.org/Perfomatic:API#Get_test_run_information

We have 3 cases where the average is displayed and we should also display the geometric mean:
run info:
desc: https://wiki.mozilla.org/Perfomatic:API#Get_test_run_information
ex: http://graphs.mozilla.org/api/test/runs/info/?id=3794718 (broken)

test runs:
desc: https://wiki.mozilla.org/Perfomatic:API#Get_test_runs_.28data.29
ex: http://graphs.mozilla.org/api/test/runs?id=223&branchid=131&platformid=31&days=7

by revision:
desc: https://wiki.mozilla.org/Perfomatic:API#Get_test_data_by_revision
ex: http://graphs.mozilla.org/api/test/runs/revisions?revision=10d2046d2b64&revision=d16525937c8b

going through the code for these 3 cases, I believe I have properly handled added geomean to the api.
Attachment #8500664 - Attachment is obsolete: true
Attachment #8503400 - Flags: review?(MattN+bmo)
Attachment #8503400 - Attachment is patch: true
Attachment #8503400 - Attachment mime type: text/html → text/plain
My server is now updated with the landed patch and attachment 8503400 [details] [diff] [review]. Testing now.
The mingeo and/or maxgeo line with the min/max is causing a 500 error on the server. I think it's not returning a number in some cases.
Comment on attachment 8503400 [details] [diff] [review]
modify api that graph.html uses to include geomean in addition to everything else (1.1)

Review of attachment 8503400 [details] [diff] [review]:
-----------------------------------------------------------------

::: server/api.py
@@ +329,5 @@
>                    'min': min(row['average'] for row in rows),
>                    'max': max(row['average'] for row in rows),
> +                  'geomeans': geomeans,
> +                  'mingeo': min(type(row['geomean']) if type(row['geomean']) == type(1.0) else 0.0 for row in rows),
> +                  'maxgeo': max(type(row['geomean']) if type(row['geomean']) == type(1.0) else 0.0 for row in rows),

I think the first |type(…)| on each of these lines is wrong but I'm not expert in Python. It ends up returning the type instead of the value.
Attachment #8503400 - Flags: review?(MattN+bmo) → review-
(In reply to Matthew N. [:MattN] (behind on reviews - focused on Loop) from comment #54)
> Comment on attachment 8503400 [details] [diff] [review]
> modify api that graph.html uses to include geomean in addition to everything
> else (1.1)
> 
> Review of attachment 8503400 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: server/api.py
> @@ +329,5 @@
> >                    'min': min(row['average'] for row in rows),
> >                    'max': max(row['average'] for row in rows),
> > +                  'geomeans': geomeans,
> > +                  'mingeo': min(type(row['geomean']) if type(row['geomean']) == type(1.0) else 0.0 for row in rows),
> > +                  'maxgeo': max(type(row['geomean']) if type(row['geomean']) == type(1.0) else 0.0 for row in rows),
> 
> I think the first |type(…)| on each of these lines is wrong but I'm not
> expert in Python. It ends up returning the type instead of the value.

You're right... comparing type(row['geomean']) == type(1.0) is a bit confusing too, that should probably just be e.g.:

'mingeo': min(row['geomean'] if type(row['geomean']) == float else 0.0 for row in rows),
ah:
'mingeo': min(row['geomean'] if isinstance(x, float) else 0.0 for row in rows),


can we assume all values are float?  I assume so, but let me double check.
updated patch to check for float|int.  Tested with mocked up python locally.
Attachment #8503400 - Attachment is obsolete: true
Attachment #8504277 - Flags: review?(rhelmer)
Attachment #8504277 - Flags: review?(MattN+bmo)
Comment on attachment 8504277 [details] [diff] [review]
modify api that graph.html uses to include geomean in addition to everything else (2.0)

Review of attachment 8504277 [details] [diff] [review]:
-----------------------------------------------------------------

::: server/api.py
@@ +329,5 @@
>                    'min': min(row['average'] for row in rows),
>                    'max': max(row['average'] for row in rows),
> +                  'geomeans': geomeans,
> +                  'mingeo': min(row['geomean'] if isinstance(row['geomean'], float) or isinstance(row['geomean'], int) else 0.0 for row in rows),
> +                  'maxgeo': max(row['geomean'] if isinstance(row['geomean'], float) or isinstance(row['geomean'], int) else 0.0 for row in rows),

This seems to work. I would have preferred just checking for the None/Null case (however that gets represented) instead of float/int but this is better than nothing.
Attachment #8504277 - Flags: review?(MattN+bmo) → review+
Comment on attachment 8504277 [details] [diff] [review]
modify api that graph.html uses to include geomean in addition to everything else (2.0)

Review of attachment 8504277 [details] [diff] [review]:
-----------------------------------------------------------------

::: server/api.py
@@ +329,5 @@
>                    'min': min(row['average'] for row in rows),
>                    'max': max(row['average'] for row in rows),
> +                  'geomeans': geomeans,
> +                  'mingeo': min(row['geomean'] if isinstance(row['geomean'], float) or isinstance(row['geomean'], int) else 0.0 for row in rows),
> +                  'maxgeo': max(row['geomean'] if isinstance(row['geomean'], float) or isinstance(row['geomean'], int) else 0.0 for row in rows),

Hm so this could be coerced to float and have a None check instead:

'mingeo': min(float(row['geomean']) if not isinstance(row['geomean'], None) else 0.0 for row in rows),

That will raise an exception if something is in there that can't be coerced to float. If this is never expected to happen then raising here might be a good thing.
Attachment #8504277 - Flags: review?(rhelmer) → review+
(In reply to Robert Helmer [:rhelmer] from comment #59)
> Comment on attachment 8504277 [details] [diff] [review]
> modify api that graph.html uses to include geomean in addition to everything
> else (2.0)
> 
> Review of attachment 8504277 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: server/api.py
> @@ +329,5 @@
> >                    'min': min(row['average'] for row in rows),
> >                    'max': max(row['average'] for row in rows),
> > +                  'geomeans': geomeans,
> > +                  'mingeo': min(row['geomean'] if isinstance(row['geomean'], float) or isinstance(row['geomean'], int) else 0.0 for row in rows),
> > +                  'maxgeo': max(row['geomean'] if isinstance(row['geomean'], float) or isinstance(row['geomean'], int) else 0.0 for row in rows),
> 
> Hm so this could be coerced to float and have a None check instead:
> 
> 'mingeo': min(float(row['geomean']) if not isinstance(row['geomean'], None)
> else 0.0 for row in rows),
> 
> That will raise an exception if something is in there that can't be coerced
> to float. If this is never expected to happen then raising here might be a
> good thing.

(maybe I should say "cast" instead of "coerce" since it's explicit)

Anyway that's more of a response to comment 58, you can take it or leave it I think the code looks OK to go in as-is. Thanks to MattN for testing this.
:rhelmer,  now that we have the backend in place, can we get this deployed and started on collected data.  We can view the results from a local version of graph.html.  Should we file a bug for deployment?  do you have a suggested path forward?
Flags: needinfo?(rhelmer)
(In reply to Joel Maher (:jmaher) from comment #62)
> :rhelmer,  now that we have the backend in place, can we get this deployed
> and started on collected data.  We can view the results from a local version
> of graph.html.  Should we file a bug for deployment?  do you have a
> suggested path forward?

I filed bug 1082694 to track this and will work on it now.
Flags: needinfo?(rhelmer)
so here is what I believe is left to do:
* look at the data now that we have some and verify it is what we want
* deploy a UI change to production that will view this data
* adjust the alerting all to reference the geomean instead of average field

am I missing anything?
(In reply to Joel Maher (:jmaher) from comment #64)
> * deploy a UI change to production that will view this data

That's bug 1077177 for graph server btw.
after a brief chat on irc, the next steps are:
* send geometric mean alerts to dev.tree-alerts (the new newsgroup)
* continue to send average alerts to dev.tree-management
* after a week or two to ensure all is well, disable average alerts to dev.tree-management

This involves some assumptions, but all very doable:
* analyze_talos.py has enough cpu bandwidth to run two passes
* we are fine not emailing patch authors twice (only email on 'average' alerts for now)
* someone will look at both sources.

The problem becomes, what do we do if we find alerts that show up in 'average' only and not in 'geomean'?  Will we be fine ignoring those?
I am not sure how to modify the analysis.cfg.template.  I would like to make sure that we have:
# Who should emails be sent to
regression_emails = dev.tree-management@lists.mozilla.org
geomean_regression_emails = dev.tree-alerts@lists.mozilla.org

I honestly have no idea what value is in regression_emails, so figuring out what is there would be a good idea.
Attachment #8526144 - Flags: review?(rhelmer)
Blocks: 1100654
Comment on attachment 8526144 [details] [diff] [review]
update scripts / tooling to support geomean (1.0)

I don't actually know where this tool is run... IIRC catlee has the analysis tool running somewhere.
Flags: needinfo?(catlee)
Attachment #8526144 - Flags: review?(rhelmer) → review+
catlee, can you answer this question:
I am not sure how to modify the analysis.cfg.template.  I would like to make sure that we have:
# Who should emails be sent to
regression_emails = dev.tree-management@lists.mozilla.org
geomean_regression_emails = dev.tree-alerts@lists.mozilla.org

I guess making available the analysis.cfg file (sans private keys, etc.) would be a good thing :)
Also, if catlee knows more about this script and it isn't run on the normal graph server (i.e. this is run someplace that he has access to), then I would like to know what the procedure is for deploying this.
(In reply to Joel Maher (:jmaher) from comment #69)
> I guess making available the analysis.cfg file (sans private keys, etc.)
> would be a good thing :)

Ideally we could just check it into the repo in the appropriate location, but named as analysis.cfg.example etc
That's what analysis.cfg.template is supposed to be.
Flags: needinfo?(catlee)
In any case, the deployment process right now is to file a bug in releng (cc me) specifying which settings need to be added or modified.
fixed reference to data_type as a value instead of a string.
Attachment #8526144 - Attachment is obsolete: true
Attachment #8532064 - Flags: review?(catlee)
Attachment #8532064 - Flags: review?(catlee) → review+
fixed a typo and verified an alert on tree-alerts:
https://groups.google.com/forum/#!topic/mozilla.dev.tree-alerts/a7L7nh1PJlQ

caveat: this most likely will only show unique geomean alerts due to caching of the warnings to prevent duplication of alerts.  The talos alert sheriff team will look at both average/geomean on all alerts for the bake in period until we turn off the dev.tree-management average alerts.
on 3 tests: tcanvasmark, v8_7, and kraken we upload a single value from talos:
http://hg.mozilla.org/build/talos/file/tip/talos/output.py#l253

these are of type 'AVERAGE', not 'VALUES'.  In the code to store a geometric mean, we only handle 'VALUES'.  This patch stores the single value as the geometric mean for the 'AVERAGE' inputs we get.
Attachment #8537197 - Flags: review?(rhelmer)
Attachment #8537197 - Flags: review?(avihpit)
Comment on attachment 8537197 [details] [diff] [review]
store the single value for average as geomean (1.0)

Review of attachment 8537197 [details] [diff] [review]:
-----------------------------------------------------------------

Changing to f+ because I'm not familiar with the graphserver internals, but the fix looks valid. THanks!
Attachment #8537197 - Flags: review?(avihpit) → feedback+
Attachment #8537197 - Flags: review?(rhelmer) → review+
http://hg.mozilla.org/graphs/rev/c6266505147e

:rhelmer, can you deploy this to the live server?
Blocks: 1112387
(In reply to Joel Maher (:jmaher) from comment #79)
> http://hg.mozilla.org/graphs/rev/c6266505147e
> 
> :rhelmer, can you deploy this to the live server?

Done in bug 1112387
(In reply to Joel Maher (:jmaher) from comment #76)
> caveat: this most likely will only show unique geomean alerts due to caching
> of the warnings to prevent duplication of alerts.  The talos alert sheriff
> team will look at both average/geomean on all alerts for the bake in period
> until we turn off the dev.tree-management average alerts.

How are the new alerts looking? Are we good to switch off the ones sent to dev.tree-management now? :-)
Flags: needinfo?(jmaher)
yes, we are ready to do it, I will need to figure out a patch and coordinate with catlee on how to deploy it.  I verified the data, found an error a couple weeks ago, now all the data is good.
Flags: needinfo?(jmaher)
That's great, thank you :-)
This patch does a couple things:
1) sets geomean as the default on graph server
2) only runs one analysis, not two, so we will be posting to dev.tree-alerts
3) fixes the upload/reporting to use geomean and passively collect average

Once this is live we could eventually remove the average code.
Attachment #8544026 - Flags: review?(rhelmer)
Attachment #8544026 - Flags: review?(catlee)
Comment on attachment 8544026 [details] [diff] [review]
run analysis for geomean as default (1.0)

Review of attachment 8544026 [details] [diff] [review]:
-----------------------------------------------------------------

the regression analysis bits look fine to me. let me know when you want to deploy.
Attachment #8544026 - Flags: review?(catlee) → review+
Comment on attachment 8544026 [details] [diff] [review]
run analysis for geomean as default (1.0)

via irc chat with rhelmer, he is fine with this patch and will help coordinate deployment.
Attachment #8544026 - Flags: review?(rhelmer)
checked in:
http://hg.mozilla.org/graphs/rev/f0a48c84d941

Next steps are to:
1) deploy changes to live server (:rhelmer)
2) update analyze_talos machine (:catlee)

I would like to do this in the same general time window.  Maybe Monday or Tuesday morning PST so :rhelmer can deploy and then in EST midday/afternoon :catlee can update the analyze talos script.

;rhelmer, :catlee- can you guys confirm this update/deployment time sounds good?
Flags: needinfo?(rhelmer)
Flags: needinfo?(catlee)
(In reply to Joel Maher (:jmaher) from comment #87)
> checked in:
> http://hg.mozilla.org/graphs/rev/f0a48c84d941
> 
> Next steps are to:
> 1) deploy changes to live server (:rhelmer)
> 2) update analyze_talos machine (:catlee)
> 
> I would like to do this in the same general time window.  Maybe Monday or
> Tuesday morning PST so :rhelmer can deploy and then in EST midday/afternoon
> :catlee can update the analyze talos script.
> 
> ;rhelmer, :catlee- can you guys confirm this update/deployment time sounds
> good?

WFM!
Flags: needinfo?(rhelmer)
this was just deployed about 15 minutes ago.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Flags: needinfo?(catlee)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: