Closed Bug 669316 (orangeseed) Opened 11 years ago Closed 4 years ago

Add tool for estimating when an orange was introduced

Categories

(Tree Management Graveyard :: OrangeFactor, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: justin.lebar+bug, Unassigned)

References

Details

Attachments

(3 files, 1 obsolete file)

In bug 668471 comment 9, I did some analysis to figure out when an orange was likely introduced.

It might be helpful to add this kind of analysis to the WOO site.  Someone should also check that I did it right.  :)
This is a good idea, and will be easy to do.
Yeah, we definitely need to improve our ability to target intermittent oranges when we first notice them, so we can potentially back out the changes that caused them, like we would with a perma-orange.
This is an awesome idea, I just had missed the bugmail!

Has there been any progress on this?
(In reply to Ehsan Akhgari [:ehsan] from comment #3)
> This is an awesome idea, I just had missed the bugmail!
> 
> Has there been any progress on this?

Not yet, I think it's next on our list.
The hard part here will be finding out when the first orange occurred.  But I assume this will be used much more for newer bugs.  We can grab, say, 4 months' worth of data.  If there are oranges within the first few days, we can then grab more data.  If we go back to the start of the records, we'll just display an appropriate error.
I think this is a little more complicated than the algorithm in bug 668471 comment 9, since we run a variable number of testruns on each commit, and each testrun has a chance of manifesting a given orange.

Also, all testruns are not equal...some are opt only (as in Nightly runs), some contain data only for specific platforms or test suites (usually retriggers).

So instead of counting the number of pushes between the first and last occurrences of an orange, we could count the number of testruns, excluding those testruns which don't have any buildtypes or platforms in common with the existing failures.  That is, if a testrun has only opt tests, and the orange in question has only been noted on debug tests, we wouldn't count that testrun for the purposes of this algorithm.  Ditto with a testrun that contained only win7 tests for an orange that has only been noted on linux.

This is still slightly fuzzy.  Consider the case of an orange which is distributed equally (but infrequently) among all platforms.  A testrun that tests all platforms would have a greater chance of manifesting the orange than a testrun that tests only win7, but we'd count both of these testruns equally with this algorithm.

A more accurate approach might be to determine which platforms/buildtypes the orange has been noted on, and then count the number of times that testsuite has been run on those buildtype/platform combinations.  I.e. if an orange occurs in mochitest-1/5 and has been noted on linux32/debug and mac32/debug, then count all instances of mochitest-1/5 on those combinations.  So instead of counting testruns, we'd be counting testsuite executions.
> So instead of counting testruns, we'd be counting testsuite executions.

Yes, that's totally right.  In bug 668471, I assumed that each push to m-c got a full set of test runs and that we never retriggered, but that's of course not correct.

One additional wrinkle to consider is that an orange does not necessarily have the same probability of occurring on each platform.
Duplicate of this bug: 620520
Been working on this, although it does a lot of querying so we have to be careful not to generate too much ES traffic.
Assignee: nobody → mcote
Status: NEW → ASSIGNED
Attached patch Server sideSplinter Review
This implements a new API, api/orangeseed/, which takes a sole parameter, "bugid", e.g. http://localhost:8314/api/orangeseed/?bugid=749890.

It returns a dict with a single member, "oranges", which is a list of dicts, each describing a particular build type/platform/test suite/tree combination of the occurrences of the bug for which we have calculated reasonable "seeds". The list is sorted in decreasing order of the average number of oranges per test run of the given combination. Each dict has a member called "seeds" which is a probability-indexed dict (e.g. "0.25", "0.5", "0.05") containing 

* "timestamp": the timestamp of the seed test run
* "date": the date of the seed test run
* "revision": the revision tested
* "previous_runs": the number of runs this seed occurred before the first recorded occurrence of this bug for this combination

See the comments for more a more detailed explanation of how we arrive at these numbers and which combinations we discard.

This is all somewhat complicated, returned values included, but the UI will simplify it, probably by just indicating the revision and probability for the first (or maybe first two or three) top average oranges/test run.

Please make sure this makes any sort of sense, both from a code perspective and from the overall algorithm, and feel free to try it out on a few bugs. :)
Attachment #633515 - Flags: review?(jgriffin)
Comment on attachment 633515 [details] [diff] [review]
Server side

Review of attachment 633515 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good!  I still wonder if there may be value in an aggregate orange seed (not OS-specific ones), but I think we should check this in and use it for a while to see how it does.

::: server/handlers.py
@@ +1016,5 @@
> +                # try to make sure the sort key is unique
> +                self.runs.sort(key=lambda x: str(x['starttime']) +
> +                               x['revision'] + x['testsuite'], reverse=True)
> +                for i, r in enumerate(self.runs):
> +                    if r['starttime'] == self.first:

This should be <= self.first, because we aren't guaranteed to have any data for every os at self.first, and using == self.first causes us to grep for data all the way back to HISTORY_START in most cases.  Ouch!

@@ +1076,5 @@
> +            if 'talos' in record['testsuite']:
> +                return # talos test-run data is not recorded
> +            if record['testsuite'].startswith('mochitests-'):
> +                # strip trailing "/5", which is not stored in test run data
> +                record['testsuite'] = record['testsuite'][:-2]

We should probably make this logic a little smarter, in case we ever get to mochitests-n/10.
Attachment #633515 - Flags: review?(jgriffin) → review+
Okay comments addressed, and I added an aggregate seed, available with the query string option "aggregate_only". It seems fairly accurate, so we can use that as a default and maybe allow more specific seeds if needed.

http://hg.mozilla.org/automation/orangefactor/rev/9fbeb9a70c75
Attached patch UI, a few server changes (obsolete) — Splinter Review
This adds a UI for Orange Seed in the form of a table on the Bug Details view, selectable via the dropdown under the graph. There's a help screen as well.

I made some changes on the back end. The main one is that I got rid of the specific calculations at 95%, 75%, and 50%; instead, we return the complete list of test runs going back to 90% probability and include the probability at each one. This represents the chance that the orange was introduced sometime between (and including) that test run and the first recorded occurrence.

This gives some good results for some bugs, but it's useless for others. If a bug occurs very infrequently, then the set of test runs going back to 90% probability can be huge. I arbitrarily set the cut off to 1000 test runs. I figure if there's a 90% chance that the bug occurred somewhere in a set greater than 1000 test runs in size, that information is not particularly useful.

I'm happy to hear comments about the code, but I'm particularly interested in opinions on how useful this is in its current form, whether the display could be improved, etc.
Attachment #638011 - Flags: review?(bmo)
Comment on attachment 638011 [details] [diff] [review]
UI, a few server changes

Just to let you know, I'm waiting on bug 769701 so I can play about with this locally.
Comment on attachment 638011 [details] [diff] [review]
UI, a few server changes

Ok, finally got VPN and back off PTO - sorry for the delay.

I've tried this locally, few observations:

* On the six or so bugs that I looked at from the current Trunk top 10, all bar one gave the "error: probability range too great: 90% probability goes back more than 1000 runs" message. Was I just unlucky, or is num_preorange_runs not being calculated correctly? (It's been so long since I did stats stuff that I can't tell for sure looking at the code). Edit: After a restart of the server, some of them have started showing results again... strange.

* For the one that did work, it took 80 seconds to load the orangeseed table locally, which seems quite long. Not sure how this translates once in production and the ES lookups are more local.

* Looking at the orangeseed for http://localhost:8080/?display=Bug&bugid=754804&startday=2011-07-13&endday=2012-07-09&tree=trunk shows multiple rows for each changeset, presumably due to the way that mochitest-other is split up. Given that the question being asked is "Which changeset caused this orange", having the same changeset listed multiple times (and even with differing percentages) is slightly confusing (will attach screenshot). Would we perhaps be better off with just one row per changeset? (And combining multiple runs on the same changeset due to retriggers etc)

* I think it would be helpful to have the changeset commit messages listed alongside the revision hash, to save having to open each changeset to see if they could have been responsible.

* I couldn't see from my very cursory skim of the code, whether the seed table would list all changesets, or only the ones for which OrangeFactor/ES knows about. Would changesets that were victim to coalescing (or the use of DONTBUILD when it shouldn't have been used), still show up in the percentages list?

* I'm also not sure how we should deal with multiple trees. Using Trunk view, I get both mozilla-inbound and mozilla-central shown, but I presume the probabilities are going to be messed up, since it would have started on one tree and merged to another? Would it make sense to calculate rate of occurrence based on Trunk but count back the changesets just on the tree where it was first observed?
Attachment #638011 - Flags: review?(bmo)
(In reply to Ed Morley [:edmorley] from comment #15)
> Comment on attachment 638011 [details] [diff] [review]
> UI, a few server changes
> 
> * On the six or so bugs that I looked at from the current Trunk top 10, all
> bar one gave the "error: probability range too great: 90% probability goes
> back more than 1000 runs" message. Was I just unlucky, or is
> num_preorange_runs not being calculated correctly? (It's been so long since
> I did stats stuff that I can't tell for sure looking at the code). Edit:
> After a restart of the server, some of them have started showing results
> again... strange.

I got responses for 4 of the 5 top bugs yesterday: 633, 535, 542, and 555 entries. Not sure why it would have changed when you restarted... anyway, the number of runs is related to the bug's frequency. An infrequent bug will have a lot more runs than a frequent one. So if you see a frequent bug that gives > 1000 runs, let me know.

> * For the one that did work, it took 80 seconds to load the orangeseed table
> locally, which seems quite long. Not sure how this translates once in
> production and the ES lookups are more local.

Yeah, it is indeed really really slow. I'm not sure what I can do about that. There are some open bugs to upgrade ES and re-index the dbs... I am curious as to what effect that will have. I'll see about setting up a temporary staging server in the mean time (probably also on brasstacks).

> * Looking at the orangeseed for
> http://localhost:8080/?display=Bug&bugid=754804&startday=2011-07-
> 13&endday=2012-07-09&tree=trunk shows multiple rows for each changeset,
> presumably due to the way that mochitest-other is split up. Given that the
> question being asked is "Which changeset caused this orange", having the
> same changeset listed multiple times (and even with differing percentages)
> is slightly confusing (will attach screenshot). Would we perhaps be better
> off with just one row per changeset? (And combining multiple runs on the
> same changeset due to retriggers etc)

Yeah maybe that was for a retrigger. The question being answered is really "in what *test run* range could the orange have been introduced", although it's true that the more interesting question is "in what changeset". I admit that it's confusing, but I'm not sure how we would go about combining rows together. I suppose we could just take the highest percentage for each changeset... not sure how accurate that would be. Heh, not that I'm sure how accurate the current implementation is...

> * I think it would be helpful to have the changeset commit messages listed
> alongside the revision hash, to save having to open each changeset to see if
> they could have been responsible.

Good idea.

> * I couldn't see from my very cursory skim of the code, whether the seed
> table would list all changesets, or only the ones for which OrangeFactor/ES
> knows about. Would changesets that were victim to coalescing (or the use of
> DONTBUILD when it shouldn't have been used), still show up in the
> percentages list?

Yeah it's only using test-run and orange-starring data from the ES database. As I mentioned above, the percentages are calculated according to test runs. So yeah, it is possible that an orange was introduced silently before any listed changesets, if no tests were run. There's not much I can do about that with this algorithm though.

> * I'm also not sure how we should deal with multiple trees. Using Trunk
> view, I get both mozilla-inbound and mozilla-central shown, but I presume
> the probabilities are going to be messed up, since it would have started on
> one tree and merged to another? Would it make sense to calculate rate of
> occurrence based on Trunk but count back the changesets just on the tree
> where it was first observed?

Yeah, where there are multiple trees, the probabilities are going to be a bit inaccurate. I don't think that mixing the rate of occurrence from multiple trees with counting of changesets from just one tree is going to be accurate though. One can just change to the mozilla-inbound tree to get a prediction just based on the occurrences in that tree though, which is probably more accurate. I should make this clear somehow, maybe with just a note above the table if Trunk is selected (or, for that matter, All).

So I think the actionable items here are

* determine how fast this will be on brasstacks
* add changeset summaries to the table
* do something about the table having multiple changesets with differing percentages. Perhaps a summary table of changesets and highest revisions to go along with the bigger table of all test runs? I will have to think about this a bit more.

And of course we still have no idea how useful this feature will be...

Btw if you wanted to work on any of the above, I would not stop you. ;) I'll try to get to this later in the week.
(In reply to Mark Côté ( :mcote ) from comment #17)
> I got responses for 4 of the 5 top bugs yesterday: 633, 535, 542, and 555
> entries. Not sure why it would have changed when you restarted... anyway,
> the number of runs is related to the bug's frequency. An infrequent bug will
> have a lot more runs than a frequent one. So if you see a frequent bug that
> gives > 1000 runs, let me know.

They were from the top 5-10 oranges, so pretty frequent.

> > * I couldn't see from my very cursory skim of the code, whether the seed
> > table would list all changesets, or only the ones for which OrangeFactor/ES
> > knows about. Would changesets that were victim to coalescing (or the use of
> > DONTBUILD when it shouldn't have been used), still show up in the
> > percentages list?
> 
> Yeah it's only using test-run and orange-starring data from the ES database.
> As I mentioned above, the percentages are calculated according to test runs.
> So yeah, it is possible that an orange was introduced silently before any
> listed changesets, if no tests were run. There's not much I can do about
> that with this algorithm though.

This would be a pretty big blocker to using orange seed imo. During peak times 50%+ of runs get coalesced, so I would have to resort to copying the start and end changesets from the orangeseed table and pasting into something like:
http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=<cset_foo>&tochange=<cset_bar>

At the least, a link to that pushlog range from the orangeseed page would be needed; or ideally we could supplement the orangeseed table, with the extra runs from the pushlog. Using something like:
https://hg.mozilla.org/integration/mozilla-inbound/json-pushes?full=1&fromchange=0e2658794e06&tochange=f8c388f622f1

The runs that ran no tests would just show no % probability, but at least would be displayed (along with commit message), so as to allow investigation without leaving the page.

> Btw if you wanted to work on any of the above, I would not stop you. ;) I'll
> try to get to this later in the week.

I'd love to; unfortunately sheriffing plus the infra issues we've had this week, are already giving little time for the TBPL bug filer work. There's no rush anyway, really appreciate your work on this :-D
Unbitrotted patch.
Attachment #638011 - Attachment is obsolete: true
Alias: orangeseed
(Added alias since I can never find this bug in the awesome bar results)
Getting "error: probability range too great: 90% probability goes back more than 1000 runs" on http://localhost:8080/?display=Bug&bugid=777875&entireHistory=true&tree=trunk even though it's the current most frequent orange?
I'm working only off the screenshot (attachment 640316 [details]), since I can't seem to get the probability ranges to show up when I run this locally.  I hope these comments are still useful:

* The site seems to attach probabilities to changesets, but that's the wrong granularity -- it needs to attach probabilities to /pushes/, which are sets of changesets.  But maybe if I click on one of the csets in the screenshot, it takes me to a push; I dunno.  :)

* I'm confused by this UI which attaches a probability to a (tree, rev, testsuite) tuple.

Is this code calculating a different probability for each tree, rev, and testsuite?  I imagine in most cases, I don't care at all about the tree; I'd want those to be merged, to give me a stronger analysis.  Similarly, if a leak is appearing in multiple test suites (as bug 754804 apparently is), that may or may not be due to the same bug, but as an investigator, I may want to see the analysis under the assumption that the leaks in the different test suites all have the same root cause.

I feel like sorting by probability when all the (tree, rev, testsuite) tuples are in the same table doesn't make much sense.  You only want to sort revs by priority, right?  If we're displaying multiple trees and test suites, we're effectively bisecting separate bugs for each (tree, testsuite) pair, and so we should have separate tables.

* It doesn't look like the results are broken down by platform.  That variable doesn't always matter, but it often does.  I think it's important to allow investigators to toggle whether we're aggregating across platforms.  If the orange does occur across platforms with roughly the same probability, then I want to analyze with that assumption in mind, because it gives me more data.  On the other hand, if an orange occurs only on Mac and we aggregate across all platforms, the analysis will be skewed if we ever skip running the test suite on one platform (which we do often).

* It looks like the probability in each row is the output  of the calculations in bug 668471 comment 9?  If so, it needs to be sorted in /increasing/ order (which will correspond with reverse temporal order).  That is, the push with "90% probability" isn't the most likely cause of the orange -- "90%" means that there's a 90% chance that this push /or any of the ones after it/ caused the orange.  So what I want to see is the list of pushes which might have caused the orange, starting with the most first push and moving backwards in time.

I think displaying a CDF, as the screenshot does, as opposed to trying to calculate "what's the probability that /this/ push is the cause of the orange" is reasonable.  But it needs to be called something other than "probability of range" (because there's no range displayed in the row), and one table needs to include only one CDF -- that is, a push should never show up twice in a given table.

* There's a pretty fundamental problem with all this logic, however: Often we want to know not when an orange was first introduced, but which cset caused its frequency to increase.  Ed probably has a better idea, but my sense is that this happens at least as often as the "when did this orange first appear?" question.  The analysis here doesn't help with that at all.

* It would be helpful to know what cases Ed and the other sheriffs need a tool to help them with.  I've pushed my fair share of oranges recently, and the sheriffs have not seemed to have difficulty tracking down mine as the offending push.  Anyway this tool only gives you a probabilistic range -- you still have to do the analysis yourself.  Maybe if I understood better exactly what is difficult for the sheriffs with their current tools, I could give more concrete suggestsions.
(In reply to comment #22)
> * It would be helpful to know what cases Ed and the other sheriffs need a tool
> to help them with.  I've pushed my fair share of oranges recently, and the
> sheriffs have not seemed to have difficulty tracking down mine as the offending
> push.  Anyway this tool only gives you a probabilistic range -- you still have
> to do the analysis yourself.  Maybe if I understood better exactly what is
> difficult for the sheriffs with their current tools, I could give more concrete
> suggestsions.

I have mostly had problem figuring out what increased the frequency of an orange.  Most new intermittent oranges are fairly easy to pinpoint to one or few changesets (although that process is tedious and error-prone and nobody would object to having tools help us there!).
I agree with most of Justin's post.

In response to the part about identifying new orange vs increase of rate of occurrence of existing orange - I'd say maybe it was equal amounts of both (off the top of my head).

In the case of the patches of yours that were backed out, I believe (from memory) either the frequency was high or else the failures were in a newly landed test - which is why they were easier to track down manually. For new, but less frequent failures (/ ones that occur in tests that have been around for a while), I still believe orangeseed will be useful, even if v1 doesn't yet support the use-case of finding regression ranges for oranges becoming more frequent.
I spent a while on this and even had something working, but it really didn't seem that useful in the end.  It's possible that this might work with significantly more effort, but there's no guarantee.
Assignee: mcote → nobody
Status: ASSIGNED → NEW
Product: Testing → Tree Management
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → INCOMPLETE
Product: Tree Management → Tree Management Graveyard
You need to log in before you can comment on or make changes to this bug.