Open Bug 1285322 Opened 3 years ago Updated 2 years ago

Implement auto-classification of "downstream" alerts

Categories

(Tree Management :: Perfherder, defect, P3)

defect

Tracking

(Not tracked)

People

(Reporter: wlach, Unassigned)

References

Details

Attachments

(1 file)

A frequent situation in perfherder is where a regression gets merged to another integration branch and we generate a new "alert" based on that merge commit. To trace these alerts back to their root cause, we have a status called "downstream" that can be assigned to these alerts. Right now we rely on human beings to mark up alerts with this status, but given that they have a very regular pattern (% change similar to previously filed alert, resultset of commit corresponds to a merge) we should be able to automatically detect this situation.

Here's an example of an alert summary with two alerts marked as downstream:

https://treeherder.mozilla.org/perf.html#/alerts?id=1701

Let's start by seeing if we can write a script which can automatically detect this type of situation. I originally thought you might need to download the performance data to be sure, but on further reflection I think the alerts themselves might have sufficient information in them (when correlated with result set info)

We should be able to grab all the alerts (summaries) with a query like this:

[mozilla-inbound]
https://treeherder.mozilla.org/api/performance/alertsummary/?alerts__series_signature__signature_hash=5ec2754e398f6f440316bd82ff738cb21ba9ff70&repository=2

[fx-team]
https://treeherder.mozilla.org/api/performance/alertsummary/?alerts__series_signature__signature_hash=5ec2754e398f6f440316bd82ff738cb21ba9ff70&repository=14

You can get resultset information (to determine whether something is a merge commit) by using a query like this (note that alerts/summaries may correspond to a range of result sets so you may need to call this multiple times):

https://treeherder.mozilla.org/api/project/mozilla-inbound/resultset/33974/

Let me know if this is enough to get started, or if you need more info!
you could use the existing data of annotated alerts to verify that this script is accurate (or find errors in our manual process)!

There are a few advanced tricks once this original description is met:
1) don't assume the revision range specified is accurate, in fact many times we should add +-1 push into the range of revisions
2) many times we don't see alerts on all branches
3) pgo vs opt, these are different signatures and different magnitudes
:wlach, :jmaher

I have a very crude script going right now, and now I have some questions I want to clarify!

1. Are we limiting the number of revisions to display on the API?

Downstream: https://treeherder.mozilla.org/perf.html#/alerts?id=1904
Downstream AlertSummary: https://treeherder.mozilla.org/api/performance/alertsummary/1904/
Downstream resultset: https://treeherder.mozilla.org/api/project/mozilla-inbound/resultset/?count=2&full=true&id__in=34470,34471&offset=0

Upstream: https://treeherder.mozilla.org/perf.html#/alerts?id=1884
Upstream AlertSummary: https://treeherder.mozilla.org/api/performance/alertsummary/1884/
Upstream resultset: https://treeherder.mozilla.org/api/project/autoland/resultset/?count=2&full=true&id__in=604,605&offset=0

The commit that could be responsible for the Alerts are 803d0028289a74df1be4220a61dd88802a3563a9, 328df07cbfc206d3be093e422f06a6b21c1f1c53, a86bfdfc575dd02b02d78f0f6f1ebbcfe45ea6f3.

When I call the upstream resultset, I don't see these commits. I have tried to expand the range by +-1, to no avail :(

2. How can I tell if the revision range specified is accurate? What are the circumstances when I should expand the search?

Thanks for all the help!
pretty much all downstream regressions will have some type of large commit range as the root cause since we normally merge 39-150 commits at a time.  I am not sure if the api is limiting the revisions or not, maybe wlach could help uncover that.

automatic alerts are a suggestion, I find in most cases they are off by +-2  (-2 for noise, +2 for not running every job on every commit).  There are some exceptions where it is greater than that.  But if you stick to the rule of the suspect range +- 2 more pushes (a push can be >1 revision), then most will be successfully identified.

A few caveats here:
* pgo - this is a periodic build we do and regressions always have a range (and we might not build pgo when we merge, so it could be a few pushes later)- possibly be ok with missing a lot of pgo alerts
* landing/backouts, or >1 regression in a small window- often we will have >1 change on a test in a small window, then that will get merged and we will see downstream effects.  Here our algorithm might mark the wrong root cause as the original, I think that is ok.
Hmm, I think treeherder might limit the # of revisions ingested. You might need to refer to the json-pushes API to get a complete set of revisions in a push. For the above example you could try something like this for getting the changesets associated with the m-i merge:

https://hg.mozilla.org//integration/mozilla-inbound/json-pushes/?full=1&version=2&changeset=711963e8daa312ae06409f8ab5c06612cb0b8f7b
:wlach, :jmaher Here is the result of the latest classification!

https://pastebin.mozilla.org/8885792

``Narrowed upstream`` is a dictionary of alert id and the potential upstream alertsummary ids
Depends on: 1288530
Hey Roy, do you think you would have time to pick this up again soon? Otherwise we might pick this up where you left off. :)
Flags: needinfo?(crosscent)
(In reply to William Lachance (:wlach) from comment #7)
> Hey Roy, do you think you would have time to pick this up again soon?
> Otherwise we might pick this up where you left off. :)

Hey Will! Sorry for putting it off for this long. I've pushed the newest commit. I am trying to make the database dump you gave me work with the new system, so I can test with actual result. It has testings done for most of the functions. I will find you on IRC on Monday.
Flags: needinfo?(crosscent)
Hi Roy! Do you know if you'll be returning to this soon? :-)
Flags: needinfo?(crosscent)
FWIW I think this would be worth picking up even if Roy isn't able to work on it, as it has the potential to save a lot of time/effort with performance sheriffing.
Assignee: crosscent → nobody
Priority: -- → P3
Flags: needinfo?(crosscent)
You need to log in before you can comment on or make changes to this bug.