Closed Bug 1429756 Opened 6 years ago Closed 6 years ago

Split out retriggers/backfilling to be a role of code/intermittent sheriff

Categories

(Testing :: Talos, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: igoldan, Assigned: igoldan)

References

Details

Attachments

(7 files)

We need to establish what the Code Sheriff will do in terms of perf data retriggering/backfilling.

Based on this, we'll update Perfherder's UI to make their job easier and scalable.
Maybe that 'faulty cooking' state is a bit too much; could drop that, even though it happens once and a while.
We will have these new states - 'cooking', 'faulty cooking', 'cooked' - on Alert summary level. All current states work like that, as we don't have exact alert-level granularity.
this is a great start.  I would like to use different terminology than cooking although the analogy makes great sense, possibly we could use states:
untriaged
* merge with other alerts via downstream/reassign
collecting_data
* backfill/retrigger
needs_data
* unable to find specific bug that caused problem (broken builds, merged commit, CI errors, broken tests)
needs_bug
* all data is backfilled/retriggered, needs analysis and bug filed/tracked

I see code sheriffs as using a dashboard of:
untriaged
collecting_data

I see perf sheriff has using a dashboard with:
needs_data
needs_bug

perfherder would need to support a feature to indicate if collecting_data is complete
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #3)
> perfherder would need to support a feature to indicate if collecting_data is
> complete

These states suffice. I think we have all we need, because when Code Sheriffs consider no more data is needed, they update the state of an Alert summary to needs_bug or needs_data.
Also, I agree with your terminology.
I followed this up with a new state diagram. I hope it's clearer; it better shows how the 2 sheriffs exchange alerts between them.
Flags: needinfo?(jmaher)
I don't like this for a couple reasons:
1) downstream - if someone starts retriggering backfilling on autoland, but the failure is really on mozilla-inbound, we just wasted a lot of time.  NOTE: in the future with a single repo this won't be much of a big concern.
2) reassign - this might not be as controversial as downstream, but there is a good chance we could be retriggering way too many jobs.  An example is if startup tests fail across the board- we would then be retriggering: other, g5, h1 x linux/windows7/windows10/osx = 12 jobs getting retriggered across many revisions- this could equal 300+ jobs; often when you see a pattern, you can retrigger the 3 jobs on one platform and determine the trend is similar on other platforms/jobs.
2.1) often we know a bug was backed out and generating improvements- this would be a waste of time to retrigger/backfill a bunch of jobs
3) why does needs_bug -> untriaged?  is this a case where you need more data?  If so it should be done by the perf sheriff to ensure we get a bug faster- the goal here is to have 24x7 support for getting more data so bugs are actionable in a shorter time frame- the ideal state is we end up with all perf regression bugs filed in <24 hours, probably realistic state is <36 hours.
4) not enough data -> that is really the case of problems collecting data

While we can come up with a lot of exceptions, we do need to accept the fact that we will have a lot of extra jobs run which are not needed.  Unfortunately we cannot increase our load 2x, let alone 4x+.  So we need to design this in a way to at least reduce the load where reasonable- I would propose:
1) for obvious downstream- set the state
2) unsure about downstream based on the graph- retrigger 2x on both branches for +- 2 revisions / backfill
3) Only retrigger up to 5 alerts in a given summary
4) perf sheriff will still do a lot of backfilling and retriggering

Until we are on a single branch (i.e. autoland only), we will have to deal with confusion in scheduling and accept that we will run extra jobs and accept that we cannot offload this fully.
Flags: needinfo?(jmaher)
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #7)
> I don't like this for a couple reasons:
> 1) downstream - if someone starts retriggering backfilling on autoland, but
> the failure is really on mozilla-inbound, we just wasted a lot of time. 
> NOTE: in the future with a single repo this won't be much of a big concern.

Then Code Sheriffs will need to be able to identify a downstream an mark it correctly.
Will share this responsability with them.
Will check our Wiki docs have documentation and multiple situations/exceptions
+ images to guide them while ramping up/working.

> 2) reassign - this might not be as controversial as downstream, but there is
> a good chance we could be retriggering way too many jobs.  An example is if
> startup tests fail across the board- we would then be retriggering: other,
> g5, h1 x linux/windows7/windows10/osx = 12 jobs getting retriggered across
> many revisions- this could equal 300+ jobs; often when you see a pattern,
> you can retrigger the 3 jobs on one platform and determine the trend is
> similar on other platforms/jobs.

Will also share this reassignment responsability to Code Sheriffs.
The idea here is to try to group alerts by the timestamp they appeared,
branch, platforms, severity.
As this routine will be a permanent one, not like the downstreaming one, it will
definetely require some Wiki docs. Explanations on what to look for and how to do
a proper reassignment + multiple situations/images explained.

> 2.1) often we know a bug was backed out and generating improvements- this
> would be a waste of time to retrigger/backfill a bunch of jobs

This is already covered here [1]. We should ensure this is amongst the first
patterns new sheriffs are able identify.

> 3) why does needs_bug -> untriaged?  is this a case where you need more
> data?  If so it should be done by the perf sheriff to ensure we get a bug
> faster- the goal here is to have 24x7 support for getting more data so bugs
> are actionable in a shorter time frame- the ideal state is we end up with
> all perf regression bugs filed in <24 hours, probably realistic state is <36
> hours.
> 4) not enough data -> that is really the case of problems collecting data

Yep, you are totally right.

> While we can come up with a lot of exceptions, we do need to accept the fact
> that we will have a lot of extra jobs run which are not needed. 
> Unfortunately we cannot increase our load 2x, let alone 4x+.  So we need to
> design this in a way to at least reduce the load where reasonable- I would
> propose:
> 1) for obvious downstream- set the state
> 2) unsure about downstream based on the graph- retrigger 2x on both branches
> for +- 2 revisions / backfill
> 3) Only retrigger up to 5 alerts in a given summary
> 4) perf sheriff will still do a lot of backfilling and retriggering
> 
> Until we are on a single branch (i.e. autoland only), we will have to deal
> with confusion in scheduling and accept that we will run extra jobs and
> accept that we cannot offload this fully.

Yes, it makes sense.

[1] https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_a_backout
Code Sheriffs will have to distinguish cases like reassign/downstream/backedout.
I updated the diagram.
I think the training required is very high for reassigning- that typically requires making decisions after a backfill/retrigger.  I would like to propose only handling downstream and initial backfill/retrigger for the first big round here.  That should cover a lot of cases and I imagine there are so many edge cases that will take a full quarter to fully understand.  Of course we could be on a single repo by the end of Q2, if that is true, then it simplifies things and leads into the next training "reassigning"

We can also propose running all talos every X pushes and the only time this fails is if builds/tests fail.
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #11)
> I think the training required is very high for reassigning- that typically
> requires making decisions after a backfill/retrigger.

Yes, it is a bit too much for start. Deferring this is wise.

> I would like to propose only handling downstream and initial backfill/retrigger for the
> first big round here. That should cover a lot of cases and I imagine there
> are so many edge cases that will take a full quarter to fully understand. 

Fair enough. I will keep your idea: #3 Only retrigger up to 5 alerts in a given summary.
It decently limits over retriggering.

> Of course we could be on a single repo by the end of Q2, if that is true,
> then it simplifies things and leads into the next training "reassigning"
> 
> We can also propose running all talos every X pushes and the only time this
> fails is if builds/tests fail.

This sounds interesting. So it would let us easily narrow down any misbehaving Talos test
or build.
So a Code Sheriff can label an alert summary with either "collecting_data" or "needs_data". If this diagram is ok, I would start adding these new states into Perfherder.
Attachment #8954317 - Flags: feedback?(jmaher)
Comment on attachment 8954317 [details]
Interaction using new states 3.0

overall this is a great breakdown- I want to make things more specific, focusing on:
* make decision
* things it's enough data

the responsibility breakdown is great- but we need to think carefully about the different scenarios that typically come up and exceptions that are seen at least once/month and ensure that we are setting this up for success.

Keeping things simpler is better for the first round, we might get 2 round of training/practice in throughout Q2- so lets plan on an iterative approach.

For example- downstream:
* many times it is obvious by looking at the graphs- sometimes it isn't.  Call that out specifically
* what if it is downstream and there is no alert on the other branch, what do you do?
* what if it looks like it is downstream but really isn't (say we are missing data on the other branch)

For thinking it is enough data... a few thoughts:
* we should say that if an alert occurs retrigger 2 revisions forward and 2 revisions backwards (5 total revisions), and backfill in between.  If we say 3 data points for all of the revisions, that would be enough to make a decision.
* ideally that would be enough data, but often it isn't- knowing that a first pass was made to retrigger/backfill will give confidence that we can do 1 more cycle and be ok to make a final decision
* we should keep it simple that it is either downstream if it is obvious, or a single pass of retrigger/backfilling
* possibly we can track when we have to do more retriggering after the first pass so we can document why and what was required (i.e. specific tests/configs, edge cases like broken builds/merges, too much noise, etc.)
Attachment #8954317 - Flags: feedback?(jmaher) → feedback-
:rwood- do you have any additional thoughts on this topic- I think we are getting close to something tangible as a first step!
Flags: needinfo?(rwood)
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #15)
> :rwood- do you have any additional thoughts on this topic- I think we are
> getting close to something tangible as a first step!

Cool! +1 for making this first pass simple - marking downstream if it's obvious, and/or one set of retriggering/backfilling. Also noting the specific number of retriggers for the +- 2 revisions is important. Once things get going you'll see better what types of questions and circumstances the code sheriffs run into.
Flags: needinfo?(rwood)
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #14)
> * what if it looks like it is downstream but really isn't (say we are
> missing data on the other branch)

I think this is too much of an edge case. I reckon I only experienced this once.
I say that we don't mark something downstream, unless we have the originating alert in near preceding vicinity.
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #14)

> For thinking it is enough data... a few thoughts:
> * we should say that if an alert occurs retrigger 2 revisions forward and 2
> revisions backwards (5 total revisions), and backfill in between.  If we say
> 3 data points for all of the revisions, that would be enough to make a
> decision.

Let's make this 4 points, to have an odd number of retriggers. I know you talked about 5 data points being enough even on noisy tests.

> * ideally that would be enough data, but often it isn't- knowing that a
> first pass was made to retrigger/backfill will give confidence that we can
> do 1 more cycle and be ok to make a final decision
> * we should keep it simple that it is either downstream if it is obvious, or
> a single pass of retrigger/backfilling
> * possibly we can track when we have to do more retriggering after the first
> pass so we can document why and what was required (i.e. specific
> tests/configs, edge cases like broken builds/merges, too much noise, etc.)

I noticed here the concept of rounds. An alert summary can have one or more rounds (as needed). Each round starts in a pending state, is numbered and ideally it has an associated comment, stating what the Code/Perf Sheriff expects from it. When retriggers/backfills all come in, the sheriff can mark that round as complete, whether he is satisfied or not.
If the sheriff didn't reach a conclusion during that round, he creates another one, which gets the previous' round id + 1.
The completed rounds cannot be later modified.

I started to integrate this into a flow, but I don't like how this goes. It looks like it can complicate the Perfherder code quite a lot; we don't have unit tests for Perfherder's UI, so lots of issues can arrive later.

I would rather integrate this into an actual Sheriff daily duty, with very, very minimal changes to Perfherder. I like the idea of using rounds or being aware of them. We can simulate them by using the version 3.0 of our current defined process + bug 1376829, which introduces notes (aka alert summary journal). That would allow sheriffs to communicate to one another what's the status of an alert.
By formalizing the content of the alert summary notes, we have lots of flexibility.

Still, I like the concept of rounds and we can still integrate them in our multi-sheriff process.
agreed- lets keep it simple.  Ideally showing a screenshot showing what to look for would be good enough.  I think the simple case will solve 60-75% of the downstreaming.  Of course to make things better- the more we document things over time, the easier it will be to identify additional use cases that can be handled by good documentation or tooling.
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #19)
> Of course to make things better- the more we document
> things over time, the easier it will be to identify additional use cases
> that can be handled by good documentation or tooling.

+1 I think we already filed a bug for that or at least discussed about this.
I'll check to make sure we have this prepared.
calling it 'rounds' is good- ideally we would never need more than 3, so lets focus on "round 1" for our initial work here.  I like the concept of notes, that is good.  I do think an additional state would be needed:
* untriaged (code sheriff view)
** use notes when doing initial work on it such as retrigger/backfill
* dataready (new state when round1 is complete and perf-sheriff takes over)
* downstream/invalid/reassigned/acknowledged/etc.

the odd piece here is- do we wait for data to show up before switching from untriaged->dataready ?  That can take anywhere from 30 minutes -> 6 hours;  Given the workflow of code sheriffs, it is not realistic for the same engineer to validate the jobs completed.  I would recommend retrigger/backfill- if it fails to backfill add a note.

The next puzzle here is we typically change status based on alert summary, not individual alerts- how do we ensure work has been done on all alerts for a given summary.  Possibly that isn't an issue, but something to be aware of.

For an initial proposal, I recommend:
* do the best you can for the alerts- if at least 1/2 are retriggered then mark the whole summary.  If it is the same test, but different OS/config- only retrigger one
* once downstream/retrigger/backfill is initially done, we go to dataready state- ideally we get there automatically when a note is added
* for downstream- only downstream if an alert in the last 24 hours exists on the other integration branch, otherwise ignore
* for backfill- backfill between all 5 revisions using treeherder tools (up to 4 backfills)
* for retriggers- on entire range of revisions (including backfills) retrigger 3x so we have 4 data points for each revision (enough for round 1)


Ideally we can build tools that:
* make it easy to get a view of +-2 revisions for retriggering from a perfherder graph or alert
* detect if we have enough data and show status in the alert
* for a given alert summary recommend jobs to retrigger/backfill- i.e. if damp regresses on all platforms choose one
* auto detect downstream and offer a recommendation in the alert view (only for simple cases)
* suggest similar regressions across alert summaries that might have the same root cause (same test but revisions within small range)

One concern I have is that retriggering all jobs that have an alert on all configurations will cause undue load to our system.  While that can be optimized over time, I believe we should build in from the beginning some initial recommendations for fewer jobs.  For example, I mentioned abouve that if damp regresses on all platforms recommend a single platform to retrigger- while this is great, often damp (or any test) will show a regression on different revisions and will show up in multiple alert summaries- that would result in retriggers across many platforms when one is really needed

Do we have the same pattern for regressions and improvements?
Depends on: 1376829
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #21)
> calling it 'rounds' is good- ideally we would never need more than 3, so
> lets focus on "round 1" for our initial work here.  I like the concept of
> notes, that is good.  I do think an additional state would be needed:
> * untriaged (code sheriff view)
> ** use notes when doing initial work on it such as retrigger/backfill
> * dataready (new state when round1 is complete and perf-sheriff takes over)
> * downstream/invalid/reassigned/acknowledged/etc.
> 
> the odd piece here is- do we wait for data to show up before switching from
> untriaged->dataready ?  That can take anywhere from 30 minutes -> 6 hours; 
> Given the workflow of code sheriffs, it is not realistic for the same
> engineer to validate the jobs completed.  I would recommend
> retrigger/backfill- if it fails to backfill add a note.

If we're using the "dataready" state, I say we wait for the data to show up _and_
it being enough to draw a conclusion. Otherwise, the name is confusing.
These 2 things will be the responsibility of the Code Sheriffs. It's not their _initial_
job to conclude based on the obvious final results, they should only be able to see
that "Yes, it's clear something changed here. Let me make sure the Perf Sheriff has
a good view over this.".
My feeling is the "dataready" state isn't enough. What I presented in the paragraph above is
a big, time consuming process. It happens between "untriaged" and "dataready".
We have to name it. Let's stick with "collecting_data".
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #21)
> The next puzzle here is we typically change status based on alert summary,
> not individual alerts- how do we ensure work has been done on all alerts for
> a given summary.  Possibly that isn't an issue, but something to be aware of.

We can add another toggle on alert level. This allows a Code Sheriff to pick
the most important alerts from a summary. This way, he will quickly find which
alerts he retriggered/backfilled.
are you suggestion we just do "collecting_data" and not "dataready"?  My main goal is to work in a model that is simple to understand without spending weeks or months learning the nuances of the system.  If we go on the model that data needs to be ready, then we need to educate on all the edge cases (which I see one edge case on any given day I perf sheriff) and build in a state machine that would account for the variability in runtime and machine availability.  For example if results take between 30 minutes and 6 hours, you would have to have multiple code sheriffs look over the data and each time figure out what needs to be triggered, then realize it is triggered, but waiting on results- therefore they just spend XX minutes (I will say at least 10) going to do change a state in the UI.

Would changing the scheduling/branching make this more reliable?
I think a toggle or filter is a good idea- keep in mind we don't want to retrigger jobs for every alert- that will overload our limited pool of machines.
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #24)
> are you suggestion we just do "collecting_data" and not "dataready"?  My
> main goal is to work in a model that is simple to understand without
> spending weeks or months learning the nuances of the system.  If we go on
> the model that data needs to be ready, then we need to educate on all the
> edge cases (which I see one edge case on any given day I perf sheriff) and
> build in a state machine that would account for the variability in runtime
> and machine availability.  For example if results take between 30 minutes
> and 6 hours, you would have to have multiple code sheriffs look over the
> data and each time figure out what needs to be triggered, then realize it is
> triggered, but waiting on results- therefore they just spend XX minutes (I
> will say at least 10) going to do change a state in the UI.
> 
> Would changing the scheduling/branching make this more reliable?

Yes, it is complex. Then let's allow the Code Sheriff to change state to "dataready"
as soon as he is done with retriggering/backfilling, without waiting for the results to come in. We will further coordinate ourselves
via the alert summary notes.
good point on the notes- I think we can build a simple protocol for using the notes field.

I still feel there is a bit more to clarify, but the overall plan is coming together nicely.
I will link a Google doc with all our discussions here, so we can easily see all the things we already discussed about.
:jmaher I've updated the Google doc. Can you take another look on the final changes?
Flags: needinfo?(jmaher)
thanks for the update- I have reviewed it, a few comments added for clarification.

In addition we need to:
1) be clear on what frameworks should be sheriffed (I think talos/awsy)
2) add a bug to include a new state of 'dataready'

Now that we have a plan, we can:
* build a wiki for instructions
* use the untriaged -> dataready state in our roles NOW to ensure it works
* prepare first training session
Flags: needinfo?(jmaher)
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #30)
> thanks for the update- I have reviewed it, a few comments added for
> clarification.
> 
> In addition we need to:
> 1) be clear on what frameworks should be sheriffed (I think talos/awsy)

I mentioned in the doc we'll only use these 2 framework, for start.

> 2) add a bug to include a new state of 'dataready'

I'm defining it right now. Will link it to this one, when done.

> Now that we have a plan, we can:
> * build a wiki for instructions

I will try to fit our docs on this Wiki page https://wiki.mozilla.org/Performance_sheriffing

> * use the untriaged -> dataready state in our roles NOW to ensure it works
> * prepare first training session
See Also: → 1451655
Specifies the whole Perf sheriffs ranking system.
Draft for new perf policy, which slightly touches the role of Code sheriffs.
Efforts required to train a Code sheriff team up to (and including) Perf sheriff level.
The procedures have been explicitly defined in the attached documents. Thus, closing this as RESOLVED.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: