1429756 - Split out retriggers/backfilling to be a role of code/intermittent sheriff

Assignee

Description

•

8 years ago

Attached image Code Sheriff perf-related responsabilities — Details

We need to establish what the Code Sheriff will do in terms of perf data retriggering/backfilling. Based on this, we'll update Perfherder's UI to make their job easier and scalable.

Ionuț Goldan [:igoldan]

Assignee

Comment 1

•

8 years ago

Maybe that 'faulty cooking' state is a bit too much; could drop that, even though it happens once and a while.

Ionuț Goldan [:igoldan]

Assignee

Comment 2

•

8 years ago

We will have these new states - 'cooking', 'faulty cooking', 'cooked' - on Alert summary level. All current states work like that, as we don't have exact alert-level granularity.

Joel Maher ( :jmaher ) (UTC -8)

Comment 3

•

8 years ago

this is a great start. I would like to use different terminology than cooking although the analogy makes great sense, possibly we could use states: untriaged * merge with other alerts via downstream/reassign collecting_data * backfill/retrigger needs_data * unable to find specific bug that caused problem (broken builds, merged commit, CI errors, broken tests) needs_bug * all data is backfilled/retriggered, needs analysis and bug filed/tracked I see code sheriffs as using a dashboard of: untriaged collecting_data I see perf sheriff has using a dashboard with: needs_data needs_bug perfherder would need to support a feature to indicate if collecting_data is complete

Ionuț Goldan [:igoldan]

Assignee

Comment 4

•

8 years ago

(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #3) > perfherder would need to support a feature to indicate if collecting_data is > complete These states suffice. I think we have all we need, because when Code Sheriffs consider no more data is needed, they update the state of an Alert summary to needs_bug or needs_data.

Ionuț Goldan [:igoldan]

Assignee

Comment 5

•

8 years ago

Also, I agree with your terminology.

Ionuț Goldan [:igoldan]

Assignee

Comment 6

•

8 years ago

Attached image Interaction using new states — Details

I followed this up with a new state diagram. I hope it's clearer; it better shows how the 2 sheriffs exchange alerts between them.

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 7

•

8 years ago

I don't like this for a couple reasons: 1) downstream - if someone starts retriggering backfilling on autoland, but the failure is really on mozilla-inbound, we just wasted a lot of time. NOTE: in the future with a single repo this won't be much of a big concern. 2) reassign - this might not be as controversial as downstream, but there is a good chance we could be retriggering way too many jobs. An example is if startup tests fail across the board- we would then be retriggering: other, g5, h1 x linux/windows7/windows10/osx = 12 jobs getting retriggered across many revisions- this could equal 300+ jobs; often when you see a pattern, you can retrigger the 3 jobs on one platform and determine the trend is similar on other platforms/jobs. 2.1) often we know a bug was backed out and generating improvements- this would be a waste of time to retrigger/backfill a bunch of jobs 3) why does needs_bug -> untriaged? is this a case where you need more data? If so it should be done by the perf sheriff to ensure we get a bug faster- the goal here is to have 24x7 support for getting more data so bugs are actionable in a shorter time frame- the ideal state is we end up with all perf regression bugs filed in <24 hours, probably realistic state is <36 hours. 4) not enough data -> that is really the case of problems collecting data While we can come up with a lot of exceptions, we do need to accept the fact that we will have a lot of extra jobs run which are not needed. Unfortunately we cannot increase our load 2x, let alone 4x+. So we need to design this in a way to at least reduce the load where reasonable- I would propose: 1) for obvious downstream- set the state 2) unsure about downstream based on the graph- retrigger 2x on both branches for +- 2 revisions / backfill 3) Only retrigger up to 5 alerts in a given summary 4) perf sheriff will still do a lot of backfilling and retriggering Until we are on a single branch (i.e. autoland only), we will have to deal with confusion in scheduling and accept that we will run extra jobs and accept that we cannot offload this fully.

Flags: needinfo?(jmaher)

Ionuț Goldan [:igoldan]

Assignee

Comment 8

•

8 years ago

(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #7) > I don't like this for a couple reasons: > 1) downstream - if someone starts retriggering backfilling on autoland, but > the failure is really on mozilla-inbound, we just wasted a lot of time. > NOTE: in the future with a single repo this won't be much of a big concern. Then Code Sheriffs will need to be able to identify a downstream an mark it correctly. Will share this responsability with them. Will check our Wiki docs have documentation and multiple situations/exceptions + images to guide them while ramping up/working. > 2) reassign - this might not be as controversial as downstream, but there is > a good chance we could be retriggering way too many jobs. An example is if > startup tests fail across the board- we would then be retriggering: other, > g5, h1 x linux/windows7/windows10/osx = 12 jobs getting retriggered across > many revisions- this could equal 300+ jobs; often when you see a pattern, > you can retrigger the 3 jobs on one platform and determine the trend is > similar on other platforms/jobs. Will also share this reassignment responsability to Code Sheriffs. The idea here is to try to group alerts by the timestamp they appeared, branch, platforms, severity. As this routine will be a permanent one, not like the downstreaming one, it will definetely require some Wiki docs. Explanations on what to look for and how to do a proper reassignment + multiple situations/images explained. > 2.1) often we know a bug was backed out and generating improvements- this > would be a waste of time to retrigger/backfill a bunch of jobs This is already covered here [1]. We should ensure this is amongst the first patterns new sheriffs are able identify. > 3) why does needs_bug -> untriaged? is this a case where you need more > data? If so it should be done by the perf sheriff to ensure we get a bug > faster- the goal here is to have 24x7 support for getting more data so bugs > are actionable in a shorter time frame- the ideal state is we end up with > all perf regression bugs filed in <24 hours, probably realistic state is <36 > hours. > 4) not enough data -> that is really the case of problems collecting data Yep, you are totally right. > While we can come up with a lot of exceptions, we do need to accept the fact > that we will have a lot of extra jobs run which are not needed. > Unfortunately we cannot increase our load 2x, let alone 4x+. So we need to > design this in a way to at least reduce the load where reasonable- I would > propose: > 1) for obvious downstream- set the state > 2) unsure about downstream based on the graph- retrigger 2x on both branches > for +- 2 revisions / backfill > 3) Only retrigger up to 5 alerts in a given summary > 4) perf sheriff will still do a lot of backfilling and retriggering > > Until we are on a single branch (i.e. autoland only), we will have to deal > with confusion in scheduling and accept that we will run extra jobs and > accept that we cannot offload this fully. Yes, it makes sense. [1] https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_a_backout

Ionuț Goldan [:igoldan]

Assignee

Comment 9

•

8 years ago

Attached image Interaction using new states 2.0 — Details

Code Sheriffs will have to distinguish cases like reassign/downstream/backedout.

Ionuț Goldan [:igoldan]

Assignee

Comment 10

•

8 years ago

I updated the diagram.

Joel Maher ( :jmaher ) (UTC -8)

Comment 11

•

8 years ago

I think the training required is very high for reassigning- that typically requires making decisions after a backfill/retrigger. I would like to propose only handling downstream and initial backfill/retrigger for the first big round here. That should cover a lot of cases and I imagine there are so many edge cases that will take a full quarter to fully understand. Of course we could be on a single repo by the end of Q2, if that is true, then it simplifies things and leads into the next training "reassigning" We can also propose running all talos every X pushes and the only time this fails is if builds/tests fail.

Ionuț Goldan [:igoldan]

Assignee

Comment 12

•

8 years ago

(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #11) > I think the training required is very high for reassigning- that typically > requires making decisions after a backfill/retrigger. Yes, it is a bit too much for start. Deferring this is wise. > I would like to propose only handling downstream and initial backfill/retrigger for the > first big round here. That should cover a lot of cases and I imagine there > are so many edge cases that will take a full quarter to fully understand. Fair enough. I will keep your idea: #3 Only retrigger up to 5 alerts in a given summary. It decently limits over retriggering. > Of course we could be on a single repo by the end of Q2, if that is true, > then it simplifies things and leads into the next training "reassigning" > > We can also propose running all talos every X pushes and the only time this > fails is if builds/tests fail. This sounds interesting. So it would let us easily narrow down any misbehaving Talos test or build.

Ionuț Goldan [:igoldan]

Assignee

Comment 13

•

8 years ago

Attached image Interaction using new states 3.0 — Details

So a Code Sheriff can label an alert summary with either "collecting_data" or "needs_data". If this diagram is ok, I would start adding these new states into Perfherder.

Attachment #8954317 - Flags: feedback?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 14

•

8 years ago

Comment on attachment 8954317 [details] Interaction using new states 3.0 overall this is a great breakdown- I want to make things more specific, focusing on: * make decision * things it's enough data the responsibility breakdown is great- but we need to think carefully about the different scenarios that typically come up and exceptions that are seen at least once/month and ensure that we are setting this up for success. Keeping things simpler is better for the first round, we might get 2 round of training/practice in throughout Q2- so lets plan on an iterative approach. For example- downstream: * many times it is obvious by looking at the graphs- sometimes it isn't. Call that out specifically * what if it is downstream and there is no alert on the other branch, what do you do? * what if it looks like it is downstream but really isn't (say we are missing data on the other branch) For thinking it is enough data... a few thoughts: * we should say that if an alert occurs retrigger 2 revisions forward and 2 revisions backwards (5 total revisions), and backfill in between. If we say 3 data points for all of the revisions, that would be enough to make a decision. * ideally that would be enough data, but often it isn't- knowing that a first pass was made to retrigger/backfill will give confidence that we can do 1 more cycle and be ok to make a final decision * we should keep it simple that it is either downstream if it is obvious, or a single pass of retrigger/backfilling * possibly we can track when we have to do more retriggering after the first pass so we can document why and what was required (i.e. specific tests/configs, edge cases like broken builds/merges, too much noise, etc.)

Attachment #8954317 - Flags: feedback?(jmaher) → feedback-

Joel Maher ( :jmaher ) (UTC -8)

Comment 15

•

8 years ago

:rwood- do you have any additional thoughts on this topic- I think we are getting close to something tangible as a first step!

Flags: needinfo?(rwood)

Robert Wood [:rwood]

Comment 16

•

8 years ago

(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #15) > :rwood- do you have any additional thoughts on this topic- I think we are > getting close to something tangible as a first step! Cool! +1 for making this first pass simple - marking downstream if it's obvious, and/or one set of retriggering/backfilling. Also noting the specific number of retriggers for the +- 2 revisions is important. Once things get going you'll see better what types of questions and circumstances the code sheriffs run into.

Flags: needinfo?(rwood)

Ionuț Goldan [:igoldan]

Assignee

Comment 17

•

8 years ago

(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #14) > * what if it looks like it is downstream but really isn't (say we are > missing data on the other branch) I think this is too much of an edge case. I reckon I only experienced this once. I say that we don't mark something downstream, unless we have the originating alert in near preceding vicinity.

Ionuț Goldan [:igoldan]

Assignee

Comment 18

•

8 years ago

(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #14) > For thinking it is enough data... a few thoughts: > * we should say that if an alert occurs retrigger 2 revisions forward and 2 > revisions backwards (5 total revisions), and backfill in between. If we say > 3 data points for all of the revisions, that would be enough to make a > decision. Let's make this 4 points, to have an odd number of retriggers. I know you talked about 5 data points being enough even on noisy tests. > * ideally that would be enough data, but often it isn't- knowing that a > first pass was made to retrigger/backfill will give confidence that we can > do 1 more cycle and be ok to make a final decision > * we should keep it simple that it is either downstream if it is obvious, or > a single pass of retrigger/backfilling > * possibly we can track when we have to do more retriggering after the first > pass so we can document why and what was required (i.e. specific > tests/configs, edge cases like broken builds/merges, too much noise, etc.) I noticed here the concept of rounds. An alert summary can have one or more rounds (as needed). Each round starts in a pending state, is numbered and ideally it has an associated comment, stating what the Code/Perf Sheriff expects from it. When retriggers/backfills all come in, the sheriff can mark that round as complete, whether he is satisfied or not. If the sheriff didn't reach a conclusion during that round, he creates another one, which gets the previous' round id + 1. The completed rounds cannot be later modified. I started to integrate this into a flow, but I don't like how this goes. It looks like it can complicate the Perfherder code quite a lot; we don't have unit tests for Perfherder's UI, so lots of issues can arrive later. I would rather integrate this into an actual Sheriff daily duty, with very, very minimal changes to Perfherder. I like the idea of using rounds or being aware of them. We can simulate them by using the version 3.0 of our current defined process + bug 1376829, which introduces notes (aka alert summary journal). That would allow sheriffs to communicate to one another what's the status of an alert. By formalizing the content of the alert summary notes, we have lots of flexibility. Still, I like the concept of rounds and we can still integrate them in our multi-sheriff process.

Joel Maher ( :jmaher ) (UTC -8)

Comment 19

•

8 years ago

agreed- lets keep it simple. Ideally showing a screenshot showing what to look for would be good enough. I think the simple case will solve 60-75% of the downstreaming. Of course to make things better- the more we document things over time, the easier it will be to identify additional use cases that can be handled by good documentation or tooling.

Ionuț Goldan [:igoldan]

Assignee

Comment 20

•

8 years ago

(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #19) > Of course to make things better- the more we document > things over time, the easier it will be to identify additional use cases > that can be handled by good documentation or tooling. +1 I think we already filed a bug for that or at least discussed about this. I'll check to make sure we have this prepared.

Joel Maher ( :jmaher ) (UTC -8)

Comment 21

•

8 years ago

calling it 'rounds' is good- ideally we would never need more than 3, so lets focus on "round 1" for our initial work here. I like the concept of notes, that is good. I do think an additional state would be needed: * untriaged (code sheriff view) ** use notes when doing initial work on it such as retrigger/backfill * dataready (new state when round1 is complete and perf-sheriff takes over) * downstream/invalid/reassigned/acknowledged/etc. the odd piece here is- do we wait for data to show up before switching from untriaged->dataready ? That can take anywhere from 30 minutes -> 6 hours; Given the workflow of code sheriffs, it is not realistic for the same engineer to validate the jobs completed. I would recommend retrigger/backfill- if it fails to backfill add a note. The next puzzle here is we typically change status based on alert summary, not individual alerts- how do we ensure work has been done on all alerts for a given summary. Possibly that isn't an issue, but something to be aware of. For an initial proposal, I recommend: * do the best you can for the alerts- if at least 1/2 are retriggered then mark the whole summary. If it is the same test, but different OS/config- only retrigger one * once downstream/retrigger/backfill is initially done, we go to dataready state- ideally we get there automatically when a note is added * for downstream- only downstream if an alert in the last 24 hours exists on the other integration branch, otherwise ignore * for backfill- backfill between all 5 revisions using treeherder tools (up to 4 backfills) * for retriggers- on entire range of revisions (including backfills) retrigger 3x so we have 4 data points for each revision (enough for round 1) Ideally we can build tools that: * make it easy to get a view of +-2 revisions for retriggering from a perfherder graph or alert * detect if we have enough data and show status in the alert * for a given alert summary recommend jobs to retrigger/backfill- i.e. if damp regresses on all platforms choose one * auto detect downstream and offer a recommendation in the alert view (only for simple cases) * suggest similar regressions across alert summaries that might have the same root cause (same test but revisions within small range) One concern I have is that retriggering all jobs that have an alert on all configurations will cause undue load to our system. While that can be optimized over time, I believe we should build in from the beginning some initial recommendations for fewer jobs. For example, I mentioned abouve that if damp regresses on all platforms recommend a single platform to retrigger- while this is great, often damp (or any test) will show a regression on different revisions and will show up in multiple alert summaries- that would result in retriggers across many platforms when one is really needed Do we have the same pattern for regressions and improvements?

Ionuț Goldan [:igoldan]

Assignee

Updated

•

8 years ago

Depends on: 1376829

Ionuț Goldan [:igoldan]

Assignee

Comment 22

•

8 years ago

(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #21) > calling it 'rounds' is good- ideally we would never need more than 3, so > lets focus on "round 1" for our initial work here. I like the concept of > notes, that is good. I do think an additional state would be needed: > * untriaged (code sheriff view) > ** use notes when doing initial work on it such as retrigger/backfill > * dataready (new state when round1 is complete and perf-sheriff takes over) > * downstream/invalid/reassigned/acknowledged/etc. > > the odd piece here is- do we wait for data to show up before switching from > untriaged->dataready ? That can take anywhere from 30 minutes -> 6 hours; > Given the workflow of code sheriffs, it is not realistic for the same > engineer to validate the jobs completed. I would recommend > retrigger/backfill- if it fails to backfill add a note. If we're using the "dataready" state, I say we wait for the data to show up _and_ it being enough to draw a conclusion. Otherwise, the name is confusing. These 2 things will be the responsibility of the Code Sheriffs. It's not their _initial_ job to conclude based on the obvious final results, they should only be able to see that "Yes, it's clear something changed here. Let me make sure the Perf Sheriff has a good view over this.". My feeling is the "dataready" state isn't enough. What I presented in the paragraph above is a big, time consuming process. It happens between "untriaged" and "dataready". We have to name it. Let's stick with "collecting_data".

Ionuț Goldan [:igoldan]

Assignee

Comment 23

•

8 years ago

(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #21) > The next puzzle here is we typically change status based on alert summary, > not individual alerts- how do we ensure work has been done on all alerts for > a given summary. Possibly that isn't an issue, but something to be aware of. We can add another toggle on alert level. This allows a Code Sheriff to pick the most important alerts from a summary. This way, he will quickly find which alerts he retriggered/backfilled.

Joel Maher ( :jmaher ) (UTC -8)

Comment 24

•

8 years ago

are you suggestion we just do "collecting_data" and not "dataready"? My main goal is to work in a model that is simple to understand without spending weeks or months learning the nuances of the system. If we go on the model that data needs to be ready, then we need to educate on all the edge cases (which I see one edge case on any given day I perf sheriff) and build in a state machine that would account for the variability in runtime and machine availability. For example if results take between 30 minutes and 6 hours, you would have to have multiple code sheriffs look over the data and each time figure out what needs to be triggered, then realize it is triggered, but waiting on results- therefore they just spend XX minutes (I will say at least 10) going to do change a state in the UI. Would changing the scheduling/branching make this more reliable?

Joel Maher ( :jmaher ) (UTC -8)

Comment 25

•

8 years ago

I think a toggle or filter is a good idea- keep in mind we don't want to retrigger jobs for every alert- that will overload our limited pool of machines.

Ionuț Goldan [:igoldan]

Assignee

Comment 26

•

8 years ago

(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #24) > are you suggestion we just do "collecting_data" and not "dataready"? My > main goal is to work in a model that is simple to understand without > spending weeks or months learning the nuances of the system. If we go on > the model that data needs to be ready, then we need to educate on all the > edge cases (which I see one edge case on any given day I perf sheriff) and > build in a state machine that would account for the variability in runtime > and machine availability. For example if results take between 30 minutes > and 6 hours, you would have to have multiple code sheriffs look over the > data and each time figure out what needs to be triggered, then realize it is > triggered, but waiting on results- therefore they just spend XX minutes (I > will say at least 10) going to do change a state in the UI. > > Would changing the scheduling/branching make this more reliable? Yes, it is complex. Then let's allow the Code Sheriff to change state to "dataready" as soon as he is done with retriggering/backfilling, without waiting for the results to come in. We will further coordinate ourselves via the alert summary notes.

Joel Maher ( :jmaher ) (UTC -8)

Comment 27

•

8 years ago

good point on the notes- I think we can build a simple protocol for using the notes field. I still feel there is a bit more to clarify, but the overall plan is coming together nicely.

Ionuț Goldan [:igoldan]

Assignee

Comment 28

•

8 years ago

I will link a Google doc with all our discussions here, so we can easily see all the things we already discussed about.

Ionuț Goldan [:igoldan]

Assignee

Comment 29

•

7 years ago

:jmaher I've updated the Google doc. Can you take another look on the final changes?

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 30

•

7 years ago

thanks for the update- I have reviewed it, a few comments added for clarification. In addition we need to: 1) be clear on what frameworks should be sheriffed (I think talos/awsy) 2) add a bug to include a new state of 'dataready' Now that we have a plan, we can: * build a wiki for instructions * use the untriaged -> dataready state in our roles NOW to ensure it works * prepare first training session

Flags: needinfo?(jmaher)

Ionuț Goldan [:igoldan]

Assignee

Comment 31

•

7 years ago

(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #30) > thanks for the update- I have reviewed it, a few comments added for > clarification. > > In addition we need to: > 1) be clear on what frameworks should be sheriffed (I think talos/awsy) I mentioned in the doc we'll only use these 2 framework, for start. > 2) add a bug to include a new state of 'dataready' I'm defining it right now. Will link it to this one, when done. > Now that we have a plan, we can: > * build a wiki for instructions I will try to fit our docs on this Wiki page https://wiki.mozilla.org/Performance_sheriffing > * use the untriaged -> dataready state in our roles NOW to ensure it works > * prepare first training session

Ionuț Goldan [:igoldan]

Assignee

Updated

•

7 years ago

Comment 32

•

7 years ago

Attached file Performance Sheriffing Policy - Details.pdf — Details

Specifies the whole Perf sheriffs ranking system.

Ionuț Goldan [:igoldan]

Assignee

Comment 33

•

7 years ago

Attached file Performance Sheriffing Policy - Summary.pdf — Details

Draft for new perf policy, which slightly touches the role of Code sheriffs.

Ionuț Goldan [:igoldan]

Assignee

Comment 34

•

7 years ago

Attached file Performance Team - Evolution Roadmap.pdf — Details

Efforts required to train a Code sheriff team up to (and including) Perf sheriff level.

Ionuț Goldan [:igoldan]

Assignee

Comment 35

•

7 years ago

The procedures have been explicitly defined in the attached documents. Thus, closing this as RESOLVED.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Code Sheriff perf-related responsabilities 8 years ago Ionuț Goldan [:igoldan] 113.86 KB, image/png		Details
Interaction using new states 8 years ago Ionuț Goldan [:igoldan] 94.64 KB, image/png		Details
Interaction using new states 2.0 8 years ago Ionuț Goldan [:igoldan] 101.12 KB, image/png		Details
Interaction using new states 3.0 8 years ago Ionuț Goldan [:igoldan] 95.40 KB, image/png	jmaher : feedback-	Details
Performance Sheriffing Policy - Details.pdf 7 years ago Ionuț Goldan [:igoldan] 81.97 KB, application/pdf		Details
Performance Sheriffing Policy - Summary.pdf 7 years ago Ionuț Goldan [:igoldan] 57.35 KB, application/pdf		Details
Performance Team - Evolution Roadmap.pdf 7 years ago Ionuț Goldan [:igoldan] 103.73 KB, application/pdf		Details