Open Bug 1425849 Opened 2 years ago Updated 11 months ago

[meta] - adjustments to talos triage/sheriffing process

Categories

(Testing :: Talos, defect)

defect
Not set

Tracking

(Not tracked)

People

(Reporter: jmaher, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: meta, Whiteboard: [PI:July])

a few things we want to do in Q1:
1) split out retriggers/backfilling to be a role of code/intermittent sheriff
2) modify perfherder alerts to support different states to support #1
3) collect perf profiles before/after each regression bug we file
4) improve documentation for:
  * adding/editing gtest
  * cleaning up existing wiki pages for easier reading
  * assign all tests to valid owner
5) ensure all tests should be run in today's environment (some tests have been around for years) and we want ownership defined in #4
6) consider scheduling changes (and downstream tooling changes) to support a more streamlined sheriffing model
:igoldan, is this something you are working on?
Flags: needinfo?(igoldan)
Whiteboard: [PI:January]
Yes, I started working on #1, #2 and #4. I'll add detailed bugs for each of these.
Flags: needinfo?(igoldan)
Whiteboard: [PI:January] → [PI:March]
following up here- #4/#5 are 98% complete- should be wrapped up soon.

#1/#2 are in progress- this is 2 rounds:
1) simple downstream, initial retriggering, leave notes to pass state
2) better tools in perfherder for "data complete" and for downstream/auto-detecting, as well as second round of retriggers, etc.

Round 1 is on schedule to start this month (April) and round 2 will probably start in June or July.

This still leaves:
3) collect perf profiles before/after each regression bug we file
6) consider scheduling changes (and downstream tooling changes) to support a more streamlined sheriffing model

There are some issues with #3 as we are waiting on fixes to perf.html, #6 can take place with round 2 or a possible round 3.  I think once we see some perf sheriffing work coming from the code sheriff team we can mark this as done and file a new bug for round 2.
Whiteboard: [PI:March] → [PI:April]
Whiteboard: [PI:April] → [PI:May]
i believe this os done, :igoldan, if not can you outline remaining items here?
Flags: needinfo?(igoldan)
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #4)
> i believe this os done, :igoldan, if not can you outline remaining items
> here?

#2 and #6 are the remaining items here.
Flags: needinfo?(igoldan)
2) modify perfherder alerts to support different states to support #1
6) consider scheduling changes (and downstream tooling changes) to support a more streamlined sheriffing model

are there bugs for these or more details?  If not, I would like to call this done.
Flags: needinfo?(igoldan)
we chatted a bit in irc, it looks like there is some possible work for #2, :igoldan is going to file a bug on that.

On the topic of #6, there are no planned taskcluster scheduling changes.

a new #7 is upcoming- it is easy for >1 sheriff to be working on the same alert.  There are two risks here:
1) wasted time
2) over writing work

as for wasted time, I am not too concerned about that- most of the waste will be starting to look at a graph and retrigger or file a bug; I view that as minimal and something that happens even naturally when handing off alerts cleanly to another sheriff.  We shouldn't discount this fully and document some of the more troublesome cases.

part 2 is more concerning and something new.  We can solve this by dividing efforts between test frameworks or alerts from certain branches.  Finding the contention points and patterns will help in figuring out a solution.  A few ideas:
1) a server that monitors activity in perfherder and notifies when someone is interacting with an alert
2) when planning to investigate or edit an alert, have a way to 'assign' the summary to yourself.
3) other ideas when we have more data/patterns.

I would like at least 3 months of dealing with the problem so we can understand the odd cases and have more time to brainstorm.
Whiteboard: [PI:May] → [PI:July]
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #6)
> 2) modify perfherder alerts to support different states to support #1
> 6) consider scheduling changes (and downstream tooling changes) to support a
> more streamlined sheriffing model
> 
> are there bugs for these or more details?  If not, I would like to call this
> done.

I filed bug 1468943 for item #2.
Flags: needinfo?(igoldan)
You need to log in before you can comment on or make changes to this bug.