Closed Bug 1277528 Opened 8 years ago Closed 1 year ago

Transition Treeherder to being supported by Operations

Categories

(Tree Management :: Treeherder: Infrastructure, task, P3)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: kthiessen)

References

Details

For bug 1277436, a request was made by the sheriffs to #moc to page both Cameron and myself:

04:57 < sheriff> jlaz: can we page camd or emorley ?
04:57 < sheriff> for the tree closure
04:57 < jlaz> Tomcat: for sure, sec
04:58 < sheriff> jlaz: thanks
04:58 < sheriff> sorry was not fully awake in the last hour :) so now have coffee and can drive/monitor from the sheriff side
05:00 < jlaz> Tomcat|sheriffduty: left camd a voicemail
05:00 < sheriff> jlaz: thanks!
05:02 < jlaz> and emorley doesn't have a voicemail to leave a message to :(
05:03 < jlaz> Tomcat|sheriffduty: is wlach someone we can escalate to?
05:03 < sheriff> jlaz: yes
05:03 < sheriff> at least he seems maybe the reviewer of something that changed today and could be the regression
05:05 < sheriff> jlaz: camd responded
05:05 < jlaz> Tomcat|sheriffduty: camd is taking a look right now

For myself it was 0600 and for Cameron it was 2200. My working day is typically 1000 onwards (since it helps with US meetings), so this was pretty early for me. (I don't mind being called if no one else is available/awake, but ideally as a last resort.)

It seems like there some optimisations we can make:
1) Ensure that the sheriffs know:
 - who to contact if there are issues
 - what timezones each of those people they are in (and that they bear them in mind when making requests to #moc)
2) Ensure that the people in #moc:
 - check the official docs (which at the moment is: https://mana.mozilla.org/wiki/display/websites/treeherder.mozilla.org#treeherder.mozilla.org-DeveloperContacts)
 - check timezones when calling people, even if provided with a specific list of names
 - prioritise calling those people that are in waking hours
 - only call the others after repeat calling the awake people over 10-15mins first (in the case above, I was called less than 2 minutes after Cameron)

Given the move to Heroku, I think it's even more important we document the above clearly (likely on readthedocs, we can then just turn the mana page above into a bunch of links to it) and make sure everyone is aware of it.
FYI you can add me to the list of people who can be paged in an emergency, I'm on EST, more or less on a normal schedule.
We should make sure that whatever we're asking the MOC to handle conforms to https://mana.mozilla.org/wiki/display/MOC/How+to+request+MOC+support+for+a+new+production+system+or+service
Component: Treeherder: Docs & Development → Treeherder: Infrastructure
Priority: P3 → P2
Summary: Improve documentation of the Treeherder out of hours contact procedure → Transition Treeherder to being MOC supported
Blocks: 1504990

With Ed gone, this falls to me. With the MOC gone, we're gonna need help from someone on Travis's crew. I've talked with Jeremy Orem on the CloudOps team, and determined that Treeherder doesn't quite fit the CloudOps model.

Next up is probably :fubar and RelOps.

Assignee: nobody → kthiessen
Summary: Transition Treeherder to being MOC supported → Transition Treeherder to being supported by Operations
Type: defect → task
Status: NEW → ASSIGNED
Priority: P2 → P3
No longer blocks: 1504990
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.