Blocker/critical bugs in ServerOps:RelEng should page oncall

RESOLVED WONTFIX

Status

mozilla.org Graveyard
Server Operations
RESOLVED WONTFIX
6 years ago
3 years ago

People

(Reporter: joduinn, Unassigned)

Tracking

Details

per irc in #ops with zandr, justdave.

Tree was closed by bug#714490. The bug was correctly filed as blocker, and assigned to ServerOps:RelEng but it did not page. oncall was only notified by joduinn sending email.

Please have blocker/critical bugs in ServerOps:RelEng follow same oncall pager protocol as other ServerOps components.
(In reply to John O'Duinn [:joduinn] from comment #0)
> per irc in #ops with zandr, justdave.

I didn't say it should do this, I only said it currently didn't.  Don't put words in my mouth.

oncall was notified by philor on IRC a good hour before your email.  philor also said it wasn't worth bothering anyone on NYE, so I didn't do anything.

Personally, I wouldn't want releng stuff coming to oncall, we have no clue what to do with it. I had no clue what to do with it last night, and paged 2 wrong people before we found the right one when you said to go ahead and get someone on it. So we wound up with 4 people out of bed for it when it only really needed one. And I got dragged out of a NYE party just before midnight and almost missed the countdown because of this.

Perhaps there should be *an* oncall for this but probably not Ops oncall.
Sorry, didn't mean to be so short about that, I'm tired at the moment, pager pretty much didn't shut up all night so I haven't really slept :|

Comment 3

6 years ago
@john we should think about this - 

* oncall triages Server-Ops (but not any of the sub-components) throughout the day
* other teams triage their own queue on their own schedule

I tend to think all new bugs should be in the main queue.  Thoughts?
Dave is absolutely right, and I don't think his answer was short at all.  Checking scrollback, zandr did not say "Server Ops: Releng" blockers should page, either.

Here's how things work now, and have for quite some time:

If there's an emergency that's actually server-ops related, it goes to server-ops oncall, via a server-ops bug.

Relops itself does not have an oncall rotation, and we don't have the staff to build something like that.  Nor is last night an example of somewhere such a thing would be useful (and I can't think of any other examples).

What we *do* have already is a commitment to take care of critical issues as efficiently as possible, and the ability to be paged from IRC (by anyone, not just server-ops oncall).  We also have server-ops oncall clearly labeled in IRC, and available to discuss ambiguous issues like last night's.

In fact, that part of the system worked quite well last night, as justdave describes in comment 1.

Comment 5

6 years ago
Bugs in that queue is triaged by the relops team.  They have different priority and different people than the other server-ops queue.  If something needs to be escalated, then it should be moved to the "server-ops" queue and the proper people will be paged.

Doing this will add more noise to on-call.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → WONTFIX
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.