Closed
Bug 1033284
Opened 11 years ago
Closed 9 years ago
releng nagios audit: open questions/action items for releng
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
INCOMPLETE
People
(Reporter: arich, Assigned: selenamarie)
References
Details
(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2457] )
As part of the audit of releng nagios, there are some open questions and actions that need to be taken.
Questions:
1) should the jacuzzi allocator be tree closing (because of tree closing reason 8 from https://wiki.mozilla.org/ReleaseEngineering/OverviewArchitectureDiagram ) or something else? do hosts fall back to old config like they do for slavelloc?
2) we have both build and releng contact groups. the latter is just hal's pager, coop, and catlee. the former is "relengapi, buildteam, buildduty, nthomas, bhearsum, coop, armenzg, catlee, asasaki, raliiev, jhopkins, hwine, jwood, catlee, coop" Should we consolidate and remove stale people?
3) githubsync contact group is just asasaki. should those be going to anyone else?
4) build doesn't get alerted for releng-foreman or its databases (it should not be tree closing). Do they want to be alerted?
Actions (these are all related to the diagram at https://wiki.mozilla.org/images/f/f3/Releng_flow_onepage_treeclose_reasons.pdf):
1) we are missing things like relengapi, mapper, etc from the releng service diagram
2) schedulerDB is actually the buildbotdb database hosts - buildbot_schedulers database
3) StatusDB is actually the buildbotdb database hosts - buildbot database
4) missing mozpool -> inventory interaction - can close the trees because pandas will be unavailable if mozpool gets incorrect database info from inventory
Non-diagram actions:
5) there are no architectural docs in mana (see https://mana.mozilla.org/wiki/display/IT/Buildbot for an example) or monitoring for vcssync
2)
Updated•11 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2450]
Updated•11 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2450] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2457]
Comment 1•11 years ago
|
||
(In reply to Amy Rich [:arich] [:arr] from comment #0)
> As part of the audit of releng nagios, there are some open questions and
> actions that need to be taken.
>
>
> Questions:
>
> 1) should the jacuzzi allocator be tree closing (because of tree closing
> reason 8 from
> https://wiki.mozilla.org/ReleaseEngineering/OverviewArchitectureDiagram ) or
> something else? do hosts fall back to old config like they do for slavelloc?
Yes, the dynamic jacuzzi allocator updates configs in source control (https://github.com/mozilla/releng-jacuzzis/tree/master/v1). If that stops, those configs will not get updated, but the show can go on.
> 2) we have both build and releng contact groups. the latter is just hal's
> pager, coop, and catlee. the former is "relengapi, buildteam, buildduty,
> nthomas, bhearsum, coop, armenzg, catlee, asasaki, raliiev, jhopkins,
> hwine, jwood, catlee, coop" Should we consolidate and remove stale people?
Seems reasonable. Maybe we should separate Dev Services from RelEng though.
> 3) githubsync contact group is just asasaki. should those be going to anyone
> else?
hwine, pmoore. Maybe also Dev Services (gps, fubar, bkero).
> 4) build doesn't get alerted for releng-foreman or its databases (it should
> not be tree closing). Do they want to be alerted?
This means if foreman goes down, we don't get alerted? I think that is ok - we don't depend on foreman I believe since we get individual puppet alerts anyway in the RelEng shared inbox. But feel free to check with others what they think.
>
> Actions (these are all related to the diagram at
> https://wiki.mozilla.org/images/f/f3/Releng_flow_onepage_treeclose_reasons.
> pdf):
> 1) we are missing things like relengapi, mapper, etc from the releng service
> diagram
Ouch. Yes, failure of either could potentially cause tree closures. Also I see gaia bumper, but b2g bumper (which bumps revs of git.m.o mirrored repos in sources.xml files in gecko) is also missing. Who can update this diagram?
> 2) schedulerDB is actually the buildbotdb database hosts -
> buildbot_schedulers database
> 3) StatusDB is actually the buildbotdb database hosts - buildbot database
> 4) missing mozpool -> inventory interaction - can close the trees because
> pandas will be unavailable if mozpool gets incorrect database info from
> inventory
>
> Non-diagram actions:
> 5) there are no architectural docs in mana (see
> https://mana.mozilla.org/wiki/display/IT/Buildbot for an example) or
> monitoring for vcssync
> 2)
Updated•10 years ago
|
Assignee: nobody → relops
Component: Other → RelOps
Product: Release Engineering → Infrastructure & Operations
Updated•10 years ago
|
Assignee: relops → arich
Reporter | ||
Comment 2•10 years ago
|
||
We got rid of the people in the build contact list and replaced that with just irc. There were no dev-services people listed in either of the two contact groups I asked about. I left the releng group, because I'm not sure what, if anything it's being used for. Possibly escalation? hwine, you were working on contact stuff, is that what the releng group is for?
Foreman going down has no impacts on the tree or releasing, it's for informational purposes only.
I'm not sure who, if anyone, is making updates to selena's original diagram to add in the missing pieces..?
Apparently I can't needinfo hwine because he's away. I'll have to come back and do that later.
Flags: needinfo?(sdeckelmann)
Reporter | ||
Updated•10 years ago
|
Flags: needinfo?(hwine)
QA Contact: arich
Assignee | ||
Comment 3•10 years ago
|
||
I am planning to update the doc! I'm going to leave the needinfo here so that I don't forget to let you know when I'm done. Estimating next week sometime.
(In reply to Amy Rich [:arr] [:arich] from comment #2)
> We got rid of the people in the build contact list and replaced that with
> just irc. There were no dev-services people listed in either of the two
> contact groups I asked about. I left the releng group, because I'm not sure
> what, if anything it's being used for. Possibly escalation? hwine, you were
> working on contact stuff, is that what the releng group is for?
Neither am I - afaik, we never set up escalations in Nagios. That might be a good thing, especially for the "blocker bug" check, etc.
Escalations are still a worthy endeavor, but whether it's nagios or some other system doesn't matter to me.
Flags: needinfo?(hwine)
Reporter | ||
Comment 5•10 years ago
|
||
hwine: there isn't any escalation set up currently, but we can certainly do that. What would those escalations look like?
To give you an example you might be familiar with, here's what dev-services does (probably back form when they were part of webops):
define serviceescalation{
hostgroup_name bug-checks
service_description devservices_bugs
contact_groups oncall_manager,sysalerts
first_notification 3
last_notification 0
notification_interval 10
escalation_options w,c,r
}
define serviceescalation{
hostgroup_name bug-checks
service_description devservices_bugs
contact_groups webops_manager,oncall_manager,sysalerts
first_notification 5
last_notification 0
notification_interval 10
escalation_options w,c,r
}
So for two checks, it only sends things to unixfairy, oncall and irc. After that, it adds in fox2mike. This is 24x7.
The releng_bugs is only working hours instead of the full 24x7.
Flags: needinfo?(hwine)
Before we go too far down this path, we should decide if we want nagios or pager duty. (I believe moc is/has moved to pager duty.)
I also believe it needs more thinking that in this bug. I.e. which alerts should be escalated (I don't think comment 5 escalates for other nagios alerts, just for blocker bugs).
And "working hours" should probably be ET through NZ? Or ??? - so I suggest business objectives first, then how we encode in what system 2nd.
Flags: needinfo?(hwine)
Reporter | ||
Comment 7•10 years ago
|
||
hwine: okay, since this sounds like it's not an easily answered question, I'll wait for an explicit bug and not continue to peruse the question of escalation here.
Reporter | ||
Comment 8•10 years ago
|
||
I think the only open question/action item left here is selena saying she's going to update the doc, so I'll hand this bug over to her to finish up.
Assignee: arich → sdeckelmann
Reporter | ||
Updated•9 years ago
|
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(sdeckelmann)
Resolution: --- → INCOMPLETE
You need to log in
before you can comment on or make changes to this bug.
Description
•