Closed
Bug 625474
Opened 14 years ago
Closed 14 years ago
Create meta-check for buildslaves (nagios)
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: zandr, Assigned: arich)
References
Details
(Whiteboard: [triagefollowup])
The buildslave pools should be managed like large clusters. Individual alerts on individual hosts are not terribly useful, but meta-alerts on the number of members of that pool are more valuable.
We need to create some meta-checks that will alert when certain thresholds are reached. This should cause people to go fix slaves, and information on which slaves need fixing is more readily accessible from a dashboard-like service (nagios or otherwise)
Inputs I'll need to implement this:
- Lists of the 'pools' we care about.
- Lists of the members of each pool.
- WARNING and CRITICAL thresholds for the number of slaves we can afford to lose from each pool.
Until this is done, I'm not willing to disable alerting on individual slaves, regardless of platform.
Comment 1•14 years ago
|
||
Yay :)
Comment 2•14 years ago
|
||
This sounds awesome!
I'll go first with the n900 testing pool. There are 6 devices in staging and 84 in production.
Pool: production n900 pool
Members: n900-001 through n900-064 and n900-071 through n900-090
Warning: 10 devices down
Critical: 25 devices down
Pool: staging n900 pool
Members: n900-065 through n900-070
Warning: 5 devices down
Critical: 6 devices down
Updated•14 years ago
|
Assignee: server-ops-releng → zandr
Assignee | ||
Updated•14 years ago
|
Assignee: zandr → arich
Assignee | ||
Updated•14 years ago
|
No longer blocks: releng-nagios
Assignee | ||
Comment 3•14 years ago
|
||
Okay, I've put up a couple proof of concept checks that include a bunch of downed hosts already. Based on the way we do checks, I'm not sure that I'll be able to use the the host cluster check, but I do have the service cluster check working. The one that's checking 335 services is using nagios servicegroups instead of individually listing each host in yet another config file.
https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=bm-admin01
Note that each cluster check is going to be on a PER-NAGIOS-SERVER basis since there are different nagios leaf nodes returning information about the hosts it knows about. In other words, if you have linux-ix-slave01 in sjc1 and linux-ix-slave01, those two hosts are monitored by different servers and can not be in the same cluster.
After digging into this more, I think that the way to go is:
1) pick what services we want to monitor
2) very clearly define groups of machines that we want as a cluster. As mentioned above, I plan to use the servicegroup mechanism to group them all together so that the check_cluster check is as simple as possible.
3) define how many hosts we're allowed to have down for each cluster before warning and going critical
4) use the service group view and do away with the hostgroup view
This means that we'll only be tracking data in the two .h files that the mozilla home grown stuff uses to build the hosts.cfg and services.cfg files and we won't be maintaining separate lists of hosts in hostsgroup.cfg or servicegrou.cfg, only using service group definitions (which will be defined in services.h).
I think this is a much cleaner and simpler approach than having yet another script to build clusters.cfg.
So what I need from releng:
* a list of services we want to monitor as a cluster (right now I've done all of the slaves running buildbot that we monitor on bm-admin01, so not anything in slc1. I'm sure we want to split this up for OS, try vs build, talos, etc)
* a list of hosts that we want to monitor each of those services on so I can create service groups
* how many service instances down in each service group cluster that we monitor for warning and critical thresholds.
Comment 4•14 years ago
|
||
I think this will be a great complement to bug 629698 and its relations - we probably won't have an all-hands panic attack when one slave doesn't reboot for 6 hours, but if 50% of the slaves in a silo have failed to reboot, we *should* panic.
Assignee | ||
Comment 5•14 years ago
|
||
I forgot to mention... any *non* releng specific checks are going to be
more difficult to add in because of the way the generation scripts are written.
Other than scl1, we share nagios servers with other groups that use the same
template definitions that we do for checks. SO things like PING are much more difficult than, say, the tegra_tcp_check, buildbot, hungslave, etc.
Status: NEW → ASSIGNED
Assignee | ||
Comment 6•14 years ago
|
||
(In reply to comment #2)
> This sounds awesome!
>
> I'll go first with the n900 testing pool. There are 6 devices in staging and
> 84 in production.
>
> Pool: production n900 pool
> Members: n900-001 through n900-064 and n900-071 through n900-090
> Warning: 10 devices down
> Critical: 25 devices down
>
> Pool: staging n900 pool
> Members: n900-065 through n900-070
> Warning: 5 devices down
> Critical: 6 devices down
I've set this up along with one to check the tcp service on the tegras (all downtimed right now):
https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=mvadm01.mv
I also modified the script that generates the hosts.cfg file so that we can use servicegroups for the PING check as well, so that's no longer a limitation.
Rob may have found an issue while he was working on the ops stuff in that service descriptions with spaces in them appear not to work for him. He's investigating this and may find a work around. If not, we may need to change the names of some checks (I'm thinking of the hung slave checks) if we want to use those in cluster definitions.
Assignee | ||
Comment 7•14 years ago
|
||
Good news... the bug that Rob was seeing doesn't affect the way I was doing things since I'm using servicegroups (and he found a workaround for his issue as well).
So as things stand right now, I've mocked up a few checks.
https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=bm-admin01
bm-buildbot
bm-hungslave
bm-hungslave_win
bm-hungslave_slow
https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=mvadm01.mv
n900 production ping
n900 staging ping
tegra tcp
I haven't added things to admin1.infra yet, but that's why I prefixed the buildbot and hungslave checks with bm- on bm-admin01.
Now that I've done proof of concept for various different things and they all seem to be working, we need to come up with groups and other possible services to monitor as a cluster.
Comment 8•14 years ago
|
||
(In reply to comment #7)
> Now that I've done proof of concept for various different things and they all
> seem to be working, we need to come up with groups and other possible services
> to monitor as a cluster.
triagefollowup: we need to get back to arr for build slaves. I am taking care of test machines.
arr let me know if this is the information you need for test slaves:
#### PRODUCTION
Warning: 3 devices down
Critical: 5 devices down
Pool: Fedora production testing pool
Members: 003-009,011-053
Pool: Fedora 64-bit production testing pool
Members: 003-009,011-055
Pool: Windows XP production testing pool
Members: 004-009,011-053
Pool: Windows 7 production testing pool
Members: 004-009,011-039,041-053
Pool: Windows 7 64-bit production testing pool
Members: 003-009,011-050
Pool: Leopard production testing pool
Members: 003-009,011-053
Pool: Snow Leopard production testing pool
Members: 003-009,011-055
#### STAGING
Warning: 2 devices down
Critical: 2 devices down
Pool: Fedora staging testing pool
Members: 001-002,010
Pool: Fedora 64-bit staging testing pool
Members: 001-002,010
Pool: Windows XP staging testing pool
Members: 001-003,010
Pool: Windows 7 staging testing pool
Members: 001-003,010
Pool: Windows 7 64-bit staging testing pool
Members: 001-002,010
Pool: Leopard staging testing pool
Members: 001-002,010
Pool: Snow Leopard staging testing pool
Members: 001-002,010
All information gathered from:
> http://hg.mozilla.org/build/buildbot-configs/file/tip/mozilla-tests/production_config.py
talos-r3-fed-*: 3-9,11-53 --> 7 + 43 = 50 slaves
talos-r3-fed64-*: 3-9,11-55 --> 7 + 45 = 52 slaves
talos-r3-xp-*: 4-9,11-53 --> 6 + 43 = 50 slaves
talos-r3-w7-*: 4-9,11-39,41-53 --> 6 + 29 + 13 = 48 slaves
t-r3-w764-*: 3-9,11-50 --> 7 + 40 = 47 slaves
talos-r3-leopard-*: 3-9,11-53 --> 7 + 43 = 50 slaves
talos-r3-snow-*: 3-9,11-55 --> 7 + 45 = 52 slaves
Whiteboard: [triagefollowup]
Assignee | ||
Comment 9•14 years ago
|
||
That gives me hosts and thresholds (warning and critical can't be the same, I don't think, so staging will have to be 2,3 or 1,2), but I also need to know what services to monitor. For the talos machines, the possibilities are PING and buildbot-start. Which would you like?
Comment 10•14 years ago
|
||
(In reply to comment #9)
> That gives me hosts and thresholds (warning and critical can't be the same,
> I don't think, so staging will have to be 2,3 or 1,2), but I also need to
> know what services to monitor. For the talos machines, the possibilities
> are PING and buildbot-start. Which would you like?
Replied on IRC.
- 1,2
- both checks
arr have I told you today that you are awesome? :)
Assignee | ||
Comment 11•14 years ago
|
||
https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?navbarsearch=1&host=admin1.infra
The talos machines all have cluster checks for PING and buildbot-start now. The checks for buildbot-start are downtimed on the windows machines (not XP) since they don't work there yet. The linux-ix-slave and linux-ix64-slave checks are downtimed because we have so many hosts out at IX that it's going to constantly complain.
I've also taken the downtime off on the n900 checks (only 2 production hosts down), since they're nice and healthy, but left them on the tegra_tcp check since we have a lot of downed services there as well.
https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=mvadm01.mv
Other than moving the w764 talos machines over to admin1 to be monitored, I haven't made any changes to bm-admin01 yet:
https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?navbarsearch=1&host=bm-admin01
Updated•14 years ago
|
Summary: Create meta-check for buildslaves → Create meta-check for buildslaves (nagios)
Comment 12•14 years ago
|
||
Hi arr, I will be gone tomorrow so if you have any questions please follow up with catlee and dustin.
The decided checks are ping, buildbot and buildbot_start.
Once buildbot_start is working we can remove the buildbot check.
I am using W/C as short for warning/critical and I am indicating it per colo.
I am also using sl. for slave(s).
I have tried to use the following values depending on the # of slaves per colo for a given OS.
For WARNING I chose round(#slaves/10) and for CRITICAL I use +3 of WARNING.
* 2/5, 3/6, 4/7, 5/8, 6/9
Feel free to propose another option if it makes things easier.
To note that I am blurring the lines of staging/production and try VS builder.
All of this data except Win32 has been gathered through slavealloc.
Linux:
W/C: 3/6 - SCL1: 35 sl. linux-ix-slave[01-42] (minus 04, 05, 07, 09, 10, 11, 15)
W/C: 5/8 - SJC1: 51 sl. moz2-linux-slave[01-51]
W/C: 3/6 - MTV1: 30 sl. linux-ix-slave[04,05,07,09,10,11,15]
mv-moz2-linux-ix-slave[01-23]
Linux 64:
W/C: 4/7 - SCL1: 41 sl. linux64-ix-slave[01-41]
W/C: 2/5 - SJC1: 22 sl. moz2-linux64-slave[01-12]
try-linux64-slave[01-10]
Darwin9:
W/C: 13/16 - SJC1: 136 sl. moz2-darwin9-slave[01-72] (minus 04)
try-mac-slave[01-47] (minus 05)
bm-xserve[06-24]
Darwin10:
W/C: 6/9 - SJC1: 63 sl. moz2-darwin10-slave[03-39]
try-mac64-slave[01-26]
W/C: 2/5 - MTV1: 24 sl. moz2-darwin10-slave[01-02/40-56]
try-mac64-slave[27-31]
Win32:
W/C: 2/5 - MTV1: 20 sl. mw32-ix-slave[02-21]
W/C: 4/7 - SCL1: 42 sl. w32-ix-slave[01-42]
W/C: 9/12 - SJC1: 95 sl. win32-slave[01-59]
try-w32-slave[01-36]
Assignee | ||
Comment 13•14 years ago
|
||
Adding dependency on 659134 since there seems to be a desire to redo the implementation of these checks first.
Depends on: 659134
Assignee | ||
Comment 14•14 years ago
|
||
Based on comments here and conversation, this is what the checks now look like. The scl1-linux-slave and scl1-w32-slave checks are set higher because we anticipate the IX machines to move there after disk replacement (the corresponding values in mtv1 should be lowered after this actually happens).
cluster: num hosts in cluster non-ok number at which to notify
n900-ping cluster: 87 45
tegra-ping cluster: 93 45
tegra-tcp cluster: 93 45
mtv1-darwin10-slave-buildbot cluster: 22 10
mtv1-darwin10-slave-buildbot-start cluster: 22 10
mtv1-darwin10-slave-ping cluster: 22 10
mtv1-linux-slave-buildbot cluster: 31
mtv1-linux-slave-buildbot-start cluster: 31 15
mtv1-linux-slave-ping cluster: 31 15
mtv1-w32-slave-buildbot-start cluster: 40 20
mtv1-w32-slave-ping cluster: 40 20
scl1-linux-slave-buildbot cluster: 34 20
scl1-linux-slave-buildbot-start cluster: 34 20
scl1-linux-slave-ping cluster: 34 20
scl1-linux64-slave-buildbot cluster: 41 20
scl1-linux64-slave-buildbot-start cluster: 41 20
scl1-linux64-slave-ping cluster: 41 20
scl1-t-r3-w764-buildbot-start cluster: 50 25
scl1-t-r3-w764-ping cluster: 50 25
scl1-talos-r3-fed-buildbot-start cluster: 53 25
scl1-talos-r3-fed-ping cluster: 53 25
scl1-talos-r3-fed64-buildbot-start cluster: 55 25
scl1-talos-r3-fed64-ping cluster: 55 25
scl1-talos-r3-leopard-buildbot-start cluster: 53 25
scl1-talos-r3-leopard-ping cluster: 53 25
scl1-talos-r3-snow-buildbot-start cluster: 55 25
scl1-talos-r3-snow-ping cluster: 55 25
scl1-talos-r3-w7-buildbot-start cluster: 52 25
scl1-talos-r3-w7-ping cluster: 52 25
scl1-talos-r3-xp-buildbot-start cluster: 53 25
scl1-talos-r3-xp-ping cluster: 53 25
scl1-w32-slave-buildbot-start cluster: 27 20
scl1-w32-slave-ping cluster: 27 20
sjc1-darwin10-slave-buildbot cluster: 55 25
sjc1-darwin10-slave-buildbot-start cluster: 55 25
sjc1-darwin10-slave-ping cluster: 55 25
sjc1-darwin9-slave-buildbot cluster: 107 50
sjc1-darwin9-slave-buildbot-start cluster: 107 50
sjc1-darwin9-slave-ping cluster: 107 50
sjc1-linux-slave-buildbot cluster: 80 40
sjc1-linux-slave-ping cluster: 80 40
sjc1-linux64-slave-buildbot cluster: 22 10
sjc1-linux64-slave-buildbot-start cluster: 22 10
sjc1-linux64-slave-ping cluster: 22 10
sjc1-w32-slave-buildbot-start cluster: 79 40
sjc1-w32-slave-ping cluster: 79 40
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Comment 15•14 years ago
|
||
Looks perfect, thanks!
Updated•12 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•