Closed Bug 625474 Opened 14 years ago Closed 14 years ago

Create meta-check for buildslaves (nagios)

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: zandr, Assigned: arich)

References

Details

(Whiteboard: [triagefollowup])

The buildslave pools should be managed like large clusters. Individual alerts on individual hosts are not terribly useful, but meta-alerts on the number of members of that pool are more valuable. We need to create some meta-checks that will alert when certain thresholds are reached. This should cause people to go fix slaves, and information on which slaves need fixing is more readily accessible from a dashboard-like service (nagios or otherwise) Inputs I'll need to implement this: - Lists of the 'pools' we care about. - Lists of the members of each pool. - WARNING and CRITICAL thresholds for the number of slaves we can afford to lose from each pool. Until this is done, I'm not willing to disable alerting on individual slaves, regardless of platform.
Yay :)
This sounds awesome! I'll go first with the n900 testing pool. There are 6 devices in staging and 84 in production. Pool: production n900 pool Members: n900-001 through n900-064 and n900-071 through n900-090 Warning: 10 devices down Critical: 25 devices down Pool: staging n900 pool Members: n900-065 through n900-070 Warning: 5 devices down Critical: 6 devices down
Assignee: server-ops-releng → zandr
Assignee: zandr → arich
No longer blocks: releng-nagios
Okay, I've put up a couple proof of concept checks that include a bunch of downed hosts already. Based on the way we do checks, I'm not sure that I'll be able to use the the host cluster check, but I do have the service cluster check working. The one that's checking 335 services is using nagios servicegroups instead of individually listing each host in yet another config file. https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=bm-admin01 Note that each cluster check is going to be on a PER-NAGIOS-SERVER basis since there are different nagios leaf nodes returning information about the hosts it knows about. In other words, if you have linux-ix-slave01 in sjc1 and linux-ix-slave01, those two hosts are monitored by different servers and can not be in the same cluster. After digging into this more, I think that the way to go is: 1) pick what services we want to monitor 2) very clearly define groups of machines that we want as a cluster. As mentioned above, I plan to use the servicegroup mechanism to group them all together so that the check_cluster check is as simple as possible. 3) define how many hosts we're allowed to have down for each cluster before warning and going critical 4) use the service group view and do away with the hostgroup view This means that we'll only be tracking data in the two .h files that the mozilla home grown stuff uses to build the hosts.cfg and services.cfg files and we won't be maintaining separate lists of hosts in hostsgroup.cfg or servicegrou.cfg, only using service group definitions (which will be defined in services.h). I think this is a much cleaner and simpler approach than having yet another script to build clusters.cfg. So what I need from releng: * a list of services we want to monitor as a cluster (right now I've done all of the slaves running buildbot that we monitor on bm-admin01, so not anything in slc1. I'm sure we want to split this up for OS, try vs build, talos, etc) * a list of hosts that we want to monitor each of those services on so I can create service groups * how many service instances down in each service group cluster that we monitor for warning and critical thresholds.
I think this will be a great complement to bug 629698 and its relations - we probably won't have an all-hands panic attack when one slave doesn't reboot for 6 hours, but if 50% of the slaves in a silo have failed to reboot, we *should* panic.
I forgot to mention... any *non* releng specific checks are going to be more difficult to add in because of the way the generation scripts are written. Other than scl1, we share nagios servers with other groups that use the same template definitions that we do for checks. SO things like PING are much more difficult than, say, the tegra_tcp_check, buildbot, hungslave, etc.
Status: NEW → ASSIGNED
(In reply to comment #2) > This sounds awesome! > > I'll go first with the n900 testing pool. There are 6 devices in staging and > 84 in production. > > Pool: production n900 pool > Members: n900-001 through n900-064 and n900-071 through n900-090 > Warning: 10 devices down > Critical: 25 devices down > > Pool: staging n900 pool > Members: n900-065 through n900-070 > Warning: 5 devices down > Critical: 6 devices down I've set this up along with one to check the tcp service on the tegras (all downtimed right now): https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=mvadm01.mv I also modified the script that generates the hosts.cfg file so that we can use servicegroups for the PING check as well, so that's no longer a limitation. Rob may have found an issue while he was working on the ops stuff in that service descriptions with spaces in them appear not to work for him. He's investigating this and may find a work around. If not, we may need to change the names of some checks (I'm thinking of the hung slave checks) if we want to use those in cluster definitions.
Good news... the bug that Rob was seeing doesn't affect the way I was doing things since I'm using servicegroups (and he found a workaround for his issue as well). So as things stand right now, I've mocked up a few checks. https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=bm-admin01 bm-buildbot bm-hungslave bm-hungslave_win bm-hungslave_slow https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=mvadm01.mv n900 production ping n900 staging ping tegra tcp I haven't added things to admin1.infra yet, but that's why I prefixed the buildbot and hungslave checks with bm- on bm-admin01. Now that I've done proof of concept for various different things and they all seem to be working, we need to come up with groups and other possible services to monitor as a cluster.
(In reply to comment #7) > Now that I've done proof of concept for various different things and they all > seem to be working, we need to come up with groups and other possible services > to monitor as a cluster. triagefollowup: we need to get back to arr for build slaves. I am taking care of test machines. arr let me know if this is the information you need for test slaves: #### PRODUCTION Warning: 3 devices down Critical: 5 devices down Pool: Fedora production testing pool Members: 003-009,011-053 Pool: Fedora 64-bit production testing pool Members: 003-009,011-055 Pool: Windows XP production testing pool Members: 004-009,011-053 Pool: Windows 7 production testing pool Members: 004-009,011-039,041-053 Pool: Windows 7 64-bit production testing pool Members: 003-009,011-050 Pool: Leopard production testing pool Members: 003-009,011-053 Pool: Snow Leopard production testing pool Members: 003-009,011-055 #### STAGING Warning: 2 devices down Critical: 2 devices down Pool: Fedora staging testing pool Members: 001-002,010 Pool: Fedora 64-bit staging testing pool Members: 001-002,010 Pool: Windows XP staging testing pool Members: 001-003,010 Pool: Windows 7 staging testing pool Members: 001-003,010 Pool: Windows 7 64-bit staging testing pool Members: 001-002,010 Pool: Leopard staging testing pool Members: 001-002,010 Pool: Snow Leopard staging testing pool Members: 001-002,010 All information gathered from: > http://hg.mozilla.org/build/buildbot-configs/file/tip/mozilla-tests/production_config.py talos-r3-fed-*: 3-9,11-53 --> 7 + 43 = 50 slaves talos-r3-fed64-*: 3-9,11-55 --> 7 + 45 = 52 slaves talos-r3-xp-*: 4-9,11-53 --> 6 + 43 = 50 slaves talos-r3-w7-*: 4-9,11-39,41-53 --> 6 + 29 + 13 = 48 slaves t-r3-w764-*: 3-9,11-50 --> 7 + 40 = 47 slaves talos-r3-leopard-*: 3-9,11-53 --> 7 + 43 = 50 slaves talos-r3-snow-*: 3-9,11-55 --> 7 + 45 = 52 slaves
Whiteboard: [triagefollowup]
That gives me hosts and thresholds (warning and critical can't be the same, I don't think, so staging will have to be 2,3 or 1,2), but I also need to know what services to monitor. For the talos machines, the possibilities are PING and buildbot-start. Which would you like?
(In reply to comment #9) > That gives me hosts and thresholds (warning and critical can't be the same, > I don't think, so staging will have to be 2,3 or 1,2), but I also need to > know what services to monitor. For the talos machines, the possibilities > are PING and buildbot-start. Which would you like? Replied on IRC. - 1,2 - both checks arr have I told you today that you are awesome? :)
https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?navbarsearch=1&host=admin1.infra The talos machines all have cluster checks for PING and buildbot-start now. The checks for buildbot-start are downtimed on the windows machines (not XP) since they don't work there yet. The linux-ix-slave and linux-ix64-slave checks are downtimed because we have so many hosts out at IX that it's going to constantly complain. I've also taken the downtime off on the n900 checks (only 2 production hosts down), since they're nice and healthy, but left them on the tegra_tcp check since we have a lot of downed services there as well. https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=mvadm01.mv Other than moving the w764 talos machines over to admin1 to be monitored, I haven't made any changes to bm-admin01 yet: https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?navbarsearch=1&host=bm-admin01
Summary: Create meta-check for buildslaves → Create meta-check for buildslaves (nagios)
Hi arr, I will be gone tomorrow so if you have any questions please follow up with catlee and dustin. The decided checks are ping, buildbot and buildbot_start. Once buildbot_start is working we can remove the buildbot check. I am using W/C as short for warning/critical and I am indicating it per colo. I am also using sl. for slave(s). I have tried to use the following values depending on the # of slaves per colo for a given OS. For WARNING I chose round(#slaves/10) and for CRITICAL I use +3 of WARNING. * 2/5, 3/6, 4/7, 5/8, 6/9 Feel free to propose another option if it makes things easier. To note that I am blurring the lines of staging/production and try VS builder. All of this data except Win32 has been gathered through slavealloc. Linux: W/C: 3/6 - SCL1: 35 sl. linux-ix-slave[01-42] (minus 04, 05, 07, 09, 10, 11, 15) W/C: 5/8 - SJC1: 51 sl. moz2-linux-slave[01-51] W/C: 3/6 - MTV1: 30 sl. linux-ix-slave[04,05,07,09,10,11,15] mv-moz2-linux-ix-slave[01-23] Linux 64: W/C: 4/7 - SCL1: 41 sl. linux64-ix-slave[01-41] W/C: 2/5 - SJC1: 22 sl. moz2-linux64-slave[01-12] try-linux64-slave[01-10] Darwin9: W/C: 13/16 - SJC1: 136 sl. moz2-darwin9-slave[01-72] (minus 04) try-mac-slave[01-47] (minus 05) bm-xserve[06-24] Darwin10: W/C: 6/9 - SJC1: 63 sl. moz2-darwin10-slave[03-39] try-mac64-slave[01-26] W/C: 2/5 - MTV1: 24 sl. moz2-darwin10-slave[01-02/40-56] try-mac64-slave[27-31] Win32: W/C: 2/5 - MTV1: 20 sl. mw32-ix-slave[02-21] W/C: 4/7 - SCL1: 42 sl. w32-ix-slave[01-42] W/C: 9/12 - SJC1: 95 sl. win32-slave[01-59] try-w32-slave[01-36]
Adding dependency on 659134 since there seems to be a desire to redo the implementation of these checks first.
Depends on: 659134
Based on comments here and conversation, this is what the checks now look like. The scl1-linux-slave and scl1-w32-slave checks are set higher because we anticipate the IX machines to move there after disk replacement (the corresponding values in mtv1 should be lowered after this actually happens). cluster: num hosts in cluster non-ok number at which to notify n900-ping cluster: 87 45 tegra-ping cluster: 93 45 tegra-tcp cluster: 93 45 mtv1-darwin10-slave-buildbot cluster: 22 10 mtv1-darwin10-slave-buildbot-start cluster: 22 10 mtv1-darwin10-slave-ping cluster: 22 10 mtv1-linux-slave-buildbot cluster: 31 mtv1-linux-slave-buildbot-start cluster: 31 15 mtv1-linux-slave-ping cluster: 31 15 mtv1-w32-slave-buildbot-start cluster: 40 20 mtv1-w32-slave-ping cluster: 40 20 scl1-linux-slave-buildbot cluster: 34 20 scl1-linux-slave-buildbot-start cluster: 34 20 scl1-linux-slave-ping cluster: 34 20 scl1-linux64-slave-buildbot cluster: 41 20 scl1-linux64-slave-buildbot-start cluster: 41 20 scl1-linux64-slave-ping cluster: 41 20 scl1-t-r3-w764-buildbot-start cluster: 50 25 scl1-t-r3-w764-ping cluster: 50 25 scl1-talos-r3-fed-buildbot-start cluster: 53 25 scl1-talos-r3-fed-ping cluster: 53 25 scl1-talos-r3-fed64-buildbot-start cluster: 55 25 scl1-talos-r3-fed64-ping cluster: 55 25 scl1-talos-r3-leopard-buildbot-start cluster: 53 25 scl1-talos-r3-leopard-ping cluster: 53 25 scl1-talos-r3-snow-buildbot-start cluster: 55 25 scl1-talos-r3-snow-ping cluster: 55 25 scl1-talos-r3-w7-buildbot-start cluster: 52 25 scl1-talos-r3-w7-ping cluster: 52 25 scl1-talos-r3-xp-buildbot-start cluster: 53 25 scl1-talos-r3-xp-ping cluster: 53 25 scl1-w32-slave-buildbot-start cluster: 27 20 scl1-w32-slave-ping cluster: 27 20 sjc1-darwin10-slave-buildbot cluster: 55 25 sjc1-darwin10-slave-buildbot-start cluster: 55 25 sjc1-darwin10-slave-ping cluster: 55 25 sjc1-darwin9-slave-buildbot cluster: 107 50 sjc1-darwin9-slave-buildbot-start cluster: 107 50 sjc1-darwin9-slave-ping cluster: 107 50 sjc1-linux-slave-buildbot cluster: 80 40 sjc1-linux-slave-ping cluster: 80 40 sjc1-linux64-slave-buildbot cluster: 22 10 sjc1-linux64-slave-buildbot-start cluster: 22 10 sjc1-linux64-slave-ping cluster: 22 10 sjc1-w32-slave-buildbot-start cluster: 79 40 sjc1-w32-slave-ping cluster: 79 40
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Looks perfect, thanks!
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.