These machines can often be disconnected from buildbot for various reasons, and we don't really care, so there's no reason to alert about them. They are: linux-ix-slave03, 04, 05 linux64-ix-slave01, 02, 37 moz2-darwin10-slave01, 02, 03, 04, 10 moz2-darwin9-slave03, 08, 10, 68 moz2-linux-slave03, 04, 10, 17, 51 moz2-linux64-slave07, 10 mv-moz2-linux-ix-slave01 mw32-ix-slave01, 19, 21 talos-r3-fed-001, 02, 10 talos-r3-fed64-001, 02, 10 talos-r3-leopard-001, 02, 10 talos-r3-snow-001, 02, 10 talos-r3-w7-001, 02, 03, 10 talos-r3-xp-001, 02, 03, 10 w32-ix-slave01 w64-ix-slave02, 05, 41
We started monitoring these slaves while I was in releng, quite intentionally. Preproduction needs a working pool of slaves to do its preprod tests. These slaves should either be running, or if they are known to be down due to someone using them, downtimed or acked appropriately. IMHO this should be WONTFIX'd.
As the manager handling the relops interface on the releng side, I'll pull coop in here and ask for his input/judgement call. I agree with Dustin that we should monitor buildbot so that we know when they are broken, but if folks in releng are ignoring the checks, then they're not useful. Coop, what say you?
Assignee: server-ops-releng → arich
I was doing some ack-ing of these nagios alerts (pointing at this bug) via IRC late this evening, and it seems that we frequently get flapping:  moz2-linux64-slave10.build.sjc1:buildbot is WARNING: PROCS WARNING: 0 processes with command name twistd, args buildbot.tac  moz2-linux64-slave10.build.sjc1:buildbot is CRITICAL: Connection refused by host  moz2-linux64-slave10.build.sjc1:buildbot is WARNING: PROCS WARNING: 0 processes with command name twistd, args buildbot.tac which clears the ack at least. (I forget if it clears downtime); if downtime also gets cleared when it would flap like that, I would be in favor of dropping the monitoring into the trash.
If preproduction was its own pool of machines, I'd agree that we should have this check enabled on them. However, it's a shared pool and people take them offline for various reasons. These checks are noisy, and having them notify just as loudly as production machines causes them to get in the way of notifications that we *do* need to address promptly.
> if downtime also gets cleared when it would flap like that, I would be in favor of > dropping the monitoring into the trash. Downtime does not get cleared when a service flaps.
(In reply to Ben Hearsum [:bhearsum] from comment #4) > If preproduction was its own pool of machines, I'd agree that we should have > this check enabled on them. However, it's a shared pool and people take them > offline for various reasons. These checks are noisy, and having them notify > just as loudly as production machines causes them to get in the way of > notifications that we *do* need to address promptly. Nagios checks should mean something, I agree. At the same time, part of being a conscientious user of our dev/preproduction services should include scheduling downtimes in nagios for any slaves you're taking out of the pool. To echo comment #1, if the we start getting nagios alerts about a slave that (e.g.) I have attached to my master, whoever is on buildduty should be first asking *me* why we're getting a alert. In general, I think we just need to make better use of the Notes field in slavealloc when we take slaves offline for any reason. That would give buildduty an easy way to assign blame. Ben seems to be most worried about the distraction factor of these alerts. I don't know much about the various reporting options for nagios...is there a way to get a daily roll-up report on just these slaves and avoid the in-channel spam and per-incident email?
Nagios is designed to alert as problems happen, since the expectation is that alerts are actionable and will be acknowledged or fixed. There is no concept of a batch notification at the end of the day. The closest you can get is to set up a check of a cluster check, which we did for a number of things, and then people decided they were too noisy because we always had over the threshold of the machines down, and when nagios reloaded it would clear the count and then flap. That may have calmed down somewhat since we have most of the ix machines back online now. Take a look at the following two URLs to see the cluster checks we already have in place: https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?navbarsearch=1&host=admin1.infra https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?navbarsearch=1&host=bm-admin01 Right now, all services monitored for build go to the build contact group which sends out mail and notifies by irc as well. It is possible (but messy if hosts ever change assignment) to split the preprod hosts out into their own groups and create a new contact for build that only sends email, but you'll still be getting the email messages one at a time per machine as an individual machine alerts. My suggestion would be to downtime the box when you hand it out to someone else (referring to slavealloc in the nagios comment message).
(In reply to Justin Wood (:Callek) from comment #3) > I was doing some ack-ing of these nagios alerts (pointing at this bug) via > IRC late this evening, and it seems that we frequently get flapping: That's a good example of a slave that needs some TLC. In particular, it's probably rebooting every 2-3 minutes, which is why nagios can't get in to check whether buildbot's running. Likely this is due to available space. Fixing this is definitely a lower priority than fixing a prod box, but it still needs to happen, otherwise the next person to try to use that system will either need to spend time fixing it, or give up and use another system -- leaving this one needlessly consuming cycles rebooting for days. (OT for this bug, but..) The nagios noise is at about the level way back when we started the "make nagios alerts less noisy" tracker (bug 603343). Fixing that involved some amount of nagios reconfiguration, but mostly a few weeks of religiously staying on top of the alerts -- fixing where possible, filing bugs and acking/downtiming where not. Allhands certainly didn't help the noise level, as I expect everyone was too busy to attend to alerts. Ben's been on on top of it this week, but it's going to take a few weeks and some adherence to practices (like notes in slavealloc before things) to get it right.
Coop, was there a consensus on a resolution for this bug?
Is it possible to have the nagios alert include a small note as to whether the slave in question is in the production or dev environment? That would allow buildduty to follow-up on production machines that are down and simply ack dev machines if they are too busy.
Currently the check has no concept of which environment the machine is in. They're all monitored exactly the same.
We would have to have slavealloc be aware of any active nagios alerts (IMO)
(In reply to Amy Rich [:arich] from comment #11) > Currently the check has no concept of which environment the machine is in. > They're all monitored exactly the same. Then it sounds like there is not much we can do here aside from fix the process for grabbing a slave on our side. We'll try to make sure that whoever pulls a slave from the dev environment also creates a nagios downtime for that slave. (In reply to Mike Taylor [:bear] from comment #12) > We would have to have slavealloc be aware of any active nagios alerts (IMO) I don't disagree, but that's out of scope for this bug. Let's start a wiki with future plans for slavealloc and list all the various data sources we'd like to pull in.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → WONTFIX
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.