It seems that we landed some puppet changes on Friday and slaves slowly started not be able to succeed running puppet and don't connect to buildbot. We should have a way of noticing this before there is a large number of slaves down. Maybe we should get an email like we get from the masters when there are exceptions. See bug 591803#c3 for more details on what it happened. Here is the small fix on puppet that got us into the bad state: http://hg.mozilla.org/build/puppet-manifests/rev/253f67007deb
(In reply to comment #0) > See bug 591803#c3 for more details on what it happened. > Here is the small fix on puppet that got us into the bad state: > http://hg.mozilla.org/build/puppet-manifests/rev/253f67007deb s/into/out of/. The URL looks bogus, did you mean to set 590720 ? A brute force method would be to poll the master for a list of builders, and check each has some minimum number of slaves connected. That'd work for test masters which are broken down by platform and all slaves are expected to be connected, but not so well for pm01/03 where we have the full list for the pool and only expect some to be connected.
I'm thinking more along the lines of the twisted exception watcher we have on the buildbot masters.
I didn't see this bug when i started working on bug 690590. The proposed solution in that bug is specifically monitoring for puppetca problems, but I could easily adapt it to also handle more general errors. It wouldn't be hard to adapt that script to also look for failed puppet runs. Should we dupe to bug 690590?
Found in triage. (In reply to John Ford [:jhford] -- please use 'needinfo?' instead of a CC from comment #3) > I didn't see this bug when i started working on bug 690590. The proposed > solution in that bug is specifically monitoring for puppetca problems, but I > could easily adapt it to also handle more general errors. > > It wouldn't be hard to adapt that script to also look for failed puppet runs. > > Should we dupe to bug 690590? Maybe, maybe not. Moving to the correct component for now. Lets see what makes sense to people with more context.
Component: Release Engineering → Release Engineering: Developer Tools
QA Contact: hwine
Product: mozilla.org → Release Engineering
All puppet masters are monitored for basic functionality. We currently get emailed about failed puppet runs, but that's not an alert. Also, transient failures are harmless, as they automatically retry. The events we need to know about are when many hosts fail at once. Foreman will give us a more comprehensive view, so perhaps there's some way to extract a meaningful signal from Foreman into nagios.
Summary: Add monitoring to puppet masters → Monitor for excessive failed puppet runs
Amy, Dustin - are we satisified with the current level of Puppet monitoring? If so, we should resolve this.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Component: Tools → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.