At the moment tbpl.m.o is returning 500 errors but nagios isn't reporting any errors. Please investigate if there are any checks configured. If not, probably we should have the usual http checks, https cert, and load balancer nodes status. On the individual nodes usual + database connectivity, disk checks particularly for the log caching. mstange, swatinem, philor and rhelmer may well have input there too. They'd be actioned by Server Ops but would be great if they reported to #buildduty too.
tbpl is on the generic cluster so I'd be surprised if there aren't at least some checks, there may not be checked for each individual vhost and database though (there probably should be).
Perhaps the checks were downtimed for the seamicro update.
No, the checks were all fine and passing the way they were setup - we just don't have a url check for this. iirc, we rushed this to prod and nagios checks are easy to overlook. We will get this monday.
I've added a standard https check, a check searching for the string 'Tinderboxpushlog' as that is the page's title and also one for the https certificate. The results can be viewed here: https://dp-nagios01.phx.mozilla.com/nagios/cgi-bin/status.cgi?navbarsearch=1&host=generic.zlb.phx.mozilla.net Please let me know if this is sufficient to R/F.
Looks great, thanks!
So how sophisticated is nagios, what kind of scripting can be used? http check is ok, as far as the general availability goes, but considering it is just a static html page this does not provide as much value. Here are a few use-cases we can consider: - Check https://tbpl.mozilla.org/php/getHiddenBuilderNames.php?branch=mozilla-central to make sure we have connection to the database at all. - Check that https://tbpl.mozilla.org/php/getRevisionBuilds.php?branch=mozilla-central&rev=*a rev that is ~ 12 hours old* is not empty to make sure our cron-job does its job correctly and is not blocked. To catch things like Bug 706229. - Check that https://tbpl.mozilla.org/php/getLogExcerpt.php?id=*recent build id*&type=tinderbox_print works for a *recent* job that *has not yet been run* through the log processor. To catch things like Bug 707553. Any ideas how to achieve this?
Arpad, For https://tbpl.mozilla.org/php/getHiddenBuilderNames.php?branch=mozilla-central I can just look in the json response for a string, if the string doesn't exist I can return critical. For https://tbpl.mozilla.org/php/getRevisionBuilds.php?branch=mozilla-central&rev=*a there is just an empty  For https://tbpl.mozilla.org/php/getLogExcerpt.php?id=*recent returns: Unknown run ID.
(In reply to Arpad Borsos (Swatinem) from comment #6) > So how sophisticated is nagios, what kind of scripting can be used? > > http check is ok, as far as the general availability goes, but considering > it is just a static html page this does not provide as much value. This is about as far as I want to go with us checking TBPL. Anything other than a 200 will error and page out. Quite a few of our sites do more than this - and we can have sophisticated checks but we like for a lot of that monitoring to be done within the app. For example, AMO has a monitor that shows the health state, and can return something other than 200 (maybe 500?) when a piece is not working. That way, nagios just has to check for the return status: https://addons.mozilla.org/services/monitor.php So - I'm happy with the checks as they are. Anything more we need to work in the app to tell us what is wrong.
Those urls should have been: https://tbpl.mozilla.org/php/getRevisionBuilds.php?branch=mozilla-central&rev=cb70391c86d9 and https://tbpl.mozilla.org/php/getLogExcerpt.php?id=7750610&type=tinderbox_print or any other rev/id that would indicate a problem in data import or parsing of "not yet cached" logs. I filed bug 707726 to implement our own health checks for tbpl, but it does not have a high priority right now.