We don't have any visibility into failures until a human happens to notice (release automation failing, people find it on treeherder). For example, wonky hardware (bug 1083156), issues starting the signing processes (bug 1210686), and today we had mac-v2-signing7 getting overwhelmed. The only recourse is to go spelunking in the log, which is pretty chatty. Some sort of metrics/monitoring would be helpful, assuming this system is here for a while. Nagios is just doing simple checks like ping, disk space, load, ntp, and 3x signing procs running. eg * the signing server processes could use syslog to message about failures, which would get carried into papertrail where we can SNS alert in #buildduty when over some failure threshold (either absolute # or percentage failing) * we could graph lots of things - # of successful/failing/pending signing jobs, I/O and cpu load. Not sure exactly how we do that, maybe it's OK to push graphite data if it's one-way.
* teach signing servers to say something like "go away, try another signing server" if they are overwhelmed.
Oh yes, the client side needs to be smarter too. In the mac repacks yesterday they kept asking a single server for about 5 hours.
Component: Tools → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.