Implement monitoring of signing servers



Release Engineering
2 years ago
11 months ago


(Reporter: nthomas, Unassigned)


Firefox Tracking Flags

(Not tracked)




2 years ago
We don't have any visibility into failures until a human happens to notice (release automation failing, people find it on treeherder). For example, wonky hardware (bug 1083156), issues starting the signing processes (bug 1210686), and today we had mac-v2-signing7 getting overwhelmed. The only recourse is to go spelunking in the log, which is pretty chatty.

Some sort of metrics/monitoring would be helpful, assuming this system is here for a while. Nagios is just doing simple checks like ping, disk space, load, ntp, and 3x signing procs running. 

* the signing server processes could use syslog to message about failures, which would get carried into papertrail where we can SNS alert in #buildduty when over some failure threshold (either absolute # or percentage failing)
* we could graph lots of things - # of successful/failing/pending signing jobs, I/O and cpu load. Not sure exactly how we do that, maybe it's OK to push graphite data if it's one-way.
* teach signing servers to say something like "go away, try another signing server" if they are overwhelmed.

Comment 2

2 years ago
Oh yes, the client side needs to be smarter too. In the mac repacks yesterday they kept asking a single server for about 5 hours.


11 months ago
Component: Tools → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.