We've seen 21:35 < nagios-sjc1>  slavealloc.build.scl1:http_expect - slavealloc.build.scl1 is CRITICAL: (Service Check Timed Out) for most of the day, and it's been causing sadness all around. Slavealloc is slowing down enough that nginx is not willing to wait for it. The net effect to production is that slaves wait for a bit while starting up. nginx eventually times out, and the slave falls back to its old buildbot.tac. I can get rid of the wait by turning slavealloc off.
So the root cause here is slow DNS in scl1 (bug 666487). Two things fixed it: 1. run nscd 2. don't call socket.getfqdn() for every request Patch for the latter momentarily.
Created attachment 541275 [details] [diff] [review] m666486-tools-p1-r1.patch Easy fix
Attachment #541275 - Flags: review?(nrthomas)
Attachment #541275 - Flags: review?(nrthomas) → review+
landed and deployed.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.