nagios alerted about 6K pending Win8 jobs, which is far too many, so I looked at https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=t-w864-ix which shows around 30 with a reasonable time since their last job, and the rest between 3 and 5 hours idle. Closed all non-try Firefox trees.
Rebooting machines doesn't bring them back online, eg t-w864-ix-325. It's like they never call runslave (ie nothing in c:\slave has a modified time after the reboot). In the event viewer and found Information 5/2/2017 8:11:37 PM GroupPolicy (Microsoft-Windows-GroupPolicy) 1501 None Information 5/2/2017 7:27:40 PM GroupPolicy (Microsoft-Windows-GroupPolicy) 1500 None Information 5/2/2017 5:49:39 PM GroupPolicy (Microsoft-Windows-GroupPolicy) 1500 None Information 5/2/2017 4:11:39 PM GroupPolicy (Microsoft-Windows-GroupPolicy) 1502 None Information 5/2/2017 2:33:30 PM GroupPolicy (Microsoft-Windows-GroupPolicy) 1502 None Information 5/2/2017 2:33:13 PM GroupPolicy (Microsoft-Windows-GroupPolicy) 1501 None Information 5/2/2017 2:19:53 PM GroupPolicy (Microsoft-Windows-GroupPolicy) 1502 None Information 5/1/2017 6:36:33 AM GroupPolicy (Microsoft-Windows-GroupPolicy) 1500 None At 5/2/2017 2:19:53 PM, 2:33:30 PM, and 4:11:39 (PDT) the message was The Group Policy settings for the computer were processed successfully. New settings from 41 Group Policy objects were detected and applied. otherwise: The Group Policy settings for the computer were processed successfully. There were no changes detected since the last successful processing of Group Policy. The first message hasn't been since April 14th, so I think we had some GPO changes today (possibly related to bug 1358307).
<Q> I has every machine recreate the scheduke task and close locks on the runslave log <Q> I found a bunch of hosts that could open the log and stopped <Q> they seem to come back after reboot now <Q> Tried 2 and they both worked I'm rebooting the hosts, it seems to take a couple of reboots to get them connected back to a buildbot master.
Everything (which was stuck as of an hour ago on slavehealth) has had a least one reboot scheduled. I'll check back again in an hour or so.
Rebooted another 6 which needed a second go, and t-w864-ix-322 manually because t-w864-ix-322.build.mozilla.org doesn't exist in DNS (Alin is going to file that separately). Backlog is clearing nicely with the pool running again, over to Tomcat.
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
trees reopen at 1am pacific
Q, do you have any thoughts about the root cause here ?
Nothing that I can find other than the runlogs being locked.
You need to log in before you can comment on or make changes to this bug.