Trees closed: most Win8 slaves not running jobs

RESOLVED FIXED

Status

Infrastructure & Operations
CIDuty
--
blocker
RESOLVED FIXED
a year ago
2 months ago

People

(Reporter: philor, Unassigned)

Tracking

Details

(Reporter)

Description

a year ago
nagios alerted about 6K pending Win8 jobs, which is far too many, so I looked at https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=t-w864-ix which shows around 30 with a reasonable time since their last job, and the rest between 3 and 5 hours idle.

Closed all non-try Firefox trees.
Rebooting machines doesn't bring them back online, eg t-w864-ix-325. It's like they never call runslave (ie nothing in c:\slave has a modified time after the reboot). In the event viewer and found

Information	5/2/2017 8:11:37 PM	GroupPolicy (Microsoft-Windows-GroupPolicy)	1501	None
Information	5/2/2017 7:27:40 PM	GroupPolicy (Microsoft-Windows-GroupPolicy)	1500	None
Information	5/2/2017 5:49:39 PM	GroupPolicy (Microsoft-Windows-GroupPolicy)	1500	None
Information	5/2/2017 4:11:39 PM	GroupPolicy (Microsoft-Windows-GroupPolicy)	1502	None
Information	5/2/2017 2:33:30 PM	GroupPolicy (Microsoft-Windows-GroupPolicy)	1502	None
Information	5/2/2017 2:33:13 PM	GroupPolicy (Microsoft-Windows-GroupPolicy)	1501	None
Information	5/2/2017 2:19:53 PM	GroupPolicy (Microsoft-Windows-GroupPolicy)	1502	None
Information	5/1/2017 6:36:33 AM	GroupPolicy (Microsoft-Windows-GroupPolicy)	1500	None

At 5/2/2017 2:19:53 PM, 2:33:30 PM, and 4:11:39 (PDT) the message was
The Group Policy settings for the computer were processed successfully. New settings from 41 Group Policy objects were detected and applied.

otherwise:
The Group Policy settings for the computer were processed successfully. There were no changes detected since the last successful processing of Group Policy.

The first message hasn't been since April 14th, so I think we had some GPO changes today (possibly related to bug 1358307).
Flags: needinfo?(q)
<Q> I has every machine recreate the scheduke task and close locks on the runslave log
<Q> I found a bunch of hosts that could open the log and stopped
<Q> they seem to come back after reboot now
<Q> Tried 2 and they both worked

I'm rebooting the hosts, it seems to take a couple of reboots to get them connected back to a buildbot master.
Flags: needinfo?(q)
Everything (which was stuck as of an hour ago on slavehealth) has had a least one reboot scheduled. I'll check back again in an hour or so.
Rebooted another 6 which needed a second go, and t-w864-ix-322 manually because t-w864-ix-322.build.mozilla.org doesn't exist in DNS (Alin is going to file that separately).

Backlog is clearing nicely with the pool running again, over to Tomcat.
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
trees reopen at 1am pacific
Q, do you have any thoughts about the root cause here ?
Flags: needinfo?(q)

Comment 7

a year ago
Nothing that I can find other than the runlogs being locked.
Flags: needinfo?(q)

Updated

2 months ago
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.