Closed Bug 1307803 Opened 8 years ago Closed 7 years ago

Windows worker hanging on build failure

Categories

(Taskcluster :: Workers, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: ted, Assigned: pmoore)

References

Details

Attachments

(2 files)

Attached file Build log
I did a try push here, and all the Windows taskcluster builds failed:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=3dc4cbe71997

However, they failed with 'claim-expired', like:
https://tools.taskcluster.net/task-inspector/#NIaiZPtqQvuzrNcCD4WaCw/0

Looking at one that was still running:
https://tools.taskcluster.net/task-inspector/#NIaiZPtqQvuzrNcCD4WaCw/1

I saw that the build had failed:
<...>
13:12:49     INFO -  client.mk:373: recipe for target 'configure' failed
13:12:49     INFO -  mozmake.EXE: *** [configure] Error 1
14:32:49     INFO - Automation Error: mozprocess timed out after 4800 seconds running ['C:\\mozilla-build\\msys\\bin\\bash.exe', 'Z:\\task_1475671472\\build\\src\\mach', '--log-no-times', 'build', '-v']
14:32:49    ERROR - timed out after 4800 seconds of no output
<...>

I'm not sure what's happening here, but something is failing to notice this is broken. This shouldn't have hit the mozprocess timeout either, it's just a simple failure. I've attached the full build log that I cut that snipped out of above.
Severity: normal → major
Assignee: nobody → pmoore
I'm not sure what caused this, but I see it is running on generic worker 5.2.0.

There has been a redesign of failure handling in generic worker which has landed this evening in 6.0.0, so my proposal would be that we try first with the refactored worker, to see if that fixes things.

I'll raise a PR against https://github.com/mozilla-releng/OpenCloudConfig for this - then when it merges, the rollout should be automatic, as far as I understand.
Attachment #8798157 - Flags: review?(rthijssen)
(note: the last line of the log also scared me, as it looked like some funky process magic might be going on)
I looked at one of the instances that had an issue with a run of this task...instance i-0471eb89d40b860a7

Looking in papertrail I see the following:

Oct 05 15:57:27 win2012-i-0471eb89d40b860a7 generic-worker: 2016/10/05 15:57:26 Reclaiming task NIaiZPtqQvuzrNcCD4WaCw... 
Oct 05 15:57:28 win2012-i-0471eb89d40b860a7 generic-worker: 2016/10/05 15:57:27 Reclaimed task NIaiZPtqQvuzrNcCD4WaCw successfully. 
Oct 05 15:58:02 win2012-i-0471eb89d40b860a7 HaltOnIdle: Is-Running :: generic-worker is running. 
Oct 05 15:58:02 win2012-i-0471eb89d40b860a7 HaltOnIdle: instance appears to be productive. 
Oct 05 15:59:27 win2012-i-0471eb89d40b860a7 Microsoft-Windows-DSC: Job {B0A72FFE-8B14-11E6-8149-126690B82E1F} :   Configuration is sent from computer NULL by user sid S-1-5-18. 
Oct 05 16:03:01 win2012-i-0471eb89d40b860a7 HaltOnIdle: Is-Running :: generic-worker is running. 
Oct 05 16:03:01 win2012-i-0471eb89d40b860a7 HaltOnIdle: instance appears to be productive. 
Oct 05 16:08:01 win2012-i-0471eb89d40b860a7 HaltOnIdle: Is-Running :: generic-worker is running. 
Oct 05 16:08:01 win2012-i-0471eb89d40b860a7 HaltOnIdle: instance appears to be productive. 


That was the last thing in the papertrail logs from this worker (note I was searching this a few hours after that last log message).

I'm not sure what caused it to stop logging, I no longer see that machine in the AWS console so clearly it's gone.

Is it possible this machine crashed or was terminated by other means?  Does the generic worker monitor AWS spot kills and terminate gracefully?

When the machines shutdown, is there something I could see in the logs that could indicate that?

Sorry Pete, I attempted to determine what was going wrong without much luck.
So after talking with Pete, it appears the generic worker does not monitor the spot termination endpoint AWS provides to determine if it will be spot killed in the next 2 minutes.  I'm not sure if that is the cause of this, but some symptoms point that way.

I've added bug 1308224
Depends on: 1308224
Comment on attachment 8798157 [details] [review]
Github Pull Request for OpenCloudConfig

I think we're going to skip 6.0.0 for gecko-1-b-win2012 and gecko-3-b-win2012 and move to the next release.
Attachment #8798157 - Flags: review?(rthijssen) → review-
No longer depends on: 1308054
Component: Worker → Generic-Worker
I believe this no longer occurs - ted, ok for me to close?
Flags: needinfo?(ted)
Sure, I haven't seen this occur in anything I've done lately.
Flags: needinfo?(ted)
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WORKSFORME
Component: Generic-Worker → Workers
You need to log in before you can comment on or make changes to this bug.