626486 - Talos slaves often get stuck with ProcessExitedAlready

Reporter

Description

•

14 years ago

Here's the traceback: 2011-01-14 20:05:53-0800 [-] command timed out: 600 seconds without output 2011-01-14 20:05:53-0800 [-] self.process has no pid 2011-01-14 20:05:53-0800 [-] trying process.signalProcess('KILL') 2011-01-14 20:05:53-0800 [-] Unhandled Error Traceback (most recent call last): File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-macosx-10.6-universal.egg/twisted/application/app.py", line 445, in startReactor self.config, oldstdout, oldstderr, self.profiler, reactor) File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-macosx-10.6-universal.egg/twisted/application/app.py", line 348, in runReactorWithLogging reactor.run() File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-macosx-10.6-universal.egg/twisted/internet/base.py", line 1166, in run self.mainLoop() File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-macosx-10.6-universal.egg/twisted/internet/base.py", line 1175, in mainLoop self.runUntilCurrent() --- <exception caught here> --- File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-macosx-10.6-universal.egg/twisted/internet/base.py", line 779, in runUntilCurrent call.func(*call.args, **call.kw) File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/buildbot-0.8.0-py2.6.egg/buildbot/slave/commands/base.py", line 726, in doTimeout self.kill(msg) File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/buildbot-0.8.0-py2.6.egg/buildbot/slave/commands/base.py", line 791, in kill self.process.signalProcess(self.KILL) File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-macosx-10.6-universal.egg/twisted/internet/process.py", line 333, in signalProcess raise ProcessExitedAlready() twisted.internet.error.ProcessExitedAlready:

Dustin J. Mitchell [:dustin] (he/him)

Comment 1

•

14 years ago

I'd like to replicate this locally. Any idea how the jetpack builds happen to cause this more than others? Do they quit too quickly?

Armen [:armenzg]

Updated

•

14 years ago

Whiteboard: [stuck]

Chris Cooper [:coop] (he/him)

Updated

•

14 years ago

Priority: -- → P3

Whiteboard: [stuck] → [buildslaves][talos][stuck]

Chris Cooper [:coop] (he/him)

Comment 2

•

14 years ago

Assigning to Dustin based on comment #1.

Assignee: nobody → dustin

Dustin J. Mitchell [:dustin] (he/him)

Comment 3

•

14 years ago

How are we working around these? They seem to leave XP machines in the "shutdown is already in progress" state..

Chris AtLee [:catlee]

Reporter

Comment 4

•

14 years ago

(In reply to comment #3) > How are we working around these? They seem to leave XP machines in the > "shutdown is already in progress" state.. I've had luck ssh'ing in and doing 'shutdown -r -t 0'

Armen [:armenzg]

Comment 5

•

14 years ago

(In reply to comment #1) > I'd like to replicate this locally. Any idea how the jetpack builds happen to > cause this more than others? Do they quit too quickly? ctalbert explains it here: https://bugzilla.mozilla.org/show_bug.cgi?id=627070#c4 There is also a log. Sometimes I have needed to use remote desktop since VNC might be dead because of the already initiated reboot (this is if ssh'ing does not work).

Dustin J. Mitchell [:dustin] (he/him)

Comment 6

•

14 years ago

That bug sounds like the jetpack runner is not exiting, yet Buildbot gets ProcessExitedAlready. Does this exception only happen on Windows? Re-assigning to nobody since I'm still playing catch-up here.

Assignee: dustin → nobody

Chris AtLee [:catlee]

Reporter

Comment 7

•

14 years ago

(In reply to comment #6) > That bug sounds like the jetpack runner is not exiting, yet Buildbot gets > ProcessExitedAlready. > > Does this exception only happen on Windows? No, I've seen it on linux and mac slaves too.

Chris AtLee [:catlee]

Reporter

Comment 8

•

14 years ago

See https://bugzilla.mozilla.org/show_bug.cgi?id=629527#c18 for steps to reproduce on linux at least. For me this doesn't block a reconfig from happening though.

Dustin J. Mitchell [:dustin] (he/him)

Comment 9

•

14 years ago

OK, a little bit of research on this topic (noting that there are a lot of only partially related slave-side problems in play here): An except block for ProcessAlreadyExited was added in https://github.com/buildbot/buildbot/commit/91343c2ba3fa840cca3d358087878a72ad8085c4 which isn't released yet - it will be in 0.8.4. I'm working on an improvement to it, as the currently-committed code will try to fire a deferred twice - no good. The exception occurs when either (a) buildbot tries to kill the process and races against the process dying itself or (b) the process has double-forked, with the parent exiting. There are still members of the process group holding file descriptors open, but the original process has already exited. Twisted's process Protocol classes (wisely) don't call processEnded until the process has exited *and* all of the stdio pipes have run dry. So the process is gone and can't be killed, but processEnded hasn't been called yet. If usePTY=True, then in this situation the parent process is a session leader, so when it exits the child process gets a SIGHUP, which kills the process by default (and would thus close the stdio pipes and wrap everything up). If usePTY=False, however, the child process holds those file descriptors open but is inaccessible to buildbot since a proper process group has not been set up. I suspect the latter is the case for Jetpack: the build or test is leaving a child process running, holding file descriptors open, and that child process is either ignoring SIGHUP, or we're not using usePTY on these slaves, or the test process has munged sessions and terminals up enough to not get the SIGHUP. Faliing to catch that exception means simply that the attempt to kill the process fails, and it is free to keep running. It's as if the buildstep was run with timeout=None. It makes sense that this doesn't block a reconfig, and that it will hang the slave in the step forever. So that's an analysis of this particular problem. The solution is complicated, not least because things have changed a lot since Buildbot-0.8.0. I'm writing some Buildbot tests to see if I can narrow down failure cases.

Dustin J. Mitchell [:dustin] (he/him)

Comment 10

•

14 years ago

After a bit more analysis and discussion, it looks like the underlying problem is the inability to kill entire process groups without usePTY. The ProcessExitedAlready exception is problematic in that it prevents the step from finishing, and the fix is easy, so let's do that here -- it's just merging the upstream commit mentioned in the comment 9.

Blocks: 631851

Depends on: 631849

Dustin J. Mitchell [:dustin] (he/him)

Comment 11

•

14 years ago

blocked on deciding what version of buildslave to upgrade to (bug 631854)

Dustin J. Mitchell [:dustin] (he/him)

Comment 12

•

14 years ago

And the decision is to upgrade to 0.8.4-pre, which already includes a fix for this problem. Yay! I'm leaving this open until bug 631851 lands.

Dustin J. Mitchell [:dustin] (he/him)

Comment 13

•

14 years ago

Let's call this fixed, just not deployed yet.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Updated

•

14 years ago

Blocks: 629263

Nobody; OK to take it and work on it

Assignee

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

Bugzilla

Talos slaves often get stuck with ProcessExitedAlready

Categories

(Release Engineering :: General, defect, P3)

Tracking

(Not tracked)

People

(Reporter: catlee, Unassigned)

References

Details

(Whiteboard: [buildslaves][talos][stuck])

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Updated

Updated