Here's the traceback: 2011-01-14 20:05:53-0800 [-] command timed out: 600 seconds without output 2011-01-14 20:05:53-0800 [-] self.process has no pid 2011-01-14 20:05:53-0800 [-] trying process.signalProcess('KILL') 2011-01-14 20:05:53-0800 [-] Unhandled Error Traceback (most recent call last): File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-macosx-10.6-universal.egg/twisted/application/app.py", line 445, in startReactor self.config, oldstdout, oldstderr, self.profiler, reactor) File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-macosx-10.6-universal.egg/twisted/application/app.py", line 348, in runReactorWithLogging reactor.run() File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-macosx-10.6-universal.egg/twisted/internet/base.py", line 1166, in run self.mainLoop() File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-macosx-10.6-universal.egg/twisted/internet/base.py", line 1175, in mainLoop self.runUntilCurrent() --- <exception caught here> --- File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-macosx-10.6-universal.egg/twisted/internet/base.py", line 779, in runUntilCurrent call.func(*call.args, **call.kw) File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/buildbot-0.8.0-py2.6.egg/buildbot/slave/commands/base.py", line 726, in doTimeout self.kill(msg) File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/buildbot-0.8.0-py2.6.egg/buildbot/slave/commands/base.py", line 791, in kill self.process.signalProcess(self.KILL) File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-macosx-10.6-universal.egg/twisted/internet/process.py", line 333, in signalProcess raise ProcessExitedAlready() twisted.internet.error.ProcessExitedAlready:
I'd like to replicate this locally. Any idea how the jetpack builds happen to cause this more than others? Do they quit too quickly?
Assigning to Dustin based on comment #1.
How are we working around these? They seem to leave XP machines in the "shutdown is already in progress" state..
(In reply to comment #3) > How are we working around these? They seem to leave XP machines in the > "shutdown is already in progress" state.. I've had luck ssh'ing in and doing 'shutdown -r -t 0'
(In reply to comment #1) > I'd like to replicate this locally. Any idea how the jetpack builds happen to > cause this more than others? Do they quit too quickly? ctalbert explains it here: https://bugzilla.mozilla.org/show_bug.cgi?id=627070#c4 There is also a log. Sometimes I have needed to use remote desktop since VNC might be dead because of the already initiated reboot (this is if ssh'ing does not work).
That bug sounds like the jetpack runner is not exiting, yet Buildbot gets ProcessExitedAlready. Does this exception only happen on Windows? Re-assigning to nobody since I'm still playing catch-up here.
(In reply to comment #6) > That bug sounds like the jetpack runner is not exiting, yet Buildbot gets > ProcessExitedAlready. > > Does this exception only happen on Windows? No, I've seen it on linux and mac slaves too.
See https://bugzilla.mozilla.org/show_bug.cgi?id=629527#c18 for steps to reproduce on linux at least. For me this doesn't block a reconfig from happening though.
OK, a little bit of research on this topic (noting that there are a lot of only partially related slave-side problems in play here): An except block for ProcessAlreadyExited was added in https://github.com/buildbot/buildbot/commit/91343c2ba3fa840cca3d358087878a72ad8085c4 which isn't released yet - it will be in 0.8.4. I'm working on an improvement to it, as the currently-committed code will try to fire a deferred twice - no good. The exception occurs when either (a) buildbot tries to kill the process and races against the process dying itself or (b) the process has double-forked, with the parent exiting. There are still members of the process group holding file descriptors open, but the original process has already exited. Twisted's process Protocol classes (wisely) don't call processEnded until the process has exited *and* all of the stdio pipes have run dry. So the process is gone and can't be killed, but processEnded hasn't been called yet. If usePTY=True, then in this situation the parent process is a session leader, so when it exits the child process gets a SIGHUP, which kills the process by default (and would thus close the stdio pipes and wrap everything up). If usePTY=False, however, the child process holds those file descriptors open but is inaccessible to buildbot since a proper process group has not been set up. I suspect the latter is the case for Jetpack: the build or test is leaving a child process running, holding file descriptors open, and that child process is either ignoring SIGHUP, or we're not using usePTY on these slaves, or the test process has munged sessions and terminals up enough to not get the SIGHUP. Faliing to catch that exception means simply that the attempt to kill the process fails, and it is free to keep running. It's as if the buildstep was run with timeout=None. It makes sense that this doesn't block a reconfig, and that it will hang the slave in the step forever. So that's an analysis of this particular problem. The solution is complicated, not least because things have changed a lot since Buildbot-0.8.0. I'm writing some Buildbot tests to see if I can narrow down failure cases.
After a bit more analysis and discussion, it looks like the underlying problem is the inability to kill entire process groups without usePTY. The ProcessExitedAlready exception is problematic in that it prevents the step from finishing, and the fix is easy, so let's do that here -- it's just merging the upstream commit mentioned in the comment 9.
blocked on deciding what version of buildslave to upgrade to (bug 631854)
And the decision is to upgrade to 0.8.4-pre, which already includes a fix for this problem. Yay! I'm leaving this open until bug 631851 lands.
Let's call this fixed, just not deployed yet.
7 years ago