Closed Bug 626486 Opened 14 years ago Closed 13 years ago

Talos slaves often get stuck with ProcessExitedAlready

Categories

(Release Engineering :: General, defect, P3)

x86_64
Linux
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Unassigned)

References

Details

(Whiteboard: [buildslaves][talos][stuck])

Here's the traceback:

2011-01-14 20:05:53-0800 [-] command timed out: 600 seconds without output
2011-01-14 20:05:53-0800 [-] self.process has no pid
2011-01-14 20:05:53-0800 [-] trying process.signalProcess('KILL')
2011-01-14 20:05:53-0800 [-] Unhandled Error
        Traceback (most recent call last):
          File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-macosx-10.6-universal.egg/twisted/application/app.py", line 445, in startReactor
            self.config, oldstdout, oldstderr, self.profiler, reactor)
          File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-macosx-10.6-universal.egg/twisted/application/app.py", line 348, in runReactorWithLogging
            reactor.run()
          File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-macosx-10.6-universal.egg/twisted/internet/base.py", line 1166, in run
            self.mainLoop()
          File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-macosx-10.6-universal.egg/twisted/internet/base.py", line 1175, in mainLoop
            self.runUntilCurrent()
        --- <exception caught here> ---
          File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-macosx-10.6-universal.egg/twisted/internet/base.py", line 779, in runUntilCurrent
            call.func(*call.args, **call.kw)
          File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/buildbot-0.8.0-py2.6.egg/buildbot/slave/commands/base.py", line 726, in doTimeout
            self.kill(msg)
          File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/buildbot-0.8.0-py2.6.egg/buildbot/slave/commands/base.py", line 791, in kill
            self.process.signalProcess(self.KILL)
          File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-macosx-10.6-universal.egg/twisted/internet/process.py", line 333, in signalProcess
            raise ProcessExitedAlready()
        twisted.internet.error.ProcessExitedAlready:
I'd like to replicate this locally.  Any idea how the jetpack builds happen to cause this more than others?  Do they quit too quickly?
Whiteboard: [stuck]
Priority: -- → P3
Whiteboard: [stuck] → [buildslaves][talos][stuck]
Assigning to Dustin based on comment #1.
Assignee: nobody → dustin
How are we working around these?  They seem to leave XP machines in the "shutdown is already in progress" state..
(In reply to comment #3)
> How are we working around these?  They seem to leave XP machines in the
> "shutdown is already in progress" state..

I've had luck ssh'ing in and doing 'shutdown -r -t 0'
(In reply to comment #1)
> I'd like to replicate this locally.  Any idea how the jetpack builds happen to
> cause this more than others?  Do they quit too quickly?

ctalbert explains it here: https://bugzilla.mozilla.org/show_bug.cgi?id=627070#c4
There is also a log.


Sometimes I have needed to use remote desktop since VNC might be dead because of the already initiated reboot (this is if ssh'ing does not work).
That bug sounds like the jetpack runner is not exiting, yet Buildbot gets ProcessExitedAlready.

Does this exception only happen on Windows?

Re-assigning to nobody since I'm still playing catch-up here.
Assignee: dustin → nobody
(In reply to comment #6)
> That bug sounds like the jetpack runner is not exiting, yet Buildbot gets
> ProcessExitedAlready.
> 
> Does this exception only happen on Windows?

No, I've seen it on linux and mac slaves too.
See https://bugzilla.mozilla.org/show_bug.cgi?id=629527#c18 for steps to reproduce on linux at least.

For me this doesn't block a reconfig from happening though.
OK, a little bit of research on this topic (noting that there are a lot of only partially related slave-side problems in play here):

An except block for ProcessAlreadyExited was added in
https://github.com/buildbot/buildbot/commit/91343c2ba3fa840cca3d358087878a72ad8085c4
which isn't released yet - it will be in 0.8.4.  I'm working on an improvement to it, as the currently-committed code will try to fire a deferred twice - no good.

The exception occurs when either (a) buildbot tries to kill the process and races against the process dying itself or (b) the process has double-forked, with the parent exiting. There are still members of the process group holding file descriptors open, but the original process has already exited.  Twisted's process Protocol classes (wisely) don't call processEnded until the process has exited *and* all of the stdio pipes have run dry.  So the process is gone and can't be killed, but processEnded hasn't been called yet.

If usePTY=True, then in this situation the parent process is a session leader, so when it exits the child process gets a SIGHUP, which kills the process by default (and would thus close the stdio pipes and wrap everything up).  If usePTY=False, however, the child process holds those file descriptors open but is inaccessible to buildbot since a proper process group has not been set up.

I suspect the latter is the case for Jetpack: the build or test is leaving a child process running, holding file descriptors open, and that child process is either ignoring SIGHUP, or we're not using usePTY on these slaves, or the test process has munged sessions and terminals up enough to not get the SIGHUP.

Faliing to catch that exception means simply that the attempt to kill the process fails, and it is free to keep running.  It's as if the buildstep was run with timeout=None.

It makes sense that this doesn't block a reconfig, and that it will hang the slave in the step forever.

So that's an analysis of this particular problem.  The solution is complicated, not least because things have changed a lot since Buildbot-0.8.0.  I'm writing some Buildbot tests to see if I can narrow down failure cases.
After a bit more analysis and discussion, it looks like the underlying problem is the inability to kill entire process groups without usePTY.  The ProcessExitedAlready exception is problematic in that it prevents the step from finishing, and the fix is easy, so let's do that here -- it's just merging the upstream commit mentioned in the comment 9.
Blocks: 631851
Depends on: 631849
blocked on deciding what version of buildslave to upgrade to (bug 631854)
And the decision is to upgrade to 0.8.4-pre, which already includes a fix for this problem.  Yay!  I'm leaving this open until bug 631851 lands.
Let's call this fixed, just not deployed yet.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.