Closed
Bug 626486
Opened 14 years ago
Closed 14 years ago
Talos slaves often get stuck with ProcessExitedAlready
Categories
(Release Engineering :: General, defect, P3)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: catlee, Unassigned)
References
Details
(Whiteboard: [buildslaves][talos][stuck])
Here's the traceback:
2011-01-14 20:05:53-0800 [-] command timed out: 600 seconds without output
2011-01-14 20:05:53-0800 [-] self.process has no pid
2011-01-14 20:05:53-0800 [-] trying process.signalProcess('KILL')
2011-01-14 20:05:53-0800 [-] Unhandled Error
Traceback (most recent call last):
File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-macosx-10.6-universal.egg/twisted/application/app.py", line 445, in startReactor
self.config, oldstdout, oldstderr, self.profiler, reactor)
File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-macosx-10.6-universal.egg/twisted/application/app.py", line 348, in runReactorWithLogging
reactor.run()
File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-macosx-10.6-universal.egg/twisted/internet/base.py", line 1166, in run
self.mainLoop()
File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-macosx-10.6-universal.egg/twisted/internet/base.py", line 1175, in mainLoop
self.runUntilCurrent()
--- <exception caught here> ---
File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-macosx-10.6-universal.egg/twisted/internet/base.py", line 779, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/buildbot-0.8.0-py2.6.egg/buildbot/slave/commands/base.py", line 726, in doTimeout
self.kill(msg)
File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/buildbot-0.8.0-py2.6.egg/buildbot/slave/commands/base.py", line 791, in kill
self.process.signalProcess(self.KILL)
File "/tools/buildbot-0.8.0/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-macosx-10.6-universal.egg/twisted/internet/process.py", line 333, in signalProcess
raise ProcessExitedAlready()
twisted.internet.error.ProcessExitedAlready:
Comment 1•14 years ago
|
||
I'd like to replicate this locally. Any idea how the jetpack builds happen to cause this more than others? Do they quit too quickly?
Updated•14 years ago
|
Whiteboard: [stuck]
Updated•14 years ago
|
Priority: -- → P3
Whiteboard: [stuck] → [buildslaves][talos][stuck]
Comment 3•14 years ago
|
||
How are we working around these? They seem to leave XP machines in the "shutdown is already in progress" state..
Reporter | ||
Comment 4•14 years ago
|
||
(In reply to comment #3)
> How are we working around these? They seem to leave XP machines in the
> "shutdown is already in progress" state..
I've had luck ssh'ing in and doing 'shutdown -r -t 0'
Comment 5•14 years ago
|
||
(In reply to comment #1)
> I'd like to replicate this locally. Any idea how the jetpack builds happen to
> cause this more than others? Do they quit too quickly?
ctalbert explains it here: https://bugzilla.mozilla.org/show_bug.cgi?id=627070#c4
There is also a log.
Sometimes I have needed to use remote desktop since VNC might be dead because of the already initiated reboot (this is if ssh'ing does not work).
Comment 6•14 years ago
|
||
That bug sounds like the jetpack runner is not exiting, yet Buildbot gets ProcessExitedAlready.
Does this exception only happen on Windows?
Re-assigning to nobody since I'm still playing catch-up here.
Assignee: dustin → nobody
Reporter | ||
Comment 7•14 years ago
|
||
(In reply to comment #6)
> That bug sounds like the jetpack runner is not exiting, yet Buildbot gets
> ProcessExitedAlready.
>
> Does this exception only happen on Windows?
No, I've seen it on linux and mac slaves too.
Reporter | ||
Comment 8•14 years ago
|
||
See https://bugzilla.mozilla.org/show_bug.cgi?id=629527#c18 for steps to reproduce on linux at least.
For me this doesn't block a reconfig from happening though.
Comment 9•14 years ago
|
||
OK, a little bit of research on this topic (noting that there are a lot of only partially related slave-side problems in play here):
An except block for ProcessAlreadyExited was added in
https://github.com/buildbot/buildbot/commit/91343c2ba3fa840cca3d358087878a72ad8085c4
which isn't released yet - it will be in 0.8.4. I'm working on an improvement to it, as the currently-committed code will try to fire a deferred twice - no good.
The exception occurs when either (a) buildbot tries to kill the process and races against the process dying itself or (b) the process has double-forked, with the parent exiting. There are still members of the process group holding file descriptors open, but the original process has already exited. Twisted's process Protocol classes (wisely) don't call processEnded until the process has exited *and* all of the stdio pipes have run dry. So the process is gone and can't be killed, but processEnded hasn't been called yet.
If usePTY=True, then in this situation the parent process is a session leader, so when it exits the child process gets a SIGHUP, which kills the process by default (and would thus close the stdio pipes and wrap everything up). If usePTY=False, however, the child process holds those file descriptors open but is inaccessible to buildbot since a proper process group has not been set up.
I suspect the latter is the case for Jetpack: the build or test is leaving a child process running, holding file descriptors open, and that child process is either ignoring SIGHUP, or we're not using usePTY on these slaves, or the test process has munged sessions and terminals up enough to not get the SIGHUP.
Faliing to catch that exception means simply that the attempt to kill the process fails, and it is free to keep running. It's as if the buildstep was run with timeout=None.
It makes sense that this doesn't block a reconfig, and that it will hang the slave in the step forever.
So that's an analysis of this particular problem. The solution is complicated, not least because things have changed a lot since Buildbot-0.8.0. I'm writing some Buildbot tests to see if I can narrow down failure cases.
Comment 10•14 years ago
|
||
After a bit more analysis and discussion, it looks like the underlying problem is the inability to kill entire process groups without usePTY. The ProcessExitedAlready exception is problematic in that it prevents the step from finishing, and the fix is easy, so let's do that here -- it's just merging the upstream commit mentioned in the comment 9.
Comment 11•14 years ago
|
||
blocked on deciding what version of buildslave to upgrade to (bug 631854)
Comment 12•14 years ago
|
||
And the decision is to upgrade to 0.8.4-pre, which already includes a fix for this problem. Yay! I'm leaving this open until bug 631851 lands.
Comment 13•14 years ago
|
||
Let's call this fixed, just not deployed yet.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•