Closed
Bug 774984
Opened 13 years ago
Closed 12 years ago
Clientproxy needs to start/handle buildbot better (figure out and solve 'buildslave start returned 1')
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
INVALID
People
(Reporter: Callek, Unassigned)
References
Details
Attachments
(1 file)
1.37 KB,
patch
|
kmoir
:
review+
|
Details | Diff | Splinter Review |
So across just about our whole tegra pool we have seen, occassionally things like:
2012-05-22 15:58:30,526 INFO MainProcess: Running verify code
2012-05-22 15:58:34,996 INFO MainProcess: verify.py has run without issues
2012-05-22 15:58:35,216 INFO MainProcess: buildslave start returned 1
Where the slave actually is not attached to the master, slave side twistd.log has nothing after its last disconnect (which usually is long before this)
And has no twistd running at at all for that tegra.
The workaround solution seems to be a stop_cp.sh/start_cp.sh for that tegra, and then it connects fine! Makes me wonder how the slave/clientproxy can get into that state though.
This can bring down tegras and make it hard(er) for us to notice.
[I thought I filed this ages ago when I first noticed -- leaving unassigned since I don't yet know when I can grab it, but am marking it critical, since as long as we are using buildbot and clientproxy this is a bad issue]
Comment 1•13 years ago
|
||
I'd recommend invoking twistd directly, rather than using 'buildbot start'. One fewer layers.
Reporter | ||
Comment 2•13 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #1)
> I'd recommend invoking twistd directly, rather than using 'buildbot start'.
> One fewer layers.
We already do:
http://hg.mozilla.org/build/tools/file/00a21e19f87d/sut_tools/clientproxy.py#l382
bash-3.2$ cat tegra-279/client*.log | grep -U9 "2012-07-17 15:"
2012-07-17 15:26:49,962 INFO dialback: process shutting down
2012-07-17 15:26:50,484 INFO dialback: process exiting with exitcode 0
2012-07-17 15:29:28,737 INFO MainProcess: 90953: ourIP 10.250.48.232 tegra tegra-279 tegraIP 10.2
50.51.119 bbpath /builds/tegra-279
2012-07-17 15:29:28,739 INFO MainProcess: process shutting down
2012-07-17 15:29:28,740 INFO MainProcess: process shutting down
2012-07-17 15:29:28,741 INFO MainProcess: monitoring started (process pid 90956)
2012-07-17 15:29:28,743 INFO dialback: child process calling self.run()
2012-07-17 15:29:28,744 INFO dialback: Binding to 0.0.0.0 port 42279
2012-07-17 15:30:36,678 INFO MainProcess: heartbeat detected
2012-07-17 15:30:36,680 INFO MainProcess: Running verify code
2012-07-17 15:38:43,537 INFO MainProcess: verify.py has run without issues
2012-07-17 15:40:43,765 INFO MainProcess: buildslave start returned 1
2012-07-17 15:42:23,786 INFO MainProcess: heartbeat detected
And twistd.log:
bash-3.2$ tail tegra-279/twistd.log
2012-07-17 15:29:22-0700 [Broker,client] lost remote
2012-07-17 15:29:22-0700 [Broker,client] lost remote
2012-07-17 15:29:22-0700 [Broker,client] lost remote
2012-07-17 15:29:22-0700 [Broker,client] lost remote
2012-07-17 15:29:22-0700 [Broker,client] lost remote
2012-07-17 15:29:22-0700 [Broker,client] lost remote step
2012-07-17 15:29:22-0700 [Broker,client] Lost connection to buildbot-master20.build.mozilla.org:9201
2012-07-17 15:29:22-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x1
07a06cb0>
2012-07-17 15:29:22-0700 [-] Main loop terminated.
2012-07-17 15:29:22-0700 [-] Server Shut Down.
bash-3.2$
last lines before the "lost remote":
2012-07-17 15:29:21-0700 [Broker,client] startCommand:shell [id 636335]
2012-07-17 15:29:21-0700 [Broker,client] RunProcess._startCommand
2012-07-17 15:29:21-0700 [Broker,client] python /builds/sut_tools/verify.py
2012-07-17 15:29:21-0700 [Broker,client] in dir /builds/tegra-279/test/build (timeout 1200 secs)
2012-07-17 15:29:21-0700 [Broker,client] watching logfiles {}
2012-07-17 15:29:21-0700 [Broker,client] argv: ['python', '/builds/sut_tools/verify.py']
2012-07-17 15:29:21-0700 [Broker,client] environment: {'__CF_USER_TEXT_ENCODING': '0x1F5:0:0', 'PA$
2012-07-17 15:29:21-0700 [Broker,client] closing stdin
2012-07-17 15:29:21-0700 [Broker,client] using PTY: False
2012-07-17 15:29:22-0700 [-] Received SIGTERM, shutting down.
2012-07-17 15:29:22-0700 [-] stopCommand: halting current command <buildslave.commands.shell.SlaveS$
2012-07-17 15:29:22-0700 [-] command interrupted, killing pid 90950
2012-07-17 15:29:22-0700 [-] trying os.kill(-pid, 9)
2012-07-17 15:29:22-0700 [-] trying process.signalProcess('KILL')
2012-07-17 15:29:22-0700 [-] signal KILL sent successfully
2012-07-17 15:29:22-0700 [Broker,client] lost remote
And of interest:
bash-3.2$ cat tegra-279/twistd.log | grep "2012-07-17 15:29:22-0700" | grep "lost remote" | wc -l
1146
Reporter | ||
Comment 3•13 years ago
|
||
Nothing is sticking out at me, in an investigation of what could be wrong here. no twistd.log is getting logged, we toss out [all of] the twistd output from running the slave start command (linked in c#2)
(In reply to Justin Wood (:Callek) from comment #2)
> (In reply to Dustin J. Mitchell [:dustin] from comment #1)
> > I'd recommend invoking twistd directly, rather than using 'buildbot start'.
> > One fewer layers.
>
> We already do:
>
> http://hg.mozilla.org/build/tools/file/00a21e19f87d/sut_tools/clientproxy.
> py#l382
Sooo what I am doing short term is turning on --debug clientproxy logging on *all* these foopies, and all the ones I go to restart today will start logging the stdout and stderror to the clientproxy log, -- so in theory when this happens next, if there is *any* output from twistd we can see it. (or at least debug a bit more)
I am right now doing another round of cleanup ala Bug 774986 because we're down to ~80 tegras up again.
Reporter | ||
Comment 4•13 years ago
|
||
(In reply to Justin Wood (:Callek) from comment #2)
> 2012-07-17 15:38:43,537 INFO MainProcess: verify.py has run without issues
> 2012-07-17 15:40:43,765 INFO MainProcess: buildslave start returned 1
> 2012-07-17 15:42:23,786 INFO MainProcess: heartbeat detected
O and to be clear, amy install ganglia on the foopies yesterday, but we don't have data for this range (next time it happens we might get some data though -- to give us data if load could be a contributing factor)
Reporter | ||
Comment 5•13 years ago
|
||
Interesting... (tegra-145) [sorry for lengthy comment]
TLDR; it looks like we are throwing and exiting one of our subprocesses, so I theorize that it has *something* to do with this.
twistd.log
2012-07-19 20:39:54-0700 [Broker,client] RunProcess._startCommand
2012-07-19 20:39:54-0700 [Broker,client] python /builds/sut_tools/reboot.py 10.250.50.55
2012-07-19 20:39:54-0700 [Broker,client] in dir /builds/tegra-145/test/. (timeout 1800 secs)
2012-07-19 20:39:54-0700 [Broker,client] watching logfiles {}
2012-07-19 20:39:54-0700 [Broker,client] argv: ['python', '/builds/sut_tools/reboot.py', '10.250.50.55']
2012-07-19 20:39:54-0700 [Broker,client] environment: {'__CF_USER_TEXT_ENCODING': '0x1F5:0:0', 'PATH': '/opt/local/bin:/opt/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin', 'SUT_NAME': 'tegra-145', 'PWD': '/builds/tegra-145/test', 'SUT_IP': '10.250.50.55'}
2012-07-19 20:39:54-0700 [Broker,client] closing stdin
2012-07-19 20:39:54-0700 [Broker,client] using PTY: False
2012-07-19 20:45:29-0700 [-] sending app-level keepalive
2012-07-19 20:55:29-0700 [-] sending app-level keepalive
2012-07-19 20:59:25-0700 [-] command finished with signal None, exit code 0, elapsedTime: 1171.081281
2012-07-19 20:59:25-0700 [-] SlaveBuilder.commandComplete <buildslave.commands.shell.SlaveShellCommand instance at 0x101b3c290>
2012-07-19 21:05:29-0700 [-] sending app-level keepalive
2012-07-19 21:15:29-0700 [-] sending app-level keepalive
....
2012-07-19 22:21:59-0700 [Broker,client] lost remote
2012-07-19 22:21:59-0700 [Broker,client] Lost connection to buildbot-master20.build.mozilla.org:9201
2012-07-19 22:21:59-0700 [Broker,client] <twisted.internet.tcp.Connector instance at 0x1015461b8> will retry in 2 seconds
2012-07-19 22:21:59-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x101a73368>
2012-07-19 22:22:02-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x101a73368>
2012-07-19 22:22:02-0700 [-] Connecting to buildbot-master20.build.mozilla.org:9201
2012-07-19 22:22:02-0700 [Uninitialized] Connection to buildbot-master20.build.mozilla.org:9201 failed: Connection Refused
2012-07-19 22:22:02-0700 [Uninitialized] <twisted.internet.tcp.Connector instance at 0x1015461b8> will retry in 8 seconds
2012-07-19 22:22:02-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x101a73368>
2012-07-19 22:22:10-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x101a73368>
2012-07-19 22:22:10-0700 [-] Connecting to buildbot-master20.build.mozilla.org:9201
2012-07-19 22:22:10-0700 [Uninitialized] Connection to buildbot-master20.build.mozilla.org:9201 failed: Connection Refused
2012-07-19 22:22:10-0700 [Uninitialized] <twisted.internet.tcp.Connector instance at 0x1015461b8> will retry in 25 seconds
2012-07-19 22:22:10-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x101a73368>
2012-07-19 22:22:35-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x101a73368>
2012-07-19 22:22:35-0700 [-] Connecting to buildbot-master20.build.mozilla.org:9201
2012-07-19 22:22:35-0700 [Uninitialized] Connection to buildbot-master20.build.mozilla.org:9201 failed: Connection Refused
2012-07-19 22:22:35-0700 [Uninitialized] <twisted.internet.tcp.Connector instance at 0x1015461b8> will retry in 71 seconds
2012-07-19 22:22:35-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x101a73368>
2012-07-19 22:23:46-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x101a73368>
2012-07-19 22:23:46-0700 [-] Connecting to buildbot-master20.build.mozilla.org:9201
2012-07-19 22:23:46-0700 [Uninitialized] Connection to buildbot-master20.build.mozilla.org:9201 failed: Connection Refused
2012-07-19 22:23:46-0700 [Uninitialized] <twisted.internet.tcp.Connector instance at 0x1015461b8> will retry in 202 seconds
2012-07-19 22:23:46-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x101a73368>
2012-07-19 22:25:46-0700 [-] connection attempt timed out (is the port number correct?)
2012-07-19 22:27:09-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x101a73368>
2012-07-19 22:27:09-0700 [-] Connecting to buildbot-master20.build.mozilla.org:9201
2012-07-19 22:27:09-0700 [Uninitialized] Connection to buildbot-master20.build.mozilla.org:9201 failed: Connection Refused
2012-07-19 22:27:09-0700 [Uninitialized] <twisted.internet.tcp.Connector instance at 0x1015461b8> will retry in 341 seconds
2012-07-19 22:27:09-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x101a73368>
2012-07-19 22:29:09-0700 [-] connection attempt timed out (is the port number correct?)
2012-07-19 22:30:28-0700 [-] Received SIGTERM, shutting down.
2012-07-19 22:30:28-0700 [-] Main loop terminated.
2012-07-19 22:30:28-0700 [-] Server Shut Down.
and the most interesting clientproxy.log:
2012-07-19 22:27:09,593 DEBUG MainProcess: bbActive True tegraActive True
2012-07-19 22:27:09,593 DEBUG MainProcess: checking if slave is alive [/builds/tegra-145/twistd.p$
2012-07-19 22:27:09,593 DEBUG MainProcess: EVENT: None
2012-07-19 22:27:57,061 DEBUG dialback: error during dialback loop
2012-07-19 22:27:57,062 DEBUG dialback: Traceback (most recent call last):
2012-07-19 22:27:57,062 DEBUG dialback:
2012-07-19 22:27:57,062 DEBUG dialback: File "clientproxy.py", line 220, in handleDialback
2012-07-19 22:27:57,062 DEBUG dialback: asyncore.loop(timeout=1, count=1)
2012-07-19 22:27:57,062 DEBUG dialback:
2012-07-19 22:27:57,062 DEBUG dialback: File "/opt/local/Library/Frameworks/Python.framework/Ve$
2012-07-19 22:27:57,063 DEBUG dialback: poll_fun(timeout, map)
2012-07-19 22:27:57,063 DEBUG dialback:
2012-07-19 22:27:57,063 DEBUG dialback: File "/opt/local/Library/Frameworks/Python.framework/Ve$
2012-07-19 22:27:57,063 DEBUG dialback: r, w, e = select.select(r, w, e, timeout)
2012-07-19 22:27:57,063 DEBUG dialback:
2012-07-19 22:27:57,063 DEBUG dialback: File "clientproxy.py", line 464, in handleSigTERM
2012-07-19 22:27:57,063 DEBUG dialback: db.close()
2012-07-19 22:27:57,063 DEBUG dialback:
2012-07-19 22:27:57,063 DEBUG dialback: AttributeError: 'Process' object has no attribute 'close'
2012-07-19 22:27:57,064 DEBUG dialback:
2012-07-19 22:27:57,064 DEBUG dialback: Traceback End
2012-07-19 22:27:57,064 INFO dialback: process shutting down
2012-07-19 22:27:57,064 DEBUG dialback: running all "atexit" finalizers with priority >= 0
2012-07-19 22:27:57,064 DEBUG dialback: running the remaining "atexit" finalizers
2012-07-19 22:27:57,064 INFO dialback: process exiting with exitcode 0
2012-07-19 22:28:09,714 DEBUG MainProcess: EVENT: active
2012-07-19 22:28:09,715 DEBUG MainProcess: event active hbFails 0 / 50
2012-07-19 22:28:09,715 DEBUG MainProcess: bbActive True tegraActive True
2012-07-19 22:28:09,715 DEBUG MainProcess: checking if slave is alive [/builds/tegra-145/twistd.pid]
2012-07-19 22:28:09,715 DEBUG MainProcess: EVENT: None
2012-07-19 22:28:57,097 DEBUG MainProcess: hbSocket.recv()
2012-07-19 22:28:57,098 DEBUG MainProcess: Traceback (most recent call last):
2012-07-19 22:28:57,098 DEBUG MainProcess:
2012-07-19 22:28:57,098 DEBUG MainProcess: File "clientproxy.py", line 296, in monitorEvents
2012-07-19 22:28:57,098 DEBUG MainProcess: hbData = hbSocket.recv(4096)
2012-07-19 22:28:57,098 DEBUG MainProcess:
2012-07-19 22:28:57,098 DEBUG MainProcess: File "clientproxy.py", line 464, in handleSigTERM
2012-07-19 22:28:57,099 DEBUG MainProcess: db.close()
2012-07-19 22:28:57,099 DEBUG MainProcess:
2012-07-19 22:28:57,099 DEBUG MainProcess: AttributeError: 'Process' object has no attribute 'close'
2012-07-19 22:28:57,099 DEBUG MainProcess:
2012-07-19 22:28:57,099 DEBUG MainProcess: Traceback End
2012-07-19 22:28:57,099 DEBUG MainProcess: bbActive True tegraActive True
2012-07-19 22:28:57,099 DEBUG MainProcess: checking if slave is alive [/builds/tegra-145/twistd.pid]
.......
2012-07-19 23:09:56,649 INFO MainProcess: Running verify code
2012-07-19 23:09:56,650 DEBUG MainProcess: calling [python /builds/sut_tools/verify.py tegra-145]
2012-07-19 23:24:28,925 DEBUG dialback: error during dialback loop
2012-07-19 23:24:28,926 DEBUG dialback: Traceback (most recent call last):
2012-07-19 23:24:28,926 DEBUG dialback:
2012-07-19 23:24:28,926 DEBUG dialback: File "clientproxy.py", line 220, in handleDialback
2012-07-19 23:24:28,927 DEBUG dialback: asyncore.loop(timeout=1, count=1)
2012-07-19 23:24:28,927 DEBUG dialback:
2012-07-19 23:24:28,927 DEBUG dialback: File "/opt/local/Library/Frameworks/Python.framework/Ve$
2012-07-19 23:24:28,927 DEBUG dialback: poll_fun(timeout, map)
2012-07-19 23:24:28,927 DEBUG dialback:
2012-07-19 23:24:28,927 DEBUG dialback: File "/opt/local/Library/Frameworks/Python.framework/Ve$
2012-07-19 23:24:28,928 DEBUG dialback: r, w, e = select.select(r, w, e, timeout)
2012-07-19 23:24:28,928 DEBUG dialback:
2012-07-19 23:24:28,928 DEBUG dialback: File "clientproxy.py", line 464, in handleSigTERM
2012-07-19 23:24:28,928 DEBUG dialback: db.close()
2012-07-19 23:24:28,928 DEBUG dialback:
2012-07-19 23:24:28,928 DEBUG dialback: AttributeError: 'Process' object has no attribute 'close'
2012-07-19 23:24:28,929 DEBUG dialback:
2012-07-19 23:24:28,929 DEBUG dialback: Traceback End
2012-07-19 23:24:28,929 INFO dialback: process shutting down
2012-07-19 23:24:28,929 DEBUG dialback: running all "atexit" finalizers with priority >= 0
2012-07-19 23:24:28,930 DEBUG dialback: running the remaining "atexit" finalizers
2012-07-19 23:24:28,930 INFO dialback: process exiting with exitcode 0
2012-07-19 23:25:29,036 WARNING MainProcess: verify.py returned with errors
2012-07-19 23:25:29,037 WARNING MainProcess: verify.py did not create '/builds/tegra-145/error.flg'$
2012-07-19 23:25:29,037 INFO MainProcess: /builds/tegra-145/error.flg
2012-07-19 23:25:29,177 INFO MainProcess: Verify.py returned with an unexpected error.
2012-07-19 23:25:30,021 INFO MainProcess: process shutting down
2012-07-19 23:25:30,021 DEBUG MainProcess: running all "atexit" finalizers with priority >= 0
2012-07-19 23:25:30,021 DEBUG MainProcess: telling queue thread to quit
2012-07-19 23:25:30,022 DEBUG MainProcess: feeder thread got sentinel -- exiting
2012-07-19 23:25:30,022 DEBUG MainProcess: running the remaining "atexit" finalizers
2012-07-19 23:25:30,022 DEBUG MainProcess: joining queue thread
2012-07-19 23:25:30,022 DEBUG MainProcess: ... queue thread joined
2012-07-19 23:25:36,622 INFO MainProcess: 39861: ourIP 10.250.48.214 tegra tegra-145 tegraIP 10.2
50.50.55 bbpath /builds/tegra-145
2012-07-19 23:25:36,625 INFO MainProcess: process shutting down
2012-07-19 23:25:36,627 INFO MainProcess: process shutting down
2012-07-19 23:25:36,645 INFO MainProcess: monitoring started (process pid 39864)
2012-07-19 23:25:36,648 INFO dialback: child process calling self.run()
2012-07-19 23:25:36,649 INFO dialback: Binding to 0.0.0.0 port 42145
2012-07-19 23:26:43,655 INFO MainProcess: Error connecting to data port - sleeping for 5 seconds
2012-07-19 23:26:48,661 INFO MainProcess: Error connecting to data port - sleeping for 5 seconds
2012-07-19 23:26:53,688 INFO MainProcess: heartbeat detected
2012-07-19 23:26:53,691 INFO MainProcess: Running verify code
2012-07-19 23:26:59,815 INFO MainProcess: verify.py has run without issues
2012-07-19 23:27:53,862 INFO MainProcess: heartbeat detected
2012-07-19 23:27:54,404 INFO MainProcess: buildslave start returned 1
Reporter | ||
Comment 6•13 years ago
|
||
(In reply to Justin Wood (:Callek) from comment #4)
> (In reply to Justin Wood (:Callek) from comment #2)
> > 2012-07-17 15:38:43,537 INFO MainProcess: verify.py has run without issues
> > 2012-07-17 15:40:43,765 INFO MainProcess: buildslave start returned 1
> > 2012-07-17 15:42:23,786 INFO MainProcess: heartbeat detected
>
> O and to be clear, amy install ganglia on the foopies yesterday, but we
> don't have data for this range (next time it happens we might get some data
> though -- to give us data if load could be a contributing factor)
Fwiw, checking ganglia output for the foopy from c#5, nothing stands out as bad. So I think the python traceback here is our best way forward.
Reporter | ||
Comment 7•13 years ago
|
||
This should start us down the right path, by getting necessary info into cp's log.
Attachment #694082 -
Flags: review?(kmoir)
Updated•13 years ago
|
Attachment #694082 -
Flags: review?(kmoir) → review+
Reporter | ||
Comment 8•13 years ago
|
||
This patch is now deployed to foopies, though I did not restart clientproxy just for this [yet]
Reporter | ||
Comment 9•13 years ago
|
||
Soooo... we have an error like this, with actual output for once! pasting here and we'll investigate later
2013-01-16 06:27:42,623 INFO MainProcess: buildslave start returned 1
2013-01-16 06:29:22,633 WARNING MainProcess: No buildbot pidfile found, as expected...
2013-01-16 06:29:22,634 INFO MainProcess: See Output of buildslave start:
2013-01-16 06:29:22,634 INFO MainProcess: ['/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/li
b/python/twisted/persisted/sob.py:12: DeprecationWarning: the md5 module is deprecated; use hashlib instead', '
import os, md5, sys', '/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/python/
filepath.py:12: DeprecationWarning: the sha module is deprecated; use the hashlib module instead', ' import sha'
, 'Traceback (most recent call last):', ' File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/
lib/python/twisted/application/app.py", line 694, in run', ' runApp(config)', ' File "/System/Library/Framewo
rks/Python.framework/Versions/2.6/Extras/lib/python/twisted/scripts/twistd.py", line 23, in runApp', ' _SomeAp
plicationRunner(config).run()', ' File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/pyth
on/twisted/application/app.py", line 411, in run', ' self.application = self.createOrGetApplication()', ' Fil
e "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/application/app.py", line 4
94, in createOrGetApplication', ' application = getApplication(self.config, passphrase)', '--- <exception caug
ht here> ---', ' File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/applic
ation/app.py", line 505, in getApplication', ' application = service.loadApplication(filename, style, passphra
se)', ' File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/application/ser
vice.py", line 390, in loadApplication', " application = sob.loadValueFromFile(filename, 'application', passph
rase)", ' File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/persisted/sob
.py", line 214, in loadValueFromFile', ' exec fileObj in d, d', ' File "/builds/tegra-071/buildbot.tac", line
4, in <module>', ' from buildslave.bot import BuildSlave', 'exceptions.ImportError: No module named buildslav
e.bot', '', 'Failed to load application: No module named buildslave.bot', '']
Reporter | ||
Comment 10•13 years ago
|
||
Sooo, we're close to removing clientproxy entirely...
But useful info.
bash-3.2$ echo $PATH
/opt/local/bin:/opt/local/sbin:/opt/local/Library/Frameworks/Python.framework/Versions/2.6/bin:/usr/bin:/bin:/usr
/sbin:/sbin:/usr/local/bin:/usr/X11/bin
bash-3.2$ which -a twistd
/opt/local/Library/Frameworks/Python.framework/Versions/2.6/bin/twistd
/usr/bin/twistd
in the cases where this hits, it seems we're using /usr/bin/twistd not the expected twistd version.
It is unclear how clientproxy loses its PATH.
Reporter | ||
Comment 11•13 years ago
|
||
So while having to stroll though a lot of devices with this problem, I decided to check what cp was *actually* displaying as its env:
Good:
bash-3.2$ ps -p `cat tegra-106/clientproxy.pid` -wwwE
PID TTY TIME CMD
95664 ?? 0:00.02 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/Resources/Python.app/Contents/MacOS/Python clientproxy.py -b --device=tegra-106 TERM=screen SHELL=/bin/bash SSH_CLIENT=10.26.74.22 57401 22 OLDPWD=/builds SSH_TTY=/dev/ttys000 USER=cltbld TERMCAP=SC|screen|VT 100/ANSI X3.64 virtual terminal:\^J^I:DO=\E[%dB:LE=\E[%dD:RI=\E[%dC:UP=\E[%dA:bs:bt=\E[Z:\^J^I:cd=\E[J:ce=\E[K:cl=\E[H\E[J:cm=\E[%i%d;%dH:ct=\E[3g:\^J^I:do=^J
:nd=\E[C:pt:rc=\E8:rs=\Ec:sc=\E7:st=\EH:up=\EM:\^J^I:le=^H:bl=^G:cr=^M:it#8:ho=\E[H:nw=\EE:ta=^I:is=\E)0:\^J^I:li#51:co#211:am:xn:xv:LP:sr=\EM:al=\E[L:AL=\E[%dL:\^J^I:cs=\E[%i%d;%dr:dl=\E[M:DL=\E[%dM:dc=\E[P:DC=\E[%dP:\^J^I:im=\E[4h:ei=\E[4l:mi:IC=\E[%d@:ks=\E[?1h\E=:\^J^I:ke=\E[?1l\E>:vi=\E[?25l:ve=\E[34h\E[?25h:vs=\E[34l:\^J^I:ti=\E[?1049h:te=\E[?1049l:us=\E[4m:ue=\E[24m:so=\E[3m:\^J^I:se=\E[23m:mb=\E[5m:md=\E[1m:mr=\E[7m:me=\E[m:ms:\^J^I:Co#8:pa
#64:AF=\E[3%dm:AB=\E[4%dm:op=\E[39;49m:AX:\^J^I:vb=\Eg:G0:as=\E(0:ae=\E(B:\^J^I:ac=\140\140aaffggjjkkllmmnnooppqqrrssttuuvvwwxxyyzz{{||}}~~..--++,,hhII00:\^J^I:po=\E[5i:pf=\E[4i:k0=\E[10~:k1=\EOP:k2=\EOQ:k3=\EOR:\^J^I:k4=\EOS:k5=\E[15~:k6=\E[17~:k7=\E[18~:k9=\E[20~:\^J^I:k;=\E[21~:F1=\E[23~:F2=\E[24~:F3=\E[1;2P:F4=\E[1;2Q:\^J^I:F5=\E[1;2R:F6=\E[1;2S:F7=\E[15;2~:F8=\E[17;2~:\^J^I:F9=\E[18;2~:FA=\E[19;2~:kb=^H:K2=\EOE:kB=\E[Z:\^J^I:kF=\E[1;2B:kR=\E[1
;2A:*4=\E[3;2~:*7=\E[1;2F:#2=\E[1;2H:\^J^I:#3=\E[2;2~:#4=\E[1;2D:%c=\E[6;2~:%e=\E[5;2~:%i=\E[1;2C:\^J^I:kh=\E[1~:@1=\E[1~:kH=\E[4~:@7=\E[4~:kN=\E[6~:kP=\E[5~:\^J^I:kI=\E[2~:kD=\E[3~:ku=\EOA:kd=\EOB:kr=\EOC:kl=\EOD:km: __CF_USER_TEXT_ENCODING=0x1F5:0:0 MAIL=/var/mail/cltbld PATH=/opt/local/bin:/opt/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin STY=13259.ttys000.foopy13 PWD=/builds/tegra-106 SHLVL=3 HOME=/Users/cltbld LOGNAME=cltbld WINDOW=0 SSH_CONNECTION=10.26.74.22 57401 10.250.48.213 22 _=/opt/local/bin/python
Bad:
bash-3.2$ ps -p `cat tegra-108/clientproxy.pid` -wwwE
PID TTY TIME CMD
38416 ?? 2:42.21 python clientproxy.py -b --device=tegra-108 SHELL=/bin/sh OLDPWD=/Users/cltbld USER=cltbld PATH=/usr/bin:/bin:/sbin PWD=/builds HOME=/Users/cltbld SHLVL=2 LOGNAME=cltbld _=/usr/bin/python __CF_USER_TEXT_ENCODING=0x1F5:0:0 VERSIONER_PYTHON_VERSION=2.6 VERSIONER_PYTHON_PREFER_32_BIT=no
Now, I found http://mxr.mozilla.org/build/source/tools/sut_tools/check.py#195 which looks ominous as a place we are restarting clientproxy from, seriously o_O
Which is run from a crontab:
http://mxr.mozilla.org/build/source/puppet/modules/foopy/templates/foopy.crontab.erb#8
(also installed manually on mac foopies)
So I *suspect* this is the root cause of this issue... New "no clientproxy" code forces status['cp'] to be 'active' so we won't hit this block
Comment 12•12 years ago
|
||
Goodbye, cp.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → INVALID
Assignee | ||
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•6 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•