Closed Bug 774984 Opened 13 years ago Closed 12 years ago

Clientproxy needs to start/handle buildbot better (figure out and solve 'buildslave start returned 1')

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86_64
Windows 7
task
Not set
major

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: Callek, Unassigned)

References

Details

Attachments

(1 file)

So across just about our whole tegra pool we have seen, occassionally things like: 2012-05-22 15:58:30,526 INFO MainProcess: Running verify code 2012-05-22 15:58:34,996 INFO MainProcess: verify.py has run without issues 2012-05-22 15:58:35,216 INFO MainProcess: buildslave start returned 1 Where the slave actually is not attached to the master, slave side twistd.log has nothing after its last disconnect (which usually is long before this) And has no twistd running at at all for that tegra. The workaround solution seems to be a stop_cp.sh/start_cp.sh for that tegra, and then it connects fine! Makes me wonder how the slave/clientproxy can get into that state though. This can bring down tegras and make it hard(er) for us to notice. [I thought I filed this ages ago when I first noticed -- leaving unassigned since I don't yet know when I can grab it, but am marking it critical, since as long as we are using buildbot and clientproxy this is a bad issue]
Blocks: 774986
I'd recommend invoking twistd directly, rather than using 'buildbot start'. One fewer layers.
(In reply to Dustin J. Mitchell [:dustin] from comment #1) > I'd recommend invoking twistd directly, rather than using 'buildbot start'. > One fewer layers. We already do: http://hg.mozilla.org/build/tools/file/00a21e19f87d/sut_tools/clientproxy.py#l382 bash-3.2$ cat tegra-279/client*.log | grep -U9 "2012-07-17 15:" 2012-07-17 15:26:49,962 INFO dialback: process shutting down 2012-07-17 15:26:50,484 INFO dialback: process exiting with exitcode 0 2012-07-17 15:29:28,737 INFO MainProcess: 90953: ourIP 10.250.48.232 tegra tegra-279 tegraIP 10.2 50.51.119 bbpath /builds/tegra-279 2012-07-17 15:29:28,739 INFO MainProcess: process shutting down 2012-07-17 15:29:28,740 INFO MainProcess: process shutting down 2012-07-17 15:29:28,741 INFO MainProcess: monitoring started (process pid 90956) 2012-07-17 15:29:28,743 INFO dialback: child process calling self.run() 2012-07-17 15:29:28,744 INFO dialback: Binding to 0.0.0.0 port 42279 2012-07-17 15:30:36,678 INFO MainProcess: heartbeat detected 2012-07-17 15:30:36,680 INFO MainProcess: Running verify code 2012-07-17 15:38:43,537 INFO MainProcess: verify.py has run without issues 2012-07-17 15:40:43,765 INFO MainProcess: buildslave start returned 1 2012-07-17 15:42:23,786 INFO MainProcess: heartbeat detected And twistd.log: bash-3.2$ tail tegra-279/twistd.log 2012-07-17 15:29:22-0700 [Broker,client] lost remote 2012-07-17 15:29:22-0700 [Broker,client] lost remote 2012-07-17 15:29:22-0700 [Broker,client] lost remote 2012-07-17 15:29:22-0700 [Broker,client] lost remote 2012-07-17 15:29:22-0700 [Broker,client] lost remote 2012-07-17 15:29:22-0700 [Broker,client] lost remote step 2012-07-17 15:29:22-0700 [Broker,client] Lost connection to buildbot-master20.build.mozilla.org:9201 2012-07-17 15:29:22-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x1 07a06cb0> 2012-07-17 15:29:22-0700 [-] Main loop terminated. 2012-07-17 15:29:22-0700 [-] Server Shut Down. bash-3.2$ last lines before the "lost remote": 2012-07-17 15:29:21-0700 [Broker,client] startCommand:shell [id 636335] 2012-07-17 15:29:21-0700 [Broker,client] RunProcess._startCommand 2012-07-17 15:29:21-0700 [Broker,client] python /builds/sut_tools/verify.py 2012-07-17 15:29:21-0700 [Broker,client] in dir /builds/tegra-279/test/build (timeout 1200 secs) 2012-07-17 15:29:21-0700 [Broker,client] watching logfiles {} 2012-07-17 15:29:21-0700 [Broker,client] argv: ['python', '/builds/sut_tools/verify.py'] 2012-07-17 15:29:21-0700 [Broker,client] environment: {'__CF_USER_TEXT_ENCODING': '0x1F5:0:0', 'PA$ 2012-07-17 15:29:21-0700 [Broker,client] closing stdin 2012-07-17 15:29:21-0700 [Broker,client] using PTY: False 2012-07-17 15:29:22-0700 [-] Received SIGTERM, shutting down. 2012-07-17 15:29:22-0700 [-] stopCommand: halting current command <buildslave.commands.shell.SlaveS$ 2012-07-17 15:29:22-0700 [-] command interrupted, killing pid 90950 2012-07-17 15:29:22-0700 [-] trying os.kill(-pid, 9) 2012-07-17 15:29:22-0700 [-] trying process.signalProcess('KILL') 2012-07-17 15:29:22-0700 [-] signal KILL sent successfully 2012-07-17 15:29:22-0700 [Broker,client] lost remote And of interest: bash-3.2$ cat tegra-279/twistd.log | grep "2012-07-17 15:29:22-0700" | grep "lost remote" | wc -l 1146
Nothing is sticking out at me, in an investigation of what could be wrong here. no twistd.log is getting logged, we toss out [all of] the twistd output from running the slave start command (linked in c#2) (In reply to Justin Wood (:Callek) from comment #2) > (In reply to Dustin J. Mitchell [:dustin] from comment #1) > > I'd recommend invoking twistd directly, rather than using 'buildbot start'. > > One fewer layers. > > We already do: > > http://hg.mozilla.org/build/tools/file/00a21e19f87d/sut_tools/clientproxy. > py#l382 Sooo what I am doing short term is turning on --debug clientproxy logging on *all* these foopies, and all the ones I go to restart today will start logging the stdout and stderror to the clientproxy log, -- so in theory when this happens next, if there is *any* output from twistd we can see it. (or at least debug a bit more) I am right now doing another round of cleanup ala Bug 774986 because we're down to ~80 tegras up again.
(In reply to Justin Wood (:Callek) from comment #2) > 2012-07-17 15:38:43,537 INFO MainProcess: verify.py has run without issues > 2012-07-17 15:40:43,765 INFO MainProcess: buildslave start returned 1 > 2012-07-17 15:42:23,786 INFO MainProcess: heartbeat detected O and to be clear, amy install ganglia on the foopies yesterday, but we don't have data for this range (next time it happens we might get some data though -- to give us data if load could be a contributing factor)
Interesting... (tegra-145) [sorry for lengthy comment] TLDR; it looks like we are throwing and exiting one of our subprocesses, so I theorize that it has *something* to do with this. twistd.log 2012-07-19 20:39:54-0700 [Broker,client] RunProcess._startCommand 2012-07-19 20:39:54-0700 [Broker,client] python /builds/sut_tools/reboot.py 10.250.50.55 2012-07-19 20:39:54-0700 [Broker,client] in dir /builds/tegra-145/test/. (timeout 1800 secs) 2012-07-19 20:39:54-0700 [Broker,client] watching logfiles {} 2012-07-19 20:39:54-0700 [Broker,client] argv: ['python', '/builds/sut_tools/reboot.py', '10.250.50.55'] 2012-07-19 20:39:54-0700 [Broker,client] environment: {'__CF_USER_TEXT_ENCODING': '0x1F5:0:0', 'PATH': '/opt/local/bin:/opt/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin', 'SUT_NAME': 'tegra-145', 'PWD': '/builds/tegra-145/test', 'SUT_IP': '10.250.50.55'} 2012-07-19 20:39:54-0700 [Broker,client] closing stdin 2012-07-19 20:39:54-0700 [Broker,client] using PTY: False 2012-07-19 20:45:29-0700 [-] sending app-level keepalive 2012-07-19 20:55:29-0700 [-] sending app-level keepalive 2012-07-19 20:59:25-0700 [-] command finished with signal None, exit code 0, elapsedTime: 1171.081281 2012-07-19 20:59:25-0700 [-] SlaveBuilder.commandComplete <buildslave.commands.shell.SlaveShellCommand instance at 0x101b3c290> 2012-07-19 21:05:29-0700 [-] sending app-level keepalive 2012-07-19 21:15:29-0700 [-] sending app-level keepalive .... 2012-07-19 22:21:59-0700 [Broker,client] lost remote 2012-07-19 22:21:59-0700 [Broker,client] Lost connection to buildbot-master20.build.mozilla.org:9201 2012-07-19 22:21:59-0700 [Broker,client] <twisted.internet.tcp.Connector instance at 0x1015461b8> will retry in 2 seconds 2012-07-19 22:21:59-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x101a73368> 2012-07-19 22:22:02-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x101a73368> 2012-07-19 22:22:02-0700 [-] Connecting to buildbot-master20.build.mozilla.org:9201 2012-07-19 22:22:02-0700 [Uninitialized] Connection to buildbot-master20.build.mozilla.org:9201 failed: Connection Refused 2012-07-19 22:22:02-0700 [Uninitialized] <twisted.internet.tcp.Connector instance at 0x1015461b8> will retry in 8 seconds 2012-07-19 22:22:02-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x101a73368> 2012-07-19 22:22:10-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x101a73368> 2012-07-19 22:22:10-0700 [-] Connecting to buildbot-master20.build.mozilla.org:9201 2012-07-19 22:22:10-0700 [Uninitialized] Connection to buildbot-master20.build.mozilla.org:9201 failed: Connection Refused 2012-07-19 22:22:10-0700 [Uninitialized] <twisted.internet.tcp.Connector instance at 0x1015461b8> will retry in 25 seconds 2012-07-19 22:22:10-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x101a73368> 2012-07-19 22:22:35-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x101a73368> 2012-07-19 22:22:35-0700 [-] Connecting to buildbot-master20.build.mozilla.org:9201 2012-07-19 22:22:35-0700 [Uninitialized] Connection to buildbot-master20.build.mozilla.org:9201 failed: Connection Refused 2012-07-19 22:22:35-0700 [Uninitialized] <twisted.internet.tcp.Connector instance at 0x1015461b8> will retry in 71 seconds 2012-07-19 22:22:35-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x101a73368> 2012-07-19 22:23:46-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x101a73368> 2012-07-19 22:23:46-0700 [-] Connecting to buildbot-master20.build.mozilla.org:9201 2012-07-19 22:23:46-0700 [Uninitialized] Connection to buildbot-master20.build.mozilla.org:9201 failed: Connection Refused 2012-07-19 22:23:46-0700 [Uninitialized] <twisted.internet.tcp.Connector instance at 0x1015461b8> will retry in 202 seconds 2012-07-19 22:23:46-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x101a73368> 2012-07-19 22:25:46-0700 [-] connection attempt timed out (is the port number correct?) 2012-07-19 22:27:09-0700 [-] Starting factory <buildslave.bot.BotFactory instance at 0x101a73368> 2012-07-19 22:27:09-0700 [-] Connecting to buildbot-master20.build.mozilla.org:9201 2012-07-19 22:27:09-0700 [Uninitialized] Connection to buildbot-master20.build.mozilla.org:9201 failed: Connection Refused 2012-07-19 22:27:09-0700 [Uninitialized] <twisted.internet.tcp.Connector instance at 0x1015461b8> will retry in 341 seconds 2012-07-19 22:27:09-0700 [Uninitialized] Stopping factory <buildslave.bot.BotFactory instance at 0x101a73368> 2012-07-19 22:29:09-0700 [-] connection attempt timed out (is the port number correct?) 2012-07-19 22:30:28-0700 [-] Received SIGTERM, shutting down. 2012-07-19 22:30:28-0700 [-] Main loop terminated. 2012-07-19 22:30:28-0700 [-] Server Shut Down. and the most interesting clientproxy.log: 2012-07-19 22:27:09,593 DEBUG MainProcess: bbActive True tegraActive True 2012-07-19 22:27:09,593 DEBUG MainProcess: checking if slave is alive [/builds/tegra-145/twistd.p$ 2012-07-19 22:27:09,593 DEBUG MainProcess: EVENT: None 2012-07-19 22:27:57,061 DEBUG dialback: error during dialback loop 2012-07-19 22:27:57,062 DEBUG dialback: Traceback (most recent call last): 2012-07-19 22:27:57,062 DEBUG dialback: 2012-07-19 22:27:57,062 DEBUG dialback: File "clientproxy.py", line 220, in handleDialback 2012-07-19 22:27:57,062 DEBUG dialback: asyncore.loop(timeout=1, count=1) 2012-07-19 22:27:57,062 DEBUG dialback: 2012-07-19 22:27:57,062 DEBUG dialback: File "/opt/local/Library/Frameworks/Python.framework/Ve$ 2012-07-19 22:27:57,063 DEBUG dialback: poll_fun(timeout, map) 2012-07-19 22:27:57,063 DEBUG dialback: 2012-07-19 22:27:57,063 DEBUG dialback: File "/opt/local/Library/Frameworks/Python.framework/Ve$ 2012-07-19 22:27:57,063 DEBUG dialback: r, w, e = select.select(r, w, e, timeout) 2012-07-19 22:27:57,063 DEBUG dialback: 2012-07-19 22:27:57,063 DEBUG dialback: File "clientproxy.py", line 464, in handleSigTERM 2012-07-19 22:27:57,063 DEBUG dialback: db.close() 2012-07-19 22:27:57,063 DEBUG dialback: 2012-07-19 22:27:57,063 DEBUG dialback: AttributeError: 'Process' object has no attribute 'close' 2012-07-19 22:27:57,064 DEBUG dialback: 2012-07-19 22:27:57,064 DEBUG dialback: Traceback End 2012-07-19 22:27:57,064 INFO dialback: process shutting down 2012-07-19 22:27:57,064 DEBUG dialback: running all "atexit" finalizers with priority >= 0 2012-07-19 22:27:57,064 DEBUG dialback: running the remaining "atexit" finalizers 2012-07-19 22:27:57,064 INFO dialback: process exiting with exitcode 0 2012-07-19 22:28:09,714 DEBUG MainProcess: EVENT: active 2012-07-19 22:28:09,715 DEBUG MainProcess: event active hbFails 0 / 50 2012-07-19 22:28:09,715 DEBUG MainProcess: bbActive True tegraActive True 2012-07-19 22:28:09,715 DEBUG MainProcess: checking if slave is alive [/builds/tegra-145/twistd.pid] 2012-07-19 22:28:09,715 DEBUG MainProcess: EVENT: None 2012-07-19 22:28:57,097 DEBUG MainProcess: hbSocket.recv() 2012-07-19 22:28:57,098 DEBUG MainProcess: Traceback (most recent call last): 2012-07-19 22:28:57,098 DEBUG MainProcess: 2012-07-19 22:28:57,098 DEBUG MainProcess: File "clientproxy.py", line 296, in monitorEvents 2012-07-19 22:28:57,098 DEBUG MainProcess: hbData = hbSocket.recv(4096) 2012-07-19 22:28:57,098 DEBUG MainProcess: 2012-07-19 22:28:57,098 DEBUG MainProcess: File "clientproxy.py", line 464, in handleSigTERM 2012-07-19 22:28:57,099 DEBUG MainProcess: db.close() 2012-07-19 22:28:57,099 DEBUG MainProcess: 2012-07-19 22:28:57,099 DEBUG MainProcess: AttributeError: 'Process' object has no attribute 'close' 2012-07-19 22:28:57,099 DEBUG MainProcess: 2012-07-19 22:28:57,099 DEBUG MainProcess: Traceback End 2012-07-19 22:28:57,099 DEBUG MainProcess: bbActive True tegraActive True 2012-07-19 22:28:57,099 DEBUG MainProcess: checking if slave is alive [/builds/tegra-145/twistd.pid] ....... 2012-07-19 23:09:56,649 INFO MainProcess: Running verify code 2012-07-19 23:09:56,650 DEBUG MainProcess: calling [python /builds/sut_tools/verify.py tegra-145] 2012-07-19 23:24:28,925 DEBUG dialback: error during dialback loop 2012-07-19 23:24:28,926 DEBUG dialback: Traceback (most recent call last): 2012-07-19 23:24:28,926 DEBUG dialback: 2012-07-19 23:24:28,926 DEBUG dialback: File "clientproxy.py", line 220, in handleDialback 2012-07-19 23:24:28,927 DEBUG dialback: asyncore.loop(timeout=1, count=1) 2012-07-19 23:24:28,927 DEBUG dialback: 2012-07-19 23:24:28,927 DEBUG dialback: File "/opt/local/Library/Frameworks/Python.framework/Ve$ 2012-07-19 23:24:28,927 DEBUG dialback: poll_fun(timeout, map) 2012-07-19 23:24:28,927 DEBUG dialback: 2012-07-19 23:24:28,927 DEBUG dialback: File "/opt/local/Library/Frameworks/Python.framework/Ve$ 2012-07-19 23:24:28,928 DEBUG dialback: r, w, e = select.select(r, w, e, timeout) 2012-07-19 23:24:28,928 DEBUG dialback: 2012-07-19 23:24:28,928 DEBUG dialback: File "clientproxy.py", line 464, in handleSigTERM 2012-07-19 23:24:28,928 DEBUG dialback: db.close() 2012-07-19 23:24:28,928 DEBUG dialback: 2012-07-19 23:24:28,928 DEBUG dialback: AttributeError: 'Process' object has no attribute 'close' 2012-07-19 23:24:28,929 DEBUG dialback: 2012-07-19 23:24:28,929 DEBUG dialback: Traceback End 2012-07-19 23:24:28,929 INFO dialback: process shutting down 2012-07-19 23:24:28,929 DEBUG dialback: running all "atexit" finalizers with priority >= 0 2012-07-19 23:24:28,930 DEBUG dialback: running the remaining "atexit" finalizers 2012-07-19 23:24:28,930 INFO dialback: process exiting with exitcode 0 2012-07-19 23:25:29,036 WARNING MainProcess: verify.py returned with errors 2012-07-19 23:25:29,037 WARNING MainProcess: verify.py did not create '/builds/tegra-145/error.flg'$ 2012-07-19 23:25:29,037 INFO MainProcess: /builds/tegra-145/error.flg 2012-07-19 23:25:29,177 INFO MainProcess: Verify.py returned with an unexpected error. 2012-07-19 23:25:30,021 INFO MainProcess: process shutting down 2012-07-19 23:25:30,021 DEBUG MainProcess: running all "atexit" finalizers with priority >= 0 2012-07-19 23:25:30,021 DEBUG MainProcess: telling queue thread to quit 2012-07-19 23:25:30,022 DEBUG MainProcess: feeder thread got sentinel -- exiting 2012-07-19 23:25:30,022 DEBUG MainProcess: running the remaining "atexit" finalizers 2012-07-19 23:25:30,022 DEBUG MainProcess: joining queue thread 2012-07-19 23:25:30,022 DEBUG MainProcess: ... queue thread joined 2012-07-19 23:25:36,622 INFO MainProcess: 39861: ourIP 10.250.48.214 tegra tegra-145 tegraIP 10.2 50.50.55 bbpath /builds/tegra-145 2012-07-19 23:25:36,625 INFO MainProcess: process shutting down 2012-07-19 23:25:36,627 INFO MainProcess: process shutting down 2012-07-19 23:25:36,645 INFO MainProcess: monitoring started (process pid 39864) 2012-07-19 23:25:36,648 INFO dialback: child process calling self.run() 2012-07-19 23:25:36,649 INFO dialback: Binding to 0.0.0.0 port 42145 2012-07-19 23:26:43,655 INFO MainProcess: Error connecting to data port - sleeping for 5 seconds 2012-07-19 23:26:48,661 INFO MainProcess: Error connecting to data port - sleeping for 5 seconds 2012-07-19 23:26:53,688 INFO MainProcess: heartbeat detected 2012-07-19 23:26:53,691 INFO MainProcess: Running verify code 2012-07-19 23:26:59,815 INFO MainProcess: verify.py has run without issues 2012-07-19 23:27:53,862 INFO MainProcess: heartbeat detected 2012-07-19 23:27:54,404 INFO MainProcess: buildslave start returned 1
(In reply to Justin Wood (:Callek) from comment #4) > (In reply to Justin Wood (:Callek) from comment #2) > > 2012-07-17 15:38:43,537 INFO MainProcess: verify.py has run without issues > > 2012-07-17 15:40:43,765 INFO MainProcess: buildslave start returned 1 > > 2012-07-17 15:42:23,786 INFO MainProcess: heartbeat detected > > O and to be clear, amy install ganglia on the foopies yesterday, but we > don't have data for this range (next time it happens we might get some data > though -- to give us data if load could be a contributing factor) Fwiw, checking ganglia output for the foopy from c#5, nothing stands out as bad. So I think the python traceback here is our best way forward.
Attached patch [tools] v1Splinter Review
This should start us down the right path, by getting necessary info into cp's log.
Attachment #694082 - Flags: review?(kmoir)
Attachment #694082 - Flags: review?(kmoir) → review+
This patch is now deployed to foopies, though I did not restart clientproxy just for this [yet]
Soooo... we have an error like this, with actual output for once! pasting here and we'll investigate later 2013-01-16 06:27:42,623 INFO MainProcess: buildslave start returned 1 2013-01-16 06:29:22,633 WARNING MainProcess: No buildbot pidfile found, as expected... 2013-01-16 06:29:22,634 INFO MainProcess: See Output of buildslave start: 2013-01-16 06:29:22,634 INFO MainProcess: ['/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/li b/python/twisted/persisted/sob.py:12: DeprecationWarning: the md5 module is deprecated; use hashlib instead', ' import os, md5, sys', '/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/python/ filepath.py:12: DeprecationWarning: the sha module is deprecated; use the hashlib module instead', ' import sha' , 'Traceback (most recent call last):', ' File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/ lib/python/twisted/application/app.py", line 694, in run', ' runApp(config)', ' File "/System/Library/Framewo rks/Python.framework/Versions/2.6/Extras/lib/python/twisted/scripts/twistd.py", line 23, in runApp', ' _SomeAp plicationRunner(config).run()', ' File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/pyth on/twisted/application/app.py", line 411, in run', ' self.application = self.createOrGetApplication()', ' Fil e "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/application/app.py", line 4 94, in createOrGetApplication', ' application = getApplication(self.config, passphrase)', '--- <exception caug ht here> ---', ' File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/applic ation/app.py", line 505, in getApplication', ' application = service.loadApplication(filename, style, passphra se)', ' File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/application/ser vice.py", line 390, in loadApplication', " application = sob.loadValueFromFile(filename, 'application', passph rase)", ' File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/persisted/sob .py", line 214, in loadValueFromFile', ' exec fileObj in d, d', ' File "/builds/tegra-071/buildbot.tac", line 4, in <module>', ' from buildslave.bot import BuildSlave', 'exceptions.ImportError: No module named buildslav e.bot', '', 'Failed to load application: No module named buildslave.bot', '']
Sooo, we're close to removing clientproxy entirely... But useful info. bash-3.2$ echo $PATH /opt/local/bin:/opt/local/sbin:/opt/local/Library/Frameworks/Python.framework/Versions/2.6/bin:/usr/bin:/bin:/usr /sbin:/sbin:/usr/local/bin:/usr/X11/bin bash-3.2$ which -a twistd /opt/local/Library/Frameworks/Python.framework/Versions/2.6/bin/twistd /usr/bin/twistd in the cases where this hits, it seems we're using /usr/bin/twistd not the expected twistd version. It is unclear how clientproxy loses its PATH.
So while having to stroll though a lot of devices with this problem, I decided to check what cp was *actually* displaying as its env: Good: bash-3.2$ ps -p `cat tegra-106/clientproxy.pid` -wwwE PID TTY TIME CMD 95664 ?? 0:00.02 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/Resources/Python.app/Contents/MacOS/Python clientproxy.py -b --device=tegra-106 TERM=screen SHELL=/bin/bash SSH_CLIENT=10.26.74.22 57401 22 OLDPWD=/builds SSH_TTY=/dev/ttys000 USER=cltbld TERMCAP=SC|screen|VT 100/ANSI X3.64 virtual terminal:\^J^I:DO=\E[%dB:LE=\E[%dD:RI=\E[%dC:UP=\E[%dA:bs:bt=\E[Z:\^J^I:cd=\E[J:ce=\E[K:cl=\E[H\E[J:cm=\E[%i%d;%dH:ct=\E[3g:\^J^I:do=^J :nd=\E[C:pt:rc=\E8:rs=\Ec:sc=\E7:st=\EH:up=\EM:\^J^I:le=^H:bl=^G:cr=^M:it#8:ho=\E[H:nw=\EE:ta=^I:is=\E)0:\^J^I:li#51:co#211:am:xn:xv:LP:sr=\EM:al=\E[L:AL=\E[%dL:\^J^I:cs=\E[%i%d;%dr:dl=\E[M:DL=\E[%dM:dc=\E[P:DC=\E[%dP:\^J^I:im=\E[4h:ei=\E[4l:mi:IC=\E[%d@:ks=\E[?1h\E=:\^J^I:ke=\E[?1l\E>:vi=\E[?25l:ve=\E[34h\E[?25h:vs=\E[34l:\^J^I:ti=\E[?1049h:te=\E[?1049l:us=\E[4m:ue=\E[24m:so=\E[3m:\^J^I:se=\E[23m:mb=\E[5m:md=\E[1m:mr=\E[7m:me=\E[m:ms:\^J^I:Co#8:pa #64:AF=\E[3%dm:AB=\E[4%dm:op=\E[39;49m:AX:\^J^I:vb=\Eg:G0:as=\E(0:ae=\E(B:\^J^I:ac=\140\140aaffggjjkkllmmnnooppqqrrssttuuvvwwxxyyzz{{||}}~~..--++,,hhII00:\^J^I:po=\E[5i:pf=\E[4i:k0=\E[10~:k1=\EOP:k2=\EOQ:k3=\EOR:\^J^I:k4=\EOS:k5=\E[15~:k6=\E[17~:k7=\E[18~:k9=\E[20~:\^J^I:k;=\E[21~:F1=\E[23~:F2=\E[24~:F3=\E[1;2P:F4=\E[1;2Q:\^J^I:F5=\E[1;2R:F6=\E[1;2S:F7=\E[15;2~:F8=\E[17;2~:\^J^I:F9=\E[18;2~:FA=\E[19;2~:kb=^H:K2=\EOE:kB=\E[Z:\^J^I:kF=\E[1;2B:kR=\E[1 ;2A:*4=\E[3;2~:*7=\E[1;2F:#2=\E[1;2H:\^J^I:#3=\E[2;2~:#4=\E[1;2D:%c=\E[6;2~:%e=\E[5;2~:%i=\E[1;2C:\^J^I:kh=\E[1~:@1=\E[1~:kH=\E[4~:@7=\E[4~:kN=\E[6~:kP=\E[5~:\^J^I:kI=\E[2~:kD=\E[3~:ku=\EOA:kd=\EOB:kr=\EOC:kl=\EOD:km: __CF_USER_TEXT_ENCODING=0x1F5:0:0 MAIL=/var/mail/cltbld PATH=/opt/local/bin:/opt/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin STY=13259.ttys000.foopy13 PWD=/builds/tegra-106 SHLVL=3 HOME=/Users/cltbld LOGNAME=cltbld WINDOW=0 SSH_CONNECTION=10.26.74.22 57401 10.250.48.213 22 _=/opt/local/bin/python Bad: bash-3.2$ ps -p `cat tegra-108/clientproxy.pid` -wwwE PID TTY TIME CMD 38416 ?? 2:42.21 python clientproxy.py -b --device=tegra-108 SHELL=/bin/sh OLDPWD=/Users/cltbld USER=cltbld PATH=/usr/bin:/bin:/sbin PWD=/builds HOME=/Users/cltbld SHLVL=2 LOGNAME=cltbld _=/usr/bin/python __CF_USER_TEXT_ENCODING=0x1F5:0:0 VERSIONER_PYTHON_VERSION=2.6 VERSIONER_PYTHON_PREFER_32_BIT=no Now, I found http://mxr.mozilla.org/build/source/tools/sut_tools/check.py#195 which looks ominous as a place we are restarting clientproxy from, seriously o_O Which is run from a crontab: http://mxr.mozilla.org/build/source/puppet/modules/foopy/templates/foopy.crontab.erb#8 (also installed manually on mac foopies) So I *suspect* this is the root cause of this issue... New "no clientproxy" code forces status['cp'] to be 'active' so we won't hit this block
Goodbye, cp.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → INVALID
Product: mozilla.org → Release Engineering
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: