Closed Bug 777273 Opened 12 years ago Closed 12 years ago

Only ~50 tegras are running tests; tegra pending test job count is at 1600 and rising

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

ARM
Android
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: armenzg)

References

Details

http://build.mozilla.org/builds/pending/running.html Running test(s) @ Jul 25 02:29:03 tegra (55) 24 mozilla-beta 16 mozilla-central 6 mozilla-aurora 3 mozilla-inbound 3 mozilla-esr10 2 fx-team 1 birch http://build.mozilla.org/builds/pending/pending.html Pending test(s) @ Jul 25 02:29:05 tegra (219) 106 mozilla-inbound 56 birch 30 fx-team 16 alder 10 mozilla-central 1 mozilla-esr10 tegra (1482) 1482 try Tegra test job pending count has been rising by 100+ per hour for the last 16 hours. Presume a number of tegras need a kick, similar to bug 774986.
Blocks: 772458
Running: tegra (49) 24 mozilla-central 21 mozilla-esr10 2 fx-team 1 mozilla-inbound 1 mozilla-beta tegra (2) 2 try Pending test(s) @ Jul 25 04:39:10 tegra (242) 100 mozilla-inbound 47 birch 36 ionmonkey 26 mozilla-aurora 16 mozilla-beta 14 alder 2 mozilla-central 1 mozilla-esr10 tegra (1762) 1762 try
Bit of a sorry story on http://mobile-dashboard.pub.build.mozilla.org/ Production Staging Tegra and buildslave online 39 13 Tegra online but buildslave is not 33 6 Both Tegra and buildslave are offline 136 5 Total 208 24
Assignee: nobody → armenzg
It seems last week we had bug 774986 happening. I see ~50 tegras taking jobs; I see ~50 tegras connected to our three masters. I see tons of tegras that have not been taking jobs for days. I don't see anything that has landed on our repos recently. I remember coop said he was looking into the tegras yesterday (not sure what the outcome was). There is only one positive side to all this... the masters were so responsive! (note to look at ganglia later). I'm going to shut all tegras that have this message "INACTIVE active OFFLINE". Starting them again does not seem to help. For now I have stopped these tegras. Once I have the fix I will start them again (since stop takes a long time to work). * connect to the foopies tail -n1 tegra*/tegra*.log | grep "INACTIVE active OFFLINE" (./check.sh takes forever) ** using the tail technique could be risky as the state could be out of date * use this command to stop them and start them: for i in {191,192,201,276,277,278,279,281,282,283,285}; do ./stop_cp.sh tegra-$i; done for i in {248,249,251,253,255,256,257,258,259,260}; do ./stop_cp.sh tegra-$i; done for i in {234,236,237,238,240,242,245}; do ./stop_cp.sh tegra-$i; done for i in {146,220,221,223,227,229,231,232,233}; do ./stop_cp.sh tegra-$i; done for i in {174,178,179,181,182,183,196}; do ./stop_cp.sh tegra-$i; done for i in {159,160,162,163,164,165,166,169,170,171}; do ./stop_cp.sh tegra-$i; done for i in {159,160,162,163,164,165,166,169,170,171}; do ./stop_cp.sh tegra-$i; done for i in {129,133,134,135,136,142,144}; do ./stop_cp.sh tegra-$i; done ** ./stop.sh is extremely slow * ^^ These are foopies 15 to 24 (except 18 and 21) * foopy18 I hit "Cannot find terminfo entry for 'xterm-256color'." * foopy21 does not seem to be setup I'm looking at tegra-144/clientproxy.log: 2012-07-25 07:24:29,467 DEBUG MainProcess: Traceback End 2012-07-25 07:24:29,467 DEBUG MainProcess: bbActive False tegraActive True 2012-07-25 07:24:29,468 DEBUG MainProcess: EVENT: None 2012-07-25 07:26:29,465 DEBUG MainProcess: hbSocket.recv() 2012-07-25 07:26:29,466 DEBUG MainProcess: Traceback (most recent call last): 2012-07-25 07:26:29,466 DEBUG MainProcess: 2012-07-25 07:26:29,466 DEBUG MainProcess: File "clientproxy.py", line 296, in monitorEvents 2012-07-25 07:26:29,467 DEBUG MainProcess: hbData = hbSocket.recv(4096) 2012-07-25 07:26:29,467 DEBUG MainProcess: 2012-07-25 07:26:29,467 DEBUG MainProcess: timeout: timed out Here is some data from tegra-144/tegra-144_status.log 2012-07-21 11:32:08 tegra-144 p online active active :: 2012-07-21 11:37:08 tegra-144 p online active active :: 2012-07-21 11:42:22 tegra-144 p INACTIVE active active :: SUTAgent not present; 2012-07-21 11:44:04 tegra-144 p INACTIVE active active :: SUTAgent not present; 2012-07-21 11:47:06 tegra-144 p INACTIVE active active :: SUTAgent not present; 2012-07-21 11:52:08 tegra-144 p online active active :: 2012-07-21 11:57:19 tegra-144 p online active active :: 2012-07-21 12:02:10 tegra-144 p online active active :: REBOOTING 2012-07-21 12:07:17 tegra-144 p online active active :: REBOOTING 2012-07-21 12:12:08 tegra-144 p OFFLINE active active :: PING tegra-144.build.mtv1.mozilla.com (10.250.50.54): 56 data bytes Request timeout for icmp_seq 0; 2012-07-21 12:19:10 tegra-144 p INACTIVE active active :: ... 2012-07-21 13:50:01 tegra-144 p INACTIVE active active :: 2012-07-21 13:55:08 tegra-144 p INACTIVE active active :: 2012-07-21 13:59:57 tegra-144 p INACTIVE active OFFLINE :: 2012-07-21 14:05:01 tegra-144 p INACTIVE active OFFLINE :: 2012-07-21 14:10:01 tegra-144 p INACTIVE active OFFLINE :: From tegra-191/twistd.log 2012-07-20 17:23:49-0700 [Broker,client] RunProcess._startCommand 2012-07-20 17:23:49-0700 [Broker,client] python /builds/sut_tools/reboot.py 10.250.50.101 2012-07-20 17:23:49-0700 [Broker,client] in dir /builds/tegra-191/test/../talos-data (timeout 1800 secs) 2012-07-20 17:23:49-0700 [Broker,client] watching logfiles {} 2012-07-20 17:23:49-0700 [Broker,client] argv: ['python', '/builds/sut_tools/reboot.py', '10.250.50.101'] 2012-07-20 17:23:49-0700 [Broker,client] environment: {'__CF_USER_TEXT_ENCODING': '0x1F5:0:0', 'PATH': '/opt/local/bin:/opt/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin', 'SUT_NAME': 'tegra-191', 'PWD': '/builds/tegra-191/talos-data', 'SUT_IP': '10.250.50.101'} 2012-07-20 17:23:49-0700 [Broker,client] closing stdin 2012-07-20 17:23:49-0700 [Broker,client] using PTY: False 2012-07-20 17:24:03-0700 [-] sending app-level keepalive 2012-07-20 17:34:03-0700 [-] sending app-level keepalive 2012-07-20 17:44:03-0700 [-] sending app-level keepalive 2012-07-20 17:46:03-0700 [-] command finished with signal 15, exit code None, elapsedTime: 1334.069412 2012-07-20 17:46:03-0700 [-] SlaveBuilder.commandComplete <buildslave.commands.shell.SlaveShellCommand instance at 0x103b2a128> 2012-07-20 17:46:04-0700 [-] Received SIGTERM, shutting down. 2012-07-20 17:46:04-0700 [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate 2012-07-20 17:46:04-0700 [Broker,client] Unhandled Error Traceback (most recent call last): Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion: Connection lost. ] 2012-07-20 17:46:04-0700 [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate 2012-07-20 17:46:04-0700 [Broker,client] Unhandled Error Traceback (most recent call last): Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion: Connection lost. ] 2012-07-20 17:46:04-0700 [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate 2012-07-20 17:46:04-0700 [Broker,client] Unhandled Error Traceback (most recent call last): Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion: Connection lost. ] 2012-07-20 17:46:04-0700 [Broker,client] SlaveBuilder._ackFailed: sendComplete 2012-07-20 17:46:04-0700 [Broker,client] Unhandled Error Traceback (most recent call last): Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion: Connection lost. ] 2012-07-20 17:46:04-0700 [Broker,client] lost remote
I think one of our code changes might be causing this. I updated to 726a439dc568 (on foopy24) which is prior to 2e38c0d51e81 (bug 771626). powercycling works after that. /me still investigates bash-3.2$ python /builds/sut_tools/tegra_powercycle.py 191 10.250.48.232 MainProcess: Reboot Tegra Called for tegra tegra-191 10.250.48.232 MainProcess: File "/builds/sut_tools/tegra_powercycle.py", line 22, in <module> 10.250.48.232 MainProcess: sut_lib.reboot_tegra(tegra, debug=True) 10.250.48.232 MainProcess: 10.250.48.232 MainProcess: File "/builds/tools/sut_tools/sut_lib.py", line 538, in reboot_tegra 10.250.48.232 MainProcess: logRebootTraceback(tegra) 10.250.48.232 MainProcess: 10.250.48.232 MainProcess: File "/builds/tools/sut_tools/sut_lib.py", line 252, in logRebootTraceback 10.250.48.232 MainProcess: for s in traceback.format_list(traceback.extract_stack()): 10.250.48.232 MainProcess: SNMPv2-SMI::enterprises.1718.3.2.3.1.11.1.2.4 = INTEGER: 3 10.250.48.232 MainProcess: process shutting down 10.250.48.232 MainProcess: running all "atexit" finalizers with priority >= 0 10.250.48.232 MainProcess: running the remaining "atexit" finalizers bash-3.2$ cd /builds/sut_tools bash-3.2$ hg up -r 726a439dc568 merging sut_tools/clientproxy.py 9 files updated, 1 files merged, 0 files removed, 0 files unresolved bash-3.2$ cd - /builds/tegra-191 bash-3.2$ python /builds/sut_tools/tegra_powercycle.py 191 SNMPv2-SMI::enterprises.1718.3.2.3.1.11.1.2.4 = INTEGER: 3
It seems that the fix is to do this: cd /builds/tools && hg up -C && hg up -r 726a439dc568 for i in {###,###}; do ./stop_cp.sh tegra-$i && python /builds/sut_tools/tegra_powercycle.py $i && ./start_cp.sh tegra-$i; done /me still is verifying this before passing all of the foopies
I have run this on foopies 15 to 24 (except 18 & 21). We're up to 65 tegras and we will soon be increasing. I will check in 30 minutes and go through foopies 07 to 14. (this tail command is different because I had run stop_cp.sh in advance) tail -n1 tegra*/tegra*.log | grep "INACTIVE OFFLINE OFFLINE" cd /builds/tools && hg up -C && hg up -r 726a439dc568 for i in {### ###}; do ./stop_cp.sh tegra-$i && python /builds/sut_tools/tegra_powercycle.py $i && ./start_cp.sh tegra-$i; done
All foopies have been tackled. The code has been backed out: http://hg.mozilla.org/build/tools/rev/fdcb5a3311d8 I'm going to do one last pass to see what I left behind. We're now running 149 jobs. https://build.mozilla.org/buildapi/running
This is as many tegras I can recuperate without the typical buildduty one by one analysis. The main issue has been dealt with.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
I can't believe I forgot completely about foopy18. It is now running this: for i in {206,207,208,209,211,212,213,214,215,217}; do ./stop_cp.sh tegra-$i && python /builds/sut_tools/tegra_powercycle.py $i && ./start_cp.sh tegra-$i; done This hopefully means we're adding another 10 tegras back.
Job started at 2012-07-25 13:53:30 PDT There might be a fallout from this: 17:15 mbrubeck: armenzg: We just had a bunch of Tegra failures with "ImportError: cannot import name logRebootTraceback" -- https://tbpl.mozilla.org/php/getParsedLog.php?id=13849556&tree=Mozilla-Inbound 17:15 mbrubeck: looks related to http://hg.mozilla.org/build/tools/rev/fdcb5a3311d8 Not sure yet.
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #10) > Job started at 2012-07-25 13:53:30 PDT > > There might be a fallout from this: ... > Not sure yet. Yea, that was due to this backout -- and I blame myself for not reminding armen about the devicemanagerSUT.py change I manually made on all the foopies to support this logging. (Manually only because these files are not in revision control for our foopies atm) I had kept the original files around as .old so I just moved them back over. All fixed
It looks like this is still happening. From https://tbpl.mozilla.org/php/getParsedLog.php?id=13851916&tree=Mozilla-Inbound&full=1 (at 2012-07-25 15:15:07.455598): calling reboot Traceback (most recent call last): File "/builds/sut_tools/installApp.py", line 190, in <module> main(sys.argv) File "/builds/sut_tools/installApp.py", line 175, in main dm, devRoot = one_time_setup(ip_addr, path_to_main_apk) File "/builds/sut_tools/installApp.py", line 148, in one_time_setup dm.reboot(proxyIP, proxyPort) File "/builds/tools/sut_tools/devicemanagerSUT.py", line 887, in reboot from sut_lib import logRebootTraceback ImportError: cannot import name logRebootTraceback program finished with exit code 1
mbrubeck caught me on IRC and I double checked, it turned out on one foopy we were bad and not fixed. (foopy07) so this affected 11 tegras max. Fixed now.
http://mobile-dashboard.pub.build.mozilla.org/ says: Tegra and buildslave online 124 (production) Whereas I believe this was up to 159 last night. Should I file another bug, or is this usual/expected daily attrition that is one of the things that [buildduty] takes care of?
(In reply to Ed Morley [:edmorley] from comment #14) > http://mobile-dashboard.pub.build.mozilla.org/ says: > Tegra and buildslave online 124 (production) > > Whereas I believe this was up to 159 last night. > > Should I file another bug, or is this usual/expected daily attrition that is > one of the things that [buildduty] takes care of? Down to 119. Should I file another bug?
Down to 109. Feels a little like I'm talking to myself... :P
Blocks: 782627
Product: mozilla.org → Release Engineering
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.