Closed
Bug 777273
Opened 12 years ago
Closed 12 years ago
Only ~50 tegras are running tests; tegra pending test job count is at 1600 and rising
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Assigned: armenzg)
References
Details
http://build.mozilla.org/builds/pending/running.html
Running test(s) @ Jul 25 02:29:03
tegra (55)
24 mozilla-beta
16 mozilla-central
6 mozilla-aurora
3 mozilla-inbound
3 mozilla-esr10
2 fx-team
1 birch
http://build.mozilla.org/builds/pending/pending.html
Pending test(s) @ Jul 25 02:29:05
tegra (219)
106 mozilla-inbound
56 birch
30 fx-team
16 alder
10 mozilla-central
1 mozilla-esr10
tegra (1482)
1482 try
Tegra test job pending count has been rising by 100+ per hour for the last 16 hours.
Presume a number of tegras need a kick, similar to bug 774986.
Reporter | ||
Comment 1•12 years ago
|
||
Running:
tegra (49)
24 mozilla-central
21 mozilla-esr10
2 fx-team
1 mozilla-inbound
1 mozilla-beta
tegra (2)
2 try
Pending test(s) @ Jul 25 04:39:10
tegra (242)
100 mozilla-inbound
47 birch
36 ionmonkey
26 mozilla-aurora
16 mozilla-beta
14 alder
2 mozilla-central
1 mozilla-esr10
tegra (1762)
1762 try
Reporter | ||
Comment 2•12 years ago
|
||
Bit of a sorry story on http://mobile-dashboard.pub.build.mozilla.org/
Production Staging
Tegra and buildslave online 39 13
Tegra online but buildslave is not 33 6
Both Tegra and buildslave are offline 136 5
Total 208 24
Assignee | ||
Updated•12 years ago
|
Assignee: nobody → armenzg
Assignee | ||
Comment 3•12 years ago
|
||
It seems last week we had bug 774986 happening.
I see ~50 tegras taking jobs; I see ~50 tegras connected to our three masters.
I see tons of tegras that have not been taking jobs for days.
I don't see anything that has landed on our repos recently.
I remember coop said he was looking into the tegras yesterday (not sure what the outcome was).
There is only one positive side to all this... the masters were so responsive! (note to look at ganglia later).
I'm going to shut all tegras that have this message "INACTIVE active OFFLINE".
Starting them again does not seem to help.
For now I have stopped these tegras.
Once I have the fix I will start them again (since stop takes a long time to work).
* connect to the foopies
tail -n1 tegra*/tegra*.log | grep "INACTIVE active OFFLINE"
(./check.sh takes forever)
** using the tail technique could be risky as the state could be out of date
* use this command to stop them and start them:
for i in {191,192,201,276,277,278,279,281,282,283,285}; do ./stop_cp.sh tegra-$i; done
for i in {248,249,251,253,255,256,257,258,259,260}; do ./stop_cp.sh tegra-$i; done
for i in {234,236,237,238,240,242,245}; do ./stop_cp.sh tegra-$i; done
for i in {146,220,221,223,227,229,231,232,233}; do ./stop_cp.sh tegra-$i; done
for i in {174,178,179,181,182,183,196}; do ./stop_cp.sh tegra-$i; done
for i in {159,160,162,163,164,165,166,169,170,171}; do ./stop_cp.sh tegra-$i; done
for i in {159,160,162,163,164,165,166,169,170,171}; do ./stop_cp.sh tegra-$i; done
for i in {129,133,134,135,136,142,144}; do ./stop_cp.sh tegra-$i; done
** ./stop.sh is extremely slow
* ^^ These are foopies 15 to 24 (except 18 and 21)
* foopy18 I hit "Cannot find terminfo entry for 'xterm-256color'."
* foopy21 does not seem to be setup
I'm looking at tegra-144/clientproxy.log:
2012-07-25 07:24:29,467 DEBUG MainProcess: Traceback End
2012-07-25 07:24:29,467 DEBUG MainProcess: bbActive False tegraActive True
2012-07-25 07:24:29,468 DEBUG MainProcess: EVENT: None
2012-07-25 07:26:29,465 DEBUG MainProcess: hbSocket.recv()
2012-07-25 07:26:29,466 DEBUG MainProcess: Traceback (most recent call last):
2012-07-25 07:26:29,466 DEBUG MainProcess:
2012-07-25 07:26:29,466 DEBUG MainProcess: File "clientproxy.py", line 296, in monitorEvents
2012-07-25 07:26:29,467 DEBUG MainProcess: hbData = hbSocket.recv(4096)
2012-07-25 07:26:29,467 DEBUG MainProcess:
2012-07-25 07:26:29,467 DEBUG MainProcess: timeout: timed out
Here is some data from tegra-144/tegra-144_status.log
2012-07-21 11:32:08 tegra-144 p online active active ::
2012-07-21 11:37:08 tegra-144 p online active active ::
2012-07-21 11:42:22 tegra-144 p INACTIVE active active :: SUTAgent not present;
2012-07-21 11:44:04 tegra-144 p INACTIVE active active :: SUTAgent not present;
2012-07-21 11:47:06 tegra-144 p INACTIVE active active :: SUTAgent not present;
2012-07-21 11:52:08 tegra-144 p online active active ::
2012-07-21 11:57:19 tegra-144 p online active active ::
2012-07-21 12:02:10 tegra-144 p online active active :: REBOOTING
2012-07-21 12:07:17 tegra-144 p online active active :: REBOOTING
2012-07-21 12:12:08 tegra-144 p OFFLINE active active :: PING tegra-144.build.mtv1.mozilla.com (10.250.50.54): 56 data bytes Request timeout for icmp_seq 0;
2012-07-21 12:19:10 tegra-144 p INACTIVE active active ::
...
2012-07-21 13:50:01 tegra-144 p INACTIVE active active ::
2012-07-21 13:55:08 tegra-144 p INACTIVE active active ::
2012-07-21 13:59:57 tegra-144 p INACTIVE active OFFLINE ::
2012-07-21 14:05:01 tegra-144 p INACTIVE active OFFLINE ::
2012-07-21 14:10:01 tegra-144 p INACTIVE active OFFLINE ::
From tegra-191/twistd.log
2012-07-20 17:23:49-0700 [Broker,client] RunProcess._startCommand
2012-07-20 17:23:49-0700 [Broker,client] python /builds/sut_tools/reboot.py 10.250.50.101
2012-07-20 17:23:49-0700 [Broker,client] in dir /builds/tegra-191/test/../talos-data (timeout 1800 secs)
2012-07-20 17:23:49-0700 [Broker,client] watching logfiles {}
2012-07-20 17:23:49-0700 [Broker,client] argv: ['python', '/builds/sut_tools/reboot.py', '10.250.50.101']
2012-07-20 17:23:49-0700 [Broker,client] environment: {'__CF_USER_TEXT_ENCODING': '0x1F5:0:0', 'PATH': '/opt/local/bin:/opt/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin', 'SUT_NAME': 'tegra-191', 'PWD': '/builds/tegra-191/talos-data', 'SUT_IP': '10.250.50.101'}
2012-07-20 17:23:49-0700 [Broker,client] closing stdin
2012-07-20 17:23:49-0700 [Broker,client] using PTY: False
2012-07-20 17:24:03-0700 [-] sending app-level keepalive
2012-07-20 17:34:03-0700 [-] sending app-level keepalive
2012-07-20 17:44:03-0700 [-] sending app-level keepalive
2012-07-20 17:46:03-0700 [-] command finished with signal 15, exit code None, elapsedTime: 1334.069412
2012-07-20 17:46:03-0700 [-] SlaveBuilder.commandComplete <buildslave.commands.shell.SlaveShellCommand instance at 0x103b2a128>
2012-07-20 17:46:04-0700 [-] Received SIGTERM, shutting down.
2012-07-20 17:46:04-0700 [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate
2012-07-20 17:46:04-0700 [Broker,client] Unhandled Error
Traceback (most recent call last):
Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion: Connection lost.
]
2012-07-20 17:46:04-0700 [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate
2012-07-20 17:46:04-0700 [Broker,client] Unhandled Error
Traceback (most recent call last):
Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion: Connection lost.
]
2012-07-20 17:46:04-0700 [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate
2012-07-20 17:46:04-0700 [Broker,client] Unhandled Error
Traceback (most recent call last):
Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion: Connection lost.
]
2012-07-20 17:46:04-0700 [Broker,client] SlaveBuilder._ackFailed: sendComplete
2012-07-20 17:46:04-0700 [Broker,client] Unhandled Error
Traceback (most recent call last):
Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion: Connection lost.
]
2012-07-20 17:46:04-0700 [Broker,client] lost remote
Assignee | ||
Comment 4•12 years ago
|
||
I think one of our code changes might be causing this.
I updated to 726a439dc568 (on foopy24) which is prior to 2e38c0d51e81 (bug 771626).
powercycling works after that.
/me still investigates
bash-3.2$ python /builds/sut_tools/tegra_powercycle.py 191
10.250.48.232 MainProcess: Reboot Tegra Called for tegra tegra-191
10.250.48.232 MainProcess: File "/builds/sut_tools/tegra_powercycle.py", line 22, in <module>
10.250.48.232 MainProcess: sut_lib.reboot_tegra(tegra, debug=True)
10.250.48.232 MainProcess:
10.250.48.232 MainProcess: File "/builds/tools/sut_tools/sut_lib.py", line 538, in reboot_tegra
10.250.48.232 MainProcess: logRebootTraceback(tegra)
10.250.48.232 MainProcess:
10.250.48.232 MainProcess: File "/builds/tools/sut_tools/sut_lib.py", line 252, in logRebootTraceback
10.250.48.232 MainProcess: for s in traceback.format_list(traceback.extract_stack()):
10.250.48.232 MainProcess:
SNMPv2-SMI::enterprises.1718.3.2.3.1.11.1.2.4 = INTEGER: 3
10.250.48.232 MainProcess: process shutting down
10.250.48.232 MainProcess: running all "atexit" finalizers with priority >= 0
10.250.48.232 MainProcess: running the remaining "atexit" finalizers
bash-3.2$ cd /builds/sut_tools
bash-3.2$ hg up -r 726a439dc568
merging sut_tools/clientproxy.py
9 files updated, 1 files merged, 0 files removed, 0 files unresolved
bash-3.2$ cd -
/builds/tegra-191
bash-3.2$ python /builds/sut_tools/tegra_powercycle.py 191
SNMPv2-SMI::enterprises.1718.3.2.3.1.11.1.2.4 = INTEGER: 3
Assignee | ||
Comment 5•12 years ago
|
||
It seems that the fix is to do this:
cd /builds/tools && hg up -C && hg up -r 726a439dc568
for i in {###,###}; do ./stop_cp.sh tegra-$i && python /builds/sut_tools/tegra_powercycle.py $i && ./start_cp.sh tegra-$i; done
/me still is verifying this before passing all of the foopies
Assignee | ||
Comment 6•12 years ago
|
||
I have run this on foopies 15 to 24 (except 18 & 21).
We're up to 65 tegras and we will soon be increasing.
I will check in 30 minutes and go through foopies 07 to 14.
(this tail command is different because I had run stop_cp.sh in advance)
tail -n1 tegra*/tegra*.log | grep "INACTIVE OFFLINE OFFLINE"
cd /builds/tools && hg up -C && hg up -r 726a439dc568
for i in {### ###}; do ./stop_cp.sh tegra-$i && python /builds/sut_tools/tegra_powercycle.py $i && ./start_cp.sh tegra-$i; done
Assignee | ||
Comment 7•12 years ago
|
||
All foopies have been tackled.
The code has been backed out:
http://hg.mozilla.org/build/tools/rev/fdcb5a3311d8
I'm going to do one last pass to see what I left behind.
We're now running 149 jobs.
https://build.mozilla.org/buildapi/running
Assignee | ||
Comment 8•12 years ago
|
||
This is as many tegras I can recuperate without the typical buildduty one by one analysis.
The main issue has been dealt with.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 9•12 years ago
|
||
I can't believe I forgot completely about foopy18.
It is now running this:
for i in {206,207,208,209,211,212,213,214,215,217}; do ./stop_cp.sh tegra-$i && python /builds/sut_tools/tegra_powercycle.py $i && ./start_cp.sh tegra-$i; done
This hopefully means we're adding another 10 tegras back.
Assignee | ||
Comment 10•12 years ago
|
||
Job started at 2012-07-25 13:53:30 PDT
There might be a fallout from this:
17:15 mbrubeck: armenzg: We just had a bunch of Tegra failures with "ImportError: cannot import name logRebootTraceback" -- https://tbpl.mozilla.org/php/getParsedLog.php?id=13849556&tree=Mozilla-Inbound
17:15 mbrubeck: looks related to http://hg.mozilla.org/build/tools/rev/fdcb5a3311d8
Not sure yet.
Comment 11•12 years ago
|
||
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #10)
> Job started at 2012-07-25 13:53:30 PDT
>
> There might be a fallout from this:
...
> Not sure yet.
Yea, that was due to this backout -- and I blame myself for not reminding armen about the devicemanagerSUT.py change I manually made on all the foopies to support this logging. (Manually only because these files are not in revision control for our foopies atm)
I had kept the original files around as .old so I just moved them back over. All fixed
Comment 12•12 years ago
|
||
It looks like this is still happening. From https://tbpl.mozilla.org/php/getParsedLog.php?id=13851916&tree=Mozilla-Inbound&full=1 (at 2012-07-25 15:15:07.455598):
calling reboot
Traceback (most recent call last):
File "/builds/sut_tools/installApp.py", line 190, in <module>
main(sys.argv)
File "/builds/sut_tools/installApp.py", line 175, in main
dm, devRoot = one_time_setup(ip_addr, path_to_main_apk)
File "/builds/sut_tools/installApp.py", line 148, in one_time_setup
dm.reboot(proxyIP, proxyPort)
File "/builds/tools/sut_tools/devicemanagerSUT.py", line 887, in reboot
from sut_lib import logRebootTraceback
ImportError: cannot import name logRebootTraceback
program finished with exit code 1
Comment 13•12 years ago
|
||
mbrubeck caught me on IRC and I double checked, it turned out on one foopy we were bad and not fixed. (foopy07) so this affected 11 tegras max. Fixed now.
Reporter | ||
Comment 14•12 years ago
|
||
http://mobile-dashboard.pub.build.mozilla.org/ says:
Tegra and buildslave online 124 (production)
Whereas I believe this was up to 159 last night.
Should I file another bug, or is this usual/expected daily attrition that is one of the things that [buildduty] takes care of?
Reporter | ||
Comment 15•12 years ago
|
||
(In reply to Ed Morley [:edmorley] from comment #14)
> http://mobile-dashboard.pub.build.mozilla.org/ says:
> Tegra and buildslave online 124 (production)
>
> Whereas I believe this was up to 159 last night.
>
> Should I file another bug, or is this usual/expected daily attrition that is
> one of the things that [buildduty] takes care of?
Down to 119.
Should I file another bug?
Reporter | ||
Comment 16•12 years ago
|
||
Down to 109.
Feels a little like I'm talking to myself... :P
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•