Closed Bug 777273 Opened 9 years ago Closed 9 years ago

Only ~50 tegras are running tests; tegra pending test job count is at 1600 and rising

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

ARM
Android
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: armenzg)

References

Details

http://build.mozilla.org/builds/pending/running.html
  Running test(s) @ Jul 25 02:29:03
    tegra (55)
       24 mozilla-beta
       16 mozilla-central
        6 mozilla-aurora
        3 mozilla-inbound
        3 mozilla-esr10
        2 fx-team
        1 birch

http://build.mozilla.org/builds/pending/pending.html
  Pending test(s) @ Jul 25 02:29:05
    tegra (219)
      106 mozilla-inbound
       56 birch
       30 fx-team
       16 alder
       10 mozilla-central
        1 mozilla-esr10
    tegra (1482)
     1482 try

Tegra test job pending count has been rising by 100+ per hour for the last 16 hours.

Presume a number of tegras need a kick, similar to bug 774986.
Blocks: 772458
Running:
  tegra (49)
   24 mozilla-central
   21 mozilla-esr10
    2 fx-team
    1 mozilla-inbound
    1 mozilla-beta
  tegra (2)
    2 try

Pending test(s) @ Jul 25 04:39:10
 tegra (242)
  100 mozilla-inbound
   47 birch
   36 ionmonkey
   26 mozilla-aurora
   16 mozilla-beta
   14 alder
    2 mozilla-central
    1 mozilla-esr10
 tegra (1762)
  1762 try
Bit of a sorry story on http://mobile-dashboard.pub.build.mozilla.org/

	                            Production	Staging
Tegra and buildslave online	        39	13
Tegra online but buildslave is not	33	6
Both Tegra and buildslave are offline	136	5
Total	                                208	24
Assignee: nobody → armenzg
It seems last week we had bug 774986 happening.

I see ~50 tegras taking jobs; I see ~50 tegras connected to our three masters.
I see tons of tegras that have not been taking jobs for days.
I don't see anything that has landed on our repos recently.
I remember coop said he was looking into the tegras yesterday (not sure what the outcome was).
There is only one positive side to all this... the masters were so responsive! (note to look at ganglia later).

I'm going to shut all tegras that have this message "INACTIVE   active  OFFLINE".
Starting them again does not seem to help.
For now I have stopped these tegras.
Once I have the fix I will start them again (since stop takes a long time to work).

* connect to the foopies
tail -n1 tegra*/tegra*.log | grep "INACTIVE   active  OFFLINE"
(./check.sh takes forever)
** using the tail technique could be risky as the state could be out of date
* use this command to stop them and start them:
for i in {191,192,201,276,277,278,279,281,282,283,285}; do ./stop_cp.sh tegra-$i; done
for i in {248,249,251,253,255,256,257,258,259,260}; do ./stop_cp.sh tegra-$i; done
for i in {234,236,237,238,240,242,245}; do ./stop_cp.sh tegra-$i; done
for i in {146,220,221,223,227,229,231,232,233}; do ./stop_cp.sh tegra-$i; done
for i in {174,178,179,181,182,183,196}; do ./stop_cp.sh tegra-$i; done
for i in {159,160,162,163,164,165,166,169,170,171}; do ./stop_cp.sh tegra-$i; done
for i in {159,160,162,163,164,165,166,169,170,171}; do ./stop_cp.sh tegra-$i; done
for i in {129,133,134,135,136,142,144}; do ./stop_cp.sh tegra-$i; done
** ./stop.sh is extremely slow
* ^^ These are foopies 15 to 24 (except 18 and 21)
* foopy18 I hit "Cannot find terminfo entry for 'xterm-256color'."
* foopy21 does not seem to be setup

I'm looking at tegra-144/clientproxy.log:
2012-07-25 07:24:29,467 DEBUG   MainProcess: Traceback End
2012-07-25 07:24:29,467 DEBUG   MainProcess: bbActive False tegraActive True
2012-07-25 07:24:29,468 DEBUG   MainProcess: EVENT: None
2012-07-25 07:26:29,465 DEBUG   MainProcess: hbSocket.recv()
2012-07-25 07:26:29,466 DEBUG   MainProcess: Traceback (most recent call last):
2012-07-25 07:26:29,466 DEBUG   MainProcess:
2012-07-25 07:26:29,466 DEBUG   MainProcess:   File "clientproxy.py", line 296, in monitorEvents
2012-07-25 07:26:29,467 DEBUG   MainProcess:     hbData = hbSocket.recv(4096)
2012-07-25 07:26:29,467 DEBUG   MainProcess:
2012-07-25 07:26:29,467 DEBUG   MainProcess: timeout: timed out

Here is some data from tegra-144/tegra-144_status.log
2012-07-21 11:32:08 tegra-144 p    online   active   active ::
2012-07-21 11:37:08 tegra-144 p    online   active   active ::
2012-07-21 11:42:22 tegra-144 p  INACTIVE   active   active :: SUTAgent not present;
2012-07-21 11:44:04 tegra-144 p  INACTIVE   active   active :: SUTAgent not present;
2012-07-21 11:47:06 tegra-144 p  INACTIVE   active   active :: SUTAgent not present;
2012-07-21 11:52:08 tegra-144 p    online   active   active ::
2012-07-21 11:57:19 tegra-144 p    online   active   active ::
2012-07-21 12:02:10 tegra-144 p    online   active   active :: REBOOTING
2012-07-21 12:07:17 tegra-144 p    online   active   active :: REBOOTING
2012-07-21 12:12:08 tegra-144 p   OFFLINE   active   active :: PING tegra-144.build.mtv1.mozilla.com (10.250.50.54): 56 data bytes Request timeout for icmp_seq 0;
2012-07-21 12:19:10 tegra-144 p  INACTIVE   active   active ::
...
2012-07-21 13:50:01 tegra-144 p  INACTIVE   active   active ::
2012-07-21 13:55:08 tegra-144 p  INACTIVE   active   active ::
2012-07-21 13:59:57 tegra-144 p  INACTIVE   active  OFFLINE ::
2012-07-21 14:05:01 tegra-144 p  INACTIVE   active  OFFLINE ::
2012-07-21 14:10:01 tegra-144 p  INACTIVE   active  OFFLINE ::

From tegra-191/twistd.log
2012-07-20 17:23:49-0700 [Broker,client] RunProcess._startCommand
2012-07-20 17:23:49-0700 [Broker,client]  python /builds/sut_tools/reboot.py 10.250.50.101
2012-07-20 17:23:49-0700 [Broker,client]   in dir /builds/tegra-191/test/../talos-data (timeout 1800 secs)
2012-07-20 17:23:49-0700 [Broker,client]   watching logfiles {}
2012-07-20 17:23:49-0700 [Broker,client]   argv: ['python', '/builds/sut_tools/reboot.py', '10.250.50.101']
2012-07-20 17:23:49-0700 [Broker,client]  environment: {'__CF_USER_TEXT_ENCODING': '0x1F5:0:0', 'PATH': '/opt/local/bin:/opt/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin', 'SUT_NAME': 'tegra-191', 'PWD': '/builds/tegra-191/talos-data', 'SUT_IP': '10.250.50.101'}
2012-07-20 17:23:49-0700 [Broker,client]   closing stdin
2012-07-20 17:23:49-0700 [Broker,client]   using PTY: False
2012-07-20 17:24:03-0700 [-] sending app-level keepalive
2012-07-20 17:34:03-0700 [-] sending app-level keepalive
2012-07-20 17:44:03-0700 [-] sending app-level keepalive
2012-07-20 17:46:03-0700 [-] command finished with signal 15, exit code None, elapsedTime: 1334.069412
2012-07-20 17:46:03-0700 [-] SlaveBuilder.commandComplete <buildslave.commands.shell.SlaveShellCommand instance at 0x103b2a128>
2012-07-20 17:46:04-0700 [-] Received SIGTERM, shutting down.
2012-07-20 17:46:04-0700 [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate
2012-07-20 17:46:04-0700 [Broker,client] Unhandled Error
        Traceback (most recent call last):
        Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion: Connection lost.
        ]

2012-07-20 17:46:04-0700 [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate
2012-07-20 17:46:04-0700 [Broker,client] Unhandled Error
        Traceback (most recent call last):
        Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion: Connection lost.
        ]

2012-07-20 17:46:04-0700 [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate
2012-07-20 17:46:04-0700 [Broker,client] Unhandled Error
        Traceback (most recent call last):
        Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion: Connection lost.
        ]

2012-07-20 17:46:04-0700 [Broker,client] SlaveBuilder._ackFailed: sendComplete
2012-07-20 17:46:04-0700 [Broker,client] Unhandled Error
        Traceback (most recent call last):
        Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion: Connection lost.
        ]

2012-07-20 17:46:04-0700 [Broker,client] lost remote
I think one of our code changes might be causing this.
I updated to 726a439dc568 (on foopy24) which is prior to 2e38c0d51e81 (bug 771626).
powercycling works after that.
/me still investigates

bash-3.2$ python /builds/sut_tools/tegra_powercycle.py 191
10.250.48.232   MainProcess: Reboot Tegra Called for tegra tegra-191
10.250.48.232   MainProcess:   File "/builds/sut_tools/tegra_powercycle.py", line 22, in <module>
10.250.48.232   MainProcess:     sut_lib.reboot_tegra(tegra, debug=True)
10.250.48.232   MainProcess: 
10.250.48.232   MainProcess:   File "/builds/tools/sut_tools/sut_lib.py", line 538, in reboot_tegra
10.250.48.232   MainProcess:     logRebootTraceback(tegra)
10.250.48.232   MainProcess: 
10.250.48.232   MainProcess:   File "/builds/tools/sut_tools/sut_lib.py", line 252, in logRebootTraceback
10.250.48.232   MainProcess:     for s in traceback.format_list(traceback.extract_stack()):
10.250.48.232   MainProcess: 
SNMPv2-SMI::enterprises.1718.3.2.3.1.11.1.2.4 = INTEGER: 3
10.250.48.232   MainProcess: process shutting down
10.250.48.232   MainProcess: running all "atexit" finalizers with priority >= 0
10.250.48.232   MainProcess: running the remaining "atexit" finalizers
bash-3.2$ cd /builds/sut_tools
bash-3.2$ hg up -r 726a439dc568
merging sut_tools/clientproxy.py
9 files updated, 1 files merged, 0 files removed, 0 files unresolved
bash-3.2$ cd -
/builds/tegra-191
bash-3.2$ python /builds/sut_tools/tegra_powercycle.py 191
SNMPv2-SMI::enterprises.1718.3.2.3.1.11.1.2.4 = INTEGER: 3
It seems that the fix is to do this:

cd /builds/tools && hg up -C && hg up -r 726a439dc568
for i in {###,###}; do ./stop_cp.sh tegra-$i && python /builds/sut_tools/tegra_powercycle.py $i && ./start_cp.sh tegra-$i; done

/me still is verifying this before passing all of the foopies
I have run this on foopies 15 to 24 (except 18 & 21).

We're up to 65 tegras and we will soon be increasing.

I will check in 30 minutes and go through foopies 07 to 14.

(this tail command is different because I had run stop_cp.sh in advance)
tail -n1 tegra*/tegra*.log | grep "INACTIVE  OFFLINE  OFFLINE"
cd /builds/tools && hg up -C && hg up -r 726a439dc568
for i in {### ###}; do ./stop_cp.sh tegra-$i && python /builds/sut_tools/tegra_powercycle.py $i && ./start_cp.sh tegra-$i; done
All foopies have been tackled.

The code has been backed out:
http://hg.mozilla.org/build/tools/rev/fdcb5a3311d8

I'm going to do one last pass to see what I left behind.

We're now running 149 jobs.
https://build.mozilla.org/buildapi/running
This is as many tegras I can recuperate without the typical buildduty one by one analysis.
The main issue has been dealt with.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
I can't believe I forgot completely about foopy18.

It is now running this:
for i in {206,207,208,209,211,212,213,214,215,217}; do ./stop_cp.sh tegra-$i && python /builds/sut_tools/tegra_powercycle.py $i && ./start_cp.sh tegra-$i; done

This hopefully means we're adding another 10 tegras back.
Job started at 2012-07-25 13:53:30 PDT

There might be a fallout from this:
17:15 mbrubeck: armenzg: We just had a bunch of Tegra failures with "ImportError: cannot import name logRebootTraceback" -- https://tbpl.mozilla.org/php/getParsedLog.php?id=13849556&tree=Mozilla-Inbound
17:15 mbrubeck: looks related to http://hg.mozilla.org/build/tools/rev/fdcb5a3311d8

Not sure yet.
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #10)
> Job started at 2012-07-25 13:53:30 PDT
> 
> There might be a fallout from this:
...
> Not sure yet.

Yea, that was due to this backout -- and I blame myself for not reminding armen about the devicemanagerSUT.py change I manually made on all the foopies to support this logging. (Manually only because these files are not in revision control for our foopies atm)

I had kept the original files around as .old so I just moved them back over. All fixed
It looks like this is still happening. From https://tbpl.mozilla.org/php/getParsedLog.php?id=13851916&tree=Mozilla-Inbound&full=1 (at 2012-07-25 15:15:07.455598):

calling reboot
Traceback (most recent call last):
  File "/builds/sut_tools/installApp.py", line 190, in <module>
    main(sys.argv)
  File "/builds/sut_tools/installApp.py", line 175, in main
    dm, devRoot = one_time_setup(ip_addr, path_to_main_apk)
  File "/builds/sut_tools/installApp.py", line 148, in one_time_setup
    dm.reboot(proxyIP, proxyPort)
  File "/builds/tools/sut_tools/devicemanagerSUT.py", line 887, in reboot
    from sut_lib import logRebootTraceback
ImportError: cannot import name logRebootTraceback
program finished with exit code 1
mbrubeck caught me on IRC and I double checked, it turned out on one foopy we were bad and not fixed. (foopy07) so this affected 11 tegras max. Fixed now.
http://mobile-dashboard.pub.build.mozilla.org/ says:
Tegra and buildslave online	124 (production)

Whereas I believe this was up to 159 last night.

Should I file another bug, or is this usual/expected daily attrition that is one of the things that [buildduty] takes care of?
(In reply to Ed Morley [:edmorley] from comment #14)
> http://mobile-dashboard.pub.build.mozilla.org/ says:
> Tegra and buildslave online	124 (production)
> 
> Whereas I believe this was up to 159 last night.
> 
> Should I file another bug, or is this usual/expected daily attrition that is
> one of the things that [buildduty] takes care of?

Down to 119.

Should I file another bug?
Down to 109.

Feels a little like I'm talking to myself... :P
Blocks: 782627
Product: mozilla.org → Release Engineering
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.