Closed Bug 777418 Opened 13 years ago Closed 13 years ago

ensure that the network switches the tegras use have not been changed physically or updated in switch configs

Categories

(Infrastructure & Operations Graveyard :: NetOps, task)

ARM
Android
task
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: jmaher, Unassigned)

Details

we are having a lot of trouble with our android tegras rebooting. After spending a couple weeks trying to debug this we have not been able to find a problem. Taking a step back, we are looking at everything to ensure we are not overlooking something unrelated to the devices themselves or the test automation code. The questions to answer are: 1) Have any of the switches that the tegras are connected to been physically changed (replaced, switched out a blade) since June 1st, 2012? 2) Have any of the switches that the tegras are connected to been updated in the switch configs since June 1st, 2012? If so, can we get a diff. 3) Has the dhcp server that the tegras use been updated, rebooted or reconfigured since June 1st, 2012? 4) Are there any new firewall rules or new dhcp servers added since June 1st, 2012?
Which switches?
The DHCP servers, name servers, and tegras are all on the same network. There are no firewalls between them. They have been up for 37 days and were down only due to the power work that was done back in June. Their configurations did not change, nor are other build hosts experiencing issues. There are several switches that the tegras are attached to. They're in haxxor and 2/idf both.
Can you be more specific about "trouble with our android tegras rebooting" What isn't working? Link? IP address? Something else? Has anyone looked at DHCP to see if leases have not been exhausted?
To be clear, the power work was in 3/mdf, not either location where tegras are.
ravi: they have static leases, so it's not lease exhaustion. And nagios reports that they are up and responding to both ping and the agent port just fine.
the tegras reboot midtest and appear to be random across builds of fennec (xul or native) the test and different spots in the test or other cleanup/setup steps. We don't know if we are mixing up ip addresses and scheduling jobs on the wrong devices. :arich: do you know if during the reboot 37 days ago due to the power outage if there were any updates that needed to be done on the dhcp server? Maybe a config change that wasn't persisted? These devices reboot 3-4 times/test vs the once/test that the normal test slaves do. I suspect the tegras are on a different physical network and use a different dns and dhcp server than the other test slaves.
jmaher: no config setting changes that would have been mode, no. And I assure you they are on the same VLAN. ns1a.build.mtv1.mozilla.com has address 10.250.48.17 ns1b.build.mtv1.mozilla.com has address 10.250.48.18 tegra-253.build.mtv1.mozilla.com has address 10.250.51.93 Bcast:10.250.51.255 Mask:255.255.252.0 All of the tegras, foopies, mv-moz2-linux, mw-32-ix, bm-remote-talos-webhost, and remaining moz2-darwin10 machines are on the same VLAN in mtv1. Unless you're seeing problems across all of those, it's extremely unlikely that it's the DHCP or DNS servers.
Thanks for double checking. I will mark this off my list and move on with further debugging. I really appreciate the fast turnaround on this!
jmaher: a thought... if they're rebooting randomly, are you sure it's not the releng code that reboots idle tegras that's killing them?
I am *pretty* sure its not, based on manual and some logging. We have a SUTAgent debuggy test coming down the pipe to log every command we send to the tegras and report the IP the command came from. Bear also checked the PDU logs manually a few weeks ago along with when Joel had noticed some of these, and no dice as far as seeing a PDU-based reboot.
I believe we are all set with this ticket Joel yes? Can I close it?
yes, we can close this. Thanks for following up.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → WORKSFORME
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.