Closed
Bug 777418
Opened 13 years ago
Closed 13 years ago
ensure that the network switches the tegras use have not been changed physically or updated in switch configs
Categories
(Infrastructure & Operations Graveyard :: NetOps, task)
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: jmaher, Unassigned)
Details
we are having a lot of trouble with our android tegras rebooting. After spending a couple weeks trying to debug this we have not been able to find a problem.
Taking a step back, we are looking at everything to ensure we are not overlooking something unrelated to the devices themselves or the test automation code.
The questions to answer are:
1) Have any of the switches that the tegras are connected to been physically changed (replaced, switched out a blade) since June 1st, 2012?
2) Have any of the switches that the tegras are connected to been updated in the switch configs since June 1st, 2012? If so, can we get a diff.
3) Has the dhcp server that the tegras use been updated, rebooted or reconfigured since June 1st, 2012?
4) Are there any new firewall rules or new dhcp servers added since June 1st, 2012?
Comment 1•13 years ago
|
||
Which switches?
Comment 2•13 years ago
|
||
The DHCP servers, name servers, and tegras are all on the same network. There are no firewalls between them. They have been up for 37 days and were down only due to the power work that was done back in June. Their configurations did not change, nor are other build hosts experiencing issues. There are several switches that the tegras are attached to. They're in haxxor and 2/idf both.
Comment 3•13 years ago
|
||
Can you be more specific about "trouble with our android tegras rebooting"
What isn't working? Link? IP address? Something else?
Has anyone looked at DHCP to see if leases have not been exhausted?
Comment 4•13 years ago
|
||
To be clear, the power work was in 3/mdf, not either location where tegras are.
Comment 5•13 years ago
|
||
ravi: they have static leases, so it's not lease exhaustion. And nagios reports that they are up and responding to both ping and the agent port just fine.
Reporter | ||
Comment 6•13 years ago
|
||
the tegras reboot midtest and appear to be random across builds of fennec (xul or native) the test and different spots in the test or other cleanup/setup steps.
We don't know if we are mixing up ip addresses and scheduling jobs on the wrong devices.
:arich: do you know if during the reboot 37 days ago due to the power outage if there were any updates that needed to be done on the dhcp server? Maybe a config change that wasn't persisted? These devices reboot 3-4 times/test vs the once/test that the normal test slaves do. I suspect the tegras are on a different physical network and use a different dns and dhcp server than the other test slaves.
Comment 7•13 years ago
|
||
jmaher: no config setting changes that would have been mode, no. And I assure you they are on the same VLAN.
ns1a.build.mtv1.mozilla.com has address 10.250.48.17
ns1b.build.mtv1.mozilla.com has address 10.250.48.18
tegra-253.build.mtv1.mozilla.com has address 10.250.51.93
Bcast:10.250.51.255 Mask:255.255.252.0
All of the tegras, foopies, mv-moz2-linux, mw-32-ix, bm-remote-talos-webhost, and remaining moz2-darwin10 machines are on the same VLAN in mtv1. Unless you're seeing problems across all of those, it's extremely unlikely that it's the DHCP or DNS servers.
Reporter | ||
Comment 8•13 years ago
|
||
Thanks for double checking. I will mark this off my list and move on with further debugging.
I really appreciate the fast turnaround on this!
Comment 9•13 years ago
|
||
jmaher: a thought... if they're rebooting randomly, are you sure it's not the releng code that reboots idle tegras that's killing them?
Comment 10•13 years ago
|
||
I am *pretty* sure its not, based on manual and some logging. We have a SUTAgent debuggy test coming down the pipe to log every command we send to the tegras and report the IP the command came from.
Bear also checked the PDU logs manually a few weeks ago along with when Joel had noticed some of these, and no dice as far as seeing a PDU-based reboot.
Comment 11•13 years ago
|
||
I believe we are all set with this ticket Joel yes? Can I close it?
Reporter | ||
Comment 12•13 years ago
|
||
yes, we can close this. Thanks for following up.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → WORKSFORME
Updated•12 years ago
|
Product: mozilla.org → Infrastructure & Operations
Updated•2 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•