Closed Bug 976587 Opened 11 years ago Closed 11 years ago

PDU outlet is wrong in inventory for some tegras

Categories

(Infrastructure & Operations :: DCOps, task)

ARM
Android
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Unassigned)

References

Details

While trying to resolve bug 838423, I looked into rebooting the slave because it hadn't reported work in 5 days. Inventory reports the PDU as: pdu2.r201-11.tegra.releng.scl3.mozilla.com:AA19 There is an outlet A19 on that PDU, but no outlet AA19. Any tools trying to reboot that slave via the inventory PDU info (e.g. kittenherder) will be failing. Not sure if all the outlet info for pdu2 is wrong. On this PDU, none of the outlets are named so it's impossible to tell. How hard is it to populate the outlet names on each PDU directly from inventory? If I find any more with the wrong outlet while I'm going through tegras, I'll them here.
tegra-108 is also wrong: listed outlet of AA7 doesn't exist.
Summary: PDU outlet is wrong in inventory for tegra-354 → PDU outlet is wrong in inventory for some tegras
Blocks: tegra-108
Blocks: tegra-114
Coop, I'm not sure this is so much wrong as maybe kittenherder just having invalid assumptions based on the old *bad* listings in inventory. Note: [jwood@cruncher.srv.releng.scl3 ~]$ curl -s http://slaveapi-dev1.srv.releng.scl3.mozilla.com:8080/slaves/tegra-354/actions/reboot; { "reboot": { "21926672": { "state": 2, "text": "Attempting SSH reboot...Failed.\nAttempting PDU reboot...Success!" } } } Slaveapi *does* verify the device goes down and comes back up as well!
Flags: needinfo?(coop)
Blocks: tegra-345
kinda confused by this bug. every single tegra has the PDU info with AA[port number] because we were told that's how they were supposed to be added into inventory. we usually don't update outlet names AFAIK. we might have done it in the past when extra cycles permit but currently super swamped. i justed tested tegra-354.tegra.releng.scl3.mozilla.com with http://pdu2.r201-11.tegra.releng.scl3.mozilla.com:AA19 and was able to manually cycle the tegra remotely. --- tegra-354.tegra.releng.scl3.mozilla.com ping statistics --- 108 packets transmitted, 39 received, 63% packet loss, time 107580ms rtt min/avg/max/mdev = 1.950/3.002/13.391/1.951 ms what do you need us to do exactly?
No longer blocks: tegra-108, tegra-345, tegra-114
Blocks: tegra-278
(In reply to Justin Wood (:Callek) from comment #2) > Coop, I'm not sure this is so much wrong as maybe kittenherder just having > invalid assumptions based on the old *bad* listings in inventory. > > Note: > > [jwood@cruncher.srv.releng.scl3 ~]$ curl -s > http://slaveapi-dev1.srv.releng.scl3.mozilla.com:8080/slaves/tegra-354/ > actions/reboot; > { > "reboot": { > "21926672": { > "state": 2, > "text": "Attempting SSH reboot...Failed.\nAttempting PDU > reboot...Success!" > } > } > } > > > Slaveapi *does* verify the device goes down and comes back up as well! Sure, but kittenherder is *still* running trying to reboot tegras based on that data. The code you've shown above doesn't exist in production yet AFAIK.
Flags: needinfo?(coop)
(In reply to Van Le [:van] from comment #3) > kinda confused by this bug. every single tegra has the PDU info with AA[port > number] because we were told that's how they were supposed to be added into > inventory. If AA19 corresponds to A19 on the PDU outlet control page (e.g. http://pdu1.r201-11.tegra.releng.scl3.mozilla.com/main.html?1,1), then there's nothing to do here and you can close this bug. Provided Callek's code lands in finite time, it won't be an issue.
spoke to :coop on IRC to clarify some things. ok to close per coop.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.