Closed
Bug 976587
Opened 11 years ago
Closed 11 years ago
PDU outlet is wrong in inventory for some tegras
Categories
(Infrastructure & Operations :: DCOps, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: coop, Unassigned)
References
Details
While trying to resolve bug 838423, I looked into rebooting the slave because it hadn't reported work in 5 days.
Inventory reports the PDU as: pdu2.r201-11.tegra.releng.scl3.mozilla.com:AA19
There is an outlet A19 on that PDU, but no outlet AA19. Any tools trying to reboot that slave via the inventory PDU info (e.g. kittenherder) will be failing.
Not sure if all the outlet info for pdu2 is wrong. On this PDU, none of the outlets are named so it's impossible to tell. How hard is it to populate the outlet names on each PDU directly from inventory?
If I find any more with the wrong outlet while I'm going through tegras, I'll them here.
Reporter | ||
Comment 1•11 years ago
|
||
tegra-108 is also wrong: listed outlet of AA7 doesn't exist.
Summary: PDU outlet is wrong in inventory for tegra-354 → PDU outlet is wrong in inventory for some tegras
Comment 2•11 years ago
|
||
Coop, I'm not sure this is so much wrong as maybe kittenherder just having invalid assumptions based on the old *bad* listings in inventory.
Note:
[jwood@cruncher.srv.releng.scl3 ~]$ curl -s http://slaveapi-dev1.srv.releng.scl3.mozilla.com:8080/slaves/tegra-354/actions/reboot;
{
"reboot": {
"21926672": {
"state": 2,
"text": "Attempting SSH reboot...Failed.\nAttempting PDU reboot...Success!"
}
}
}
Slaveapi *does* verify the device goes down and comes back up as well!
Flags: needinfo?(coop)
Comment 3•11 years ago
|
||
kinda confused by this bug. every single tegra has the PDU info with AA[port number] because we were told that's how they were supposed to be added into inventory.
we usually don't update outlet names AFAIK. we might have done it in the past when extra cycles permit but currently super swamped.
i justed tested tegra-354.tegra.releng.scl3.mozilla.com with http://pdu2.r201-11.tegra.releng.scl3.mozilla.com:AA19 and was able to manually cycle the tegra remotely.
--- tegra-354.tegra.releng.scl3.mozilla.com ping statistics ---
108 packets transmitted, 39 received, 63% packet loss, time 107580ms
rtt min/avg/max/mdev = 1.950/3.002/13.391/1.951 ms
what do you need us to do exactly?
Reporter | ||
Comment 4•11 years ago
|
||
(In reply to Justin Wood (:Callek) from comment #2)
> Coop, I'm not sure this is so much wrong as maybe kittenherder just having
> invalid assumptions based on the old *bad* listings in inventory.
>
> Note:
>
> [jwood@cruncher.srv.releng.scl3 ~]$ curl -s
> http://slaveapi-dev1.srv.releng.scl3.mozilla.com:8080/slaves/tegra-354/
> actions/reboot;
> {
> "reboot": {
> "21926672": {
> "state": 2,
> "text": "Attempting SSH reboot...Failed.\nAttempting PDU
> reboot...Success!"
> }
> }
> }
>
>
> Slaveapi *does* verify the device goes down and comes back up as well!
Sure, but kittenherder is *still* running trying to reboot tegras based on that data. The code you've shown above doesn't exist in production yet AFAIK.
Flags: needinfo?(coop)
Reporter | ||
Comment 5•11 years ago
|
||
(In reply to Van Le [:van] from comment #3)
> kinda confused by this bug. every single tegra has the PDU info with AA[port
> number] because we were told that's how they were supposed to be added into
> inventory.
If AA19 corresponds to A19 on the PDU outlet control page (e.g. http://pdu1.r201-11.tegra.releng.scl3.mozilla.com/main.html?1,1), then there's nothing to do here and you can close this bug. Provided Callek's code lands in finite time, it won't be an issue.
Comment 6•11 years ago
|
||
spoke to :coop on IRC to clarify some things. ok to close per coop.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•