Closed Bug 623299 Opened 14 years ago Closed 13 years ago

ip power switch reboot solution for tegras

Categories

(Release Engineering :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: mozilla, Assigned: bear)

References

Details

The tegras are getting regularly wedged in a state that requires a hard reboot.
Power cycling at the switch gets them back green.

We're ordering ip power switches for these; we'll need to map power switch ip/plug to each tegra.
Once we detect that a tegra is in such a wedged state, we need an automated way to kick it. Bonus points for setting that buildbot job to retry.
if it's ip addressable we should be able to trigger it from either a script run by clientproxy.py or by clientproxy.py doing the socket connect.
is this now in IT's hands?  are they doing the ordering/followup.
Blocks: 610600
The ordering is in IT's hands. This bug is for the scripting solution, and is in our hands, but blocked on hardware.
didn't we talk about getting these in early December at the all hands?  Is there somebody that can push on IT to get these ordered?
Zandr is in charge of that.
I was given a several-week (~3) ETA a couple weeks ago for a new Tegra server room, since Haxxor is running out of power and the Tegras don't use wifi.  The new location will have the ip-controlled power.  I haven't heard any change in that ETA, nor have I heard any news, but Zandr's cc'ed and I bet he'll chime in.

Having said that, we've largely stabilized the Tegras on the clientproxy side, which means manual intervention is needed much less frequently (as opposed to multiple times a day, we're down to once or twice a week if we continue at the current rate), which means these are a nice/good to have, rather than a we-must-have-these-to-stay-running-at-all urgency.
Taking this off my list; we'll need to revisit owner when these arrive.
Assignee: aki → nobody
Priority: -- → P3
Assignee: nobody → bear
Aki: is this a front-burner item now, i.e. do we have the required hardware/networking (I don't see any dependent bugs)? 

Is this still something you want to have bear take on?

What's a reasonable timeframe for getting this work completed?
In theory I have all that I need to start using these - I just have not tested the access yet.  My thought was that I could be able to switch to this later tonight or tomorrow.
Tegras 55-93 are on the PDUs now. I'll get the port names filled in and publish IP info in this bug by the end of the day, and will add the IP addresses for the PDU as the "OOB IP" in inventory.

The PDUs have a web interface, that should be sufficient out of the gate.
(In reply to comment #7)
> Aki: is this a front-burner item now, i.e. do we have the required
> hardware/networking (I don't see any dependent bugs)? 

Afaik yes, we have tegra-055 through tegra-094 plugged into the appropriate network-addressable PDUs.

The documentation was "go to the website" and we don't have a map of PDU IPs/ports <-> tegras yet, afaik.  That full map is a requirement to be able to run this for real, but is not a requirement to be able to write something that addresses the API.

> Is this still something you want to have bear take on?

I would put it at a non-critical priority unless devices going offline is at a high enough sustained level that we need to automate it.  If we're losing a handful a day, or even one or two every other hour, we can work around via bugs or manual network PDU resets instead of using an automated script.

If Bear's interested in this, I'm happy to let him do so.

> What's a reasonable timeframe for getting this work completed?

I don't know enough here to say.
I would think, without having looked at the docs or knowing how the PDUs work, that we can

a) come up with a plan for how the script will work
b) write a proof of concept script
c) get a map of IPs/ports <-> tegras
d) beef up the poc script to work on a pool in staging
e) roll out to production

where (c) can be done in parallel with (a) and (b).
I think how long (b) will take is dependent on (a), which is dependent on how good the docs are.

And, oh, looks like some of this is already addressed by comments that landed while I was typing.
(In reply to comment #10)
> (In reply to comment #7)
> > Aki: is this a front-burner item now, i.e. do we have the required
> > hardware/networking (I don't see any dependent bugs)? 
> 
> Afaik yes, we have tegra-055 through tegra-094 plugged into the appropriate
> network-addressable PDUs.
> 
> The documentation was "go to the website" and we don't have a map of PDU
> IPs/ports <-> tegras yet, afaik.  That full map is a requirement to be able to
> run this for real, but is not a requirement to be able to write something that
> addresses the API.
> 
> > Is this still something you want to have bear take on?
> 
> I would put it at a non-critical priority unless devices going offline is at a
> high enough sustained level that we need to automate it.  If we're losing a
> handful a day, or even one or two every other hour, we can work around via bugs
> or manual network PDU resets instead of using an automated script.
> 
> If Bear's interested in this, I'm happy to let him do so.

yep, interested :)

> 
> > What's a reasonable timeframe for getting this work completed?
> 
> I don't know enough here to say.
> I would think, without having looked at the docs or knowing how the PDUs work,
> that we can
> 
> a) come up with a plan for how the script will work
> b) write a proof of concept script
> c) get a map of IPs/ports <-> tegras
> d) beef up the poc script to work on a pool in staging
> e) roll out to production
> 
> where (c) can be done in parallel with (a) and (b).
> I think how long (b) will take is dependent on (a), which is dependent on how
> good the docs are.

I talked some of this thru with zandr last week. We have 4 avenues of using the PDU's:

1) via web interface
2) by script by writing POST calls in python to the web interface
3) command lines calls to snmp tools via python
4) scripting via python snmp library

Item 4 is not even realistic as item 3 will give the same result with far less complexity.

Item 3 is where I want to end up at, but like Aki mentioned above item 1 will do for now.



> 
> And, oh, looks like some of this is already addressed by comments that landed
> while I was typing.
[15:18]	<bear>	aki - i'm inclined to close bug 623299 since we can use the web interface

Stability + network PDU web interface means I'm fine if we WONTFIX or P5+unassign for now.
WONTFIX'ing - the web interface is working just fine
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → WONTFIX
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.