623299 - ip power switch reboot solution for tegras

Reporter

Description

•

14 years ago

The tegras are getting regularly wedged in a state that requires a hard reboot.
Power cycling at the switch gets them back green.

We're ordering ip power switches for these; we'll need to map power switch ip/plug to each tegra.
Once we detect that a tegra is in such a wedged state, we need an automated way to kick it. Bonus points for setting that buildbot job to retry.

Mike Taylor [:bear]

Assignee

Comment 1

•

14 years ago

if it's ip addressable we should be able to trigger it from either a script run by clientproxy.py or by clientproxy.py doing the socket connect.

Mike Taylor [:bear]

Assignee

Comment 2

•

13 years ago

is this now in IT's hands?  are they doing the ordering/followup.

Blocks: 610600

Aki Sasaki (not active)

Reporter

Comment 3

•

13 years ago

The ordering is in IT's hands. This bug is for the scripting solution, and is in our hands, but blocked on hardware.

Joel Maher ( :jmaher ) (UTC -8)

Comment 4

•

13 years ago

didn't we talk about getting these in early December at the all hands?  Is there somebody that can push on IT to get these ordered?

Aki Sasaki (not active)

Reporter

Comment 5

•

13 years ago

Zandr is in charge of that.
I was given a several-week (~3) ETA a couple weeks ago for a new Tegra server room, since Haxxor is running out of power and the Tegras don't use wifi.  The new location will have the ip-controlled power.  I haven't heard any change in that ETA, nor have I heard any news, but Zandr's cc'ed and I bet he'll chime in.

Having said that, we've largely stabilized the Tegras on the clientproxy side, which means manual intervention is needed much less frequently (as opposed to multiple times a day, we're down to once or twice a week if we continue at the current rate), which means these are a nice/good to have, rather than a we-must-have-these-to-stay-running-at-all urgency.

Aki Sasaki (not active)

Reporter

Comment 6

•

13 years ago

Taking this off my list; we'll need to revisit owner when these arrive.

Assignee: aki → nobody

Priority: -- → P3

Mike Taylor [:bear]

Assignee

Updated

•

13 years ago

Assignee: nobody → bear

Chris Cooper [:coop] (he/him)

Comment 7

•

13 years ago

Aki: is this a front-burner item now, i.e. do we have the required hardware/networking (I don't see any dependent bugs)? 

Is this still something you want to have bear take on?

What's a reasonable timeframe for getting this work completed?

Mike Taylor [:bear]

Assignee

Comment 8

•

13 years ago

In theory I have all that I need to start using these - I just have not tested the access yet.  My thought was that I could be able to switch to this later tonight or tomorrow.

Zandr Milewski [:zandr]

Comment 9

•

13 years ago

Tegras 55-93 are on the PDUs now. I'll get the port names filled in and publish IP info in this bug by the end of the day, and will add the IP addresses for the PDU as the "OOB IP" in inventory.

The PDUs have a web interface, that should be sufficient out of the gate.

Aki Sasaki (not active)

Reporter

Comment 10

•

13 years ago

(In reply to comment #7)
> Aki: is this a front-burner item now, i.e. do we have the required
> hardware/networking (I don't see any dependent bugs)? 

Afaik yes, we have tegra-055 through tegra-094 plugged into the appropriate network-addressable PDUs.

The documentation was "go to the website" and we don't have a map of PDU IPs/ports <-> tegras yet, afaik.  That full map is a requirement to be able to run this for real, but is not a requirement to be able to write something that addresses the API.

> Is this still something you want to have bear take on?

I would put it at a non-critical priority unless devices going offline is at a high enough sustained level that we need to automate it.  If we're losing a handful a day, or even one or two every other hour, we can work around via bugs or manual network PDU resets instead of using an automated script.

If Bear's interested in this, I'm happy to let him do so.

> What's a reasonable timeframe for getting this work completed?

I don't know enough here to say.
I would think, without having looked at the docs or knowing how the PDUs work, that we can

a) come up with a plan for how the script will work
b) write a proof of concept script
c) get a map of IPs/ports <-> tegras
d) beef up the poc script to work on a pool in staging
e) roll out to production

where (c) can be done in parallel with (a) and (b).
I think how long (b) will take is dependent on (a), which is dependent on how good the docs are.

And, oh, looks like some of this is already addressed by comments that landed while I was typing.

Mike Taylor [:bear]

Assignee

Comment 11

•

13 years ago

(In reply to comment #10)
> (In reply to comment #7)
> > Aki: is this a front-burner item now, i.e. do we have the required
> > hardware/networking (I don't see any dependent bugs)? 
> 
> Afaik yes, we have tegra-055 through tegra-094 plugged into the appropriate
> network-addressable PDUs.
> 
> The documentation was "go to the website" and we don't have a map of PDU
> IPs/ports <-> tegras yet, afaik.  That full map is a requirement to be able to
> run this for real, but is not a requirement to be able to write something that
> addresses the API.
> 
> > Is this still something you want to have bear take on?
> 
> I would put it at a non-critical priority unless devices going offline is at a
> high enough sustained level that we need to automate it.  If we're losing a
> handful a day, or even one or two every other hour, we can work around via bugs
> or manual network PDU resets instead of using an automated script.
> 
> If Bear's interested in this, I'm happy to let him do so.

yep, interested :)

> 
> > What's a reasonable timeframe for getting this work completed?
> 
> I don't know enough here to say.
> I would think, without having looked at the docs or knowing how the PDUs work,
> that we can
> 
> a) come up with a plan for how the script will work
> b) write a proof of concept script
> c) get a map of IPs/ports <-> tegras
> d) beef up the poc script to work on a pool in staging
> e) roll out to production
> 
> where (c) can be done in parallel with (a) and (b).
> I think how long (b) will take is dependent on (a), which is dependent on how
> good the docs are.

I talked some of this thru with zandr last week. We have 4 avenues of using the PDU's:

1) via web interface
2) by script by writing POST calls in python to the web interface
3) command lines calls to snmp tools via python
4) scripting via python snmp library

Item 4 is not even realistic as item 3 will give the same result with far less complexity.

Item 3 is where I want to end up at, but like Aki mentioned above item 1 will do for now.



> 
> And, oh, looks like some of this is already addressed by comments that landed
> while I was typing.

Aki Sasaki (not active)

Reporter

Comment 12

•

13 years ago

[15:18]	<bear>	aki - i'm inclined to close bug 623299 since we can use the web interface

Stability + network PDU web interface means I'm fine if we WONTFIX or P5+unassign for now.

Mike Taylor [:bear]

Assignee

Comment 13

•

13 years ago

WONTFIX'ing - the web interface is working just fine

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → WONTFIX

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

Bugzilla

Quick Search

ip power switch reboot solution for tegras

Categories

(Release Engineering :: General, defect, P3)

Tracking

(Not tracked)

People

(Reporter: mozilla, Assigned: bear)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Updated