888712 - Recover remaining pandas after scl1 power outage

Assignee

Description

•

11 years ago

We have a large number of pandas that have not come back up properly after the scl1 power outage and are alerting in nagios

DOWN :CRITICAL: in state failed_pxe_booting

We should powercycle them, and reimage if necessary.

We do have enough capacity back in place to last the weekend though

Kim Moir [:kmoir] ET

Comment 1

•

11 years ago

I don't understand how we (releng) can recover these machines.  In mozpool, if the device is in the failed_pxe_booting state a request to reboot it or reimage is ignored (looking in the logs).  I tried to relay reboot a few of them too and they still are in the failed_pxe_booting state. cc'ing :dividehex for guidance on how to deal with these machines

Flags: needinfo?(jwatkins)

Jake Watkins [:dividehex]

Comment 2

•

11 years ago

Unless this has change recently, panda boards allocated for Android are not managed by mozpool and therefore should have their state set to 'locked_out' in mozpool.  I see a lot of devices in the lifeguard component that should be locked_out according to their comment field but are not set to that state.

If a device is in a failed_* state, it means the device was NOT locked_out and that mozpool has done ALL it can to try and recover the device (via rebooting/selftest).  You can attempt to force another selftest under the lifeguard component by selecting "Please_self_test" but if the device ends up back in a failed_* state, then it needs to be added to the Bad Panda Bug so I or DCOPS can look at them.
https://bugzilla.mozilla.org/show_bug.cgi?id=bad-panda-log

Flags: needinfo?(jwatkins)

Jake Watkins [:dividehex]

Comment 3

•

11 years ago

I see from https://bugzilla.mozilla.org/show_bug.cgi?id=829211#c37 that Android pandas are now being managed via mozpool.  (That's great!)  We should probably clear all the erroneous comments in mozpool if they are no longer being 'locked_out'

Justin Wood (:Callek)

Assignee

Comment 4

•

11 years ago

This might make more sense as fallout from the scl1 power outage now....

http://mxr.mozilla.org/build/source/mozpool/mozpool/lifeguard/devicemachine.py#145

and from a log of forcing a self-test: http://mobile-imaging-001.p1.releng.scl1.mozilla.com/ui/log.html?device=panda-0734

We are using the *sut* reboot, not actual relayboard rebooting.

we've previously (in our automation) found that rebooting via SUT was failure-prone for pandas, and had to use the relayboard.

I also note that per logs it has tried sut rebooting far more than 5 times which is contrary to the self.TRY_RELAY_AFTER_SUT_COUNT clause there, so I defer to you to find out whats up here.

Flags: needinfo?(jwatkins)

Dustin J. Mitchell [:dustin] (he/him)

Comment 5

•

11 years ago

It tries to use SUT five times, then cycles the power, as you can see in the log you linked.  It's not surprising that the SUT reboot fails, since the device is probably powered down.  It is surprising that the power cycle fails.

That relay board (panda-relay-066) is responding to HTTP, and the nagios comms checks are good.  The power to panda-0734 appears to be on:

[root@mobile-imaging-008.p8.releng.scl1.mozilla.com ~]# /opt/mozpool/frontend/bin/relay status panda-relay-066.p8.releng.scl1.mozilla.com 1 4
bank 1, relay 4 status: on

and a 'relay powercycle' accomplishes nothing.

The failed_pxe_booting state does not seem limited to any one relay board, and in fact only 28 pandas are in that state.  So I think the best guess is that something's amiss with those pandas, and they should go into the bad panda bug.  That's not a surprising number given that these pandas have not, until now, been managed by mozpool.

I'm not sure how 28 is a 'large number' though.  Note that there are still 254 pandas in the locked_out state.  Probably activating some of those is a better use of time than chasing down the remaining 28.

Flags: needinfo?(jwatkins)

Justin Wood (:Callek)

Assignee

Comment 6

•

11 years ago

No it is indeed a large number, since *ALL* pandas have to be rebooted manually via lifeguard.

The ones that are/were in 'ready' state end up in the failed_rebooting state after doing so, until they are rebooted nothing works when connecting and thus we had no capacity.

The sut issue is that it tries sut 5 times, (but note we should not be using sut reboot *ever* on pandas by default, only used when we have no relayboard)

And then it does nothing and then tries to selftest again using sut again.

Dustin J. Mitchell [:dustin] (he/him)

Comment 7

•

11 years ago

(In reply to Justin Wood (:Callek) from comment #6)
> No it is indeed a large number, since *ALL* pandas have to be rebooted
> manually via lifeguard.

At least some pandas are working, so I don't think you mean "all".  I'm not sure what you mean by "manually via lifeguard" either - the lifeguard please_power_cycle option tries to reboot via SUT and, failing that, via relay - just as you see in the logs for panda-0734.

> The ones that are/were in 'ready' state end up in the failed_rebooting state
> after doing so, until they are rebooted nothing works when connecting and
> thus we had no capacity.

ITYM failed_pxe_booting.  There is no failed_rebooting state.  If they've failed rebooting, why are they in the ready state?

> The sut issue is that it tries sut 5 times, (but note we should not be using
> sut reboot *ever* on pandas by default, only used when we have no relayboard)

This is incorrect.  Mozpool tries a SUT reboot 5 times if the DB indicates that SUT should be running on the device, and if that fails, then performs a relay power cycle.  As you've seen from the logs, that whole process repeats several times before a final failure.

> And then it does nothing and then tries to selftest again using sut again.

This is incorrect - see the logs.

Justin Wood (:Callek)

Assignee

Comment 8

•

11 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #7)
> (In reply to Justin Wood (:Callek) from comment #6)
> > No it is indeed a large number, since *ALL* pandas have to be rebooted
> > manually via lifeguard.
> 
> At least some pandas are working, so I don't think you mean "all".  I'm not
> sure what you mean by "manually via lifeguard" either - the lifeguard
> please_power_cycle option tries to reboot via SUT and, failing that, via
> relay - just as you see in the logs for panda-0734.

I incorrectly said ALL..have I mean "ALL ... had, and I only did some so far, so many more still need to be done"....

And yea it has to be done via lifeguard since the request logic of mozpool doesn't kick in until buildbot starts up, which doesn't happen until verify.py passes, which checks that SUT is up and the basics as well. 


> > The ones that are/were in 'ready' state end up in the failed_rebooting state
> > after doing so, until they are rebooted nothing works when connecting and
> > thus we had no capacity.
> 
> ITYM failed_pxe_booting.  There is no failed_rebooting state.  If they've
> failed rebooting, why are they in the ready state?

No I mean ready. When these hosts came up after the power issue in scl1, all devices retained there last-known state, which was 'ready' for many of them. only after forcing the reboot did the handful that I asked jake and you about end up in the failed_pxe_booting.

> > The sut issue is that it tries sut 5 times, (but note we should not be using
> > sut reboot *ever* on pandas by default, only used when we have no relayboard)
> 
> This is incorrect.  Mozpool tries a SUT reboot 5 times if the DB indicates
> that SUT should be running on the device, and if that fails, then performs a
> relay power cycle.  As you've seen from the logs, that whole process repeats
> several times before a final failure.

By "should" here I mean should based on what releng + ateam came up with as hard requirements due to ineffecient/unsafe rebooting via SUTAgent for pandas. That is why we had to make sure the relayboard rebooting for automation. So if teh logic is flipped in mozpool we need to correct that.

> > And then it does nothing and then tries to selftest again using sut again.
> 
> This is incorrect - see the logs.

I see no evidence of this on the failed_pxe_booting ones, which don't have any error message from failing to relay-reboot, and do have many attempts at sut rebooting (but no failed messages from that either)

Dustin J. Mitchell [:dustin] (he/him)

Comment 9

•

11 years ago

We sorted this out in IRC.  Salient points:

* Devices are being power-cycled when the SUT reboots fail.  There's less logging about that (only "initiating power cycle") than about the SUT reboots, but it is occurring.

* Devices in the 'ready' state which are not associated with an open request are SUT verified or pinged every 10m (depending on whether or not they have SUT installed).  There's not a lot of logging for the pings.  If the check fails, the device is automatically self-tested.  This took care of starting most of the pandas once the imaging servers are up.

* The decision to prefer relay over sut reboots didn't reach mozpool - bug 888945.

* Kim, Mark, Callek, and I will meet to discuss the -- apparently substantial -- impedance mismatch between Mozpool and the releng automation.

Over to Callek for any further work necessary on this bug.

Assignee: nobody → bugspam.Callek

Justin Wood (:Callek)

Assignee

Comment 10

•

11 years ago

After understanding things more (that mozpool puts them in a self-test mode if there is no image requested...) I chose to image all the remaining ones as Android, which should get us into a sane state.

This bug as filed is resolved

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

BMO Automation

Updated

•

6 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

4 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

Quick Search

Recover remaining pandas after scl1 power outage

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P3)

Tracking

(Not tracked)

People

(Reporter: Callek, Assigned: Callek)

References

Details

(Whiteboard: [buildduty][buildslaves][capacity])

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated

Updated

Updated