Closed Bug 834568 Opened 13 years ago Closed 12 years ago

/failed_.*_pinging/ states should be replaced with a self-test

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

Details

Attachments

(1 file)

bug834568.patch 12 years ago Dustin J. Mitchell [:dustin] (he/him) 40.91 KB, patch	mcote : review+	Details \| Diff \| Splinter Review

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Description

•

13 years ago

It seems these states are more often a result of a bogus image than of a failure to image. At least for failed_b2g_pinging. The easy solution is just to transition to 'self_test' instead of the failed state. The tricky bit is, if mozpool's waiting for the device, it needs to be told what's wrong. If the self-test succeeds, the device will end up in the 'free' state, while it still has a request assigned. That's not a state that will make much sense to Mozpool. I don't have good answers for the following questions: - How does lifeguard communicate this event to mozpool? An HTTP call? What if *that* fails? - What does mozpool do when this occurs? In the very short term, we'll get alerted of these failures by nagios, and can trigger a self-test manually. But that will get tiring very quickly!

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 1

•

13 years ago

I think I need some advice from the requests guru on this one :)

Assignee: nobody → dustin

Flags: needinfo?(mcote)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

13 years ago

Depends on: 819085

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

13 years ago

Depends on: 826746

Jake Watkins [:dividehex]

Comment 2

•

13 years ago

I think it might be worth adding a powercycle ( or two) to try and recover from a failed pinging. There are situations where the image is good but the hardware just didn't initialize properly. I've seen instances (among linux, android and b2g) where the panda board ethernet driver fails to load and a simple powercycle brings it back. It leads me to believe this issue goes all the way back upstream to the kernel itself.

Mark Côté [:mcote]

Comment 3

•

13 years ago

Finally mentally digested this; here are my thoughts. I *think* it was a mistake to have both free and ready device states. The whole reason we get into this problem of not knowing whether to go into free or ready after a self-test (and we've run into this issue before I believe) is that the DeviceMachine now relies on information outside of its scope, namely, whether there is a pending request. DeviceMachine shouldn't care about requests. I think the confusion arose because the idea of the device_requests table wasn't fully formed when I started thinking about request states. This table plus the device state is enough to determine when a device is requested and ready to use. I think we should drop the 'free' device state and only use 'ready'. The only downside is that the lifeguard UI would have to have another column or something to denote that a ready device is assigned or not (so people don't mess with devices in use). I think this is a further argument for a better, more unified UI that hides the layered implementation from the user. This doesn't solve this whole bug, but I think it's a necessary step. To address the main problem, if we can somewhat reliably determine that an image is bad (as opposed to a device), I guess we could send an HTTP event to the mozpool layer to transition the request machine to a failure state (bad_image or something). We can retry it from the DeviceMachine for a while, but I think a failsafe would be for the pending state of RequestMachine to verify that the desired image was installed once the device goes into the ready state. If not, we assume that the imaging process failed for one reason or another, and we fail the request. The ability to try a different device if one fails isn't in mozpool yet, but when it is, we could retry here as well, since this is an ambiguous failure mode. If it is indeed a bad image, presumably sooner or later the bad_image event would get through and we would fail the request properly. An HTTP request does mean that the DeviceMachine will actually have to know a bit about the requests layer, though, which I wanted to avoid... another approach would be to somehow mark a particular image+pxe_config as bad and have the RequestMachine check for this while polling. Anyway, does at least the first part make sense? It doesn't break the API, although it does represent a somewhat significant change in that we are removing a state that is exposed publicly.

Flags: needinfo?(mcote)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

12 years ago

No longer depends on: 819085

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 4

•

12 years ago

Attached patch bug834568.patch — Details — Splinter Review

This follows bug 837241.

Attachment #721701 - Flags: review?(mcote)

Mark Côté [:mcote]

Comment 5

•

12 years ago

Comment on attachment 721701 [details] [diff] [review] bug834568.patch Review of attachment 721701 [details] [diff] [review]: ----------------------------------------------------------------- Groovy. I think this makes a lot of sense. ::: UPGRADING.md @@ +18,5 @@ > + > +You will want to move all `free` devices to the `ready` state, and set a timeout for `ready` devices. > +After shutting down the old version, execute: > + > + update devices set state='ready', state_timeout=NOW() where state='free' I just have to say that I love that this file exists and is up to date. :) ::: mozpool/lifeguard/devicemachine.py @@ +711,5 @@ > + def on_power_cycle_ok(self, args): > + self.machine.goto_state('sut_verifying') > + > + def on_timeout(self): > + if (self.machine.increment_counter(self.state_name) > self.PERMANENT_FAILURE_COUNT): Don't need the parens here. ::: mozpool/mozpool/requestmachine.py @@ +187,5 @@ > + # note that there's still a small chance of a race here between > + # mozpool and lifeguard: lifeguard begins self-testing the device > + # after this check and before it receives the event below; the > + # device eventually returns to the 'ready' state, but has not seent > + # he request. The timeout for 'pending' will catch this rare Typo: "seent he".

Attachment #721701 - Flags: review?(mcote) → review+

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 6

•

12 years ago

fixed and landed

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

9 years ago

Product: Testing → Testing Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

/failed_.*_pinging/ states should be replaced with a self-test

Categories

(Testing Graveyard :: Mozpool, defect)

Tracking

(Not tracked)

People

(Reporter: dustin, Assigned: dustin)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Updated

Updated

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Updated

Attachment

General

Description

File Name

Content Type