Closed
Bug 834568
Opened 13 years ago
Closed 12 years ago
/failed_.*_pinging/ states should be replaced with a self-test
Categories
(Testing Graveyard :: Mozpool, defect)
Testing Graveyard
Mozpool
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dustin, Assigned: dustin)
References
Details
Attachments
(1 file)
40.91 KB,
patch
|
mcote
:
review+
|
Details | Diff | Splinter Review |
It seems these states are more often a result of a bogus image than of a failure to image. At least for failed_b2g_pinging.
The easy solution is just to transition to 'self_test' instead of the failed state.
The tricky bit is, if mozpool's waiting for the device, it needs to be told what's wrong. If the self-test succeeds, the device will end up in the 'free' state, while it still has a request assigned. That's not a state that will make much sense to Mozpool.
I don't have good answers for the following questions:
- How does lifeguard communicate this event to mozpool? An HTTP call? What if *that* fails?
- What does mozpool do when this occurs?
In the very short term, we'll get alerted of these failures by nagios, and can trigger a self-test manually. But that will get tiring very quickly!
Assignee | ||
Comment 1•13 years ago
|
||
I think I need some advice from the requests guru on this one :)
Assignee: nobody → dustin
Flags: needinfo?(mcote)
Comment 2•13 years ago
|
||
I think it might be worth adding a powercycle ( or two) to try and recover from a failed pinging. There are situations where the image is good but the hardware just didn't initialize properly. I've seen instances (among linux, android and b2g) where the panda board ethernet driver fails to load and a simple powercycle brings it back. It leads me to believe this issue goes all the way back upstream to the kernel itself.
Comment 3•13 years ago
|
||
Finally mentally digested this; here are my thoughts.
I *think* it was a mistake to have both free and ready device states. The whole reason we get into this problem of not knowing whether to go into free or ready after a self-test (and we've run into this issue before I believe) is that the DeviceMachine now relies on information outside of its scope, namely, whether there is a pending request. DeviceMachine shouldn't care about requests.
I think the confusion arose because the idea of the device_requests table wasn't fully formed when I started thinking about request states. This table plus the device state is enough to determine when a device is requested and ready to use. I think we should drop the 'free' device state and only use 'ready'.
The only downside is that the lifeguard UI would have to have another column or something to denote that a ready device is assigned or not (so people don't mess with devices in use). I think this is a further argument for a better, more unified UI that hides the layered implementation from the user.
This doesn't solve this whole bug, but I think it's a necessary step. To address the main problem, if we can somewhat reliably determine that an image is bad (as opposed to a device), I guess we could send an HTTP event to the mozpool layer to transition the request machine to a failure state (bad_image or something).
We can retry it from the DeviceMachine for a while, but I think a failsafe would be for the pending state of RequestMachine to verify that the desired image was installed once the device goes into the ready state. If not, we assume that the imaging process failed for one reason or another, and we fail the request. The ability to try a different device if one fails isn't in mozpool yet, but when it is, we could retry here as well, since this is an ambiguous failure mode. If it is indeed a bad image, presumably sooner or later the bad_image event would get through and we would fail the request properly.
An HTTP request does mean that the DeviceMachine will actually have to know a bit about the requests layer, though, which I wanted to avoid... another approach would be to somehow mark a particular image+pxe_config as bad and have the RequestMachine check for this while polling.
Anyway, does at least the first part make sense? It doesn't break the API, although it does represent a somewhat significant change in that we are removing a state that is exposed publicly.
Flags: needinfo?(mcote)
Assignee | ||
Comment 4•12 years ago
|
||
This follows bug 837241.
Attachment #721701 -
Flags: review?(mcote)
Comment 5•12 years ago
|
||
Comment on attachment 721701 [details] [diff] [review]
bug834568.patch
Review of attachment 721701 [details] [diff] [review]:
-----------------------------------------------------------------
Groovy. I think this makes a lot of sense.
::: UPGRADING.md
@@ +18,5 @@
> +
> +You will want to move all `free` devices to the `ready` state, and set a timeout for `ready` devices.
> +After shutting down the old version, execute:
> +
> + update devices set state='ready', state_timeout=NOW() where state='free'
I just have to say that I love that this file exists and is up to date. :)
::: mozpool/lifeguard/devicemachine.py
@@ +711,5 @@
> + def on_power_cycle_ok(self, args):
> + self.machine.goto_state('sut_verifying')
> +
> + def on_timeout(self):
> + if (self.machine.increment_counter(self.state_name) > self.PERMANENT_FAILURE_COUNT):
Don't need the parens here.
::: mozpool/mozpool/requestmachine.py
@@ +187,5 @@
> + # note that there's still a small chance of a race here between
> + # mozpool and lifeguard: lifeguard begins self-testing the device
> + # after this check and before it receives the event below; the
> + # device eventually returns to the 'ready' state, but has not seent
> + # he request. The timeout for 'pending' will catch this rare
Typo: "seent he".
Attachment #721701 -
Flags: review?(mcote) → review+
Assignee | ||
Comment 6•12 years ago
|
||
fixed and landed
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•9 years ago
|
Product: Testing → Testing Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•