[mozdevice] introduce a retry mechanism on shell calls
Categories
(Testing :: Mozbase, enhancement, P3)
Tracking
(Not tracked)
People
(Reporter: tarek, Assigned: gbrown)
References
Details
(Whiteboard: [chase-intermittents])
We're seeing in intermittents a lot of timeouts and errors on adb shell calls, and for some of our tests on bitbar, gracefully retrying the command that failed or timedout could be useful
I would like to introduce in ADBDevice.shell() a retries option, that will retry the same command several times when there's a problem
The new interface would be:
def shell(self, cmd, env=None, cwd=None, timeout=None, root=False,
stdout_callback=None, retry_callback=None):
where retry_callback is a callable with this signature:
def retry_callback(retry, error):
"""retry: the number of times it was retried. the first call is 0
error: the error that was triggered
"""
if the function raises an error (the error that was passed, or another error, it's bubbling up as usual. If the function returns False, no new attempt is made and the shell() function resume its behavior. If the function returns True, a new attempt is made.
The function is in charge of adding a time.sleep() call if it wants.
a default implementation callable is provided to make it easier to use:
def retry_shell_call(attempts= 3, interval=1, errors=(ADBError, ADBTimeout)):
def _retry_shell_call(retry errors):
if error not in errors:
raise error
if retry >= attempts:
return False
time.sleep(interval)
return True
return _retry_shell_call
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Comment 1•5 years ago
|
||
I don't think this will help but am willing to be proved wrong.
If I am correct, ADBTimeoutError is a permanent state until the device is recovered usually manually. You can see this in situations where we do not immediately terminate the test run when an ADBTimeoutError occurs where we attempt several more adb calls before terminating, each of which times out after the default 5 minute time out. I think what you will see is once you hit an ADBTimeout is that each of your retries will also time out resulting in a 15 minute delay attempting to retry 3 times. If we have the situation where we do other adb calls during the clean up and exit phase of the framework, you will hit a 15 minute delay each time.
It may be the case where this might be helpful in some command executed on the device such as for adb forward or adb reverse but I don't think it should be the default action.
The issue on emulators where we were getting intermittent failures were mostly due to failures to detect something during initialization. They were not due to ADBErrors or ADBTimeoutErrors but were due to the emulators lying to us about the existence of files and directories.
Getting the android-hw tests to use the same TBPL Retry mechanism as the android-em tests I think will be a more production solution. I was looking into this a little bit last week and think that the approach where we use run-task to run the script.py at bitbar which runs test-linux.sh hides the retry exit code 4 from taskcluster and prevents the job from being retried.
aerickson: Can you look into investigating and fixing the TBPL Retry issue?
gbrown: Thoughts?
Assignee | ||
Comment 2•5 years ago
|
||
(In reply to Tarek Ziadé (:tarek) from comment #0)
We're seeing in intermittents a lot of timeouts and errors on adb shell calls
Is that mostly bug 1500266?
Assignee | ||
Comment 3•5 years ago
|
||
In my experience (mostly influenced by emulators), ADBTimeoutError is usually permanent: the device is unresponsive and doesn't recover on its own.
I have seen non-permanent cases:
- Android startup is not complete on the initial attempt
- Android reboots or restarts a service in the middle of a test run
- the timeout is too short / mozdevice gives up too soon
For the most part, I think there are better responses to those non-permanent cases than retries -- improve the wait for startup, avoid the conditions that triggered reboot, set an appropriate timeout -- and I would hate to see problems like that papered over by an automatic retry mechanism.
On the other hand, I have no idea what is happening in bug 1500266, and if this sort of retry is demonstrated to be effective, that's very tempting.
Task-level TBPL Retry is another approach which I think would be very effective. We've used that for at least a couple of years on emulators and it has been very useful and trouble free.
Reporter | ||
Comment 4•5 years ago
|
||
Thanks for the detailed feedback all. Looking forward for aerickson feedback. Yeah I was thinking about bug 1500266 and also did some experiments on my g5 at home, so I wanted to propose that approach. Sounds like TBPL retries are a much better approach
Comment 5•5 years ago
|
||
RE: TBPL retries, I posted an update to bug 1573269. TLDR: We haven't deployed our fix to the non-test queues yet. I'm testing the latest revision and hope to deploy this week.
Assignee | ||
Updated•5 years ago
|
Updated•4 years ago
|
Comment 6•2 years ago
|
||
Andrew, what's the reason for bug 1578424 to be on the see also list? Note that this is about the Rust mozdevice crate and not the python module. Thanks.
Comment 7•2 years ago
|
||
Aaah, sorry. I thought we were rewriting/replacing the python code in rust, but I see that they're separate efforts.
Removed.
Updated•2 years ago
|
Assignee | ||
Comment 8•2 years ago
|
||
I think we're in a better place now + task retries are working + concerns raised in comment 1 and comment 3.
Description
•