Closed Bug 1580137 Opened 5 years ago Closed 2 years ago

[mozdevice] introduce a retry mechanism on shell calls

Categories

(Testing :: Mozbase, enhancement, P3)

Version 3
enhancement

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: tarek, Assigned: gbrown)

References

Details

(Whiteboard: [chase-intermittents])

We're seeing in intermittents a lot of timeouts and errors on adb shell calls, and for some of our tests on bitbar, gracefully retrying the command that failed or timedout could be useful

I would like to introduce in ADBDevice.shell() a retries option, that will retry the same command several times when there's a problem

The new interface would be:

def shell(self, cmd, env=None, cwd=None, timeout=None, root=False,
          stdout_callback=None, retry_callback=None):

where retry_callback is a callable with this signature:

def retry_callback(retry, error):
    """retry: the number of times it was retried. the first call is 0
        error: the error that was triggered
    """

if the function raises an error (the error that was passed, or another error, it's bubbling up as usual. If the function returns False, no new attempt is made and the shell() function resume its behavior. If the function returns True, a new attempt is made.

The function is in charge of adding a time.sleep() call if it wants.

a default implementation callable is provided to make it easier to use:

def retry_shell_call(attempts= 3, interval=1, errors=(ADBError, ADBTimeout)):
    def _retry_shell_call(retry errors):
        if error not in errors:
            raise error
        if retry >= attempts:
            return False
        time.sleep(interval)
        return True  
    return _retry_shell_call
Assignee: nobody → tarek
Whiteboard: [chase-intermittents]
Summary: introduce a retry mechanism on shell calls → [mozdevice] introduce a retry mechanism on shell calls

I don't think this will help but am willing to be proved wrong.

If I am correct, ADBTimeoutError is a permanent state until the device is recovered usually manually. You can see this in situations where we do not immediately terminate the test run when an ADBTimeoutError occurs where we attempt several more adb calls before terminating, each of which times out after the default 5 minute time out. I think what you will see is once you hit an ADBTimeout is that each of your retries will also time out resulting in a 15 minute delay attempting to retry 3 times. If we have the situation where we do other adb calls during the clean up and exit phase of the framework, you will hit a 15 minute delay each time.

It may be the case where this might be helpful in some command executed on the device such as for adb forward or adb reverse but I don't think it should be the default action.

The issue on emulators where we were getting intermittent failures were mostly due to failures to detect something during initialization. They were not due to ADBErrors or ADBTimeoutErrors but were due to the emulators lying to us about the existence of files and directories.

Getting the android-hw tests to use the same TBPL Retry mechanism as the android-em tests I think will be a more production solution. I was looking into this a little bit last week and think that the approach where we use run-task to run the script.py at bitbar which runs test-linux.sh hides the retry exit code 4 from taskcluster and prevents the job from being retried.

aerickson: Can you look into investigating and fixing the TBPL Retry issue?

gbrown: Thoughts?

Flags: needinfo?(gbrown)
Flags: needinfo?(aerickson)

(In reply to Tarek Ziadé (:tarek) from comment #0)

We're seeing in intermittents a lot of timeouts and errors on adb shell calls

Is that mostly bug 1500266?

In my experience (mostly influenced by emulators), ADBTimeoutError is usually permanent: the device is unresponsive and doesn't recover on its own.
I have seen non-permanent cases:

  • Android startup is not complete on the initial attempt
  • Android reboots or restarts a service in the middle of a test run
  • the timeout is too short / mozdevice gives up too soon
    For the most part, I think there are better responses to those non-permanent cases than retries -- improve the wait for startup, avoid the conditions that triggered reboot, set an appropriate timeout -- and I would hate to see problems like that papered over by an automatic retry mechanism.

On the other hand, I have no idea what is happening in bug 1500266, and if this sort of retry is demonstrated to be effective, that's very tempting.

Task-level TBPL Retry is another approach which I think would be very effective. We've used that for at least a couple of years on emulators and it has been very useful and trouble free.

Flags: needinfo?(gbrown)

Thanks for the detailed feedback all. Looking forward for aerickson feedback. Yeah I was thinking about bug 1500266 and also did some experiments on my g5 at home, so I wanted to propose that approach. Sounds like TBPL retries are a much better approach

RE: TBPL retries, I posted an update to bug 1573269. TLDR: We haven't deployed our fix to the non-test queues yet. I'm testing the latest revision and hope to deploy this week.

Flags: needinfo?(aerickson)
Priority: -- → P3
Assignee: tarek → nobody
See Also: → 1578424
See Also: → 1615599

Andrew, what's the reason for bug 1578424 to be on the see also list? Note that this is about the Rust mozdevice crate and not the python module. Thanks.

Flags: needinfo?(aerickson)

Aaah, sorry. I thought we were rewriting/replacing the python code in rust, but I see that they're separate efforts.

Removed.

Flags: needinfo?(aerickson)
See Also: 1578424
Severity: normal → S3

I think we're in a better place now + task retries are working + concerns raised in comment 1 and comment 3.

Assignee: nobody → gbrown
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.