1580137 - [mozdevice] introduce a retry mechanism on shell calls

Reporter

Description

•

5 years ago

•

We're seeing in intermittents a lot of timeouts and errors on adb shell calls, and for some of our tests on bitbar, gracefully retrying the command that failed or timedout could be useful

I would like to introduce in ADBDevice.shell() a retries option, that will retry the same command several times when there's a problem

The new interface would be:

def shell(self, cmd, env=None, cwd=None, timeout=None, root=False,
          stdout_callback=None, retry_callback=None):

where retry_callback is a callable with this signature:

def retry_callback(retry, error):
    """retry: the number of times it was retried. the first call is 0
        error: the error that was triggered
    """

if the function raises an error (the error that was passed, or another error, it's bubbling up as usual. If the function returns False, no new attempt is made and the shell() function resume its behavior. If the function returns True, a new attempt is made.

The function is in charge of adding a time.sleep() call if it wants.

a default implementation callable is provided to make it easier to use:

def retry_shell_call(attempts= 3, interval=1, errors=(ADBError, ADBTimeout)):
    def _retry_shell_call(retry errors):
        if error not in errors:
            raise error
        if retry >= attempts:
            return False
        time.sleep(interval)
        return True  
    return _retry_shell_call

Tarek Ziadé (:tarek)

Reporter

Updated

•

5 years ago

Assignee: nobody → tarek

Tarek Ziadé (:tarek)

Reporter

Updated

•

5 years ago

Whiteboard: [chase-intermittents]

Tarek Ziadé (:tarek)

Reporter

Updated

•

5 years ago

Summary: introduce a retry mechanism on shell calls → [mozdevice] introduce a retry mechanism on shell calls

Bob Clary [:bc] (inactive)

Comment 1

•

5 years ago

I don't think this will help but am willing to be proved wrong.

If I am correct, ADBTimeoutError is a permanent state until the device is recovered usually manually. You can see this in situations where we do not immediately terminate the test run when an ADBTimeoutError occurs where we attempt several more adb calls before terminating, each of which times out after the default 5 minute time out. I think what you will see is once you hit an ADBTimeout is that each of your retries will also time out resulting in a 15 minute delay attempting to retry 3 times. If we have the situation where we do other adb calls during the clean up and exit phase of the framework, you will hit a 15 minute delay each time.

It may be the case where this might be helpful in some command executed on the device such as for adb forward or adb reverse but I don't think it should be the default action.

The issue on emulators where we were getting intermittent failures were mostly due to failures to detect something during initialization. They were not due to ADBErrors or ADBTimeoutErrors but were due to the emulators lying to us about the existence of files and directories.

Getting the android-hw tests to use the same TBPL Retry mechanism as the android-em tests I think will be a more production solution. I was looking into this a little bit last week and think that the approach where we use run-task to run the script.py at bitbar which runs test-linux.sh hides the retry exit code 4 from taskcluster and prevents the job from being retried.

aerickson: Can you look into investigating and fixing the TBPL Retry issue?

gbrown: Thoughts?

Flags: needinfo?(gbrown)

Flags: needinfo?(aerickson)

Geoff Brown [:gbrown]

Assignee

Comment 2

•

5 years ago

(In reply to Tarek Ziadé (:tarek) from comment #0)

We're seeing in intermittents a lot of timeouts and errors on adb shell calls

Is that mostly bug 1500266?

Geoff Brown [:gbrown]

Assignee

Comment 3

•

5 years ago

In my experience (mostly influenced by emulators), ADBTimeoutError is usually permanent: the device is unresponsive and doesn't recover on its own.
I have seen non-permanent cases:

Android startup is not complete on the initial attempt
Android reboots or restarts a service in the middle of a test run
the timeout is too short / mozdevice gives up too soon
For the most part, I think there are better responses to those non-permanent cases than retries -- improve the wait for startup, avoid the conditions that triggered reboot, set an appropriate timeout -- and I would hate to see problems like that papered over by an automatic retry mechanism.

On the other hand, I have no idea what is happening in bug 1500266, and if this sort of retry is demonstrated to be effective, that's very tempting.

Task-level TBPL Retry is another approach which I think would be very effective. We've used that for at least a couple of years on emulators and it has been very useful and trouble free.

Flags: needinfo?(gbrown)

Tarek Ziadé (:tarek)

Reporter

Comment 4

•

5 years ago

Thanks for the detailed feedback all. Looking forward for aerickson feedback. Yeah I was thinking about bug 1500266 and also did some experiments on my g5 at home, so I wanted to propose that approach. Sounds like TBPL retries are a much better approach

Andrew Erickson [:aerickson]

Comment 5

•

5 years ago

RE: TBPL retries, I posted an update to bug 1573269. TLDR: We haven't deployed our fix to the non-test queues yet. I'm testing the latest revision and hope to deploy this week.

Flags: needinfo?(aerickson)

Geoff Brown [:gbrown]

Assignee

Updated

•

5 years ago

Priority: -- → P3

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

4 years ago

Assignee: nobody → gbrown

Status: NEW → RESOLVED

Closed: 2 years ago

Resolution: --- → WONTFIX

Bugzilla

[mozdevice] introduce a retry mechanism on shell calls

Categories

(Testing :: Mozbase, enhancement, P3)

Tracking

(Not tracked)

People

(Reporter: tarek, Assigned: gbrown)

References

Details

(Whiteboard: [chase-intermittents])

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Updated

Updated

Updated

Comment 6

Comment 7

Updated

Comment 8