Write unit tests for retry function exponential backoff

RESOLVED FIXED

Status

Release Engineering
General Automation
P3
minor
RESOLVED FIXED
6 years ago
3 years ago

People

(Reporter: jhford, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [retry])

Attachments

(1 attachment)

When we have a failure in release automation, we should have an exponential backoff for retry delay.  This may help with intermittent network or server failures
code is here http://hg.mozilla.org/build/tools/file/tip/lib/python/util/retry.py
Summary: retry logic in release automation should use exponential backoff → retry function should use exponential backoff
(In reply to John Ford [:jhford] from comment #0)
> When we have a failure in release automation, we should have an exponential
> backoff for retry delay.  This may help with intermittent network or server
> failures

An argument against exponential backoff: Retries are cheap, and I'd hate to wait many minutes extra because of it, knowing that a failed service is actually back online.

Suggest at least a reasonable upper limit (like 5 or 10 minutes) + a random jitter so hundreds of servers aren't waiting the exact same amount of time to retry.
(In reply to John Hopkins (:jhopkins) from comment #2)
> (In reply to John Ford [:jhford] from comment #0)
> > When we have a failure in release automation, we should have an exponential
> > backoff for retry delay.  This may help with intermittent network or server
> > failures
> 
> An argument against exponential backoff: Retries are cheap, and I'd hate to
> wait many minutes extra because of it, knowing that a failed service is
> actually back online.
> 
> Suggest at least a reasonable upper limit (like 5 or 10 minutes) + a random
> jitter so hundreds of servers aren't waiting the exact same amount of time
> to retry.

Hmmm, I'm not sure how I feel about jitter. It really depends on the use case. I will say that an upper bound of 60 seconds by default is preferable though, waiting 5 or 10 minutes is unreasonable long IMHO.
For a short job, sure, but what about a job that takes multiple hours?  I'd rather wait 5+ minutes to save 4 hours of work.
(In reply to John Ford [:jhford] from comment #4)
> For a short job, sure, but what about a job that takes multiple hours?  I'd
> rather wait 5+ minutes to save 4 hours of work.

Longer wait times don't necessarily give you shorter coverage. You can get an hours coverage with 12 retries and 5 minutes of wait, or 60 retries and 1 minute of wait. I prefer the latter.
(In reply to Ben Hearsum [:bhearsum] from comment #5)
> (In reply to John Ford [:jhford] from comment #4)
> > For a short job, sure, but what about a job that takes multiple hours?  I'd
> > rather wait 5+ minutes to save 4 hours of work.
> 
> Longer wait times don't necessarily give you shorter coverage. You can get
> an hours coverage with 12 retries and 5 minutes of wait, or 60 retries and 1
> minute of wait. I prefer the latter.

I am not parsing the first sentence properly I think.

The problem with having 60 updates, one a minute, is that its going to hit whatever server its trying to hit 60 times, one minute apart.  if the server is lagging behind but still working and is taking longer to respond, hitting it less frequently is advantageous because its giving the server more of a chance to process its response, without adding a bunch of extra load.  Because choosing between short intervals and long intervals isn't great, having a backoff is ideal.  The first few retries will use a short interval.  As time goes on, the backoff will cause longer intervals, giving the server a chance to process the request fully, while having the same overall time limit as if we didn't have a backoff.
Nobody has picked this up despite all the chatter, so I assert this is P3!
Priority: -- → P3

Comment 8

6 years ago
(In reply to Ben Hearsum [:bhearsum] from comment #5)
> Longer wait times don't necessarily give you shorter coverage. You can get
> an hours coverage with 12 retries and 5 minutes of wait, or 60 retries and 1
> minute of wait. I prefer the latter.

Does anyone have renewed thoughts on this given that we just recently had a hg outage that caused us retry many failed pulls in rapid succession (bug 733663)?
Component: Release Engineering → Release Engineering: Automation
OS: Linux → All
Priority: P3 → --
QA Contact: release → catlee
Hardware: Other → All
Whiteboard: [retry]
I suggest a truncated binary exponential backoff:

https://en.wikipedia.org/wiki/Exponential_backoff

Updated

6 years ago
Priority: -- → P3
Assignee: nobody → jhopkins
Created attachment 608343 [details] [diff] [review]
proposed patch
Attachment #608343 - Flags: review?(jhford)
Attachment #608343 - Flags: review?(jhford) → review+
Landed in http://hg.mozilla.org/build/tools/rev/d34d4d691277

We still need unit tests to cover the exponential backoff and the new --maxsleeptime argument.  Updating summary to reflect that.
Summary: retry function should use exponential backoff → Write unit tests for retry function exponential backoff
This is not on my radar.  Returning to the pool.
Assignee: jhopkins → nobody
Severity: normal → minor
(Assignee)

Updated

4 years ago
Product: mozilla.org → Release Engineering
I'm pretty sure redo has tests for this...
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.