Closed Bug 823304 Opened 7 years ago Closed 4 years ago

Fuzz backoff interval on failure

Categories

(Firefox Health Report Graveyard :: Client: Desktop, defect, P4)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: gps, Unassigned)

References

Details

Operational consideration to help mitigate thundering herd.

policy.jsm:989
Blocks: 829887
No longer blocks: 718066
We may want to consider expanding scope of this to cover larger backoff-related issues:

* 500 is a hard failure causing immediate 24h backoff.
* Are 15 minutes and 60 minutes the correct backoff intervals.
(In reply to Gregory Szorc [:gps] from comment #1)
> We may want to consider expanding scope of this to cover larger
> backoff-related issues:
> 
> * 500 is a hard failure causing immediate 24h backoff.

I don't think there should be a hard failure this severe.  If the server returns 503+Retry-After we should respect that, otherwise we should treat it like any other error.

> * Are 15 minutes and 60 minutes the correct backoff intervals.

I think we should use ~30 minutes as an initial base with a combination of fuzzing and progressive backoff.  If we just spike and overload infra, we'll slow down gradually until the system recovers.  If there's a serious issue I'd expect Ops to deal with it more explicitly, but our dual goals here are "don't DoS the infra" and "collect as much data as we can" so I think we want to be aggressive at first.

Here's what I'd want here:

let base = 20 * 60 * 1000; // 20m
let maxBI = 24 * 60 * 60 * 1000; // 24h
let backoffMS = base * failureCount + Math.floor(Math.random() * base);
return Math.min(backoffMS, maxBI);
(In reply to Mike Connor [:mconnor] from comment #2)

> > * 500 is a hard failure causing immediate 24h backoff.
> 
> I don't think there should be a hard failure this severe.  If the server
> returns 503+Retry-After we should respect that, otherwise we should treat it
> like any other error.

I agree… if someone gets paged when a production service returns a 500. 500 means "someone screwed up", with consequences of unknown severity.

If nobody gets paged, then clients should act as if that 500 just brought down a cluster (which it might well have done), and should retreat to the nearest bar as fast as possible.

Note that this can be resolved by having a LB turn all 500s into 503s with long backoffs, of course.

Or phrased differently: sure, there shouldn't be a hard failure this severe. What happens when there is?
Priority: -- → P4
Component: Metrics and Firefox Health Report → Client: Desktop
Product: Mozilla Services → Firefox Health Report
won't fix based on FHR removal - https://bugzilla.mozilla.org/show_bug.cgi?id=1209088
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WONTFIX
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.