Closed Bug 823304 Opened 7 years ago Closed 4 years ago
Fuzz backoff interval on failure
Operational consideration to help mitigate thundering herd. policy.jsm:989
We may want to consider expanding scope of this to cover larger backoff-related issues: * 500 is a hard failure causing immediate 24h backoff. * Are 15 minutes and 60 minutes the correct backoff intervals.
(In reply to Gregory Szorc [:gps] from comment #1) > We may want to consider expanding scope of this to cover larger > backoff-related issues: > > * 500 is a hard failure causing immediate 24h backoff. I don't think there should be a hard failure this severe. If the server returns 503+Retry-After we should respect that, otherwise we should treat it like any other error. > * Are 15 minutes and 60 minutes the correct backoff intervals. I think we should use ~30 minutes as an initial base with a combination of fuzzing and progressive backoff. If we just spike and overload infra, we'll slow down gradually until the system recovers. If there's a serious issue I'd expect Ops to deal with it more explicitly, but our dual goals here are "don't DoS the infra" and "collect as much data as we can" so I think we want to be aggressive at first. Here's what I'd want here: let base = 20 * 60 * 1000; // 20m let maxBI = 24 * 60 * 60 * 1000; // 24h let backoffMS = base * failureCount + Math.floor(Math.random() * base); return Math.min(backoffMS, maxBI);
(In reply to Mike Connor [:mconnor] from comment #2) > > * 500 is a hard failure causing immediate 24h backoff. > > I don't think there should be a hard failure this severe. If the server > returns 503+Retry-After we should respect that, otherwise we should treat it > like any other error. I agree… if someone gets paged when a production service returns a 500. 500 means "someone screwed up", with consequences of unknown severity. If nobody gets paged, then clients should act as if that 500 just brought down a cluster (which it might well have done), and should retreat to the nearest bar as fast as possible. Note that this can be resolved by having a LB turn all 500s into 503s with long backoffs, of course. Or phrased differently: sure, there shouldn't be a hard failure this severe. What happens when there is?
Component: Metrics and Firefox Health Report → Client: Desktop
Product: Mozilla Services → Firefox Health Report
won't fix based on FHR removal - https://bugzilla.mozilla.org/show_bug.cgi?id=1209088
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WONTFIX
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.