Closed Bug 877050 Opened 12 years ago Closed 11 years ago

Cleanly drop a percentage of sync traffic with a 503, rather than allowing weird behavior under load

Categories

(Cloud Services :: Operations: Miscellaneous, task, P2)

task

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: rfkelly, Assigned: rfkelly)

References

Details

(Whiteboard: [qa+])

Users are currently seeing error bars and weird behavior like 400 errors, javascript tracebacks, etc. See e.g. Bug 685941 and Bug 749315. As best we understand it, these are caused by timeouts and other load-induced strangeness truncating request/response bodies. Richard suggested on IRC that it may be better to purposely throttle our traffic with 503s, rather than allow the problems to show up as user-visible error bars. For example, randomly fail 50% of requests to /info/collections with a 503, Retry-After 2 hours. Advantages: * clients are better behaved with a clean 503, and produce less error-bar noise * failing things out at the initial handshake will produce less DB load than allowing them to fail in the middle of a sync Thoughts?
Whiteboard: [qa+]
Blocks: 907479
Something we still want for Sync 1.5?
Priority: -- → P2
Related to 975305
Depends on: 975305
Fixed for Sync1.5 in Bug 975306. Bob do you want any action on this bug for sync1.1 or should we just close it out?
Assignee: nobody → rfkelly
Status: NEW → ASSIGNED
Flags: needinfo?(bobm)
(In reply to Ryan Kelly [:rfkelly] from comment #3) > Fixed for Sync1.5 in Bug 975306. > > Bob do you want any action on this bug for sync1.1 or should we just close > it out? No, we can put nodes into back-off at the various points of redirection in our Sync 1.1 production, which should be fine.
Flags: needinfo?(bobm)
Great, closing it out then.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Done.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.