Closed Bug 1601127 Opened 1 year ago Closed 1 year ago

Skew on-sync recipe runs to avoid crushing servers


(Firefox :: Normandy Client, enhancement, P1)




Firefox 73
Tracking Status
firefox73 --- fixed


(Reporter: mythmon, Assigned: mythmon)




(1 file)

Today, when a change happens to the Normandy recipes on Remote Settings, approximately every client that is online re-evaluates recipes within 5 minutes. This drives very large traffic spikes to the Classify Client service, which have caused it to become overloaded and fail.

We are exploring server side fixes to this problem as well, but as a simple mitigation we should skew our response to the event over a few minutes. When the event is received, we should wait a random amount of time before acting on it. I suggest we wait up to 10 minutes. This should be configurable so we can tweak it in the future, and so that QA is quicker.

Assignee: nobody → mcooper

This will spread the time that clients run updates in reponse to on-sync
events over an addtional 10 minutes. The events are already spread over
about 5 minutes by the machiners of the push infrastructure. This will
serve to cut the overall server load to about 1/3 it's previous level.

The trade off is that we now have slightly slower response times for
changes on the server. Fifteen minutes is still short enough to be
usefully "real time", while allowing enough time for servesr to scale up
gracefully and humans to react to any potentialy problems.

Thanks for working on this! I agree 10 minutes is a good starting value for app.normandy.onsync_skew_sec. It's good to have that as a preference so we can adjust it if needed down the line.

Pushed by
Skew Normandy on-sync recipe runs to avoid crushing servers r=Gijs
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 73
See Also: → 1594954
See Also: → 1615378
You need to log in before you can comment on or make changes to this bug.