Open Bug 1605091 Opened 6 years ago Updated 4 years ago

Add jitter delay for ping upload

Categories

(Data Platform and Tools :: Glean: SDK, defect, P4)

defect

Tracking

(Not tracked)

People

(Reporter: brizental, Unassigned)

References

(Depends on 1 open bug)

Details

(Whiteboard: [telemetry:glean-rs:backlog])

From Glean SDK architecture changes for Project FOG and Beyond:

e.g. around midnight for “main”-ping like functionality on Desktop

Depends on: 1605077
Whiteboard: [telemetry:glean-rs:m?] → [telemetry:glean-rs:m16]

Finally this is the last bug on the m16 milestone! I left this for last because this is the one that is least clear to me on what should be done.

For desktop we have what is called a "fuzzing delay" (see: https://searchfox.org/mozilla-central/source/toolkit/components/telemetry/app/TelemetrySend.jsm#657-659). This bug is about implementing such a feature for the Glean SDK upload mechanism.

I could implement the jitter (or fuzzing) focused on the Glean SDKs metrics ping. So, jitter when it is around 4AM (the time we usually send the metrics pings for the Glean SDK). This solution is a bit weird to me, since the definition of the 4AM schedule for the metrics pings in done on the bindings and not on the core. What if a different binding decides not to send the metrics ping at 4AM?

An option would be to let the bindings tell the core when is it that they are scheduling pings based on time of the day. The core can then jitter the sending of pings at the times reported.

Pings that are scheduled based on time of the day could have their scheduling logic moved to the core, should this bug be blocked on moving the implementiation of these schedules from the bindings?

I understand this is specially important for project FOG, so I would like to get opinions. My proposed solution is to add jitter around 4AM for ping uploading, passing the 4AM as a "magic number" for now. After doing that, we can determine if the jitter needs to be more generic / customizable. Moving ping scheduling to the core is another problem TBD on its own.

Flags: needinfo?(jrediger)
Flags: needinfo?(chutten)
Flags: needinfo?(alessio.placitelli)

(In reply to Beatriz Rizental from comment #1)

I could implement the jitter (or fuzzing) focused on the Glean SDKs metrics ping. So, jitter when it is around 4AM (the time we usually send the metrics pings for the Glean SDK). This solution is a bit weird to me, since the definition of the 4AM schedule for the metrics pings in done on the bindings and not on the core. What if a different binding decides not to send the metrics ping at 4AM?

Nope: we wouldn't probably want this to be specific to the metrics ping. We'd probably want jitter to be an option you would define in pings.yaml (e.g. upload_jitter: true|false - defaulting to false). It should not be up to the bindings, but handled in the rust core.

Jitter must not affect collection time, it should only affect upload. E.g. for the metrics ping, collection would still always happen at 4am, but we'd not upload at 4 am.

An option would be to let the bindings tell the core when is it that they are scheduling pings based on time of the day. The core can then jitter the sending of pings at the times reported.

Pings that are scheduled based on time of the day could have their scheduling logic moved to the core, should this bug be blocked on moving the implementiation of these schedules from the bindings?

I'm afraid that would be a bigger can of worms, and that's orthogonal to uploads.
Consistent scheduling relies on OS primitives, that might not be available to Rust.

I understand this is specially important for project FOG, so I would like to get opinions. My proposed solution is to add jitter around 4AM for ping uploading, passing the 4AM as a "magic number" for now. After doing that, we can determine if the jitter needs to be more generic / customizable. Moving ping scheduling to the core is another problem TBD on its own.

Nope, I'd push back on this :) For FOG we don't (well, I don't know!) if we'd end up using 'metrics' ping or if FOG will define its own 'main' ping. I believe we need a proposal for this bug, to at least outline the ideal/general solution. Then we could decide to ship a temporary "hack", but I'd rather not do that unless the general solution is very complex.

Flags: needinfo?(alessio.placitelli)

(In reply to Alessio Placitelli [:Dexter] from comment #2)

Nope: we wouldn't probably want this to be specific to the metrics ping. We'd probably want jitter to be an option you would define in pings.yaml (e.g. upload_jitter: true|false - defaulting to false). It should not be up to the bindings, but handled in the rust core.

Thank you, [:Dexter]! This really clarifies what is the idea for jitter delay inside the Glean SDK.

I'll wait for the input from the others and write up a proposal document after that.

That brings up the question: do we need to do something now?
glean-core doesn't handle scheduling as you said.
We don't have the requirement for jitter in either Android or iOS yet.
The uploader, which is per binding, could already do its own jitter (by just not asking for a ping upload task).

Maybe this isn't something we need to solve now, but becomes part of whatever schedule we decide for a metrics-equivalent ping on FOG?

Flags: needinfo?(jrediger)

(In reply to Jan-Erik Rediger [:janerik] from comment #4)

That brings up the question: do we need to do something now?
glean-core doesn't handle scheduling as you said.
We don't have the requirement for jitter in either Android or iOS yet.
The uploader, which is per binding, could already do its own jitter (by just not asking for a ping upload task).

The problem is that the uploader "shouldn't" know about ping types (it could, since we can't hide URLs): by uploader I mean the "dumb" part that comes from outside the language bindings, which performs the actual HTTP request.

Maybe this isn't something we need to solve now, but becomes part of whatever schedule we decide for a metrics-equivalent ping on FOG?

Note that the schedule of the ping itself isn't the point here: we really care about the upload part itself, which products should not know about. I'm fine with tackling this further on, but only if that's fine for Chris.

The point of a jitter is to reduce the maximum simultaneous load on the pipeline. The load cares about how many and what size of documents that are inbound, not what their doctype is, so I don't think a solution needs to be ping-specific.

It also need not be client-time-specific: If you live at UTC-9 your 4AM is probably a pretty quiet time so there's no need for jittering. Instead we might prefer to jitter around midnight (to avoid running afoul of telemetry peaks), 4AM, and 9AM (start of business) in EST, PST, CEST, GMT, IST, CST, JST and whichever other timezones we presently have a peak at in Telemetry data. (and probably only M-F. Weekends are probably different)

I think we want to talk to Ops to see what would be beneficial to them. I know from prior discussions with :whd that pagers do go off if too many pings are sent at once, but that might be because of alerting more than because we're spending too much money or overflowing buffers. (and these conversations were about custom pings, not scheduled pings, reinforcing that perhaps we want a ping-type-agnostic solution, and perhaps suggesting that we want to jitter everything not just during some rush hours). This can be part of the proposal approval process (good idea, Alessio)

On the client we know[1] what time it is. We can shimmy our pings so that they arrive in a constant flurry instead of over short periods of blizzards interspersed with eerie calms. Maybe this only needs to be for metrics pings (because, well, as this query shows, we need to do something about those if we scale to the 100s of millions of users in Firefox Desktop), but maybe we can get even more benefit from even less work by making it generic?

[1]: Ha-ha, well, fine. Okay. Close enough.

Flags: needinfo?(chutten)

Hey, [:klukas]! Would mind giving your insight on the upload jittering issue? Or maybe pointing us to people that should be involved in these discussions? Thanks :D

Flags: needinfo?(jklukas)

:whd should be involved as he's the person most heavily involved with the pipeline on the operations side.

I agree with :chutten's summary:

The load cares about how many and what size of documents that are inbound, not what their doctype is, so I don't think a solution needs to be ping-specific

The GCP pipeline is designed to auto-scale to handle spikes in traffic, so in theory we should be able to handle spikes without spending inordinate amounts of money or having the pipeline fall over. In practice, auto-scaling is imperfect and we have the possibility of experiencing adverse behavior due to large spikes in traffic.

Any pings that are sent as a direct result of user behavior (such as baseline) should be naturally spread out. It's indeed the pings like main or metrics that are sent by many clients at the same time that can cause problematic behavior.

One things to note is that we do segregate telemetry doctypes do their own instance of the pipeline. Glean pings go through the structured instance of the pipeline, where traffic is currently dominated by activity-stream pings associated with the newtab page in desktop. Glean will become more significant as Fenix rolls out to the full release population over the next month. We will see at that point whether metrics ping behavior causes adverse behavior in the pipeline. My gut says it will probably be fine, but :whd will likely have a more concrete answer. I definitely want to see jitter implemented before we start sending scheduled pings through FOG.

Flags: needinfo?(jklukas)
Flags: needinfo?(whd)

tl;dr ops prefer jitter over an hour for scheduled pings (basically what was implemented for telemetry). Any schedule that does not cause more than a 50% increase in overall traffic to the telemetry endpoint within a 5 minute period should be fine (typically only events involving 100% of the firefox release population can have this amount of swing).

From an operations perspective we primarily care about not exceeding the elasticity of our load balancers and the auto scaling backend systems behind them. These are AWS ELB/ALB with EC2 ASG and GCP GCLB with GKE HPA. AWS recommends not exceeding a 50 percent every five minutes [1]. GCP claims their load balancer scales to millions of requests with no pre-warming, but in practice our backing HPA evaluates its metrics at once every 15s [2] and the metrics it's using to evaluate are probably closer in periodicity to 1m. Scaling up with pods in a statefulset also has ramp up time such that I would say our ability to autoscale in GCP is roughly the same as in AWS.

So in general, any scheduling mechanism that causes traffic to spike at no greater an interval than 50% in 5 minutes should suite our purposes. We have enough of a baseline of traffic between activity stream and telemetry that exceeding capacity to autoscale is unlikely to occur in most circumstances.

The primary circumstance where we can still run into autoscaling issues is with a 100% release population system addon rollout (which causes a change to the environment which causes a subsession split which sends a main ping). I worked with :mythmon earlier this year to get to a place where system addons rollouts with this target population have sufficient jitter (I believe also 1hr) to prevent autoscaling issues from arising.

Overall I'm not too concerned with autoscaling and jitter for newer pings, as long as they match approximately what we have for standard telemetry (under the assumption that one day glean-based traffic will subsume legacy telemetry).

[1] https://aws.amazon.com/articles/best-practices-in-evaluating-elastic-load-balancing/#pre-warming
[2] https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/

Flags: needinfo?(whd)

I am defering this for until we come to a conclusion on whether to move ping scheduling to Rust core.

Depends on: 1651382
Depends on: 1663599
Whiteboard: [telemetry:glean-rs:m16] → [telemetry:glean-rs:backlog]
Priority: P3 → P4
You need to log in before you can comment on or make changes to this bug.