Open Bug 1702052 Opened 4 years ago Updated 2 years ago

Allow to wait for Glean ping to be submitted

Categories

(Toolkit :: Telemetry, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: nalexander, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Whiteboard: [telemetry:fog:m?][fidedi-ope] )

Right now Glean's submitPing and Submit are fire-and-forget.

For use in --backgroundtask backgroundupdate (see Bug 1689519), I want to allow the ping a reasonable chance to actually be submitted. I intend to use a persistent data store and the affordance added by Bug 1694505 to have these pings sent somewhat robustly, but it still seems like I should be able to wait for ... something ... to know data transfer was attempted. Or does Glean wire into the shutdown blocker mechanisms so that this happens automatically?

In any case, this ticket tracks a method to do a sensible thing in Bug 1654891.

While we're here, it would be nice to have some token (UUID? URL?) identifying the submitted ping out of this method, so I could stuff it into local logs for easy correlation.

(In reply to Nick Alexander :nalexander [he/him] from comment #0)

For use in --backgroundtask backgroundupdate (see Bug 1689519), I want to allow the ping a reasonable chance to actually be submitted. I intend to use a persistent data store and the affordance added by Bug 1694505 to have these pings sent somewhat robustly, but it still seems like I should be able to wait for ... something ... to know data transfer was attempted.

If you were given the opportunity to wait on submission or, at least, to receive a notification about an error, how would that be handled on your end?

Flags: needinfo?(nalexander)

(In reply to Alessio Placitelli [:Dexter] from comment #2)

(In reply to Nick Alexander :nalexander [he/him] from comment #0)

For use in --backgroundtask backgroundupdate (see Bug 1689519), I want to allow the ping a reasonable chance to actually be submitted. I intend to use a persistent data store and the affordance added by Bug 1694505 to have these pings sent somewhat robustly, but it still seems like I should be able to wait for ... something ... to know data transfer was attempted.

If you were given the opportunity to wait on submission or, at least, to receive a notification about an error, how would that be handled on your end?

A fine question. What I'm mostly worried about is "running off the end": asking to submit a ping, more or less immediately exiting, and having the (asynchronous) submit never actually send because we fast exit or similar.

In the error case, presumably the details would get captured in the persistent Glean DB, potentially getting submitted in a later run? I'm less concerned about knowing the submission succeeded, and more concerned about knowing the submission was attempted.

If ping submission doesn't return control to the invoker until after submission is attempted, then I don't need anything but a comment in my code to that effect.

Flags: needinfo?(nalexander) → needinfo?(alessio.placitelli)

(In reply to Nick Alexander :nalexander [he/him] from comment #3)

A fine question. What I'm mostly worried about is "running off the end": asking to submit a ping, more or less immediately exiting, and having the (asynchronous) submit never actually send because we fast exit or similar.

In the error case, presumably the details would get captured in the persistent Glean DB, potentially getting submitted in a later run? I'm less concerned about knowing the submission succeeded, and more concerned about knowing the submission was attempted.

Got it, thanks for giving us more context. This should be covered already: if you hit submit Glean guarantees you that a collection happens. Our shutdown procedure rejects any new operation after it gets called, but guarantees the execution of the previous call (unless, of course, the caller process ends for different reasons).

If ping submission doesn't return control to the invoker until after submission is attempted, then I don't need anything but a comment in my code to that effect.

Note that submission in the Legacy telemetry/Glean context means this. So the above procedure applies to the collection: pings are guaranteed to be collected. There's no way to guarantee upload internally or, at least, we don't guarantee upload in our default uploader. However Glean grants the ability to specify custom uploaders, meaning you can provide your own guarantees on upload, e.g. make the uploader block shutdown :)

Flags: needinfo?(alessio.placitelli)
No longer blocks: 1654891

Sorry for taking so long to answer this.

(In reply to Nick Alexander :nalexander [he/him] from comment #3)

If ping submission doesn't return control to the invoker until after submission is attempted, then I don't need anything but a comment in my code to that effect.

After examining the code and docs and internal guarantee motivations, I can provide for you now the most precise description of what ping submission means:

We can guarantee are that any instrumentation calls to Glean in the same app session that happen after submit() will not have their data put into the ping that you just submitted. That's pretty much the only thing we can guarantee without assuming that we aren't crashed at an inopportune moment.

If we assume we have an orderly shutdown instead of a crash (or that we are running for long enough that our threads aren't starved) Glean drains its instrumentation queue (aka "the dispatcher) which adds the guarantee that Glean will attempt to submit a ping (though if this happens during shutdown, network might have already been torn down so you might have to wait for next session's startup for the first upload attempt). If we assume disk I/O doesn't explode when trying to write a file to disk, then we can guarantee the ping submit succeeds which means Glean will try and upload that ping at least as many times as Glean is init'd, and likely more if the app sessions are of sufficient length. Even if disk I/O explodes, we will attempt to upload whatever we have in-memory at least once (again, if this is during shutdown this attempt may fail).

So... barring crashes and disk I/O failures, you should be all set. And even in the face of those, so long as they don't happen at the wrong time, you should be set anyway.

Does this help, :nalexander?

Flags: needinfo?(nalexander)

Sorry for the long delayed response. I think I'm satisfied here. I would still like it to be easier to reason about shutdown, but I'm not particularly concerned about pings getting "lost", since we have the background update process running in the wild and it seems to be working pretty well.

I do wonder if we should add a counter that we bump each time the task starts, so that we can witness missing pings. It would be an easy experiment to understand the robustness of the Glean storage and upload process.

Flags: needinfo?(nalexander)

Come to think of this, there are a couple of pieces of internal consistency instrumentation that the Glean SDK provides for you that are relevant here.

First, we have the dirty_startup-reason "baseline" ping. Most of the "baseline" ping scheduling in FOG doesn't apply to the background update task (since the scheduling is based on user interaction), so all we get are reason active at startup and reason dirty_startup at startups where the Glean SDK didn't completely shut down. The current (Nov 1-4) rate of background update startups that are dirty is fifteen hundredths of a percent, so this is consistent with the hypothesis that, even in the wild, the Glean SDK in the Background Update Task has an opportunity to process all pending operations and cleanly shutdown almost every time.

Now, this only ensures we're definitely processing that near-shutdown submit call (and all the calls before it). If you want to be sure that all the pings that are submitted are showing up we have sequence numbers in every ping at ping_info.seq. There should rarely be "holes" in the seq record and you can run an analysis like this one to check.


Now, all this assumes that Glean's able to persist the dirty flag at startup and the seq value on submit. But if we're having problems with that, then the disk might be broken enough that you'd have trouble persisting your own sequence number to cross check... so I'm not really sure what to do about that.

(In normal continuous operation this should reach eventual consistency even without such a mechanism. But maybe it's a good thing to add (and if so it may require some SDK work, too. )

Severity: -- → N/A
Priority: -- → P3
Whiteboard: [telemetry:fog:m?]
Whiteboard: [telemetry:fog:m?] → [telemetry:fog:m?][fidedi-ope]

This has come up again, so Jan-Erik and I chatted about this.

We're leaning away from providing a way to wait for a given submitted ping to be uploaded. It wouldn't solve the problem completely (especially if the ping upload failed and needed retry on this or another session), and may encourage folks to block on submission of pings in general when we specifically architected against it.

We considered a blocking shutdown API that these bg tasks could call. This wouldn't be most ideal for these specific cases as the individual tasks might not have enough insight or control into their shutdowns to time it properly, and it'd be a thing that could be forgotten, resulting in missing data. We can do better.

We're considering a configuration param for FOG at init that tells it to block shutdown until Glean's uploader has a chance to run through its pending pings list once. FOG'd add a shutdown blocker in AppShutdownNetTeardown that it'd release after the Glean SDK signals it's tried uploading its pending pings.

This requires changes to the SDK to figure out when the pending pings have been attempted to be sent, changes to FOG to register a shutdown blocker, careful synchronization with bug 1777233 to make sure we don't step on our own toes, and changes in the background tasks to opt into this behaviour.

(Firefox Desktop itself might want this behaviour too, but its usage pattern is more likely to cause a long-enough-for-normal-mode duration of app session so maybe not. We can look into this if ping latency (specifically upload latency) grows too long. )

That means the first step will be filing and fixing a bug against the SDK to gain and expose the knowledge of whether it has processed its pending pings.

Depends on: 1790702
See Also: → 1749510
You need to log in before you can comment on or make changes to this bug.