FOG is sending an order of magnitude too many "deletion-request" pings
Categories
(Toolkit :: Telemetry, task, P2)
Tracking
()
People
(Reporter: chutten, Assigned: chutten)
References
Details
https://sql.telemetry.mozilla.org/queries/69026#174211 shows that, starting around the end of January (coincides with Fx 85 hitting Release), FOG (Glean with application id firefox_desktop) stopped sending its previous just-about-the-same-number-as-Firefox-Telemetry number of "deletion-request" pings (20-30k/day) and started sending an outrageous (and growing!) number of "deletion-request" pings (120k to start, nearly 300k/day now).
What is up with that? FOG's been sending "deletion-request" pings for many versions (how many?) before 85. What changed?
Comment 1•4 years ago
|
||
That dashboard does get numbers before copy_deduplicate (it queries payload_bytes_decoded.*), but duplicates made up less than 1% of the reported values for the last couple days.
| Assignee | ||
Comment 2•4 years ago
|
||
Splitting by OS and country code reveals this to be some seriously weird stuff.
OS
https://sql.telemetry.mozilla.org/queries/79073/source#196482
- Windows picks it up on release day and climbs ever up and to the right.
- Linux gets a share of the problem on release day, but doesn't kick off for real until Feb 1.
- Mac OSX has irregular and gigantic spikes, but only after Valentine's Day.
Geo
https://sql.telemetry.mozilla.org/queries/79076/source#196484
The Mac spikes are 100% attributable to Singapore. The climb in Windows numbers up and to the right? Vietnam (and to a lesser extent Russia). The big spike on Jan 29th? Germany.
What does this mean?
I don't think these features are particular to OS or geography. I think these are singular "clients" juicing the data. Perhaps we're getting flooded?
Certainly looking at volumes of distinct client ids we're not seeing anywhere near the same weirdness. Maybe it doesn't have to be attackers?
Questions
- Why are we getting so many duplicate requests to delete the same client ids? How are we sending multiple "deletion-request" pings with different doc ids, but the same client_id? Shouldn't that be impossible?
- By what mechanism are we getting so many client ids so quickly from so few originators? Is there a bug in the pref observer? Why aren't we seeing problems with Firefox Telemetry-send "deletion-request" pings?
Theories
The Glean SDK has code detecting when the upload pref is disabled between application invocations. This was introduced to handle CLI cases where preferences were changed by text editors between invocations, not by UI present in the application. When this situation is detected, the Glean SDK will send a "deletion-request" ping and set the client_id to the c0ffee canary value.
If a situation like this happens to Firefox Desktop and the db of a Glean SDK is copied into the profile (a profile clone) then we'd see a high number of "deletion-request" pings without a similar increase in the number of clients. If the profile is cloned to many machines once (as part of a deploy, perhaps), we'd see that as a spike. If the profile is cloned on every day or every reset or every app session or similar, we'd see this as a sustained high volume.
This theory is consistent with the observed data and both the high and sustained level of pings as well as the spikiness. It also explains why we don't see this in Firefox Telemetry's "deletion-request" pings: in the event a pref is changed when Firefox Telemetry isn't init'd to notice, Firefox Telemetry just sets the client_id to c0ffee and doesn't send any pings.
So what do we do?
Comment 3•4 years ago
|
||
(In reply to Chris H-C :chutten from comment #2)
I think these are singular "clients" juicing the data. [...] the Glean SDK will send a "deletion-request" ping and set the client_id to the
c0ffeecanary [...] Firefox Telemetry just sets the client_id toc0ffeeand doesn't send any pings.So what do we do?
Is it possible to tell glean to skip the "deletion-request" ping if it hasn't sent any pings? Would that help?
If we make FoG a special-case, we could more aggressively deduplicate deletion request pings by configuring copy_deduplicate to be based on client_id instead of document_id, but then we'd need to be careful to include any shredder-used ids (e.g. activity stream's impression id) in the deduplication.
If we want to fix this more generally, then we could modify shredder to materialize the query it uses to select ids for deletion, and make that trigger on a certain volume threshold.
Comment 4•4 years ago
|
||
:relud's suggestions seem like some good things to consider if we can't find a client-side solution.
As for the client-side, I don't have any objection to just skipping the "preference changed while offline" check for all non-Python (command line) environments. It would be nice to catch this, but it is kind of a corner case that seems like it's causing more harm than help...
| Assignee | ||
Comment 5•4 years ago
|
||
(In reply to Daniel Thorn [:relud] from comment #3)
Is it possible to tell glean to skip the "deletion-request" ping if it hasn't sent any pings? Would that help?
I'm not sure it is because of secondary ids. Maybe e.g. Activity Stream has some data associated with impression_id that we do need to delete, even if there's no Glean data to delete.
And I'm not sure it'd help. If this is cloned profiles, the source profile might have already sent one or more pings before it was cloned.
If we make FoG a special-case, we could more aggressively deduplicate deletion request pings by configuring copy_deduplicate to be based on
client_idinstead ofdocument_id, but then we'd need to be careful to include any shredder-used ids (e.g. activity stream's impression id) in the deduplication.
What's the impact on the pipeline of having a bunch of the same exact client_id (and possibly other ids, too)?
(In reply to Michael Droettboom [:mdroettboom] from comment #4)
As for the client-side, I don't have any objection to just skipping the "preference changed while offline" check for all non-Python (command line) environments. It would be nice to catch this, but it is kind of a corner case that seems like it's causing more harm than help...
Unfortunately for FOG this isn't an option we can really consider. FOG's init is at some point after the app starts (here's where we ask for it to be scheduled in a function called scheduleStartupIdleTasks) after "every window has finished being restored by session restore". Conceivably a user could disable the pref using Options between app start and FOG's init, which would look to Glean as indistinguishable from a pref change between application runs.
We could add a reason to "deletion-request" pings. Bea suggests reason init for ones sent during init, and setUploadEnabled for ones sent as a result of setUploadEnabled(false) after Glean has init. This won't let us ignore any pings, but it does allow us to monitor setUploadEnabled-reason "deletion-request" pings to be within an expected volume, even if init-reason "deletion-request" pings can spike and ever increase and generally misbehave.
I'm not real pleased about adding anything to a "deletion-request" ping that isn't an id that tells us what we need to delete, but it would be a way to verify that this is indeed what is happening and offer us an ability to continue monitoring the health of the system by filtering out the unreasonable init-reason "deletion-request" pings.
I have some questions out to KrisWright and mkaply about whether, if this is a Policy Engine or some other instance-wide pref-setting mechanism, we can detect it reliably in FOG and suppress sending these duplicate and unnecessary "deletion-request" pings in Glean. Might need an SDK change to have an init config that means "Yes, upload is disabled. No, you don't have to send a 'deletion-request' ping. Just set the client_id to c0ffee"
| Assignee | ||
Comment 6•4 years ago
|
||
According to mkaply on Slack, Policy Engine locks the pref when it disables it, and pref locking isn't exposed to users using any UI. So if the pref is locked, it might be a good sign of a situation not needing a "deletion-request" ping.
| Assignee | ||
Comment 7•4 years ago
|
||
So where do we go from here?
Well, I think I'm going to go ahead and recommend we add a reason to "deletion-request" pings to identify which trigger (init or set_upload_enabled) caused the submission of the ping. It is in-line with the reason of the other SDK built-in pings, and will enable us to determine if indeed these spikes and volume inflation are due to preference changes that might not have originated from users. Look forward to an SDK bug about that soon.
From there we can evaluate whether or not we want to suppress "deletion-request" pings when the pref has been locked off. Locking is not something that users do, and only users can request self-service data deletion (they are the "self") there. I think information from the distribution of reason values in "deletion-request" pings is necessary to help us with this decision. If it doesn't meet my hypothesis (a core of set_upload_enabled-caused ones approximately matching the number of "deletion-request" pings from Telemetry plus a wildly variable spiky mess of init-caused ones), then we should give this another think.
Gonna leave this bug open to track the work necessary to upgrade the SDK after the SDK bug is fixed and released.
| Assignee | ||
Comment 8•4 years ago
•
|
||
The work to upgrade the SDK was taken care of in bug 1611770
Preliminary looks at live data suggest that this is working... but not completely. Still getting some the NULL reasons in there. Gonna file an analysis bug to look into this after we have a couple days' data.NULLs were from earlier SDKs in use on other channels. When I filtered for the proper SDK version (AND client_info.telemetry_sdk_build = '37.0.0') they go away.
Description
•