Closed Bug 1809900 Opened 2 years ago Closed 8 months ago

Where are all those "deletion-request" pings with `null` and `at_init` coming from?

Categories

(Toolkit :: Telemetry, task, P1)

task

Tracking

()

RESOLVED WONTFIX

People

(Reporter: chutten, Assigned: chutten)

References

Details

(Whiteboard: [dataquality])

Attachments

(4 files)

Jan-Erik (in bug 1806441) added monitoring for "deletion-request" ping volume by reason to the FOG Monitoring Dashboard.

What I expected was a small number of pings per day, mostly set_upload_enabled with a few at_init sprinkled in. Maybe they'd be as close as equal in volume.

Instead what I saw was at_init with an order of magnitude more pings per day than set_upload_enabled... and a third line on the graph for a null reason: which shouldn't happen since we upgraded to Glean v37 in Firefox 90 (bug 1611770).

What's going on here?

Turns out this is ping volume not client volume. If we look at client volume, it paints a much more reasonable-looking picture.

The recent spikes are 104.0 at_init, 101.0.1 at_init, and 108.0.2 at_init. null is middle of the pack, an easy two orders of magnitude lower than the monitoring dashboard suggests.

This doesn't explain everything I think is weird (and raises more questions about the importance of 104, 101.0.x, and 108.0.x in this year of 2023 (Fx101 was released May 31, Fx104 Aug 23, Fx108 Dec 13)), but it takes the severity of figuring this out from "Oh goodness, is this an Incident?" to "Oh great, not another weird data thing of middling and not worsening importance".

See Also: → 1809904

"deletion-request" pings with null reason

null was, prior to Glean v37 introduced to Firefox 90 in bug 1611770, the only ping reason. How much of the null we are seeing today are due to clients still on those "ancient" (Fx90 reached release in July of 2021) versions?

Turns out pretty much all of them. Since November 1, 2022 we've received 416 "deletion-request" pings with NULL reason from 70 clients. I'm confused that we're getting any, but the volume is so far below the noise floor that I think we can safely leave this alone.

Conclusion: The null pings are from old versions of Firefox. They should decrease over time, and it is my opinion that we can ignore them.

(Unanswered questions remaining: at_init's relative volume, and those weird spikes starting Jan 5)

volume of "deletion-request" pings with at_init reason far exceeding set_upload_enabled reason

If a user disables upload, we send a "deletion-request" ping. If that disabling happens between when the Glean SDK is init and when it is shutdown, that ping has a reason set_upload_enabled. If that disabling happens outside of that time, the ping is sent on the next Glean SDK init with reason at_init.

We believe real human user actions can result in pings of either type. If they open the options and disable telemetry upload during browser startup, they can send an at_init ping. Glean is init at some convenient time after the busiest parts of startup, so especially on slower machines where Firefox may be in the busy part of startup for longer, users may have plenty of time to turn off upload.

But more often we suspect this to be a reason associated with automation. If upload is opted out of through manipulation of the profile, then on the next (perhaps first) launch of Firefox with that profile, we'll upload an at_init-reason "deletion-request" ping.

As such I'd expect the ratio of at_init-reason "deletion-request" pings to set_upload_enabled-reason "deletion-request" pings to be at most 1.

I'm wrong, of course. Even the ratio of clients sending at_init to set_upload_enabled regularly far exceeds 1 outside of nightly and aurora channels.

This shouldn't have surprised me. I've seen this before in bug 1701985. In my defence, that was nearly 2 years ago.

Conclusion: I was wrong to think that set_upload_enabled volume in clients or pings would exceed at_init volume in Firefox Desktop. Whoops.

(( Unanswered questions remaining: those weird spikes starting Jan 5 ))

See Also: → 1701985

Spikes in "deletion-request" ping and client volume beginning Jan 5, 2023

Starting January 5, 2023, there was a spike in the number of clients sending "deletion-request" pings from release 108, 104, 101, and 100. There were no corresponding spikes in other channels or other versions (aside from a temporary spike on channel Other version 91 in early December... but that's channel Other, which I'm just gonna ignore).

By spike I mean release 104 went from not charting to sending us "deletion-request" pings from 2x-3x the number of clients that release 108 (the current version) was typically sending in late 2022. Release 101 did the same, but to 1.3x. Release 108 (the current version) went from normal to 1.7x (though that might have been delayed to Jan 9). Release 100 went from essentially nothing to 0.5x the volume of the current version.

Everything else was noisy but normal.

Jan 5 coincides with the release of Fx108.0.2, which may explain why it's prominent on the map: if you combine the outgoing volume of Fx108.0.1 with the incoming Fx108.0.2, they are extra prominent. It may also explain why Fx108 peaks on Jan 9, as that's the first Monday after the dot release. Fx108 may not actually be "weird" despite being one of the spikes on the initial graph.

There's no such saving grace for 100, 101, and 104. Fx104.0, 104.0.1, 104.0.2 are all spiking (mostly 104.0). And 101.0.1 and 101.0 (mostly 101.0.1). Fx100 is mostly spiking just in 100.0 (100.0.1 and 100.0.2 appear to have similar/same volume to before the spike begins). As far as I can tell, there's nothing about these versions in particular that'd make them prone to this sort of misbehaviour. They're on different versions of the Glean SDK, they were released at different times, only some are dot releases, only some are even numbered... nothing obvious comes to mind, which has me leaning in the "there's a weird system doing this to us" direction and away from "there's something in the code going awry".

Leaning into "weird system" I then looked at metadata.isp.name. This was "fun", as the variety of isps really caused sql.tmo to chug. (Reminder to friends, HAVING COUNT(DISTINCT client_info.client_id) > N (for some N) is helfpul for focusing on higher-volume combinations of dimensions) and low and behold, all of this spike comes from a single ISP: Korea Telecom.

Going one level deeper to metadata.isp.organization gets us no further resolution (still Korea Telecom, meaning there's no org-level information here). geo.city is mostly null, geo.subdivision1 is split amongst a few numbered subdivisions (mostly 11, some 41 and 42) which, if they're from this list suggests mostly Seoul and northern provinces (which are listed as the most populous, so this might be a Map of Where People Live rather than anything that meaningfully divides the population sending us these pings)

So, uh... what?

There's been another spike. And lo and behold it's still 100% Korea Telecom. If we hope to get to the bottom of this, we might need someone other than me to look at this... and some luck.

Whiteboard: [data-quality]
Whiteboard: [data-quality] → [dataquality]

Ah, failed to mention we're tracking this as a data Incident (tracking doc). And just today we noticed that the reports have abruptly dropped nearly to normal levels.

:chutten is this still an issue or can this bug be closed?

Flags: needinfo?(chutten)

It's nowhere near as bad as it was, but it's been getting worse since September. Still on the same weird versions.

That being said, I'd entertain arguments to close the bug on the grounds of it not really harming anything and we're stumped about what to do about it.

Flags: needinfo?(chutten)

This is basically not a problem any more, with Korea not exhibiting the issue since last April, the Viettel Group (in Vietnam) since last September, and Converge ICT (in the Philippines) since last November.

We have monitoring that ought to catch this should it happen again, so WONTFIX-ing until then.

Status: ASSIGNED → RESOLVED
Closed: 8 months ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: