Open Bug 1866094 Opened 1 year ago Updated 10 months ago

Investigation: clients with duplicated sequence numbers

Categories

(Data Platform and Tools :: Glean: SDK, task, P3)

task

Tracking

(Not tracked)

People

(Reporter: janerik, Unassigned)

References

Details

In my research on the glean event timestamps I noticed that some clients sent re-sent some of their pings.

This is visible as pings with the same (client_id, document_id, seq) tuples, but with different submission_timestamps.
Notably submission_timestamps is on different days (re-submissions on the same day should be caught by the copy_dedupe task).

Query: https://sql.telemetry.mozilla.org/queries/96015/source?p_app=org_mozilla_fenix&p_channel=nightly&p_days=90&p_ping=baseline#237128

The overall number of clients for which this happens is low for every app (like <1%).
Some of those clients re-submitt a ping pretty much every day (like 42 pings in a 60 day window, or 30 pings in a 30 day window from today backwards)

Given the low number of clients this is not very concerning, though it can affect analysis if you inspect specific client data.

For now this bug acts merely as documentation for future-us.

I spoke a bit too soon.
For Fenix release we see about 1.4% of all clients week-by-week send us duplicated sequence numbers.

Query: https://sql.telemetry.mozilla.org/queries/96015/source?p_app=org_mozilla_firefox&p_channel=release&p_days=90&p_ping=events#237128

For Desktop release on a 1% sample of the data we get ~0.4% (all of release takes too long to query)

More precisely this is ping specific:

We currently have a sharp increase in WAU on Fenix Nightly (measured on events and baseline pings), thus skewing the numbers.

Ping Timeframe Dup %
baseline 2023-09-11 2.3%
baseline 2023-11-13 1.0%
metrics 2023-09-11 0.8%
metrics 2023-11-13 0.74%
events 2023-09-11 1.4%
events 2023-11-13 2.3%

On Fenix release:

Ping Timeframe Dup %
baseline 2023-09-11 2.26%
baseline 2023-11-13 2.27%
metrics 2023-09-11 0.89%
metrics 2023-11-13 0.95%
events 2023-09-11 1.44%
events 2023-11-13 1.46%

Firefox iOS release is pretty stable, so I only report numbers from November

ping Dup %
baseline 11.6%
metrics 1.7%
events 10.6%

Those baseline and events numbers are shockingly high.
Could that be because we upload in the background and iOS lets that to finish, but doesn't give us enough time to clean out the files?
Can we reproduce this?

(Note: take those numbers as "unverified" until I get someone else to look at my queries!)

See Also: → 1854086

I re-ran the numbers today, there's a bit of a downward trend this year (10% -> 7-9%), but why and if that trends keeps on we don't know.

I think this is worth some work:

  • Validate the analysis, make sure what I'm looking at is valid. Is my "by week" look valid? Is it hiding anything?
    • When/For how long do these dupes happen? Within days? Consistently the same ping over days/weeks from a specific client?
  • Come up with a potential hypothesis why that happens
    • I phrased one above: Could that be because we upload in the background and iOS lets that to finish, but doesn't give us enough time to clean out the files?
    • Anyway to locally reproduce that?
  • How do we handle this?
    • Can we collect additional information about what pings we try to upload when? e.g. on request for an upload store UUID + timestamp and send that along with everything else?
    • Can we use this stored information to avoid dupes client-side? Or do we need to apply something server-side to delete dupes within a certain window?

More questions than answers. Fixing this will require some Swift experience (that's where the uploader is implemented).

Assignee: jrediger → nobody
Priority: P1 → P3

Just adding this to the investigation: I looked at this another way, counting dupes by client_id + sequence number over the last 90 days and ended up seeing about 15% of clients send us dupes, but this amounts to less that 1% of all pings sent.

ref: https://sql.telemetry.mozilla.org/queries/90512/source

ni? as a reminder

Flags: needinfo?(pmcmanis)
Assignee: nobody → pmcmanis
Flags: needinfo?(pmcmanis)
Assignee: pmcmanis → nobody
You need to log in before you can comment on or make changes to this bug.