Closed Bug 1586810 Opened 5 years ago Closed 5 years ago

Glean submits pings with duplicate document ids

Categories

(Data Platform and Tools :: Glean: SDK, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mdroettboom, Assigned: mdroettboom)

References

Details

(Whiteboard: [telemetry:glean-rs:m8])

Attachments

(2 files)

We should get to the bottom of why these are occurring.

Assignee: nobody → mdroettboom

I'm struggling to find any patterns to the pings that have duplicate document ids. This happens across all Android SDK versions, 105 different device models, all versions of Fenix and a-c that have run over the past day, 32 different locales. There also seems to be no relationship to ping duration, or how old the profile is.

The GCP pipeline doesn't retain the rejected pings to inspect what they might contain. However, from the submission URL, we can get the document id that was rejected, and then cross-reference that with the accepted pings in the database to inspect the pings' contents. This is going on the assumption that the contents of the accepted and rejected pings are the same, which is a huge, unconfirmed assumption.

Using this technique, it is never the case that a duplicate document id came from different client ids (since the client id is retained in the error database), so we can at least rule out uuid collisions across different devices.

There are some interesting patterns I've found by looking at the difference in submission_timestamp between the accepted ping and the corresponding rejected pings. (See plot below).

(1) In the last day, there were 3 pings that were rejected before they were submitted, but only by a difference of at most 36 seconds. I would assume this is due to the submission_timestamp being generated early on in the pipeline and the duplicate document id rejection happening later, with a race condition in between.

(2) There is a pretty normal looking decay curve in the left of the graph where the delays are short. Many of the duplicates are sent in the first five minutes.

(3) There is a peculiar spike between 5h:24m and 5h:27m that seems much too large to be random. What is it about the ~5:30 that is meaningful here?

EDIT: All of the pings in (3) came from a single client, and all in a single incident of sending a bunch of pings within a single one minute period, so it's probably fair to say that it's an anomaly related to something that happened on that particular device rather than anything systemic.

It is possible that WorkManager is part of the equation here, as a race condition that would allow multiple uploaders to run at the same time would require WorkManager not behaving as advertised with respect to "unique work". We should track the deployment of https://github.com/mozilla-mobile/android-components/pull/4615 (which upgrades to WorkManager 2.2) and see if we see any change in the rate of duplicates.

One important caveat to note here is that duplicate pings seen in the GCP pipeline do not necessarily represent a duplicate ping being received from a client. The AWS to GCP infrastructure uses HTTPS and implements retries the same way a client is expected to, so that a duplicate in GCP can mean either or both of:

  1. The AWS tee received a bad status code when sending the ping to GCP and resent it
  2. The client sent the ping multiple times to AWS (for ostensibly the same reason vis-a-vis AWS)

Therefore we expect strictly greater duplication rate in GCP compared to AWS. In our validation exercises we measured this "additional duplicate" rate to be less than .01%, but it is worth considering.

The GCP pipeline doesn't retain the rejected pings to inspect what they might contain. However, from the submission URL, we can get the document id that was rejected, and then cross-reference that with the accepted pings in the database to inspect the pings' contents. This is going on the assumption that the contents of the accepted and rejected pings are the same, which is a huge, unconfirmed assumption.

We do retain the raw pings for some amount of time in a separate table (payload_bytes_raw) that access is somewhat limited to, which can be used for this kind of debugging.

:whd Thanks for the information. This is indeed helpful!

As of today, the current duplicate document id rate for Fenix nightly is 0.8%

We haven't found any client-side cause, and the error rate is close to what's expected in the pipeline. Closing as fixed.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: