Closed Bug 1249308 Opened 8 years ago Closed 8 years ago

Investigate duplicated core pings (same document id)

Categories

(Toolkit :: Telemetry, defect)

All
Android
defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox47 --- affected

People

(Reporter: Dexter, Unassigned)

References

Details

The preliminary analysis in bug 1247605 revealed that there are duplicated core pings coming from Firfox for Android.

The duplicated pings have the same content, but different metadata:

- The "hostname" field refers to different IP addresses
- The "Timestamp" is different
- The "geoCity" fields refer to different city names.

A possible reason we could be seeing this is because pings are being sent during  a cellular network handoff.
Blocks: ut-android
I can think of three possible scenarios here:
(1) the client retries sending for some reason somehow (it shouldn't yet, persistence & retry is coming later)
(2) some artifact of the HTTP POST implementation the client uses, which retries for some types of connection closures
(3) the server somehow duplicates in rare circumstances

Any other scenarios?
(In reply to Georg Fritzsche [:gfritzsche] from comment #2)
> I can think of three possible scenarios here:
> (1) the client retries sending for some reason somehow (it shouldn't yet,
> persistence & retry is coming later)
> (2) some artifact of the HTTP POST implementation the client uses, which
> retries for some types of connection closures

Michael, is this possible?
Flags: needinfo?(michael.l.comella)
Alessio, in which part of the ping is the metadata stored (e.g. the HTTP request headers)? fwiw, the example in comment 1 seems to have expired - maybe another example would answer my question.

Also, Alessio, are the docIDs also the same between the duplicate uploads?

(In reply to Georg Fritzsche [:gfritzsche] from comment #2)
> (1) the client retries sending for some reason somehow (it shouldn't yet,
> persistence & retry is coming later)

If the docIDs are the same, then we're uploading twice on the same Intent which seems unlikely to be an issue on the client since we don't permit Intent redelivery [0]. This is more likely to be case #2 (an artifact of the HTTP libraries), case #3, or a network error (comment 0).

However, if the docIDs are different, I see a very unlikely possibility: with the current implementation, there is a small chance that we get killed after starting the Service but before we can store the seq count to disk since we call the store afterwards and it is an async operation [1]. We can move that before we start the service and make it synchronous with the possibility that we may skip sequence values – I filed bug 1250250.

> (2) some artifact of the HTTP POST implementation the client uses, which
> retries for some types of connection closures

I don't know off the top of my head but maybe rnewman does given he's used BaseResource & httpclientandroidlib more than I have.

Richard?

> Any other scenarios?

The scenario mentioned in comment 0 seems feasible.

[0]: https://mxr.mozilla.org/mozilla-central/source/mobile/android/base/java/org/mozilla/gecko/telemetry/TelemetryUploadService.java#48
[1]: https://mxr.mozilla.org/mozilla-central/source/mobile/android/base/java/org/mozilla/gecko/BrowserApp.java#3991
Flags: needinfo?(michael.l.comella)
(In reply to Michael Comella (:mcomella) from comment #4)
> Alessio, in which part of the ping is the metadata stored (e.g. the HTTP
> request headers)? fwiw, the example in comment 1 seems to have expired -
> maybe another example would answer my question.

Here's the refreshed example from comment 1: https://pastebin.mozilla.org/8860748

I think the metadata is extracted both from the submission URL (app name, update channel, app version, etc.) and the HTTP request headers (hostname, ...).

> Also, Alessio, are the docIDs also the same between the duplicate uploads?

Yes, they are.
Flags: needinfo?(rnewman)
Looking at comment 5, the timestamps indicate that the pings were sent/received three seconds apart (assuming the units are nanoseconds since the epoch) and the 'geoCity' tags represent cities that are not geographically close together – this removes the possibility that the user traveled between the two locations between the transmission of the two pings. This does not account for clock skew.
Copied over from Bug 1247605:

---
0.11% dupe rate, if I read the gist right.

Do these documents have the same client ID + document ID?

Do these documents also have the same sequence ID?

If those three things are the same, then we should simply drop the ping on the floor.

If they're not -- same client ID, same sequence, different document ID -- then that might be explained by device backup/restore/clone, emulators, automated test systems.

If only the document ID is the same, and the client ID is different, then there's a surprising lack of randomness in UUID generation.

Can you give more details?
---
Flags: needinfo?(rnewman) → needinfo?(alessio.placitelli)
(In reply to Richard Newman [:rnewman] from comment #7)
> Copied over from Bug 1247605:
> 
> ---
> 0.11% dupe rate, if I read the gist right.

We have 488 duplicated document ids over 4154305 core pings, so 0.00011%. I'll make that clearer in the next iteration of the gist.

> Do these documents have the same client ID + document ID?
> 
> Do these documents also have the same sequence ID?
> 
> If those three things are the same, then we should simply drop the ping on
> the floor.

On 488 pings with duplicated document ids, 466 have the same client ID, document ID and sequence number (see [1], steps 104-110)
 
> If they're not -- same client ID, same sequence, different document ID --
> then that might be explained by device backup/restore/clone, emulators,
> automated test systems.
> 
> If only the document ID is the same, and the client ID is different, then
> there's a surprising lack of randomness in UUID generation.
> 
> Can you give more details?

Sure!

I took some time today to further analyse the duplicated pings. It looks like 22 pings with duplicated document ids (over 488) have the same document ID, the same client ID but different metadata (hostname and geocity) and sequence number. I've redacted a sample and posted it here: https://pastebin.mozilla.org/8861108
Flags: needinfo?(alessio.placitelli)
(In reply to Alessio Placitelli [:Dexter] from comment #8)

> We have 488 duplicated document ids over 4154305 core pings, so 0.00011%.
> I'll make that clearer in the next iteration of the gist.

Wow, that's great.


> On 488 pings with duplicated document ids, 466 have the same client ID,
> document ID and sequence number (see [1], steps 104-110)

That implies redelivery, and I'm happy for us to directly de-dupe those.


> I took some time today to further analyse the duplicated pings. It looks
> like 22 pings with duplicated document ids (over 488) have the same document
> ID, the same client ID but different metadata (hostname and geocity) and
> sequence number. I've redacted a sample and posted it here:
> https://pastebin.mozilla.org/8861108

Those three timestamps are:

Seq 2:  Wed Feb 10 20:20:20 PST 2016   Hollywood
Seq 67: Thu Feb 11 11:52:26 PST 2016   Silver Spring
Seq 76: Thu Feb 11 16:26:45 PST 2016   Hollywood

If I had to guess, I'd say this is an AT&T Samsung Galaxy S4 user in Hollywood, CA, with a VPN to their employer in Maryland, and we're somehow reusing a document identifier. These don't look like redelivered pings.

If the geoip is wrong, and it meant "Silver Lake" instead of "Silver Spring", then this looks like completely normal user behavior (Silver Lake is right by Hollywood, and the user is hopping between cell towers), except for the unchanged document ID over ~20 hours.

That's something that Michael should look into.
Flags: needinfo?(michael.l.comella)
fwiw, during my discussions with Richard, I thought Java was generating the UUID but it wasn't yet because bug 1244295 wasn't uplifted to beta yet, where we're seeing these issues.
Clear NI for now – duplicate IDs are not a critical volume and, while we should fix them if we can, we can alternatively verify they stay low. This could also change after we reupload failed documents (bug 1243585).
Flags: needinfo?(michael.l.comella)
Blocks: core-ping
No longer blocks: ut-android
We're going to validate this against release and since we expect some number of duplicated pings, if the numbers don't look too bad, we'll close this.

NI Georg since we're waiting on release data validation.
Flags: needinfo?(gfritzsche)
We didn't see unacceptable duplicate numbers and can filter in analysis, so i'm closing this.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(gfritzsche)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.