Closed Bug 1268513 Opened 4 years ago Closed 4 years ago

Validate the core ping reuploader implementation

Categories

(Firefox for Android :: General, defect, P1)

All
Android
defect
Points:
2

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: mcomella, Assigned: gfritzsche)

References

(Blocks 3 open bugs)

Details

(Whiteboard: [measurement:client])

Georg, this landed yesterday and should be in Nightly – could you verify this when you have a chance?

I tried to do this in redash but afaict it doesn't let you filter on build ID so I can't.
Flags: needinfo?(gfritzsche)
(In reply to Michael Comella (:mcomella) from comment #1)
> I tried to do this in redash but afaict it doesn't let you filter on build
> ID so I can't.

That is a good point - Mark, the build id is available in the meta data, could it be added to the presto export?
Flags: needinfo?(gfritzsche) → needinfo?(mark.finkle)
(In reply to Michael Comella (:mcomella) from comment #1)
> Georg, this landed yesterday and should be in Nightly – could you verify
> this when you have a chance?

What would we look for, what kind of failures could we expect?
I can think about confirming:
* that the per-client ping volume stays roughly stable
* ordering and ping sequences are not crazy
* ping sequence gaps should be basically gone after landing this
  (but in what scenarios would we expect ping gaps/loss? are there e.g. limits on local ping storage after which we drop pings?)

What is the upload order? Oldest first?
Would we expect out-of-order submissions? (E.g. from: send ping on app launch, later upload (old, saved) pings.)
Flags: needinfo?(michael.l.comella)
(In reply to Georg Fritzsche [:gfritzsche] from comment #3)
> What would we look for, what kind of failures could we expect?
> I can think about confirming:
> * that the per-client ping volume stays roughly stable
> * ordering and ping sequences are not crazy

Sounds good.

> * ping sequence gaps should be basically gone after landing this
>   (but in what scenarios would we expect ping gaps/loss? are there e.g.
> limits on local ping storage after which we drop pings?)

"basically gone" but I added documentation in bug 1269926 for where we can experience data loss.

> What is the upload order? Oldest first?

No guarantees – I currently use the order from File.listFiles, which I'd think is alphabetic by file name.

> Would we expect out-of-order submissions? (E.g. from: send ping on app
> launch, later upload (old, saved) pings.)

Since there are no guarantees above, sounds feasible.

These validations seem reasonable to me.

fyi, I also added bug 1271391 to validate some new fields (e.g. ping creation date) so perhaps we'd like to batch that here.
Flags: needinfo?(michael.l.comella)
FYI: bug 1270213 landed and it restructures how we name pings on disk. This will have the side effect of losing the data of any stored pings that hadn't had the chance to get uploaded yet. Assuming most clients uploaded when they stored, however, I don't expect to lose a lot of data.
(In reply to Georg Fritzsche [:gfritzsche] from comment #2)
> (In reply to Michael Comella (:mcomella) from comment #1)
> > I tried to do this in redash but afaict it doesn't let you filter on build
> > ID so I can't.
> 
> That is a good point - Mark, the build id is available in the meta data,
> could it be added to the presto export?

Done and backfilled:
https://github.com/mfinkle/user-data-analytics/commit/a7f9c1ffcd7cca067f85060178dec6bccc20f409
Flags: needinfo?(mark.finkle)
Alessio, could you validate this? Discussion on what to validate in comment 3 & comment 4. Note that it could be good to also take this opportunity to verify bug 1271391 (and note that bug 1273688 will be ready for validation relatively soon).
Flags: needinfo?(alessio.placitelli)
Georg, should you take this one?
Flags: needinfo?(alessio.placitelli) → needinfo?(gfritzsche)
Yes, i'll look into the validations.
Assignee: nobody → gfritzsche
Flags: needinfo?(gfritzsche)
Priority: -- → P1
Whiteboard: [measurement:client]
I'm running comparisons on this before and after the uploader landing:
https://gist.github.com/georgf/2d8cfccefbec1b89b54dd03f44d3ad2a

Overall this is looking like good improvements, but [23] is puzzling me a bit - missing pings being ~31% the number of the submitted pings before uploader landing and 10% after.
This seems way too high and i don't quite trust this data yet, i have to re-think this on monday.
Would bug 1270963 affect missing ping numbers?
Is this fixed in buildids >=20160509?
Flags: needinfo?(michael.l.comella)
(In reply to Georg Fritzsche [:gfritzsche] from comment #11)
> Would bug 1270963 affect missing ping numbers?
> Is this fixed in buildids >=20160509?

I believe so.

Things that I know of, outside of the expected robustness issues (as seen in the docs), that could drop pings (in decreasing order of probability):
 * bug 1270213 (landed 5/13) - we dropped any pings that were stored on disk to revise the on-disk naming scheme.
 * bug 1270963 (landed 5/9) - crashes when the search engine was null. Unclear when this might happen but I only see 26 crashes on crash-stats [1].
 * bug 1272817 (landed 5/16) - some (I think a very small number) clients tried to read empty files, which crashes
 * bug 1272431 (landed 5/19) - when we get the pings to upload, we get them in an arbitrary order (rather than ascending by modified date, which we do now). In theory, if we exceed our storage quota, these files can get deleted without ever having a chance to be uploaded.

[1]: https://crash-stats.mozilla.com/report/list?product=FennecAndroid&range_unit=days&range_value=28&signature=java.lang.NullPointerException%3A+at+org.mozilla.gecko.telemetry.UploadTelemetryCorePingCallback%241.run%28UploadTelemetryCorePingCallback.java%29
Flags: needinfo?(michael.l.comella)
Points: --- → 2
So, looking at:
* the latest nightly uploader validation: https://gist.github.com/georgf/2d8cfccefbec1b89b54dd03f44d3ad2a
* a cross-check to beta data: https://gist.github.com/georgf/0b33672f9492f07690827c6f2f196b47

From that i see:
* Clients that have gaps:
  * ~35% on beta
  * ~20% on nightly before the changes
  * ~3% after the changes
* Gaps compared to overall ping counts:
  * ~25% on beta
  * ~34% on nightly before the changes
  * ~11% on nightly after the changes

Possible explanations for the current status:
* some bad clients on Nightly lead to real high gap counts
* the client is skipping sequence numbers
* the client is dropping pings
* my analysis is overcounting gaps
(In reply to Georg Fritzsche [:gfritzsche] from comment #13)
...
>   * ~11% on nightly after the changes

Another explanation:
Currently Fennec Nightly is sending Telemetry from automation (with not so normal behavior), which might explain those numbers.
(In reply to Georg Fritzsche [:gfritzsche] from comment #14)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #13)
> ...
> >   * ~11% on nightly after the changes
> 
> Another explanation:
> Currently Fennec Nightly is sending Telemetry from automation (with not so
> normal behavior), which might explain those numbers.

This will be fixed in bug 1270191.
Overall, this improved the numbers we see.
I'm moving digging more into the high missing ping counts we still see to bug 1275929.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WORKSFORME
See Also: → 1275929
You need to log in before you can comment on or make changes to this bug.