Closed Bug 1272395 Opened 8 years ago Closed 8 years ago

Test Pilot UT Pings under reporting compared to other sources

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rweiss, Assigned: mreid)

References

Details

The Activity Stream active users count numbers in the thousands: https://sql.telemetry.mozilla.org/dashboard/activity-stream-current-active-users

This is roughly the same as what is tracked by GA for downloads.

However the UT TxP pings reveal a significantly lower amount of active users (either MAU or DAU).  This script (https://gist.github.com/rjweiss/1193b079c3bfaa7038c41ca4c2ceadff) suggests only a few hundred users.

It appears there is underreporting, either via the client or the pipeline.
Flags: needinfo?(wclouser)
Flags: needinfo?(mreid)
Points: --- → 1
Priority: -- → P1
Assignee: nobody → mreid
This is also being tracked at https://github.com/mozilla/testpilot/issues/815 .  There is a function which wakes up every 10 minutes to see if it submitted something in the last 24 hours and, if not, submits a ping:  https://github.com/mozilla/testpilot/blob/master/addon/lib/metrics.js#L72
Flags: needinfo?(wclouser)
Per email discussion, in order to shed light on how to improve the latency we're seeing:

I was thinking of something like this:

- For each testpilot ping, grab testpilot install date, ping creation date, submission date, and clientid
- Find the earliest install date (or creation date) per clientid
- Compute the delta between the install/creation date and the submission date
- Look at the distribution in submission latency for clientids we *did* see.

Do the same for testpilottest pings to see how the latency distribution compares.

We can't efficiently filter the entire Telemetry corpus for "has testpilot enabled", but we can efficiently use the set of clientids in the union of both the sets above, and see what the latency looks like for main pings from the same clientids (and compare it to the background latency for all main pings) using the main_summary dataset.

Further, we should check how many testpilottest clientids were not found in the main pings during the same interval.

Some predictions:

If we find that testpilottest contains many clientids that did not report main pings or that the latency for testpilottest clientids in the main dataset is significantly higher than the background latency, we are probably running up against the throttling behaviour on the client. Follow-up: How many testpilottest pings are reported per clientid per day? Actions here would be to decrease the number of testpilottest pings or ease up on client throttling for tpt pings.

If we find that significantly more testpilottest clientids are present in the main pings than the testpilot pings (and that the latency is not significantly worse than the background rate), it should be safe to increase the frequency of testpilot submission, and that should improve latency.
Flags: needinfo?(mreid)
I ran the above analysis, here is the notebook with the code and results.
https://gist.github.com/mreid-moz/e007487a0b03f2ee40ad3ccd6b21f44a

From cells 25 and 26, it looks like the lower submission rate for testpilot pings is not reflected in the main telemetry pings for the same set of clientids, so I'm fairly confident we're not hitting client throttling behaviour. We should be safe to fix / increase frequency of the testpilot submissions without having a negative impact on other data reporting.

A client-side fix went in for https://github.com/mozilla/testpilot/issues/815, so I'll re-run the notebook in a few days and see if the symptoms improve.
I've updated the notebook, you can compare the differences with the last run on the "revisions" pane of the gist in comment 3.

The summary is that the number of unique clientids sending 'testpilot' pings has increased dramatically.  Looks like things are on the right track!

The interesting changes are in cells #5, 6, 26, and 27.

The "latency since install" graph (cell 29) also changed significantly, presumably due to previously-unreported clients sending in data after the above client fix.
I've checked again and the number of clientids reporting testpilot pings is fairly stable at the new and improved rate. It would appear that the client-side fix did the trick!
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Thanks for the help, everyone.
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.