Closed Bug 1463410 Opened 6 years ago Closed 6 years ago

Validate "event" ping data for the Nightly channel

Categories

(Toolkit :: Telemetry, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
Tracking Status
firefox62 --- affected

People

(Reporter: chutten, Assigned: chutten)

References

Details

Similar to Alessio's analyses in bug 1351396 and bug 1351402 we will need to do a broad-spectrum validity analysis on the "event" ping on Nightly after it lands.

Things to check (a non-exhaustive list):
Ping counts
Ping frequency
Ping size
session id link
subsession id link
Do the "main" ping events match the "event" ping events (allowing for them being sent at different times and in different subsessions)
Does the meta-telemetry (histogram TELEMETRY_EVENT_PING_SENT) send and correspond correctly to subsessions containing events. Does its counts match the counts of received "event" pings?

(( Note that this analysis is in addition to the preliminary realtime plugin that ensures we're sending pings at all: https://docs.telemetry.mozilla.org/cookbooks/new_ping.html ))
(In reply to Chris H-C :chutten from comment #0)
> Similar to Alessio's analyses in bug 1351396 and bug 1351402 we will need to
> do a broad-spectrum validity analysis on the "event" ping on Nightly after
> it lands.
> 
> Things to check (a non-exhaustive list):
> Ping counts
> Ping frequency

I think this part should also include:

- Checking that the creation date between pings is > 10 minutes
- Checking how many pings hit the 1000 events limit
- Check that the ping reason matches with the content of the ping/expected behaviour; for example, check that shutdown event pings are sent at the right time (e.g. close to a main ping with reason shutdown?)
- Does the "lost event count" make sense for event pings with reason max and shutdown?
Priority: -- → P1
Assignee: nobody → chutten
Status: NEW → ASSIGNED
Priority: P1 → P2
Pings are coming in: https://pipeline-cep.prod.mozaws.net/dashboard_output/graphs/analysis.moz_telemetry_doctype_monitor_event.volume.html

I should have enough data to perform validation analysis soon.
Priority: P2 → P1
Here's my validation notebook in Databricks: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/19038/command/19453

ni?Alessio for review.

The conclusions in text for transparency (since there's no path from Databricks to publishing that I can find):

Of sampled 'event' pings, 
* All reasons were valid and no pings had no reason
* ~0.1% were created within 10min of each other on the same session. (slightly worrisome that this is happening at all, but it's low enough to ignore)
* No pings have reported lost events yet (which is good because there haven't been consecutive 'max'-reason pings yet)
* We are receiving no more than 13 pings per client per day (and usually just 1 or 2)
* There are never more than 1000 pings per process in these pings (usually just 1-10)
* Only one 'event' ping (0.002%) had events claiming to have happened outside of the 1-hour interval
* ~99.9% of 'main' pings linked to 'event' pings by subsessionId had TELEMETRY_EVENT_PING_SENT set
* When linking by sessionId, this figure drops to ~69%.

From these points I conclude that 'event' pings are working as designed in Nightly.
Flags: needinfo?(alessio.placitelli)
(In reply to Chris H-C :chutten from comment #3)
> Here's my validation notebook in Databricks:
> https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/19038/command/19453
> 
> ni?Alessio for review.

The analysis looks solid to me. I left a few nits as comments there, feel free to address them if you have time/want. They do not compromise the analysis itself, so no rush.

> The conclusions in text for transparency (since there's no path from
> Databricks to publishing that I can find):
> [...]
> * ~99.9% of 'main' pings linked to 'event' pings by subsessionId had
> TELEMETRY_EVENT_PING_SENT set
> * When linking by sessionId, this figure drops to ~69%.

Why do you think it drops to 69%? Any hypothesis?

> From these points I conclude that 'event' pings are working as designed in
> Nightly.
Flags: needinfo?(alessio.placitelli) → needinfo?(chutten)
I presume the 69% is because there can be multiple subsessions per session, and only one of the subsessions might have sent an "event" ping in it.

Thinking of it this way, you have a long-running Nightly on your machine and open and close the devtools. With the Dataset queries set up the way it is, we'll get the "event" ping for the devtools events (with a 1/10 chance) and the main pings from that entire long-running session. So... one "event" ping, let's say 4 "main" pings, and only one of the "main" pings has TELEMETRY_EVENT_PING_SENT in it.

If we were to match the "event" ping to the "main" ping with the same subsessionId it is ~guaranteed to be the one with TELEMETRY_EVENT_PING_SENT in it. But we also have three other pings from the same session, so it reduces the overall rate.

Make sense?

*takes a look at the notebook for nits*

...there aren't any? Did databricks eat your comments, or do I have to activate a different mode to see them?
Flags: needinfo?(chutten) → needinfo?(alessio.placitelli)
Ah, there's a "Comments" view. 

Nits addressed, notebook rerun, conclusions unchanged. Marking this FIXED.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Flags: needinfo?(alessio.placitelli)
Resolution: --- → FIXED
Blocks: 1474295
You need to log in before you can comment on or make changes to this bug.