Closed Bug 1134661 Opened 9 years ago Closed 9 years ago

An automated script to compare FHR v2 results and FHR-v4 for a sample of users

Categories

(Firefox Health Report Graveyard :: Data Request, defect)

31 Branch
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1199380

People

(Reporter: benjamin, Unassigned)

References

Details

(Whiteboard: [40b9] [unifiedTelemetry][data-validation])

To validate the new unified (v4) FHR data, we want to compare the matching user records for v2 and v4 for a sample of users. If the records match up, we'll be great. Any discrepancies will be flagged for manual review by Brendan C. and others.
Based on what we discussed yesterday, here are my recommendations for what we should look at to determine whether or not v2 and v4 data is matching up.

The way I'm thinking about this is that the main objective of this comparison is to make sure that we have captured the same data during the same periods, and to make sure there are not aggregation errors in the new subsession mechanism (missing observations, off-by-one errors, double counts, etc.). But getting the 'same period' part right might be tricky-- I can think of a couple things we'll need to keep an eye on:
1) v2 presumably takes longer to reach the server than v4, so we may have data in the v4 set that we haven't yet recieved in the v2 record.
2) v2 has appSession.current, which may make things look weird at the end of the v2 series compared to the v4 set
3) v2 sessions can span multiple days. I assume that searches, crashes etc are binned into the correct calendar day, but totalTime and activeTicks will be recorded on the date on which the session started.

Keeping the above in mind, we'll need to correctly filter days and sessions to get comparable data sets. As a first pass, we could try this simple approach: find the interval of days present in both v2 data.days (ignore appSession.current) and any of the v4 pings, and then drop the first and last day of this interval to avoid edge effects. (I think restricting the comparison to this period should avoid most weird boundary effects, but I'm almost certain there will be some that we can't anticipate.)

Within this set of days, the following should be *identical*:
* the set of days with any activity
* the number of session starts on each date
* the number of searches (by provider and SAP) on each date
* the number of crashes on each date
* the number of update checks, successes, etc
* any client version changes should fall on the same date

For each complete reconstructed session starting within this date range (i.e., not the ongoing session in appSession.current or a session for which not all subsessions have been collected), we should also have these match *identically*:
* start up times per session
* totalTime per session
* activeTicks per session

We could also possibly look at places.bookmarks and places.pages-- since the timing of measurements will presumably be different in v2 and v4 (when will these measurements be triggered in v4??) they cannot be expected to match exactly, but they should be "close"-- if we look a the difference between v2 and v4, it should cluster pretty tightly around 0.

I guess we should also check that the enviroments are updating correctly -- that the things in v2 sync.sync, appinfo.appinfo, etc roughly match with environments on the corresponding days. I'm most concerned about the binning of measurements into subsessions, but since the environments mechanism is also new to desktop, it probably deserves some sanity checking too.


Any deviation from the expected identical matches above will need to be investigated by examining the data up close. If it can be explained by an obvious deficiency in the filter, we should update the filter rules and rerun the analysis (posting the new rules publicly in this bug so that we're all on the same page). If we can't trace the discrepancy to a problem with the filter, we still *must* be able to explain the difference, either as a bug in v4 that will be fixed to bring v4 into alignment with v2, or as a concretely identifiable bug in v2. At the time of the go/no-go decision, any remaining discrepancies without a rock solid explanation (and v4 fix if need be) will likely cause the metrics team to vote no-go, so tracking down, explaining, and maybe fixing the causes of any discrepancies will need to be a high priority for the v2 and v4 client teams as well as the server-side teams if we want to hit the target launch date.

(related data quality bug about broken session chains and session trees-- https://bugzilla.mozilla.org/show_bug.cgi?id=1129185 )
Just to verify that I'm understanding the new system correctly: for the purposes of comparing to FHR v2, all submissions with ['payload']['info']['reason']=="idle-daily" should be dropped, right?
a few more questions about the v4 packet:

(0)
Q from comment 2.

(1)
I've been poking around in the v4 data, and I'm not sure where to find most of the metrics mentioned in Comment 1. I've found candidates in some cases, but since the packets we have currently are from Nightly data, there is a ton of extra telemetry stuff that doesn't bear on the FHR comparison, and I don't want to be grabbing the wrong thing from among those many fields. So to avoid any errors, can you let me know the correct paths in the v4 packet to the following:
* the number of searches (by provider and SAP) per session fragment
* the number of update checks, successes, etc
* start up times per session
* totalTime per session
* activeTicks per session


(2)
How would you determine the number of crashes on each date with v4?

(3)
To determine version changes you need to look for changes in the environment between subsessions, right? That is, there's not something that I should be looking for *within* and individual payload that flags a version change, is there?

(4)
Minor question: In ['payload']['info']['sessionStartDate'] and ['payload']['info']['subsessionStartDate'], the dates have the form "2015-02-27T00:00:00". Is there any reason for the "T00:00:00"? I thought we decided that a date stamp was enough?
Flags: needinfo?(benjamin)
Flags: needinfo?(gfritzsche)
(In reply to brendan c from comment #2)
> Just to verify that I'm understanding the new system correctly: for the
> purposes of comparing to FHR v2, all submissions with
> ['payload']['info']['reason']=="idle-daily" should be dropped, right?

Yes, we will kill idle-daily.
(In reply to brendan c from comment #3)
> (1)
> I've been poking around in the v4 data, and I'm not sure where to find most
> of the metrics mentioned in Comment 1. I've found candidates in some cases,
> but since the packets we have currently are from Nightly data, there is a
> ton of extra telemetry stuff that doesn't bear on the FHR comparison, and I
> don't want to be grabbing the wrong thing from among those many fields. So
> to avoid any errors, can you let me know the correct paths in the v4 packet
> to the following:
> * the number of searches (by provider and SAP) per session fragment

ping.payload.keyedHistograms.SEARCH_COUNTS["searchProvider"].sum

> * the number of update checks, successes, etc

Update metrics will be ported in bug 1121018 (WIP).

> * start up times per session

Can you be more specific?

> * totalTime per session

ping.payload.simpleMeasurements.totalTime

> * activeTicks per session

ping.payload.simpleMeasurements.activeTicks

> (2)
> How would you determine the number of crashes on each date with v4?

Pending bug 1121013.

> (3)
> To determine version changes you need to look for changes in the environment
> between subsessions, right? That is, there's not something that I should be
> looking for *within* and individual payload that flags a version change, is
> there?

What version changes? Updates to Firefox?
Flags: needinfo?(gfritzsche)
Thanks Georg! To clarify--

> > * start up times per session
> 
> Can you be more specific?

Nm, I found these (was looking for main, sessionrestore and firstpaint, found them in simple measurements)

> > (3)
> > To determine version changes you need to look for changes in the environment
> > between subsessions, right? That is, there's not something that I should be
> > looking for *within* and individual payload that flags a version change, is
> > there?
> 
> What version changes? Updates to Firefox?

Yes, updates to Firefox.
(In reply to brendan c from comment #6)
> > > (3)
> > > To determine version changes you need to look for changes in the environment
> > > between subsessions, right? That is, there's not something that I should be
> > > looking for *within* and individual payload that flags a version change, is
> > > there?
> > 
> > What version changes? Updates to Firefox?
> 
> Yes, updates to Firefox.

Correct. We will have "upgrade" pings later to specifically address that scenario per bug 1120372.
Flags: needinfo?(benjamin)
(In reply to brendan c from comment #3)
> (4)
> Minor question: In ['payload']['info']['sessionStartDate'] and
> ['payload']['info']['subsessionStartDate'], the dates have the form
> "2015-02-27T00:00:00". Is there any reason for the "T00:00:00"? I thought we
> decided that a date stamp was enough?

This is incidental, we are just not removing those parts. We can probably remove that later as an improvement?
(In reply to Georg Fritzsche [:gfritzsche] [away Feb 27 - March 15] from comment #8)
> (In reply to brendan c from comment #3)
> > (4)
> > Minor question: In ['payload']['info']['sessionStartDate'] and
> > ['payload']['info']['subsessionStartDate'], the dates have the form
> > "2015-02-27T00:00:00". Is there any reason for the "T00:00:00"? I thought we
> > decided that a date stamp was enough?
> 
> This is incidental, we are just not removing those parts. We can probably
> remove that later as an improvement?

Probably better to remove them sooner than later so that we won't need scripts that handle both forms.
Whats the difference for the script? Both are valid forms of the ISO date strings and would presumably use the same facilities?
Yea I think you're right. For some reason I thought I'd remembered a common parsing tool in R or python that supported YYYY-mm-dd but not the full timestamp, but I think I was mistaken.
Depends on: 1149664
Depends on: 1149666
Depends on: 1154113
Depends on: 1167456
Blocks: 1169103
Whiteboard: [b5] [unifiedTelemetry]
does this ticket need to be updated? needs more info to Brendon
Flags: needinfo?(bcolloran)
(In reply to thuelbert from comment #12)

> does this ticket need to be updated? needs more info to Brendon

Possibly at some point? I'm not sure what Benjamin had in mind originally, but this has kind of become a tracking bug (along with https://bugzilla.mozilla.org/show_bug.cgi?id=1169103 ).

However, we could take the title of the bug seriously and consolidate all of the checks that currently exist in various iPython notebooks into a single script for the purpose of on-going automated checking.

That would probably be useful, but would take some doing.
Flags: needinfo?(bcolloran)
Whiteboard: [b5] [unifiedTelemetry] → [b5] [unifiedTelemetry][data-validation]
Whiteboard: [b5] [unifiedTelemetry][data-validation] → [40b9] [unifiedTelemetry][data-validation]
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → DUPLICATE
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.