All users were logged out of Bugzilla on October 13th, 2018
Here is an first set of criteria that must be met before FHR v4 can replace v2 on Release. Please add others as they come up. Some of these have bugs, others will need new bugs created. I can probably not get to all of these with the speed that we might desire, so it'd be helpful to have some other folks jump in and grab a few of these. FHR v4 internal consistency - any pings submitted under a single pingId must have entirely identical data ( https://bugzilla.mozilla.org/show_bug.cgi?id=1167456 ) - any pings submitted with the same subsessionId must have entirely identical data - (deduped) subsession chains per client must not have repeated values for profileSubsessionCounter - (deduped) subsession chains per client must not have any gaps from more than X days before the most recent submission (I.e, if we have a subsession recorded on day D, then there must not be any gaps in the subsession chain before day D-X. This is to allow for the possibility of submission lag). - subsession chain branching must be quantified, and: -- subsessions with more than one descendant subsession must be “rare” (profile branching can happen, but if session chains branch all the time then there is a problem) -- subsessions must never have more than one ancestor subsession consistency between v2 and v4 (see https://bugzilla.mozilla.org/show_bug.cgi?id=1134661 ): - within the set of comparable days in v2 and v4, the following should be identical: -- the set of days with any activity -- the number of session starts on each date ( https://bugzilla.mozilla.org/show_bug.cgi?id=1154113 ) -- the set of session signatures must match ( https://bugzilla.mozilla.org/show_bug.cgi?id=1150105 ) -- the number of searches (by provider and SAP) on each date -- the number of crashes on each date -- the number of update checks, successes, etc -- any client version changes should fall on the same date - For each complete reconstructed session starting within this date range (i.e., not the ongoing session in appSession.current or a session for which not all subsessions have been collected), we should also have these match identically: -- start up times per session -- totalTime per session -- activeTicks per session - investigate geoIP lookup ( https://bugzilla.mozilla.org/show_bug.cgi?id=1132660 )
What does the "consistency" requirement actually mean? We know that there are going to be differences, e.g. UTC versus local days. Are you expressing this in terms of manual QA testing or field measurements? If we're talking about field measurements we need realistic error bars, since 100% matching isn't realistic. Also we're still going to allow for error margins. Internal consistency beyond 99% may not be realistic, and I seriously doubt we have it for FHR v2 currently.
Sure, we will allow minute disagreements. The level of disagreement allowed will depend on the measure. As far as internal consistency is concerned: I think the metrics team wants to see a lot of nines here. This is in part informed by the experience of FHR v2, which was a giant cluster that *continues* to bite us to this day (see also: the month long data outage that we just suffered). We want to be very very sure that the new system is operating as it is expected and designed before swapping out the mess we know for a system that lacks some of the things about v2 that we think make it robust in the long run. Correctly coalescing subsessions on the client and being able to reconstruct them into sensible session chains on the server is essential for this. We already have evidence that this stuff is hard in the form of bugs that have already come up, so to have confidence that we have a solid implementation of a difficult design, and to have confidence that the design is well-understood, we want a very strong evidence that these foundational bits are working correctly. Without getting into confidential numbers, let's say we have order of 10^8 profiles of interest. Just off the top of my head, I think it would be allowable to have an error condition (a deviation from the conditions laid out above) in 1 out of every 10^5 profiles. (The exception being subsession graph branching, which might well be more prevalent). IMO, we want to be really rigorous about making sure this layer of the apparatus works correctly. WRT consistency with v2-- there may well be cases where FHR v2 is far off the mark, and so it doesn't provide a valid yardstick against which to measure v4-- but as we've said before, we will want a clear identification of the v2 bug that caused the v2 numbers to be incorrect before certifying that the v4 number is better. But certainly there is wiggle room. IMO, for most of the numbers, if we're off by 1% we'll need to dig into it, but if we're off by say 0.01% then it'll be fine to let it slide. But nailing down exactly what is acceptable in terms of error is not up to me-- perhaps you and John (and maybe others?) should meet to discuss the tradeoffs between the business requirements and the technical challenges.
Flags: needinfo?(bcolloran) → needinfo?(jjensen)
(In reply to brendan c from comment #2) > But nailing down exactly what is acceptable in terms of error is not up to > me-- perhaps you and John (and maybe others?) should meet to discuss the > tradeoffs between the business requirements and the technical challenges. As per my comment in another related bug (https://bugzilla.mozilla.org/show_bug.cgi?id=1142153#c15) I think we are not yet at a stage where we can agree on numbers for this. The best way forward is IMHO for us to agree on some rough principles regarding the issue and then apply them as best we can to come up with a definition.
see also this google doc, which is basically the same content as this bug, but is a little more readable (for me at least) and which will have other notes https://docs.google.com/document/d/1KpcQy_QEfizd6Q4MFvOt5rMCL32Ef49_2T5yXNqWIWw/edit#
I ran a simple test comparing pings with common subsessionIds; they appear to be identical: http://nbviewer.ipython.org/gist/SamPenrose/1a8133d2ef6d251addfc In the notebook you'll see 17K examined in the period May 7-14; the period April 30-May 6 found similar results. The key function is reduce_multi_subsession(), which does pairwise comparison of the pings in question aside from the "meta" blob.
After consulting with Georg, I should retest this. He believes a specific checkin fixed this: https://hg.mozilla.org/mozilla-central/rev/032fa84aba3d He would like me to test later builds, and also to test earlier builds and hopefully find problems with them (therefore validating that my test actually does its job). (In reply to Sam Penrose from comment #5) > I ran a simple test comparing pings with common subsessionIds; they appear > to be identical: > > http://nbviewer.ipython.org/gist/SamPenrose/1a8133d2ef6d251addfc > > In the notebook you'll see 17K examined in the period May 7-14; the period > April 30-May 6 found similar results. > > The key function is reduce_multi_subsession(), which does pairwise > comparison of the pings in question aside from the "meta" blob.
(In reply to brendan c from comment #0) > -- subsessions must never have more than one ancestor subsession This criterion doesn't make sense for v4; let's drop it.
Whiteboard: [unifiedTelemetry] → [unifiedTelemetry][b5][data-validation]
Priority: -- → P1
Whiteboard: [unifiedTelemetry][b5][data-validation] → [unifiedTelemetry][40b9][data-validation]
using tag based searches
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → INVALID
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.