Closed Bug 1362161 Opened 7 years ago Closed 7 years ago

Add any missing high-value fields ahead of Main Summary backfill

Categories

(Data Platform and Tools :: General, enhancement, P2)

x86
macOS
enhancement
Points:
13

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mreid, Unassigned)

References

Details

It costs about the same to backfill the data whether we add one new field or fifty, so we might as well take a moment to think about other fields that might be useful to add before we embark on the backfill in bug 1362134.

The plan is to keep this bug open for a week or two, add fields as they are requested, then run the backfill to populate them all at once.

Please nominate fields here that would be useful for analyses based on the main_summary dataset.
Blocks: 1362134
bsmedberg recommended `subsessionCounter`.
Depends on: 1360177
(In reply to Frank Bertsch [:frank] from comment #1)
> bsmedberg recommended `subsessionCounter`.

I'd like to have `subsessionCounter`, `info.profileSubsessionCounter` and `creationDate`, which are useful for chaining subsessions longitudinally per-profile.

I'd also like to get the complete contents of `environment.settings.defaultSearchEngineData` (currently we only have the `name` field). These fields are essential for monitoring search hijacking and reset, which we want to move towards doing better.

There are a bunch of new opt-out search histograms that landed recently, and I'd love to have all of those - I'll provide the list.

However, I want to raise the question: should we have more of the opt-out measurements? All opt-out histograms? A subset? Do I recall correctly that we automatically get all scalars/keyedScalars?
(In reply to Dave Zeber [:dzeber] from comment #2)

> However, I want to raise the question: should we have more of the opt-out
> measurements? All opt-out histograms? A subset?

This would be fantastic. We've talked about adding this, but it would be quite a bit of engineering work to implement (see also: longitudinal), and probably also require yet another schema change as we would presumably normalize on some naming process for histograms. It would really be about finding someone to take on the workload.

> Do I recall correctly that we automatically get all scalars/keyedScalars?

We will be getting all scalars and keyed scalars, but (presumably) only for the parent process. I may have time to add them for the content/gpu processes as well.
thanks for this. This should be a twice a quarter process. It is very useful.
I woud like (maybe already in?)

- activeticks
- shutdown_kills (backfilled)
- total_uri_count
- tab_open_event_count
- attribution
- iswow64

And i'll add more in the next two weeks. Thanks again.
(In reply to Frank Bertsch [:frank] from comment #3)
> (In reply to Dave Zeber [:dzeber] from comment #2)
> 
> > However, I want to raise the question: should we have more of the opt-out
> > measurements? All opt-out histograms? A subset?
> 
> This would be fantastic. We've talked about adding this, but it would be
> quite a bit of engineering work to implement (see also: longitudinal), and
> probably also require yet another schema change as we would presumably
> normalize on some naming process for histograms. It would really be about
> finding someone to take on the workload.

This dataset is not meant to be a full-fidelity representation of the main pings - I think if we want to really scale up the coverage of the main ping contents, we should approach the problem differently. We could use the direct-to-parquet output, for example, or a variation of the longitudinal dataset. The intention for main_summary is to be a "narrower", simplified version of the main ping.
(In reply to "Saptarshi Guha[:joy]" from comment #4)
> - activeticks
already included in v4

> - shutdown_kills (backfilled)
where does this appear in the main ping?

> - total_uri_count
already included in v4 (all scalars have been added)

> - tab_open_event_count
already included in v4

> - attribution
already included (though doesn't seem to appear in the docs)

> - iswow64
already included
(In reply to Mark Reid [:mreid] from comment #6)
> (In reply to "Saptarshi Guha[:joy]" from comment #4)
> > - activeticks
> already included in v4
> 
> > - shutdown_kills (backfilled)
> where does this appear in the main ping?
> 

Thanks. It's from here: https://bugzilla.mozilla.org/show_bug.cgi?id=1344274
see https://github.com/mozilla/telemetry-batch-view/pull/186/commits/01e3b6a42a59c397a8e82eedf6ba93f0cc8faaf9#diff-3f03f83249e76f7f26f75ef9a93f8623R684

hsum(keyedHistograms \ "SUBPROCESS_KILL_HARD" \ "ShutDownKill")

HTH
Ok, thanks. All the fields already present in main_summary will be backfilled, so this is more about identifying new fields to add to the table before we kick off the backfill job.
A measure of memory installed would be nice
Points: --- → 13
Priority: -- → P2
also need all the engagement.navigation.* fields described here:
https://bugzilla.mozilla.org/show_bug.cgi?id=1303333
Added and backfilled.
(In reply to brendan c from comment #10)
> also need all the engagement.navigation.* fields described here:
> https://bugzilla.mozilla.org/show_bug.cgi?id=1303333
> Added and backfilled.

The backfill will include all scalars.
(In reply to Frank Bertsch [:frank] from comment #11)
> (In reply to brendan c from comment #10)
> > also need all the engagement.navigation.* fields described here:
> > https://bugzilla.mozilla.org/show_bug.cgi?id=1303333
> > Added and backfilled.
> 
> The backfill will include all scalars.

cool thanks frank. i think most of those ones are keyed scalars in case that makes any difference.
I'd like to see isWebExtension in environment.addons in the backfill. This is only in Nightly and Beta at the moment.
(In reply to "Saptarshi Guha[:joy]" from comment #9)
> A measure of memory installed would be nice

Let's add environment/system/memoryMB for this (as used by the hardware survey)
(In reply to Ben Miroglio from comment #13)
> I'd like to see isWebExtension in environment.addons in the backfill. This
> is only in Nightly and Beta at the moment.

This field will be added in bug 1360177.
(In reply to Dave Zeber [:dzeber] from comment #2)
> (In reply to Frank Bertsch [:frank] from comment #1)
> > bsmedberg recommended `subsessionCounter`.
> 
> I'd like to have `subsessionCounter`, `info.profileSubsessionCounter` and
> `creationDate`, which are useful for chaining subsessions longitudinally
> per-profile.

Is `creationDate` useful given that `subsessionStartDate` is already available?
Flags: needinfo?(dzeber)
(In reply to Mark Reid [:mreid] from comment #16)
> Is `creationDate` useful given that `subsessionStartDate` is already
> available?

I've found `creationDate` really useful for chaining together event sequences, as it gives millisecond-resolution timestamps. It's probably best used as a diagnostic - for example, checking that Shield/TxP participants are sending the right types of pings in the right order. It can be more reliable than `profileSubsessionCounter`. It's something I'd want to have available when querying for experiment participants via the HBase API.
Flags: needinfo?(dzeber)
(In reply to Dave Zeber [:dzeber] from comment #17)
> (In reply to Mark Reid [:mreid] from comment #16)
> > Is `creationDate` useful given that `subsessionStartDate` is already
> > available?
> 
> I've found `creationDate` really useful for chaining together event
> sequences, as it gives millisecond-resolution timestamps. It's probably best
> used as a diagnostic - for example, checking that Shield/TxP participants
> are sending the right types of pings in the right order. It can be more
> reliable than `profileSubsessionCounter`. It's something I'd want to have
> available when querying for experiment participants via the HBase API.

Hey Dave, I'm really curious about the circumstances in which creationDate is more reliable than profileSubsessionCounter.

The two gold standards for ordering sessions that we built into FHRv4 right at the start are:
1) profileSubsessionCounter, and
2) previousSubsessionId (which can be used to reconstruct a hopefully unambiguous session tree even in the case of branching profile histories)

Have you checked into whether the cases in which the timestamps work but profileSubsessionCounter fails are instances of history branching? If it's something else, we may need to file other bugs.

Also, if accurate subsession ordering is an in important use case, then probably profileSubsessionCounter, and previousSubsessionId should be added to main_summary as well.
Flags: needinfo?(dzeber)
(In reply to brendan c from comment #18)
> Have you checked into whether the cases in which the timestamps work but
> profileSubsessionCounter fails are instances of history branching? If it's
> something else, we may need to file other bugs.

One of the main things is that profileSubsessionCounter restarts at 1 when the profile gets reset. This is probably the correct behaviour, but a non-negligible proportion of profiles have been reset. I've also seen some things like gaps in the count (as if subsessions were missing) and subsessions whose counter is out of sequence wrt subsessionStartDate (although this could also be clock weirdness). I've looked at this briefly in cases where I wanted longitudinal sequencing for an analysis, but I haven't done an in-depth study into prevalence or subsession ID chaining.

> Also, if accurate subsession ordering is an in important use case, then
> probably profileSubsessionCounter, and previousSubsessionId should be added
> to main_summary as well.

profileSubsessionCounter is already there. I think that and creationDate should be sufficient. One thing with previousSubsessionId is that if we happen to be missing a subsession submission, the chain is broken.
Flags: needinfo?(dzeber)
(In reply to Dave Zeber [:dzeber] from comment #19)

Gotcha, makes sense re: profile resets.

> and
> subsessions whose counter is out of sequence wrt subsessionStartDate
> (although this could also be clock weirdness).

Hmm, given all the clock things we've seen over the years, I'd suspect clock weirdness before I'd cast suspicion on profileSubsessionCounter.

> One thing with previousSubsessionId is that if we happen to be
> missing a subsession submission, the chain is broken.

yep absolutely true. if someone wanted to look at chaining/branchin in more depth i guess they could use `longitudinal`.
Not sure when the request window is closing, but I had another one: `environment.settings.searchCohort`. AFAIK this is not yet in main_summary. It's used to identify search partner test participants, which are typically small groups for which we need 100% resolution.
When this PR lands:
https://github.com/mozilla/telemetry-batch-view/pull/238

I think that will take care of all the additions requested in this bug. Thanks all!
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Datasets: Main Summary → General
You need to log in before you can comment on or make changes to this bug.