Add any missing high-value fields ahead of Main Summary backfill

RESOLVED FIXED

Status

Data Platform and Tools
Datasets: Main Summary
P2
normal
RESOLVED FIXED
6 months ago
4 months ago

People

(Reporter: mreid, Unassigned)

Tracking

Details

(Reporter)

Description

6 months ago
It costs about the same to backfill the data whether we add one new field or fifty, so we might as well take a moment to think about other fields that might be useful to add before we embark on the backfill in bug 1362134.

The plan is to keep this bug open for a week or two, add fields as they are requested, then run the backfill to populate them all at once.

Please nominate fields here that would be useful for analyses based on the main_summary dataset.
(Reporter)

Updated

6 months ago
Blocks: 1362134
bsmedberg recommended `subsessionCounter`.
(Reporter)

Updated

6 months ago
Depends on: 1360177

Comment 2

6 months ago
(In reply to Frank Bertsch [:frank] from comment #1)
> bsmedberg recommended `subsessionCounter`.

I'd like to have `subsessionCounter`, `info.profileSubsessionCounter` and `creationDate`, which are useful for chaining subsessions longitudinally per-profile.

I'd also like to get the complete contents of `environment.settings.defaultSearchEngineData` (currently we only have the `name` field). These fields are essential for monitoring search hijacking and reset, which we want to move towards doing better.

There are a bunch of new opt-out search histograms that landed recently, and I'd love to have all of those - I'll provide the list.

However, I want to raise the question: should we have more of the opt-out measurements? All opt-out histograms? A subset? Do I recall correctly that we automatically get all scalars/keyedScalars?
(In reply to Dave Zeber [:dzeber] from comment #2)

> However, I want to raise the question: should we have more of the opt-out
> measurements? All opt-out histograms? A subset?

This would be fantastic. We've talked about adding this, but it would be quite a bit of engineering work to implement (see also: longitudinal), and probably also require yet another schema change as we would presumably normalize on some naming process for histograms. It would really be about finding someone to take on the workload.

> Do I recall correctly that we automatically get all scalars/keyedScalars?

We will be getting all scalars and keyed scalars, but (presumably) only for the parent process. I may have time to add them for the content/gpu processes as well.
thanks for this. This should be a twice a quarter process. It is very useful.
I woud like (maybe already in?)

- activeticks
- shutdown_kills (backfilled)
- total_uri_count
- tab_open_event_count
- attribution
- iswow64

And i'll add more in the next two weeks. Thanks again.
(Reporter)

Comment 5

6 months ago
(In reply to Frank Bertsch [:frank] from comment #3)
> (In reply to Dave Zeber [:dzeber] from comment #2)
> 
> > However, I want to raise the question: should we have more of the opt-out
> > measurements? All opt-out histograms? A subset?
> 
> This would be fantastic. We've talked about adding this, but it would be
> quite a bit of engineering work to implement (see also: longitudinal), and
> probably also require yet another schema change as we would presumably
> normalize on some naming process for histograms. It would really be about
> finding someone to take on the workload.

This dataset is not meant to be a full-fidelity representation of the main pings - I think if we want to really scale up the coverage of the main ping contents, we should approach the problem differently. We could use the direct-to-parquet output, for example, or a variation of the longitudinal dataset. The intention for main_summary is to be a "narrower", simplified version of the main ping.
(Reporter)

Comment 6

6 months ago
(In reply to "Saptarshi Guha[:joy]" from comment #4)
> - activeticks
already included in v4

> - shutdown_kills (backfilled)
where does this appear in the main ping?

> - total_uri_count
already included in v4 (all scalars have been added)

> - tab_open_event_count
already included in v4

> - attribution
already included (though doesn't seem to appear in the docs)

> - iswow64
already included
(In reply to Mark Reid [:mreid] from comment #6)
> (In reply to "Saptarshi Guha[:joy]" from comment #4)
> > - activeticks
> already included in v4
> 
> > - shutdown_kills (backfilled)
> where does this appear in the main ping?
> 

Thanks. It's from here: https://bugzilla.mozilla.org/show_bug.cgi?id=1344274
see https://github.com/mozilla/telemetry-batch-view/pull/186/commits/01e3b6a42a59c397a8e82eedf6ba93f0cc8faaf9#diff-3f03f83249e76f7f26f75ef9a93f8623R684

hsum(keyedHistograms \ "SUBPROCESS_KILL_HARD" \ "ShutDownKill")

HTH
(Reporter)

Comment 8

6 months ago
Ok, thanks. All the fields already present in main_summary will be backfilled, so this is more about identifying new fields to add to the table before we kick off the backfill job.
A measure of memory installed would be nice

Updated

5 months ago
Points: --- → 13
Priority: -- → P2

Comment 10

5 months ago
also need all the engagement.navigation.* fields described here:
https://bugzilla.mozilla.org/show_bug.cgi?id=1303333
Added and backfilled.
(In reply to brendan c from comment #10)
> also need all the engagement.navigation.* fields described here:
> https://bugzilla.mozilla.org/show_bug.cgi?id=1303333
> Added and backfilled.

The backfill will include all scalars.

Comment 12

5 months ago
(In reply to Frank Bertsch [:frank] from comment #11)
> (In reply to brendan c from comment #10)
> > also need all the engagement.navigation.* fields described here:
> > https://bugzilla.mozilla.org/show_bug.cgi?id=1303333
> > Added and backfilled.
> 
> The backfill will include all scalars.

cool thanks frank. i think most of those ones are keyed scalars in case that makes any difference.
I'd like to see isWebExtension in environment.addons in the backfill. This is only in Nightly and Beta at the moment.
(Reporter)

Comment 14

5 months ago
(In reply to "Saptarshi Guha[:joy]" from comment #9)
> A measure of memory installed would be nice

Let's add environment/system/memoryMB for this (as used by the hardware survey)
(Reporter)

Comment 15

5 months ago
(In reply to Ben Miroglio from comment #13)
> I'd like to see isWebExtension in environment.addons in the backfill. This
> is only in Nightly and Beta at the moment.

This field will be added in bug 1360177.
(Reporter)

Comment 16

5 months ago
(In reply to Dave Zeber [:dzeber] from comment #2)
> (In reply to Frank Bertsch [:frank] from comment #1)
> > bsmedberg recommended `subsessionCounter`.
> 
> I'd like to have `subsessionCounter`, `info.profileSubsessionCounter` and
> `creationDate`, which are useful for chaining subsessions longitudinally
> per-profile.

Is `creationDate` useful given that `subsessionStartDate` is already available?
Flags: needinfo?(dzeber)

Comment 17

5 months ago
(In reply to Mark Reid [:mreid] from comment #16)
> Is `creationDate` useful given that `subsessionStartDate` is already
> available?

I've found `creationDate` really useful for chaining together event sequences, as it gives millisecond-resolution timestamps. It's probably best used as a diagnostic - for example, checking that Shield/TxP participants are sending the right types of pings in the right order. It can be more reliable than `profileSubsessionCounter`. It's something I'd want to have available when querying for experiment participants via the HBase API.
Flags: needinfo?(dzeber)

Comment 18

5 months ago
(In reply to Dave Zeber [:dzeber] from comment #17)
> (In reply to Mark Reid [:mreid] from comment #16)
> > Is `creationDate` useful given that `subsessionStartDate` is already
> > available?
> 
> I've found `creationDate` really useful for chaining together event
> sequences, as it gives millisecond-resolution timestamps. It's probably best
> used as a diagnostic - for example, checking that Shield/TxP participants
> are sending the right types of pings in the right order. It can be more
> reliable than `profileSubsessionCounter`. It's something I'd want to have
> available when querying for experiment participants via the HBase API.

Hey Dave, I'm really curious about the circumstances in which creationDate is more reliable than profileSubsessionCounter.

The two gold standards for ordering sessions that we built into FHRv4 right at the start are:
1) profileSubsessionCounter, and
2) previousSubsessionId (which can be used to reconstruct a hopefully unambiguous session tree even in the case of branching profile histories)

Have you checked into whether the cases in which the timestamps work but profileSubsessionCounter fails are instances of history branching? If it's something else, we may need to file other bugs.

Also, if accurate subsession ordering is an in important use case, then probably profileSubsessionCounter, and previousSubsessionId should be added to main_summary as well.
Flags: needinfo?(dzeber)

Comment 19

5 months ago
(In reply to brendan c from comment #18)
> Have you checked into whether the cases in which the timestamps work but
> profileSubsessionCounter fails are instances of history branching? If it's
> something else, we may need to file other bugs.

One of the main things is that profileSubsessionCounter restarts at 1 when the profile gets reset. This is probably the correct behaviour, but a non-negligible proportion of profiles have been reset. I've also seen some things like gaps in the count (as if subsessions were missing) and subsessions whose counter is out of sequence wrt subsessionStartDate (although this could also be clock weirdness). I've looked at this briefly in cases where I wanted longitudinal sequencing for an analysis, but I haven't done an in-depth study into prevalence or subsession ID chaining.

> Also, if accurate subsession ordering is an in important use case, then
> probably profileSubsessionCounter, and previousSubsessionId should be added
> to main_summary as well.

profileSubsessionCounter is already there. I think that and creationDate should be sufficient. One thing with previousSubsessionId is that if we happen to be missing a subsession submission, the chain is broken.
Flags: needinfo?(dzeber)

Comment 20

5 months ago
(In reply to Dave Zeber [:dzeber] from comment #19)

Gotcha, makes sense re: profile resets.

> and
> subsessions whose counter is out of sequence wrt subsessionStartDate
> (although this could also be clock weirdness).

Hmm, given all the clock things we've seen over the years, I'd suspect clock weirdness before I'd cast suspicion on profileSubsessionCounter.

> One thing with previousSubsessionId is that if we happen to be
> missing a subsession submission, the chain is broken.

yep absolutely true. if someone wanted to look at chaining/branchin in more depth i guess they could use `longitudinal`.

Comment 21

5 months ago
Not sure when the request window is closing, but I had another one: `environment.settings.searchCohort`. AFAIK this is not yet in main_summary. It's used to identify search partner test participants, which are typically small groups for which we need 100% resolution.
(Reporter)

Comment 22

5 months ago
When this PR lands:
https://github.com/mozilla/telemetry-batch-view/pull/238

I think that will take care of all the additions requested in this bug. Thanks all!
(Reporter)

Updated

4 months ago
Status: NEW → RESOLVED
Last Resolved: 4 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.