1362161 - Add any missing high-value fields ahead of Main Summary backfill

Reporter

Description

•

7 years ago

It costs about the same to backfill the data whether we add one new field or fifty, so we might as well take a moment to think about other fields that might be useful to add before we embark on the backfill in bug 1362134.

The plan is to keep this bug open for a week or two, add fields as they are requested, then run the backfill to populate them all at once.

Please nominate fields here that would be useful for analyses based on the main_summary dataset.

Mark Reid [:mreid]

Reporter

Updated

•

7 years ago

Blocks: 1362134

Frank Bertsch [:frank]

Comment 1

•

7 years ago

bsmedberg recommended `subsessionCounter`.

Mark Reid [:mreid]

Reporter

Updated

•

7 years ago

Depends on: 1360177

Dave Zeber [:dzeber]

Comment 2

•

7 years ago

(In reply to Frank Bertsch [:frank] from comment #1)
> bsmedberg recommended `subsessionCounter`.

I'd like to have `subsessionCounter`, `info.profileSubsessionCounter` and `creationDate`, which are useful for chaining subsessions longitudinally per-profile.

I'd also like to get the complete contents of `environment.settings.defaultSearchEngineData` (currently we only have the `name` field). These fields are essential for monitoring search hijacking and reset, which we want to move towards doing better.

There are a bunch of new opt-out search histograms that landed recently, and I'd love to have all of those - I'll provide the list.

However, I want to raise the question: should we have more of the opt-out measurements? All opt-out histograms? A subset? Do I recall correctly that we automatically get all scalars/keyedScalars?

Frank Bertsch [:frank]

Comment 3

•

7 years ago

(In reply to Dave Zeber [:dzeber] from comment #2)

> However, I want to raise the question: should we have more of the opt-out
> measurements? All opt-out histograms? A subset?

This would be fantastic. We've talked about adding this, but it would be quite a bit of engineering work to implement (see also: longitudinal), and probably also require yet another schema change as we would presumably normalize on some naming process for histograms. It would really be about finding someone to take on the workload.

> Do I recall correctly that we automatically get all scalars/keyedScalars?

We will be getting all scalars and keyed scalars, but (presumably) only for the parent process. I may have time to add them for the content/gpu processes as well.

"Saptarshi Guha[:joy]"

Comment 4

•

7 years ago

thanks for this. This should be a twice a quarter process. It is very useful.
I woud like (maybe already in?)

- activeticks
- shutdown_kills (backfilled)
- total_uri_count
- tab_open_event_count
- attribution
- iswow64

And i'll add more in the next two weeks. Thanks again.

Mark Reid [:mreid]

Reporter

Comment 5

•

7 years ago

(In reply to Frank Bertsch [:frank] from comment #3)
> (In reply to Dave Zeber [:dzeber] from comment #2)
> 
> > However, I want to raise the question: should we have more of the opt-out
> > measurements? All opt-out histograms? A subset?
> 
> This would be fantastic. We've talked about adding this, but it would be
> quite a bit of engineering work to implement (see also: longitudinal), and
> probably also require yet another schema change as we would presumably
> normalize on some naming process for histograms. It would really be about
> finding someone to take on the workload.

This dataset is not meant to be a full-fidelity representation of the main pings - I think if we want to really scale up the coverage of the main ping contents, we should approach the problem differently. We could use the direct-to-parquet output, for example, or a variation of the longitudinal dataset. The intention for main_summary is to be a "narrower", simplified version of the main ping.

Mark Reid [:mreid]

Reporter

Comment 6

•

7 years ago

(In reply to "Saptarshi Guha[:joy]" from comment #4)
> - activeticks
already included in v4

> - shutdown_kills (backfilled)
where does this appear in the main ping?

> - total_uri_count
already included in v4 (all scalars have been added)

> - tab_open_event_count
already included in v4

> - attribution
already included (though doesn't seem to appear in the docs)

> - iswow64
already included

"Saptarshi Guha[:joy]"

Comment 7

•

7 years ago

(In reply to Mark Reid [:mreid] from comment #6)
> (In reply to "Saptarshi Guha[:joy]" from comment #4)
> > - activeticks
> already included in v4
> 
> > - shutdown_kills (backfilled)
> where does this appear in the main ping?
> 

Thanks. It's from here: https://bugzilla.mozilla.org/show_bug.cgi?id=1344274
see https://github.com/mozilla/telemetry-batch-view/pull/186/commits/01e3b6a42a59c397a8e82eedf6ba93f0cc8faaf9#diff-3f03f83249e76f7f26f75ef9a93f8623R684

hsum(keyedHistograms \ "SUBPROCESS_KILL_HARD" \ "ShutDownKill")

HTH

Mark Reid [:mreid]

Reporter

Comment 8

•

7 years ago

Ok, thanks. All the fields already present in main_summary will be backfilled, so this is more about identifying new fields to add to the table before we kick off the backfill job.

"Saptarshi Guha[:joy]"

Comment 9

•

7 years ago

A measure of memory installed would be nice

Thomas Huelbert

Updated

•

7 years ago

Points: --- → 13

Priority: -- → P2

brendan c

Comment 10

•

7 years ago

also need all the engagement.navigation.* fields described here:
https://bugzilla.mozilla.org/show_bug.cgi?id=1303333
Added and backfilled.

Frank Bertsch [:frank]

Comment 11

•

7 years ago

(In reply to brendan c from comment #10)
> also need all the engagement.navigation.* fields described here:
> https://bugzilla.mozilla.org/show_bug.cgi?id=1303333
> Added and backfilled.

The backfill will include all scalars.

brendan c

Comment 12

•

7 years ago

(In reply to Frank Bertsch [:frank] from comment #11)
> (In reply to brendan c from comment #10)
> > also need all the engagement.navigation.* fields described here:
> > https://bugzilla.mozilla.org/show_bug.cgi?id=1303333
> > Added and backfilled.
> 
> The backfill will include all scalars.

cool thanks frank. i think most of those ones are keyed scalars in case that makes any difference.

Ben Miroglio [:bmiroglio]

Comment 13

•

7 years ago

I'd like to see isWebExtension in environment.addons in the backfill. This is only in Nightly and Beta at the moment.

Mark Reid [:mreid]

Reporter

Comment 14

•

7 years ago

(In reply to "Saptarshi Guha[:joy]" from comment #9)
> A measure of memory installed would be nice

Let's add environment/system/memoryMB for this (as used by the hardware survey)

Mark Reid [:mreid]

Reporter

Comment 15

•

7 years ago

(In reply to Ben Miroglio from comment #13)
> I'd like to see isWebExtension in environment.addons in the backfill. This
> is only in Nightly and Beta at the moment.

This field will be added in bug 1360177.

Mark Reid [:mreid]

Reporter

Comment 16

•

7 years ago

(In reply to Dave Zeber [:dzeber] from comment #2)
> (In reply to Frank Bertsch [:frank] from comment #1)
> > bsmedberg recommended `subsessionCounter`.
> 
> I'd like to have `subsessionCounter`, `info.profileSubsessionCounter` and
> `creationDate`, which are useful for chaining subsessions longitudinally
> per-profile.

Is `creationDate` useful given that `subsessionStartDate` is already available?

Flags: needinfo?(dzeber)

Dave Zeber [:dzeber]

Comment 17

•

7 years ago

(In reply to Mark Reid [:mreid] from comment #16)
> Is `creationDate` useful given that `subsessionStartDate` is already
> available?

I've found `creationDate` really useful for chaining together event sequences, as it gives millisecond-resolution timestamps. It's probably best used as a diagnostic - for example, checking that Shield/TxP participants are sending the right types of pings in the right order. It can be more reliable than `profileSubsessionCounter`. It's something I'd want to have available when querying for experiment participants via the HBase API.

Flags: needinfo?(dzeber)

brendan c

Comment 18

•

7 years ago

(In reply to Dave Zeber [:dzeber] from comment #17)
> (In reply to Mark Reid [:mreid] from comment #16)
> > Is `creationDate` useful given that `subsessionStartDate` is already
> > available?
> 
> I've found `creationDate` really useful for chaining together event
> sequences, as it gives millisecond-resolution timestamps. It's probably best
> used as a diagnostic - for example, checking that Shield/TxP participants
> are sending the right types of pings in the right order. It can be more
> reliable than `profileSubsessionCounter`. It's something I'd want to have
> available when querying for experiment participants via the HBase API.

Hey Dave, I'm really curious about the circumstances in which creationDate is more reliable than profileSubsessionCounter.

The two gold standards for ordering sessions that we built into FHRv4 right at the start are:
1) profileSubsessionCounter, and
2) previousSubsessionId (which can be used to reconstruct a hopefully unambiguous session tree even in the case of branching profile histories)

Have you checked into whether the cases in which the timestamps work but profileSubsessionCounter fails are instances of history branching? If it's something else, we may need to file other bugs.

Also, if accurate subsession ordering is an in important use case, then probably profileSubsessionCounter, and previousSubsessionId should be added to main_summary as well.

Flags: needinfo?(dzeber)

Dave Zeber [:dzeber]

Comment 19

•

7 years ago

(In reply to brendan c from comment #18)
> Have you checked into whether the cases in which the timestamps work but
> profileSubsessionCounter fails are instances of history branching? If it's
> something else, we may need to file other bugs.

One of the main things is that profileSubsessionCounter restarts at 1 when the profile gets reset. This is probably the correct behaviour, but a non-negligible proportion of profiles have been reset. I've also seen some things like gaps in the count (as if subsessions were missing) and subsessions whose counter is out of sequence wrt subsessionStartDate (although this could also be clock weirdness). I've looked at this briefly in cases where I wanted longitudinal sequencing for an analysis, but I haven't done an in-depth study into prevalence or subsession ID chaining.

> Also, if accurate subsession ordering is an in important use case, then
> probably profileSubsessionCounter, and previousSubsessionId should be added
> to main_summary as well.

profileSubsessionCounter is already there. I think that and creationDate should be sufficient. One thing with previousSubsessionId is that if we happen to be missing a subsession submission, the chain is broken.

Flags: needinfo?(dzeber)

brendan c

Comment 20

•

7 years ago

(In reply to Dave Zeber [:dzeber] from comment #19)

Gotcha, makes sense re: profile resets.

> and
> subsessions whose counter is out of sequence wrt subsessionStartDate
> (although this could also be clock weirdness).

Hmm, given all the clock things we've seen over the years, I'd suspect clock weirdness before I'd cast suspicion on profileSubsessionCounter.

> One thing with previousSubsessionId is that if we happen to be
> missing a subsession submission, the chain is broken.

yep absolutely true. if someone wanted to look at chaining/branchin in more depth i guess they could use `longitudinal`.

Dave Zeber [:dzeber]

Comment 21

•

7 years ago

Not sure when the request window is closing, but I had another one: `environment.settings.searchCohort`. AFAIK this is not yet in main_summary. It's used to identify search partner test participants, which are typically small groups for which we need 100% resolution.

Mark Reid [:mreid]

Reporter

Comment 22

•

7 years ago

When this PR lands:
https://github.com/mozilla/telemetry-batch-view/pull/238

I think that will take care of all the additions requested in this bug. Thanks all!

Mark Reid [:mreid]

Reporter

Updated

•

7 years ago

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Assignee

Updated

•

2 years ago

Component: Datasets: Main Summary → General