Remove opt-in histograms from longitudinal dataset

RESOLVED FIXED

Status

Cloud Services
Metrics: Pipeline
P1
normal
RESOLVED FIXED
2 years ago
a year ago

People

(Reporter: rvitillo, Assigned: harter)

Tracking

(Blocks: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

User Story

Given that the longitudinal dataset is used as a representative dataset of our user population and that it's only a 1% sample, I think we should remove the opt-in measurements as they don't add value.

Attachments

(1 attachment)

Comment hidden (empty)
(Assignee)

Updated

2 years ago
Assignee: nobody → rharter
Points: --- → 3
Priority: -- → P2
(Assignee)

Updated

2 years ago
Priority: P2 → P1
(Assignee)

Comment 1

2 years ago
Hey Alessio,

Looking at histograms.json[1] I only see two histograms explicitly marked as opt-in[2]. Both of these appear to be for testing. If a histogram is not explicitly marked as opt-out is it opt-in?

[1] https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/Histograms.json
[2] https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/Histograms.json#5796
Flags: needinfo?(alessio.placitelli)
(In reply to Ryan Harter [:harter] from comment #1)
> [...] If a histogram is not explicitly marked as opt-out is it opt-in?

Hey Ryan! Yes, if not specified, you can safely assume an histogram is "opt-in" (see [1]).

[1] - https://dxr.mozilla.org/mozilla-central/rev/86f702229e32c6119d092e86431afee576f033a1/toolkit/components/telemetry/histogram_tools.py#130
Flags: needinfo?(alessio.placitelli)
(Commenting on User Story)
> Given that the longitudinal dataset is used as a representative dataset of
> our user population and that it's only a 1% sample, I think we should remove
> the opt-in measurements as they don't add value.

This seems like a potentially disruptive change, there should probably at least reasonable advance notice be given for this? 
On the pre-release channels (where the opt-in measurements are collected from everyone by default) i would expect them to be used.
(Reporter)

Comment 4

2 years ago
(In reply to Georg Fritzsche [:gfritzsche] from comment #3)
> (Commenting on User Story)
> > Given that the longitudinal dataset is used as a representative dataset of
> > our user population and that it's only a 1% sample, I think we should remove
> > the opt-in measurements as they don't add value.
> 
> This seems like a potentially disruptive change, there should probably at
> least reasonable advance notice be given for this? 

We should announce the intent of doing this on fhr-dev and fx-data-platform to see if there are any objections.

> On the pre-release channels (where the opt-in measurements are collected
> from everyone by default) i would expect them to be used.

In my experience we can't generally make statistically meaningful claims using only 1% of pre-release.
(Assignee)

Comment 5

2 years ago
> > This seems like a potentially disruptive change, there should probably at
> > least reasonable advance notice be given for this? 
> 
> We should announce the intent of doing this on fhr-dev and fx-data-platform
> to see if there are any objections.

I'll send out an email this afternoon.

> > On the pre-release channels (where the opt-in measurements are collected
> > from everyone by default) i would expect them to be used.
> 
> In my experience we can't generally make statistically meaningful claims
> using only 1% of pre-release.

Looks like the current dataset has ~250k clients in pre-release. What types of claims do we try to make with these data? It seems like we should be able to answer some questions with that number of users.
Flags: needinfo?(rvitillo)
(Reporter)

Comment 6

2 years ago
(In reply to Ryan Harter [:harter] from comment #5)
> Looks like the current dataset has ~250k clients in pre-release. What types
> of claims do we try to make with these data? It seems like we should be able
> to answer some questions with that number of users.

Right, but our users usually apply some filtering on top of that which can bring the number of eligible users quickly down to something which isn't very interesting. 

Furthermore, mixing opt-in and opt-out measurements makes self-served analysis more error prone as one could easily run a query based on an opt-in measure (and not knowing it's opt-in) and mistakenly think that the result applies to our population as a whole.
Flags: needinfo?(rvitillo)
(Assignee)

Comment 7

2 years ago
Keeping this bug updated, we've identified all users/queries which depend on these histograms. I've emailed these users and am waiting on a response.
Jumping in to say that I agree with this choice in the general case.  I think it would likely be safer for those of us creating queries to not mistakenly use opt-in probes. 

I do have certain probes which are only useful to me in Nightly or Aurora because I'm tracking developer tools there.  But I believe I'm going to be the outsider here and I can look for another solution.  Hopefully someone can help me. :)  Given the size of the Nightly and Aurora population it would actually be nice to have a larger sample set anyway.

Comment 9

2 years ago
Note that the original plan for longitudinal was to have separate longitudinals for certain subgroups, so I'd like us to continue to explore having a nightly-longitudinal with 100% of nightly, beta-longitudinal with 10% of beta, etc. I agree we should make this change for the current longitudinal because it's a footgun.
(Assignee)

Comment 10

2 years ago
Created attachment 8814944 [details] [review]
https://github.com/mozilla/telemetry-batch-view/pull/148

Sounds like we want to move forward with this change and pursue the 100% pre-release longitudinal set. 

PR attached.
Attachment #8814944 - Flags: review?(rvitillo)
(Reporter)

Updated

2 years ago
Attachment #8814944 - Flags: review?(rvitillo) → review+
(Reporter)

Updated

2 years ago
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #4)
> In my experience we can't generally make statistically meaningful claims
> using only 1% of pre-release.

Any particular reason we kept opt-in scalars? This argument applies to those as well.
(In reply to Frank Bertsch [:frank] from comment #11)
> (In reply to Roberto Agostino Vitillo (:rvitillo) from comment #4)
> > In my experience we can't generally make statistically meaningful claims
> > using only 1% of pre-release.
> 
> Any particular reason we kept opt-in scalars? This argument applies to those
> as well.

Opt-in scalars should be removed as well.
You need to log in before you can comment on or make changes to this bug.