Introduce a default_value property for scalars

RESOLVED WONTFIX

Status

Cloud Services
Metrics: Pipeline
RESOLVED WONTFIX
8 months ago
7 months ago

People

(Reporter: Dexter, Unassigned)

Tracking

(Blocks: 1 bug)

unspecified
Points:
---

Firefox Tracking Flags

(firefox55 affected)

Details

(Whiteboard: [measurement:client:tracking])

(Reporter)

Description

8 months ago
As bug 1356181 shows, there's interest in having a default value for certain scalars to support use cases such as:

> "The intent is to be able to see what proportion of Firefox sessions sees at least one legacy `isindex` submission. I'm not sure if the telemetry usage allows compasison with the `true` case with all submissions considering that the non-`true` case of this scalar doesn't get submitted."

Without a default value it would be tricky, if possible at all, to get a proportion of users using feature X on TMO.

We should consider:

- Introducing the optional 'default_value' property in the scalar definition (its type depending on the underlying scalar type, e.g. boolean's default_value would be of boolean type, etc.)
- If a 'default_value' is present for a particular Scalar, send this if the scalar was not set.

If we add support for 'default_value', we should explicitly mention in the docs that using it means we're sending the scalar in EVERY ping, even if it wasn't set, discouraging uses unless there's compelling reason to do so.
(Reporter)

Updated

8 months ago
Blocks: 1275517
Points: --- → 2
Priority: -- → P2
Whiteboard: [measurement:client]
(In reply to Alessio Placitelli [:Dexter] from comment #0)
> As bug 1356181 shows, there's interest in having a default value for certain
> scalars to support use cases such as:
> 
> > "The intent is to be able to see what proportion of Firefox sessions sees at least one legacy `isindex` submission. I'm not sure if the telemetry usage allows compasison with the `true` case with all submissions considering that the non-`true` case of this scalar doesn't get submitted."
> 
> Without a default value it would be tricky, if possible at all, to get a
> proportion of users using feature X on TMO.

I think we don't need to send more data to solve this, instead this seems like a short-coming of the current TMO model.
Just like we don't send a value when a count is never increased (implicit `0`), we can define an implicit default for bool scalars (`false`).

I think the solution should not be to send default values for everything, but to either:
- have TMO present the aggregates proportionally or
- have a standard measure to correlate this against (which is not necessarily exclusive of the above)

Is there currently anything blocking the use-cases on using re:dash or custom analysis?
Flags: needinfo?(alessio.placitelli)
(Reporter)

Comment 2

7 months ago
(In reply to Georg Fritzsche [:gfritzsche] [away Apr 13 - 18] from comment #1)
> (In reply to Alessio Placitelli [:Dexter] from comment #0)
> > As bug 1356181 shows, there's interest in having a default value for certain
> > scalars to support use cases such as:
> > 
> > > "The intent is to be able to see what proportion of Firefox sessions sees at least one legacy `isindex` submission. I'm not sure if the telemetry usage allows compasison with the `true` case with all submissions considering that the non-`true` case of this scalar doesn't get submitted."
> > 
> > Without a default value it would be tricky, if possible at all, to get a
> > proportion of users using feature X on TMO.
> 
> I think we don't need to send more data to solve this, instead this seems
> like a short-coming of the current TMO model.
> Just like we don't send a value when a count is never increased (implicit
> `0`), we can define an implicit default for bool scalars (`false`).
> 
> I think the solution should not be to send default values for everything,
> but to either:
> - have TMO present the aggregates proportionally or
> - have a standard measure to correlate this against (which is not
> necessarily exclusive of the above)
> 
> Is there currently anything blocking the use-cases on using re:dash or
> custom analysis?

No, nothing prevents consumers to use re:dash or custom jobs, however they both require some amount of work to get the same answers that one could easily get on TMO.

Frank, how difficult would it be to add the implicit value for boolean scalars on TMO (Are you the right person to ask?)?
Flags: needinfo?(alessio.placitelli) → needinfo?(fbertsch)
> Frank, how difficult would it be to add the implicit value for boolean
> scalars on TMO (Are you the right person to ask?)?

Hmm, it would be a bit of work, but doable. It seems to me though that this should happen earlier in the ingestion pipeline; i.e. hindsight inputs implicit defaults before storing to s3.
Flags: needinfo?(fbertsch)

Comment 4

7 months ago
I would really prefer not to store them with defaults, because rewriting during ingestion can make future analysis harder.
(In reply to Benjamin Smedberg [:bsmedberg] from comment #4)
> I would really prefer not to store them with defaults, because rewriting
> during ingestion can make future analysis harder.

I'm not sure what you mean. Analysis would remain unchanged; all this would add is a new option for future scalars (not existing scalars) to have a default value. Any analysis would then be able to correctly parse the statistics for these; i.e. without storing the default, every analysis needs to put in defaults on their own. When we store defaults, then analysis should actually be easier.

In general, we have shied away from having aggregates do anything special with the data. We want it to be a straight aggregation of the raw data, because if it isn't, then when people dive in to answer questions using STMO/ATMO after looking at TMO, they get confused when the results don't line up. It just creates another footgun and "special case" for our pipeline.
Note that rewriting incoming data might have storage impact.
AFAIK, so far we treated incoming data as immutable and leave semantic changes to ETL or analysis jobs.

While this bug is specifically discussing the scalar bug, the underlying problem seems that TMO uses the model of "sessions this measurement was actively recorded in".
This problem exists for all probe types.

Can we take a step back and look at:
- what we need right now to solve current requests
- how to solve this better in the future for TMO
> - what we need right now to solve current requests

I don't think there should be any "immediate" action that differs from long-term action. My rationale is that this information *is* available in STMO, with a pretty straightforward query on Longitudinal (and soon, main_summary - I'm auto-adding all scalars there). I'm more than willing to help people write these queries.

> - how to solve this better in the future for TMO

We could add a pseudo-probe to aggregates.t.m.o that is simply the total number of pings seen along each set of dimensions. It would then be up to the application (e.g. TMO front-end) to add that in to viewing, either as a denominator or displayed as a count itself. Only caveat is the name could not clash with any other probes.

Would be interesting to think about other probes we could create based on that idea - client counts, session hours, etc. But that's for a different day :)
(Reporter)

Comment 8

7 months ago
I moved this bug to the "Pipeline" component, as per comment 1.
Points: 2 → ---
Component: Telemetry → Metrics: Pipeline
Priority: P2 → --
Product: Toolkit → Cloud Services
Whiteboard: [measurement:client] → [measurement:client:tracking]
Version: Trunk → unspecified
Closing this in favor of bug 1353105. We will be able to query for opt-in and opt-out scalars there.
Status: NEW → RESOLVED
Last Resolved: 7 months ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.