Closed Bug 1274008 Opened 8 years ago Closed 8 years ago

In longitudinal, there are probably-incorrect results in settings.user_prefs

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: benjamin, Assigned: harter)

References

Details

https://sql.telemetry.mozilla.org/queries/386/source This is a probably-incorrect result. In particular it's very unlikely that many people have customized browser.startup.homepage but absolutely nobody has customized browser.startup.page. Similarly, dougt was trying to look for the frequency of privacy.trackingprotection.enabled and that showed zero users either, which is not reasonable. I've verified that my pings in about:telemetry include browser.startup.page customizations in the settings.userPrefs block.
Flags: needinfo?(rvitillo)
Priority: -- → P2
Flags: needinfo?(rvitillo)
Anthony, could you please look into this?
Flags: needinfo?(azhang9)
In previous versions, avro-parquet didn't support unions in map values, so the derived streams only kept track of settings that have string values. `browser.startup.page` is generally an integer, and therefore doesn't show up. Since we upgraded it recently, here's a PR that adds support for non-string pref values: https://github.com/mozilla/telemetry-batch-view/pull/72
Flags: needinfo?(azhang9)
Ryan, can you please take this? It's coming up as a recurring footgun for developers.
Flags: needinfo?(rharter)
Assignee: nobody → rharter
Flags: needinfo?(rharter)
Sure, let me take a look.
Priority: P2 → P1
Points: --- → 3
I got a chance to review this issue and I have a path forward. However, I will have to whitelist each userPref field with a type. Identifying the types will take some analysis. I'll keep this thread updated with my progress, expect an update within a week.
Where will your list be kept? Because as we add/change the list of prefs that are sent, we really should avoid having people update another file somewhere else. Would it be worth the complexity to make an EnvironmentPrefs.yaml file to drive both TelemetryEnvironment collection and validation in longitudinal?
> Where will your list be kept? I'll create a struct in the LongitudianlView code[0]. The struct will enumerate a subset of all userPref fields, roughly similar to this default_search_engine_data struct[1]. > Because as we add/change the list of prefs that > are sent, we really should avoid having people update another file somewhere > else. If we want to dedicate some resources to a more elegant solution, we should go through the parquet-avro upgrade blocking this PR [2]. This would allow us to import all fields automagically, without the need for a whitelist. This will probably take some time, but it would save us from some boilerplate code. > Would it be worth the complexity to make an EnvironmentPrefs.yaml file to > drive both TelemetryEnvironment collection and validation in longitudinal? Where would that yaml file be kept? Adding a dependency on an external yaml file makes me uneasy. For example, a developer making changes to that EnvironmentPrefs.yaml file now needs to be aware that they can break the telemetry-batch-view build. [0] https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/Longitudinal.scala#L259 [1] https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/Longitudinal.scala#L243 [2] https://github.com/mozilla/telemetry-batch-view/pull/72
Flags: needinfo?(benjamin)
Presumably in mozilla-central next to Histograms.json and Scalars.yaml. Those are also imported and used by telemetry-batch-view, right?
Flags: needinfo?(benjamin)
Yes, you are correct. Let me take a look and see how that would work.
On the client-side, longer-term, i would like to record the user prefs (and most of the environment) in Scalars using the Scalars API. A EnvironmentPrefs.yaml or so would be a good step in that direction, i'd be happy to help with that.
Alternatively we could treat all values as strings, and use casts at query time, while this data moves under Scalars and/or Spark starts using a recent parquet-avro library.
I don't mind storing these as scalars, but I do think that many of them should continue to trigger a subsession split, hence the reason they are in the environment block in the first place.
The plan is to not move anything to a different part of the ping. This is just about having consistent semantics for the data. The first step here would be to use parse_scalars.py [0] (with possibly more restrictive settings) from another script to parse a EnvironmentPrefs.yaml and generate a JS or JSON file from that. 0: https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/parse_scalars.py
I have an immediate fix available here[0]. Currently, the longitudinal table expects all user pref fields to be strings and discards all non-string values. This change converts numerics and bools to strings which should allow the missing fields to flow through. [0] https://github.com/mozilla/telemetry-batch-view/pull/142
This should now be fixed. Let's open a separate bug if we want to pursue moving these measures into the Scalars API.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
See Also: → 1330856
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.