Some release clients are sending prerelease probes

RESOLVED FIXED in Firefox 66

Status

()

defect
RESOLVED FIXED
7 months ago
3 months ago

People

(Reporter: tdsmith, Assigned: chutten)

Tracking

63 Branch
mozilla67
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox65 wontfix, firefox66 fixed, firefox67 fixed)

Details

Attachments

(2 attachments)

:wbeard noticed that some clients in the release channel are sending the MEMORY_TOTAL probe, which was (before Bug 1511918, which won't land until 66) supposed to be limited to prerelease only.

We're consistently receiving a few hundred pings a day from clients that represent to be part of the release population but are sending us values for MEMORY_TOTAL and other prerelease-only probes.

dbfs:/tdsmith/weird_pings.json contains release channel pings containing MEMORY_TOTAL from a single day during the 63 release, extracted by https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/52325/command/52339.

:chutten pointed out that we used to see O(millions) of these clients/day, so the step down to O(hundreds) is itself interesting: https://mzl.la/2Q7aEYC

This doesn't block anything but is a little unnerving!
We have a pref for overriding the reported channel (toolkit.telemetry.overrideUpdateChannel) and another test preference (toolkit.telemetry.testing.overridePreRelease) for overriding the release/prerelease split, so maybe they're some possible culprits.

P1 so it shows in my list at the top of 2019.
Assignee: nobody → chutten
Status: NEW → ASSIGNED
Priority: -- → P1

Giving 2 points for the analysis work.

Points: --- → 2

The change from O(millions) to O(hundreds) of pings containing MEMORY_TOTAL per day on release happened in Firefox 58. Not coincidentally, that's when bug 1406391 locked extended data collection to prerelease channels only.

(Prior to that users could opt in to sending us more data on release, and the proportion of users who did so roughly lines up with the numbers we see on TMO.)

But as for those pesky hundreds of users... Well, we know from past bugs that enabling MEMORY_* probes on release channel isn't as simple as changing the definition. There's also a runtime check to ensure that you're able to record the extended data set.

So whatever's happening, it isn't some weird burp around release channels and it isn't a fault in how the definitions aren't being properly applied. No, Telemetry.canRecordExtended must be true at the point when the recording happens.

Now, the logic about when canRecordExtended is true is a little complex. If MOZ_UPDATE_CHANNEL is one of [nightly, aurora, beta] then it's true. If it's an unofficial build (MOZILLA_OFFICIAL is falsey) then we also allow the dev default channel. ("official" builds with default channel exist for some Linux distros. We want to exclude them because we don't know if they're something we can consider "pre-release" or not). If MOZ_UPDATE_CHANNEL is release but the channel you get updates from is beta then you're running a Release Candidate, which is pre-release, so it's true then, too. We also have a bool pref toolkit.telemetry.testing.overridePreRelease which, if true, means canRecordExtended is true.

We also have a pref toolkit.telemetry.overrideUpdateChannel that can be set to any string you want us to report as your channel. So if someone has that set to "release", but is actually running a pre-release build, we'll report "release" but allow it to record pre-release data.

Also, someone could just go in and type Telemetry.canRecordExtended = true into a privileged console and that'd do it, too. Most commonly this is done in tests.

So, in short, the ways you can report "release" and still submit extended data like MEMORY_TOTAL are:

  1. Be on "release" and set canRecordExtended to true manually or as part of a test suite.
  2. Be on not-release and either
    1. Set toolkit.telemetry.testing.overridePreRelease to true
    2. Set toolkit.telemetry.overrideUpdateChannel to release

Unfortunately we don't currently report the state of these prefs in Telemetry so we can't check if those hundreds of profiles are setting these test prefs. We can, though, and I think that's the obvious next step for this bug.

(( I wonder if there's any way to determine if canRecordExtended is being set during a test. I suppose we could have a scalar that counted how often that subsession SetCanRecordExtended was called... ))

Oh, in the pile of "definitely not the problem" is if a profile collected the data in one channel then switched channels before sending the ping. "main" pings always have the channel of the collecting session, not the sending session.

(( If the measurements were on the "crash" ping then we'd need to consider this and account for the value of hasCrashEnvironment. ))

We're seeing some weirdness in the data that might be explained by these
prefs being set to some non-default value.

Attachment #9038047 - Flags: review?(rrayborn)
Comment on attachment 9038047 [details]
data collection review request

General Notes: Seems low risk and straightforward, good sleuthing here!

1) Is there or will there be **documentation** that describes the schema for the ultimate data set available publicly, complete and accurate?
Yes, pretty clear in Bug 1514031 plus TelemetryEnvironment.jsm is quite parsable

2) Is there a control mechanism that allows the user to turn the data collection on and off?
Yes, Telemetry Defaults

3) If the request is for permanent data collection, is there someone who will monitor the data over time?
6 months collection by chutten

4) Using the category system of data types (https://wiki.mozilla.org/Firefox/Data_Collection), what collection type of data do the requested measurements fall under?
Category 1, only what 2 low risk prefs are user set

5) Is the data collection request for default-on or default-off?
Telemetry Defaults

6) Does the instrumentation include the addition of any *new* identifier* (whether anonymous or otherwise; e.g., username, random IDs, etc.  See the appendix for more details)?
No

7) Is the data collection covered by the existing Firefox privacy notice?
Yes, standard Telemetry

8) Does there need to be a check-in in the future to determine whether to renew the data?
No
Attachment #9038047 - Flags: review?(rrayborn) → review+
Pushed by chutten@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/cd4f8f60d10c
Record the value of some telemetry testing prefs. r=janerik
Status: ASSIGNED → RESOLVED
Closed: 6 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla67

Comment on attachment 9038042 [details]
Bug 1514031 - Record the value of some telemetry testing prefs. r?janerik

Beta/Release Uplift Approval Request

Feature/Bug causing the regression

None

User impact if declined

We'll have to wait longer to get the information about whether or not these prefs account for a data anomaly. (see bug 1521597)

Is this code covered by automated tests?

No

Has the fix been verified in Nightly?

Yes

Needs manual test from QE?

No

If yes, steps to reproduce

List of other uplifts needed

None

Risk to taking this patch

Low

Why is the change risky/not risky? (and alternatives if risky)

Only adding two prefs to the well-understood userPrefs Telemetry Environment block.

String changes made/needed

Attachment #9038042 - Flags: approval-mozilla-beta?

Comment on attachment 9038042 [details]
Bug 1514031 - Record the value of some telemetry testing prefs. r?janerik

This should help the telemetry team clarify some anomalies.
OK for uplift to beta 7.

Attachment #9038042 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Analysis in bug 1521597 showed 0.0003% of pings from Release 66 sending prerelease data, none having their channels overridden by these test prefs.

Reopening, clearing fields so it shows in triage.

Assignee: chutten → nobody
Status: RESOLVED → REOPENED
Points: 2 → ---
Priority: P1 → --
Resolution: FIXED → ---

Moving this investigation to the experts with bug 1544028

Status: REOPENED → RESOLVED
Closed: 6 months ago3 months ago
Resolution: --- → FIXED
Assignee: nobody → chutten
You need to log in before you can comment on or make changes to this bug.