Closed Bug 1594035 Opened 5 years ago Closed 4 years ago

The "unenroll" telemetry events are no longer registered in about:telemetry if a client unenrolls after updating the browser to an incremented version

Categories

(Firefox :: Normandy Client, defect, P1)

Desktop
All
defect

Tracking

()

VERIFIED FIXED
Firefox 72
Tracking Status
firefox-esr68 --- unaffected
firefox70 --- unaffected
firefox71 blocking verified
firefox72 --- verified

People

(Reporter: cmuntean, Assigned: mythmon)

References

(Regression)

Details

(Keywords: regression, Whiteboard: [rca - coding])

Attachments

(1 file, 2 obsolete files)

[Notes]:

  • The issue is reproducible with all Normandy actions.
  • The issue is reproducible if a client unenrolls after updating Firefox Beta 70.0bx to the latest FIrefox Beta 71.0bx.
  • The issue is reproducible if a client unenrolls after updating Firefox Nightly 70.0a1 to the latest FIrefox Nightly 72.0a1.
  • The issue is reproducible if a client unenrolls after updating Firefox Nightly 71.0a1 to the latest FIrefox Nightly 72.0a1.
  • The issue is not reproducible when updating from Firefox Release 69 to 70.
  • The issue is also not reproducible when updating through beta versions or dot releases.
  • It seems to be a regression, but unfortunately we cannot find the regression range due to the fact that we need a Firefox update with a version change to reproduce the issue.

[Affected Platforms]:

  • All Windows
  • All Mac
  • All Linux

[Affected Versions]:

  • Firefox Beta 71.0b6
  • Firefox Nightly 71.0a1
  • Firefox Nightly 72.a1

[Prerequisites]:

  • Have a Firefox Beta or Nightly 70 version installed.
  • Have access to the stage delivery console.
  • Have an active multi-preferance recipe.
  • Have the security.content.signature.root_hash pref set to DB:74:CE:58:E4:F9:D0:9E:E0:42:36:BE:6C:C5:C4:F6:6A:E7:74:7D:C0:21:42:7A:03:BC:2F:57:0C:8B:9B:90.
  • Have the app.normandy.api_url pref set to https://stage.normandy.nonprod.cloudops.mozgcp.net/api/v1.
  • Have the app.normandy.dev_mode pref set to true.
  • Have the app.normandy.logging.level pref set to 0.
  • Have the services.settings.server pref set to https://settings.stage.mozaws.net/v1.
  • Create a new pref named cosmin and set it as true.

[Steps to reproduce]:

  1. Open the browser with the profile from prerequisites.
  2. Update the browser to the latest version.
  3. Navigate to "about:studies" page and remove the active study.
  4. Navigate to "about:telemetry#events-tab" page.
  5. Search for an unenroll event.

[Expected result]:

  • An “unenroll” event is successfully registered.

[Actual result]:

  • No “unenroll” event is registered.

[Additional Notes]:

  • The logs are correctly displayed in the browser console, but the events are not registered in "about:telemetry" page.
  • The issue is reproducible for all the unenroll events: "recipe-not-seen", "user-preference-changed", "individual-opt-out", “user-preference-changed-sideload" and "general-opt-out".
  • We are not sure if the telemetry event is sent to the server and it is not shown in about:telemetry page or the event is not sent at all.
  • Here is a link with the screen recording of the issue: link.

If you remove the upgrade from the equation, and only run one of the problematic versions, does this still happen? That is, if you enroll and unenroll on Beta 71 or Nightly 72, do you get an unenroll event? Do you get the enroll event? Since Normandy doesn't have any specific code to handle upgrades, I suspect that the upgrade is not necessary for the STR, which would help testing this.

Flags: needinfo?(cosmin.muntean)

I think I've figured this out. The issue is that enrollments that begin on Fx <= 71 don't include enrollmentIds, and the code in Fx >= 72 expects enrollmentIds. In particular, Telemetry doesn't like us sending enrollmentId: undefined.

This wasn't caught in tests because the Telemetry APIs in my experience never fail by throwing an exception, instead they print any problems to the console and return.

My proposed solution (patch up soon) is to make sure that existing experiments get a dummy enrollment ID that Telemetry will be satisfied with but it will be clear to us that it isn't a valid enrollment ID. I expect our Telemetry analysis tooling will be configured the same. The filler value I plan to use is __NO_ENROLLMENT_ID__, which matches a pattern we've used for other filler IDs.

Flags: needinfo?(cosmin.muntean)

Bugbug thinks this bug is a regression, but please revert this change in case of error.

Keywords: regression
Priority: -- → P1
See Also: → 1594201
Regressed by: 1555176

Having null or undefined values for enrollment IDs causes trouble with
Telemetry. The easiest approach is to assign all objects that could have an
enrollment ID but don't a filler value. This value will be ignored by tooling
and should be fairly obvious to humans reading the code.

Any newly created studies and rollouts will have enrollment IDs set. The
migrations introduced here will ensure that any old studies and rollouts get a
value. Therefor all of them will have an enrollment ID that is non-null. This
simplifies the code that deals with enrollment IDs, instead of scattering null
checks all around the code.

Assignee: nobody → mcooper
Status: NEW → ASSIGNED
Attachment #9107016 - Attachment description: Bug 1594035 - Add filler enrollment IDs to old Normandy studies and rollouts. r=Gijs → Bug 1594035 - Add filler enrollment IDs to old Normandy studies and rollouts. r=Gijs!
Pushed by mcooper@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/0c672cce8019
Add filler enrollment IDs to old Normandy studies and rollouts. r=Gijs
Flags: needinfo?(mcooper)

Without this patch, the UI finished listener is only registered after the
migrations are completely run. If the UI finishes intializating before this
happens, then our listener will never fire, and we'll miss the event. This
causes initialization to never proceed.

With this patch, we register the listener synchronously (before the first
await) and so are guaranteed to receive the event, no matter how long the
migrations take to run.

Attachment #9107016 - Attachment description: Bug 1594035 - Add filler enrollment IDs to old Normandy studies and rollouts. r=Gijs! → Bug 1594035 p2 - Add filler enrollment IDs to old Normandy studies and rollouts. r=Gijs!
Pushed by mcooper@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6f93019be0d6
p1 - Register UI listener sooner in Normandy init to avoid race conditions r=Gijs
https://hg.mozilla.org/integration/autoland/rev/066674b8313d
p2 - Add filler enrollment IDs to old Normandy studies and rollouts. r=Gijs
Attachment #9107016 - Attachment description: Bug 1594035 p2 - Add filler enrollment IDs to old Normandy studies and rollouts. r=Gijs! → Bug 1594035 p2 - Add filler enrollment IDs to old Normandy studies and rollouts. r=Gijs
Pushed by mcooper@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ab451796f43a
p1 - Register UI listener sooner in Normandy init to avoid race conditions r=Gijs
https://hg.mozilla.org/integration/autoland/rev/b4da4b5e0781
p2 - Add filler enrollment IDs to old Normandy studies and rollouts. r=Gijs

Backed out for xpcshell failures at toolkit/modules/tests/xpcshell/test_firstStartup.js

Backout: https://hg.mozilla.org/integration/autoland/rev/20560f3195c88bad767fb1fb844605588425a73c

Failure push: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=b4da4b5e07816cb653efe702d3f1819999b2fc08

Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=278105742&repo=autoland&lineNumber=5925

task 2019-11-25T23:47:32.812Z] 23:47:32 INFO - TEST-PASS | toolkit/modules/tests/xpcshell/test_firstStartup.js | test_timeout - [test_timeout : 46] 2 == 2
[task 2019-11-25T23:47:32.812Z] 23:47:32 INFO - (xpcshell/head.js) | test run_next_test 1 finished (2)
[task 2019-11-25T23:47:32.812Z] 23:47:32 INFO - (xpcshell/head.js) | test run_next_test 2 pending (2)
[task 2019-11-25T23:47:32.812Z] 23:47:32 INFO - (xpcshell/head.js) | test test_timeout finished (2)
[task 2019-11-25T23:47:32.812Z] 23:47:32 INFO - "CONSOLE_MESSAGE: (warn) [JavaScript Warning: "firstStartup.elapsed - Truncating float/double number."]"
[task 2019-11-25T23:47:32.812Z] 23:47:32 INFO - PID 7356 | [7356, Main Thread] WARNING: Ignoring duplicate observer.: file z:/build/build/src/modules/libpref/Preferences.cpp, line 2696
[task 2019-11-25T23:47:32.813Z] 23:47:32 INFO - (xpcshell/head.js) | test run_next_test 2 finished (1)
[task 2019-11-25T23:47:32.813Z] 23:47:32 INFO - exiting test
[task 2019-11-25T23:47:32.813Z] 23:47:32 INFO - PID 7356 | [7356, Main Thread] WARNING: Ignoring duplicate observer.: file z:/build/build/src/modules/libpref/Preferences.cpp, line 2696
[task 2019-11-25T23:47:32.813Z] 23:47:32 INFO - PID 7356 | [7356, Main Thread] WARNING: Ignoring duplicate observer.: file z:/build/build/src/modules/libpref/Preferences.cpp, line 2696
[task 2019-11-25T23:47:32.813Z] 23:47:32 INFO - PID 7356 | [7356, Main Thread] WARNING: Ignoring duplicate observer.: file z:/build/build/src/modules/libpref/Preferences.cpp, line 2696
[task 2019-11-25T23:47:32.813Z] 23:47:32 INFO - PID 7356 | [7356, Main Thread] WARNING: Ignoring duplicate observer.: file z:/build/build/src/modules/libpref/Preferences.cpp, line 2696
[task 2019-11-25T23:47:32.813Z] 23:47:32 INFO - PID 7356 | [7356, Main Thread] WARNING: This method is lossy. Use GetCanonicalPath !: file z:/build/build/src/xpcom/io/nsLocalFileWin.cpp, line 3140
[task 2019-11-25T23:47:32.813Z] 23:47:32 INFO - PID 7356 | [7356, Main Thread] WARNING: This method is lossy. Use GetCanonicalPath !: file z:/build/build/src/xpcom/io/nsLocalFileWin.cpp, line 3140
[task 2019-11-25T23:47:32.813Z] 23:47:32 INFO - PID 7356 | [7356, Main Thread] WARNING: This method is lossy. Use GetCanonicalPath !: file z:/build/build/src/xpcom/io/nsLocalFileWin.cpp, line 3140
[task 2019-11-25T23:47:32.813Z] 23:47:32 INFO - "CONSOLE_MESSAGE: (error) [JavaScript Error: "NetworkError when attempting to fetch resource." {file: "resource://services-settings/Utils.jsm" line: 109}]
[task 2019-11-25T23:47:32.813Z] 23:47:32 INFO - fetchLatestChanges@resource://services-settings/Utils.jsm:109:28
[task 2019-11-25T23:47:32.813Z] 23:47:32 INFO - sync@resource://services-settings/RemoteSettingsClient.jsm:401:37
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - get@resource://services-settings/RemoteSettingsClient.jsm:354:22
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - asyncloadRecipes@resource://normandy/lib/RecipeRunner.jsm:344:51
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - run@resource://normandy/lib/RecipeRunner.jsm:296:35
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - asyncinit@resource://normandy/lib/RecipeRunner.jsm:120:18
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - asyncfinishInit@resource://normandy/Normandy.jsm:129:24
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - asyncinit/<@resource://normandy/Normandy.jsm:65:54
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - init@resource://gre/modules/FirstStartup.jsm:59:19
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - test_success@Z:/task_1574722640/build/tests/xpcshell/tests/toolkit/modules/tests/xpcshell/test_firstStartup.js:14:16
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - run_next_test/_run_next_test/<@Z:\task_1574722640\build\tests\xpcshell\head.js:1567:22
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - _run_next_test@Z:\task_1574722640\build\tests\xpcshell\head.js:1567:38
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - run@Z:\task_1574722640\build\tests\xpcshell\head.js:735:9
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - _do_main@Z:\task_1574722640\build\tests\xpcshell\head.js:246:6
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - _execute_test@Z:\task_1574722640\build\tests\xpcshell\head.js:573:5
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - @-e:1:1
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - "
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - PID 7356 | [7356, Main Thread] WARNING: Workers don't support the 'mem.mem.' preference!: file z:/build/build/src/dom/workers/RuntimeService.cpp, line 547
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - "CONSOLE_MESSAGE: (error) [JavaScript Error: "IndexedDB UnknownErr: ActorsParent.cpp:12712"]"
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - "CONSOLE_MESSAGE: (error) [JavaScript Error: "IndexedDB UnknownErr: ActorsParent.cpp:12712"]"
[task 2019-11-25T23:47:32.815Z] 23:47:32 WARNING - TEST-UNEXPECTED-FAIL | toolkit/modules/tests/xpcshell/test_firstStartup.js | test_timeout - [test_timeout : 265] A promise chain failed to handle a rejection: A request was aborted, for example through a call to IDBTransaction.abort. - stack: Transaction/this._completionPromise</transaction.onabort@resource://gre/modules/IndexedDB.jsm:254:15
[task 2019-11-25T23:47:32.815Z] 23:47:32 INFO - _execute_test@Z:\task_1574722640\build\tests\xpcshell\head.js:661:19
[task 2019-11-25T23:47:32.815Z] 23:47:32 INFO - @-e:1:1
[task 2019-11-25T23:47:32.815Z] 23:47:32 INFO - Rejection date: Mon Nov 25 2019 23:47:32 GMT+0000 (Greenwich Mean Time) - false == true
[task 2019-11-25T23:47:32.815Z] 23:47:32 INFO - resource://testing-common/PromiseTestUtils.jsm:assertNoUncaughtRejections:265
[task 2019-11-25T23:47:32.815Z] 23:47:32 INFO - Z:\task_1574722640\build\tests\xpcshell\head.js:_execute_test:668
[task 2019-11-25T23:47:32.815Z] 23:47:32 INFO - -e:null:1
[task 2019-11-25T23:47:32.815Z] 23:47:32 INFO - exiting test

Pascal, I want to flag this bug as a problem for Firefox 71 release. If this makes it to release, any active Normandy experiments will be broken, and we can't fix it with a dot release. This is because the data for the experiment will fail to be sent, and we'll have a large gap in any experiment affected. Given the problems I've had landing the above patch, I'm going to work on another strategy that should be less invasive and easier to land.

Once the new patch is ready, I'm going to request to uplift that patch to 71. I hope to have it ready by early tomorrow. I realize this is extremely late, and I'm sorry about that.

Flags: needinfo?(pascalc)

(In reply to Michael Cooper [:mythmon] from comment #12)

Pascal, I want to flag this bug as a problem for Firefox 71 release. If this makes it to release, any active Normandy experiments will be broken, and we can't fix it with a dot release. This is because the data for the experiment will fail to be sent, and we'll have a large gap in any experiment affected. Given the problems I've had landing the above patch, I'm going to work on another strategy that should be less invasive and easier to land.

Once the new patch is ready, I'm going to request to uplift that patch to 71. I hope to have it ready by early tomorrow. I realize this is extremely late, and I'm sorry about that.

OK, that means a RC2. We need the uplift request with a patch that grafts cleanly to mozilla-release as soon as possible so as to build it. Ideally tomorrow. Thanks.

Flags: needinfo?(pascalc)

Can we backout bug 1555176 instead of rushing a fix?

The new patch (comment 15) I've prepared is much simpler than the previous one, and I think it is much easier to land than the previous patches in this bug. It's targeted at release, and should apply cleanly to release, beta, and central. I've started a try run with the release version of the patch.

(In reply to Pascal Chevrel:pascalc from comment #14)

Can we backout bug 1555176 instead of rushing a fix?

I think that backing out bug 1555176 would fix the problem. I think that it is a similar amount of work and chance of success to back it out or to take the patch above. Personally I'd prefer the new patch instead of a backout.

The problems with the original patch (Bug 1594035 p2 - Add filler enrollment IDs to old Normandy studies and rollouts) was a race condition between the UI starting and Normandy's migrations running. With either a backout or the new patch (Bug 1594035 - Don't send null enrollmentIds for Normandy telemetry), that underlying problem still exists, it just doesn't show up in our test suite. I'll be filing a separate bug to fix that, regardless of if we back out feature or take the new patch.

Flags: needinfo?(mcooper)

Comment on attachment 9111770 [details]
Bug 1594035 - Don't send null enrollmentIds for Normandy telemetry r=rhelmer

Beta/Release Uplift Approval Request

  • User impact if declined: Experiments currently active in Firefox 70 will stop sending Telemetry data, effectively breaking all active experiments. Users will be enrolled in experiments for no reason.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce: The STR from comment 0 still apply
  • List of other uplifts needed: None
  • Risk to taking this patch: Medium
  • Why is the change risky/not risky? (and alternatives if risky): The change itself is small and simple. If it is validated by manual testing, it is likely to work in the real world. On the other hand, it is very late in the train cycle, and so any change is risky.
  • String changes made/needed: None
Attachment #9111770 - Flags: approval-mozilla-release?
Pushed by mcooper@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ab751affd303
Don't send `null` enrollmentIds for Normandy telemetry r=rhelmer
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 72
Flags: qe-verify+
QA Whiteboard: [qa-triaged]

We have verified that the unenroll pings are now correctly displayed in about:telemetry for multi-preference-experiment, branched-addon-study, preference rollback, and addon rollback if users are unenrolled after a version change. The "__NO_ENROLLMENT_ID__" is displayed as the enrollmentId for each of the pings. We have verified this on Windows 10 and macOS 10.14.6.

However, we have not managed to verify the scenario from comment 0 due to the fact that the Nighty build where the fix has landed is not yet available for a regular update. We have created profiles using Firefox Release 70, enrolled in the experiments/rollouts, opened the same profile with a Nightly build downloaded from comment 18 and unenrolled in that build.

I'm marking the issue as verified in order to move things along, we will verify using the natural update method when the build will be available (most likely tomorrow) and follow-up once this is done.

Status: RESOLVED → VERIFIED
Flags: qe-verify+

Comment on attachment 9111770 [details]
Bug 1594035 - Don't send null enrollmentIds for Normandy telemetry r=rhelmer

This is our only blocker for the release, the patch has tests and is on Nightly, let's take it in a RC2. Thanks.

Attachment #9111770 - Flags: approval-mozilla-release? → approval-mozilla-release+

Since the fix landed in the latest Nightly build, we have also verified that the unenroll events are now correctly displayed in about:telemetry after updating from an older version of Firefox Nightly to the latest. We have verified this on Windows 10 and macOS 10.14.6.

We will verify this once again as soon as an RC build with the fix is available.

We have also verified that the unenroll events are now correctly displayed in about:telemetry after updating from an older version of Firefox Beta to the latest version from the release-localtest channel. We have verified this on Windows 10 and macOS 10.14.6.

See Also: → 1601111
Attachment #9110121 - Attachment is obsolete: true
Attachment #9107016 - Attachment is obsolete: true

This bug has been identified as part of a pilot on determining root causes of blocking and dot release drivers.

It needs a root-cause set for it. Please see the list at https://docs.google.com/document/d/1FFEGsmoU8T0N8R9kk-MXWptOPtXXXRRIe4vQo3_HgMw/.

Add the root cause as a whiteboard tag in the form [rca - <cause> ] and remove the rca-needed keyword.

If you have questions, please contact :tmaity.

Keywords: rca-needed

I've added the root cause of "coding". I think that the root cause here is an incompatibility between the design of Normandy and the design of Telemetry. Normandy's process treats missing-or-incorrect telemetry as something that should cause tests to fail, and the Telemetry system does not. Both of these are by design, and they sometimes conflict.

Normandy's tests often mock out Telemetry in order to verify that the correct event is being generated, because it is simpler than reading the actual generated events. In this case, none of the tests that handled the case where an enrollment ID was missing both a) did not mock Telemetry, and b) verified the event that was actually sent. If this had been the case, we would have noticed this as a test failure, since the event was rejected by Telemetry.

From a Normandy-point of view, it would helpful to be able to put Telemetry in a mode where invalid Telemetry would throw and error instead of logging a message. This could be used in Normandy's test suite to help verify our expectations. Alternatively, perhaps Normandy could check for the messages that Telemetry logs.

Keywords: rca-needed
Whiteboard: [rca - coding]

From a Normandy-point of view, it would helpful to be able to put Telemetry in a mode where invalid Telemetry would throw and error instead of logging a message.

Tagging in a bug about this that I'd stumbled across before.

See Also: → 1542891
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: