The "unenroll" telemetry events are no longer registered in about:telemetry if a client unenrolls after updating the browser to an incremented version
Categories
(Firefox :: Normandy Client, defect, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr68 | --- | unaffected |
firefox70 | --- | unaffected |
firefox71 | blocking | verified |
firefox72 | --- | verified |
People
(Reporter: cmuntean, Assigned: mythmon)
References
(Regression)
Details
(Keywords: regression, Whiteboard: [rca - coding])
Attachments
(1 file, 2 obsolete files)
47 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-release+
|
Details | Review |
[Notes]:
- The issue is reproducible with all Normandy actions.
- The issue is reproducible if a client unenrolls after updating Firefox Beta 70.0bx to the latest FIrefox Beta 71.0bx.
- The issue is reproducible if a client unenrolls after updating Firefox Nightly 70.0a1 to the latest FIrefox Nightly 72.0a1.
- The issue is reproducible if a client unenrolls after updating Firefox Nightly 71.0a1 to the latest FIrefox Nightly 72.0a1.
- The issue is not reproducible when updating from Firefox Release 69 to 70.
- The issue is also not reproducible when updating through beta versions or dot releases.
- It seems to be a regression, but unfortunately we cannot find the regression range due to the fact that we need a Firefox update with a version change to reproduce the issue.
[Affected Platforms]:
- All Windows
- All Mac
- All Linux
[Affected Versions]:
- Firefox Beta 71.0b6
- Firefox Nightly 71.0a1
- Firefox Nightly 72.a1
[Prerequisites]:
- Have a Firefox Beta or Nightly 70 version installed.
- Have access to the stage delivery console.
- Have an active multi-preferance recipe.
- Have the
security.content.signature.root_hash
pref set toDB:74:CE:58:E4:F9:D0:9E:E0:42:36:BE:6C:C5:C4:F6:6A:E7:74:7D:C0:21:42:7A:03:BC:2F:57:0C:8B:9B:90
. - Have the
app.normandy.api_url
pref set tohttps://stage.normandy.nonprod.cloudops.mozgcp.net/api/v1
. - Have the
app.normandy.dev_mode
pref set totrue
. - Have the
app.normandy.logging.level
pref set to0
. - Have the
services.settings.server
pref set tohttps://settings.stage.mozaws.net/v1
. - Create a new pref named
cosmin
and set it astrue
.
[Steps to reproduce]:
- Open the browser with the profile from prerequisites.
- Update the browser to the latest version.
- Navigate to "about:studies" page and remove the active study.
- Navigate to "about:telemetry#events-tab" page.
- Search for an unenroll event.
[Expected result]:
- An “unenroll” event is successfully registered.
[Actual result]:
- No “unenroll” event is registered.
[Additional Notes]:
- The logs are correctly displayed in the browser console, but the events are not registered in "about:telemetry" page.
- The issue is reproducible for all the unenroll events: "recipe-not-seen", "user-preference-changed", "individual-opt-out", “user-preference-changed-sideload" and "general-opt-out".
- We are not sure if the telemetry event is sent to the server and it is not shown in about:telemetry page or the event is not sent at all.
- Here is a link with the screen recording of the issue: link.
Assignee | ||
Comment 1•5 years ago
|
||
If you remove the upgrade from the equation, and only run one of the problematic versions, does this still happen? That is, if you enroll and unenroll on Beta 71 or Nightly 72, do you get an unenroll event? Do you get the enroll event? Since Normandy doesn't have any specific code to handle upgrades, I suspect that the upgrade is not necessary for the STR, which would help testing this.
Assignee | ||
Comment 2•4 years ago
|
||
I think I've figured this out. The issue is that enrollments that begin on Fx <= 71 don't include enrollmentIds, and the code in Fx >= 72 expects enrollmentIds. In particular, Telemetry doesn't like us sending enrollmentId: undefined
.
This wasn't caught in tests because the Telemetry APIs in my experience never fail by throwing an exception, instead they print any problems to the console and return.
My proposed solution (patch up soon) is to make sure that existing experiments get a dummy enrollment ID that Telemetry will be satisfied with but it will be clear to us that it isn't a valid enrollment ID. I expect our Telemetry analysis tooling will be configured the same. The filler value I plan to use is __NO_ENROLLMENT_ID__
, which matches a pattern we've used for other filler IDs.
Comment 3•4 years ago
|
||
Bugbug thinks this bug is a regression, but please revert this change in case of error.
Updated•4 years ago
|
Assignee | ||
Comment 4•4 years ago
|
||
Having null or undefined values for enrollment IDs causes trouble with
Telemetry. The easiest approach is to assign all objects that could have an
enrollment ID but don't a filler value. This value will be ignored by tooling
and should be fairly obvious to humans reading the code.
Any newly created studies and rollouts will have enrollment IDs set. The
migrations introduced here will ensure that any old studies and rollouts get a
value. Therefor all of them will have an enrollment ID that is non-null. This
simplifies the code that deals with enrollment IDs, instead of scattering null
checks all around the code.
Assignee | ||
Updated•4 years ago
|
Updated•4 years ago
|
Pushed by mcooper@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/0c672cce8019 Add filler enrollment IDs to old Normandy studies and rollouts. r=Gijs
Comment 6•4 years ago
|
||
Backed out changeset 0c672cce8019 (bug 1594035) for Browser-chrome failures in browser/components/tests/browser/browser_urlbar_matchBuckets_migration60.js. CLOSED TREE
Log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=275657013&repo=autoland&lineNumber=10006
Push with failures:
https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&revision=0c672cce8019ad294204a0a2efb16cd7035dbde3
Backout:
https://hg.mozilla.org/integration/autoland/rev/586eb027063318b9e6cea97e96bcd7ff291c231d
Assignee | ||
Comment 7•4 years ago
|
||
Without this patch, the UI finished listener is only registered after the
migrations are completely run. If the UI finishes intializating before this
happens, then our listener will never fire, and we'll miss the event. This
causes initialization to never proceed.
With this patch, we register the listener synchronously (before the first
await
) and so are guaranteed to receive the event, no matter how long the
migrations take to run.
Updated•4 years ago
|
Pushed by mcooper@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/6f93019be0d6 p1 - Register UI listener sooner in Normandy init to avoid race conditions r=Gijs https://hg.mozilla.org/integration/autoland/rev/066674b8313d p2 - Add filler enrollment IDs to old Normandy studies and rollouts. r=Gijs
Comment 9•4 years ago
|
||
Backed out 2 changesets (bug 1594035) for Browser-chrome failiures on normandy/test/browser/browser_Normandy.js. CLOSED TREE
Log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=277534997&repo=autoland&lineNumber=5358
Push with failures:
https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&revision=066674b8313dd20240e6a22fb70e55172a9ae7f1
Backout:
https://hg.mozilla.org/integration/autoland/rev/26a1ba8ed665dc7cd7254f6e287b3fdf12516c82
Updated•4 years ago
|
Comment 10•4 years ago
|
||
Pushed by mcooper@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ab451796f43a p1 - Register UI listener sooner in Normandy init to avoid race conditions r=Gijs https://hg.mozilla.org/integration/autoland/rev/b4da4b5e0781 p2 - Add filler enrollment IDs to old Normandy studies and rollouts. r=Gijs
Comment 11•4 years ago
|
||
Backed out for xpcshell failures at toolkit/modules/tests/xpcshell/test_firstStartup.js
Backout: https://hg.mozilla.org/integration/autoland/rev/20560f3195c88bad767fb1fb844605588425a73c
Failure push: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=b4da4b5e07816cb653efe702d3f1819999b2fc08
Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=278105742&repo=autoland&lineNumber=5925
task 2019-11-25T23:47:32.812Z] 23:47:32 INFO - TEST-PASS | toolkit/modules/tests/xpcshell/test_firstStartup.js | test_timeout - [test_timeout : 46] 2 == 2
[task 2019-11-25T23:47:32.812Z] 23:47:32 INFO - (xpcshell/head.js) | test run_next_test 1 finished (2)
[task 2019-11-25T23:47:32.812Z] 23:47:32 INFO - (xpcshell/head.js) | test run_next_test 2 pending (2)
[task 2019-11-25T23:47:32.812Z] 23:47:32 INFO - (xpcshell/head.js) | test test_timeout finished (2)
[task 2019-11-25T23:47:32.812Z] 23:47:32 INFO - "CONSOLE_MESSAGE: (warn) [JavaScript Warning: "firstStartup.elapsed - Truncating float/double number."]"
[task 2019-11-25T23:47:32.812Z] 23:47:32 INFO - PID 7356 | [7356, Main Thread] WARNING: Ignoring duplicate observer.: file z:/build/build/src/modules/libpref/Preferences.cpp, line 2696
[task 2019-11-25T23:47:32.813Z] 23:47:32 INFO - (xpcshell/head.js) | test run_next_test 2 finished (1)
[task 2019-11-25T23:47:32.813Z] 23:47:32 INFO - exiting test
[task 2019-11-25T23:47:32.813Z] 23:47:32 INFO - PID 7356 | [7356, Main Thread] WARNING: Ignoring duplicate observer.: file z:/build/build/src/modules/libpref/Preferences.cpp, line 2696
[task 2019-11-25T23:47:32.813Z] 23:47:32 INFO - PID 7356 | [7356, Main Thread] WARNING: Ignoring duplicate observer.: file z:/build/build/src/modules/libpref/Preferences.cpp, line 2696
[task 2019-11-25T23:47:32.813Z] 23:47:32 INFO - PID 7356 | [7356, Main Thread] WARNING: Ignoring duplicate observer.: file z:/build/build/src/modules/libpref/Preferences.cpp, line 2696
[task 2019-11-25T23:47:32.813Z] 23:47:32 INFO - PID 7356 | [7356, Main Thread] WARNING: Ignoring duplicate observer.: file z:/build/build/src/modules/libpref/Preferences.cpp, line 2696
[task 2019-11-25T23:47:32.813Z] 23:47:32 INFO - PID 7356 | [7356, Main Thread] WARNING: This method is lossy. Use GetCanonicalPath !: file z:/build/build/src/xpcom/io/nsLocalFileWin.cpp, line 3140
[task 2019-11-25T23:47:32.813Z] 23:47:32 INFO - PID 7356 | [7356, Main Thread] WARNING: This method is lossy. Use GetCanonicalPath !: file z:/build/build/src/xpcom/io/nsLocalFileWin.cpp, line 3140
[task 2019-11-25T23:47:32.813Z] 23:47:32 INFO - PID 7356 | [7356, Main Thread] WARNING: This method is lossy. Use GetCanonicalPath !: file z:/build/build/src/xpcom/io/nsLocalFileWin.cpp, line 3140
[task 2019-11-25T23:47:32.813Z] 23:47:32 INFO - "CONSOLE_MESSAGE: (error) [JavaScript Error: "NetworkError when attempting to fetch resource." {file: "resource://services-settings/Utils.jsm" line: 109}]
[task 2019-11-25T23:47:32.813Z] 23:47:32 INFO - fetchLatestChanges@resource://services-settings/Utils.jsm:109:28
[task 2019-11-25T23:47:32.813Z] 23:47:32 INFO - sync@resource://services-settings/RemoteSettingsClient.jsm:401:37
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - get@resource://services-settings/RemoteSettingsClient.jsm:354:22
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - asyncloadRecipes@resource://normandy/lib/RecipeRunner.jsm:344:51
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - run@resource://normandy/lib/RecipeRunner.jsm:296:35
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - asyncinit@resource://normandy/lib/RecipeRunner.jsm:120:18
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - asyncfinishInit@resource://normandy/Normandy.jsm:129:24
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - asyncinit/<@resource://normandy/Normandy.jsm:65:54
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - init@resource://gre/modules/FirstStartup.jsm:59:19
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - test_success@Z:/task_1574722640/build/tests/xpcshell/tests/toolkit/modules/tests/xpcshell/test_firstStartup.js:14:16
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - run_next_test/_run_next_test/<@Z:\task_1574722640\build\tests\xpcshell\head.js:1567:22
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - _run_next_test@Z:\task_1574722640\build\tests\xpcshell\head.js:1567:38
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - run@Z:\task_1574722640\build\tests\xpcshell\head.js:735:9
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - _do_main@Z:\task_1574722640\build\tests\xpcshell\head.js:246:6
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - _execute_test@Z:\task_1574722640\build\tests\xpcshell\head.js:573:5
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - @-e:1:1
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - "
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - PID 7356 | [7356, Main Thread] WARNING: Workers don't support the 'mem.mem.' preference!: file z:/build/build/src/dom/workers/RuntimeService.cpp, line 547
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - "CONSOLE_MESSAGE: (error) [JavaScript Error: "IndexedDB UnknownErr: ActorsParent.cpp:12712"]"
[task 2019-11-25T23:47:32.814Z] 23:47:32 INFO - "CONSOLE_MESSAGE: (error) [JavaScript Error: "IndexedDB UnknownErr: ActorsParent.cpp:12712"]"
[task 2019-11-25T23:47:32.815Z] 23:47:32 WARNING - TEST-UNEXPECTED-FAIL | toolkit/modules/tests/xpcshell/test_firstStartup.js | test_timeout - [test_timeout : 265] A promise chain failed to handle a rejection: A request was aborted, for example through a call to IDBTransaction.abort. - stack: Transaction/this._completionPromise</transaction.onabort@resource://gre/modules/IndexedDB.jsm:254:15
[task 2019-11-25T23:47:32.815Z] 23:47:32 INFO - _execute_test@Z:\task_1574722640\build\tests\xpcshell\head.js:661:19
[task 2019-11-25T23:47:32.815Z] 23:47:32 INFO - @-e:1:1
[task 2019-11-25T23:47:32.815Z] 23:47:32 INFO - Rejection date: Mon Nov 25 2019 23:47:32 GMT+0000 (Greenwich Mean Time) - false == true
[task 2019-11-25T23:47:32.815Z] 23:47:32 INFO - resource://testing-common/PromiseTestUtils.jsm:assertNoUncaughtRejections:265
[task 2019-11-25T23:47:32.815Z] 23:47:32 INFO - Z:\task_1574722640\build\tests\xpcshell\head.js:_execute_test:668
[task 2019-11-25T23:47:32.815Z] 23:47:32 INFO - -e:null:1
[task 2019-11-25T23:47:32.815Z] 23:47:32 INFO - exiting test
Assignee | ||
Comment 12•4 years ago
|
||
Pascal, I want to flag this bug as a problem for Firefox 71 release. If this makes it to release, any active Normandy experiments will be broken, and we can't fix it with a dot release. This is because the data for the experiment will fail to be sent, and we'll have a large gap in any experiment affected. Given the problems I've had landing the above patch, I'm going to work on another strategy that should be less invasive and easier to land.
Once the new patch is ready, I'm going to request to uplift that patch to 71. I hope to have it ready by early tomorrow. I realize this is extremely late, and I'm sorry about that.
Comment 13•4 years ago
|
||
(In reply to Michael Cooper [:mythmon] from comment #12)
Pascal, I want to flag this bug as a problem for Firefox 71 release. If this makes it to release, any active Normandy experiments will be broken, and we can't fix it with a dot release. This is because the data for the experiment will fail to be sent, and we'll have a large gap in any experiment affected. Given the problems I've had landing the above patch, I'm going to work on another strategy that should be less invasive and easier to land.
Once the new patch is ready, I'm going to request to uplift that patch to 71. I hope to have it ready by early tomorrow. I realize this is extremely late, and I'm sorry about that.
OK, that means a RC2. We need the uplift request with a patch that grafts cleanly to mozilla-release as soon as possible so as to build it. Ideally tomorrow. Thanks.
Updated•4 years ago
|
Comment 14•4 years ago
|
||
Can we backout bug 1555176 instead of rushing a fix?
Assignee | ||
Comment 15•4 years ago
|
||
Assignee | ||
Comment 16•4 years ago
|
||
The new patch (comment 15) I've prepared is much simpler than the previous one, and I think it is much easier to land than the previous patches in this bug. It's targeted at release, and should apply cleanly to release, beta, and central. I've started a try run with the release version of the patch.
(In reply to Pascal Chevrel:pascalc from comment #14)
Can we backout bug 1555176 instead of rushing a fix?
I think that backing out bug 1555176 would fix the problem. I think that it is a similar amount of work and chance of success to back it out or to take the patch above. Personally I'd prefer the new patch instead of a backout.
The problems with the original patch (Bug 1594035 p2 - Add filler enrollment IDs to old Normandy studies and rollouts) was a race condition between the UI starting and Normandy's migrations running. With either a backout or the new patch (Bug 1594035 - Don't send null
enrollmentIds for Normandy telemetry), that underlying problem still exists, it just doesn't show up in our test suite. I'll be filing a separate bug to fix that, regardless of if we back out feature or take the new patch.
Assignee | ||
Comment 17•4 years ago
|
||
Comment on attachment 9111770 [details]
Bug 1594035 - Don't send null
enrollmentIds for Normandy telemetry r=rhelmer
Beta/Release Uplift Approval Request
- User impact if declined: Experiments currently active in Firefox 70 will stop sending Telemetry data, effectively breaking all active experiments. Users will be enrolled in experiments for no reason.
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: No
- If yes, steps to reproduce: The STR from comment 0 still apply
- List of other uplifts needed: None
- Risk to taking this patch: Medium
- Why is the change risky/not risky? (and alternatives if risky): The change itself is small and simple. If it is validated by manual testing, it is likely to work in the real world. On the other hand, it is very late in the train cycle, and so any change is risky.
- String changes made/needed: None
Comment 18•4 years ago
|
||
Pushed by mcooper@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ab751affd303 Don't send `null` enrollmentIds for Normandy telemetry r=rhelmer
Comment 19•4 years ago
|
||
bugherder |
Updated•4 years ago
|
Updated•4 years ago
|
Comment 20•4 years ago
|
||
We have verified that the unenroll
pings are now correctly displayed in about:telemetry
for multi-preference-experiment, branched-addon-study, preference rollback, and addon rollback if users are unenrolled after a version change. The "__NO_ENROLLMENT_ID__"
is displayed as the enrollmentId
for each of the pings. We have verified this on Windows 10 and macOS 10.14.6.
However, we have not managed to verify the scenario from comment 0 due to the fact that the Nighty build where the fix has landed is not yet available for a regular update. We have created profiles using Firefox Release 70, enrolled in the experiments/rollouts, opened the same profile with a Nightly build downloaded from comment 18 and unenrolled in that build.
I'm marking the issue as verified in order to move things along, we will verify using the natural update method when the build will be available (most likely tomorrow) and follow-up once this is done.
Comment 22•4 years ago
|
||
Comment on attachment 9111770 [details]
Bug 1594035 - Don't send null
enrollmentIds for Normandy telemetry r=rhelmer
This is our only blocker for the release, the patch has tests and is on Nightly, let's take it in a RC2. Thanks.
Comment 23•4 years ago
|
||
bugherder uplift |
Comment 24•4 years ago
|
||
Since the fix landed in the latest Nightly build, we have also verified that the unenroll
events are now correctly displayed in about:telemetry
after updating from an older version of Firefox Nightly to the latest. We have verified this on Windows 10 and macOS 10.14.6.
We will verify this once again as soon as an RC build with the fix is available.
Comment 25•4 years ago
|
||
We have also verified that the unenroll
events are now correctly displayed in about:telemetry
after updating from an older version of Firefox Beta to the latest version from the release-localtest
channel. We have verified this on Windows 10 and macOS 10.14.6.
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Comment 26•4 years ago
|
||
This bug has been identified as part of a pilot on determining root causes of blocking and dot release drivers.
It needs a root-cause set for it. Please see the list at https://docs.google.com/document/d/1FFEGsmoU8T0N8R9kk-MXWptOPtXXXRRIe4vQo3_HgMw/.
Add the root cause as a whiteboard
tag in the form [rca - <cause> ]
and remove the rca-needed
keyword.
If you have questions, please contact :tmaity.
Assignee | ||
Comment 27•4 years ago
|
||
I've added the root cause of "coding". I think that the root cause here is an incompatibility between the design of Normandy and the design of Telemetry. Normandy's process treats missing-or-incorrect telemetry as something that should cause tests to fail, and the Telemetry system does not. Both of these are by design, and they sometimes conflict.
Normandy's tests often mock out Telemetry in order to verify that the correct event is being generated, because it is simpler than reading the actual generated events. In this case, none of the tests that handled the case where an enrollment ID was missing both a) did not mock Telemetry, and b) verified the event that was actually sent. If this had been the case, we would have noticed this as a test failure, since the event was rejected by Telemetry.
From a Normandy-point of view, it would helpful to be able to put Telemetry in a mode where invalid Telemetry would throw and error instead of logging a message. This could be used in Normandy's test suite to help verify our expectations. Alternatively, perhaps Normandy could check for the messages that Telemetry logs.
Comment 28•4 years ago
|
||
From a Normandy-point of view, it would helpful to be able to put Telemetry in a mode where invalid Telemetry would throw and error instead of logging a message.
Tagging in a bug about this that I'd stumbled across before.
Updated•2 years ago
|
Description
•