Closed Bug 1953664 Opened 2 months ago Closed 27 days ago

Intermittent toolkit/components/nimbus/test/browser/browser_experimentapi_child.js | Uncaught exception in test bound testGetExperimentFromChildExistingEnrollment - at chrome://mochitests/content/browser/toolkit/components/nimbus/test/browser/

Categories

(Firefox :: Nimbus Desktop Client, defect, P5)

defect

Tracking

()

RESOLVED FIXED
139 Branch
Tracking Status
firefox-esr128 --- unaffected
firefox136 --- unaffected
firefox137 --- unaffected
firefox138 --- fixed
firefox139 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: beth)

References

(Regression)

Details

(Keywords: intermittent-failure, regression, test-verify-fail)

Attachments

(2 files)

Filed by: chorotan [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=498879246&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/DBGAtAZUQAKV0w_t-P4S-w/runs/0/artifacts/public/logs/live_backing.log
Reftest URL: https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/DBGAtAZUQAKV0w_t-P4S-w/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1


[task 2025-03-13T00:48:46.076Z] 00:48:46     INFO - TEST-PASS | toolkit/components/nimbus/test/browser/browser_experimentapi_child.js | Experiment is null - null == null - 
[task 2025-03-13T00:48:46.076Z] 00:48:46     INFO - Leaving test bound testGetExperimentFromChildNewEnrollment
[task 2025-03-13T00:48:46.076Z] 00:48:46     INFO - Entering test bound testGetExperimentFromChildExistingEnrollment
[task 2025-03-13T00:48:46.076Z] 00:48:46     INFO - when calling ExperimentAPI.getExperiment with slug:
[task 2025-03-13T00:48:46.076Z] 00:48:46     INFO - Buffered messages finished
[task 2025-03-13T00:48:46.076Z] 00:48:46     INFO - TEST-UNEXPECTED-FAIL | toolkit/components/nimbus/test/browser/browser_experimentapi_child.js | Uncaught exception in test bound testGetExperimentFromChildExistingEnrollment - at chrome://mochitests/content/browser/toolkit/components/nimbus/test/browser/browser_experimentapi_child.js:212 - TypeError: can't access property "slug", rv is null
[task 2025-03-13T00:48:46.076Z] 00:48:46     INFO - Stack trace:
[task 2025-03-13T00:48:46.076Z] 00:48:46     INFO - @chrome://mochitests/content/browser/toolkit/components/nimbus/test/browser/browser_experimentapi_child.js:212:20
[task 2025-03-13T00:48:46.076Z] 00:48:46     INFO - async*execute@resource://testing-common/SpecialPowersSandbox.sys.mjs:139:12
[task 2025-03-13T00:48:46.076Z] 00:48:46     INFO - _spawnTask@resource://testing-common/SpecialPowersChild.sys.mjs:1626:15
[task 2025-03-13T00:48:46.076Z] 00:48:46     INFO - receiveMessage@resource://testing-common/SpecialPowersChild.sys.mjs:257:21
[task 2025-03-13T00:48:46.076Z] 00:48:46     INFO - JSActor query*receiveMessage@resource://testing-common/SpecialPowersParent.sys.mjs:1384:14
[task 2025-03-13T00:48:46.077Z] 00:48:46     INFO - JSActor query*spawn@resource://testing-common/SpecialPowersChild.sys.mjs:1555:17
[task 2025-03-13T00:48:46.077Z] 00:48:46     INFO - testGetExperimentFromChildExistingEnrollment@chrome://mochitests/content/browser/toolkit/components/nimbus/test/browser/browser_experimentapi_child.js:192:23
[task 2025-03-13T00:48:46.077Z] 00:48:46     INFO - async*handleTask@chrome://mochikit/content/browser-test.js:1170:26
[task 2025-03-13T00:48:46.077Z] 00:48:46     INFO - _runTaskBasedTest@chrome://mochikit/content/browser-test.js:1242:18
[task 2025-03-13T00:48:46.077Z] 00:48:46     INFO - async*Tester_execTest@chrome://mochikit/content/browser-test.js:1383:14
[task 2025-03-13T00:48:46.077Z] 00:48:46     INFO - nextTest/<@chrome://mochikit/content/browser-test.js:1153:14
[task 2025-03-13T00:48:46.077Z] 00:48:46     INFO - SimpleTest.waitForFocus/<@chrome://mochikit/content/tests/SimpleTest/SimpleTest.js:1058:13
[task 2025-03-13T00:48:46.077Z] 00:48:46     INFO - Leaving test bound testGetExperimentFromChildExistingEnrollment
[task 2025-03-13T00:48:46.077Z] 00:48:46     INFO - GECKO(12708) | ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to /tmp/tmp7vri4om8.mozrunner/runtests_leaks_tab_pid13166.log
[task 2025-03-13T00:48:46.113Z] 00:48:46     INFO - GECKO(12708) | MEMORY STAT | vsize 11606MB | residentFast 515MB | heapAllocated 252MB
[task 2025-03-13T00:48:46.114Z] 00:48:46     INFO - TEST-OK | toolkit/components/nimbus/test/browser/browser_experimentapi_child.js | took 1874ms
[task 2025-03-13T00:48:46.115Z] 00:48:46     INFO - Not taking screenshot here: see the one that was previously logged
[task 2025-03-13T00:48:46.115Z] 00:48:46     INFO - TEST-UNEXPECTED-FAIL | toolkit/components/nimbus/test/browser/browser_experimentapi_child.js | Found an unexpected tab at the end of test run: https://example.com/ - 
[task 2025-03-13T00:48:46.191Z] 00:48:46     INFO - checking window state
[task 2025-03-13T00:48:46.321Z] 00:48:46     INFO - GECKO(12708) | Completed ShutdownLeaks collections in process 12708
[task 2025-03-13T00:48:46.322Z] 00:48:46     INFO - TEST-START | Shutdown
[task 2025-03-13T00:48:46.322Z] 00:48:46     INFO - Browser Chrome Test Summary
[task 2025-03-13T00:48:46.323Z] 00:48:46     INFO - Passed:  31
[task 2025-03-13T00:48:46.323Z] 00:48:46     INFO - Failed:  2
[task 2025-03-13T00:48:46.324Z] 00:48:46     INFO - Todo:    0
[task 2025-03-13T00:48:46.324Z] 00:48:46     INFO - Mode:    e10s
[task 2025-03-13T00:48:46.324Z] 00:48:46     INFO - *** End BrowserChrome Test Results ***
[task 2025-03-13T00:48:46.348Z] 00:48:46     INFO - GECKO(12708) | Exiting due to channel error.
[task 2025-03-13T00:48:46.350Z] 00:48:46     INFO - GECKO(12708) | Exiting due to channel error.
[task 2025-03-13T00:48:46.352Z] 00:48:46     INFO - GECKO(12708) | Exiting due to channel error.
[task 2025-03-13T00:48:46.354Z] 00:48:46     INFO - GECKO(12708) | Exiting due to channel error.
[task 2025-03-13T00:48:46.355Z] 00:48:46     INFO - GECKO(12708) | Exiting due to channel error.
[task 2025-03-13T00:48:46.359Z] 00:48:46     INFO - GECKO(12708) | Exiting due to channel error.
[task 2025-03-13T00:48:46.361Z] 00:48:46     INFO - GECKO(12708) | Exiting due to channel error.
[task 2025-03-13T00:48:46.363Z] 00:48:46     INFO - GECKO(12708) | Exiting due to channel error.
[task 2025-03-13T00:48:46.369Z] 00:48:46     INFO - TEST-INFO | Main app process: exit 0
Flags: needinfo?(brennie)

Set release status flags based on info from the regressing bug 1943422

Summary: Intermittent TV toolkit/components/nimbus/test/browser/browser_experimentapi_child.js | Uncaught exception in test bound testGetExperimentFromChildExistingEnrollment - at chrome://mochitests/content/browser/toolkit/components/nimbus/test/browser/ → Intermittent toolkit/components/nimbus/test/browser/browser_experimentapi_child.js | Uncaught exception in test bound testGetExperimentFromChildExistingEnrollment - at chrome://mochitests/content/browser/toolkit/components/nimbus/test/browser/
See Also: → 1956248

See this phab comment:

I think updating the data from the parent to already existing children is just intrinsically racy (and always was) and that is what we see here.

testGetExperimentFromChildExistingEnrollment relies on ExperimentAPI.ready but if we happen to run in a pre-existing content process where we already did ExperimentAPI.ready(), this becomes basically a no-op and the data written by the parent must arrive via a "change" notification to the ExperimentStore in the child, IIUC. And that is racy.

I assume this is not a big deal for normal usage, as the data will eventually arrive, but this test definitely needs some re-thinking.

Assignee: nobody → brennie
Flags: needinfo?(brennie)

So using forceNewProcess alone seems to not solve the problem locally to me using

MOZ_PROFILER_STARTUP=1 MOZ_PROFILER_SHUTDOWN=/home/jens/src/mozilla-unified/profile10.dat MOZ_CHAOSMODE=true ./mach mochitest toolkit/components/nimbus/test/browser --repeat 10

on an opt build. But at least from local profiling it definitely seems to ensure, that each execution gets a new process (note that this profile is running the below change without failure).

It appears to me that ExperimentStore.addEnrollment() does not ensure that _syncToChildren is called with flush: true. I ignore the details of the underlying Services.ppmm.sharedData but it sounds likely that we need to force a flush even before we want to poll the data from the child.

Locally this seems to do the trick on my machine, a try push is coming. There might be better places to enforce the flush, maybe directly inside ExperimentStore.addEnrollment. Or maybe we actually do not want that flush to happen always but only for testing?

(In reply to Jens Stutte [:jstutte] from comment #5)

It appears to me that ExperimentStore.addEnrollment() does not ensure that _syncToChildren is called with flush: true. I ignore the details of the underlying Services.ppmm.sharedData but it sounds likely that we need to force a flush even before we want to poll the data from the child.

If I clicked right in searchfox I think everything boils down to this idle dispatch without flushing. I think until the idle event runs, nothing is visible to child processes but everything is buffered in mChangedKeys.

So I think we should definitely flush in the test. Probably for normal operations, things are just fine.

This test is trying to test the scenarios of (a) syncing a new
enrollment to an existing child process and (b) syncing an existing
enrollment to a new child process, so we need to create new processes in
each case. Otherwise, we can run into a race because
ExperimentAPI.ready() in a re-used child process will resolve
immediately and we actually want to wait for the initial sync from the
parent to the child.

Additionally, we force the SharedMap to flush to child processes after
we modify our enrollments so that child processes will immedaiately see
the changes instead of waiting for idle dispatch to re-serialize the
contents of the enrollment store.

Pushed by brennie@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/96f2725a0e71 Always use a new process for child process in browser_experimentapi_child.js r=jstutte,nimbus-reviewers,chumphreys
Status: NEW → RESOLVED
Closed: 27 days ago
Resolution: --- → FIXED
Target Milestone: --- → 139 Branch

The patch landed in nightly and beta is affected.
:beth, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.
  • If no, please set status-firefox138 to wontfix.

For more information, please visit BugBot documentation.

Flags: needinfo?(brennie)

This test is trying to test the scenarios of (a) syncing a new
enrollment to an existing child process and (b) syncing an existing
enrollment to a new child process, so we need to create new processes in
each case. Otherwise, we can run into a race because
ExperimentAPI.ready() in a re-used child process will resolve
immediately and we actually want to wait for the initial sync from the
parent to the child.

Additionally, we force the SharedMap to flush to child processes after
we modify our enrollments so that child processes will immedaiately see
the changes instead of waiting for idle dispatch to re-serialize the
contents of the enrollment store.

Original Revision: https://phabricator.services.mozilla.com/D243530

Attachment #9476709 - Flags: approval-mozilla-beta?

beta Uplift Approval Request

  • User impact if declined: None. Test only
  • Code covered by automated testing: yes
  • Fix verified in Nightly: yes
  • Needs manual QE test: no
  • Steps to reproduce for manual QE testing: N/A
  • Risk associated with taking this patch: None
  • Explanation of risk level: test only
  • String changes made/needed: no
  • Is Android affected?: no
Attachment #9476709 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Flags: needinfo?(brennie)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: