Enabling the profiler with the JS feature to profile tests leaks windows until shutdown
Categories
(Core :: Gecko Profiler, defect, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr78 | --- | unaffected |
firefox86 | --- | unaffected |
firefox87 | --- | wontfix |
firefox88 | --- | fix-optional |
firefox89 | --- | fix-optional |
People
(Reporter: florian, Assigned: canova)
References
(Blocks 1 open bug, Regression)
Details
(Keywords: regression, Whiteboard: [fxp])
Attachments
(1 obsolete file)
Step to reproduce:
MOZ_PROFILER_STARTUP=1 MOZ_PROFILER_STARTUP_FEATURES="stackwalk,js" ./mach test browser/components/customizableui/test/browser_885530_showInPrivateBrowsing.js --headless
Result:
ERROR TEST-UNEXPECTED-FAIL | browser/components/customizableui/test/browser_885530_showInPrivateBrowsing.js | leaked 8 window(s) until shutdown [url = chrome://browser/content/browser.xhtml]
ERROR TEST-UNEXPECTED-FAIL | browser/components/customizableui/test/browser_885530_showInPrivateBrowsing.js | leaked 8 window(s) until shutdown [url = about:blank]
This behavior is not specific to the browser_885530_showInPrivateBrowsing.js
test, many tests show this bug, but I've not verified if all of them reproduce. Some tests also leak docShells.
I'm using the MOZ_PROFILER_STARTUP_FEATURES
environment variable here to reduce the feature set used, because using the I/O interposer triggers another leak.
Here are the various shutdown outcomes depending on the features used:
"nostacksampling,js"
: Clean shutdown."js"
: Shutdown leak of windows."stackwalk"
,"nostacksampling"
: Shutdown crash on an assertion:MOZ_ASSERT(!mBufferPositionWhenReceivedJSContext) (JSContext should have been cleared before the thread was unregistered)
Reporter | ||
Comment 1•4 years ago
|
||
A local backout of bug 1691504 fixes (or hides) this problem.
Updated•4 years ago
|
Comment 2•4 years ago
|
||
Set release status flags based on info from the regressing bug 1691504
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Reporter | ||
Comment 3•3 years ago
|
||
I can still reproduce, and it's annoying as it makes the debugging automated tests with the profiler in debug builds very painful.
The "using the I/O interposer triggers another leak." part from comment 0 was fixed in bug 1699683.
Reporter | ||
Comment 4•3 years ago
|
||
Actually, the leak also exists in non-debug builds, and almost made me blame the test I was profiling for a leak.
Profile with the default features: https://share.firefox.dev/3J2Fv2H an increasingly large amount of time is spent GC/CC'ing as the test (which uses a lot of windows) progresses.
Profile without the JS feature: https://share.firefox.dev/3QT6LCJ The same test now takes 30s instead of 46, and most of the GC/CC time is gone.
Assignee | ||
Comment 5•3 years ago
|
||
I spent some time to investigate this leak. Here are my findings:
It looks like the leak is not entirely Bug 1691504's fault. Before that bug was fixed, we weren't profiling the JIT samples properly, that's why probably all the JIT frames were missing before that patch.
After going down on that code path a bit, I realized that before that bug this branch was always executed instead of the code below. entry->isSampled(*rangeStart)
was always producing false
because of the wrong sample position in buffer. And since we negate that, it was always resulting in true
and going in there.
So, it looks like it's about not expiring some of the JitcodeGlobalEntries properly. When I looked where we expire everything after the profiling session, I came across the profiler_clear_js_context
API. This API calls JitcodeGlobalTable::setAllEntriesAsExpired
method eventually by following the path below:
profiler_clear_js_context -> (StopJSSampling + PollJSSampling) -> EnableContextProfilingStack(mJSContext, false) -> GeckoProfilerRuntime::enable(false) -> JitcodeGlobalTable::setAllEntriesAsExpired
.
In the end, my guess it is that either we are not calling the profiler_clear_js_context
for some cases, or we are not calling it early enough during the shutdown.
When I looked at some JSContext destructors, I found that we are not calling this function inside nsJSContext::~nsJSContext
. After adding it there, it seems like the leak is fixed!
But it's good to mention that my JS engine knowledge is really not that deep. I would like it if someone who knows JS engine better than me verifies that (probably I will send a WIP patch and ping people there).
Assignee | ||
Comment 6•3 years ago
|
||
During shutdown profiling, test suites were reporting window leaks on shutdown.
After investigating, I realized that we are not setting some
JitcodeGlobalEntries as expired. PROFILER_CLEAR_JS_CONTEXT sets these entries
as expired but it wasn't called in this case where we are destructing
nsJSContext.
Updated•3 years ago
|
Comment 8•3 years ago
|
||
Backed out for causing failures at browser_test_feature_jsallocations.js.
Backout link: https://hg.mozilla.org/integration/autoland/rev/58ce88d4f5dab93d267c396fe01cd0b2ac706a57
Push with failures: https://treeherder.mozilla.org/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel&revision=409858b7bb4aa75debf041720f8a76dfd63bef92
Failure log: https://treeherder.mozilla.org/logviewer?job_id=405925394&repo=autoland&lineNumber=7292
Comment hidden (obsolete) |
Updated•3 years ago
|
Updated•3 years ago
|
Comment 10•3 years ago
|
||
There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:canova, could you have a look please?
If you still have some work to do, you can add an action "Plan Changes" in Phabricator.
For more information, please visit auto_nag documentation.
Reporter | ||
Updated•3 years ago
|
Updated•3 years ago
|
Updated•2 years ago
|
Comment 11•1 year ago
|
||
This is blocking Florian's idea of enabling the profiler for all mochitests. So it's not an urgent issue, but it's high-value.
Assignee | ||
Comment 12•1 year ago
|
||
I realized that I haven't explained why my previous patch hasn't worked. I will try to explain it as much as possible from my memory and from my quick reading up again. But before starting to work on it again, we need to double check these.
The previous patch was calling PROFILER_CLEAR_JS_CONTEXT
during nsJSContext::Destroy
. This was partly working and fixing our shutdown memory leaks but also was breaking the JS profiling dramatically for some cases.
If I'm not mistaken nsJSContext
is tied to a single window proxy. It means that there could be multiple nsJSContext
s per process / main thread if there are multiple websites at that moment. There is only a single JSContext
though. So multiple nsJSContext
s can point to a single JSContext
.
Because of this, it was fine if we had a single website in a single web content process. But as soon as we have multiple websites tied to a web content process it gets messy. Let's assume that we have two tabs of mozilla.org that live in the same content process. If we close the first tab, we clear the whole JSContext
that is also responsible of the second tab. At that moment, if you continue profiling the second tab, you would lose ALL the capabilities of JS profiling in that second tab and would lose all the JS frames.
So that's why I didn't want to move forward with that patch and then stopped working on it at some point. It would be good if we can understand this issue in a better way. I believe it would be nice to ask some people from the JS team to see if we can do something about it.
Looks like Bug 1896068 removed PollJSSampling
and replaced it with PollJSSamplingForCurrentThread
. Maybe calling PollJSSamplingForCurrentThread
only instead would solve it without removing the whole JSContext from the profiler? Or maybe we need a more targeted API that marks only the entries that belong to that window expired.
Assignee | ||
Comment 13•1 year ago
|
||
I have experimented with adding PollJSSamplingForCurrentThread
, but unfortunately it didn't work. So I think we need some JS help.
Description
•