Continuous memory leak, with animated theme and at least one visible and one occluded window
Categories
(Core :: Graphics: WebRender, defect)
Tracking
()
People
(Reporter: n6fpblsjcva7, Assigned: tnikkel)
References
(Regression)
Details
(Keywords: regression)
Attachments
(2 files, 2 obsolete files)
User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/112.0
Steps to reproduce:
Updated to latest version
Actual results:
Every ~4 seconds, the memory that Firefox is using increases by .02 - .4 GB. This continues until it is out of memory.
Expected results:
Kept around the same memory usage.
Comment 1•2 years ago
|
||
The Bugbug bot thinks this bug should belong to the 'Core::Performance' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.
Comment 2•2 years ago
|
||
This bug was moved into the Performance component.
:n6fpblsjcva7, could you make sure the following information is on this bug?
- For slowness or high CPU usage, capture a profile with http://profiler.firefox.com/, upload it and share the link here.
✅ For memory usage issues, capture a memory dump fromabout:memory
and attach it to this bug.- Troubleshooting information: Go to
about:support
, click "Copy raw data to clipboard", paste it into a file, save it, and attach the file here.
If the requested information is already in the bug, please confirm it is recent.
Thank you.
Comment 3•2 years ago
•
|
||
That's a lot of shmem.
1,446.02 MB ── resident
2,286.36 MB ── resident-peak
16,480.96 MB ── resident-phys-footprint
1,163.64 MB ── resident-unique
14,639.53 MB ── shmem-allocated
14,703.05 MB ── shmem-mapped
I'm going to move this to Graphics under the assumption that it is the most likely cause of a continuous leak like that.
Do you have any windows minimized? I see that you have the Dark Space addon. Do you have a lightweight theme with an animation?
Maybe this is another instance of bug 1820841, which seemed to be a kind of variation of bug 1828587 showing up on MacOS.
Comment 4•2 years ago
|
||
While this likely represents a bug in Firefox that we need to fix, you might be able to avoid this leak by disabling the animated theme.
Comment 5•2 years ago
|
||
:ahale, this sounds similar to recent issues you've worked on WRT animated themes causing problems. Is this a dupe, maybe?
Updated•2 years ago
|
Comment 6•2 years ago
•
|
||
(In reply to Erich Gubler [:ErichDonGubler] from comment #5)
:ahale, this sounds similar to recent issues you've worked on WRT animated themes causing problems. Is this a dupe, maybe?
It sure does look that way; I can't see anything to distinguish this bug from that one other than this being MacOS rather than Windows.
I'm not sure if this is after the fix landed, reporter version is 112.0, so it may be before the fix. If it is after the fix it would imply that the mThrottled check in https://hg.mozilla.org/mozilla-central/rev/6f2d406b412b#l1.102 is not working?
:bradwerth, would you mind attempting repro on this on MacOS? It looks easy to repro, and we need to take action on this if the bug is still around. Repro steps in https://bugzilla.mozilla.org/show_bug.cgi?id=1828587#c71 should still work.
Assignee | ||
Comment 7•2 years ago
|
||
It's pretty easy reproduce by using the dark space theme, opening a bunch of windows and then minimizing them (we don't know that the reporter of this bug report did that but it's quite possible). This is already know as Andrew commented in bug 1820841, comment 9. It reproduces in nightlies at least as far back as 110, so it's before Brad's work in bug 1768495 which landed during the 112 cycle.
So I guess minimized windows on mac aren't throttled, and aren't painting either.
Updated•2 years ago
|
Comment 8•2 years ago
|
||
The bug is marked as tracked for firefox113 (release), tracked for firefox114 (beta) and tracked for firefox115 (nightly). However, the bug still isn't assigned.
:bhood, could you please find an assignee for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.
For more information, please visit BugBot documentation.
Comment 9•2 years ago
|
||
(In reply to Timothy Nikkel (:tnikkel) from comment #7)
So I guess minimized windows on mac aren't throttled, and aren't painting either.
They should be tho...
Comment 10•2 years ago
|
||
Brad, can you poke at this and see if you can come up with any ideas?
Comment 11•2 years ago
|
||
Tim, does this repro in an even older nightly (82)? macOS effectively paused the compositor before already, see bug 1580117.
Comment 12•2 years ago
|
||
I'm able to reproduce using Tim's approach in comment 7. I'll try to figure out what's going on.
Assignee | ||
Comment 13•2 years ago
|
||
(In reply to Emilio Cobos Álvarez (:emilio) from comment #11)
Tim, does this repro in an even older nightly (82)? macOS effectively paused the compositor before already, see bug 1580117.
Yeah, I can reproduce using whatever nightly mozregression launches when I ask for 82.
Assignee | ||
Comment 14•2 years ago
|
||
I just checked with printfs in a local build: we do throttle the browser.xhtml parent process and content process refresh driver in windows that are minimizied on macOS. So there must be something more going on here.
Comment 15•2 years ago
|
||
When I reproduce, and diff the memory reports of before-and-after, the growth is in textures/render-texture-hosts
.
Comment 16•2 years ago
|
||
With an animated theme active, this only reproduces when I have at least one window minimized and one window not minimized. When that is true, calls to RenderThread::RegisterExternalImage
continue but there are no corresponding calls to RenderThread::UnregisterExternalImage
. This is what's building up the memory reported as textures/render-texture-hosts
.
Assignee | ||
Comment 17•2 years ago
|
||
We keep hitting this case
https://searchfox.org/mozilla-central/rev/e77d89c414eac63a295a2124600045775a6f4715/image/imgFrame.cpp#279
when we are trying to recycle a frame from the animated image. Seems like it is an unexpected interaction with the animated image frame recycling code.
Assignee | ||
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Comment 18•2 years ago
|
||
I don't completely understand this yet, but here's what I've discovered about the behavior of animated theme in at least one visible window and at least one occluded window:
- While decoding an animated GIF, we sometimes reuse an external image id. When we do that, we create a
ScheduleSharedSurfaceRelease
for the old image, which gets posted as a transaction to theRenderer
. We also (somewhere) create a transaction to register the new image and I believe increment a use count for the external image. I believe the transaction gets posted to allRenderer
using that image, including those which might be associated with occluded windows. Through some call chain that I don't understand yet, this transaction is executed on all thoseRenderer
. - The
ScheduleSharedSurfaceRelease
transaction created in Step 1 will only be executed on theRenderer
instances whenRenderer.update
is called, and theAppendNotificationRequests
message is processed. ButRenderer.update
has a long causation chain which leads back toWebRenderBridgeParent::MaybeGenerateFrame
. In that function, we early exit if the compositor is paused. - Since the register transaction from Step 1 is executed on all
Renderer
and the release transaction is only executed on unoccludedRenderer
, the usage counts are never decremented to 0 and the external image is never released until the occludedRenderer
no longer has a paused compositor (it becomes unoccluded) and processes all of its pending transactions.
The ideal fix I think would be to prevent the register transaction from being posted (or the usage counter being incremented) on occluded windows. But I don't yet understand that call chain. If somebody does, please post to this Bug how that causation works. An alternative fix is to change the behavior of MaybeGenerateFrame
, so that when it detects a paused compositor, it calls the Update
method on Renderer
before the early exit. Update
, unlike UpdateAndRender
should process the pending transactions without generating a frame. I'll try to build a fix that does this.
Comment 19•2 years ago
|
||
I also have no understanding why this is an issue in macOS and not on other platforms. It's possible that the image decode step is handled differently and somehow the register transaction is permitted on macOS and not on other platforms.
Comment 20•2 years ago
|
||
Comment 21•1 years ago
|
||
(In reply to Brad Werth [:bradwerth] from comment #19)
I also have no understanding why this is an issue in macOS and not on other platforms. It's possible that the image decode step is handled differently and somehow the register transaction is permitted on macOS and not on other platforms.
Isn't this the same issue as reported here? https://bugzilla.mozilla.org/show_bug.cgi?id=1828587
Comment 22•1 years ago
|
||
I also have no understanding why this is an issue in macOS and not on other platforms.
(I'm not a developer & not sure.)
Don't have Mac and Wayland a different kind of vsync than Windows and X11? bug 1829493 (114), bug 1797978 (108,sounds similar), bug 1786247 (107: sounds similar), bug 1765399 (102), bug 1759234 (100), bug 1645528 (86: "widget-local vsync source"), bug 1542808 (72)
Only Mac and Linux have gfx.canvas.accelerated=true:
bug 1831548 (async RemoteTexture=Early Beta or earlier) makes Linux running out of file descriptors, which crashes Firefox sometimes.
Updated•1 years ago
|
Updated•1 years ago
|
Assignee | ||
Comment 23•1 years ago
|
||
A little bit of context of what I've figured out in case that helps. There is one RasterImage object for the animated image in the theme, it is shared by all of the windows. So when the refresh driver ticks in the one visible window we advance the frame of that one image object, that those minimized windows are pointing to as well. We share the image frame to the gpu process so it can paint it. Then the gpu process is supposed to send a message back to the process that sent it the animated image frame after it is done with it. The gpu process doesn't send the "finished with frame n" msg, I'm fuzzy on the exact mechanism this uses, but something to do with the gpu process knowing those minimized windows contain the same animated image and still need to present that image frame (and the logic doesn't take into account paused compositors in some way).
Comment 24•1 years ago
|
||
FYI our last beta builds tomorrow and ships on Friday. We build our RC on Monday, I would take mitigation patches in the RC.
Comment 25•1 years ago
|
||
Some insight into why we're allocating a frame in this case. When we have an animated theme and at least one window visible and one occluded, we always choose a recycled frame that is the restore frame. I'm not sure why -- not familiar with this logic. This causes us to allocate a new frame for each frame, thus the memory leak. I'll try to figure out what we can do in this case. Tim, do you understand why we are always hitting this code path in this case?
Comment 26•1 years ago
|
||
I think we should probably back out bug 1768495 again from beta. It's unfortunate, but it's way less risky than adding a whole new code-path to mitigate this IMO, specially close to RC.
Assignee | ||
Comment 27•1 years ago
|
||
(In reply to Brad Werth [:bradwerth] from comment #25)
Some insight into why we're allocating a frame in this case. When we have an animated theme and at least one window visible and one occluded, we always choose a recycled frame that is the restore frame. I'm not sure why -- not familiar with this logic. This causes us to allocate a new frame for each frame, thus the memory leak. I'll try to figure out what we can do in this case. Tim, do you understand why we are always hitting this code path in this case?
Hmm, that's not quite the behaviour that I'm observing. In that code you linked blocked is false for me and we go into InitForDecoderRecycle. However InitForDecoderRecycle fails so blocked does end up true and we still go to allocate a new frame.
InitForDecoderRecycle fails because we can't recycle a frame if it is still being held onto by someone. Similar to comment 17, at this point
we have one ref for each open window (since each open window points to the same image), so we can't recycle the old frame cause those minimized windows still have a reference to the frame.
Recycling of frames is not really the cause because I can still reproduce with setting the pref image.animated.decode-on-demand.recycle to false, which should disable recycling.
Assignee | ||
Comment 28•1 years ago
|
||
And we allocate a new frame because the image in question is large enough that we don't keep around every frame of the animated image. The code that handles this operates on the assumption that old frames will get dropped when we are finished with them, so it doesn't expect to wrap around and start decoding new copies of frames that are still being held somewhere.
Comment 29•1 years ago
|
||
Thanks for correcting me on the code path through Decoder::AllocateFrameInternal
. Indeed, we are failing there for the reason you detail in comment 28, and not due to matching the restore frame, as I proposed in comment 25.
(In reply to Timothy Nikkel (:tnikkel) from comment #27)
Recycling of frames is not really the cause because I can still reproduce with setting the pref image.animated.decode-on-demand.recycle to false, which should disable recycling.
If this pref is off, it seems that the correct fix is D178551. Since that fix also handles the case when the pref is on, I think we really should land it, and treat the behavior in comment 28 as expected and correct.
Comment 30•1 years ago
|
||
(In reply to Emilio Cobos Álvarez (:emilio) from comment #26)
I think we should probably back out bug 1768495 again from beta. It's unfortunate, but it's way less risky than adding a whole new code-path to mitigate this IMO, specially close to RC.
(In reply to Pascal Chevrel:pascalc from comment #24)
FYI our last beta builds tomorrow and ships on Friday. We build our RC on Monday, I would take mitigation patches in the RC.
Pascal, let's backout Bug 1768495 from beta, as Emilio proposes. That will give us more time to find the right fix.
Assignee | ||
Comment 31•1 years ago
|
||
I'm moving pascals needinfo to bug 1834246 which we will use to handle the backout from beta. This bug can continue to be used to track a long term fix.
Updated•1 years ago
|
Updated•1 years ago
|
Comment 32•1 years ago
|
||
Shared surfaces are given a refcount of 1 at the point of creation, and
then that refcount is increased for each window that displays the surface.
This change ensures that if the compositor associated with a window is
paused, we don't increase the refcount because that paused compositor will
never display the surface and decrease the refcount.
Comment 33•1 years ago
|
||
I think D179121 solves the problem correctly, but Bug 1835127 is confusing the issue since it's very easy to crash the browser while testing it.
Assignee | ||
Comment 34•1 years ago
|
||
Comment 35•1 years ago
|
||
Tim, is D179393 to be landed in conjunction with the earlier patch D179121, or instead of?
Assignee | ||
Comment 36•1 years ago
|
||
D179393 solves the memory use increase by itself in my testing. D179393 stops the updates from being sent from the content process, so we avoid any work in the gpu process.
D179121 could prevent any updates that do slip through and get to the compositor through different paths (any that exist currently that we aren't aware of or created in the future) from causing problems.
Comment 37•1 years ago
|
||
Thank you, let's land them both. I'll take them out of WIP when Bug 1835127 is resolved.
Updated•1 years ago
|
Updated•1 years ago
|
Comment 38•1 years ago
|
||
:bradwerth 115 goes to beta next week (Monday 2023-06-05)
Any concerns about the timing for landing the fix in Bug 1835127 and then here in Bug 1830753?
The regressor Bug 1768495 introduced in Fx112, was backed out of Fx113 beta, Fx114 beta. I was wondering if we'll need to do the same for Fx115.
Comment 39•1 years ago
|
||
To hasten this along, I'm going to remove Bug 1835127 as a blocker. It's a related issue that I believe has no effect outside of debug builds. Let's try to get the patches in this Bug approved and landed.
Updated•1 years ago
|
Updated•1 years ago
|
Comment 40•1 years ago
|
||
Setting 115 to fixed, the initial patch that introduced the regression was backed out in beta by uplifting Bug 1836699.
Setting 116 to affected and tracked as blocking, the patches from Bug 1768495 are still in central.
As mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=1836699#c3, at this point we should consider also backing out in central until we have everything needed in terms of landing the fixes.
Updated•1 years ago
|
Updated•1 years ago
|
Updated•1 years ago
|
Comment 41•1 years ago
|
||
Tim has the last surviving patch for this Bug so it should be assigned to Tim.
Comment 42•1 years ago
|
||
Assignee | ||
Comment 43•1 years ago
|
||
I filed bug 1838111 to explore a more general solution.
Comment 44•1 years ago
|
||
bugherder |
Comment 45•1 year ago
|
||
I think we want QA to do another round of testing on 116 with this patch now landed.
Updated•1 year ago
|
Updated•1 year ago
|
Description
•