1830753 - Continuous memory leak, with animated theme and at least one visible and one occluded window

n6fpblsjcva7

Reporter

Description

•

2 years ago

Attached file memory-report.json.gz — Details

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/112.0

Steps to reproduce:

Updated to latest version

Actual results:

Every ~4 seconds, the memory that Firefox is using increases by .02 - .4 GB. This continues until it is out of memory.

Expected results:

Kept around the same memory usage.

BugBot [:suhaib / :marco/ :calixte]

Comment 1

•

2 years ago

The Bugbug bot thinks this bug should belong to the 'Core::Performance' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → Performance

Product: Firefox → Core

BugBot [:suhaib / :marco/ :calixte]

Comment 2

•

2 years ago

This bug was moved into the Performance component.

:n6fpblsjcva7, could you make sure the following information is on this bug?

For slowness or high CPU usage, capture a profile with http://profiler.firefox.com/, upload it and share the link here.
~~✅ For memory usage issues, capture a memory dump from about:memory and attach it to this bug.~~
Troubleshooting information: Go to about:support, click "Copy raw data to clipboard", paste it into a file, save it, and attach the file here.

If the requested information is already in the bug, please confirm it is recent.

Thank you.

Flags: needinfo?(n6fpblsjcva7)

Andrew McCreight [:mccr8]

Comment 3

•

2 years ago

•

Edited

That's a lot of shmem.

1,446.02 MB ── resident
2,286.36 MB ── resident-peak
16,480.96 MB ── resident-phys-footprint
1,163.64 MB ── resident-unique
14,639.53 MB ── shmem-allocated
14,703.05 MB ── shmem-mapped

I'm going to move this to Graphics under the assumption that it is the most likely cause of a continuous leak like that.

Do you have any windows minimized? I see that you have the Dark Space addon. Do you have a lightweight theme with an animation?

Maybe this is another instance of bug 1820841, which seemed to be a kind of variation of bug 1828587 showing up on MacOS.

Component: Performance → Graphics: WebRender

Comment 4

•

2 years ago

While this likely represents a bug in Firefox that we need to fix, you might be able to avoid this leak by disabling the animated theme.

Erich Gubler [:ErichDonGubler]

Comment 5

•

2 years ago

:ahale, this sounds similar to recent issues you've worked on WRT animated themes causing problems. Is this a dupe, maybe?

Flags: needinfo?(ahale)

Erich Gubler [:ErichDonGubler]

Updated

•

2 years ago

Blocks: gfx-triage

Severity: -- → S2

Ashley Hale [:ahale]

Comment 6

•

2 years ago

•

Edited

(In reply to Erich Gubler [:ErichDonGubler] from comment #5)

:ahale, this sounds similar to recent issues you've worked on WRT animated themes causing problems. Is this a dupe, maybe?

It sure does look that way; I can't see anything to distinguish this bug from that one other than this being MacOS rather than Windows.

I'm not sure if this is after the fix landed, reporter version is 112.0, so it may be before the fix. If it is after the fix it would imply that the mThrottled check in https://hg.mozilla.org/mozilla-central/rev/6f2d406b412b#l1.102 is not working?

:bradwerth, would you mind attempting repro on this on MacOS? It looks easy to repro, and we need to take action on this if the bug is still around. Repro steps in https://bugzilla.mozilla.org/show_bug.cgi?id=1828587#c71 should still work.

Flags: needinfo?(ahale) → needinfo?(bwerth)

Timothy Nikkel (:tnikkel)

Assignee

Comment 7

•

2 years ago

It's pretty easy reproduce by using the dark space theme, opening a bunch of windows and then minimizing them (we don't know that the reporter of this bug report did that but it's quite possible). This is already know as Andrew commented in bug 1820841, comment 9. It reproduces in nightlies at least as far back as 110, so it's before Brad's work in bug 1768495 which landed during the 112 cycle.

So I guess minimized windows on mac aren't throttled, and aren't painting either.

Pascal Chevrel:pascalc

Updated

•

2 years ago

Status: UNCONFIRMED → NEW

status-firefox113: --- → affected

status-firefox114: --- → affected

status-firefox115: --- → affected

tracking-firefox113: --- → +

tracking-firefox114: --- → +

tracking-firefox115: --- → +

Ever confirmed: true

BugBot [:suhaib / :marco/ :calixte]

Comment 8

•

2 years ago

The bug is marked as tracked for firefox113 (release), tracked for firefox114 (beta) and tracked for firefox115 (nightly). However, the bug still isn't assigned.

:bhood, could you please find an assignee for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit BugBot documentation.

Flags: needinfo?(bhood)

Emilio Cobos Álvarez (:emilio)

Comment 9

•

2 years ago

(In reply to Timothy Nikkel (:tnikkel) from comment #7)

So I guess minimized windows on mac aren't throttled, and aren't painting either.

They should be tho...

Bob Hood [:bhood]

Comment 10

•

2 years ago

Brad, can you poke at this and see if you can come up with any ideas?

Assignee: nobody → bwerth

Flags: needinfo?(bhood)

Emilio Cobos Álvarez (:emilio)

Comment 11

•

2 years ago

Tim, does this repro in an even older nightly (82)? macOS effectively paused the compositor before already, see bug 1580117.

Flags: needinfo?(tnikkel)

Emilio Cobos Álvarez (:emilio)

Updated

•

2 years ago

Comment 12

•

2 years ago

I'm able to reproduce using Tim's approach in comment 7. I'll try to figure out what's going on.

Flags: needinfo?(bwerth)

Timothy Nikkel (:tnikkel)

Assignee

Comment 13

•

2 years ago

(In reply to Emilio Cobos Álvarez (:emilio) from comment #11)

Tim, does this repro in an even older nightly (82)? macOS effectively paused the compositor before already, see bug 1580117.

Yeah, I can reproduce using whatever nightly mozregression launches when I ask for 82.

Flags: needinfo?(tnikkel)

Timothy Nikkel (:tnikkel)

Assignee

Comment 14

•

2 years ago

I just checked with printfs in a local build: we do throttle the browser.xhtml parent process and content process refresh driver in windows that are minimizied on macOS. So there must be something more going on here.

Brad Werth [:bradwerth]

Comment 15

•

2 years ago

When I reproduce, and diff the memory reports of before-and-after, the growth is in textures/render-texture-hosts.

Brad Werth [:bradwerth]

Comment 16

•

2 years ago

With an animated theme active, this only reproduces when I have at least one window minimized and one window not minimized. When that is true, calls to RenderThread::RegisterExternalImage continue but there are no corresponding calls to RenderThread::UnregisterExternalImage. This is what's building up the memory reported as textures/render-texture-hosts.

Timothy Nikkel (:tnikkel)

Assignee

Comment 17

•

2 years ago

We keep hitting this case
https://searchfox.org/mozilla-central/rev/e77d89c414eac63a295a2124600045775a6f4715/image/imgFrame.cpp#279
when we are trying to recycle a frame from the animated image. Seems like it is an unexpected interaction with the animated image frame recycling code.

Timothy Nikkel (:tnikkel)

Assignee

Updated

•

2 years ago

Flags: needinfo?(n6fpblsjcva7)

Andrew McCreight [:mccr8]

Updated

•

2 years ago

status-firefox-esr102: --- → affected

Jeff Muizelaar [:jrmuizel]

Updated

•

2 years ago

No longer blocks: gfx-triage

Brad Werth [:bradwerth]

Comment 18

•

2 years ago

I don't completely understand this yet, but here's what I've discovered about the behavior of animated theme in at least one visible window and at least one occluded window:

While decoding an animated GIF, we sometimes reuse an external image id. When we do that, we create a ScheduleSharedSurfaceRelease for the old image, which gets posted as a transaction to the Renderer. We also (somewhere) create a transaction to register the new image and I believe increment a use count for the external image. I believe the transaction gets posted to all Renderer using that image, including those which might be associated with occluded windows. Through some call chain that I don't understand yet, this transaction is executed on all those Renderer.
The ScheduleSharedSurfaceRelease transaction created in Step 1 will only be executed on the Renderer instances when Renderer.update is called, and the AppendNotificationRequests message is processed. But Renderer.update has a long causation chain which leads back to WebRenderBridgeParent::MaybeGenerateFrame. In that function, we early exit if the compositor is paused.
Since the register transaction from Step 1 is executed on all Renderer and the release transaction is only executed on unoccluded Renderer, the usage counts are never decremented to 0 and the external image is never released until the occluded Renderer no longer has a paused compositor (it becomes unoccluded) and processes all of its pending transactions.

The ideal fix I think would be to prevent the register transaction from being posted (or the usage counter being incremented) on occluded windows. But I don't yet understand that call chain. If somebody does, please post to this Bug how that causation works. An alternative fix is to change the behavior of MaybeGenerateFrame, so that when it detects a paused compositor, it calls the Update method on Renderer before the early exit. Update, unlike UpdateAndRender should process the pending transactions without generating a frame. I'll try to build a fix that does this.

Brad Werth [:bradwerth]

Comment 19

•

2 years ago

I also have no understanding why this is an issue in macOS and not on other platforms. It's possible that the image decode step is handled differently and somehow the register transaction is permitted on macOS and not on other platforms.

Brad Werth [:bradwerth]

Comment 20

•

2 years ago

Attached file Bug 1830753: Make renderer wake up instead of rendering when compositor is paused. (obsolete) — Details

Avwon

Comment 21

•

1 years ago

(In reply to Brad Werth [:bradwerth] from comment #19)

I also have no understanding why this is an issue in macOS and not on other platforms. It's possible that the image decode step is handled differently and somehow the register transaction is permitted on macOS and not on other platforms.

Isn't this the same issue as reported here? https://bugzilla.mozilla.org/show_bug.cgi?id=1828587

Darkspirit

Comment 22

•

1 years ago

I also have no understanding why this is an issue in macOS and not on other platforms.

(I'm not a developer & not sure.)
Don't have Mac and Wayland a different kind of vsync than Windows and X11? bug 1829493 (114), bug 1797978 (108,sounds similar), bug 1786247 (107: sounds similar), bug 1765399 (102), bug 1759234 (100), bug 1645528 (86: "widget-local vsync source"), bug 1542808 (72)

Only Mac and Linux have gfx.canvas.accelerated=true:
bug 1831548 (async RemoteTexture=Early Beta or earlier) makes Linux running out of file descriptors, which crashes Firefox sometimes.

Brad Werth [:bradwerth]

Updated

•

1 years ago

Updated

•

1 years ago

Summary: Continuous memory leak, with animated theme and very high shmem-allocated and shmem-mapped → Continuous memory leak, with animated theme and at least one visible and one occluded window

Phabricator Automation

Updated

•

1 years ago

Attachment #9335023 - Attachment description: WIP: Bug 1830753 Part 1: Make RenderBackend wake up instead of rendering when compositor is paused. → Bug 1830753: Make renderer wake up instead of rendering when compositor is paused.

Timothy Nikkel (:tnikkel)

Assignee

Comment 23

•

1 years ago

A little bit of context of what I've figured out in case that helps. There is one RasterImage object for the animated image in the theme, it is shared by all of the windows. So when the refresh driver ticks in the one visible window we advance the frame of that one image object, that those minimized windows are pointing to as well. We share the image frame to the gpu process so it can paint it. Then the gpu process is supposed to send a message back to the process that sent it the animated image frame after it is done with it. The gpu process doesn't send the "finished with frame n" msg, I'm fuzzy on the exact mechanism this uses, but something to do with the gpu process knowing those minimized windows contain the same animated image and still need to present that image frame (and the logic doesn't take into account paused compositors in some way).

Pascal Chevrel:pascalc

Comment 24

•

1 years ago

FYI our last beta builds tomorrow and ships on Friday. We build our RC on Monday, I would take mitigation patches in the RC.

Brad Werth [:bradwerth]

Comment 25

•

1 years ago

Some insight into why we're allocating a frame in this case. When we have an animated theme and at least one window visible and one occluded, we always choose a recycled frame that is the restore frame. I'm not sure why -- not familiar with this logic. This causes us to allocate a new frame for each frame, thus the memory leak. I'll try to figure out what we can do in this case. Tim, do you understand why we are always hitting this code path in this case?

Flags: needinfo?(tnikkel)

Emilio Cobos Álvarez (:emilio)

Comment 26

•

1 years ago

I think we should probably back out bug 1768495 again from beta. It's unfortunate, but it's way less risky than adding a whole new code-path to mitigate this IMO, specially close to RC.

Timothy Nikkel (:tnikkel)

Assignee

Comment 27

•

1 years ago

(In reply to Brad Werth [:bradwerth] from comment #25)

Some insight into why we're allocating a frame in this case. When we have an animated theme and at least one window visible and one occluded, we always choose a recycled frame that is the restore frame. I'm not sure why -- not familiar with this logic. This causes us to allocate a new frame for each frame, thus the memory leak. I'll try to figure out what we can do in this case. Tim, do you understand why we are always hitting this code path in this case?

Hmm, that's not quite the behaviour that I'm observing. In that code you linked blocked is false for me and we go into InitForDecoderRecycle. However InitForDecoderRecycle fails so blocked does end up true and we still go to allocate a new frame.

InitForDecoderRecycle fails because we can't recycle a frame if it is still being held onto by someone. Similar to comment 17, at this point

https://searchfox.org/mozilla-central/rev/e77d89c414eac63a295a2124600045775a6f4715/image/imgFrame.cpp#279

we have one ref for each open window (since each open window points to the same image), so we can't recycle the old frame cause those minimized windows still have a reference to the frame.

Recycling of frames is not really the cause because I can still reproduce with setting the pref image.animated.decode-on-demand.recycle to false, which should disable recycling.

Flags: needinfo?(tnikkel)

Timothy Nikkel (:tnikkel)

Assignee

Comment 28

•

1 years ago

And we allocate a new frame because the image in question is large enough that we don't keep around every frame of the animated image. The code that handles this operates on the assumption that old frames will get dropped when we are finished with them, so it doesn't expect to wrap around and start decoding new copies of frames that are still being held somewhere.

Brad Werth [:bradwerth]

Comment 29

•

1 years ago

Thanks for correcting me on the code path through Decoder::AllocateFrameInternal. Indeed, we are failing there for the reason you detail in comment 28, and not due to matching the restore frame, as I proposed in comment 25.

(In reply to Timothy Nikkel (:tnikkel) from comment #27)

Recycling of frames is not really the cause because I can still reproduce with setting the pref image.animated.decode-on-demand.recycle to false, which should disable recycling.

If this pref is off, it seems that the correct fix is D178551. Since that fix also handles the case when the pref is on, I think we really should land it, and treat the behavior in comment 28 as expected and correct.

Brad Werth [:bradwerth]

Comment 30

•

1 years ago

(In reply to Emilio Cobos Álvarez (:emilio) from comment #26)

I think we should probably back out bug 1768495 again from beta. It's unfortunate, but it's way less risky than adding a whole new code-path to mitigate this IMO, specially close to RC.

(In reply to Pascal Chevrel:pascalc from comment #24)

FYI our last beta builds tomorrow and ships on Friday. We build our RC on Monday, I would take mitigation patches in the RC.

Pascal, let's backout Bug 1768495 from beta, as Emilio proposes. That will give us more time to find the right fix.

Flags: needinfo?(pascalc)

Timothy Nikkel (:tnikkel)

Assignee

Comment 31

•

1 years ago

I'm moving pascals needinfo to bug 1834246 which we will use to handle the backout from beta. This bug can continue to be used to track a long term fix.

Flags: needinfo?(pascalc)

Ryan VanderMeulen [:RyanVM]

Updated

•

1 years ago

status-firefox113: affected → wontfix

Phabricator Automation

Updated

•

1 years ago

Attachment #9335023 - Attachment is obsolete: true

Brad Werth [:bradwerth]

Comment 32

•

1 years ago

Attached file Bug 1830753: Prevent increases in shared surface refcount when the compositor is paused. (obsolete) — Details

Shared surfaces are given a refcount of 1 at the point of creation, and
then that refcount is increased for each window that displays the surface.
This change ensures that if the compositor associated with a window is
paused, we don't increase the refcount because that paused compositor will
never display the surface and decrease the refcount.

Brad Werth [:bradwerth]

Updated

•

1 years ago

Depends on: 1835127

Brad Werth [:bradwerth]

Comment 33

•

1 years ago

I think D179121 solves the problem correctly, but Bug 1835127 is confusing the issue since it's very easy to crash the browser while testing it.

Timothy Nikkel (:tnikkel)

Assignee

Comment 34

•

1 years ago

Attached file Bug 1830753. Don't send animated image updates to paused compositors. r?#gfx-reviewers — Details

Brad Werth [:bradwerth]

Comment 35

•

1 years ago

Tim, is D179393 to be landed in conjunction with the earlier patch D179121, or instead of?

Flags: needinfo?(tnikkel)

Timothy Nikkel (:tnikkel)

Assignee

Comment 36

•

1 years ago

D179393 solves the memory use increase by itself in my testing. D179393 stops the updates from being sent from the content process, so we avoid any work in the gpu process.

D179121 could prevent any updates that do slip through and get to the compositor through different paths (any that exist currently that we aren't aware of or created in the future) from causing problems.

Flags: needinfo?(tnikkel)

Brad Werth [:bradwerth]

Comment 37

•

1 years ago

Thank you, let's land them both. I'll take them out of WIP when Bug 1835127 is resolved.

Donal Meehan [:dmeehan]

Updated

•

1 years ago

status-firefox113: wontfix → disabled

status-firefox114: affected → disabled

Keywords: regression

Regressed by: 1768495

Donal Meehan [:dmeehan]

Updated

•

1 years ago

tracking-firefox115: + → blocking

Donal Meehan [:dmeehan]

Comment 38

•

1 years ago

:bradwerth 115 goes to beta next week (Monday 2023-06-05)
Any concerns about the timing for landing the fix in Bug 1835127 and then here in Bug 1830753?

The regressor Bug 1768495 introduced in Fx112, was backed out of Fx113 beta, Fx114 beta. I was wondering if we'll need to do the same for Fx115.

Flags: needinfo?(bwerth)

Brad Werth [:bradwerth]

Comment 39

•

1 years ago

To hasten this along, I'm going to remove Bug 1835127 as a blocker. It's a related issue that I believe has no effect outside of debug builds. Let's try to get the patches in this Bug approved and landed.

Flags: needinfo?(bwerth)

Brad Werth [:bradwerth]

Updated

•

1 years ago

No longer depends on: 1835127

Updated

•

1 years ago

Attachment #9335953 - Attachment description: WIP: Bug 1830753: Prevent increases in shared surface refcount when the compositor is paused. → Bug 1830753: Prevent increases in shared surface refcount when the compositor is paused.

Timothy Nikkel (:tnikkel)

Assignee

Updated

•

1 years ago

Comment 40

•

1 years ago

Setting 115 to fixed, the initial patch that introduced the regression was backed out in beta by uplifting Bug 1836699.
Setting 116 to affected and tracked as blocking, the patches from Bug 1768495 are still in central.

As mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=1836699#c3, at this point we should consider also backing out in central until we have everything needed in terms of landing the fixes.

status-firefox113: disabled → fixed

status-firefox114: disabled → fixed

status-firefox115: affected → fixed

status-firefox116: --- → affected

tracking-firefox116: --- → blocking

Phabricator Automation

Updated

•

1 years ago

Attachment #9336454 - Attachment description: WIP: Bug 1830753. Don't send animated image updates to paused compositors. → Bug 1830753. Don't send animated image updates to paused compositors.

Phabricator Automation

Updated

•

1 years ago

Attachment #9335953 - Attachment is obsolete: true

Phabricator Automation

Updated

•

1 years ago

Attachment #9336454 - Attachment description: Bug 1830753. Don't send animated image updates to paused compositors. → Bug 1830753. Don't send animated image updates to paused compositors. r?#gfx-reviewers

Brad Werth [:bradwerth]

Comment 41

•

1 years ago

Tim has the last surviving patch for this Bug so it should be assigned to Tim.

Assignee: bwerth → tnikkel

Pulsebot

Comment 42

•

1 years ago

Pushed by tnikkel@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/1b140d3a2770 Don't send animated image updates to paused compositors. r=bradwerth,gfx-reviewers

Timothy Nikkel (:tnikkel)

Assignee

Updated

•

1 years ago

Comment 43

•

1 years ago

I filed bug 1838111 to explore a more general solution.

Cosmin Sabou [:CosminS]

Comment 44

•

1 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/1b140d3a2770

Status: NEW → RESOLVED

Closed: 1 years ago

status-firefox116: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → 116 Branch

Ryan VanderMeulen [:RyanVM]

Comment 45

•

1 year ago

I think we want QA to do another round of testing on 116 with this patch now landed.

Flags: qe-verify+

Timothy Nikkel (:tnikkel)

Assignee

Updated

•

1 year ago

Blocks: 1839109

Ryan VanderMeulen [:RyanVM]

Updated

•

1 year ago

status-firefox-esr102: affected → wontfix

Flags: qe-verify+

Ryan VanderMeulen [:RyanVM]

Updated

•

1 year ago

Flags: qe-verify+

Timothy Nikkel (:tnikkel)

Assignee

Updated

•

1 year ago

Blocks: 1845413

Timothy Nikkel (:tnikkel)

Assignee

Updated

•

10 months ago

Updated

•

7 months ago

memory-report.json.gz 2 years ago n6fpblsjcva7 442.94 KB, application/x-gzip		Details
Bug 1830753: Make renderer wake up instead of rendering when compositor is paused. 2 years ago Brad Werth [:bradwerth] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1830753: Prevent increases in shared surface refcount when the compositor is paused. 1 years ago Brad Werth [:bradwerth] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1830753. Don't send animated image updates to paused compositors. r?#gfx-reviewers 1 years ago Timothy Nikkel (:tnikkel) 48 bytes, text/x-phabricator-request		Details \| Review