Closed Bug 1741997 Opened 11 days ago Closed 10 days ago

DMABufSurface::FenceDelete should not require GL context to close mSyncFd handle

Categories

(Core :: Widget: Gtk, defect, P1)

Desktop
Linux
defect

Tracking

()

RESOLVED FIXED
96 Branch
Tracking Status
relnote-firefox --- 94+
firefox-esr91 --- fixed
firefox94 + fixed
firefox95 + fixed
firefox96 --- verified

People

(Reporter: aosmond, Assigned: aosmond)

References

Details

Attachments

(1 file)

We might not have a GLContext set in DMABufSurface::mGL if we create it via the DMABufSurface::ImportSurfaceDescriptor method. Since we have a check for a null mGL before freeing the handle, this could result in a leak:

https://searchfox.org/mozilla-central/rev/3881c4ca80d1b4b2f43be695438ecaf90ee4f86c/widget/gtk/DMABufSurface.cpp#214

I was able to confirm this happens sometimes (but not always) in a debug build on Linux + Intel just by moving the close earlier and printing the GL context pointer value (it was sometimes null).

We can create DMABufSurface objects via deserialization without ever
setting a GL context. We must ensure we close the file handle even if we
don't have one.

Pushed by aosmond@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/80edabb0ac9e
DMABufSurface::FenceDelete should not require GL context to close mSyncFd handle. r=stransky

This reproduces readily for me by opening a WebGL demo (e.g. https://webglsamples.org/aquarium/aquarium.html) and then opening a new tab, backgrounding the demo. A whole bunch of DMABufSurface objects with valid mSyncFd handles were released without a GL context.

Blocks: 1739924

Bug 1739924 comment 61 confirmed the patches stops the file handle leaks.

Status: NEW → RESOLVED
Closed: 10 days ago
Resolution: --- → FIXED
Target Milestone: --- → 96 Branch

(In reply to Andrew Osmond [:aosmond] (he/him) from comment #4)

Bug 1739924 comment 61 confirmed the patches stops the file handle leaks.

I can confirm that too.

[Tracking Requested - why for this release]: Reports in bug 1739924 indicate that this fixes that bug, so copying from there (see bug 1739924 comment 5): Seems like a pretty serious regression in Linux stability. There are a remarkable number of comments on these crash reports, more than I've seen on crashes in a long time, and lots of them are talking about how Firefox started crashing very frequently for them, in the last week or so.

Severity: S3 → S2

Tracking for 94 and 95. We have a 94 planned dot release on Monday so that can't make this one but for 95, I would definitely take an uplift in our last betas next week.

Changing the priority to p1 as the bug is tracked by a release manager for the current release.
See What Do You Triage for more information

Priority: P3 → P1

Comment on attachment 9251475 [details]
Bug 1741997 - DMABufSurface::FenceDelete should not require GL context to close mSyncFd handle.

Beta/Release Uplift Approval Request

  • User impact if declined: Users with EGL and DMABuf enabled (all Wayland and X11+EGL users which the majority likely get with the release of 94) will experience file handle leaks when background tabs have WebGL canvas instances. Eventually they will run out and the process will crash.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The code would be covered by automated tests except for the fact that our Mesa version for our Linux instances is too old and so we don't take this path. The CI team is aware that we need a newer image for Linux to test various features (distros had already chosen of their own accord to ship Wayland in newer releases for example).

I don't believe there is any meaningful risk to accepting this patch. I changed the order of releasing some resources to ensure the file handle in question is always closed. There is no dependency between them as they are completely independent. The leak only happens when the resource was not used for rendering (hence why the tab needs to be backgrounded), and thus the buffer state is very simple -- it has some file handles that need to be closed and that's about it. We only release the resources when we are done with the buffer.

The alternative for users is to disable WebGL + DMABuf via a pref change (set widget.dmabuf-webgl.enabled to false). This will cause us to fall back to reading the pixels out of the WebGL canvas into a shared memory buffer. Performance is not as good, but we lived with it up until 94. I am not sure how many users were testing EGL without WebGL + DMABuf however since EGL more or less implies DMABuf for the users we are shipping to, and there is a runtime check to see if DMABuf is supported (we don't have any telemetry on this). There is no reason to believe it will cause problems however since that is our fallback path for all platforms.

  • String changes made/needed:
Attachment #9251475 - Flags: approval-mozilla-release?
Attachment #9251475 - Flags: approval-mozilla-beta?

Comment on attachment 9251475 [details]
Bug 1741997 - DMABufSurface::FenceDelete should not require GL context to close mSyncFd handle.

Approved for 95 Beta 10. P1/S2 regression, approved and verified on nightly. Thanks.

Attachment #9251475 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
See Also: → 1725245

Comment on attachment 9251475 [details]
Bug 1741997 - DMABufSurface::FenceDelete should not require GL context to close mSyncFd handle.

Approved for 94.0.2, thanks.

Attachment #9251475 - Flags: approval-mozilla-release? → approval-mozilla-release+

(Martin Stránský [:stransky] (ni? me) from comment #6)

(In reply to Andrew Osmond [:aosmond] (he/him) from comment #4)

Bug 1739924 comment 61 confirmed the patches stops the file handle leaks.

I can confirm that too.

Also verified with Nightly 20211119093910.
Clicking on "Restart to apply update" in the Main menu does not seem to fully close the old Nightly, therefore old file handles keep open.
It is necessary to close and re-open Nightly and then it's fine.

Andrew, should we mention that fix in 94.0.2 release notes? Any suggested wording? Thanks

Flags: needinfo?(aosmond)

(In reply to Pascal Chevrel:pascalc from comment #16)

Andrew, should we mention that fix in 94.0.2 release notes?

I know I'm not the Andrew you meant (haha) but I think we should. I saw a lot of comments in the crash reports about people saying they were going to have to switch to Chromium because of how much Firefox was crashing. It makes the browser unusable enough that it would be good that we could signal to those people having this issue that Firefox is usable again.

Please consider uplifting bug 1735905 as well to serve all downstream builds equally.

(In reply to Pascal Chevrel:pascalc from comment #16)

Andrew, should we mention that fix in 94.0.2 release notes? Any suggested wording? Thanks

Given how many encountered it, probably. I'm not sure how we typically phrase crashes, but something like the lines of "Resolved general instability/crashes on Linux caused by a file descriptor leak when backgrounding tabs using WebGL" is accurate.

Flags: needinfo?(aosmond)

(In reply to Andrew Osmond [:aosmond] (he/him) from comment #19)

(In reply to Pascal Chevrel:pascalc from comment #16)

Andrew, should we mention that fix in 94.0.2 release notes? Any suggested wording? Thanks

Given how many encountered it, probably. I'm not sure how we typically phrase crashes, but something like the lines of "Resolved general instability/crashes on Linux caused by a file descriptor leak when backgrounding tabs using WebGL" is accurate.

Thanks, I am using this wording.

Comment on attachment 9251475 [details]
Bug 1741997 - DMABufSurface::FenceDelete should not require GL context to close mSyncFd handle.

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: Some Firefox packages appear to turn on EGL and DMABuf by default on ESR, and we are seeing this reflected in the crash rates. Uplifting this will reduce the noise in the crash stats on Linux.
  • User impact if declined: If they are using one of these Firefox ESR snaps in question, they will encounter general instability and difficult to trace crashes due to the lack of file descriptors.
  • Fix Landed on Version: 96 (uplifted to 94 and 95)
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): See comment 10. It may not apply cleanly due to a formatting change in DMABufSurface::FenceDelete, but should be trivial to resolve.
  • String or UUID changes made by this patch: N/A
Attachment #9251475 - Flags: approval-mozilla-esr91?

Comment on attachment 9251475 [details]
Bug 1741997 - DMABufSurface::FenceDelete should not require GL context to close mSyncFd handle.

Approved for 91.4esr.

Attachment #9251475 - Flags: approval-mozilla-esr91? → approval-mozilla-esr91+

Hello. Perfherder has detected this improvement alert. I am not sure if this bug caused it or Bug 1740924. Can you provide some information about the changes in order to determine the right bug for the improvement? I think you worked on both changes that might be the cause for this improvement (4f21ae5601ae and cb44e5e99ad4). Thanks.

Flags: needinfo?(pchevrel)

Andrew, do you think this could have caused base content explicit memory to improve? Given that this fixes some kind of leak it seems somewhat plausible. More so than the other bug. I'll look at the about:memory diffs to see what the improvement is, if I can.

Flags: needinfo?(pchevrel) → needinfo?(aosmond)

Looking at the diff in about:memory for the child process, I'm not really sure what the difference is about. Maybe churn got reduced so there's less fragmentation or something? Very odd.

-3.50 MB (100.0%) -- explicit
├──-3.20 MB (91.42%) ++ heap-overhead
├──-0.18 MB (05.22%) ── heap-unclassified
├──-0.13 MB (03.71%) -- threads
│  ├──-0.13 MB (03.57%) ++ stacks
│  └──-0.00 MB (00.14%) ── overhead/wrappers [8]
└───0.01 MB (-00.35%) ++ (5 tiny)

Parent process memory went up by 50MB, mostly in heap-unclassified, which is odd (but this test doesn't track that).

I suppose the file descriptor is keeping a GL resource alive and releasing it could impact memory usage in some minor way. The allocation would exist in the content process on Linux for WebGL instances since we don't use OOP right now. I wouldn't expect there to be any new interaction with the parent process.

Flags: needinfo?(aosmond)
You need to log in before you can comment on or make changes to this bug.