Open Bug 1918983 Opened 1 month ago Updated 10 days ago

ABORT: failed to re-open freezeable shm: Too many open files: file /builds/worker/checkouts/gecko/ipc/chromium/src/base/shared_memory_posix.cc:345)

Categories

(Core :: IPC, defect)

defect

Tracking

()

People

(Reporter: whimboo, Unassigned)

References

(Blocks 1 open bug)

Details

Seen in a webdriver web-platform test for Firefox 128 when triggering a navigation to https://web-platform.test:8443/webdriver/tests/bidi/network/support/empty.html :

https://treeherder.mozilla.org/logviewer?job_id=473050010&repo=mozilla-esr128&lineNumber=164533-164583

[task 2024-09-05T13:07:47.613Z] 13:07:47     INFO - PID 1602 | [Parent 1625, Main Thread] ###!!! ABORT: failed to re-open freezeable shm: Too many open files: file /builds/worker/checkouts/gecko/ipc/chromium/src/base/shared_memory_posix.cc:345
[task 2024-09-05T13:07:47.613Z] 13:07:47     INFO - STDOUT: Initializing stack-fixing for the first stack frame, this may take a while...
[task 2024-09-05T13:08:11.273Z] 13:08:11     INFO - PID 1602 | #01: NS_DebugBreak [xpcom/base/nsDebugImpl.cpp:469]
[task 2024-09-05T13:08:11.273Z] 13:08:11     INFO - PID 1602 | #02: base::SharedMemory::CreateInternal(unsigned long, bool) [ipc/chromium/src/base/shared_memory_posix.cc:0]
[task 2024-09-05T13:08:11.274Z] 13:08:11     INFO - PID 1602 | #03: mozilla::ipc::MemMapSnapshot::Init(unsigned long) [dom/ipc/MemMapSnapshot.cpp:19]
[task 2024-09-05T13:08:11.275Z] 13:08:11     INFO - PID 1602 | #04: mozilla::dom::ipc::WritableSharedMap::Serialize() [dom/ipc/SharedMap.cpp:307]
[task 2024-09-05T13:08:11.275Z] 13:08:11     INFO - PID 1602 | #05: mozilla::dom::ipc::WritableSharedMap::BroadcastChanges() [dom/ipc/SharedMap.cpp:364]
[task 2024-09-05T13:08:11.276Z] 13:08:11     INFO - PID 1602 | #06: mozilla::dom::MozWritableSharedMap_Binding::flush(JSContext*, JS::Handle<JSObject*>, void*, JSJitMethodCallArgs const&) [s3:gecko-generated-sources:dd0ac67eb9a51c28b612af09a8cd4f73062fd999b56abf09c458191f32d43a6adaaed15f1b85827b297020094a288a1601a4fcc0ccbaad657920e4f7add05328/dom/bindings/MozSharedMapBinding.cpp::1538]
[task 2024-09-05T13:08:11.277Z] 13:08:11     INFO - PID 1602 | #07: mozilla::dom::binding_detail::GenericMethod<mozilla::dom::binding_detail::NormalThisPolicy, mozilla::dom::binding_detail::ThrowExceptions>(JSContext*, unsigned int, JS::Value*) [dom/bindings/BindingUtils.cpp:3270]

Given that this causes a hang in Firefox I wonder if it might be related to bug 1832294 where we see similar hangs once in a while when navigating via the WebDriver BiDi browsingContext.navigate command.

See Also: → 1832294
Component: DOM: Content Processes → IPC

Hmm, though, not sure if this is really about IPC, but bug 1463587 was there.

This is an exception raised due to file descriptor exhaustion, which does line up with the other Failed to duplicate file handle for current process! errors before and after this line.

It sounds like there is a chance that we have some kind of file descriptor leak in that test case if this is happening reliably, which would be interesting to isolate and figure out the cause of. Given we don't see similar logs in other hangs, it seems unlikely that the other timeouts are due to FD exhaustion.

These exact logs are likely to not show up on Linux, as we use a slightly different shared memory backend there, though, so perhaps the errors could end up looking slightly different?

Severity: -- → S3

So maybe it's related to https://github.com/web-platform-tests/wpt/issues/27072. For me it's not clear where exactly in the stack the origin of the file handle exhaustion is located. Maybe it's wptserve given that this is the tool that runs all the time when executing web-platform tests.

For me it's kinda easy to reproduce and I mentioned steps here, which basically is to run this mach command and let it run for a while (assuming the test is not failing due to another assertion):

mach wpt --webdriver-binary=target/debug/geckodriver --webdriver-arg=-vv testing/web-platform/tests/webdriver/tests/switch_to_frame/switch.py --repeat-until-unexpected

I would appreciate some feedback in how to figure our out where exactly we actually leak file handles.

Blocks: 1906583

Jed, if you could give some advice that would be great. I see quite a lot of these failures for Wd jobs in CI. Thanks.

Flags: needinfo?(jld)

I just ran lsof on my Mac, and the firefox process has a lot of file descriptors listed with type rte, which the man page says are AF_ROUTE sockets. I wouldn't expect us to need more than one, let alone >400 of them, so maybe Necko has a leak? Indeed, it looks like the socket opened here (thanks, searchfox) is never closed. I'll file a bug.

Also, I don't think the wpt github issue is related — that's reporting fd exhaustion in a Python process that runs Firefox, and it's the error code for exceeding the per-process limit rather than the systemwide limit (so excessive fd use in Firefox wouldn't contribute to it).

Flags: needinfo?(jld)
You need to log in before you can comment on or make changes to this bug.