Instantiating a large number of wasm library sandboxes crashes due to overuse of file descriptors
Categories
(Core :: Security: Process Sandboxing, defect, P1)
Tracking
()
People
(Reporter: gsvelto, Assigned: shravanrn)
References
Details
(Keywords: crash)
Crash Data
Attachments
(1 file)
Crash report: https://crash-stats.mozilla.org/report/index/32813feb-7bfc-4cc1-84dc-b52720201013
MOZ_CRASH Reason: called `Result::unwrap()` on an `Err` value: Sys(EMFILE)
Top 9 frames of crashing thread:
0 XUL RustMozCrash mozglue/static/rust/wrappers.cpp:17
1 XUL core::ops::function::Fn::call src/libcore/ops/function.rs:72
2 XUL std::panicking::rust_panic_with_hook src/libstd/panicking.rs:474
3 XUL rust_begin_unwind src/libstd/panicking.rs:378
4 XUL core::panicking::panic_fmt src/libcore/panicking.rs:85
5 XUL core::option::expect_none_failed src/libcore/option.rs:1211
6 XUL lucet_wasi::ctx::WasiCtxBuilder::fd_dup_for_io_desc third_party/rust/lucet-wasi-wasmsbx/src/ctx.rs:127
7 XUL lucet_wasi::ctx::WasiCtxBuilder::new third_party/rust/lucet-wasi-wasmsbx/src/ctx.rs:23
8 XUL lucet_load_module third_party/rust/rlbox_lucet_sandbox/src/create.rs:28
It seems that Lucet sandbox initialization is often running out of file descriptors on macOS. Could we be leaking some?
Comment 1•5 years ago
|
||
Tom, is this something you know about? Or maybe gcp knows who knows about rlbox?
(Separately, is there a better component than Core: General for this kind of bug?)
Updated•5 years ago
|
Updated•5 years ago
|
Assignee | ||
Comment 2•5 years ago
|
||
Apologies for the late response. Let me look through the code and see what this is doing. Will follow up with an update shortly.
Assignee | ||
Comment 3•5 years ago
|
||
Hmm... at first glance it does look at least possible. I'll do some further debugging to see if I can repro the issue,
Assignee | ||
Comment 4•5 years ago
|
||
So, on testing the rlbox code in isolation on Linux, I am unable to reproduce the issue --- all file resources are cleaned up appropriately on the destruction of the sandbox.
I am curious though if there is any way to get more of the backtrace --- this unfortunately ends after 8 frames. Right now RLBox is deployed in a couple of locations in Firefox --- graphite and ogg. I am curious which one is causing the above issue.
That said, I am trying to think through ways we can debug this
@tom - do you have any thoughts/could you cc folks who may be able to offer any debugging strategies?
Reporter | ||
Comment 5•5 years ago
|
||
I believe the stack trace is the way it is because of the sandbox so I doubt we can get a better one. However the URLs seem to all point to the same site (or it's facebook page):
I've opened a bunch of reports and all have Flash installed so I guess those pages are using it - and maybe it's indirectly causing the bug. All the crashing installations are on ESR on macOS so maybe this is already fixed in our release channel.
Assignee | ||
Comment 6•5 years ago
|
||
Thanks for the links. I'll do some testing on those sites specifically as well.
So the rlbox wasm sandbox does not use a separate stack, so the back trace should be fine. Just as a sanity check, is there a way to confirm that there are no limitations in backtrace collections on macs?
Assignee | ||
Comment 7•5 years ago
|
||
Was successfully able to repro the issue in a mac. Will continue investigating tomorrow.
Assignee | ||
Comment 8•5 years ago
|
||
Sorry for the delay, I have been having some issues with getting a hold of a working Mac. But I am all set now, and I am looking into this over the next couple of days.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 9•5 years ago
|
||
I believe I have identified a fix for this issue. I will be testing and submitting a patch for this shortly.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 10•5 years ago
|
||
I also did identify that this does affect both esr and regular versions of Firefox and affects both the mac and linux platform and have updated the bug accordingly.
@tom. @bholley: I am not too familiar with how patches get picked up for esr, but this would need to be included. Could you advise on the process here / cc any relevant folks.
Comment 11•5 years ago
|
||
If the patch applies cleanly to both -central and ESR, just submit the patch via phab normally. If it needs a separate patch, you can attach it as a file to the bug.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 12•5 years ago
|
||
Assignee | ||
Comment 13•5 years ago
•
|
||
Have submitted a patch that fixes this. Adding a bit more detail to this bug.
Instantiating a wasm library duplicates a file descriptor for /dev/null 3 times to be used as input, output and error streams for the wasm sandboxed code. When a lot of sandboxes are created and destroyed, a lot of descriptors are duplicated and closed. While this should be fine, POSIX does not seem to happy with the opening and closing of many file descriptors --- this could perhaps be some strange interaction with Firefox's seccomp filters and cross-process file descriptor handling as it is difficult to repro this outside of firefox.
However, the simpler fix here was to just eliminate the duplication of /dev/null and return an error when input, output or error streams are accessed by wasm sandboxed code. This means calls to printf will fail, but no code I know of actually checks the int error code returned by printf and this change is certainly compatible with existing sandboxed components.
Comment 14•5 years ago
|
||
The severity field is not set for this bug.
:gcp, could you have a look please?
For more information, please visit auto_nag documentation.
Updated•5 years ago
|
Comment 15•5 years ago
|
||
Comment 16•5 years ago
|
||
On Linux, a sandboxed open
call will briefly use two fds (while sending the broker request, and immediately after receiving a successful response), but after it returns there is only the returned fd, and it also holds only one fd while blocked on the broker's reply. On macOS, there's no difference with sandboxing; the filesystem policy is interpreted by the kernel.
It's interesting that we see crash reports only on macOS: the per-process fd limit we request there should be at least at large as on Linux, if I recall correctly. There's a systemwide fd limit that's relatively low compared to other Unixes, but in that case we should see ENFILE
rather than EMFILE
, and we don't. (However, macOS may not be using error codes correctly.) I don't have a good explanation for this.
Comment 17•5 years ago
|
||
bugherder |
Updated•5 years ago
|
Assignee | ||
Comment 18•5 years ago
|
||
(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #16)
On Linux, a sandboxed
open
call will briefly use two fds (while sending the broker request, and immediately after receiving a successful response), but after it returns there is only the returned fd, and it also holds only one fd while blocked on the broker's reply. On macOS, there's no difference with sandboxing; the filesystem policy is interpreted by the kernel.It's interesting that we see crash reports only on macOS: the per-process fd limit we request there should be at least at large as on Linux, if I recall correctly. There's a systemwide fd limit that's relatively low compared to other Unixes, but in that case we should see
ENFILE
rather thanEMFILE
, and we don't. (However, macOS may not be using error codes correctly.) I don't have a good explanation for this.
Ah interesting! Thanks for the details here. I think this may actually be useful for some additional considerations of how best we handle wasm runtimes and file descriptors going forward. I think my larger takeaway is that, in general, the ideal scenario is if we can avoid creating file descriptors in the first place in the wasm runtimes, as it avoids this potential scalability pitfall.
Comment 19•5 years ago
|
||
FYI, this patch doesn't graft cleanly to ESR78. If you intend to request uplift approval for it, you'll need to attach a rebased patch that applies cleanly.
Updated•5 years ago
|
Assignee | ||
Comment 21•5 years ago
|
||
Sorry for the long delay here. It looks like there are some other intervening commits are missing from esr that result in this change not neatly applying.
I wanted to confirm if it is ok to create a commit for esr that includes all of the necessary changes in one and submit it here?
Comment 22•5 years ago
|
||
What commits would you need to include?
Comment 23•4 years ago
|
||
I took another look at this and the only conclusion I can draw is that trying to manually backport the required Rust libraries is going to be a big can of worms. While the crash volume for this specific signature is higher than we'd prefer, it's also not high enough to justify the risk of such a messy backport IMO.
Description
•