Closed Bug 1670878 Opened 5 years ago Closed 5 years ago

Instantiating a large number of wasm library sandboxes crashes due to overuse of file descriptors

Categories

(Core :: Security: Process Sandboxing, defect, P1)

Unspecified
All
defect

Tracking

()

RESOLVED FIXED
85 Branch
Tracking Status
firefox-esr78 --- wontfix
firefox83 --- wontfix
firefox84 --- wontfix
firefox85 --- fixed

People

(Reporter: gsvelto, Assigned: shravanrn)

References

Details

(Keywords: crash)

Crash Data

Attachments

(1 file)

Crash report: https://crash-stats.mozilla.org/report/index/32813feb-7bfc-4cc1-84dc-b52720201013

MOZ_CRASH Reason: called `Result::unwrap()` on an `Err` value: Sys(EMFILE)

Top 9 frames of crashing thread:

0 XUL RustMozCrash mozglue/static/rust/wrappers.cpp:17
1 XUL core::ops::function::Fn::call src/libcore/ops/function.rs:72
2 XUL std::panicking::rust_panic_with_hook src/libstd/panicking.rs:474
3 XUL rust_begin_unwind src/libstd/panicking.rs:378
4 XUL core::panicking::panic_fmt src/libcore/panicking.rs:85
5 XUL core::option::expect_none_failed src/libcore/option.rs:1211
6 XUL lucet_wasi::ctx::WasiCtxBuilder::fd_dup_for_io_desc third_party/rust/lucet-wasi-wasmsbx/src/ctx.rs:127
7 XUL lucet_wasi::ctx::WasiCtxBuilder::new third_party/rust/lucet-wasi-wasmsbx/src/ctx.rs:23
8 XUL lucet_load_module third_party/rust/rlbox_lucet_sandbox/src/create.rs:28

It seems that Lucet sandbox initialization is often running out of file descriptors on macOS. Could we be leaking some?

Tom, is this something you know about? Or maybe gcp knows who knows about rlbox?

(Separately, is there a better component than Core: General for this kind of bug?)

Flags: needinfo?(tom)
Flags: needinfo?(gpascutto)
Component: General → Security: Process Sandboxing
Flags: needinfo?(tom) → needinfo?(shravanrn)
Flags: needinfo?(gpascutto)

Apologies for the late response. Let me look through the code and see what this is doing. Will follow up with an update shortly.

Hmm... at first glance it does look at least possible. I'll do some further debugging to see if I can repro the issue,

Flags: needinfo?(shravanrn)

So, on testing the rlbox code in isolation on Linux, I am unable to reproduce the issue --- all file resources are cleaned up appropriately on the destruction of the sandbox.

I am curious though if there is any way to get more of the backtrace --- this unfortunately ends after 8 frames. Right now RLBox is deployed in a couple of locations in Firefox --- graphite and ogg. I am curious which one is causing the above issue.

That said, I am trying to think through ways we can debug this

@tom - do you have any thoughts/could you cc folks who may be able to offer any debugging strategies?

Flags: needinfo?(tom)

I believe the stack trace is the way it is because of the sandbox so I doubt we can get a better one. However the URLs seem to all point to the same site (or it's facebook page):

I've opened a bunch of reports and all have Flash installed so I guess those pages are using it - and maybe it's indirectly causing the bug. All the crashing installations are on ESR on macOS so maybe this is already fixed in our release channel.

Thanks for the links. I'll do some testing on those sites specifically as well.

So the rlbox wasm sandbox does not use a separate stack, so the back trace should be fine. Just as a sanity check, is there a way to confirm that there are no limitations in backtrace collections on macs?

Was successfully able to repro the issue in a mac. Will continue investigating tomorrow.

Sorry for the delay, I have been having some issues with getting a hold of a working Mac. But I am all set now, and I am looking into this over the next couple of days.

Assignee: nobody → shravanrn

I believe I have identified a fix for this issue. I will be testing and submitting a patch for this shortly.

Flags: needinfo?(tom)

I also did identify that this does affect both esr and regular versions of Firefox and affects both the mac and linux platform and have updated the bug accordingly.

@tom. @bholley: I am not too familiar with how patches get picked up for esr, but this would need to be included. Could you advise on the process here / cc any relevant folks.

Flags: needinfo?(tom)
Flags: needinfo?(bholley)

If the patch applies cleanly to both -central and ESR, just submit the patch via phab normally. If it needs a separate patch, you can attach it as a file to the bug.

Flags: needinfo?(tom)
Flags: needinfo?(bholley)
Summary: Crash in [@ core::option::expect_none_failed | lucet_wasi::ctx::WasiCtxBuilder::fd_dup_for_io_desc] → Instantiating large a number of wasm library sandboxes crashes due to overuse of file descriptors
Summary: Instantiating large a number of wasm library sandboxes crashes due to overuse of file descriptors → Instantiating a large number of wasm library sandboxes crashes due to overuse of file descriptors

Have submitted a patch that fixes this. Adding a bit more detail to this bug.

Instantiating a wasm library duplicates a file descriptor for /dev/null 3 times to be used as input, output and error streams for the wasm sandboxed code. When a lot of sandboxes are created and destroyed, a lot of descriptors are duplicated and closed. While this should be fine, POSIX does not seem to happy with the opening and closing of many file descriptors --- this could perhaps be some strange interaction with Firefox's seccomp filters and cross-process file descriptor handling as it is difficult to repro this outside of firefox.

However, the simpler fix here was to just eliminate the duplication of /dev/null and return an error when input, output or error streams are accessed by wasm sandboxed code. This means calls to printf will fail, but no code I know of actually checks the int error code returned by printf and this change is certainly compatible with existing sandboxed components.

The severity field is not set for this bug.
:gcp, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(gpascutto)
Severity: -- → S3
Flags: needinfo?(gpascutto)
Priority: -- → P1
Pushed by cbrindusan@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/c790eed9beab Instantiating a large number of wasm library sandboxes crashes due to overuse of file descriptors r=tjr

On Linux, a sandboxed open call will briefly use two fds (while sending the broker request, and immediately after receiving a successful response), but after it returns there is only the returned fd, and it also holds only one fd while blocked on the broker's reply. On macOS, there's no difference with sandboxing; the filesystem policy is interpreted by the kernel.

It's interesting that we see crash reports only on macOS: the per-process fd limit we request there should be at least at large as on Linux, if I recall correctly. There's a systemwide fd limit that's relatively low compared to other Unixes, but in that case we should see ENFILE rather than EMFILE, and we don't. (However, macOS may not be using error codes correctly.) I don't have a good explanation for this.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → 85 Branch

(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #16)

On Linux, a sandboxed open call will briefly use two fds (while sending the broker request, and immediately after receiving a successful response), but after it returns there is only the returned fd, and it also holds only one fd while blocked on the broker's reply. On macOS, there's no difference with sandboxing; the filesystem policy is interpreted by the kernel.

It's interesting that we see crash reports only on macOS: the per-process fd limit we request there should be at least at large as on Linux, if I recall correctly. There's a systemwide fd limit that's relatively low compared to other Unixes, but in that case we should see ENFILE rather than EMFILE, and we don't. (However, macOS may not be using error codes correctly.) I don't have a good explanation for this.

Ah interesting! Thanks for the details here. I think this may actually be useful for some additional considerations of how best we handle wasm runtimes and file descriptors going forward. I think my larger takeaway is that, in general, the ideal scenario is if we can avoid creating file descriptors in the first place in the wasm runtimes, as it avoids this potential scalability pitfall.

FYI, this patch doesn't graft cleanly to ESR78. If you intend to request uplift approval for it, you'll need to attach a rebased patch that applies cleanly.

Flags: needinfo?(shravanrn)
Crash Signature: [@ core::option::expect_none_failed | lucet_wasi::ctx::WasiCtxBuilder::fd_dup_for_io_desc] → [@ core::option::expect_none_failed | lucet_wasi::ctx::WasiCtxBuilder::fd_dup_for_io_desc] [@ core::result::unwrap_failed | lucet_wasi::ctx::WasiCtxBuilder::fd_dup_for_io_desc]

Sorry for the long delay here. It looks like there are some other intervening commits are missing from esr that result in this change not neatly applying.

I wanted to confirm if it is ok to create a commit for esr that includes all of the necessary changes in one and submit it here?

Crash Signature: [@ core::option::expect_none_failed | lucet_wasi::ctx::WasiCtxBuilder::fd_dup_for_io_desc] [@ core::result::unwrap_failed | lucet_wasi::ctx::WasiCtxBuilder::fd_dup_for_io_desc] → [@ core::option::expect_none_failed | lucet_wasi::ctx::WasiCtxBuilder::fd_dup_for_io_desc] [@ core::result::unwrap_failed | lucet_wasi::ctx::WasiCtxBuilder::fd_dup_for_io_desc]
Flags: needinfo?(shravanrn) → needinfo?(ryanvm)

What commits would you need to include?

Flags: needinfo?(ryanvm) → needinfo?(shravanrn)

I took another look at this and the only conclusion I can draw is that trying to manually backport the required Rust libraries is going to be a big can of worms. While the crash volume for this specific signature is higher than we'd prefer, it's also not high enough to justify the risk of such a messy backport IMO.

Flags: needinfo?(shravanrn)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: