1670878 - Instantiating a large number of wasm library sandboxes crashes due to overuse of file descriptors

Gabriele Svelto [:gsvelto]

Reporter

Description

•

5 years ago

Crash report: https://crash-stats.mozilla.org/report/index/32813feb-7bfc-4cc1-84dc-b52720201013

MOZ_CRASH Reason: called `Result::unwrap()` on an `Err` value: Sys(EMFILE)

Top 9 frames of crashing thread:

0 XUL RustMozCrash mozglue/static/rust/wrappers.cpp:17
1 XUL core::ops::function::Fn::call src/libcore/ops/function.rs:72
2 XUL std::panicking::rust_panic_with_hook src/libstd/panicking.rs:474
3 XUL rust_begin_unwind src/libstd/panicking.rs:378
4 XUL core::panicking::panic_fmt src/libcore/panicking.rs:85
5 XUL core::option::expect_none_failed src/libcore/option.rs:1211
6 XUL lucet_wasi::ctx::WasiCtxBuilder::fd_dup_for_io_desc third_party/rust/lucet-wasi-wasmsbx/src/ctx.rs:127
7 XUL lucet_wasi::ctx::WasiCtxBuilder::new third_party/rust/lucet-wasi-wasmsbx/src/ctx.rs:23
8 XUL lucet_load_module third_party/rust/rlbox_lucet_sandbox/src/create.rs:28

It seems that Lucet sandbox initialization is often running out of file descriptors on macOS. Could we be leaking some?

Andrew Overholt [:overholt]

Comment 1

•

5 years ago

Tom, is this something you know about? Or maybe gcp knows who knows about rlbox?

(Separately, is there a better component than Core: General for this kind of bug?)

Flags: needinfo?(tom)

Flags: needinfo?(gpascutto)

Tom Ritter [:tjr]

Updated

•

5 years ago

Component: General → Security: Process Sandboxing

Flags: needinfo?(tom) → needinfo?(shravanrn)

Andrew Overholt [:overholt]

Updated

•

5 years ago

Flags: needinfo?(gpascutto)

Shravan Narayan

Assignee

Comment 2

•

5 years ago

Apologies for the late response. Let me look through the code and see what this is doing. Will follow up with an update shortly.

Shravan Narayan

Assignee

Comment 3

•

5 years ago

Hmm... at first glance it does look at least possible. I'll do some further debugging to see if I can repro the issue,

Flags: needinfo?(shravanrn)

Shravan Narayan

Assignee

Comment 4

•

5 years ago

So, on testing the rlbox code in isolation on Linux, I am unable to reproduce the issue --- all file resources are cleaned up appropriately on the destruction of the sandbox.

I am curious though if there is any way to get more of the backtrace --- this unfortunately ends after 8 frames. Right now RLBox is deployed in a couple of locations in Firefox --- graphite and ogg. I am curious which one is causing the above issue.

That said, I am trying to think through ways we can debug this

@tom - do you have any thoughts/could you cc folks who may be able to offer any debugging strategies?

Flags: needinfo?(tom)

Gabriele Svelto [:gsvelto]

Reporter

Comment 5

•

5 years ago

I believe the stack trace is the way it is because of the sandbox so I doubt we can get a better one. However the URLs seem to all point to the same site (or it's facebook page):

I've opened a bunch of reports and all have Flash installed so I guess those pages are using it - and maybe it's indirectly causing the bug. All the crashing installations are on ESR on macOS so maybe this is already fixed in our release channel.

Shravan Narayan

Assignee

Comment 6

•

5 years ago

Thanks for the links. I'll do some testing on those sites specifically as well.

So the rlbox wasm sandbox does not use a separate stack, so the back trace should be fine. Just as a sanity check, is there a way to confirm that there are no limitations in backtrace collections on macs?

Shravan Narayan

Assignee

Comment 7

•

5 years ago

Was successfully able to repro the issue in a mac. Will continue investigating tomorrow.

Shravan Narayan

Assignee

Comment 8

•

5 years ago

Sorry for the delay, I have been having some issues with getting a hold of a working Mac. But I am all set now, and I am looking into this over the next couple of days.

Shravan Narayan

Assignee

Updated

•

5 years ago

Assignee: nobody → shravanrn

Shravan Narayan

Assignee

Comment 9

•

5 years ago

I believe I have identified a fix for this issue. I will be testing and submitting a patch for this shortly.

Flags: needinfo?(tom)

Shravan Narayan

Assignee

Updated

•

5 years ago

status-firefox83: --- → affected

status-firefox84: --- → affected

status-firefox85: --- → affected

status-firefox-esr78: --- → affected

OS: macOS → All

Shravan Narayan

Assignee

Comment 10

•

5 years ago

I also did identify that this does affect both esr and regular versions of Firefox and affects both the mac and linux platform and have updated the bug accordingly.

@tom. @bholley: I am not too familiar with how patches get picked up for esr, but this would need to be included. Could you advise on the process here / cc any relevant folks.

Flags: needinfo?(tom)

Flags: needinfo?(bholley)

Tom Ritter [:tjr]

Comment 11

•

5 years ago

If the patch applies cleanly to both -central and ESR, just submit the patch via phab normally. If it needs a separate patch, you can attach it as a file to the bug.

Flags: needinfo?(tom)

Flags: needinfo?(bholley)

Shravan Narayan

Assignee

Updated

•

5 years ago

Summary: Crash in [@ core::option::expect_none_failed | lucet_wasi::ctx::WasiCtxBuilder::fd_dup_for_io_desc] → Instantiating large a number of wasm library sandboxes crashes due to overuse of file descriptors

Shravan Narayan

Assignee

Updated

•

5 years ago

Summary: Instantiating large a number of wasm library sandboxes crashes due to overuse of file descriptors → Instantiating a large number of wasm library sandboxes crashes due to overuse of file descriptors

Shravan Narayan

Assignee

Comment 12

•

5 years ago

Attached file Bug 1670878: Instantiating a large number of wasm library sandboxes crashes due to overuse of file descriptors r=tjr — Details

Shravan Narayan

Assignee

Comment 13

•

5 years ago

•

Edited

Have submitted a patch that fixes this. Adding a bit more detail to this bug.

Instantiating a wasm library duplicates a file descriptor for /dev/null 3 times to be used as input, output and error streams for the wasm sandboxed code. When a lot of sandboxes are created and destroyed, a lot of descriptors are duplicated and closed. While this should be fine, POSIX does not seem to happy with the opening and closing of many file descriptors --- this could perhaps be some strange interaction with Firefox's seccomp filters and cross-process file descriptor handling as it is difficult to repro this outside of firefox.

However, the simpler fix here was to just eliminate the duplication of /dev/null and return an error when input, output or error streams are accessed by wasm sandboxed code. This means calls to printf will fail, but no code I know of actually checks the int error code returned by printf and this change is certainly compatible with existing sandboxed components.

BugBot [:suhaib / :marco/ :calixte]

Comment 14

•

5 years ago

The severity field is not set for this bug.
:gcp, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(gpascutto)

Gian-Carlo Pascutto [:gcp]

Updated

•

5 years ago

Severity: -- → S3

Flags: needinfo?(gpascutto)

Priority: -- → P1

Pulsebot

Comment 15

•

5 years ago

Pushed by cbrindusan@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/c790eed9beab Instantiating a large number of wasm library sandboxes crashes due to overuse of file descriptors r=tjr

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 16

•

5 years ago

On Linux, a sandboxed open call will briefly use two fds (while sending the broker request, and immediately after receiving a successful response), but after it returns there is only the returned fd, and it also holds only one fd while blocked on the broker's reply. On macOS, there's no difference with sandboxing; the filesystem policy is interpreted by the kernel.

It's interesting that we see crash reports only on macOS: the per-process fd limit we request there should be at least at large as on Linux, if I recall correctly. There's a systemwide fd limit that's relatively low compared to other Unixes, but in that case we should see ENFILE rather than EMFILE, and we don't. (However, macOS may not be using error codes correctly.) I don't have a good explanation for this.

Cristian Brindusan [:cbrindusan]

Comment 17

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/c790eed9beab

Status: NEW → RESOLVED

Closed: 5 years ago

status-firefox85: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → 85 Branch

Julien Cristau [:jcristau] (back August 18)

Updated

•

5 years ago

status-firefox83: affected → wontfix

status-firefox84: affected → wontfix

Shravan Narayan

Assignee

Comment 18

•

5 years ago

(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #16)

On Linux, a sandboxed open call will briefly use two fds (while sending the broker request, and immediately after receiving a successful response), but after it returns there is only the returned fd, and it also holds only one fd while blocked on the broker's reply. On macOS, there's no difference with sandboxing; the filesystem policy is interpreted by the kernel.

It's interesting that we see crash reports only on macOS: the per-process fd limit we request there should be at least at large as on Linux, if I recall correctly. There's a systemwide fd limit that's relatively low compared to other Unixes, but in that case we should see ENFILE rather than EMFILE, and we don't. (However, macOS may not be using error codes correctly.) I don't have a good explanation for this.

Ah interesting! Thanks for the details here. I think this may actually be useful for some additional considerations of how best we handle wasm runtimes and file descriptors going forward. I think my larger takeaway is that, in general, the ideal scenario is if we can avoid creating file descriptors in the first place in the wasm runtimes, as it avoids this potential scalability pitfall.

Ryan VanderMeulen [:RyanVM]

Comment 19

•

5 years ago

FYI, this patch doesn't graft cleanly to ESR78. If you intend to request uplift approval for it, you'll need to attach a rebased patch that applies cleanly.

Flags: needinfo?(shravanrn)

Julien Cristau [:jcristau] (back August 18)

Updated

•

5 years ago

Updated

•

5 years ago

Crash Signature: [@ core::option::expect_none_failed | lucet_wasi::ctx::WasiCtxBuilder::fd_dup_for_io_desc] → [@ core::option::expect_none_failed | lucet_wasi::ctx::WasiCtxBuilder::fd_dup_for_io_desc] [@ core::result::unwrap_failed | lucet_wasi::ctx::WasiCtxBuilder::fd_dup_for_io_desc]

Shravan Narayan

Assignee

Comment 21

•

5 years ago

Sorry for the long delay here. It looks like there are some other intervening commits are missing from esr that result in this change not neatly applying.

I wanted to confirm if it is ok to create a commit for esr that includes all of the necessary changes in one and submit it here?

Crash Signature: [@ core::option::expect_none_failed | lucet_wasi::ctx::WasiCtxBuilder::fd_dup_for_io_desc] [@ core::result::unwrap_failed | lucet_wasi::ctx::WasiCtxBuilder::fd_dup_for_io_desc] → [@ core::option::expect_none_failed | lucet_wasi::ctx::WasiCtxBuilder::fd_dup_for_io_desc] [@ core::result::unwrap_failed | lucet_wasi::ctx::WasiCtxBuilder::fd_dup_for_io_desc]

Flags: needinfo?(shravanrn) → needinfo?(ryanvm)

Ryan VanderMeulen [:RyanVM]

Comment 22

•

5 years ago

What commits would you need to include?

Flags: needinfo?(ryanvm) → needinfo?(shravanrn)

Ryan VanderMeulen [:RyanVM]

Comment 23

•

4 years ago

I took another look at this and the only conclusion I can draw is that trying to manually backport the required Rust libraries is going to be a big can of worms. While the crash volume for this specific signature is higher than we'd prefer, it's also not high enough to justify the risk of such a messy backport IMO.

status-firefox-esr78: affected → wontfix

Flags: needinfo?(shravanrn)