Closed Bug 1575985 Opened 8 months ago Closed 4 months ago

RLBox - Safely load dynamic library dealing with content sandbox policies

Categories

(Core :: Security: Process Sandboxing, enhancement, P3)

Desktop
Unspecified
enhancement

Tracking

()

RESOLVED FIXED
mozilla71
Tracking Status
firefox71 --- wontfix
firefox72 --- wontfix
firefox73 --- fixed

People

(Reporter: shravanrn, Assigned: froydnj)

References

Details

Attachments

(5 files)

Project description - See Bug 1554268

RLBox modifies Firefox to use wasm sandboxed versions of third party or otherwise risky libraries. The first target is libGraphite (font library), which is used in libThebes.

Currently, wasm sandboxed libraries must be compiled and used as shared object binaries, that must be dynamically loaded and thus cannot be statically or dynamically linked.

To this end, the code in libThebes must be able to dynamically load (dlopen) the shared object. However, since libThebes is in a content process, the content sandbox prevents this.

I'm hoping to get some info on the best options to get past this. Is there a way to allow read only to a set of some safe files from the content process?

Perhaps the content sandbox already permits access to some files in the Firefox binary folder?

Any information or examples of how to get past the content sandbox for this specific use case would be great.

Flags: needinfo?(nfroyd)
Blocks: 1569370
No longer blocks: 1566238

Have you encountered this problem on specific platforms? (I added :haik and :jld for OSX/Linux.)

On Windows, I don't believe we have any current restriction that prevents this from occurring in the Content Process; but we want to in the future so it may make sense to future-proof it. Presently, our best approach for doing so is to preload the library; that is, LoadLibrary() it before we lower the sandbox. Later on, a subsequent call to LoadLibrary will be a no-op, but will successfully return the handle of the already-loaded library.

Component: General → Security: Process Sandboxing
Flags: needinfo?(nfroyd)

On Linux, there are no restrictions on filesystem access before the sandbox start point (e.g., ContentChild::RecvSetProcessSandbox). After the sandbox is started, content processes can continue to load libraries; GMP and RDD processes cannot.

In general, the advice for this is to load the library earlier (assuming the library itself is from a trusted source); see the Windows GMP code for an example.

Note that the Mac sandbox is started relatively early, so that might be causing problems there.

On Mac, once the content sandbox has started, libraries from within the main Firefox.app should still be able to be loaded.

Is there a way to allow read only to a set of some safe files from the content process?

Yes, there is a way to do this if the libraries can't be delivered in the .app bundle.

If you could provide some more details about where you're loading from now and what failures you're seeing, that would help.

Thanks a bunch for the info! So from your comments, I believe one option is that we should perform an early load of the library, prior to sandbox creation. I think this makes sense, and I will investigate this option.

I think the above solution makes sense for the current example of sandboxing the font library. In the future when we move to sandboxing rarely used libraries, I think we would not necessarily want to preload sandboxed libraries. For instance, if we sandbox a library that handles a rarely used image format, we want to defer loading until we hit a page that has an image in this format.

On Linux ... content processes can continue to load libraries
On Mac ... libraries from within the main Firefox.app should still be able to be loaded
On Windows, our best approach... is to preload the library

From your comments, above, it seems that deferred loading is possible at least on Linux/Mac. So it sounds like dlopen would work.
Question(Linux/Mac) - Does this mean that syscalls like open would also continue to work?

On Windows, restriction ... in the Content Process ... in the future

Question(Win) - Once this is enabled, would a LoadLibrary of a non pre-loaded dll be permitted, at least for dlls in the Firefox binary folder? We can definitely punt on this question if the details of the Windows content sandbox are unclear at this time.

Have you encountered this problem on specific platforms?

Sorry, I should have mentioned this earlier. The initial plan for wasm sandboxing is Linux 64. So my testing and this bug description is for Linux 64. However the goal is to keep this general, so that we can expand to other platforms. Mac 64 is a relatively easy next step in the roadmap. Win64 is a longer term goal as Lucet does not currently support it.

Yes, there is a way to do this if the libraries can't be delivered in the .app bundle. If you could provide some more details about where you're loading from now and what failures you're seeing, that would help.

Ok, this makes sense now. The current prototype is a proof of concept and is a bit hacky - it would load from a path outside the Firefox binary folder/app bundle. After cleanup, I am hoping to set it up so that the load is from the same folder (app bundle?) as libxul.
Question(Mac) - From your comments, it seems that this deferred library load would be permitted?
Caveat - I am not very familiar with the app bundles on Mac, so I apologize if the above description is flawed.

Follow up investigation from my side - The code that dynamically loads the wasm sandbox (RLBox_Lucet + a customized Lucet compiler runtime), uses 2 file related system calls prior to loading

  1. dlopen/LoadLibrary
  2. File open, read, close syscalls - to parse some additional metadata from the shared library file that is not included in the dlopen above

I will check to see if we can eliminate #2 above, in case that helps simplify working with the content sandbox.

Flags: needinfo?(tom)
Flags: needinfo?(jld)
Flags: needinfo?(haftandilian)

(In reply to Shravan Narayan from comment #4)

In the future when we move to sandboxing rarely used libraries, I think we would not necessarily want to preload sandboxed libraries. For instance, if we sandbox a library that handles a rarely used image format, we want to defer loading until we hit a page that has an image in this format.

Question(Win) - Once this is enabled, would a LoadLibrary of a non pre-loaded dll be permitted, at least for dlls in the Firefox binary folder? We can definitely punt on this question if the details of the Windows content sandbox are unclear at this time.

No; there are several restrictions in place here that would prevent it. The sandboxed process can't open the file to load it - because it can't open any files. Additionally, any dll not signed by Microsoft would be prevented from loading.

Preloading the library may be disadvantageous for performance reasons; but I'm not worried about it otherwise. It's possible that by the time we got the Windows sandbox restricted in this way we could have a solution that maps libraries into the process' address space from the parent.

(Also, not a big deal; but tom@ritter.vg is an old/person bugzilla account; better to use tom@mozilla)

Flags: needinfo?(tom)

Question(Linux/Mac) - Does this mean that syscalls like open would also continue to work?

On Mac, open syscalls are still permitted, but it depends on the path. We have SandboxPolicyContent.h in the tree which uses a scheme program to specify the sandbox rules and you can get an idea of what is allowed from that.

After cleanup, I am hoping to set it up so that the load is from the same folder (app bundle?) as libxul.
Question(Mac) - From your comments, it seems that this deferred library load would be permitted?
Caveat - I am not very familiar with the app bundles on Mac, so I apologize if the above description is flawed.

For Mac, that sounds fine and there are several other .dylibs in the same directory (Firefox.app/Contents/MacOS). This library would be like other libraries we ship and be signed by Mozilla.

Flags: needinfo?(haftandilian)

(In reply to Shravan Narayan from comment #4)

Question(Linux/Mac) - Does this mean that syscalls like open would also continue to work?

Strictly speaking, in the Linux content sandbox the open syscall doesn't work: it raises SIGSYS, the signal handler communicates with an unsandboxed broker, and the register state is modified to reflect the simulated syscall return value; the real open syscall can't be invoked.

But from the caller's point of view, it's similar to the Mac answer above: open appears to work, but restricted by a policy. The policy is constructed in the awkwardly named SandboxBrokerPolicyFactory.cpp; note the parser for ld.so.conf, so that we can allow loading system libraries, but also the section with NS_GRE_DIR which should cover anything that's shipped as part of Firefox.

Setting MOZ_SANDBOX_LOGGING in the environment may be helpful for understanding what's going on.

Flags: needinfo?(jld)

Thanks a bunch to everyone for the info! Will investigate the above options and keep this thread updated.

Hmm... Seeing some super odd behaviors here. Hoping to get some help with below

I am currently testing Linux 64. The dynamic library being loaded is libGraphite. libGraphite.so is now output in the same directory as libxul.so as part of the build. libGraphite.so is also mentioned in package-manifest.in, so that it looks like a normal binary shipped with Firefox.

I start by calling dlopen on libGraphite.so prior to enabling of the content sandbox, and then trying to dlopen libGraphite.so at the relevant location. For preloading, I dlopen'd the library here.

Question - is this a good location for the preload. I'm trying to make sure there are no races, or this isn't too late etc.

This seems to work when running Firefox say using ./mach run layout/reftests/text/graphite-01.html. However, the second dlopen seems to be failing on mach tests... For instance running ./mach test layout/reftests/text/graphite-01.html crashes with content sandbox violations.

To pin down what is happening, I enabled export MOZ_SANDBOX_LOGGING=1. Both mach run and mach test produce

0:00.97 GECKO(2480) Sandbox: policy for /home/USER/mozilla-central/obj-x86_64-pc-linux-gnu/dist/bin/: 1 -> 35
0:00.97 GECKO(2480) Sandbox: policy for /home/USER/mozilla-central/obj-x86_64-pc-linux-gnu/dist/bin: 1 -> 3
0:08.80 GECKO(2480) Sandbox: policy for /home/USER/mozilla-central/obj-x86_64-pc-linux-gnu/dist/bin/: 1 -> 35
0:08.80 GECKO(2480) Sandbox: policy for /home/USER/mozilla-central/obj-x86_64-pc-linux-gnu/dist/bin: 1 -> 3
0:47.68 GECKO(2838) Sandbox: policy for /home/USER/mozilla-central/obj-x86_64-pc-linux-gnu/dist/bin/: 1 -> 35
0:47.68 GECKO(2838) Sandbox: policy for /home/USER/mozilla-central/obj-x86_64-pc-linux-gnu/dist/bin: 1 -> 3
0:55.63 GECKO(2838) Sandbox: policy for /home/USER/mozilla-central/obj-x86_64-pc-linux-gnu/dist/bin/: 1 -> 35
0:55.63 GECKO(2838) Sandbox: policy for /home/USER/mozilla-central/obj-x86_64-pc-linux-gnu/dist/bin: 1 -> 3

After this mach run outputs

Sandbox: Recording mapping /home/USER/mozilla-central/libGraphite.so -> /home/USER/mozilla-central/obj-x86_64-pc-linux-gnu/dist/bin/libGraphite.so

while mach test outputs

1:23.59 GECKO(2838) Sandbox: SandboxBroker: denied op=stat rflags=400000 perms=0 path=/home for pid=3001
1:23.59 GECKO(2838) Sandbox: Failed errno -13 op stat flags 0400000 path /home

Question - Do mach tests have different policies on content sandbox? Is there any reason the above works for mach run but doesn't work for mach test?

Syscalls
I also wanted to be more complete about the set of syscalls used. I have removed the need for the open syscall from the code. So we just use dlopen. However, there are also some calls to the RUST API cannonicalize to standardize file paths... According to the docs this is equivalent to realpath which I imagine has some syscalls to resolve symlinks...

Flags: needinfo?(jld)
Flags: needinfo?(jld) → needinfo?(gpascutto)

I start by calling dlopen on libGraphite.so prior to enabling of the content sandbox, and then trying to dlopen libGraphite.so at the relevant location. For preloading, I dlopen'd the library here.

Note that ContentParent runs in the parent/chrome process. This is probably not what you want?

Do mach tests have different policies on content sandbox?

No. There are some differences between running a normal (packaged) Firefox and running the build output. If you're running the build output, the entire source repo gets explicitly whitelisted, because some JS stuff doesn't correctly copy its files into the objdir but leaves them in random places in the filesystem. See for example https://searchfox.org/mozilla-central/source/security/sandbox/linux/broker/SandboxBrokerPolicyFactory.cpp#427

I'm not sure if this is different for mach run vs mach test. Looking at MOZ_SANDBOX_LOGGING output will tell you if the expected paths were whitelisted.

Sandbox: Recording mapping

FYI what's happening here is that the sandbox broker notices that this library isn't actually in $OBJDIR/dist/bin but that it's a symlink to another location. It will remember this so it knows that opening the library in the original path is actually OK.

Is there any reason the above works for mach run but doesn't work for mach test?

There's nothing relevant in the excerpts you posted related to opening libGraphite.so. Or maybe there is, but we can't tell from so little info.

However, there are also some calls to the RUST API cannonicalize to standardize file paths... According to the docs this is equivalent to realpath which I imagine has some syscalls to resolve symlinks...

Yes, it will stat/lstat() paths along the way. These are brokered, and the broker will refuse access to some paths because they are outside the allowed paths. This is likely what you saw in the log snippet you posted, i.e. the stat() call on /home is refused.

Flags: needinfo?(gpascutto)

FWIW, uses of nsIFile::Normalize(), which is the platform-independent way of calling realpath() and likely the same as canonicalize(), typically have their return values ignored in the content process: https://searchfox.org/mozilla-central/search?q=symbol:_ZN11nsLocalFile9NormalizeEv%2C_ZN7nsIFile9NormalizeEv&redirect=false

Note for example this code: https://searchfox.org/mozilla-central/source/dom/media/gmp/GMPChild.cpp#147 which shows that it's best effort but continues even if it fails.

Note that ContentParent runs in the parent/chrome process. This is probably not what you want?

Oh you're right... that definitely seems like the wrong place. Would you have any thoughts on the appropriate location in the code to do this preloading? I'm hoping to find the place in source which runs in the content process prior to the content sandbox enforcement.

There are some differences between running a normal (packaged) Firefox and running the build output.

This is good to know! For the moment, we are still in the phase where we are testing on build output only. I will look definitely into this when we start testing the packaged builds.

I'm not sure if this is different for mach run vs mach test. Looking at MOZ_SANDBOX_LOGGING output will tell you if the expected paths were whitelisted.

Thanks, I will try this out.

typically have their return values ignored in the content process

Unfortunately the call to canonicalize is in third party rust code, which does abort on failure... From your description, it seems possible that this call to canonicalize may be one of the sources of problems for mach test... I will investigate and check if this needs to be removed.

Flags: needinfo?(gpascutto)

(In reply to Shravan Narayan from comment #12)

Note that ContentParent runs in the parent/chrome process. This is probably not what you want?

Oh you're right... that definitely seems like the wrong place. Would you have any thoughts on the appropriate location in the code to do this preloading? I'm hoping to find the place in source which runs in the content process prior to the content sandbox enforcement.

Trace your way backwards from https://searchfox.org/mozilla-central/rev/e5327b05c822cdac24e233afa37d72c0552dbbaf/security/sandbox/linux/Sandbox.cpp#521 which turns the sandbox on and you'll find the code that happens right before it

Flags: needinfo?(gpascutto)

Hmm... I'm still having a lot of trouble with this. In short mach run works, but mach test does not...

Description

  • I am preloading the dynamic library in ContentChild.cpp
  • The only call is to dlopen, there are no calls to APIs like canonicalize (realpath)

Would anyone have any approaches to debugging this?

What I have tried

  • I have already looked through the logs of mach run and mach test with MOZ_SANDBOX_LOGGING enabled but this is not illuminating... mach run runs successfully, but mach test fails and produces the following log statements just prior to the crash
 1:26.13 GECKO(24069) Sandbox: SandboxBroker: denied op=open rflags=2000000 perms=0 path=/dev/null for pid=24240
 1:26.13 GECKO(24069) Sandbox: Failed errno -13 op open flags 02000000 path /dev/null
 1:26.13 GECKO(24069) Hit MOZ_CRASH(failed to open /dev/null: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }) at src/libcore/result.rs:999

There is no mention of '/dev/null' in the logs of mach run

  • I tried attaching the debugger to mach test (both gdb and rr), but this doesn't seem to be supported

Questions

  • froydnj: Is there a way to hook the rr debugger on to mach test... I can try to trace the source of the code that needs /dev/null
  • gcp, tjr: Is there any other way to investigate this other than the above? Any other logging mechanism? Alternately rather than preloading, should we be considering explicitly whitelisting the path to the dynamic library? Also, if it helps I can attach the logs of mach run and mach test

Thoughts welcome!

Flags: needinfo?(tom)
Flags: needinfo?(nfroyd)
Flags: needinfo?(gpascutto)

Alternately rather than preloading, should we be considering explicitly whitelisting the path to the dynamic library?

No, we already established that the build process puts it in a whitelisted directory, and if you preload it then the broker doesn't even come into play wrt to that. So I don't see what doing that would achieve.

You should figure out what is trying to open /dev/null. If that's code that is hard to modify, we can whitelist /dev/null. But removing the access would be even better, because there's no sane reason anything would need it and the less access, the better...

Whatever code is doing that access seems to be Rust code...

Flags: needinfo?(gpascutto)
Flags: needinfo?(tom)

gcp: oh nice find! That looks plausible. Will investigate.

So, it wasn't that particular piece of code, but I did identify where we rely on /dev/null. This comes from an upstream dependency "lucet" we rely on for library sandboxing. We do maintain a fork of lucet for our purposes, so I can remove the offending code.

That said, the inconsistency of mach run allowing /dev/null and mach test not doing so is a bit strange. I'm wondering if this behavior is a separate content sandbox bug

Further

  • when I look at the logs of mach run /dev/null is never mentioned
  • when I look at SandboxBrokerPolicyFactory.cpp, /dev/null is not mentioned
  • when I look at list of open files in Firefox (to check if /dev/null was opened before the sandbox was enabled), /dev/null is not mentioned

However, the same code that uses /dev/null runs without issue here... So, not super sure what's going on here... The only other explanations I can think of seem unlikely...

  • somehow /dev/null is symlinked to something that is allowed, which was opened
  • The source repo whitelist on when running from the build folder, somehow allow /dev/null

Any thoughts on whether I should open a new content sandbox bug to track this?

That said, the inconsistency of mach run allowing /dev/null and mach test not doing so is a bit strange. I'm wondering if this behavior is a separate content sandbox bug

Are you sure that this is actually what happens? As far as I can tell we don't allow /dev/null anywhere.

Any thoughts on whether I should open a new content sandbox bug to track this?

I see no evidence there is any issue with the content sandbox.

The only other explanations I can think of seem unlikely...

I can think of at least one other one: "lucet" or some other code is trying to open /dev/null when running tests but not when doing a normal run. For some reason.

"lucet" or some other code is trying to open /dev/null when running tests but not when doing a normal run. For some reason.

I was wondering about this as well and that was my initial theory. So I did some printf debugging to make sure the same code is executing... While it is the same code that's running, the results of the code here are confusing with how it could interact with the content sandbox. Given that you believe this is not a content sandbox issue, I will investigate the code in question (described below) more carefully.

As far as I can tell we don't allow /dev/null anywhere.

Good to know. This definitely seems to indicate its not a content sandbox issue. Let me investigate further from my side.

Also, here is a quick description of the code I'm looking at. I have some theories about what is going wrong. (Sharing here in case its something obvious that I'm missing :) )

Code that's executing
The lucet code executes these 2 functions

Briefly, these 2 functions

  • Create 3 new file descriptors fds[0], fds[1], fds[2]
  • Open file /dev/null, duplicates the descriptor and assigns to all 3 fds
  • Reassign fds by duplicating stdin, stdout and stderr

Possible execution paths for this code

  1. If the firefox process' stdin, stdout and stderr are left as default, the code would attempt to open /dev/null briefly, but never actually read or write to it.
  2. If stdin, stdout and stderr are /dev/null during launch, this code would attempt to open /dev/null, duplicate the file descriptor pointing to open /dev/null and read from/write to it.

Investigations from my side
I am planning to check if one or both execution paths described above work. If only one works while the other fails, I will see if mach run and mach test cause different execution paths

Have you watched procfs and see where they point to during execution?

It would actually make a lot of sense that during tests the stdout/stderr/stdin are redirected, but not during a normal run.

I'm still investigating, but thought I'd post my progress

I did some checks on the procfs, by calling readlink on file descriptors 0, 1 and 2 in the code

  • For mach run, stdin, stdout and stderr point to the terminal.
  • For mach test, stdin points to the terminal, stdout and stderr are piped somewhere (readlink returns strings that look like "pipe:[255774]" - I haven't yet figured out a way to decode what that is, as the id in the string is not a PID)

It seems like this could be the issue... Here are some of the next tests I tried

To see if "/dev/null" was the culprit, I did confirm that adding /dev/null to SandboxBrokerPolicyFactory.cpp allows mach test to run, so there is light at the end of the tunnel :)

To see if I can get mach run to crash with a similar message, I executed mach run with input and output pointing to "/dev/null" directly, with the below command. However, this worked fine without a crash.

2>/dev/null 1>/dev/null /home/USER/mozilla-central/obj-x86_64-pc-linux-gnu/dist/bin/firefox layout/reftests/text/graphite-01.html -no-remote -profile /home/USER/mozilla-central/obj-x86_64-pc-linux-gnu/tmp/profile-default

Did the same with the output and error pointing to a log file on my Desktop (which is a path not permitted by the sandbox broker), this too works fine.

I am doing some follow up to see of calling dup on the pipe is somehow the issue... AFAICT, the linux manpages don't explicitly disallow this, so not sure if this is the issue...

Confirmed that dup is not the issue - dup followed by write works correctly on the stdin, stdout and stderr file descriptors even in mach test... At this point unless there is a way to attach rr on to mach test, I think I have exhausted most other avenues/approaches to explore this...

To this end, I think it may be time to consider if we can allow read write access to /dev/null in the linux sandbox

gcp, tjr: Do you think this is an option? Since I'm unfamiliar with the process, I also wanted to ask if there are additional considerations i.e. does this require additional sign-offs, approvals or security reviews?

Flags: needinfo?(tom)
Flags: needinfo?(gpascutto)

Not my call; probably gcp's or Jed's.

Flags: needinfo?(tom) → needinfo?(jld)

(In reply to Shravan Narayan from comment #23)

(readlink returns strings that look like "pipe:[255774]" - I haven't yet figured out a way to decode what that is, as the id in the string is not a PID)

That's an inode number. You can try to find the other end with something like ls -l /proc/*/fd/* 2>/dev/null | grep -w 255774. (Both ends of a pipe use the same inode; contrast socketpairs, which have an adjacent pair of inode numbers. Credit for this trick goes to Chromium.)

But in this case, the question seems to be whether the WASI virtual fds 0-2 point to the same file as the real fds 0-2.

I don't see any problem with adding /dev/null to the policy. Right now we for sure allow much riskier stuff.

But I still think it is unneeded - may just not be worth the effort to work around/fix whatever funnies the rust stdlib does.

Given that it's only needed for mach test, how about using the whitelist preference (security.sandbox.content.read_path_whitelist) to inject it during tests?

Flags: needinfo?(gpascutto)

Adding /dev/null shouldn't be a problem, given that we already allow /dev/urandom (and allow inheriting tty fds).

But it would be good to know why this happens only with mach test and not mach run. According to comment #20 this isn't the Rust stdlib; it's this line, and there doesn't seem to be anything in that module that would construct a WasiCtx without calling it (although that could be changed easily enough). If the sandbox is allowing access to a file and we don't know why, that's kind of a problem.

Instrumenting the broker to log when accesses are allowed (instead of just denials) might help, or using strace, or adding a sandbox policy rule with the CRASH_INSTEAD flag and backtracing in a debugger.

As for rr, it should be possible to record the entire mach test run, select the process of interest in rr ps, and replay that process.

Flags: needinfo?(jld)

But in this case, the question seems to be whether the WASI virtual fds 0-2 point to the same file as the real fds 0-2.

This is the question I am currently investigating, although I should point out that I'm not fully sure of what is going on here yet, so this may stiil be a red herring...

may just not be worth the effort to work around/fix whatever funnies the rust stdlib does.

Yeah this is sort of the stage I'm approaching, given the debugger issues I am seeing (described below)

Given that it's only needed for mach test, how about using the whitelist preference

I am sort of worried about doing this... Clearly there is some sort of configuration that can get Firefox to run that causes this code to require /dev/null which is somehow triggerred in mach test. Given this, there is probably some way to get mach run or stock firefox to also operate in this configuration... If some user does actually do this, this would cause a crash, so adding an exception to mach test only feels like a bug waiting to happen

Fwiw, I have unsuccessfully tried to get firefox to crash with the same error in mach run when

  • by piping output and err to another program
  • by redirecting output and err to /dev/null
  • by redirecting output in C, with dup2 + exec(firefox) (in case C was needed to trigger the pipe)
  • by setting up pipes and executing firefox from a C program (in case C + pipes was needed to trigger the pipe)

All of the above cases work without crashing.

Re using rr

Unfortunately have run into a couple of blocking bugs when trying to use rr

  • unlike mach run, mach test doesn't seem to use the "--debugger" flag (it allows it, but then ignores it) to set the debugger only on the Firefox process
  • if i try to record all of mach test, I run into an rr bug, where the replay just causes rr to crash with a backtrace (attached below if you are curious)...

I spoke to @froydnj offline, and he suggested I open bugs for both of these, but this does unfortunately make it harder to understand why mach run and mach test differ

Using gdb directly on mach test is also infeasibile as there are about 50 fork/exec's here before the content process is launched... Following the correct forks and execs to get to the right one would be too tedious

Next steps

Instrumenting the broker to log when accesses are allowed (instead of just denials) might help, or using strace, or adding a sandbox policy rule with the CRASH_INSTEAD flag and backtracing in a debugger.

  1. These are good suggestions... Will investigate!

That's an inode number. You can try to find the other end with ...

  1. Oh nice! will investigate the inode numbers to see if this sheds some light on why this is happening.

rr backtrace

[ERROR /build/rr-ViolGa/rr-5.1.0/src/Registers.cc:256:maybe_print_reg_mismatch()] r9 0x7ffcc7be38c0 != 0x7ffcc7be3ac0 (replaying vs. recorded)
[FATAL /build/rr-ViolGa/rr-5.1.0/src/Registers.cc:366:compare_register_files()] 
 (task 15563 (rec:13231) at time 24770)
 -> Assertion `!bail_error || match' failed to hold. Fatal register mismatch (ticks/rec:216752702/216752713)
=== Start rr backtrace:
rr(_ZN2rr13dump_rr_stackEv+0x41)[0x561ed216ddd1]
rr(_ZN2rr9GdbServer15emergency_debugEPNS_4TaskE+0x5c5)[0x561ed20a0135]
rr(_ZN2rr21EmergencyDebugOstreamD1Ev+0x11e)[0x561ed20ad6be]
rr(_ZN2rr9Registers22compare_register_filesEPNS_10ReplayTaskEPKcRKS0_S4_S6_NS_16MismatchBehaviorE+0xe3)[0x561ed2110a93]
rr(_ZN2rr10ReplayTask13validate_regsEj+0x26c)[0x561ed212a46c]
rr(_ZN2rr13ReplaySession13enter_syscallEPNS_10ReplayTaskERKNS0_15StepConstraintsE+0x575)[0x561ed2117d05]
rr(_ZN2rr13ReplaySession18try_one_trace_stepEPNS_10ReplayTaskERKNS0_15StepConstraintsE+0xdb)[0x561ed211a73b]
rr(_ZN2rr13ReplaySession11replay_stepERKNS0_15StepConstraintsE+0x128)[0x561ed211b3b8]
rr(_ZN2rr14ReplayTimeline19replay_step_forwardENS_10RunCommandEl+0xdf)[0x561ed2132a5f]
rr(_ZN2rr9GdbServer12serve_replayERKNS0_15ConnectionFlagsE+0x81)[0x561ed209f281]
rr(+0xfdbcf)[0x561ed2111bcf]
rr(_ZN2rr13ReplayCommand3runERSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EE+0x570)[0x561ed2112d40]
rr(main+0x1da)[0x561ed2051ffa]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f9019caab97]
rr(_start+0x2a)[0x561ed20521ca]
=== End rr backtrace
Flags: needinfo?(nfroyd)

(Ignore the last attachment - hit send too early)

Aha, managed to figure out how to get gdb attached to the right process in mach run and mach test!

I can confirm that the thing causing the crash is indeed the lucet rust code... Further the crash has nothing to do with the calls to dup or pipes. The crash definitely occurs here which is simply a call in rust to open("/dev/null"). When this line of code is executed in mach run, the SYSSIG handler in SandboxBrokerClient.cpp is able to successfully request opening of /dev/null here (returns a descriptor that is not -1), while the same request is rejected for mach test (returns -1).

This also leads to a better framing of the issue I'm seeing. The question is not why mach test fails, but rather why does mach run pass? For the purposes of the code I am working on, the "correct" behavior should require explicitly allowing open and write access to /dev/null in the broker.

Barring any objections, I believe adding the exception seems to be right way forward. In the interest of scoping the work I'm doing - I do believe the fact that mach run allows this without the explicit rule feels like an orthogonal bug that may need separate investigation.

Here are a couple of other tests I did to see if I can get more info on why mach run passes.

  1. Test: Added a policy of CRASH_INSTEAD on /dev/null
    Result: this causes both mach run and mach test to fail on this line
    Conclusion: apparently, without CRASH_INSTEAD, mach run seems to permit /dev/null
  2. Test: Tried opening /dev/null in C++ code
    Result: this fails in both mach run and mach test
    Conclusion: This result is very confusing. Somehow rust code can open /dev/null, but C++ code cannot?

For completeness, here are the results of the next steps from my previous post

  • strace - did this. Wasn't too illuminating as both have calls to open dev/null... one succeeds the other one deosn't
  • CRASH_INSTEAD flag - mach run and mach test fails on the rust code listed above, if this policy is included.
  • inode number investigation - Wasn't the issue, as the crash is on the open, not on dup or the pipes...

gcp, jld: Please let me know if you have objections to the above conclusion i.e. adding the /dev/null exception + tracking any other part of this as a separate bug with different owners.

Flags: needinfo?(jld)
Flags: needinfo?(gpascutto)

Added a policy of CRASH_INSTEAD on /dev/null
Result: this causes both mach run and mach test to fail on this line
Conclusion: apparently, without CRASH_INSTEAD, mach run seems to permit /dev/null

Do you have a backtrace here? The log you give above is for another file.

This result is very confusing. Somehow rust code can open /dev/null, but C++ code cannot?

This isn't really possible as far as we understand. For this to work the calls have to be either in a different process (i.e. a non-sandboxed one for rust) or before the sandbox is initialized.

gcp, jld: Please let me know if you have objections to the above conclusion i.e. adding the /dev/null exception + tracking any other part of this as a separate bug with different owners.

You can split the bug to do whatever you originally wanted to do here, and to deal with figuring out why /dev/null is being accessed (and to get rid of it again, potentially) in another bug. I have no idea why the second bug would have to have another owner, though :-)

When this line of code is executed in mach run, the SYSSIG handler in SandboxBrokerClient.cpp is able to successfully request opening of /dev/null here (returns a descriptor that is not -1), while the same request is rejected for mach test (returns -1).

A MOZ_SANDBOX_LOGGING=1 output of a run where that happens (i.e. it being allowed) might reveal a lot.

Flags: needinfo?(gpascutto)

Is there some way for us to reproduce the testing that you are doing?

If necessary dump the WIP onto a github repo or something...

Do you have a backtrace here? The log you give above is for another file.

Have attached the 2 logs

  • mach run with the current policy unchanged (i.e. no crash instead) - this allows the open of /dev/null
  • mach run with the crash instead on /dev/null policy - this crashed on the open of /dev/null

In both logs, you can look for the string "!!!!!!!!!About to call the Graphite setup" which is printed just prior to the call to open "/dev/null". If the call to open /dev/null is successful in rust, the rust code prints "!!!!!!!!!!!!!!!!!!!!/dev/null opennig was ok" - this is present in the first log.

/dev/null is being accessed (and to get rid of it again, potentially) in another bug

Actually, we can't get rid of this for 2 reasons...

  1. My original confusion was actually because I thought I did not need a policy for open... I had thought that I needed a policy only on read and write. However, given that I need a policy for open as well, the only solution here is to add /dev/null
  2. We also do need write access for the library sandboxing use case. While we initially were Ok with redirecting output from the sandboxed library to the terminal, we subsequently found this is actually something that needs be context dependent or configurable... If we choose to disallow output to the terminal based on the config, this would mean redirecting all writes to /dev/null...

Thanks to the logs, I think I know what's going on here: the test is being loaded from a file: URL in the mach run case. File URLs are loaded in a dedicated content process which has read access to the entire filesystem (see FILE_REMOTE_TYPE in the source, and the policy for /: 1 -> 35 line in the logs), and File::open is a read-only open (despite being used for virtual stdout/stderr, which might be a bug).

Flags: needinfo?(jld)

File URLs are loaded in a dedicated content process which has read access to the entire filesystem

Oh, nice find! That is definitely it! The whole thing makes sense.

  • mach test loads these files from a local test server, which is why it was failing
  • the upstream lucet rust code is opening /dev/null read only (which as you pointed out looks like a bug given that they are using this for writing)
  • my C++ test of open /dev/null crashed even in mach run as in my test, I was opening with write access (which is what the upstream code should have been doing anyway)

Thanks so much for your help! This definitely clarifies everything.

I think just to summarize the long thread, I think the conclusions are

  • default policy is that /dev/null is inaccessible by default (except when handling file urls, where the entire file system is read only by design)
  • since we need read/write access to /dev/null, we need a sandbox broker policy here
  • upstream lucet has a bug where they are opening /dev/null read only for some reason
Type: task → enhancement
Summary: RLBox - Require info on Content sandbox policies → RLBox - Safely load dynamic library dealing with content sandbox policies
Attachment #9092869 - Attachment description: Bug 1575985 - Safely load RLBox dynamic library dealing with content sandbox policies r=froydnj → Bug 1575985 part 1 - Preload RLBox dynamic library before content sandbox to allow subsequent loads r=froydnj
Attachment #9093198 - Attachment description: Bug 1575985 part2 - Allow RW access to /dev/null in content sandbox, needed by lucet to run WASM sandboxed libraries r=froydnj,jld,gcp → Bug 1575985 part2 - Allow RW access to /dev/null in content sandbox r=froydnj,jld,gcp
Attachment #9093198 - Attachment description: Bug 1575985 part2 - Allow RW access to /dev/null in content sandbox r=froydnj,jld,gcp → Bug 1575985 part 2 - Allow RW access to /dev/null in content sandbox r=froydnj,jld,gcp
Pushed by tritter@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/eeaa7ecf70e3
part 2 - Allow RW access to /dev/null in content sandbox r=gcp
Status: NEW → RESOLVED
Closed: 6 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla71

Sorry there's still Part 1 outstanding.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Pushed by nfroyd@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/0309a3ff69a0
part 1 - Preload RLBox dynamic library before content sandbox to allow subsequent loads r=froydnj
Status: REOPENED → RESOLVED
Closed: 6 months ago4 months ago
Resolution: --- → FIXED

Since the status are different for nightly and release, what's the status for beta?
For more information, please visit auto_nag documentation.

You need to log in before you can comment on or make changes to this bug.