Closed Bug 1347710 Opened 7 years ago Closed 2 years ago

Enable sandbox protections for the Windows GPU process

Categories

(Core :: Security: Process Sandboxing, task, P2)

task

Tracking

()

RESOLVED FIXED
108 Branch
Tracking Status
firefox76 --- wontfix
firefox108 --- fixed

People

(Reporter: jimm, Assigned: cmartin)

References

(Depends on 1 open bug)

Details

(Whiteboard: sb+)

Attachments

(10 files, 5 obsolete files)

11.55 KB, patch
Details | Diff | Splinter Review
122.08 KB, text/html
Details
17.62 KB, text/html
Details
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
2.75 KB, text/plain
chutten
: data-review+
Details
This process will eventually need native api access to most of what the current content process consumes, plus apis associated with painting to an hwnd using direct/gdi apis. We should get this process locked down as much as possible now so that we are testing quantum gfx changes with a sandbox in place.
Assignee: nobody → davidp99
Whiteboard: sb? → sbwc2
Attached patch Enable GPU process sandbox (obsolete) — Splinter Review
The GPU process seems to have been able to handle pretty high sandbox settings for most of the categories.  Everything is turned up to 11 except:

* Job Level -- anything above JOB_NONE hits an ASSERT
* Alternative Desktop -- same ASSERT

The ASSERT is triggered by a call to IsWindow [1] trying to get info on an HWND from the parent:

	xul.dll!MOZ_ReportAssertionFailure(const char * aStr, const char * aFilename, int aLine) Line 154	C++
 	xul.dll!mozilla::widget::WinCompositorWidget::WinCompositorWidget(const mozilla::widget::CompositorWidgetInitData & aInitData, const mozilla::layers::CompositorOptions & aOptions) Line 32	C++
 	xul.dll!mozilla::widget::CompositorWidgetParent::CompositorWidgetParent(const mozilla::widget::CompositorWidgetInitData & aInitData, const mozilla::layers::CompositorOptions & aOptions) Line 16	C++
 	xul.dll!mozilla::layers::CompositorBridgeParent::AllocPCompositorWidgetParent(const mozilla::widget::CompositorWidgetInitData & aInitData) Line 1762	C++
 	xul.dll!mozilla::layers::PCompositorBridgeParent::OnMessageReceived(const IPC::Message & msg__) Line 598	C++
 	xul.dll!mozilla::ipc::MessageChannel::DispatchAsyncMessage(const IPC::Message & aMsg) Line 1872	C++
 	xul.dll!mozilla::ipc::MessageChannel::DispatchMessageW(IPC::Message && aMsg) Line 1809	C++
 	xul.dll!mozilla::ipc::MessageChannel::RunMessage(mozilla::ipc::MessageChannel::MessageTask & aTask) Line 1681	C++
 	xul.dll!mozilla::ipc::MessageChannel::MessageTask::Run() Line 1714	C++
 	xul.dll!MessageLoop::RunTask(already_AddRefed<mozilla::Runnable> aTask) Line 359	C++
 	xul.dll!MessageLoop::DeferOrRunPendingTask(MessageLoop::PendingTask && pending_task) Line 369	C++
 	xul.dll!MessageLoop::DoWork() Line 441	C++
 	xul.dll!base::MessagePumpForUI::DoRunLoop() Line 212	C++
 	xul.dll!base::MessagePumpWin::RunWithDispatcher(base::MessagePump::Delegate * delegate, base::MessagePumpWin::Dispatcher * dispatcher) Line 58	C++
 	xul.dll!base::MessagePumpWin::Run(base::MessagePump::Delegate * delegate) Line 80	C++
 	xul.dll!MessageLoop::RunInternal() Line 239	C++
 	xul.dll!MessageLoop::RunHandler() Line 232	C++
 	xul.dll!MessageLoop::Run() Line 212	C++
 	xul.dll!base::Thread::ThreadMain() Line 182	C++
 	xul.dll!`anonymous namespace'::ThreadFunc(void * closure) Line 29	C++
 	[External Code]	

I haven't poked too hard to see how serious this is but I would be surprised if we could get much further because I'm pretty sure we still use Windows messaging in aspects of the graphics engine and thats what JOB_RESTRICTED+ and Alternative Desktop are intended to block.  We can possibly remote that stuff if we choose to but my guess is that this would be pretty involved. 

I've set it to use level 1 (similar to content proc level 1) outside of nightly builds (in case of uplift).  Nightly is set to use level 2, which is what I described above.  This is just a suggestion.

Note that the GPU process can't automatically guarantee that it can run using the content sandbox because the GPU process contains a fair amount of code that was never sandboxed at all (e.g. the Compositor).

Tests: https://treeherder.mozilla.org/#/jobs?repo=try&revision=2759d1b6267a5a2ecd67da04e9ac9546f0177cac

[1] https://msdn.microsoft.com/en-us/library/windows/desktop/ms633528(v=vs.85).aspx
Attachment #8851709 - Flags: review?(bobowencode)
Comment on attachment 8851709 [details] [diff] [review]
Enable GPU process sandbox

Review of attachment 8851709 [details] [diff] [review]:
-----------------------------------------------------------------

It doesn't look like we're calling the method to lower the sandbox to the delayed settings anywhere for the GPU process:
  mozilla::SandboxTarget::Instance()->StartSandbox();

You'll probably want to discuss where in the process initialisation we put this call with dvander.
If it makes sense they might be able to initialise some extra things before we call it, to enable stronger settings after, but we'll want to do it before we can process any untrusted data.

::: security/sandbox/win/src/sandboxbroker/sandboxBroker.cpp
@@ +274,5 @@
> +void
> +SandboxBroker::SetSecurityLevelForGPUProcess(int32_t aSandboxLevel)
> +{
> +  // The GPU sandbox is based on the content sandbox as it looked when the
> +  // GPU process was split from the content process.

As I understand it much of the GPU process is from the master process, so I'm not sure we need this comment.

@@ +305,5 @@
> +  } else if (aSandboxLevel >= 2) {
> +    jobLevel = sandbox::JOB_NONE;
> +    accessTokenLevel = sandbox::USER_LOCKDOWN;
> +    initialIntegrityLevel = sandbox::INTEGRITY_LEVEL_LOW;
> +    delayedIntegrityLevel = sandbox::INTEGRITY_LEVEL_UNTRUSTED;

I'd be a little surprised if we can get things running at this level.
We need to make sure it works with and without acceleration as well.

I'm assuming we must be passing the HWND from the GPU process to the master process as we would even fall foul of low integrity the other way around I think.

I think that the chromium GPU process currently runs with USER_LIMITED, low integrity, JOB_LIMITED_USER (but with exceptions for nearly everything [1]).

I suspect that unlike the content process, where you might get some issues with higher settings, the GPU process will either work or it won't.
It makes sense to have the conservative settings as 1 in case of uplift for the moment, but after that I think we should probably not bother with the complication of levels for the GPU process. Similar to the GMP process.

[1] https://cs.chromium.org/chromium/src/content/browser/gpu/gpu_process_host.cc?type=cs&l=273

@@ +356,5 @@
> +  MOZ_RELEASE_ASSERT(sandbox::SBOX_ALL_OK == result,
> +                     "Invalid flags for SetDelayedProcessMitigations.");
> +
> +  if (aSandboxLevel >= 4) {
> +    result = mPolicy->SetAlternateDesktop(true);

Let's remove this completely.
I don't think we'll ever be able to use this while we need shared access to HWNDs etc.
Attachment #8851709 - Flags: review?(bobowencode) → review-
Right, we should make sure Direct3D acceleration still works in the GPU process after adding the sandbox. I can't think of anywhere it would be better to initialize the sandbox, it's possible that doing it after creating a D3D11 device would work, but then we'd maybe have to kill the process to get a new device.
(That is, if we use an access level that does not permit getting a d3d11 device)
Attached patch Enable GPU process sandbox (obsolete) — Splinter Review
(In reply to Bob Owen (:bobowen) from comment #2)
> It doesn't look like we're calling the method to lower the sandbox to the
> delayed settings anywhere for the GPU process:
>   mozilla::SandboxTarget::Instance()->StartSandbox();
> 
> You'll probably want to discuss where in the process initialisation we put
> this call with dvander.

(In reply to David Anderson [:dvander] from comment #3)
> Right, we should make sure Direct3D acceleration still works in the GPU
> process after adding the sandbox. I can't think of anywhere it would be
> better to initialize the sandbox, it's possible that doing it after creating
> a D3D11 device would work, but then we'd maybe have to kill the process to
> get a new device.

I've put it in GPUProcessImpl::Init, which is about as early as possible and mimics e.g. the plugin process behavior.

> I'd be a little surprised if we can get things running at this level.

I had to reduce the intensity.  Level 1 is still the same and works fine.  But I had to reduce the delayed integrity level (to LOW) and the access token (to USER_LIMITED).  This is certainly less protection.  I think there may be some low-hanging fruit in the job level, which at least initially fails due to some restricted win32 APIs that are not essential.

> We need to make sure it works with and without acceleration as well.

I am checking it with about:support.  Web pages (even WebGL pages) look 'fine' when the sandbox forces the GPU to fail... but thats because it falls back to the Basic compositor.  This is easy to see in about:support.  With this patch, levels 1 and 2 look good in about:support.

---

Tests: https://treeherder.mozilla.org/#/jobs?repo=try&revision=5c138f0726c19f7a903d64d1f0eba08f98df7377
Attachment #8851709 - Attachment is obsolete: true
Attachment #8859789 - Flags: review?(bobowencode)
Comment on attachment 8859789 [details] [diff] [review]
Enable GPU process sandbox

Review of attachment 8859789 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, thanks handyman.
I think we should probably try and get this uplifted to Beta.

Can you file a follow-up bug on looking into changing the job level, please.

Once we're uplifted (or if it doesn't happen), I think we should hard code at least level 1 similar to how we do for NPAPI [1].
Can you file a follow-up for that as well, please.

[1] https://hg.mozilla.org/mozilla-central/file/40e828d5ac43/dom/plugins/base/nsPluginTags.cpp#l441

::: security/sandbox/win/src/sandboxbroker/sandboxBroker.cpp
@@ +303,5 @@
> +  sandbox::IntegrityLevel delayedIntegrityLevel;
> +
> +  // The setting of these levels is pretty arbitrary, but they are a useful (if
> +  // crude) tool while we are tightening the policy. Gaps are left to try and
> +  // avoid changing their meaning.

nit: I don't think we need this comment given that we're using both levels that are set.
Attachment #8859789 - Flags: review?(bobowencode) → review+
Something else that's worth checking is that webrender still works ok with the sandboxing. It's off by default in Windows nightly builds, you can enable it by setting gfx.webrender.enabled=true and restarting the browser. I don't expect there to be any problems but you never know.
We decided to change the patch so that it sets the sandbox to level 1 on all builds.  We'll push that if its fine on nightly for a week or so (which it should be).  We will probably then quickly upgrade the nightly setting to level 2.

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #7)
> Something else that's worth checking is that webrender still works ok with
> the sandboxing. It's off by default in Windows nightly builds, you can
> enable it by setting gfx.webrender.enabled=true and restarting the browser.

Checked and it looks good as well.

Tests with the weaker sandbox:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=7fb961f345235af0233a7d14bbcf87ef4e69c318&selectedJob=93807046
Attachment #8859789 - Attachment is obsolete: true
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/69a988036095
Enable sandbox protections for the Windows GPU process. r=bobowen
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/69a988036095
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
This has broken WebVR in Nightly - See Bug 1359460

The GPU process accesses the SteamVR API and the Oculus API through LoadLibrary calls.  Perhaps this is being blocked by the sandboxing?
(In reply to :kip (Kearwood Gilbert) from comment #11)
> This has broken WebVR in Nightly - See Bug 1359460
> 
> The GPU process accesses the SteamVR API and the Oculus API through
> LoadLibrary calls.  Perhaps this is being blocked by the sandboxing?

The libraries that we load can change any time and may do things that are perhaps not allowed in the sandbox.  I would expect to see activity related to IPC/shmem, file access, and other graphics API's.
Depends on: 1359460
Kip and Daosheng,

Can you please look at the Vive/Rift with sandbox violation logging on and add the reports here to see if this can be loosened or otherwise modified for 55?
Flags: needinfo?(kgilbert)
Flags: needinfo?(dmu)
I have started investigating the interaction of the sandbox with the Oculus runtime.  My findings so far are in the comments on bug 1359460.
Flags: needinfo?(kgilbert)
I have updated some finding on bug 1359460, for OpenVR, it looks like in sandbox, it does not allow send IPC messages to the processes from OpenVR.
Flags: needinfo?(dmu)
I stopped getting D3D11 yesterday and narrowed it down to this change. Something about having Visual Studio or maybe the debug SDK for D3D11 causes sandboxing to break it. See bug 1245309. It's extremely rare in Telemetry so it's not actually worth worrying about, but in case any other Gecko developers encounter this problem, the sandbox level has to be 0.
Depends on: 1245309
Depends on: 1361719
Once Bug 1362578 lands (Move VRManager to its own process), WebVR will no longer preclude the GPU sandboxing.

I have also added Bug 1362579 (Add Sandboxing to the VR process) to track the sandboxing efforts for this VR process, which will have different rules applied.
Depends on: 1362578
See Also: → 1362579
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: sbwc2 → sbwc3
Priority: -- → P3
Whiteboard: sbwc3 → sb+
Target Milestone: mozilla55 → ---
Depends on: 1404689
Blocks: 1476092
I think this is the other way round, this bug is blocked on the VR process.
No longer blocks: 1476092
Depends on: 1476092

We have already enabled VR process in Nightly, Bug 1516554. I am willing to help enable GPU sandbox then, please ping me if you need any helps.

Trying to re-enable now that :daoshengmu has moved VR service to its own process.

Unfortunately, ran into this assertion on a Try build with security.sandbox.gpu.level=1:

Assertion failure: mCompositorWnd, at z:/build/build/src/widget/windows/WinCompositorWidget.cpp:304

Tested on my workstation, and it Works on My Machine™, so I'm going to try to reproduce on my laptop that has an Nvidia card and see if that helps. Have more info soon hopefully.

Depends on: 1524591

Previously, the GPU sandbox was enabled and caused the VR service to break.

The VR service has now been moved into its own process, and now the GPU
sandbox should be able to work fine. We will initially apply sandbox level '1'
for a time, and then increase to sandbox level '2' if everything works just
fine.

There are 2 things to note for this change:

  1. There is a pref to enable the seperate process for the VR service. If
    that pref is disabled, VR will be running in the GPU process and we can't
    enable the GPU sandbox
  2. Currently, WebRender causes the GPU sandbox to fail because it tries to
    make a GPU Process the child of a Browser Process, and that violates
    windows Integrity Level mechanisms. This can be removed when WebRender
    changes this behavior.
Keywords: checkin-needed

Pushed by csabou@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/60f9b60f1f4f
Re-enable GPU sandbox level 1 r=bobowen

Keywords: checkin-needed
Status: REOPENED → RESOLVED
Closed: 7 years ago5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla67
Depends on: 1526661
Backout by dvarga@mozilla.com:
https://hg.mozilla.org/mozilla-central/rev/14ee0b7ecef3
Backed out changeset 60f9b60f1f4f for causing bug 1526661. a=backout
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla67 → ---

This also caused a performance regression:
== Change summary for alert #19273 (as of Mon, 11 Feb 2019 06:19:43 GMT) ==

Regressions:

133% sessionrestore windows10-64-qr opt e10s stylo 378.96 -> 884.58

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=19273

The backout fixed the issue

== Change summary for alert #19269 (as of Sat, 09 Feb 2019 19:54:15 GMT) ==

Regressions:

4% ts_paint windows10-64-qr opt e10s stylo 327.83 -> 341.75
4% ts_paint_webext windows10-64-qr opt e10s stylo 328.71 -> 340.92

Improvements:

62% sessionrestore windows10-64-qr opt e10s stylo 885.92 -> 333.92
5% sessionrestore_no_auto_restore windows10-64 opt e10s stylo 336.21 -> 320.50
5% sessionrestore windows10-64 opt e10s stylo 332.33 -> 317.25
4% sessionrestore_no_auto_restore windows7-32 opt e10s stylo 348.42 -> 332.83
4% ts_paint windows10-64 opt e10s stylo 338.25 -> 324.92
3% ts_paint windows7-32 opt e10s stylo 352.67 -> 340.58

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=19269

Tracking for 67 as it caused major regressions on landing.

Changing the priority to p2 as the bug is tracked by a release manager for the current nightly.
See How Do You Triage for more information

Priority: P3 → P2

Updating to P1, actively being worked on by Chris.

Assignee: davidp99 → cmartin
Priority: P2 → P1

We have enabled WebVR process in FF 67 Nightly. And in the beginning of FF 68, we are going to make FF stable also enable VR process by default.

Attached file dump-15456-26-0.html

Attached is a dump of the painted layer when the menu is 'broken' by this change. Just shows that it is still being painted properly, so the issue definitely lies somewhere betweeen painting and compositing.

Type: defect → task
Attached file after_composite.html

This is the surface after compositing is finished for the LayerManager that draws the hamburger menu. It looks like the issue is not between painting and compositing - It is that somehow the composited image is not being drawn on to the screen.

As I was debugging, I noticed that the hamburger menu has its own nsWindow and LayerManager. I'm now thinking the issue is more related to windowing problems - maybe Z-order, rectangle invalidation, etc. Going to try and narrow that now.

Gian-Carlo had a thought that at one point, the hamburger menu probably wasn't broken by this change - Back when David first submitted this, people were complaining about VR but not about the hamburger menu breaking, so it's likely that this behaviour is a regression.

We suspected that it might have been this changelist:

https://hg.mozilla.org/integration/mozilla-inbound/rev/19142efa84a5

But alas, it was not - The hamburger menu is still broken even without this commit.

My next experiment will be to do a little binary search and see if I can figure out what CL might have caused the regression.

It looks like the root issue is this: For any window that uses transparency, the GPU process calls the Win32 UpdateLayeredWindow() API (here) to present the in-memory Device Context it finished painting to. Unfortunately, the HWND it passes is for a window owned by the main process and can't be used by the low-integrity GPU process, so the call fails with Error Code 5 - Access Denied. (Note that all of the caller code swallows this error, which is why there was no assertion. I will fix that as part of this bug)

I have already tried a few things:

  1. Disabling OMTC for transparent windows
    • Fixes the hamburger menu, but any addons that have doorhangers are broken
  2. Creating a separate independent window in GPU process
    • Transparency doesn't work. Entire window becomes opaque
  3. Moving the UpdateLayeredWindow() call to the main process
    • Unfortunately, Win32 doesn't allow sharing a Device Context between processes, so I got an Invalid Parameter error

However, I finally found something that works (based on a suggestion from Bob Owen): Create a child window in the low-integrity GPU process, ask the main process to establish a "parent-child" relationship between the widget and this new window, and then resize the child window so that it occupies the entire client area of the parent window.

It looks like something similar to this is already done by DirectComposite, so there's already precedent for it. And it might end up being quite nice actually because it will unify the DirectComposite path with the other paths.

It seems like it also makes more sense from a "Supported Behaviour" perspective. Right now, the GPU process is making all kinds of Win32 calls on a higher-integrity window that it doesn't own, and hoping that Win32 will incidentally support it. After this change, these calls will instead be made on a window the GPU process owns.

I am working on a patch right now to implement this fix. It should be up for review fairly soon.

(In reply to Chris Martin [:cmartin] from comment #33)

Create a child window in the low-integrity GPU process, ask the main process to establish a "parent-child" relationship between the widget and this new window, and then resize the child window so that it occupies the entire client area of the parent window.

This is bad because it will synchronize the input queues between the GPU process and the parent process. If one of them stops pumping messages, the other one will hang.

(In reply to Aaron Klotz [:aklotz] from comment #34)

(In reply to Chris Martin [:cmartin] from comment #33)

Create a child window in the low-integrity GPU process, ask the main process to establish a "parent-child" relationship between the widget and this new window, and then resize the child window so that it occupies the entire client area of the parent window.

This is bad because it will synchronize the input queues between the GPU process and the parent process. If one of them stops pumping messages, the other one will hang.

I think we raised similar concerns about the main compositor HWND that we draw to, but the GPU process has a dedicated WinCompositorWindowThread, which I think alleviates these concerns (at least for the GPU process hanging the main one).
So, I guess we could use the same thread, by adding to that class (if needed).

Just keep in mind that this synchronization is transitive, so if anything else on the system is synced up with our browser main thread (third-party code, perhaps), that will also affect the GPU process.

(In reply to Aaron Klotz [:aklotz] from comment #34)

(In reply to Chris Martin [:cmartin] from comment #33)

Create a child window in the low-integrity GPU process, ask the main process to establish a "parent-child" relationship between the widget and this new window, and then resize the child window so that it occupies the entire client area of the parent window.

This is bad because it will synchronize the input queues between the GPU process and the parent process. If one of them stops pumping messages, the other one will hang.

Thanks for pointing this out! I played around with things a little bit, and it looks like I can un-synchronize the queues by calling AttachThreadInput(tid1, tid2, FALSE); I have tested it already with a toy example to see if it works, and it appears that yes - it does make it so the two windows go back to having individual input queues.

Furthermore, I think it will help that the child window is created with WS_DISABLED, so hopefully it never even receives input events in the first place. We don't really need it to receive events at all - It's literally just there to be a texture over top of the widget that has all the smarts in it.

Personally, I think the "gold-standard" way to solve this problem is probably to follow after what video game engines do and move all windowing and swapchain/presentation logic into the main UI thread and have the GPU only operate on a surface (aka buffer) it 'borrows' from the UI thread. This is more-or-less how Vulkan was designed to be used.

But that seems like it would involve a fair amount of refactoring of the different GPU compositing paths (D3D11, D2D, OGL, DirectComposite, or CPU-based Cairo/Skia), so I don't know that I should consider that unless you/Bob/Jimm/etc think it's something I should seriously consider.

Yes, AttachThreadInput can fix this, though since it is not possible to query these attachments from user mode (see this repo for my efforts to make it happen with a kernel driver), it can be a bit challenging to validate.

I do think that the refactoring is the best thing to do in the long term, but for short to medium term, we probably need somebody above our pay grade to make that call.

I'm at a point where my patch is getting close to ready for review, but I'm investigating 2 issues that I've recently noticed as I've been testing my change more.

  1. The hamburger menu works just fine, but it looks like WebExt doorhangers aren't receiving mouse clicks or keystrokes. The child window in the parent-child relationship is supposed to be using WS_DISABLED, so the inputs should just be going right through it, but they aren't. I don't think it's anything too difficult though - Probably something simple like another piece of code is re-setting the window style somewhere and removing the WS_DISABLED flag.
  2. The path where DX11 compositing is used, but without WebRender (which is no longer the default and requires config options to recreate) just displays a blank white window. I suspect that DX11 is just targeting the wrong window and I missed a spot where I need to hand it the new "child" window handle instead of the old "parent" one.

I've run into several fairly serious issues with this parent-child thing that I need to work through to continue with this patch:

  • I was wrong about the lack of WS_DISABLED being the reason that the clicks aren't getting through to WebExt doorhangers. It also happens both with GPU sandbox on and off. I'm going to investigate the message queues for both parent and child and see why the child may be stopping messages from passing through to the parent - That shouldn't be the case when the child is disabled.
  • The child window is having trouble getting size change notifications every time its parent's size changes. This manifests itself in weird problems, like a situation where the Awesome Bar's search suggestions weren't being drawn after there were more than a few suggestions in it (I'm guessing that it's drawn with a default size and resized larger if needed)
  • When the GPU sandbox is enabled, I noticed from logs that there is a failure to create the compositor on the 1st try. I'm going to have to investigate what's going on with that now too.
  • I tried detaching the two windows' threads using AttachThreadInput(..., FALSE), and inputs stopped working everywhere in Firefox altogether. That is not great.

Luckily I still have several ideas that I'm going to try to see if I can break through these issues, but it definitely seems like this is going to be more complicated than I had initially thought.

I took a look at Chromium's code a while ago, and it seems like the way that they handle this situation is by having a memory device context owned by the main UI process that is backed by a bitmap buffer shared with the GPU process.

The GPU process then renders into that shared memory using Skia, notifying the UI process when it's done. Upon receiving that notification, the UI process presents the buffer by either doing a BitBlt() or UpdateLayeredWindow (), depending on whether the window is layered or not.

I've moved forward with trying to get a similar solution working in our code. I think it has several advantages over the parent-child thing I was attempting earlier: It doesn't rely on undocumented Win32 GDI behaviour, it doesn't require linking together the input queues and the issues that brings, it already has precedent with Chromium, and it's also conceptually similar to the way that most 3D APIs expect to be used: Your main UI thread owns the window so it can directly process input messages and whatnot, and when it's time to render it "borrows" a surface from the swapchain and hands it to the rendering thread who draws into it and hands it back to the UI thread when it's done for presentation.

I initially thought the easiest way to go about this would be to hook StartRemoteDrawing and EndRemoteDrawing to request a shared memory buffer from the UI process to render into.

Unfortunately, that did not go well. I didn't realize that it would be outstandingly difficult to send synchronous messages from the GPU process back to the UI process. It's not directly allowed by IPDL, since the GPU process has the parent and the UI process has the child, and synchronous messages from parent-to-child are "verboten".

Gian-Carlo and I thought maybe there would be a way to create a "reverse protocol" that would allow synchronous messages to flow the other way, but after talking to Jed we got the impression that it may be very difficult to do, and even more difficult to do well. It would likely involve spinning up a new thread to allow the synchronous message to happen without deadlocking the main IPC threads of both processes, and it's likely that shutdown of the new protocol would give us headaches.

It looks like the easiest way to do this is to set up the shared memory buffer before the IPC call is made to the GPU process to start compositing, and present it once the GPU process indicates that it's done.

I spent a fair amount of time this week looking through the GPU compositing code to try and understand how it all fits together so I could understand where the right places are to "borrow" the buffer in the UI process before GPU compositing starts, and then where to present the buffer after compositing ends.

I believe now that the right places to do this are:

  • Right before the ClientLayerManager calls ShadowLayerForwarder::EndTransaction to begin the process of sending the layer tree updates to the GPU process for compositing with a "transaction id" it generates above and stores in mLatestTransactionId.
  • After the UI process receives the LayerManager::DidComposite message from the GPU process, specifically this function where the same ClientLayerManager from above is invoked with the same "transaction id" as above, letting it know that it's fine to retire the buffer.

I am currently writing up code to test this. I should be ready to test soon, and hopefully it works! fingers crossed

Currently, the GPU process renders directly to windows that are owned by the
UI process using the windows' Device Context.

Once the GPU sandbox is enabled, this represents a low-integrity process
trying to render into a higher integrity one. This is an access violation,
and results in tranparent windows being rendered as an invisible black
rectangle (RGBA 0x00000000)

It seems like the best way forward is to have the GPU process composite into
a shared memory buffer and then request the UI process to "present" the
memory buffer using either a BitBlt() or UpdateLayeredWindow().

It's been a while since I've updated this ticket - but here goes.

This changelist above is a very rough WIP (in fact, I'm still changing quite a lot of things about it). However, it does work :)

With the GPU sandbox enabled, all the windows render properly including the hamburger menu and WebExt doorhangers. The idea was to have everything render to a shared memory buffer and then "present" that shared memory buffer whenever the UI process gets an asynchronous "Done compositing" message.

But there are a number of things I need to cleanup: Using the FF primitives for shared memory instead of directly using the Win32 API (insecure), naming things better, only using the shared memory buffer when remote compositing is enabled, using some of the other already-built pieces for shared surfaces, and a bunch of other things.

I'm getting close to having something a bit more ready for review, but :gcp had mentioned that :jimm might want to have a look at some of the stuff I've been trying.

Depends on: 1604412

Currently, attempting to register a peer process a second time returns an
error code. That triggers a warning to be printed for any child process that
is sandboxed with a JOB_NONE job level.

Without a job object to track the child process, the sandbox code requires the
host code to deal with the lifetime of launched processes. Since the host code
still expects to be able to duplicate handles, it still adds the new process
to its list of peer processes.

However, Gecko always adds every non-plugin child as a peer process. This
causes there to be a double-add in this case.

This change simply allows double-adds, and will just report success without
doing anything if a peer is already registered.

Attachment #9101720 - Attachment is obsolete: true

When the GPU sandbox is enabled, access to most of the filesystem is blocked.

The GPU process uses a directory, "%profiledir%/shader-cache", to cache
compiled shared for performance reasons. Not allowing access to that directory
results in a HUGE performance backslide when the sandbox is turned on.

Attachment #9133846 - Attachment description: Bug 1347710 - Suppress warning if peer process is registered multiple times → Bug 1347710 - Don't call AddTargetPeer if process is sandboxed
Pushed by cmartin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/721bbaf32929
Don't call AddTargetPeer if process is sandboxed r=bobowen,handyman
Status: REOPENED → RESOLVED
Closed: 5 years ago4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla76
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Keywords: leave-open
Pushed by cmartin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2ce6f7489ea2
Make GPU sandbox allow access to shader cache r=bobowen
Attachment #9041919 - Attachment description: Bug 1347710 - Re-enable GPU sandbox level 1 → Bug 1347710 - Enable sandbox protections for the Windows GPU process
Pushed by cmartin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a6904ec3d1e0
Enable sandbox protections for the Windows GPU process r=bobowen
Regressions: 1630860
Backout by dvarga@mozilla.com:
https://hg.mozilla.org/mozilla-central/rev/48cefa8cc058
Backed out changeset a6904ec3d1e0 for causing Bug 1630860 a=backout

== Change summary for alert #25797 (as of Thu, 30 Apr 2020 18:56:20 GMT) ==

Improvements:

2% tp5o_scroll windows7-32-shippable opt e10s stylo 1.37 -> 1.34

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=25797

Currently, there is an outstanding issue where enabling the GPU sandbox breaks
scrolling using the the mouse wheel on laptops with Intel GPUs.

This will enable the GPU sandbox on Nightly for non-Intel GPUs to prevent any
sandbox regressions while we try and figure out what the scrolling issue is.

See Bug 1630860 for more info

Pushed by cmartin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2cedf945f040
Enable Windows GPU sandbox for supported hardware r=gcp

This regressed bug 1630860 again but if possible I'd like us to wait with backing it out again until we can get a bit more info from the affected people in that bug. Unfortunately we've been completely unable to reproduce the issue, even with similar hardware and drivers.

Okay, we got 3 sets of data in bug 1630860 now. We'll back this out once again pending a root cause analysis.

Unfortunately I don't think we're going to be able to fix this in a timely manner.

Please back out changeset 2cedf945f040

Flags: needinfo?(sheriffs)
Priority: P1 → P2
Depends on: 1651670
Keywords: stalled
Attachment #9145915 - Attachment is obsolete: true
Attachment #9041919 - Attachment is obsolete: true
Severity: normal → S3

Depends on D160275

Depends on D160278

Comment on attachment 9300216 [details]
Bug 1347710 - Add GPU sandbox to crash reporter annotations

Request for data collection review form

  1. What questions will you answer with this data?

Whether a user whose Firefox has just crashed had the GPU sandbox enabled or not

  1. Why does Mozilla need to answer these questions? Are there benefits for users? Do we need this information to address product or business requirements?

It helps us debug issues that may arise as a result of the GPU sandbox being enabled, which is a security feature that will benefit end-users

  1. What alternative methods did you consider to answer these questions? Why were they not sufficient?

The only alternative is to not report GPU sandboxing level, which would leave us without an important piece of information for debugging GPU process crashes

  1. Can current instrumentation answer these questions?

In a more indirect and involved way... yes... which actually further reinforces the idea that this isn't providing us with any information we didn't already have.

  1. List all proposed measurements and indicate the category of data collection for each measurement, using the Firefox data collection categories found on the Mozilla wiki.

Note that the data steward reviewing your request will characterize your data collection based on the highest (and most sensitive) category.

+-------------------------------+-------------------------------+------------------+
| Measurement Description       |  Data Collection Category     |  Tracking Bug #  |
+-------------------------------+-------------------------------+------------------+
| GPU Process Sandboxing Level  |  Category 1 “Technical data”  |  1347710         |
+-------------------------------+-------------------------------+------------------+
  1. Please provide a link to the documentation for this data collection which describes the ultimate data set in a public, complete, and accurate way.

Here is a link to the changelist that is being proposed. The "CrashAnnotations.yaml" has a description field, and the change itself is similar to previous fields that are already collected.

  1. How long will this data be collected? Choose one of the following:

I want to permanently monitor this data. (cmartin)

  1. What populations will you measure?

Every user on every channel

  1. If this data collection is default on, what is the opt-out mechanism for users?

They can disable crash reporting

  1. Please provide a general description of how you will analyze this data.

It will be used to aid in crash analysis.

  1. Where do you intend to share the results of your analysis?

It will be displayed along with other crash stats, subject to the usual privacy policies therein

  1. Is there a third-party tool (i.e. not Glean or Telemetry) that you are proposing to use for this data collection?

No.

Attachment #9300216 - Flags: data-review?(chutten)
Keywords: stalled

Data Review Requests fit into the stewardship process most nicely when they are text attachments.

Attachment #9300416 - Flags: data-review?(chutten)
Attachment #9300216 - Flags: data-review?(chutten)

Comment on attachment 9300416 [details]
data collection review request

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is part of Crash Reporting so can be controlled through Firefox's Preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes, :cmartin is responsible.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

Is the data collection request for default-on or default-off?

Default off (Crash Reporting is opt-in)

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does the data collection use a third-party collection tool?

No.


Result: datareview+

Attachment #9300416 - Flags: data-review?(chutten) → data-review+

Thanks, Chutten! Time to land this bird :)

Pushed by cmartin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/79f44b04c7aa
Add GPU sandbox level to about:support r=handyman,mossop,flod
https://hg.mozilla.org/integration/autoland/rev/cc9dba140361
Change sandbox.gpu to a static pref r=handyman
https://hg.mozilla.org/integration/autoland/rev/b66ed3a13a98
Add GPU sandbox to crash reporter annotations r=handyman,gsvelto
Pushed by cmartin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/3525211fa094
Re-enable GPU sandbox on Windows Nightly r=jrmuizel
Regressions: 1797887

In Firefox Nightly 108.0a1 (2022-10-27), GPU sandbox is enabled by default (security.sandbox.gpu.level = 1), but unfortunately, it will causing Webrender fallback to software (see about:support). When I set it to 0, it could using hardware accelerate.

Tom25519, can you file a separate bug for that?

Flags: needinfo?(Tom25519)

(In reply to Jeff Muizelaar [:jrmuizel] from comment #81)

Tom25519, can you file a separate bug for that?

OK, I have been file a bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1798091

Flags: needinfo?(Tom25519)
Regressions: 1798091
Regressions: 1798014

Should this bug still be open?

Status: REOPENED → ASSIGNED
Flags: needinfo?(cmartin)
Target Milestone: mozilla76 → ---
Depends on: 1797887
Flags: needinfo?(cmartin)
Depends on: 1799470

@RyanVM -- I think it's best to leave this bug open to track the handful of regressions that have occurred (and I'm sure more are on the way once this hits Release channel).

We're tracking (and fixing) those in new bugs, though. This bug was just for getting the feature enabled.

Fair enough! I'll close it.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago2 years ago
Resolution: --- → FIXED
Keywords: leave-open
Target Milestone: --- → 108 Branch
Blocks: 1809519
Regressions: 1810878
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: