Closed Bug 1987906 Opened 9 months ago Closed 9 months ago

GPU process sandboxing can be circumvented by crashing the GPU process 6 times

Tracking

()

Status:

RESOLVED FIXED

Milestone:

145 Branch

Tracking Flags:

Tracking

Status

firefox-esr115

---

wontfix

firefox-esr140

---

wontfix

firefox143

---

wontfix

firefox144

---

wontfix

firefox145

fixed

People

(Reporter: simonf, Assigned: aosmond)

References

(Blocks 1 open bug)

Details

(Keywords: reporter-external, sec-want, Whiteboard: [adv-main145-] )

Attachments

(1 file)

(secure) 9 months ago Andrew Osmond [:aosmond] (he/him) 48 bytes, text/x-phabricator-request		Details \| Review

Simon Friedberger [:simonf]

Reporter

Description

•

9 months ago

It seems layers.gpu-process.max_restarts allows deactiving the GPU process sandbox by crashing the GPU process a few times. This mostly defeats the purpose of the sandbox.

This was reported as part of Bug 1984825.

Simon Friedberger [:simonf]

Reporter

Updated

•

9 months ago

Updated

•

9 months ago

Summary: GPU process sandboxing can be circumvented by crashing the GPU process 3 times → GPU process sandboxing can be circumvented by crashing the GPU process 6 times

Andrew McCreight [:mccr8]

Updated

•

9 months ago

Group: core-security → gfx-core-security

Andrew McCreight [:mccr8]

Updated

•

9 months ago

Keywords: reporter-external

Andrew McCreight [:mccr8]

Comment 1

•

9 months ago

I think this is like a "sec-want". It would provide additional hardening for Windows users, but we already need to support the fallback path on MacOS and Linux.

Improving the situation here would be tricky. The goal of this fallback is to improve stability for users flakey graphics drivers. Is it possible to distinguish "this users went to a website with fancy Canvas2D stuff for the first time and their drivers are bad" from "this user is under attack from a web site they have not visited before"?

Keywords: sec-want

Lee Salzman [:lsalzman]

Comment 2

•

9 months ago

•

Edited

I think it is inaccurate to say that we are circumventing the sandbox from the GPU process, for the reason that the GPU process is not enabled on all platforms or for all users.

Half our platforms do not have it by default, and on those platforms that do, we don't always guarantee the user even gets it to start with. So for a sizeable portion of our users, there is no sandbox beyond the separation of content and parent process.

The GPU process was intended to defend against drivers that are intermittently buggy, so that the GPU can go down without taking out the parent process and with it the entire browser. For drivers that are so indefensibly buggy that it makes no sense to continue acceleration, we allow the acceleration to fall back to software rendering outside the GPU process.

The other side intention to it was to move privileged access to some OS API access out of the content process (i.e. Win32). However, it was more of an unintended benefit that this happened to move it to the GPU process sometimes, if not the parent process.

The architecture wasn't really meant to work as an isolation mechanism in the way that content processes isolate from parent, more about the intermittent bugginess of drivers not taking down the browser.

Right now there are many assumptions built into the very core of Gecko that if you have the GPU process you get acceleration, but if you don't have acceleration, we must disable the GPU process to provide that fallback.

This assumption makes sense for a fallback in that fallback/software Canvas rendering actually takes place in the content process; it does not suddenly happen in the parent process, so if anything, the "sandbox", or at least the isolation between parent and content processes, is partially strengthened, not weakened, when the GPU process goes away.

This falls down for WebRender in that we still remote that to the parent when the GPU process isn't available. The same problem is also true of WebGL, that can't operate in the content process either. Due to moving OS API access out of the content processes, we have no choice but to use the parent process for these sensitive tasks when there is no GPU process possible. Here is where the problem really lies.

I don't know that we have adequately addressed all the performance assumptions of doing software fallback rendering and composition for WebRender within the GPU process.

And we would still need to move into a world where all platforms and users get the GPU process for this to be make sense rather than have the GPU process only intermittently available by default. That is tricky on some platforms.

I say all this to point out that what we are discussing here is not actually a "bug"ion the GPU process design. We are instead discussing how to rearchitect the GPU process into something else entirely which is conceivably possible, but not a simple fix.

Teodor Tanasoaia [:teoxoy]

Updated

•

9 months ago

Severity: -- → S3

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 3

•

9 months ago

I'll look at reworking the logic to avoid this. Ideally that pref is only to be used when we first are launching the GPU process at startup. Once we determine we can have a stable GPU process, we should never deviate from that, but there are corner cases like the above where this is not the case. It might have been intentional at the time I tightened up the logic but now with 99.7% of users sticking with a GPU process, I think we can afford to be more aggressive here.

Assignee: nobody → aosmond

Status: NEW → ASSIGNED

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 4

•

9 months ago

There are a few solutions, all of which with drawbacks:

Restart the GPU process indefinitely -- this is simple enough to achieve, but if the content is doing something to cause the crash very quickly, malicious or not, then I wonder how responsive the UI will be from a user's perspective.
Crash the parent process -- this is also simple enough to achieve, it prevents content spamming in option 1, and the tabs won't automatically reload, breaking the cycle until the problematic tab is reloaded.
Crash the content process -- we could inform the parent process of the most recent processes to send over display lists/WebGL/WebGPU/AC2D canvas commands, and blame them. Even if our guess is wrong, presumably whack-a-mole will eventually take it down. This is more complicated to implement than the other options, with the potential benefit of minimizing user data loss from option 2, but perhaps that is insufficient to justify the extra complexity.

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 5

•

9 months ago

It appears Chrome has gone for option 2:
https://source.chromium.org/chromium/chromium/src/+/main:content/browser/gpu/gpu_data_manager_impl_private.cc;l=1671;drc=89f6321d4c72ccc4b16de1d3e700e66b878e624b

Jeff Muizelaar [:jrmuizel]

Comment 6

•

9 months ago

I think 2 is the best option right now.

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 7

•

9 months ago

Attached file (secure) — Details

Pulsebot

Comment 8

•

9 months ago

Pushed by aosmond@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/b77e723fc778 https://hg.mozilla.org/integration/autoland/rev/4bea02dfe77c r=lsalzman

Ryan VanderMeulen [:RyanVM]

Comment 9

•

9 months ago

https://hg-edge.mozilla.org/mozilla-central/rev/4bea02dfe77c

Group: gfx-core-security → core-security-release

Status: ASSIGNED → RESOLVED

Closed: 9 months ago

status-firefox143: --- → wontfix

status-firefox144: --- → wontfix

status-firefox145: --- → fixed

status-firefox-esr115: --- → wontfix

status-firefox-esr140: --- → affected

tracking-firefox145: --- → +

tracking-firefox-esr140: --- → 145+

Resolution: --- → FIXED

Target Milestone: --- → 145 Branch

Ryan VanderMeulen [:RyanVM]

Updated

•

9 months ago

Regressions: 1992573

Andrew Osmond [:aosmond] (he/him)

Assignee

Updated

•

8 months ago

Blocks: always-gpu-process

Andrei Vaida [:avaida]

Updated

•

8 months ago

QA Whiteboard: [sec] [qa-triage-done-c146/b145]

Ryan VanderMeulen [:RyanVM]

Comment 10

•

8 months ago

Please nominate this for ESR140 approval when you get a chance.

Flags: needinfo?(aosmond)

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 11

•

8 months ago

For the moment, we've elected to not ship parent process crashing on anything but nightly. There are tons of GPU process management related changes riding the trains, with more coming, where I hope to either provide clarification (that we can crash the parent process that much without risking too many users) or a partial solution (for example, on Android we are working towards not disabling the GPU process when Android is backgrounded).

Flags: needinfo?(aosmond)

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 12

•

8 months ago

Bug 1992430 and bug 1992856 help mitigate this concern by disabling AC2D and WebGPU before disabling the GPU process itself. This is what prompted the initial concerns, as they have a wide risk spread in terms of functionality content can access. Those patches have ridden the appropriate trains.

Ryan VanderMeulen [:RyanVM]

Updated

•

8 months ago

status-firefox-esr140: affected → wontfix

tracking-firefox-esr140: 145+ → ---

Tom Schuster

Updated

•

7 months ago

Whiteboard: [adv-main145-]

Daniel Veditz [:dveditz]

Updated

•

1 month ago

Group: core-security-release

You need to log in before you can comment on or make changes to this bug.