GPU process sandboxing can be circumvented by crashing the GPU process 6 times
Categories
(Core :: Graphics: Canvas2D, defect)
Tracking
()
People
(Reporter: simonf, Assigned: aosmond)
References
(Blocks 1 open bug)
Details
(Keywords: reporter-external, sec-want, Whiteboard: [adv-main145-] )
Attachments
(1 file)
It seems layers.gpu-process.max_restarts allows deactiving the GPU process sandbox by crashing the GPU process a few times. This mostly defeats the purpose of the sandbox.
This was reported as part of Bug 1984825.
| Reporter | ||
Updated•9 months ago
|
| Reporter | ||
Updated•9 months ago
|
Updated•9 months ago
|
Updated•9 months ago
|
Comment 1•9 months ago
|
||
I think this is like a "sec-want". It would provide additional hardening for Windows users, but we already need to support the fallback path on MacOS and Linux.
Improving the situation here would be tricky. The goal of this fallback is to improve stability for users flakey graphics drivers. Is it possible to distinguish "this users went to a website with fancy Canvas2D stuff for the first time and their drivers are bad" from "this user is under attack from a web site they have not visited before"?
Comment 2•9 months ago
•
|
||
I think it is inaccurate to say that we are circumventing the sandbox from the GPU process, for the reason that the GPU process is not enabled on all platforms or for all users.
Half our platforms do not have it by default, and on those platforms that do, we don't always guarantee the user even gets it to start with. So for a sizeable portion of our users, there is no sandbox beyond the separation of content and parent process.
The GPU process was intended to defend against drivers that are intermittently buggy, so that the GPU can go down without taking out the parent process and with it the entire browser. For drivers that are so indefensibly buggy that it makes no sense to continue acceleration, we allow the acceleration to fall back to software rendering outside the GPU process.
The other side intention to it was to move privileged access to some OS API access out of the content process (i.e. Win32). However, it was more of an unintended benefit that this happened to move it to the GPU process sometimes, if not the parent process.
The architecture wasn't really meant to work as an isolation mechanism in the way that content processes isolate from parent, more about the intermittent bugginess of drivers not taking down the browser.
Right now there are many assumptions built into the very core of Gecko that if you have the GPU process you get acceleration, but if you don't have acceleration, we must disable the GPU process to provide that fallback.
This assumption makes sense for a fallback in that fallback/software Canvas rendering actually takes place in the content process; it does not suddenly happen in the parent process, so if anything, the "sandbox", or at least the isolation between parent and content processes, is partially strengthened, not weakened, when the GPU process goes away.
This falls down for WebRender in that we still remote that to the parent when the GPU process isn't available. The same problem is also true of WebGL, that can't operate in the content process either. Due to moving OS API access out of the content processes, we have no choice but to use the parent process for these sensitive tasks when there is no GPU process possible. Here is where the problem really lies.
I don't know that we have adequately addressed all the performance assumptions of doing software fallback rendering and composition for WebRender within the GPU process.
And we would still need to move into a world where all platforms and users get the GPU process for this to be make sense rather than have the GPU process only intermittently available by default. That is tricky on some platforms.
I say all this to point out that what we are discussing here is not actually a "bug"ion the GPU process design. We are instead discussing how to rearchitect the GPU process into something else entirely which is conceivably possible, but not a simple fix.
Updated•9 months ago
|
| Assignee | ||
Comment 3•9 months ago
|
||
I'll look at reworking the logic to avoid this. Ideally that pref is only to be used when we first are launching the GPU process at startup. Once we determine we can have a stable GPU process, we should never deviate from that, but there are corner cases like the above where this is not the case. It might have been intentional at the time I tightened up the logic but now with 99.7% of users sticking with a GPU process, I think we can afford to be more aggressive here.
| Assignee | ||
Comment 4•9 months ago
|
||
There are a few solutions, all of which with drawbacks:
- Restart the GPU process indefinitely -- this is simple enough to achieve, but if the content is doing something to cause the crash very quickly, malicious or not, then I wonder how responsive the UI will be from a user's perspective.
- Crash the parent process -- this is also simple enough to achieve, it prevents content spamming in option 1, and the tabs won't automatically reload, breaking the cycle until the problematic tab is reloaded.
- Crash the content process -- we could inform the parent process of the most recent processes to send over display lists/WebGL/WebGPU/AC2D canvas commands, and blame them. Even if our guess is wrong, presumably whack-a-mole will eventually take it down. This is more complicated to implement than the other options, with the potential benefit of minimizing user data loss from option 2, but perhaps that is insufficient to justify the extra complexity.
| Assignee | ||
Comment 5•9 months ago
|
||
It appears Chrome has gone for option 2:
https://source.chromium.org/chromium/chromium/src/+/main:content/browser/gpu/gpu_data_manager_impl_private.cc;l=1671;drc=89f6321d4c72ccc4b16de1d3e700e66b878e624b
Comment 6•9 months ago
|
||
I think 2 is the best option right now.
| Assignee | ||
Comment 7•9 months ago
|
||
Comment 9•9 months ago
|
||
| Assignee | ||
Updated•8 months ago
|
Updated•8 months ago
|
Comment 10•8 months ago
|
||
Please nominate this for ESR140 approval when you get a chance.
| Assignee | ||
Comment 11•8 months ago
|
||
For the moment, we've elected to not ship parent process crashing on anything but nightly. There are tons of GPU process management related changes riding the trains, with more coming, where I hope to either provide clarification (that we can crash the parent process that much without risking too many users) or a partial solution (for example, on Android we are working towards not disabling the GPU process when Android is backgrounded).
| Assignee | ||
Comment 12•8 months ago
|
||
Bug 1992430 and bug 1992856 help mitigate this concern by disabling AC2D and WebGPU before disabling the GPU process itself. This is what prompted the initial concerns, as they have a wide risk spread in terms of functionality content can access. Those patches have ridden the appropriate trains.
Updated•8 months ago
|
Updated•7 months ago
|
Updated•1 month ago
|
Description
•