D3D11CreateDevice fails with E_FAIL with low integrity content sandbox

RESOLVED WONTFIX

Status

()

RESOLVED WONTFIX
3 years ago
a year ago

People

(Reporter: dvander, Unassigned)

Tracking

(Blocks: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [sb+])

Attachments

(3 attachments)

On my Desktop, Firefox is not using hardware acceleration for content when E10s is enabled. The D3D11CreateDevice() call is returning E_FAIL instead of S_OK. If I change "security.sandbox.content.level" to 0 then it success. "1" does not work.

Normally this call only returns E_FAIL when it can't activate the debug layer, so it could be that something is weird with my system, because I can't reproduce this on my other Windows 10 machine.

Unfortunately, there is no way to actually tell this happens without logging something in gfxWindowsPlatform.cpp. We don't have logging for this call failing, and about:support only reports features in the parent process. We also wouldn't have any way to see this in Telemetry. I'll try to address these shortcomings soon.
Summary: D3D11CreateDevice fails with E_FAIL → D3D11CreateDevice fails with E_FAIL in sandbox

Comment 1

3 years ago
The calls using sD3D11CreateDeviceFn in gfxWindowsPlatform.cpp that I can see on my Windows 7 desktop seem to be working fine.

Was it a specific call or were they all failing?

I'll try on my Windows 10 laptop.
Flags: needinfo?(dvander)
(In reply to Bob Owen (:bobowen) from comment #1)
> The calls using sD3D11CreateDeviceFn in gfxWindowsPlatform.cpp that I can
> see on my Windows 7 desktop seem to be working fine.
> 
> Was it a specific call or were they all failing?
> 
> I'll try on my Windows 10 laptop.

https://dxr.mozilla.org/mozilla-central/source/gfx/thebes/gfxWindowsPlatform.cpp#2289

This specific call is failing in the child process, unless I disable the sandbox. I can't reproduce it on my other Windows machine - I don't know what's different either.

I'll get some telemetry in to see if this is a one-off thing or whether it might affect more than just me.
Flags: needinfo?(dvander)

Comment 3

3 years ago
This also seems to work fine on my Windows 10 Laptop.
What graphics card and driver version does this happen on?
Flags: needinfo?(dvander)
(In reply to Gian-Carlo Pascutto [:gcp] from comment #4)
> What graphics card and driver version does this happen on?

Adapter Description: NVIDIA GeForce GTX 960
Device ID: 0x1401
Driver Date: 11-5-2015
Driver Version: 10.18.13.5891
Flags: needinfo?(dvander)
Created attachment 8721496 [details]
about:support

I'm also getting perma d2d disabled in the content process.
Bob, is there a way we can find out what the sandbox is blocking from the nvidia driver?
Flags: needinfo?(bobowen.code)
From the 3 reporters, all are on nvidia GPUs on Windows 10. On my Lenovo Laptop with an Intel HD 5500 with Windows 10, device creation is successful and the crash in bug 1249600 does not happen.

Comment 9

3 years ago
(In reply to Jeff Muizelaar [:jrmuizel] from comment #7)
> Bob, is there a way we can find out what the sandbox is blocking from the
> nvidia driver?

The only logging we have is for things that you can add sandbox policy rules for, like file or registry access.
There is a pref or you can run with env var MOZ_WIN_SANDBOX_LOGGING=1.

After that it's a matter of debugging / tracing through system calls.
It tends to be some sort of access denied error code, but of course sometimes these happen anyway.

Unfortunately, there isn't any system logging of which I know.

(In reply to Mason Chang [:mchang] from comment #8)
> From the 3 reporters, all are on nvidia GPUs on Windows 10. On my Lenovo
> Laptop with an Intel HD 5500 with Windows 10, device creation is successful
> and the crash in bug 1249600 does not happen.

My Windows 10 laptop has an nvidia Quadro K2100M and a built in Intel HD 4600 one, so maybe it isn't using the nvidia one ... is there a way I can tell?

I thought that our content sandbox wasn't any stronger than the chromium gpu one, but just looking again, maybe our job level is causing this.

If you set security.sandbox.content.level=1 and restart, does it start working?
Flags: needinfo?(bobowen.code)

Updated

3 years ago
Flags: needinfo?(mchang)
Created attachment 8722079 [details]
Sandbox Log

(In reply to Bob Owen (:bobowen) from comment #9)
> (In reply to Jeff Muizelaar [:jrmuizel] from comment #7)
> > Bob, is there a way we can find out what the sandbox is blocking from the
> > nvidia driver?
> 
> The only logging we have is for things that you can add sandbox policy rules
> for, like file or registry access.
> There is a pref or you can run with env var MOZ_WIN_SANDBOX_LOGGING=1.
> 
> After that it's a matter of debugging / tracing through system calls.
> It tends to be some sort of access denied error code, but of course
> sometimes these happen anyway.
> 
> Unfortunately, there isn't any system logging of which I know.

Here's a sandbox log when running with MOZ_WIN_SANDBOX. gfxInit is already failing at line 33 before I see any errors with the sandbox. The system call that's happening is at [1].


> (In reply to Mason Chang [:mchang] from comment #8)
> > From the 3 reporters, all are on nvidia GPUs on Windows 10. On my Lenovo
> > Laptop with an Intel HD 5500 with Windows 10, device creation is successful
> > and the crash in bug 1249600 does not happen.
> 
> My Windows 10 laptop has an nvidia Quadro K2100M and a built in Intel HD
> 4600 one, so maybe it isn't using the nvidia one ... is there a way I can
> tell?

You can go to about:Support. What do you see in the "Adapter Description" and "WebGL Renderer"?
> 
> I thought that our content sandbox wasn't any stronger than the chromium gpu
> one, but just looking again, maybe our job level is causing this.
> 
> If you set security.sandbox.content.level=1 and restart, does it start
> working?

No it does not work. From comment 0, this does not seem to work for :dvander as well.

If you load this site with e10s, nightly, nvidia GPU on Windows 10 x64 are you crashing?
https://dl.dropboxusercontent.com/u/40949268/emcc/ShadowMap_novsync/ShadowMap_cpuprofiler.html

[1] https://dxr.mozilla.org/mozilla-central/source/gfx/thebes/gfxWindowsPlatform.cpp?case=true&from=gfxWindowsPlatform.cpp#2165
Flags: needinfo?(mchang) → needinfo?(bobowen.code)

Comment 11

3 years ago
(In reply to Mason Chang [:mchang] from comment #10)
> Created attachment 8722079 [details]
 
> Here's a sandbox log when running with MOZ_WIN_SANDBOX. gfxInit is already
> failing at line 33 before I see any errors with the sandbox. The system call
> that's happening is at [1].

Right, I guess it will be some other system call further down the stack, which is causing the actual problem.
 
> You can go to about:Support. What do you see in the "Adapter Description"
> and "WebGL Renderer"?

Adapter Description	Intel(R) HD Graphics 4600
Adapter Description (GPU #2)	NVIDIA Quadro K2100M 
WebGL Renderer	Google Inc. -- ANGLE (Intel(R) HD Graphics 4600 Direct3D11 vs_5_0 ps_5_0)

So, I guess it is using the Intel, can I force it to the nvidia one?

> > I thought that our content sandbox wasn't any stronger than the chromium gpu
> > one, but just looking again, maybe our job level is causing this.
> > 
> > If you set security.sandbox.content.level=1 and restart, does it start
> > working?
> 
> No it does not work. From comment 0, this does not seem to work for :dvander
> as well.

Oh yeah. :-)
Is the chromium code doing something different?
 
> If you load this site with e10s, nightly, nvidia GPU on Windows 10 x64 are
> you crashing?
> https://dl.dropboxusercontent.com/u/40949268/emcc/ShadowMap_novsync/
> ShadowMap_cpuprofiler.html

So unsurprisingly this works for me.
Flags: needinfo?(bobowen.code)
(In reply to Bob Owen (:bobowen) from comment #11)
> (In reply to Mason Chang [:mchang] from comment #10)
> > Created attachment 8722079 [details]
>  
> > Here's a sandbox log when running with MOZ_WIN_SANDBOX. gfxInit is already
> > failing at line 33 before I see any errors with the sandbox. The system call
> > that's happening is at [1].
> 
> Right, I guess it will be some other system call further down the stack,
> which is causing the actual problem.
>  
> > You can go to about:Support. What do you see in the "Adapter Description"
> > and "WebGL Renderer"?
> 
> Adapter Description	Intel(R) HD Graphics 4600
> Adapter Description (GPU #2)	NVIDIA Quadro K2100M 
> WebGL Renderer	Google Inc. -- ANGLE (Intel(R) HD Graphics 4600 Direct3D11
> vs_5_0 ps_5_0)
> 
> So, I guess it is using the Intel, can I force it to the nvidia one?

Can you force your whole system to use the NVidia one? That should force the NVidia GPU for us. See something like this - http://gpu.userbenchmark.com/Faq/How-to-force-Optimus-or-Switchable-discrete-GPUs/97 
or somewhere in the NVidia control panel.

> > > I thought that our content sandbox wasn't any stronger than the chromium gpu
> > > one, but just looking again, maybe our job level is causing this.
> > > 
> > > If you set security.sandbox.content.level=1 and restart, does it start
> > > working?
> > 
> > No it does not work. From comment 0, this does not seem to work for :dvander
> > as well.
> 
> Oh yeah. :-)
> Is the chromium code doing something different?

Do you mean google chromium? Or our chrome process? Our chrome process should be doing the mostly roughly the same thing.
Flags: needinfo?(bobowen.code)

Comment 13

3 years ago
(In reply to Mason Chang [:mchang] from comment #12)
> (In reply to Bob Owen (:bobowen) from comment #11)

> Can you force your whole system to use the NVidia one? That should force the
> NVidia GPU for us. See something like this -
> http://gpu.userbenchmark.com/Faq/How-to-force-Optimus-or-Switchable-discrete-
> GPUs/97 
> or somewhere in the NVidia control panel.

Thought I'd tried something similar to that before, but seems to be working now.
I now get:

WebGL Renderer	Google Inc. -- ANGLE (NVIDIA Quadro K2100M Direct3D11 vs_5_0 ps_5_0)

But that URL still works fine, haven't checked that the call is working yet.
 
> Do you mean google chromium? Or our chrome process? Our chrome process
> should be doing the mostly roughly the same thing.

Google Chromium gpu process, assuming that it's making a similar call.
Because it also has a low integrity sandbox.

It is possible that they are making the call before lowering the sandbox, but I thought they started the process at low integrity anyway.
Flags: needinfo?(bobowen.code)
(In reply to Bob Owen (:bobowen) from comment #13)
> (In reply to Mason Chang [:mchang] from comment #12)
> > (In reply to Bob Owen (:bobowen) from comment #11)
> 
> > Can you force your whole system to use the NVidia one? That should force the
> > NVidia GPU for us. See something like this -
> > http://gpu.userbenchmark.com/Faq/How-to-force-Optimus-or-Switchable-discrete-
> > GPUs/97 
> > or somewhere in the NVidia control panel.
> 
> Thought I'd tried something similar to that before, but seems to be working
> now.
> I now get:
> 
> WebGL Renderer	Google Inc. -- ANGLE (NVIDIA Quadro K2100M Direct3D11 vs_5_0
> ps_5_0)
> 
> But that URL still works fine, haven't checked that the call is working yet.
>  

Maybe if you also try updating your drivers? This is on a Windows 10 machine?

> > Do you mean google chromium? Or our chrome process? Our chrome process
> > should be doing the mostly roughly the same thing.
> 
> Google Chromium gpu process, assuming that it's making a similar call.
> Because it also has a low integrity sandbox.
> 
> It is possible that they are making the call before lowering the sandbox,
> but I thought they started the process at low integrity anyway.

I'm not sure what Chromium is doing, I don't have any experience with it. Jeff, do you know?
Flags: needinfo?(jmuizelaar)
Flags: needinfo?(bobowen.code)

Comment 15

3 years ago
(In reply to Mason Chang [:mchang] from comment #14)
> Maybe if you also try updating your drivers? This is on a Windows 10 machine?

Yes this is on Windows 10, just updated and that URL still works.

Here the GPU #2 stuff from about:support before and after:

Adapter Description (GPU #2)	NVIDIA Quadro K2100M
Adapter Drivers (GPU #2)	nvd3dumx,nvwgf2umx,nvwgf2umx,nvwgf2umx nvd3dum,nvwgf2um,nvwgf2um,nvwgf2um
Adapter RAM (GPU #2)	2048
Device ID (GPU #2)	0x11fc
Driver Date (GPU #2)	7-22-2015
Driver Version (GPU #2)	10.18.13.5362
GPU #2 Active	false
Subsys ID (GPU #2)	221a17aa
Vendor ID (GPU #2)	0x10de


Adapter Description (GPU #2)	NVIDIA Quadro K2100M
Adapter Drivers (GPU #2)	nvd3dumx,nvwgf2umx,nvwgf2umx,nvwgf2umx nvd3dum,nvwgf2um,nvwgf2um,nvwgf2um
Adapter RAM (GPU #2)	2048
Device ID (GPU #2)	0x11fc
Driver Date (GPU #2)	2-8-2016
Driver Version (GPU #2)	10.18.13.6191
GPU #2 Active	false
Subsys ID (GPU #2)	221a17aa
Vendor ID (GPU #2)	0x10de
WebGL Renderer	Google Inc. -- ANGLE (NVIDIA Quadro K2100M Direct3D11 vs_5_0 ps_5_0)


I notice that it says GPU #2 active false.
Flags: needinfo?(bobowen.code)
This should be an E10S blocker, if the sandboxing behaviour is expected to be the same in release.
Flags: needinfo?(blassey.bugs)
(In reply to Milan Sreckovic [:milan] from comment #17)
> This should be an E10S blocker, if the sandboxing behaviour is expected to
> be the same in release.

Sandbox and e10s are decoupled, so this won't block e10s
Flags: needinfo?(blassey.bugs)

Comment 19

3 years ago
(In reply to Milan Sreckovic [:milan] from comment #17)
> This should be an E10S blocker, if the sandboxing behaviour is expected to
> be the same in release.

What Brad said.

I'll set this to block bug 1246505, which is for letting the low integrity sandbox ride the trains.
Blocks: 1246505

Comment 20

3 years ago
Installed Windows 10 on my other machine that has an nvidia card and that URL works fine.

Adapter Description: NVIDIA GeForce GT 720
Adapter Drivers: nvd3dumx,nvwgf2umx,nvwgf2umx,nvwgf2umx nvd3dum,nvwgf2um,nvwgf2um,nvwgf2um
Adapter RAM: 2048
Asynchronous Pan/Zoom: wheel input enabled; touch input enabled
Device ID: 0x1288
Direct2D Enabled: true
DirectWrite Enabled: true (10.0.10240.16430)
Driver Date: 2-8-2016
Driver Version: 10.18.13.6191
GPU #2 Active: false
GPU Accelerated Windows: 1/1 Direct3D 11 (OMTC)
Subsys ID: 00000000
Supports Hardware H264 Decoding: Yes
Vendor ID: 0x10de
WebGL Renderer: Google Inc. -- ANGLE (NVIDIA GeForce GT 720 Direct3D11 vs_5_0 ps_5_0)
windowLayerManagerRemote: true
AzureCanvasBackend: direct2d 1.1
AzureContentBackend: direct2d 1.1
AzureFallbackCanvasBackend: cairo
AzureSkiaAccelerated: 0
(In reply to Mason Chang [:mchang] from comment #14)
> (In reply to Bob Owen (:bobowen) from comment #13)
> > It is possible that they are making the call before lowering the sandbox,
> > but I thought they started the process at low integrity anyway.
> 
> I'm not sure what Chromium is doing, I don't have any experience with it.
> Jeff, do you know?

I do not.
Flags: needinfo?(jmuizelaar)

Comment 22

3 years ago
Running a debug version of chromium I've only seen the D3D11CreateDevice at [1] hit in the GPU process.

Presumably this would be the call below if not a debug build.

The driver type was: D3D_DRIVER_TYPE_HARDWARE
Feature levels were: {D3D_FEATURE_LEVEL_11_0, D3D_FEATURE_LEVEL_10_1, D3D_FEATURE_LEVEL_10_0}

It is running at low integrity.
I see that the parameters differ from ours a bit.

mchang- Does this call work for you in chromium? (not sure how you tell other than debugging, although there's a ton of information in about:gpu)


[1] https://code.google.com/p/chromium/codesearch#chromium/src/third_party/angle/src/libANGLE/renderer/d3d/d3d11/Renderer11.cpp&q=D3D11CreateDevice&sq=package:chromium&type=cs&l=698
Flags: needinfo?(mchang)
(In reply to Bob Owen (:bobowen) from comment #22)
> Running a debug version of chromium I've only seen the D3D11CreateDevice at
> [1] hit in the GPU process.
> 
> Presumably this would be the call below if not a debug build.
> 
> The driver type was: D3D_DRIVER_TYPE_HARDWARE
> Feature levels were: {D3D_FEATURE_LEVEL_11_0, D3D_FEATURE_LEVEL_10_1,
> D3D_FEATURE_LEVEL_10_0}
> 
> It is running at low integrity.
> I see that the parameters differ from ours a bit.
> 
> mchang- Does this call work for you in chromium? (not sure how you tell
> other than debugging, although there's a ton of information in about:gpu)
> 
> 
> [1]
> https://code.google.com/p/chromium/codesearch#chromium/src/third_party/angle/
> src/libANGLE/renderer/d3d/d3d11/Renderer11.
> cpp&q=D3D11CreateDevice&sq=package:chromium&type=cs&l=698

I don't have access to that machine at the moment. I'll check later this week. 

Also FYI, bug 1250669 landed, which should prevent the crash from happening on the test site. The only way to test if the E_FAIL happens is to actually debug it now. Please let the other members on the sandbox team know. Thanks!
Flags: needinfo?(mchang)
Flags: needinfo?(dvander)
Flags: needinfo?(mchang)

Comment 24

3 years ago
dvander - are you able to see if the call in Chrome/Chromium is working for you?
See comment 22.

(In reply to Mason Chang [:mchang] from comment #23)

> Also FYI, bug 1250669 landed, which should prevent the crash from happening
> on the test site. The only way to test if the E_FAIL happens is to actually
> debug it now. Please let the other members on the sandbox team know. Thanks!

I assume we could also tell from the telemetry and warnings that dvander added in bug 1247539.
Flags: needinfo?(dvander)
(In reply to Mason Chang [:mchang] from comment #6)
> Created attachment 8721496 [details]
> about:support
> 
> I'm also getting perma d2d disabled in the content process.

Direct2D Enabled: true
DirectWrite Enabled: true (10.0.10586.0)
GPU Accelerated Windows: 1/1 Direct3D 11 (OMTC)
AzureSkiaAccelerated: 0

Is the AzureSkiaAccelerated: 0 the relevant part?

I got this:

Direct2D Enabled	true
DirectWrite Enabled	true (10.0.10586.0)
Driver Date	8-7-2015
Driver Version	10.18.13.5382
GPU #2 Active	false
GPU Accelerated Windows	1/1 Direct3D 11 (OMTC)
Subsys ID	00000000
Supports Hardware H264 Decoding	Yes
Vendor ID	0x10de
WebGL Renderer	Google Inc. -- ANGLE (NVIDIA GeForce GTX 750 Direct3D11 vs_5_0 ps_5_0)
windowLayerManagerRemote	true
AzureCanvasAccelerated	0                    <---- is this the relevant part?
AzureCanvasBackend	direct2d 1.1
AzureContentBackend	direct2d 1.1
AzureFallbackCanvasBackend	cairo
Argh, should've read the first comment!
(In reply to David Anderson [:dvander] from comment #5)
> (In reply to Gian-Carlo Pascutto [:gcp] from comment #4)
> > What graphics card and driver version does this happen on?
> 
> Adapter Description: NVIDIA GeForce GTX 960
> Device ID: 0x1401
> Driver Date: 11-5-2015
> Driver Version: 10.18.13.5891

I updated my desktop to Windows 10, updated the Nvidia drivers to this exact version, but it's working for me (checking the actual calls in the debugger). 

I've got a GTX 750 which is Maxwell 1 vs the Maxwell 2 in the GTX 960, but the other reporter has a 775M which seems to be a Kepler based GPU, so I would be surprised that is the issue/relevant difference.

It's not clear to me what difference in the environment causes these differences.
I provided a build that MOZ_CRASHES on failure to a friend with a GTX970, and that produced this:
https://crash-stats.mozilla.com/report/index/9bc121d0-3b7a-4dee-b0f3-e91152160302

Failure also happens on the latest 362.00 Nvidia drivers.

about:gpu from Chrome on the same configuration:
https://pastebin.mozilla.org/8861956
Do you have the symbols for that test build?
Flags: needinfo?(gpascutto)
As discussed on IRC: yes, but note that it was patched to specifically crash when we hit the failure case from this bug.
Flags: needinfo?(gpascutto)
(In reply to Gian-Carlo Pascutto [:gcp] from comment #28)
> I provided a build that MOZ_CRASHES on failure to a friend with a GTX970,
> and that produced this:
> https://crash-stats.mozilla.com/report/index/9bc121d0-3b7a-4dee-b0f3-
> e91152160302
> 
> Failure also happens on the latest 362.00 Nvidia drivers.
> 
> about:gpu from Chrome on the same configuration:
> https://pastebin.mozilla.org/8861956

So it looks like from that about:gpu, the call is also failing on Chrome. What's also interesting is the "Sandboxed       false". But all the graphics features at the top say hardware accelerated. Does Chrome just disable the sandbox if the device fails to create?
Flags: needinfo?(mchang) → needinfo?(gpascutto)
I don't know as I can't reproduce the problem. Chrome shows the same on my AMD card though, where that call works.
Flags: needinfo?(gpascutto)
On Chrome, are you getting "Sandboxed: true" in about:gpu? Or can you also attach your chrome's about:gpu here?
Flags: needinfo?(gpascutto)
I'm not convinced Chrome is reporting this value correctly, I haven't seen it report true. It might be better to use Process Explorer to inspect the process privileges.
Created attachment 8726328 [details]
Chrome's about:gpu on an AMD GPU
Flags: needinfo?(gpascutto)
(In reply to Bob Owen (:bobowen) from comment #22)
> Running a debug version of chromium I've only seen the D3D11CreateDevice at
> [1] hit in the GPU process.
> 
> Presumably this would be the call below if not a debug build.
> 
> The driver type was: D3D_DRIVER_TYPE_HARDWARE
> Feature levels were: {D3D_FEATURE_LEVEL_11_0, D3D_FEATURE_LEVEL_10_1,
> D3D_FEATURE_LEVEL_10_0}
> 
> It is running at low integrity.
> I see that the parameters differ from ours a bit.
> 
> mchang- Does this call work for you in chromium? (not sure how you tell
> other than debugging, although there's a ton of information in about:gpu)
> 
> 
> [1]
> https://code.google.com/p/chromium/codesearch#chromium/src/third_party/angle/
> src/libANGLE/renderer/d3d/d3d11/Renderer11.
> cpp&q=D3D11CreateDevice&sq=package:chromium&type=cs&l=698

So I found out that this call does not work. I found out by debugging, and it actually shows up in about:gpu:

[3672:576:0303/160033:ERROR:angle_platform_impl.cc(33)] : ANGLE Display::initialize error 4: Could not create D3D11 device.
[3672:576:0303/160033:ERROR:gl_surface_egl.cc(586)] : eglInitialize D3D11 failed with error EGL_NOT_INITIALIZED, trying next display type

Chrome tries d3d 11, it fails, and then it falls back to d3d 9 which succeeds on this machine. If I explicitly disable the sandbox with --no-sandbox [1], d3d11 initialization passes.

[1] https://www.chromium.org/developers/how-tos/debugging-gpu-related-code
Flags: needinfo?(dvander)

Comment 37

3 years ago
(In reply to Mason Chang [:mchang] from comment #36)

> > [1]
> > https://code.google.com/p/chromium/codesearch#chromium/src/third_party/angle/
> > src/libANGLE/renderer/d3d/d3d11/Renderer11.
> > cpp&q=D3D11CreateDevice&sq=package:chromium&type=cs&l=698
> 
> So I found out that this call does not work. I found out by debugging, and
> it actually shows up in about:gpu:
> 
> [3672:576:0303/160033:ERROR:angle_platform_impl.cc(33)] : ANGLE
> Display::initialize error 4: Could not create D3D11 device.
> [3672:576:0303/160033:ERROR:gl_surface_egl.cc(586)] : eglInitialize D3D11
> failed with error EGL_NOT_INITIALIZED, trying next display type
> 
> Chrome tries d3d 11, it fails, and then it falls back to d3d 9 which
> succeeds on this machine. If I explicitly disable the sandbox with
> --no-sandbox [1], d3d11 initialization passes.
> 
> [1] https://www.chromium.org/developers/how-tos/debugging-gpu-related-code

Thanks Mason.

Also found those errors right at the bottom of the other person's about:gpu that gcp found who could reproduce (not the one attached, that was gcp's).

Out of interest what would happen if we added all the other d3d9 feature levels, would it fall back to one of those? (I think we already pass D3D_FEATURE_LEVEL_9_3)

Or is it something about this API that is sometimes broken with the sandbox and Chrome uses a different API to fall back?

Would a similar fallback be acceptable for us?


Trying to read the runes from the telemetry this looks pretty rare on Nightly.
Taken from the same date range (2016/02/17 to 2016/03/02), just on Win7+ as that's the only place we've seen the error.

Failed to create a gfx content device. 0=content d3d11, 1=image d3d11, 2=d2d1.
Start	End	GFX_CONTENT_FAILED_TO_ACQUIRE_DEVICE Count
0	1	90 (28.3%)
1	2	144 (45.28%)
2	3	84 (26.42%)

Successful telemetry submission
Start	End	TELEMETRY_SUCCESS Count
0	1	525.29k (20.69%)
1	2	2.01M (79.31%)


I don't know if we could dig further into the failure reports to get correlations on other graphics data.
Flags: needinfo?(mchang)

Updated

3 years ago
Whiteboard: [sb?]
(In reply to Bob Owen (:bobowen) from comment #37)
> (In reply to Mason Chang [:mchang] from comment #36)
> 
> > > [1]
> > > https://code.google.com/p/chromium/codesearch#chromium/src/third_party/angle/
> > > src/libANGLE/renderer/d3d/d3d11/Renderer11.
> > > cpp&q=D3D11CreateDevice&sq=package:chromium&type=cs&l=698
> > 
> > So I found out that this call does not work. I found out by debugging, and
> > it actually shows up in about:gpu:
> > 
> > [3672:576:0303/160033:ERROR:angle_platform_impl.cc(33)] : ANGLE
> > Display::initialize error 4: Could not create D3D11 device.
> > [3672:576:0303/160033:ERROR:gl_surface_egl.cc(586)] : eglInitialize D3D11
> > failed with error EGL_NOT_INITIALIZED, trying next display type
> > 
> > Chrome tries d3d 11, it fails, and then it falls back to d3d 9 which
> > succeeds on this machine. If I explicitly disable the sandbox with
> > --no-sandbox [1], d3d11 initialization passes.
> > 
> > [1] https://www.chromium.org/developers/how-tos/debugging-gpu-related-code
> 
> Thanks Mason.
> 
> Also found those errors right at the bottom of the other person's about:gpu
> that gcp found who could reproduce (not the one attached, that was gcp's).
> 
> Out of interest what would happen if we added all the other d3d9 feature
> levels, would it fall back to one of those? (I think we already pass
> D3D_FEATURE_LEVEL_9_3)

I tried this, adding the D3D_FEATURE_LEVEL_9_2 and 9_1. I still get an E_FAIL so it looks like we don't fallback to d3d9 in either case.

> 
> Or is it something about this API that is sometimes broken with the sandbox
> and Chrome uses a different API to fall back?
> 

They use a different API by actually using the IDirect3D9::CreateDevice API. [1]

> Would a similar fallback be acceptable for us?

I think we actually have to support the configuration where device resets cause the device creation to fail on some hardware. We've seen this in the wild. I think our fallback will be to just fallback to software and not try to use d3d 9.

However, we should still block content sandboxing on figuring out what's going on with this bug. Is there some kind of piece meal sandboxing we could do to see which part of the driver is getting blocked? 

[1] https://code.google.com/p/chromium/codesearch#chromium/src/third_party/angle/src/libANGLE/renderer/d3d/d3d9/Renderer9.cpp&sq=package:chromium&type=cs&q=Renderer9.cpp&l=280
Flags: needinfo?(mchang) → needinfo?(bobowen.code)
(In reply to Mason Chang [:mchang] from comment #38)
> I think our fallback will be to just fallback to software and not try
> to use d3d 9.

You mean that anyone affected will be running without HW acceleration?

I'm concerned about this. The Telemetry shows it's very exceptional to hit this...but two Mozilla developers hit it, despite it being totally non-obvious that there's a problem so you actually have to specifically look for it. I sent my build to two 2 non-Mozilla people and got 1 hit.

That makes me question whether the Telemetry figure of 0.02% affected people is right.
(In reply to Gian-Carlo Pascutto [:gcp] from comment #39)
> (In reply to Mason Chang [:mchang] from comment #38)
> > I think our fallback will be to just fallback to software and not try
> > to use d3d 9.
> 
> You mean that anyone affected will be running without HW acceleration?

Correct.

> I'm concerned about this. The Telemetry shows it's very exceptional to hit
> this...but two Mozilla developers hit it, despite it being totally
> non-obvious that there's a problem so you actually have to specifically look
> for it. I sent my build to two 2 non-Mozilla people and got 1 hit.
> 
> That makes me question whether the Telemetry figure of 0.02% affected people
> is right.

@Dvander - oped?
Flags: needinfo?(dvander)
Telemetry figure is probably right, but we'll learn more when the instrumentation hits Aurora next week.

It's not quite right to say people affected will be running without HW acceleration. The D3D11 compositor still works fine since it's in the parent process. Only content processes are affected.

General browsing would probably have no noticeable difference. Canvas 2D perf would take a hit, and WebGL would have to fallback to D3D9 (which means no WebGL2 I think). Video perf will take a hit, but we plan to move that to the compositor in the medium-term future anyway.

That doesn't mean we shouldn't figure out why this is happening, but it's not as severe as "everything is now software".
Flags: needinfo?(dvander)

Comment 42

3 years ago
(In reply to Mason Chang [:mchang] from comment #38)

> > Would a similar fallback be acceptable for us?
> 
> I think we actually have to support the configuration where device resets
> cause the device creation to fail on some hardware. We've seen this in the
> wild. I think our fallback will be to just fallback to software and not try
> to use d3d 9.

Right, I was wondering if we could attempt to fall back to d3d9 before software and whether that would mitigate some of the performance hit.
 
> However, we should still block content sandboxing on figuring out what's
> going on with this bug. Is there some kind of piece meal sandboxing we could
> do to see which part of the driver is getting blocked? 

Well the rate seems pretty low to block on and it means we won't get more data from Aurora/Beta.

I've just realised that the process level mitigations are turned on between level 0 and level 1 as well, so there's a slim chance it is one of those causing the issue and not low integrity.

Here's a try build with the mitigations only turned on at level 2+ and the sandbox level defaulted to 1:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b0a00d1e3c70

The only differnce between 0 (where it works) and 1 in this build should be low integrity.
Can you test this when you get a chance please.

(In reply to David Anderson [:dvander] from comment #41)
> Telemetry figure is probably right, but we'll learn more when the
> instrumentation hits Aurora next week.

As I mentioned above this won't move to Aurora yet as I've not landed this patch.
It'll be interesting to see if we still get some failures even without the sandbox.

I think I might change this patch so it will just roll out to Aurora and then land and uplift.
Flags: needinfo?(bobowen.code) → needinfo?(mchang)
(In reply to Bob Owen (:bobowen) from comment #42)
> (In reply to Mason Chang [:mchang] from comment #38)
> 
> > > Would a similar fallback be acceptable for us?
> > 
> > I think we actually have to support the configuration where device resets
> > cause the device creation to fail on some hardware. We've seen this in the
> > wild. I think our fallback will be to just fallback to software and not try
> > to use d3d 9.
> 
> Right, I was wondering if we could attempt to fall back to d3d9 before
> software and whether that would mitigate some of the performance hit.

After talking with :dvander some more, the biggest hits would be on canvas and webgl, and maybe video. The graphics team as a whole isn't convinced that the gpu helps with normal web content (e.g. nytimes.com), but acceleration does help with canvas/webgl. It's not the best situation, but not as dire when I said "no acceleration".

> > However, we should still block content sandboxing on figuring out what's
> > going on with this bug. Is there some kind of piece meal sandboxing we could
> > do to see which part of the driver is getting blocked? 
> 
> Well the rate seems pretty low to block on and it means we won't get more
> data from Aurora/Beta.
> 
> I've just realised that the process level mitigations are turned on between
> level 0 and level 1 as well, so there's a slim chance it is one of those
> causing the issue and not low integrity.
> 
> Here's a try build with the mitigations only turned on at level 2+ and the
> sandbox level defaulted to 1:
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=b0a00d1e3c70
> 
> The only differnce between 0 (where it works) and 1 in this build should be
> low integrity.
> Can you test this when you get a chance please.
> 

This is still failing :(
Flags: needinfo?(mchang)

Comment 44

3 years ago
currently doesn't block rollout.
No longer blocks: 1246505
Whiteboard: [sb?] → [sb+]

Comment 45

3 years ago
(In reply to Jim Mathies [:jimm] from comment #44)
> currently doesn't block rollout.

Doesn't block rollout of the content process sandbox, would block rollout of a sandboxed gpu process.
(In reply to Jim Mathies [:jimm] from comment #45)
> (In reply to Jim Mathies [:jimm] from comment #44)
> > currently doesn't block rollout.
> 
> Doesn't block rollout of the content process sandbox, would block rollout of
> a sandboxed gpu process.

We currently use the gpu in the content process, so this should block rollout of the content process sandbox.

Comment 47

3 years ago
From what I understand:

1) according to telemetry this impacts a very small percentage of nightly users (.2%)
2) we have a clean fallback
3) we have on reproducible machine in house (dvander)

I'll renom for the sandbox team to see if we can collect more information on aurora (mentioned in comment 42).

Milan, can the gfx team put some effort into diagnosing this? The plan is to roll the low integrity sandbox out right behind the first rollout of content processes.
Flags: needinfo?(milan)
Summary: D3D11CreateDevice fails with E_FAIL in sandbox → D3D11CreateDevice fails with E_FAIL with low integrity content sandbox
Whiteboard: [sb+] → [sb?]

Updated

3 years ago
Flags: needinfo?(milan)
I've got some updated Telemetry numbers, now that we've had the instrumentation on Aurora for a few weeks.

In a sample of 77,407 sessions, 66,606 had a D3D11 compositor, and of those, 13 failed to create a device in a content process. Of *those*, there are two groups of failures: ones where the initial device creation fails (which suggests a sandbox failure), and ones where only d2d/video failed. Here are the adapters/drivers that were correlated to a possible sandbox failure:

 Intel HD Graphics 2500, 8.15.10.2639, 2-1-2012
 NVIDIA GeForce GTX 970, 10.18.13.5900, 11-13-2015
 NVIDIA GeForce GTX 770, 10.18.13.5582, 8-25-2015
 NVIDIA GeForce GT 650M, 10.18.13.6191, 2-8-2016
 NVIDIA GeForce GTX 980 Ti, 10.18.13.6200, 2-23-2016
 NVIDIA GeForce GTX 960, 10.18.13.6143, 12-16-2015

Here are the ones *NOT* correlated:
 Intel HD Graphics 3000, 8.15.10.2361, 4-10-2011
 Iris pro 5200, 10.18.10.3621, 5-17-2014
 Intel 2nd Gen, 8.15.10.2342, 3-25-2011
 Intel 2nd Gen, 8.15.10.2538, 9-26-2011
 Intel HD Graphics 3000, 8.15.10.2342, 3-25-2011
 Intel(R) HD Graphics 4000, 10.18.10.4252, 7-10-2015
 NVIDIA GeForce 9800 GT, 9.18.13.4195, 1-29-2016

The pattern seems to be that this only affects high-end NVIDIA graphics cards running relatively recent drivers on Windows 10. There is one outlier, which is probably just a normal driver failure given how old the configuration is (and how that fits into the other general failures, which also have very old hardware).

My guess is that some kind of developer tool or SDK feature causes the problem, but I (or someone else who can reproduce it) would have to reinstall Windows and install all the developer tools one-by-one again to narrow down which one. 

At any rate, even if we consider all of these failures as sandbox-related, that is still only 0.02% of sessions affected. It is probably not be worth worrying about for rollout, but we may want to uplift the telemetry and see what beta looks like.

Comment 49

3 years ago
(In reply to David Anderson [:dvander] from comment #48)
> I've got some updated Telemetry numbers, now that we've had the
> instrumentation on Aurora for a few weeks.

Just to clear, the Windows content sandbox is only on Nightly at the moment, some printing issues have come up since this bug.

So, from the Nightly failures, it appears that the content sandbox makes this failure rate worse, but we'll have to wait until we start to roll out to get the figures.

It seems like this shouldn't block rolling out to Beta, but we should make sure it is checked and signed off for release.

With that said, as we don't really have a detailed idea of why this is failing, any further debugging would be useful because we don't really have any ideas for how we could fix this if it does turn out to be a bigger problem.

Would it be possible to ask any contacts we have at NVIDIA to see if they can reproduce?
David, could you put together the telemetry patch for Beta, as in "...we may want to uplift the telemetry and see what beta looks like..."
Flags: needinfo?(dvander)
It sounds like sandboxing will ride the trains, if that's the case then the telemetry is ahead and we don't have to do anything.
Flags: needinfo?(dvander)

Updated

3 years ago
Whiteboard: [sb?] → [sb+]

Updated

2 years ago
Blocks: 1347710

Updated

a year ago
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → WORKSFORME
This still doesn't work for me. I have the content and gpu process sandboxes disabled on my machine. But it seems reasonable to WONTFIX this.
Resolution: WORKSFORME → WONTFIX
You need to log in before you can comment on or make changes to this bug.