Closed Bug 1183910 Opened 4 years ago Closed 4 years ago

Don't re-test d3d11 features on each process

Categories

(Core :: Graphics, defect)

Unspecified
Windows
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla43
Tracking Status
firefox42 --- fixed
firefox43 --- fixed

People

(Reporter: dvander, Assigned: dvander)

References

(Blocks 1 open bug)

Details

(Whiteboard: [gfx-noted])

Attachments

(10 files, 8 obsolete files)

11.44 KB, patch
mattwoodrow
: review+
Details | Diff | Splinter Review
10.06 KB, patch
mattwoodrow
: review+
Details | Diff | Splinter Review
39.03 KB, patch
bas.schouten
: review+
Details | Diff | Splinter Review
4.58 KB, patch
mattwoodrow
: review+
Details | Diff | Splinter Review
5.66 KB, patch
mattwoodrow
: review+
Details | Diff | Splinter Review
1.15 KB, patch
mattwoodrow
: review+
Details | Diff | Splinter Review
23.68 KB, patch
mattwoodrow
: review+
Details | Diff | Splinter Review
19.71 KB, patch
mattwoodrow
: review+
Details | Diff | Splinter Review
5.71 KB, patch
jrmuizel
: review+
Details | Diff | Splinter Review
13.20 KB, patch
mattwoodrow
: review+
Details | Diff | Splinter Review
Currently, the d3d11 initialization code has a few idiosyncrasies. After a TDR we can technically switch from WARP to non-WARP and vice versa, because we retest whether or not we should be using WARP. Ideally once we land in WARP we should never try non-WARP again.

Second, the child process will execute the same steps as the parent process, including creating a compositor device. It could technically create an incompatible device during this process. (See also, bug 1100510).

I'd like to simplify this so that the parent process determines the graphics configuration exactly once. After a TDR, we should re-use the old configuration rather than re-test. Similarly, child processes should request the configuration from the parent process, and should not create devices they don't need.
There may or may not be some coordination required with Benoit on this one.
Non-accelerated to accelerated change should probably not be allowed.  The other way around, it may be necessary.
Flags: needinfo?(bgirard)
Whiteboard: [gfx-noted]
I filed bug 1183789 today which may be similar. David do you know if this is the primary cause of us switching our minds about gfx features in the appnotes?
1) If we're sure we should dupe bug 1183789 to this one.
2) If we're unsure but think it's the primary cause then we may want to solve this first and the re-evaluate where that leaves bug 1183789.
3) If we're totally unsure we may just want to have me look at bug 1183789.
Flags: needinfo?(bgirard) → needinfo?(dvander)
(In reply to Benoit Girard (:BenWa) from comment #2)
> I filed bug 1183789 today which may be similar. David do you know if this is
> the primary cause of us switching our minds about gfx features in the
> appnotes?
> 1) If we're sure we should dupe bug 1183789 to this one.
> 2) If we're unsure but think it's the primary cause then we may want to
> solve this first and the re-evaluate where that leaves bug 1183789.
> 3) If we're totally unsure we may just want to have me look at bug 1183789.

I'm not sure, to be honest. I'd been looking at the device setup code for inclusion in telemetry data, and it just seemed too unreliable. My plan here was to,
 (1) Read d3d/d2d1 prefs in gfxPlatform, once, on initialization.
 (2) Only initialize d3d11 if we will use it for layers acceleration.
 (3) Only initialize d2d if we will be using d3d11.
 (4) If the initialization process fails, do not try it again.
 (5) If it succeeds, re-use that configuration after TDRs.
 (6) Do not specify LAYERS_D3D11 as a backend if we do not have a D3D11 device.
 (7) Have the child process request the configuration from the parent process.
 (8) This will let us avoid creating a compositor device in the child process.

Another weird thing is that, right now we can technically switch compositor backends. For example if we fail to get a device after a TDR we will create a basic compositor after having earlier chosen an accelerated one. I want to fix that, either by (a) falling back to non-OMTC in that situation, or (b) supporting fallbacks to the basic compositor in Gecko. What we're doing now is basically (b) except that we're pretty sure it doesn't work.
Flags: needinfo?(dvander)
Attached patch part 1, initial refactoring (obsolete) — Splinter Review
This is a big cleanup of how we get D3D11 devices. All initialization now happens in gfxWindowsPlatform's constructor, and it follows a pretty simple series of decisions. I'm including the commit message below since that describes the full list of changes better, but the takeaway is:

We no longer create D3D11 devices *at all* if layers acceleration is disabled or D3D9 is preferred, and subsequently do not initialize D2D *at all* if we do not plan on using a D3D11 compositor.

The structure of the code has been refactored such that we can start re-using previous decisions after TDRs and across processes.

Full changes:
 (1) We no longer attempt to use D3D11 if acceleration is disabled or D3D9 is preferred. This is a departure from our previous behavior, where we would construct these devices but then not use them as a compositor backend.
 (2) D3D11 startup no longer creates a content device (this is reserved for D2D initialization).
 (3) D2D is only attempted if we managed to create a D3D11 compositor device. This is a departure from previous behavior where if D3D11 was not used for compositing, we could still create its machinery to use D2D as a content backend.
 (4) D2D 1.1 initialization is now directly responsible for creating a D3D11 content device.
 (5) D2D 1.0 and 1.1 logic have been disentangled for clarity.
 (6) UpdateRenderMode() has been split up, so we can update backend prefs out of band with device resets.
 (7) mUseGDIFonts and mUseDirectWrite have been removed as their state was confusing. Instead, D2D now depends on DWrite initialization succeeding. If later we fail to get a DWrite font list, we revert our decision to use Direct2D.
 (8) Device resets now clear a little more state, including the devices set in Moz2D Factory.
 (9) We no longer create a DWrite text analyzer as it was unused.
Attachment #8635051 - Flags: review?(matt.woodrow)
Attached patch part 1, initial refactoring (obsolete) — Splinter Review
Whoops, correct patch.

Another issue I found while working on this is that various places in Moz2D have a (mDevice == Factory::GetDirect3DxDevice()) check to see whether they're still valid. This is susceptible to an A-B-A problem, but I don't know whether that matters.

There are also places in Moz2D that use Factory::GetDirect3D10Device() without a null check, but that would only matter if you retained some objects across a TDR and then attempted to use them.
Attachment #8635051 - Attachment is obsolete: true
Attachment #8635051 - Flags: review?(matt.woodrow)
Attachment #8635052 - Flags: review?(matt.woodrow)
Attached patch part 1, initial refactoring (obsolete) — Splinter Review
Errr, *this* is the correct patch.
Attachment #8635052 - Attachment is obsolete: true
Attachment #8635052 - Flags: review?(matt.woodrow)
Attachment #8635053 - Flags: review?(matt.woodrow)
Attached patch part 1, initial refactoring (obsolete) — Splinter Review
This version fixes a few asserts that would trigger after TDRs.

(Also, I was wrong about ABA problems, since they're held alive by RefPtr.)
Attachment #8635053 - Attachment is obsolete: true
Attachment #8635053 - Flags: review?(matt.woodrow)
Attachment #8635085 - Flags: review?(matt.woodrow)
Attached patch part 2, add a pref to test TDRs (obsolete) — Splinter Review
This is a small bit of code to trigger TDRs for testing. When you set the pref, the next call to GetDeviceRemovedReason will simulate a TDR and unset the pref. It doesn't work with e10s, which I'll fix in the final part of this bug.

I used this to make sure part 1 didn't regress TDR handling. After a few fixes, it doesn't, but TDR handling seems pretty shaky right now. We assert in a few places on non-e10s and have a gfxCriticalError on e10s.
Attachment #8635087 - Flags: review?(matt.woodrow)
We're using critical errors to find out more about the TDRs, so that one is probably OK.  Maybe not, I don't know which one we're talking about.

Let's also wait a few days for the aurora/beta bugs with warp preferences to get settled, so that we don't diverge so much (too late already, I imagine, but still) and make testing in nightly more difficult.
Comment on attachment 8635087 [details] [diff] [review]
part 2, add a pref to test TDRs

Review of attachment 8635087 [details] [diff] [review]:
-----------------------------------------------------------------

Cool!

A couple of feature requests:

* Get a menu item for this added to 'Nightly Tester Tools' [1]
* Add a way to break hardware compositing so that we test falling back to software.

[1] https://github.com/mozilla/nightlytt
Attachment #8635087 - Flags: review?(matt.woodrow) → review+
Comment on attachment 8635085 [details] [diff] [review]
part 1, initial refactoring

Review of attachment 8635085 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good, really hard to tell if any behaviour changed though.

As Milan said, wouldn't hurt to hold off landing this for a few days. I'll try make sure Jeff and Bas take a quick look too so they're aware of the changes.

::: gfx/layers/client/TextureClient.cpp
@@ +342,5 @@
>  #ifdef XP_WIN
>    LayersBackend parentBackend = aAllocator->GetCompositorBackendType();
>    if (parentBackend == LayersBackend::LAYERS_D3D11 &&
> +      ((aMoz2DBackend == gfx::BackendType::DIRECT2D && Factory::GetDirect3D10Device()) ||
> +       (aMoz2DBackend == gfx::BackendType::DIRECT2D1_1 && Factory::GetDirect3D11Device())) &&

This seems a bit confusing. Letting the callers pass in arbitrary backends, and then having to verify if it's actually actionable is sad.

As far as I can tell, only single caller (CanvasClient) passes something other than NONE for the backend (which gets overwritten with GetContentBacked()), and that uses GetCanvasBackend().

How would you feel about getting rid of the aMoz2DBackend parameter entirely, and just having a 'usage' flag with 'content' and 'canvas' as the options.

That way we could always make the call to GetContentBackend/GetCanvasBackend ourselves, and assert that those only ever return a backend that is actually available.

@@ +347,3 @@
>        aSize.width <= maxTextureSize &&
> +      aSize.height <= maxTextureSize)
> +  {

The style guide won't like this change!

::: gfx/thebes/gfxWindowsPlatform.cpp
@@ +467,4 @@
>    }
>  
> +  mDWriteFactory = factory;
> +  factory->Release();

Wouldn't hurt to use RefPtr<IDWriteFactory>, byRef(factory) to avoid this explicit release.
Attachment #8635085 - Flags: review?(matt.woodrow) → review+
Good ideas, I'll do the pref to fallback to software in a later patch (it's easier to do once we can sticky the feature status across device initialization attempts).

For now here's a new one with e10s support. We capture the pref in the parent process and propagate it to children. True TDRs don't necessarily work that way, but by the end of this patch queue they sort of will (since we'll block on the parent process attempting to re-initialize its devices).
Attachment #8635087 - Attachment is obsolete: true
Attachment #8635841 - Flags: review?(matt.woodrow)
This removes D3D11Status since it can be covered by FeatureStatus. It also inverts the "use WARP" directive to be "use hardware", which flows better.

The motivation for this and the next two patches are to separate parent and child process checks apart.
Attachment #8635895 - Flags: review?(matt.woodrow)
w/ Matt's comments addressed, except for the TextureClient stuff which I'll do as a follow-up. Also moved mRenderMode setting to UpdateRenderMode(). r?'ing bas for this round, see comment #4 for the full change list.
Attachment #8635085 - Attachment is obsolete: true
Attachment #8635905 - Flags: review?(bas)
Similar to part 3, factor D2D checks into separate functions, to make parent/child process checks easier.
Attachment #8635906 - Flags: review?(matt.woodrow)
This makes FeatureStatus values persist after they are initialized. After a TDR, we no longer re-try features if they previously didn't work. This also makes mIsWARP persist across InitializeDevices (that is, we don't go from WARP to non-WARP after a TDR).
Attachment #8635907 - Flags: review?(matt.woodrow)
Not sure if there was a reason behind this, but it looks like a bug. We cache the result of texture sharing checks which means we don't get an updated value after device resets.
Attachment #8635909 - Flags: review?(matt.woodrow)
Comment on attachment 8635905 [details] [diff] [review]
part 1, initial refactoring

Review of attachment 8635905 [details] [diff] [review]:
-----------------------------------------------------------------

Still looking, but so far I'm really happy with this work.
This patch adds a new sync IPC call so the child process can request device preferences (such as whether or not to use d3d11, WARP, d2d, etc). We now also force content processes to initialize gfx right after XPCOM, to ensure the sync message doesn't happen at a random weird time in the content process.

It seemed like there are a few approaches to reworking gfxWindowsPlatform for this. The approach here short-cuts the parent process checks if we're in a content process. This is nice since the previous patches separated out "is this feature supported" checks from "try to initialize this feature", and now we get to re-use all the initialization logic.

On the other hand, it does make the device logic more complicated and it may be simpler to just duplicate the entire initialization process in a separate, content-only path, since creating the devices is pretty straightforward.

It seems hard to know for sure until we need another feature or d3d11/d2d restriction.
Attachment #8636827 - Flags: review?(matt.woodrow)
Content processes don't need a d3d11 compositor. It would be hard to remove this without the e10s part of this bug, because we wouldn't reliably know whether or not to create a WARP device (or whether or not texture sharing works). But now we can just skip the device creation and copy the device prefs from the parent process.

The only tricky part is that we do still need to check if we can get a DXGI adapter, so I moved that into the device creation functions.
Attachment #8636831 - Flags: review?(bas)
Err, fixing a stupid bug.
Attachment #8636831 - Attachment is obsolete: true
Attachment #8636831 - Flags: review?(bas)
Attachment #8636832 - Flags: review?(bas)
As suggested in comment #11, this removes the aMoz2DBackend parameter from TextureClient::CreateForDrawing. It also removes the Factory::GetD3D*Device() checks since they are guaranteed to be valid based on the gfxPlatform backend list.
Attachment #8636962 - Flags: review?(matt.woodrow)
Something worth considering is that different processes can see different capabilities. e.g. in optimus the gpu that you can get can depend on the name of the executable.
Comment on attachment 8636832 [details] [diff] [review]
part 8, don't create compositors in content processes

Review of attachment 8636832 [details] [diff] [review]:
-----------------------------------------------------------------

I'm going to r+ this for now, but I'm a little concerned about the way we're managing mDoesD3D11TextureSharingWork in general. (maybe your other patches fix this in which case my concern is moot, but for now let's do this)
Attachment #8636832 - Flags: review?(bas) → review+
Comment on attachment 8635841 [details] [diff] [review]
part 2, add a pref to test TDRs

Review of attachment 8635841 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/ipc/ContentChild.cpp
@@ +2896,5 @@
>  
> +bool
> +ContentChild::RecvTestGraphicsDeviceReset(const uint32_t& aResetReason)
> +{
> +#if defined(XP_WIN)

We can skip this #ifdef since TestDeviceReset is already a no-op for non-windows platforms.
Attachment #8635841 - Flags: review?(matt.woodrow) → review+
Attachment #8635895 - Flags: review?(matt.woodrow) → review+
Attachment #8635906 - Flags: review?(matt.woodrow) → review+
Attachment #8635907 - Flags: review?(matt.woodrow) → review+
Attachment #8635909 - Flags: review?(matt.woodrow) → review+
Attachment #8636962 - Flags: review?(matt.woodrow) → review+
Comment on attachment 8635905 [details] [diff] [review]
part 1, initial refactoring

Review of attachment 8635905 [details] [diff] [review]:
-----------------------------------------------------------------

Still not supergreat code to read but it's a huge improvement, sounds good.
Attachment #8635905 - Flags: review?(bas) → review+
OS: Unspecified → Windows
Matt, given our discussion in IRC, part 7 should still be okay to review. It still re-computes texture sharing in the child process. In fact that patch in theory) should be able to land standalone.

Part 8 is where things went wrong, so I've fixed up that instead. There were a few places that relied on the value of texture sharing:

 (1) ClientCanvasLayer. We reasoned this only cares about the compositor device in the parent process, so DoesD3D11TextureSharingWork has been renamed to CompositorD3D11TextureSharingWorks().
 (2) CanUseHardwareVideoDecoding(). This cares about both the compositor and image bridge device's texture sharing. We already destroy the image bridge device if it can't share textures. This patch adds an additional check: we now avoid creating the image bridge device entirely if the compositor can't share textures. This means the presence of the image bridge device is enough to check for hardware video decoding.
 (3) Direct2D. This cares about the compositor device, but we don't have one in content processes. Since content processes can bind to a different GPU, this patch simply adds a texture sharing check against the content D3D11 device.

I'll try to address the dual GPU problem in a final patch.
Attachment #8636832 - Attachment is obsolete: true
Attachment #8641343 - Flags: review?(matt.woodrow)
This patch doesn't let content processes use D3D11 devices if their description doesn't match the one in the parent process.

Note: I'm not totally sure if the LUID is a good thing to include in this check, or what any of this means with SLI. I don't have a machine with two GPUs to test on.
Attachment #8641493 - Flags: review?(jmuizelaar)
Comment on attachment 8641493 [details] [diff] [review]
part 10, don't mix adapters

Review of attachment 8641493 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/thebes/gfxWindowsPlatform.cpp
@@ -1811,5 @@
>  
>      keyedMutex->ReleaseSync(0);
> -
> -    // It seems like this may only happen when we're using the NVIDIA gpu
> -    CheckForAdapterMismatch(device);

Can we keep this around for a while longer?

@@ +2048,5 @@
> +gfxWindowsPlatform::ContentAdapterIsParentAdapter(ID3D11Device* device)
> +{
> +  DXGI_ADAPTER_DESC desc;
> +  if (!GetDxgiDesc(device, &desc)) {
> +    // Just assume things are okay.

Feels like it would be safer to assume that things are not ok.
Attachment #8641493 - Flags: review?(jmuizelaar) → review+
Comment on attachment 8636827 [details] [diff] [review]
part 7, the e10s part

Review of attachment 8636827 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/thebes/gfxWindowsPlatform.cpp
@@ +375,5 @@
>  
> +// Device init data should only be used on child processes, so we protect it
> +// behind a getter here.
> +static DeviceInitData sDeviceInitDataDoNotUseDirectly;
> +static inline DeviceInitData&

Can this be const?
Attachment #8636827 - Flags: review?(matt.woodrow) → review+
Comment on attachment 8641343 [details] [diff] [review]
part 8, don't create compositors in content processes

Review of attachment 8641343 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/thebes/gfxWindowsPlatform.cpp
@@ +451,5 @@
>  bool
>  gfxWindowsPlatform::CanUseHardwareVideoDecoding()
>  {
> +  if (!gfxPrefs::LayersPreferD3D9() && !mD3D11ImageBridgeDevice) {
> +    // The image bridge device is only created if we don't have WARP, we can

Is this true? We agreed that we still want image bridge even without h/w accel, since we can send async video using shmem.
Address comment #39 - don't make hw video decoding depend on the image bridge device being present.
Attachment #8641343 - Attachment is obsolete: true
Attachment #8641343 - Flags: review?(matt.woodrow)
Attachment #8642208 - Flags: review?(matt.woodrow)
Attachment #8642208 - Flags: review?(matt.woodrow) → review+
sorry backed out for bustage like https://treeherder.mozilla.org/logviewer.html#?job_id=12565550&repo=mozilla-inbound

 10:08:10 INFO - ^
10:08:10 INFO - In file included from Unified_cpp_dom_media_webrtc0.cpp:29:0:
10:08:10 INFO - /home/worker/workspace/gecko/dom/media/webrtc/MediaEngineGonkVideoSource.cpp: In member function 'void mozilla::MediaEngineGonkVideoSource::RotateImage(mozilla::layers::Image*, uint32_t, uint32_t)':
10:08:10 ERROR - /home/worker/workspace/gecko/dom/media/webrtc/MediaEngineGonkVideoSource.cpp:757:100: error: no matching function for call to 'mozilla::layers::TextureClientRecycleAllocator::CreateOrRecycleForDrawing(mozilla::gfx::SurfaceFormat, mozilla::gfx::IntSize, mozilla::gfx::BackendType, mozilla::layers::TextureFlags, mozilla::layers::TextureAllocationFlags)'
10:08:10 INFO - layers::ALLOC_DISALLOW_BUFFERTEXTURECLIENT);
10:08:10 INFO - ^
Flags: needinfo?(dvander)
Flags: needinfo?(dvander)
Keywords: leave-open
Whoops, the push in comment #51 had the wrong bug number.

part 10: http://hg.mozilla.org/mozilla-central/rev/9fda40f7a0d8
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Comment on attachment 8641493 [details] [diff] [review]
part 10, don't mix adapters

Approval Request Comment
[Feature/regressing bug #]: Graphics stability work
[User impact if declined]: Potential crashes on dual GPU systems
[Describe test coverage new/current, TreeHerder]: Nightly
[Risks and why]: This was part of a series of patches that all landed on Firefox 42. Unfortunately, this last patch landed a few hours after the merge. It was intended to be included on 42 as well.
Attachment #8641493 - Flags: approval-mozilla-aurora?
Comment on attachment 8641493 [details] [diff] [review]
part 10, don't mix adapters

Sure, let's take it!
Attachment #8641493 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Target Milestone: --- → mozilla43
Duplicate of this bug: 1115352
Depends on: 1208234
You need to log in before you can comment on or make changes to this bug.