Closed Bug 1145106 Opened 5 years ago Closed 4 years ago

crash in atidxx32.dll@0x196f7

Categories

(Core :: Graphics, defect, critical)

x86
Windows NT
defect
Not set
critical

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: bogdan_maris, Unassigned)

Details

(Keywords: crash)

Crash Data

Attachments

(3 files)

This bug was filed from the Socorro interface and is 
report bp-003957b4-d7d5-4adc-91d1-4b2302150319.
=============================================================

STR:
1. Start Aurora
2. Change loop.server pref to https://loop-dev.stage.mozaws.net/v0
2. Click Hello icon
3. Click Start a conversation
4. Copy conversation link and past it in the same tab
5. Select 'Share my tabs' from 'Share your screen' button
6. Click 'Got it' from the infobar

Notes:
1. Only reproduced on Windows platform
2. It may not crash from the first try.

Another crash from a different machine:
bp-286443d4-97be-40e4-9f67-81de82150319
Flags: needinfo?(mreavy)
These crashes seem to be deep in graphics drivers, with no obvious connection to Hello/WebRTC (though perhaps this was indirectly requested by the tab-sharing's "render page to canvas" request.  I think this needs to go over to graphcs/layers (though I'm keeping it in WebRTC:audio/video for now).  The other possibility is that there's some ownership issue with the data being manipulated.  Possibly related to Bug 1145102.

Milan -- I'd like someone with the right knowledge (I believe it'd be someone on your team) to look at this bug and Bug 1145102 together to help analyze how concerning these are and what we can do to fix them.  They are impacting the stability of tab sharing which are very important to Hello and WebRTC.  (Tab sharing for Hello is scheduled to ship to release in Fx38.)  Thanks.
Rank: 10
Component: WebRTC → WebRTC: Audio/Video
Flags: needinfo?(mreavy) → needinfo?(milan)
Priority: -- → P1
Start with Bas, just because of the Windows expertise.
Flags: needinfo?(milan) → needinfo?(bas)
(In reply to Milan Sreckovic [:milan] from comment #2)
> Start with Bas, just because of the Windows expertise.

There does not appear to be any obvious connection between the two crashes mentioned in comment 0 from the crashes themselves. (if I understand correctly both were triggered with the same STR?) Other than that both are invalid memory reads at a seemingly not ridiculous address on the heap.

The first crash however, does have an interesting stack as well as an interesting crash address. It appears to me that the crash address may very well be related to the image data being uploaded. I'll attempt to catch this in a debugger, but in case someone beats me to it, it'd be interesting to know where the address where the read access violation occurs has any relation to the address of the image data that we're trying to upload.

That would explain this crashing in a variety of locations as depending on the driver version the upload would occur in different locations inside of the DLL. If my suspicion is correct this would not be a driver bug but possibly more of an issue in our image bridge code. Nical, what do you think?
Flags: needinfo?(bas) → needinfo?(nical.bugzilla)
In the first report, an ImageContainer contains a TextureClient that has been used on the main thread and has a D2D texture which, I assume, is using the main thread's D2D device. in this stack we are on the ImageBridge thread, which uses a different D2D device for its texture uploads (as of recently).

So my guess is that this crash comes from trying to copy from a texture into another with the textures not belonging to the same D2D device.

I remember reviewing the patch that introduced the separate D2D device for the ImageBridge thread but I don't have it handy. We could either try to back the patch out, or read the texture back on the cpu and reupload it. The later would be sorta horrible but safe. The risk with the former is that we landed the patch to avoid a deadlock inside of D2D if you do certain things with a D2D device from several threads. IIRC we landed that patch and another one which was attempting to work around the deadlock in a different way, so maybe we can just back it out. Matt, what do you think (as you wrote both the patches for the deadlock issue)?
Flags: needinfo?(nical.bugzilla) → needinfo?(matt.woodrow)
My changes only affected uploads for D3D9SurfaceImage, so this crash will still be using the 'content' d3d device for uploading.

It looks like the main thread is within d2d as well, so we're probably because both threads are using the same device at the same time.

We could try using SetMultithreadProtected on the d3d device (see the gfxWindowsPlatform changes in bug 875247).

Alternatively we could create and upload the TextureClient when we construct the Image (on the main-thread) rather than waiting for the ImageBridge thread to do it.
Flags: needinfo?(matt.woodrow)
Here are some more crashes with a slightly different signature. 

Attached is a gif showing how Firefox behaves before crashes (ugly sight...). It`s quite easy for me to crash now.

bp-f9f1fb7c-2428-4022-bfef-efb892150423
bp-e280c757-710d-4817-9c09-d01d92150423
Is there a try build we can produce with some of these suggestions, perhaps more instrumentation (I'd like to know if any TDRs are happening, for instance), and pass it to Bogdan to test with?
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(matt.woodrow)
Flags: needinfo?(bas)
Sure, here's a patch for the first suggestion.

Try builds: https://treeherder.mozilla.org/#/jobs?repo=try&revision=5b9f787a0bb1

The second suggestion I had doesn't actually make much sense, because we allocate the CairoImage off the main thread as well.

The other solution would be to have a separate d3d device for the image bridge thread (we could probably use the existing 'media' one), and have TextureClientD3D11::AllocateSurface choose the appropriate one given the current thread.

That might be preferable, since setting MultiThreadProtected on the content device might hurt performance.
Flags: needinfo?(matt.woodrow)
Working try builds:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b1b6324cbf4a

https://treeherder.mozilla.org/#/jobs?repo=try&revision=e11d0a5bebbe

Bogdan, could you please try those builds and see if the crashes are fixed?
Flags: needinfo?(bogdan.maris)
Flags: needinfo?(nical.bugzilla)
1) Build https://treeherder.mozilla.org/#/jobs?repo=try&revision=e11d0a5bebbe:

 - Windows 7 64-bit (AMD Radeon HD6450) - machine where I originally crashed with this signature
bp-b7eae424-d5ef-40c0-a2fc-01a3d2150424

We also tested on a second machine:
 - Windows 7 64-bit (NVidia Geforce 210) - machine where it originally crashed with signature from bug 1154271
bp-41721af3-829b-45c5-9536-8817f2150424

2) Build https://treeherder.mozilla.org/#/jobs?repo=try&revision=b1b6324cbf4a:

Looks good with both machines from above, was unable to crash the build. 

Could this build contain the fix for bug 1154271 as well, and bug 1154771 (I don`t have setup to verify this crash)?
Flags: needinfo?(bogdan.maris) → needinfo?(matt.woodrow)
Bogdan, could you clarify comment 11 a bit? Did you mean that build 1) still crashes for you and build 2) doesn't?
Flags: needinfo?(bogdan.maris)
Yes

Build 1) - I can still crash Firefox
Build 2) - Works fine, unable to crash Firefox
Flags: needinfo?(bogdan.maris)
Bogdan, do you have any idea why the stacks changed between comment #1 and comment #6?

Bug 1154271 and bug 1154771 both have very similar stacks to comment #1, where the crash is coming from a thread with CreateLayeredChild on the stack.

The stacks in comment #6 and comment #11 seem different and I'm not sure why.

Bas: The newer crashes aren't obviously on the ImageBridge thread (unlike the earlier ones), and avoiding using the content d3d device isn't fixing the crash. I guess that means that yet another thread has access to our content device, but I can't find anywhere that would happen.
Flags: needinfo?(matt.woodrow) → needinfo?(bogdan.maris)
It's possible that a gecko code change caused the different stacks, if so, getting a regression range would be really helpful.
(In reply to Matt Woodrow (:mattwoodrow) from comment #14)
> Bogdan, do you have any idea why the stacks changed between comment #1 and
> comment #6?
> 
> Bug 1154271 and bug 1154771 both have very similar stacks to comment #1,
> where the crash is coming from a thread with CreateLayeredChild on the stack.
> 
> The stacks in comment #6 and comment #11 seem different and I'm not sure why.

Using the initial build where I got the signature from comment 1 ([@ atidxx32.dll@0x196f7]), now I get the signature from comment 6 ([@ atidxx32.dll@0x49299e ]).
bp-20eba869-da76-4309-8ec4-2a2a42150428

Since mozregression was of no help here, I did a manual regression and tested builds from the first time I got this crash until nightly from 2015.04.23.

Here are the results:

2015.03.19 -  [@ atidxx32.dll@0x49299e ]
bp-38d4d3af-5c0b-4d3b-9c21-ae3b52150428

2015.03.20 - [@ atidxx32.dll@0x49299e ]
bp-af376f30-9bde-493f-9182-9cb712150428

2015.03.28 - [@ atidxx32.dll@0x49299e ]
bp- 97dacb01-c60d-4698-a020-277402150428

2015.04.06 - [@ atidxx64.dll@0x57d7bb ]
bp-9223dfa9-7853-43fa-9eed-6305f2150428

2015.04.23 - [@ atidxx32.dll@0x49299e ]

Latest Nightly:
2015.04.28 - [@ atidxx32.dll@0x49299e ]
bp-20825854-6c12-4af9-9fc5-c08bd2150428

Note: 
- I did not actually tested every build from 2015.03.18 to 2015.04.28 and Today, I bisected just as mozregression would.

I don`t know what has changed but I can only get the signature from comment 6 on almost all builds, see above. I hope this helps...
Flags: needinfo?(bogdan.maris) → needinfo?(matt.woodrow)
I encountered the first signature again using latest Nightly and 38.0.5beta1 build 2:
- [@ atidxx32.dll@0x196f7 ]
bp-17baf221-d1f7-4122-a5c9-76d582150512
bp-88a80823-de88-4d1f-bf20-8332d2150512
bp-40a112bb-d067-46fb-90a9-fae702150512

Sometimes I get other signatures:
- [@ afhook32.dll@0x54de ] bp-3c07bf93-f9b4-45de-8e07-16db32150512
- [@ atidxx32.dll@0x509db0 ] bp-aaef48b2-87fc-4d57-8a01-c29c72150512

Steps:
1. Make a tab sharing call.
2. Send and Open link in other PC.
2. Resize Firefox from first PC.
It looks like our D2DFactory is created with multithreading, so I'm not sure what else could be causing this.

I guess we should just go ahead with marking the D3D device as multithreaded unless Bas has a better idea.
Flags: needinfo?(matt.woodrow)
backlog: --- → webRTC+
-> GFX, because it is.  Waiting on bas to reply to matt I guess.
backlog: webRTC+ → ---
Rank: 10
Component: WebRTC: Audio/Video → Graphics
Priority: P1 → --
Comment on attachment 8596962 [details] [diff] [review]
multi-thread-protect

Review of attachment 8596962 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/thebes/gfxWindowsPlatform.cpp
@@ +2015,5 @@
>  
>      mD3D11ContentDevice->SetExceptionMode(0);
>  
> +    nsRefPtr<ID3D10Multithread> multi;
> +    mD3D1ContentDevice->QueryInterface(__uuidof(ID3D10Multithread), getter_AddRefs(multi));

I'm assuming this is a typo, and it should be mD3D11ContentDevice?
Attachment #8596962 - Flags: feedback?(bas)
Yes, it's spelt correctly in the try push :)
Comment on attachment 8596969 [details] [diff] [review]
Choose d3d11 device based on thread

This patch didn't fix the crash. I'm not sure why.
Attachment #8596969 - Flags: review?(bas)
Comment on attachment 8596962 [details] [diff] [review]
multi-thread-protect

Review of attachment 8596962 [details] [diff] [review]:
-----------------------------------------------------------------

If this helps let's do it, although we really need to fix the fact we have this bug in the first place, there's no way this will totally fix things.
Attachment #8596962 - Flags: feedback?(bas) → feedback+
Is the patch still needed? It looks like there have been zero reports of this crash in at least 28 days.
Flags: needinfo?(bas)
I suspect this was actually fixed by that bug.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.