shutdownhang in mozilla::layers::ImageBridgeChild::ShutDown()

RESOLVED FIXED in Firefox 39

Status

()

defect
--
critical
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: kairo, Assigned: mattwoodrow)

Tracking

({crash})

Trunk
mozilla39
x86
Windows NT
Points:
---

Firefox Tracking Flags

(firefox36 wontfix, firefox37+ wontfix, firefox38 affected, firefox39 fixed)

Details

(crash signature)

Attachments

(2 attachments)

[Tracking Requested - why for this release]:

This bug was filed from the Socorro interface and is 
report bp-fd72c856-19b6-44f3-972b-6677a2150323.
=============================================================

This is a new topcrash / shutdownhang in 37.0b7 (it exists at very low level in other versions, but only spiked with this one), which interestingly is a very similar picture to bug 1146313.

Those seem to be the significant frames (in Thread 0):
5 	nss3.dll 	PR_Wait 	nsprpub/pr/src/threads/prmon.c
6 	xul.dll 	mozilla::ReentrantMonitor::Wait(unsigned int) 	xpcom/glue/ReentrantMonitor.h
7 	xul.dll 	mozilla::layers::ImageBridgeChild::ShutDown() 	gfx/layers/ipc/ImageBridgeChild.cpp
8 	xul.dll 	gfxPlatform::ShutdownLayersIPC() 	gfx/thebes/gfxPlatform.cpp
9 	xul.dll 	ScopedXPCOMStartup::~ScopedXPCOMStartup() 	toolkit/xre/nsAppRunner.cpp


Almost all of those shutdownhangs seem to be in Win7, with at least Intel and AMD graphics adapters in the stats.

This is the #3 crash with 2.9% of all crashes in 37.0b7, and also #3 with ~10% when looking at only crashes with youtube.com URLs.

You can find more reports and stats by clicking on the link in the Crash Signature field in this bug.

Bas/nical: Any ideas what's up here?
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(bas)
ni kats and Jeff as well because we're scheduled to build the 37 RC today and have very little time to figure out our next step. My immediate thought is to backout whatever we think caused this crash to appear in b7.
Flags: needinfo?(jmuizelaar)
Flags: needinfo?(bugmail.mozilla)
Do we know if some other crash correspondingly reduced in volume? If it's just a matter of crashes changing signatures then I don't think we should be doing a backout. If it's a new crash while the other existing crashes remain at the same volume then yeah a backout seems warranted. Also fwiw this is a shutdown crash which arguably is better and less user-noticeable than a crash while actively using the browser.
Flags: needinfo?(bugmail.mozilla)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #2)
> Do we know if some other crash correspondingly reduced in volume? If it's
> just a matter of crashes changing signatures then I don't think we should be
> doing a backout. If it's a new crash while the other existing crashes remain
> at the same volume then yeah a backout seems warranted. 

Kairo tells me that the overall crash rate has increased in b7. I don't know whether this crash specifically is a result of a change in signature.

> Also fwiw this is a
> shutdown crash which arguably is better and less user-noticeable than a
> crash while actively using the browser.

You're right and I agree. We should review the alternative and make a call on whether it is better to take a higher volume of shutdown crashes.
Thread 25's backtrace is interesting: it is the ImageBridgeChild's thread which the main thread is waiting on, and it is blocked by something inside of the D3D code, inparticular see the symbol see "RtlpWaitCouldDeadlock". The stack in this thread belong to some code added recently by Matt, related to doing texture uploads on the ImageBridgeChild thread so as to get acceptable 4k video playback performance In bug 1138967.
I am not 100% certain Matt's patch is at fault but the stack is pointing to it so it's worth taking a closer look.
Flags: needinfo?(nical.bugzilla) → needinfo?(matt.woodrow)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #2)
> Do we know if some other crash correspondingly reduced in volume?

Nothing that really corresponds with the volume of this.

(In reply to Nicolas Silva [:nical] from comment #4)
> The stack in this thread belong to some code added recently by Matt, related
> to doing texture uploads on the ImageBridgeChild thread so as to get
> acceptable 4k video playback performance In bug 1138967.

This would make sense as we took this in exactly the affected build, 37.0b7.
I'm looking at the following two pages to get my numbers here:

https://crash-stats.mozilla.com/topcrasher/products/Firefox/versions/37.0b6
https://crash-stats.mozilla.com/topcrasher/products/Firefox/versions/37.0b7

On it, I see the js::GCMarker::processMarkStack (for which there were no related changes between b6 and b7) went down from 2.14% to 1.87% from b6 to b7, giving us a baseline multiplier of ~0.874 to indicate "no change". In other words, if a crash was at P% in b6, then we would expect to see it at 0.874P% in b7 just because other crashes around it got moved around and changed in volume.

Now, also in the b6 -> b7 comparison I see a shutdownhang at 2.87% (bug 1138520) in b6 disappear entirely in b7. And in b7 this shutdownhang (bug 1146315) appears in b7 at a volume of 2.47%. The multiplier here is 0.861 which I think is close enough to the baseline - it could be that that the shutdown hang "fixed" in bug 1138520 by backing out bug 1103036 just morphed signatures and turned into this one with no actual change in volume.

However, the total percent of shutdownhang crashes in b6 is 8.13% and it went up to 8.80% in b7. It should have been 7.10% using the baseline multiplier I calculated above, so it definitely did increase, but I'm not convinced that this one is really a "new" issue.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #6)
> On it, I see the js::GCMarker::processMarkStack (for which there were no
> related changes between b6 and b7) went down from 2.14% to 1.87% from b6 to
> b7, giving us a baseline multiplier of ~0.874 to indicate "no change".

Where do you get the idea that the percentage of that total would change in that way?
In any way, I'm also looking at volume comparisons by crashes/ADI across the 37 train and my above assessment comes from that.

> Now, also in the b6 -> b7 comparison I see a shutdownhang at 2.87% (bug
> 1138520) in b6 disappear entirely in b7.

Which we know where it came from and had a specific backout for in bug 1103036 that we know fixed this. Also, the two paths that those shutdownhangs are waiting for are completely different.

> However, the total percent of shutdownhang crashes in b6 is 8.13% and it
> went up to 8.80% in b7.

Yes, because of the bug here and bug 1127270 rising significantly.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #7)
> (In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #6)
> > On it, I see the js::GCMarker::processMarkStack (for which there were no
> > related changes between b6 and b7) went down from 2.14% to 1.87% from b6 to
> > b7, giving us a baseline multiplier of ~0.874 to indicate "no change".
> 
> Where do you get the idea that the percentage of that total would change in
> that way?

Math? Let's say there were 100000 crashes over a week of b6, of which 2140 (2.14%) were the processMarkStack. b7 has only been out for a couple of days, so let's say there's 10000 crashes in b7, of which 187 (1.87%) are processMarkStack. Assuming no changes were made that affect the processMarkStack crashes directly, we should expect to see the same volume (2140) of those after a week. If we assume the first couple of days of b7 are representative of the week, then after a week of b7 (when we hit 2140 processMarkStack crashes) we should have 114439 total crashes. This is 14439 crashes more than b6 had over an equivalent period. This increase can be attributed to changes made between b6 and b7. Now consisder any crash that (like processMarkStack) was not affected by these changes. The volume in b6 should remain the same in b7, so if we had X crashes in b6, we should get X crashes in b7. As a percent of total crashes in b6 X would be X/1000 and in b7 it would be X/1144.39, so the percentage in b7 would be 0.874 times the percentage value in b6.

My argument is a little circular because I have to assume the bug 1138520 shutdownhang was unaffected in volume in order to apply this multiplier, but what I'm getting at is that the numbers are consistent with the the hypothesis.
1) The first two days of any release, including a beta, are not entirely representative of what a week of crashes looks like - even more so over the weekend, where we usually have a crash rate spike and slightly different user behavior.
2) Even taking the weekend spike into account, 37.0b7 has a 10-20% higher crash rate than what we had before. There needs to be some reasons to that, I looked at my explosiveness reports that are based on crashes/ADI data as well as searches, and found that the rising shutdown crashes, led by this bug and bug 1127270 have more volume than the decreasing ones like bug 1138520. The main increase is due to probably-TDRs (signature of bug 1116812) and bug 1146313 also plays a role.
3) Nicolas already found out that the shutdownhang here is connected to code that was added by bug 1138967, which is the most risky patch we took for 37.0b7.
Ok, fair enough. Sounds like we might want to back out bug 1138967 then as it seems to be the most likely culprit for both this bug and bug 1146313.
https://msdn.microsoft.com/en-us/library/windows/desktop/ee417025.aspx#Multithreading_and_DXGI

It seems that ImageBridgeChild::Shutdown blocking the main thread while we wait for textures to be destroyed may be the issue here, as DXGI can block waiting on the main thread.

It's not really clear how D3D is determining the 'application window' for this device, as we never intentionally associate it with any window (it's only used for creating resources).

It's possible that creating the device on the ImageBridge thread rather than the main thread would help, but I'm not sure.
Flags: needinfo?(matt.woodrow)
I think we can post messages to the main thread to delete our d3d textures, instead of releasing the final ref from the ImageBridgeChild thread.

The other solution would be to spin the event loop while waiting for ImageBridge shutdown, instead of waiting on the monitor, but that sounds fairly awful.

Any thoughts nical?

Why do we block when shutting down image bridge, but not the compositor? I think we'd potentially have the same problems with releasing textures from the compositor thread otherwise.
Flags: needinfo?(nical.bugzilla)
> 
> It's possible that creating the device on the ImageBridge thread rather than
> the main thread would help, but I'm not sure.

Worth trying. I don't know how much d3d/dxgi expects its objects to be used on the thread on which they are created, but we are less likely to run into 3rd party bugs if we don't use these objects across multiple threads anyway.

(In reply to Matt Woodrow (:mattwoodrow) from comment #12)
> I think we can post messages to the main thread to delete our d3d textures,
> instead of releasing the final ref from the ImageBridgeChild thread.

We can try this too.

> 
> The other solution would be to spin the event loop while waiting for
> ImageBridge shutdown, instead of waiting on the monitor, but that sounds
> fairly awful.

Sounds quite awful indeed!

> Why do we block when shutting down image bridge, but not the compositor? I
> think we'd potentially have the same problems with releasing textures from
> the compositor thread otherwise.

The shutdown of CompositorChild and ImageBridgeChild are similar, except that the ImageBridgeChild's has to be proxied from the main thread, since the IPDL protocol belongs to the ImageBridgeChild thread and can't be used from anywhere else. In contrast, CompositorChild lives on the main thread so we can just do things from there.
The reason we have to block the main thread until the ImageBridge is shut down is because this has to finish before the shutdown of XPCom threads, which happens shortly after.
We'd need NS_ShutdownXPCOM to be completely async in order to not wait for the ImageBridge thread.
Flags: needinfo?(nical.bugzilla)
Assignee: nobody → matt.woodrow
Flags: needinfo?(jmuizelaar)
Flags: needinfo?(bas)
Attachment #8582838 - Flags: review?(nical.bugzilla)
Attachment #8582837 - Flags: review?(nical.bugzilla) → review+
Attachment #8582838 - Flags: review?(nical.bugzilla) → review+
https://hg.mozilla.org/mozilla-central/rev/0da613f9e4af
https://hg.mozilla.org/mozilla-central/rev/d1569a99baea
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
38 is marked as affected. What do you think about uplifting to Beta?
Flags: needinfo?(matt.woodrow)
38 shouldn't be affected, so there's no need to uplift right now.
Flags: needinfo?(matt.woodrow)
Ok, my patch is specific to code that only exists on 39, so we'll need a new patch if we decide this is worth fixing.
See Also: → 1174592
You need to log in before you can comment on or make changes to this bug.