<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Comment 1

•

10 years ago

ni kats and Jeff as well because we're scheduled to build the 37 RC today and have very little time to figure out our next step. My immediate thought is to backout whatever we think caused this crash to appear in b7.

Flags: needinfo?(jmuizelaar)

Flags: needinfo?(bugmail.mozilla)

Updated

•

10 years ago

status-firefox36: --- → affected

status-firefox38: --- → affected

status-firefox39: --- → affected

tracking-firefox37: ? → +

Comment 2

•

10 years ago

Do we know if some other crash correspondingly reduced in volume? If it's just a matter of crashes changing signatures then I don't think we should be doing a backout. If it's a new crash while the other existing crashes remain at the same volume then yeah a backout seems warranted. Also fwiw this is a shutdown crash which arguably is better and less user-noticeable than a crash while actively using the browser.

Flags: needinfo?(bugmail.mozilla)

Comment 3

•

10 years ago

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #2) > Do we know if some other crash correspondingly reduced in volume? If it's > just a matter of crashes changing signatures then I don't think we should be > doing a backout. If it's a new crash while the other existing crashes remain > at the same volume then yeah a backout seems warranted. Kairo tells me that the overall crash rate has increased in b7. I don't know whether this crash specifically is a result of a change in signature. > Also fwiw this is a > shutdown crash which arguably is better and less user-noticeable than a > crash while actively using the browser. You're right and I agree. We should review the alternative and make a call on whether it is better to take a higher volume of shutdown crashes.

Comment 4

•

10 years ago

Thread 25's backtrace is interesting: it is the ImageBridgeChild's thread which the main thread is waiting on, and it is blocked by something inside of the D3D code, inparticular see the symbol see "RtlpWaitCouldDeadlock". The stack in this thread belong to some code added recently by Matt, related to doing texture uploads on the ImageBridgeChild thread so as to get acceptable 4k video playback performance In bug 1138967. I am not 100% certain Matt's patch is at fault but the stack is pointing to it so it's worth taking a closer look.

Flags: needinfo?(nical.bugzilla) → needinfo?(matt.woodrow)

Robert Kaiser

Reporter

Comment 5

•

10 years ago

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #2) > Do we know if some other crash correspondingly reduced in volume? Nothing that really corresponds with the volume of this. (In reply to Nicolas Silva [:nical] from comment #4) > The stack in this thread belong to some code added recently by Matt, related > to doing texture uploads on the ImageBridgeChild thread so as to get > acceptable 4k video playback performance In bug 1138967. This would make sense as we took this in exactly the affected build, 37.0b7.

Comment 6

•

10 years ago

I'm looking at the following two pages to get my numbers here: https://crash-stats.mozilla.com/topcrasher/products/Firefox/versions/37.0b6 https://crash-stats.mozilla.com/topcrasher/products/Firefox/versions/37.0b7 On it, I see the js::GCMarker::processMarkStack (for which there were no related changes between b6 and b7) went down from 2.14% to 1.87% from b6 to b7, giving us a baseline multiplier of ~0.874 to indicate "no change". In other words, if a crash was at P% in b6, then we would expect to see it at 0.874P% in b7 just because other crashes around it got moved around and changed in volume. Now, also in the b6 -> b7 comparison I see a shutdownhang at 2.87% (bug 1138520) in b6 disappear entirely in b7. And in b7 this shutdownhang (bug 1146315) appears in b7 at a volume of 2.47%. The multiplier here is 0.861 which I think is close enough to the baseline - it could be that that the shutdown hang "fixed" in bug 1138520 by backing out bug 1103036 just morphed signatures and turned into this one with no actual change in volume. However, the total percent of shutdownhang crashes in b6 is 8.13% and it went up to 8.80% in b7. It should have been 7.10% using the baseline multiplier I calculated above, so it definitely did increase, but I'm not convinced that this one is really a "new" issue.

Robert Kaiser

Reporter

Comment 7

•

10 years ago

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #6) > On it, I see the js::GCMarker::processMarkStack (for which there were no > related changes between b6 and b7) went down from 2.14% to 1.87% from b6 to > b7, giving us a baseline multiplier of ~0.874 to indicate "no change". Where do you get the idea that the percentage of that total would change in that way? In any way, I'm also looking at volume comparisons by crashes/ADI across the 37 train and my above assessment comes from that. > Now, also in the b6 -> b7 comparison I see a shutdownhang at 2.87% (bug > 1138520) in b6 disappear entirely in b7. Which we know where it came from and had a specific backout for in bug 1103036 that we know fixed this. Also, the two paths that those shutdownhangs are waiting for are completely different. > However, the total percent of shutdownhang crashes in b6 is 8.13% and it > went up to 8.80% in b7. Yes, because of the bug here and bug 1127270 rising significantly.

Comment 8

•

10 years ago

(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #7) > (In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #6) > > On it, I see the js::GCMarker::processMarkStack (for which there were no > > related changes between b6 and b7) went down from 2.14% to 1.87% from b6 to > > b7, giving us a baseline multiplier of ~0.874 to indicate "no change". > > Where do you get the idea that the percentage of that total would change in > that way? Math? Let's say there were 100000 crashes over a week of b6, of which 2140 (2.14%) were the processMarkStack. b7 has only been out for a couple of days, so let's say there's 10000 crashes in b7, of which 187 (1.87%) are processMarkStack. Assuming no changes were made that affect the processMarkStack crashes directly, we should expect to see the same volume (2140) of those after a week. If we assume the first couple of days of b7 are representative of the week, then after a week of b7 (when we hit 2140 processMarkStack crashes) we should have 114439 total crashes. This is 14439 crashes more than b6 had over an equivalent period. This increase can be attributed to changes made between b6 and b7. Now consisder any crash that (like processMarkStack) was not affected by these changes. The volume in b6 should remain the same in b7, so if we had X crashes in b6, we should get X crashes in b7. As a percent of total crashes in b6 X would be X/1000 and in b7 it would be X/1144.39, so the percentage in b7 would be 0.874 times the percentage value in b6. My argument is a little circular because I have to assume the bug 1138520 shutdownhang was unaffected in volume in order to apply this multiplier, but what I'm getting at is that the numbers are consistent with the the hypothesis.

Robert Kaiser

Reporter

Comment 9

•

10 years ago

1) The first two days of any release, including a beta, are not entirely representative of what a week of crashes looks like - even more so over the weekend, where we usually have a crash rate spike and slightly different user behavior. 2) Even taking the weekend spike into account, 37.0b7 has a 10-20% higher crash rate than what we had before. There needs to be some reasons to that, I looked at my explosiveness reports that are based on crashes/ADI data as well as searches, and found that the rising shutdown crashes, led by this bug and bug 1127270 have more volume than the decreasing ones like bug 1138520. The main increase is due to probably-TDRs (signature of bug 1116812) and bug 1146313 also plays a role. 3) Nicolas already found out that the shutdownhang here is connected to code that was added by bug 1138967, which is the most risky patch we took for 37.0b7.

Comment 10

•

10 years ago

Ok, fair enough. Sounds like we might want to back out bug 1138967 then as it seems to be the most likely culprit for both this bug and bug 1146313.

Assignee

Comment 11

•

10 years ago

https://msdn.microsoft.com/en-us/library/windows/desktop/ee417025.aspx#Multithreading_and_DXGI It seems that ImageBridgeChild::Shutdown blocking the main thread while we wait for textures to be destroyed may be the issue here, as DXGI can block waiting on the main thread. It's not really clear how D3D is determining the 'application window' for this device, as we never intentionally associate it with any window (it's only used for creating resources). It's possible that creating the device on the ImageBridge thread rather than the main thread would help, but I'm not sure.

Flags: needinfo?(matt.woodrow)

Assignee

Comment 12

•

10 years ago

I think we can post messages to the main thread to delete our d3d textures, instead of releasing the final ref from the ImageBridgeChild thread. The other solution would be to spin the event loop while waiting for ImageBridge shutdown, instead of waiting on the monitor, but that sounds fairly awful. Any thoughts nical? Why do we block when shutting down image bridge, but not the compositor? I think we'd potentially have the same problems with releasing textures from the compositor thread otherwise.

Flags: needinfo?(nical.bugzilla)

Comment 13

•

10 years ago

> > It's possible that creating the device on the ImageBridge thread rather than > the main thread would help, but I'm not sure. Worth trying. I don't know how much d3d/dxgi expects its objects to be used on the thread on which they are created, but we are less likely to run into 3rd party bugs if we don't use these objects across multiple threads anyway. (In reply to Matt Woodrow (:mattwoodrow) from comment #12) > I think we can post messages to the main thread to delete our d3d textures, > instead of releasing the final ref from the ImageBridgeChild thread. We can try this too. > > The other solution would be to spin the event loop while waiting for > ImageBridge shutdown, instead of waiting on the monitor, but that sounds > fairly awful. Sounds quite awful indeed! > Why do we block when shutting down image bridge, but not the compositor? I > think we'd potentially have the same problems with releasing textures from > the compositor thread otherwise. The shutdown of CompositorChild and ImageBridgeChild are similar, except that the ImageBridgeChild's has to be proxied from the main thread, since the IPDL protocol belongs to the ImageBridgeChild thread and can't be used from anywhere else. In contrast, CompositorChild lives on the main thread so we can just do things from there. The reason we have to block the main thread until the ImageBridge is shut down is because this has to finish before the shutdown of XPCom threads, which happens shortly after. We'd need NS_ShutdownXPCOM to be completely async in order to not wait for the ImageBridge thread.

Flags: needinfo?(nical.bugzilla)

Updated

•

10 years ago

Assignee: nobody → matt.woodrow

Flags: needinfo?(jmuizelaar)

Flags: needinfo?(bas)

Assignee

Comment 14

•

10 years ago

Attached patch Attempt 1: Create the d3d device on the thread we'll use it on — Details — Splinter Review

Attachment #8582837 - Flags: review?(nical.bugzilla)

Assignee

Comment 15

•

10 years ago

Attached patch Attempt 2: Release textures on main thread — Details — Splinter Review

Attachment #8582838 - Flags: review?(nical.bugzilla)

Updated

•

10 years ago

Attachment #8582837 - Flags: review?(nical.bugzilla) → review+

Updated

•

10 years ago

Attachment #8582838 - Flags: review?(nical.bugzilla) → review+

https://hg.mozilla.org/integration/mozilla-inbound/rev/0da613f9e4af https://hg.mozilla.org/integration/mozilla-inbound/rev/d1569a99baea

Assignee

Comment 16

•

10 years ago

Ryan VanderMeulen [:RyanVM]

Comment 17

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/0da613f9e4af https://hg.mozilla.org/mozilla-central/rev/d1569a99baea

Status: NEW → RESOLVED

Closed: 10 years ago

status-firefox39: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla39

Comment 18

•

10 years ago

38 is marked as affected. What do you think about uplifting to Beta?

status-firefox36: affected → wontfix

status-firefox37: affected → wontfix

Flags: needinfo?(matt.woodrow)

Assignee

Comment 19

•

10 years ago

38 shouldn't be affected, so there's no need to uplift right now.

status-firefox38: affected → unaffected

Flags: needinfo?(matt.woodrow)

Comment 20

•

10 years ago

I do see (very few) reports of this crash in 38. It appears as 38 is affected although maybe this isn't critical. See https://crash-stats.mozilla.com/report/list?product=Firefox&signature=shutdownhang+|+WaitForSingleObjectEx+|+WaitForSingleObject+|+PR_Wait+|+mozilla%3A%3AReentrantMonitor%3A%3AWait%28unsigned+int%29+|+mozilla%3A%3Alayers%3A%3AImageBridgeChild%3A%3AShutDown%28%29

status-firefox38: unaffected → affected