1244883 - crash in RtlEnterCriticalSection | PR_EnterMonitor | mozilla::layers::ImageContainer::ClearImagesFromImageBridge

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Description

•

9 years ago

This bug was filed from the Socorro interface and is report bp-8d6f2243-3ef6-4a22-980e-493e72160127. ============================================================= No particular correlation I could find. Here's one from 45: https://crash-stats.mozilla.com/report/index/7df17f96-c760-457a-b9ca-365192160130

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Comment 1

•

9 years ago

Timothy, anything that looks interesting here?

Flags: needinfo?(tnikkel)

Whiteboard: [gfx-noted]

Timothy Nikkel (:tnikkel)

Comment 2

•

9 years ago

Probably not likely an imagelib issue. ImageContainer's are also used for video, and all of the code in the crash stack lives under gfx/layers.

Component: ImageLib → Graphics: Layers

Flags: needinfo?(tnikkel)

Edwin Flores [inactive from 2016-12-01] [:eflores] [:edwin]

Assignee

Comment 3

•

9 years ago

Most of the crash reports involve media, but I'm not sure yet whether that's relevant. On MediaDecoderStateMachine shutdown (which indirectly owns the ImageContainer), ImageBridgeChild::FlushAllImages is called. This queues a sync-ish (using AsyncTransactionWaiter) task on the ImageBridge thread which calls ImageBridgeChild::FlushAllImagesSync. This function acts on the ImageContainer, which stays alive while FlushAllImages is waiting. AsyncTransactionWaiter has a timeout of 50 seconds which it's hitting, AFAICT. So when FlushAllImagesSync finally gets a chance to run, the ImageContainer is already dead. I can't get much more information out of these crash reports, since the real cause happens before the crash. I think should add a gfxDevCrash on AsyncTransactionWaiter timeout to find the actual cause.

Edwin Flores [inactive from 2016-12-01] [:eflores] [:edwin]

Assignee

Updated

•

9 years ago

Assignee: nobody → edwin

Edwin Flores [inactive from 2016-12-01] [:eflores] [:edwin]

Assignee

Comment 4

•

9 years ago

Attached patch 1244883-assert.patch (obsolete) — Details — Splinter Review

Attachment #8733888 - Flags: review?(nical.bugzilla)

Edwin Flores [inactive from 2016-12-01] [:eflores] [:edwin]

Assignee

Updated

•

9 years ago

Keywords: leave-open

Edwin Flores [inactive from 2016-12-01] [:eflores] [:edwin]

Assignee

Comment 5

•

9 years ago

Attached patch 1244883-assert.patch — Details — Splinter Review

oops.

Attachment #8733888 - Attachment is obsolete: true

Attachment #8733888 - Flags: review?(nical.bugzilla)

Attachment #8733912 - Flags: review?(nical.bugzilla)

Nicolas Silva [:nical]

Updated

•

9 years ago

Attachment #8733912 - Flags: review?(nical.bugzilla) → review+

Nicolas Silva [:nical]

Comment 6

•

9 years ago

cc :sotaro who is rewriting things related to this code in another bug (I think).

Pulsebot

Comment 7

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/c3c85d59ec72

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Comment 8

•

9 years ago

Comment on attachment 8733912 [details] [diff] [review] 1244883-assert.patch Review of attachment 8733912 [details] [diff] [review]: ----------------------------------------------------------------- ::: gfx/layers/ipc/AsyncTransactionTracker.cpp @@ +35,5 @@ > printf_stderr("Timeout of waiting transaction complete."); > } > + > + if (count == maxCount) { > + gfxDevCrash(LogReason::AsyncTransactionTimeout) << "Bug 1244883: AsyncTransactionWaiter timed out."; Would it be useful to have mWaitCount in the crash report?

Carsten Book [:Tomcat]

Comment 9

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/c3c85d59ec72

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Comment 10

•

9 years ago

We're going to get this backed out (https://treeherder.mozilla.org/#/jobs?repo=ash&revision=24c5fbde4488&group_state=expanded&filter-searchStr=win%20debug) - we may have just gotten the information we needed.

Ryan VanderMeulen [:RyanVM]

Comment 11

•

9 years ago

Ryan VanderMeulen [:RyanVM]

Comment 12

•

9 years ago

backout bugherder

Merge of backout: https://hg.mozilla.org/mozilla-central/rev/0b1daf1da80a

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Comment 13

•

9 years ago

Edwin, does the data from those try runs help?

Flags: needinfo?(edwin)

Ryan VanderMeulen [:RyanVM]

Updated

•

9 years ago

Depends on: 1259630

Edwin Flores [inactive from 2016-12-01] [:eflores] [:edwin]

Assignee

Comment 14

•

9 years ago

More data can't hurt! I find it odd that those tests didn't hit this bug previously, though. Will look today.

Flags: needinfo?(edwin)

Edwin Flores [inactive from 2016-12-01] [:eflores] [:edwin]

Assignee

Comment 15

•

9 years ago

Managed to get a dump of the parent process. What I see happening is the compositor process trying to shut down the content process. It eventually reaches EnsureProcessTerminated and waits indefinitely *on the I/O thread* for the content process to die, but doesn't attempt to terminate it [1]. On the content side, when the media stuff shuts down, MediaFormatReader::ReleaseMediaResources() calls ImageContainer::ClearAllImages(), which calls ImageBridgeChild::FlushAllImages. This adds a FlushAllImagesSync() task to the ImageBridge thread and waits on an AsyncTransactionWaiter. [2] On the ImageBridge thread, FlushAllImagesSync() indirectly calls ImageClient::RemoveTextureWithWaiter(). This creates an AsyncTransactionTracker which increments the wait count on the above waiter. [3] That tracker and the TextureClient being destroyed is then passed to ImageBridgeClient::RemoveTextureFromCompositableAsync(). That function sends a RemoveTextureAsync op to the compositor process and makes sure the TextureClient stays alive until we receive a reply from the compositor process by calling CompositableClient::HoldUntilComplete. [4] The compositor process doesn't get the RemoveTextureAsync message before content dies because its I/O thread is blocked until then. Sadness ensues. [1] https://dxr.mozilla.org/mozilla-central/rev/68c0b7d6f16ce5bb023e08050102b5f2fe4aacd8/ipc/chromium/src/chrome/common/process_watcher_win.cc#77 [2] https://dxr.mozilla.org/mozilla-central/rev/68c0b7d6f16ce5bb023e08050102b5f2fe4aacd8/gfx/layers/ipc/ImageBridgeChild.cpp#673 [3] https://dxr.mozilla.org/mozilla-central/rev/68c0b7d6f16ce5bb023e08050102b5f2fe4aacd8/gfx/layers/client/ImageClient.cpp#94 [4] https://dxr.mozilla.org/mozilla-central/rev/68c0b7d6f16ce5bb023e08050102b5f2fe4aacd8/gfx/layers/ipc/ImageBridgeChild.cpp#1240

Edwin Flores [inactive from 2016-12-01] [:eflores] [:edwin]

Assignee

Comment 16

•

9 years ago

Oh! Oh... hmm... All of the dumps from Ash are in XPCOM shutdown. XPCOM shutdown usually (always?) happens on the child only once the Content channel is closed. On the parent side, the Content channel closing starts the wait for the child to die (via PContentParent::OnChannelClose -> PContentParent::DestroySubtree -> ContentParent::ActorDestroy -> DelayedDeleteSubprocess -> EnsureProcessTerminated). Ultimately, if we're deleting textures at XPCOM shutdown we have already lost the game. Unfortunately that's when media shutdown happens, which holds onto an ImageContainer until the very end.

Ryan VanderMeulen [:RyanVM]

Comment 17

•

9 years ago

Any chance bug 1215265 would have helped here? Also, the "Force e10s on" patch has had to change a bit. Below is a link to one that'll work on current trunk. https://hg.mozilla.org/try/rev/8d6ed373efa8fc656264c2e0b15daa948040e176

Edwin Flores [inactive from 2016-12-01] [:eflores] [:edwin]

Assignee

Comment 18

•

9 years ago

Yes, that looks like it might fix it -- thanks for the heads up! I've updated my m-c and pushed a try run. If green, I'll re-land the diagnostic patch as I'm pretty certain that the "in-the-wild" reports have a different cause (for one thing, they don't even have e10s enabled).

Nicolas Silva [:nical]

Comment 19

•

9 years ago

(In reply to Edwin Flores [:eflores] [:edwin] from comment #16) > XPCOM shutdown usually (always?) happens on the child only once the Content > channel is closed. I hope not, because it doesn a whole lot of stuff that relies on IPC still being up. We should make sure that if the channel is closed when we run the shutdown stuff (that would be because of an abnormal shutdown I assume), we can cancel anything that waits on ipc related stuff. > > Ultimately, if we're deleting textures at XPCOM shutdown we have already > lost the game. Unfortunately that's when media shutdown happens, which holds > onto an ImageContainer until the very end. That's when all of the shutdown of pretty much anything happens and I don't think we'll be able to shut things down earlier. We definitely need ipc to still be up in ShutdownXPCOM in the case of a normal shtudown, and maybe just kill the process without shutting down XPCOM for abnormal shutdowns. (In reply to Edwin Flores [:eflores] [:edwin] from comment #18) > Yes, that looks like it might fix it -- thanks for the heads up! Ok good, although this stuff is not enabled on Windows yet.

Edwin Flores [inactive from 2016-12-01] [:eflores] [:edwin]

Assignee

Comment 20

•

9 years ago

(In reply to Edwin Flores [:eflores] [:edwin] from comment #18) > Yes, that looks like it might fix it -- thanks for the heads up! > > I've updated my m-c and pushed a try run. If green, I'll re-land the > diagnostic patch as I'm pretty certain that the "in-the-wild" reports have a > different cause (for one thing, they don't even have e10s enabled). Hmm, nope. Still happens.

Edwin Flores [inactive from 2016-12-01] [:eflores] [:edwin]

Assignee

Comment 21

•

9 years ago

(In reply to Nicolas Silva [:nical] from comment #19) > (In reply to Edwin Flores [:eflores] [:edwin] from comment #16) > > XPCOM shutdown usually (always?) happens on the child only once the Content > > channel is closed. > > I hope not, because it doesn a whole lot of stuff that relies on IPC still > being up. We should make sure that if the channel is closed when we run the > shutdown stuff (that would be because of an abnormal shutdown I assume), we > can cancel anything that waits on ipc related stuff. Looks like I was wrong about "usually": this depends on NS_FREE_PERMANENT_DATA being set. That's why it's only showing up in debug builds (although I haven't quite figured out how it's meant to work without that #defined -- it appears we just hard kill ALL THE THINGS and forgo XPCOM shutdown?). The way it is shutting down in this case (based on gratuitous addition of printfs) is, in glorious high detail: On the parent side: - ContentParent::NotifyTabDestroyed is called - That dispatches a call to ShutDownProcess(SEND_SHUTDOWN_MESSAGE) - That sends SendShutdown to the child On the child: - RecvShutdown is received - That does some light shutdown stuff, including firing a notification with subject "content-child-shutdown" -- we could move the shutdown to here, maybe? - That sends SendFinishShutdown to the parent. On the parent again: - RecvFinishShutdown is received - That calls ShutDownProcess(CLOSE_CHANNEL) - That calls Close() The channel is now closed. Simultaneously-ish: - PContentChild/Parent::OnChannelClose is eventually called - That calls DestroySubtree - Which calls ContentChild/Parent::ActorDestroy On the child: - ContentChild::ActorDestroy calls XRE_ShutdownChildProcess if defined(NS_FREE_PERMANENT_DATA) - This tells the message loop to quit. On the parent, around the same time: - ContentParent::ActorDestroy does some things -- lots of things. Among these things is dispatching a call to DelayedDeleteSubprocess for our content process - That then dispatches a DeleteTask to the IO message loop for that content process - That destroys our GeckoChildProcessHost - ~GeckoChildProcessHost calls ProcessWatcher::EnsureProcessTerminated with force = !defined(NS_FREE_PERMANENT_DATA) - The parent patiently waits for the child to die. [1] Meanwhile, on the child main thread in XRE_InitChildProcess: - The UI message loop returns from Run() - ContentProcess::CleanUp is called - That calls ScopedXREEmbed::Stop - Which calls XRE_TermEmbedding - Which calls NS_ShutdownXPCOM - Which finally calls ShutdownXPCOM TL;DR -- ShutdownXPCOM happens. Content channel is closed. [1] https://dxr.mozilla.org/mozilla-central/rev/29d5a4175c8b74f45482276a53985cf2568b4be2/ipc/chromium/src/chrome/common/process_watcher_win.cc#77

Sotaro Ikeda [:sotaro]

Updated

•

9 years ago

Comment 22

•

9 years ago

Note that Sotaro just removed the call to FlushAllImagesSync in bug 1209724. PContent "opens" but doesn't "manage" PCompositorBridge. Are they actually using the same channel? If they do I'd expect this issue to have showed up a lot more. PContent also opens PImageBridge and I am certain that PCompositorBridge and PImageBridge do not use the same channel, so we may be trying to fix a problem with PContent we don't have. Moving the gfx shutdown earlier is tricky because the earlier we shut gfx down, the earlier everything that depends on gfx needs to be shut down as well, and that pulls an uncomfortable amount of dependencies. I would rather make it so we call SendFinishShutdown from ShutdownXPCOM (if we have to, I still think we don't).

Edwin Flores [inactive from 2016-12-01] [:eflores] [:edwin]

Assignee

Comment 23

•

9 years ago

(In reply to Nicolas Silva [:nical] from comment #22) > PContent "opens" but doesn't "manage" PCompositorBridge. Are they actually > using the same channel? If they do I'd expect this issue to have showed up a > lot more. > PContent also opens PImageBridge and I am certain that PCompositorBridge and > PImageBridge do not use the same channel, so we may be trying to fix a > problem with PContent we don't have. They don't use the same channel. But all IPC goes through the I/O thread (at least on the parent).

Nicolas Silva [:nical]

Comment 24

•

9 years ago

(In reply to Nicolas Silva [:nical] from comment #22) > Note that Sotaro just removed the call to FlushAllImagesSync in bug 1209724. Correction: he is trying to remove it but that hasn't landed yet.

Nicolas Silva [:nical]

Comment 25

•

9 years ago

(In reply to Edwin Flores [:eflores] [:edwin] from comment #23) > They don't use the same channel. But all IPC goes through the I/O thread (at > least on the parent). Then what we need to fix is the DeleteTask on the parent IO loop being sent too early.

Nicolas Silva [:nical]

Comment 26

•

9 years ago

(In reply to Nicolas Silva [:nical] from comment #22) > Note that Sotaro just removed the call to FlushAllImagesSync in bug 1209724. Correction (bis): I meant bug 1254029.

Edwin Flores [inactive from 2016-12-01] [:eflores] [:edwin]

Assignee

Updated

•

9 years ago

Depends on: 1262898

Pulsebot

Comment 27

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/57e1f54bc53a

Edwin Flores [inactive from 2016-12-01] [:eflores] [:edwin]

Assignee

Comment 28

•

9 years ago

Bug 1262898 will have fixed the shutdown problems. Re-landing diagnostic patch.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 29

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/57e1f54bc53a

Jim Mathies [:jimm]

Updated

•

9 years ago

Blocks: e10s-crashes

Jim Mathies [:jimm]

Comment 30

•

9 years ago

hey edwin, any new information here?

Flags: needinfo?(edwin)

Edwin Flores [inactive from 2016-12-01] [:eflores] [:edwin]

Assignee

Comment 31

•

9 years ago

This crash is very low volume in nightly and aurora. I don't expect we'll have any new information until Beta 48.

Flags: needinfo?(edwin)

Jim Mathies [:jimm]

Comment 32

•

9 years ago

should we uplift the diagnostics to beta 47?

Flags: needinfo?(edwin)

BugBot [:suhaib / :marco/ :calixte]

Comment 33

•

8 years ago

Crash volume for signature 'RtlEnterCriticalSection | PR_EnterMonitor | mozilla::layers::ImageContainer::ClearImagesFromImageBridge': - nightly (version 50): 0 crash from 2016-06-06. - aurora (version 49): 0 crash from 2016-06-07. - beta (version 48): 0 crash from 2016-06-06. - release (version 47): 5285 crashes from 2016-05-31. - esr (version 45): 87 crashes from 2016-04-07. Crash volume on the last weeks: Week N-1 Week N-2 Week N-3 Week N-4 Week N-5 Week N-6 Week N-7 - nightly 0 0 0 0 0 0 0 - aurora 0 0 0 0 0 0 0 - beta 0 0 0 0 0 0 0 - release 879 742 893 835 771 584 240 - esr 5 12 8 12 16 15 6 Affected platform: Windows

status-firefox-esr45: --- → affected

Edwin Flores [inactive from 2016-12-01] [:eflores] [:edwin]

Assignee

Comment 34

•

8 years ago

Looks good these days. This signature hasn't been reported since 47, and neither has the diagnostic crash.

Status: NEW → RESOLVED

Closed: 8 years ago

Flags: needinfo?(edwin)

Resolution: --- → WORKSFORME

Firefox Bug Husbandry Bot

Comment 35

•

7 years ago

Removing leave-open keyword from resolved bugs, per :sylvestre.

Keywords: leave-open

Ryan VanderMeulen [:RyanVM]

Updated

•

2 years ago

status-firefox47: affected → wontfix

status-firefox-esr45: affected → wontfix

1244883-assert.patch 9 years ago Edwin Flores [inactive from 2016-12-01] [:eflores] [:edwin] 1.51 KB, patch		Details \| Diff \| Splinter Review
1244883-assert.patch 9 years ago Edwin Flores [inactive from 2016-12-01] [:eflores] [:edwin] 1.55 KB, patch	nical : review+	Details \| Diff \| Splinter Review