Closed Bug 1223193 Opened 4 years ago Closed 4 years ago

crash in mozilla::layers::ImageBridgeChild::RemoveTextureFromCompositableAsync

Categories

(Core :: Graphics: Layers, defect, critical)

43 Branch
All
Windows
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla46
Tracking Status
firefox42 --- unaffected
firefox43 + wontfix
firefox44 + fixed
firefox45 + fixed
firefox46 --- fixed
b2g-v2.5 --- fixed

People

(Reporter: ashughes, Unassigned)

References

Details

(Keywords: crash, topcrash-win)

Crash Data

Attachments

(2 files)

This bug was filed from the Socorro interface and is 
report bp-fb40c0ff-9da5-4289-bd13-285512151108.
=============================================================
0 	xul.dll 	mozilla::layers::ImageBridgeChild::RemoveTextureFromCompositableAsync(mozilla::layers::AsyncTransactionTracker*, mozilla::layers::CompositableClient*, mozilla::layers::TextureClient*) 	gfx/layers/ipc/ImageBridgeChild.cpp
1 	xul.dll 	mozilla::layers::ImageClient::RemoveTextureWithWaiter(mozilla::layers::TextureClient*, mozilla::layers::AsyncTransactionWaiter*) 	gfx/layers/client/ImageClient.cpp
2 	xul.dll 	mozilla::layers::ImageClientSingle::FlushAllImages(mozilla::layers::AsyncTransactionWaiter*) 	gfx/layers/client/ImageClient.cpp
3 	xul.dll 	mozilla::layers::FlushAllImagesSync 	gfx/layers/ipc/ImageBridgeChild.cpp
4 	xul.dll 	RunnableFunction<void (*)(mozilla::layers::ImageClient*, mozilla::layers::ImageContainer*, mozilla::layers::AsyncTransactionWaiter*), Tuple3<mozilla::layers::ImageClient*, mozilla::layers::ImageContainer*, RefPtr<mozilla::layers::AsyncTransactionWaiter> > >::Run() 	ipc/chromium/src/base/task.h
5 	xul.dll 	base::MessagePumpDefault::Run(base::MessagePump::Delegate*) 	ipc/chromium/src/base/message_pump_default.cc
6 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc
7 	xul.dll 	MessageLoop::Run() 	ipc/chromium/src/base/message_loop.cc
8 	xul.dll 	base::Thread::ThreadMain() 	ipc/chromium/src/base/thread.cc
9 	xul.dll 	`anonymous namespace'::ThreadFunc(void*) 	ipc/chromium/src/base/platform_thread_win.cc
10 	kernel32.dll 	BaseThreadInitThunk 	
11 	ntdll.dll 	RtlUserThreadStart 	
12 	kernel32.dll 	BasepReportFault 	
13 	kernel32.dll 	BasepReportFault 	
=============================================================
More reports: https://crash-stats.mozilla.com/report/list?product=Firefox&signature=mozilla%3A%3Alayers%3A%3AImageBridgeChild%3A%3ARemoveTextureFromCompositableAsync
[Tracking Requested - why for this release]: this crash first shows up in Firefox 43 and is currently the #20 crash in Beta. There are no strong device/driver correlations. Facebook.com, about:home, and about:newtab rank high in the URLs.
OK, tracking for 43. Looks like this first was reported on 2015-10-07. So the regression may have been around then.
This is still frequent in 43 late betas, around the #20 topcrash. There are several other high volume crash signatures containing "mozilla::layers::ImageBridgeChild", maybe related?   If they are related then this is probably our top crash on beta.
Depends on: 1207220
The crash reports show that the crash happens during or after gfx is shut(ting) down (from another thread). The fix for this and the long tail of similar shutdown crashes is to ensure that media shuts down completely before gfx (which can't be shut down any later because the destruction of things gfx relies on happens right after gfx's shut down).
Component: Graphics: Layers → Audio/Video: Playback
[Tracking Requested - why for this release]: this is now the 9th topcrash in Release accounting for 1.06% of crashes reported against Firefox 43.0.1.
Thanks Anthony. We should probably track for 44+ but it's too late for a fix for 43. 
jya is this something you could have a look at? Or can you help us find an owner for this bug?
Flags: needinfo?(jyavenard)
from comment 5, looks like gfx has a different opinion on the matter, so likely this bug will just bounce back and forth between teams :)
is there a way to test if gfx is shut down? would be the easiest work around ensuring gfx isn't called when that happens
Flags: needinfo?(nical.bugzilla)
Upon investigating, the relation between ImageClient, TextureClient and ImageBridgeChild isn't threadsafe either and a race is possible.
It's when things are being shutdown while tasks are pending that an issue can occur (not just because the media stack could continue to use gfx after shutdown).

I can't see any easy fix here, maybe a quick work around.
bug 1209724 didn't paper over the issue unfortunately. ImageBridgeChild::sImageBridgeChildSingleton is deleted too late in the process.

(btw access to sImageBridgeChildSingleton isn't thread-safe either)
See Also: → 1209724
Fundamentally unnecessary, because this member is only used for debug build, but I had thought of using this value to detect if we are shut down or not. so seeing that it doesn't hurt..
Attachment #8702804 - Flags: review?(sotaro.ikeda.g)
Attachment #8702804 - Flags: review?(nical.bugzilla)
This really is the same logic as used in bug 1209724. The two bugs really are identical just crashing differently depending on the timing of when the race occurred.

By the time sImageBridgeChildThread is cleared, shutdown has already started and if a proxy call is made on the ImageBridgeChild at this time you will get this crash.

We can’t use sImageBridgeChildSingleton->mShuttingDown to detect if shutting down is happening for two reasons:
1- Accessing sImageBridgeChildSingleton itself isn’t thread-safe (this needs to be addressed separately)
2- ImageBridgeChild::mShuttingDown is set too late in the process, it's really ImageBridgeChild::mShutDown.

I could have replaced !ImageBridgeChild::IsCreated() with ImageBridgeChild::IsShutDown() entirely, but I felt that it didn't matter and it's clearer that way.

Seeing that everything is being shut down at the same time (be it gfx, media or what else), and we can't really easily alter the timing of everything; this is the easiest way to proceed at this stage.

Nical, Sotaro whichever wants to review this first.
Too bad it won't be in 43, the problem could have been foreseen.
Flags: needinfo?(nical.bugzilla)
Attachment #8702805 - Flags: review?(sotaro.ikeda.g)
Attachment #8702805 - Flags: review?(nical.bugzilla)
(In reply to Jean-Yves Avenard [:jya] from comment #14)
> 1- Accessing sImageBridgeChildSingleton itself isn’t thread-safe (this needs
> to be addressed separately)

It doesn't matter. sImageBridgeChildSingleton is assigned at startup before users of ImageBridgeChild are created and cleared at shutdown after all users of ImageBridge *must* have been destoyed. While the latter doesn't hold true we have plenty of other races conditions so adding synchronization primitives around the singleton pointer just paper over some of the issue.

> Seeing that everything is being shut down at the same time (be it gfx, media
> or what else), and we can't really easily alter the timing of everything;
> this is the easiest way to proceed at this stage.

I'm fine with r+ing this, but it is really papering over part of a larger issue. I am the author of many similar shutdown workarounds, I have been piling them up for a long while and new problems keep showing up. All of these issues would not exist if code that depends on gfx (especially the ipc stuff) would be entirely shut down before gfx.

> 
> Nical, Sotaro whichever wants to review this first.
> Too bad it won't be in 43, the problem could have been foreseen.

The problem has been foreseen for quite a while, but my limited knowledge of MediaShutdownManager hasn't been enough for me to figure out why most of the media resources are synchronously destroyed when XPCOM_SHUTDOWN_OBSERVER_ID is fired (which is exactly what we need), but not all of it (where the problems arise). Hard to get help on IRC because timezones and everybody being busy (for very good reasons like EME, etc).
Honestly, my biggest mistake was to start papering over shutdown races years ago, thinking a quick fix in gfx would be easier and sufficient while I should have put some strict assertions in place before things got out of hand.
I am sorry to make it look like I am bouncing gfx issues toward other teams, but after trying to make gfx stuff not crash when several threads are poking at it while it is shutting down for a long while without great success, I think we really need to look for another approach.
Attachment #8702804 - Flags: review?(nical.bugzilla) → review+
Attachment #8702805 - Flags: review?(nical.bugzilla) → review+
Attachment #8702804 - Flags: review?(sotaro.ikeda.g)
Attachment #8702805 - Flags: review?(sotaro.ikeda.g)
> It doesn't matter. sImageBridgeChildSingleton is assigned at startup before
> users of ImageBridgeChild are created and cleared at shutdown after all
> users of ImageBridge *must* have been destoyed. While the latter doesn't
> hold true we have plenty of other races conditions so adding synchronization
> primitives around the singleton pointer just paper over some of the issue.

Well, it is accessed :)

I guess I will have to disagree with you in regards to "harmless" data races.
We can't know what the compiler will generate as the C++ behaviour is undefined.

> I'm fine with r+ing this, but it is really papering over part of a larger
> issue. I am the author of many similar shutdown workarounds, I have been
> piling them up for a long while and new problems keep showing up. All of
> these issues would not exist if code that depends on gfx (especially the ipc
> stuff) would be entirely shut down before gfx.
> 
> > 
> > Nical, Sotaro whichever wants to review this first.
> > Too bad it won't be in 43, the problem could have been foreseen.
> 
> The problem has been foreseen for quite a while, but my limited knowledge of
> MediaShutdownManager hasn't been enough for me to figure out why most of the
> media resources are synchronously destroyed when XPCOM_SHUTDOWN_OBSERVER_ID
> is fired (which is exactly what we need), but not all of it (where the
> problems arise). Hard to get help on IRC because timezones and everybody
> being busy (for very good reasons like EME, etc).

I don't know anything about MediaShutdownManager either :)

The media stack readers (in those backtrace, mostly MediaFormatReader) are working asynchronously. They return a MozPromise when shutting down which will be resolved once shutdown has completed. Assuming the media is the only thing left accessing gfx, we will want to complete gfx shutdown only once all those MozPromises have been resolved.

But having said that, we have multiple threads working concurrently within gfx, all using weak references to their parameters. Everything relies on things happening in a particular order when really there's no guarantee that it will happen in that order.

The gfx elements used by the media stack are ref-counted, so it's possible for the gfx stack to know that there are elements still alive. Unfortunately, as almost everything uses weak ptr, it becomes impossible to track what is freed and what isn't.

> Honestly, my biggest mistake was to start papering over shutdown races years
> ago, thinking a quick fix in gfx would be easier and sufficient while I
> should have put some strict assertions in place before things got out of
> hand.

indeed.
Didn't you say you had started a complete rewrite of this?
The whole TextureClient/TextureServer, the recycling of D3D objects, all of this appears racy to me.
The use of multiple threads with no synchronisation or monitors; plenty of objects accessed concurrently: makes it all fragile.

> I am sorry to make it look like I am bouncing gfx issues toward other teams,
> but after trying to make gfx stuff not crash when several threads are poking
> at it while it is shutting down for a long while without great success, I
> think we really need to look for another approach.

Yeah, we need a different approach so we can guarantee that gfx will be the last thing to shut down. Or maybe make gfx the owner of all gfx related objects (which from the media point of view it already looks that way. Media never creates a gfx element (be it VideoFrameContainer, layers backend etc..), it is only ever passed a reference to them.
It never destroys them either, at worse it calls VideoFrameContainer::ClearCurrentFrame ; seeing that the gfx is the owner, should we stop calling this?
https://hg.mozilla.org/mozilla-central/rev/327f2836d54a
https://hg.mozilla.org/mozilla-central/rev/7fd666d36af5
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla46
Nical, should we uplift this fix to Beta44? I looked at the last few days of data on this crash signature but it doesn't seem to occur on Nightly46 so we don't have a verification to go by. However, given that this is a top crash on 44.0b2/b4, it seems like uplifting is still a good idea.
Flags: needinfo?(nical.bugzilla)
I have managed to reproduce another shutdown race in a VM rather consistently.
Depends on: 1236384
Bug 1236384 is fixing the underlying cause for all those problems, which as Nicolas described is due to the improper handling of the MediaShutdownManager.

So Nicolas my apologies, this issue is really a Audio/Video playback one.
Comment on attachment 8702804 [details] [diff] [review]
0001-Bug-1223193-P1.-Make-ImageBridgeChild-mShuttingDown-.patch

Approval Request Comment
[Feature/regressing bug #]:
[User impact if declined]: shutdown crashes
[Describe test coverage new/current, TreeHerder]:
[Risks and why]: low risk, only affect shutdown in cases where we are already in a bad situation today.
[String/UUID change made/needed]:
Flags: needinfo?(nical.bugzilla)
Attachment #8702804 - Flags: approval-mozilla-beta?
Attachment #8702804 - Flags: approval-mozilla-aurora?
Comment on attachment 8702805 [details] [diff] [review]
0002-Bug-1223193-P2.-Cancel-ImageBridge-proxy-functions-i.patch

Approval Request Comment
Same as the previous patch.
Attachment #8702805 - Flags: approval-mozilla-beta?
Attachment #8702805 - Flags: approval-mozilla-aurora?
(In reply to Jean-Yves Avenard [:jya] from comment #21)
> So Nicolas my apologies, this issue is really a Audio/Video playback one.

No problem, a million thanks for your patch in bug 1207220!
Comment on attachment 8702804 [details] [diff] [review]
0001-Bug-1223193-P1.-Make-ImageBridgeChild-mShuttingDown-.patch

Top crash fix, Beta44+, Aurora45+
Attachment #8702804 - Flags: approval-mozilla-beta?
Attachment #8702804 - Flags: approval-mozilla-beta+
Attachment #8702804 - Flags: approval-mozilla-aurora?
Attachment #8702804 - Flags: approval-mozilla-aurora+
Comment on attachment 8702805 [details] [diff] [review]
0002-Bug-1223193-P2.-Cancel-ImageBridge-proxy-functions-i.patch

Beta44+, Aurora45+
Attachment #8702805 - Flags: approval-mozilla-beta?
Attachment #8702805 - Flags: approval-mozilla-beta+
Attachment #8702805 - Flags: approval-mozilla-aurora?
Attachment #8702805 - Flags: approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.