Closed Bug 1205559 Opened 4 years ago Closed 4 years ago

Access to TextureClient / TextureChild not thread-safe

Categories

(Core :: Graphics: Layers, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla44
Tracking Status
firefox43 --- affected
firefox44 --- fixed

People

(Reporter: jya, Assigned: nical)

References

Details

Attachments

(1 file, 1 obsolete file)

Investigation of bug 1072313 indicates that we have a race occurring upon freeing the mKeep object in TextureChild .

In https://treeherder.mozilla.org/#/jobs?repo=try&revision=750259489f71 ; logging was added to find leaks on the KeepAlive object.

In 30% of the MacIOSurface leak we see:
TEST-UNEXPECTED-FAIL | leakcheck | default process: 88 bytes leaked (KeepAlive, MacIOSurface)

The simple attempt to protect access to mKeep with a monitor:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b109b5578839

We sees that we have no more KeepAlive leaks (though still plenty of MacIOSurface leaks)

Now, obviously protecting mKeep with a Monitor isn't the solution as we still have TextureClient::mActor and TextureChild::mTextureClient that are accessed in a non-threadsafe manner.

It still indicates a problem.

The fix is likely non-trivial as we probably don't want to use a monitor there which would likely have severe performance penalties.

I have some more logs and assert to prove that points coming shortly.
Things like:
http://mxr.mozilla.org/mozilla-central/source/gfx/layers/client/TextureClient.cpp#611

  // Always make a temporary strong reference to the actor before we use it,
  // in case TextureChild::ActorDestroy might null mActor concurrently.
  RefPtr<TextureChild> actor = mActor;

This is certainly not thread-safe ...

TextureChild::ActorDestroy accesses both TextureChild::mKeep and TextureClient::mActor and can do so concurrently to ~TextureClient
Assignee: nobody → nical.bugzilla
Flags: needinfo?(nical.bugzilla)
Depends on: 949347
The two problems are:
 - We should not have to manually call TextureClient::ForceRemove (which we when shutting down the ImageBridge protocol for all textures that are still alive (and there is no sane reason for any texture to be kept alive that late after half of gfx has already been shut down). Right now ther can be a race between this manual call to ForceRemove and the TextureClient actually starting its late deallocation on another thread.
 - A race when code touches the ipdl actor outside the ipdl thread (and ~TextureClient can run off the ipdl thread so it should not do things like KeppUntilDeallocation). There's a patch in bug 1072313 which adds a hook to run destruction code on the ipdl thread.

ForceRemove and ActorDestroy can only run in the ipdl thread so they can't race with each other.
Flags: needinfo?(nical.bugzilla)
I think this is the easiest way to solve the problem without a complete rethink of the threading model.
Attachment #8662354 - Flags: review?(nical.bugzilla)
(In reply to Nicolas Silva [:nical] from comment #2)
> ForceRemove and ActorDestroy can only run in the ipdl thread so they can't
> race with each other.

Are you sure?

ActorDestroy I thought I saw it being called from the main thread from time to time.

If it is ; then running KeppUntilDeallocation would be a solution.

However, what about the other methods accessing TextureClient::mActor ? Are they all always running on the ipdl thread?
If not; you still have a race seeing that mActor can be cleared.
Flags: needinfo?(nical.bugzilla)
I added an assert on ActorDestroy testing that we're never on the main thread:

it hits it instantly.

Assertion failure: !NS_IsMainThread(), at /Users/jyavenard/Work/Mozilla/mozilla-central/gfx/layers/client/TextureClient.cpp:170
#01: mozilla::layers::TextureChild::ActorDestroy(mozilla::ipc::IProtocolManager<mozilla::ipc::IProtocol>::ActorDestroyReason)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x17c6599]
#02: mozilla::layers::PTextureChild::DestroySubtree(mozilla::ipc::IProtocolManager<mozilla::ipc::IProtocol>::ActorDestroyReason)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0xcfe2cb]
#03: mozilla::layers::PTextureChild::OnMessageReceived(IPC::Message const&)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0xcfdf4e]
#04: mozilla::layers::PCompositorChild::OnMessageReceived(IPC::Message const&)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0xed4073]
#05: mozilla::ipc::MessageChannel::DispatchAsyncMessage(IPC::Message const&)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x873d5d]
#06: mozilla::ipc::MessageChannel::DispatchMessage(IPC::Message const&)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x8730ed]
#07: mozilla::ipc::MessageChannel::OnMaybeDequeueOne()[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x86ef32]
#08: void DispatchToMethod<mozilla::ipc::MessageChannel, bool (mozilla::ipc::MessageChannel::*)()>(mozilla::ipc::MessageChannel*, bool (mozilla::ipc::MessageChannel::*)(), Tuple0 const&)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x893443]
#09: RunnableMethod<mozilla::ipc::MessageChannel, bool (mozilla::ipc::MessageChannel::*)(), Tuple0>::Run()[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x8932ee]
#10: mozilla::ipc::MessageChannel::RefCountedTask::Run()[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x88f908]
#11: mozilla::ipc::MessageChannel::DequeueTask::Run()[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x88f774]
#12: MessageLoop::RunTask(Task*)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x7e3930]
#13: MessageLoop::DeferOrRunPendingTask(MessageLoop::PendingTask const&)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x7e3e9f]
#14: MessageLoop::DoWork()[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x7e40c4]
#15: mozilla::ipc::DoWorkRunnable::Run()[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x876e35]
#16: nsThread::ProcessNextEvent(bool, bool*)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x198b55]
#17: NS_ProcessNextEvent(nsIThread*, bool)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x217e97]
#18: nsThread::Shutdown()[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x198373]
#19: void nsRunnableMethodArguments<>::apply<nsIThread, nsresult (nsIThread::*)()>(nsIThread*, nsresult (nsIThread::*)())[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x1c1323]
#20: nsRunnableMethodImpl<nsresult (nsIThread::*)(), true>::Run()[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x1c1156]
#21: nsThread::ProcessNextEvent(bool, bool*)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x198b55]
#22: NS_ProcessPendingEvents(nsIThread*, unsigned int)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x217d07]
#23: nsBaseAppShell::NativeEventCallback()[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x3d15506]
#24: nsAppShell::ProcessGeckoEvents(void*)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x3da56fd]
#25: __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__[/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation +0xaa621]
#26: __CFRunLoopDoSources0[/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation +0x89e1c]
#27: __CFRunLoopRun[/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation +0x8933f]
#28: CFRunLoopRunSpecific[/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation +0x88d38]
#29: RunCurrentEventLoopInMode[/System/Library/Frameworks/Carbon.framework/Versions/A/Frameworks/HIToolbox.framework/Versions/A/HIToolbox +0x30d55]
#30: ReceiveNextEventCommon[/System/Library/Frameworks/Carbon.framework/Versions/A/Frameworks/HIToolbox.framework/Versions/A/HIToolbox +0x30a97]
#31: _BlockUntilNextEventMatchingListInModeWithFilter[/System/Library/Frameworks/Carbon.framework/Versions/A/Frameworks/HIToolbox.framework/Versions/A/HIToolbox +0x309cf]
#32: _DPSNextEvent[/System/Library/Frameworks/AppKit.framework/Versions/C/AppKit +0x49236]
#33: -[NSApplication _nextEventMatchingEventMask:untilDate:inMode:dequeue:][/System/Library/Frameworks/AppKit.framework/Versions/C/AppKit +0x48665]
#34: -[GeckoNSApplication nextEventMatchingMask:untilDate:inMode:dequeue:][/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x3da4237]
#35: -[NSApplication run][/System/Library/Frameworks/AppKit.framework/Versions/C/AppKit +0x3d1c8]
#36: nsAppShell::Run()[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x3da60ec]
#37: nsAppStartup::Run()[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x4df6541]
#38: XREMain::XRE_mainRun()[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x4ecde4c]
#39: XREMain::XRE_main(int, char**, nsXREAppData const*)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x4ece6f9]
#40: XRE_main[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x4ecebc7]
#41: do_main(int, char**, nsIFile*)[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/firefox +0x297f]
#42: main[/Users/jyavenard/Work/Mozilla/mozilla-central/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/firefox +0x1da6]
(In reply to Jean-Yves Avenard [:jya] from comment #4)
> Are you sure?
> 
> ActorDestroy I thought I saw it being called from the main thread from time
> to time.

Yes, and you are also right: what I called the ipdl thread is the thread that the texture's ipdl actor belongs to, and that can be either the ImageBridge thread (if the texture is created for async video playback), or the main thread (general case). Video stuff is what tends to keep references to the textures outside of the ipdl thread (ImageBridge in this case).

> However, what about the other methods accessing TextureClient::mActor ? Are
> they all always running on the ipdl thread?

TextureClient::mActor is only cleared after the last reference to the TextureClient is gone because TextureClient's reference counting is what triggers the awful and convoluted destruction sequence (I am trying to clean that in another bug but I discover new things with every attempt), unless the texture has survived longer than the ImageBridge (in which case we force the destruction on the ImageBridge thread which can cause the remaining race). When that happens we are basically in a state where a lot of bad things can happen because a lot of components that gfx use have already been destroyed or and the remaining ones are about to be destroyed in a bunch of cycles. a leak is probably one of the least preoccupying thing that can happen in this situation, so I would prefer that we figure out how to make sure things like video decoders let go of all their textures before we get there, rather than try to make things work in this situation.

I don't know what's holding the texture for that long, but it could be that it needs to be shut down earlier. For example if it is being shut down on the event NS_XPCOM_SHUTDOWN_THREADS_OBSERVER_ID, it should probably observe NS_XPCOM_SHUTDOWN_OBSERVER_ID instead. ImageBridge is basically destroyed as late as possible (after that there is NS_XPCOM_SHUTDOWN_THREADS and then all threads must be gone).
Flags: needinfo?(nical.bugzilla)
I did a very crude patch that would cause an assert should any members of TextClient be accessed by two threads concurrently.

I got an assert there:
xul.dll!mozilla::layers::TextureClient::SetRecycleAllocator(mozilla::layers::TextureClientRecycleAllocator *) [TextureClient.cpp:baaa7fdcb26a : 275 + 0xa]


logs:
https://treeherder.mozilla.org/logviewer.html#?job_id=11636252&repo=try

Now I haven't examined what SetRecycleAllocator is doing it could very well do like TextureClient::ForceRemove dispatch a sync task to another TextureClient method which would trigger an assert ; but from the stack trace, the call to SetRecycleAllocator is initiated from WMFMediaDataDecoder.

Tomorrow I'll make it print the backtrace of the thread concurrently accessing the TextureClient ; will give us more information (like there's another crash in RecycleTexture, but this one does seem to be due to a sync dispatch)
(In reply to Jean-Yves Avenard [:jya] from comment #7)
> I did a very crude patch that would cause an assert should any members of
> TextClient be accessed by two threads concurrently.
> 
> I got an assert there:
> xul.dll!mozilla::layers::TextureClient::SetRecycleAllocator(mozilla::layers::
> TextureClientRecycleAllocator *) [TextureClient.cpp:baaa7fdcb26a : 275 + 0xa]
> 
> 

looks like this one just got fixed by bug 1197534.

At least it's comforting to see that the patch mentioned in comment 7 does its job and found a genuine race!
Depends on: 1197534
Still some races / assertion failure, only on windows 7 / 8:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=bb4c876a83f7

On debug build:
Assertion failure: !mActor->mKeep, at c:/builds/moz2_slave/try-w64-d-00000000000000000000/build/src/gfx/layers/client/TextureClient.cpp:573
This is TextureClient::KeepUntilFullDeallocation

On opt build this hits the thread safety crash:
which is due to a concurrent access to TextureClient::RecycleTexture.
A pity the printf aren't shown ; it would print which other functions are being access simultaneously.

From earlier backtraces; this appears related to the recycling of D3D textures.
Flags: needinfo?(matt.woodrow)
Yes, it's the D3D9 recycling that appears racy: the concurrent access is in:

05:05:16     INFO -  8  xul.dll!mozilla::layers::TextureClient::SetRecycleAllocator(mozilla::layers::TextureClientRecycleAllocator *) [TextureClient.cpp:bb4c876a83f7 : 282 + 0xa]
05:05:16     INFO -     rsp = 0x000000cb8f0cf2e0   rip = 0x000007fc2df8fe0a
05:05:16     INFO -     Found by: call frame info
05:05:16     INFO -  9  xul.dll!mozilla::layers::TextureClientRecycleAllocator::CreateOrRecycle(mozilla::gfx::SurfaceFormat,mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits>,mozilla::layers::BackendSelector,mozilla::layers::TextureFlags,mozilla::layers::TextureAllocationFlags) [TextureClientRecycleAllocator.cpp:bb4c876a83f7 : 129 + 0xb]
05:05:16     INFO -     rsp = 0x000000cb8f0cf310   rip = 0x000007fc2df8a94b
05:05:16     INFO -     Found by: call frame info
05:05:16     INFO - 10  xul.dll!mozilla::layers::D3D9RecycleAllocator::CreateOrRecycleClient(mozilla::gfx::SurfaceFormat,mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> const &) [D3D9SurfaceImage.cpp:bb4c876a83f7 : 228 + 0x1b]
05:05:16     INFO -     rsp = 0x000000cb8f0cf3c0   rip = 0x000007fc2df563df
05:05:16     INFO -     Found by: call frame info
05:05:16     INFO - 11  xul.dll!mozilla::layers::D3D9SurfaceImage::SetData(mozilla::layers::D3D9SurfaceImage::Data const &) [D3D9SurfaceImage.cpp:bb4c876a83f7 : 60 + 0x20]
05:05:16     INFO -     rsp = 0x000000cb8f0cf420   rip = 0x000007fc2df59ce5
05:05:16     INFO -     Found by: call frame info
05:05:16     INFO - 12  xul.dll!mozilla::D3D9DXVA2Manager::CopyToImage(IMFSample *,mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const &,mozilla::layers::ImageContainer *,mozilla::layers::Image * *) [DXVA2Manager.cpp:bb4c876a83f7 : 429 + 0x4e]
05:05:16     INFO -     rsp = 0x000000cb8f0cf530   rip = 0x000007fc2e866e92
05:05:16     INFO -     Found by: call frame info
05:05:16     INFO - 13  xul.dll!mozilla::WMFVideoMFTManager::CreateD3DVideoFrame(IMFSample *,__int64,mozilla::VideoData * *) [WMFVideoMFTManager.cpp:bb4c876a83f7 : 540 + 0x2e]
05:05:16     INFO -     rsp = 0x000000cb8f0cf5c0   rip = 0x000007fc2e867538
05:05:16     INFO -     Found by: call frame info
05:05:16     INFO - 14  xul.dll!mozilla::WMFVideoMFTManager::Output(__int64,nsRefPtr<mozilla::MediaData> &) [WMFVideoMFTManager.cpp:bb4c876a83f7 : 606 + 0x15]
05:05:16     INFO -     rsp = 0x000000cb8f0cf670   rip = 0x000007fc2e869eab
05:05:16     INFO -     Found by: call frame info
05:05:16     INFO - 15  xul.dll!mozilla::WMFMediaDataDecoder::ProcessOutput() [WMFMediaDataDecoder.cpp:bb4c876a83f7 : 153 + 0x13]
05:05:16     INFO -     rsp = 0x000000cb8f0cf6d0   rip = 0x000007fc2e86a096
05:05:16     INFO -     Found by: call frame info
05:05:16     INFO - 16  xul.dll!mozilla::WMFMediaDataDecoder::ProcessDecode(mozilla::MediaRawData *) [WMFMediaDataDecoder.cpp:bb4c876a83f7 : 144 + 0xc]
05:05:16     INFO -     rsp = 0x000000cb8f0cf700   rip = 0x000007fc2e869f9f
05:05:16     INFO -     Found by: call frame info
Depends on: 1207220
Milan - can you prod nical?
Flags: needinfo?(milan)
Is the patch still relevant?
Flags: needinfo?(milan)
The patch is no longer applicable since bug 1072313. However the races are still there. (in particular the allocation / recycling of the D3D textures) ; they are allocated/recycled on multiple threads with no locking in place.
Comment on attachment 8662354 [details] [diff] [review]
Make TextureChild/TextureClient thread-safe.

Review of attachment 8662354 [details] [diff] [review]:
-----------------------------------------------------------------

We don't need to lock around mKeep anymore. It's worth taking the rest of the patch though, even if I haven't seen reports of crashes or leaks related to the other members. That stuff will get rewritten as part of 1200595 and followups but it's taking longer than I hoped.
Attached patch rebased patchSplinter Review
Jean-Yves, this is a rebased version of your patch without the synchronization around mKeep.
Attachment #8662354 - Attachment is obsolete: true
Attachment #8662354 - Flags: review?(nical.bugzilla)
Flags: needinfo?(matt.woodrow)
Attachment #8673520 - Flags: review?(jyavenard)
Comment on attachment 8673520 [details] [diff] [review]
rebased patch

Review of attachment 8673520 [details] [diff] [review]:
-----------------------------------------------------------------

I'm not sure this will be enough to resolve the issues on Windows where a DXVA surface may be used on multiple threads at once.
Attachment #8673520 - Flags: review?(jyavenard) → review+
https://hg.mozilla.org/mozilla-central/rev/09de9d59910d
https://hg.mozilla.org/mozilla-central/rev/cbbc1719a22b
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla44
Depends on: 1231228
You need to log in before you can comment on or make changes to this bug.