Closed Bug 988713 Opened 11 years ago Closed 11 years ago

Race condition when destroying TextureClient & TextureChild.

Categories

(Core :: Graphics: Layers, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla31
blocking-b2g 1.4+
Tracking Status
firefox29 --- wontfix
firefox30 --- fixed
firefox31 --- fixed
b2g-v1.4 --- fixed
b2g-v2.0 --- fixed

People

(Reporter: jhlin, Assigned: pchang)

References

Details

Attachments

(1 file, 2 obsolete files)

Scenario 1: TextureClient::Finalize() and TextureChild::ActorDestroy() running simultaneously. ActorDestroy(): if (mTextureClient) { <-- true Finalize(): actor->mTextureClient = nullptr; ActorDestroy(): mTextureClient->mActor = nullptr; <-- null mTextureClient Scenrio 2: TextureClient holds the last ref. to TextureChild. ActorDestroy(): mTextureClient->mActor = nullptr; <-- ref count 1 ==> 0. ~TextureChild() mWaitForRecycle = nullptr; <-- invalid mWaitForRecycle
John, can you explain actual STRs by user to cause this problem?
Flags: needinfo?(jolin)
Apply the patch in bug 911046, crash happened in H264 video call.
(In reply to John Lin[:jolin][:jhlin] from comment #0) > Scenario 1: TextureClient::Finalize() and TextureChild::ActorDestroy() > running simultaneously. > ActorDestroy(): > if (mTextureClient) { <-- true > Finalize(): > actor->mTextureClient = nullptr; > ActorDestroy(): > mTextureClient->mActor = nullptr; <-- null mTextureClient > > Scenrio 2: TextureClient holds the last ref. to TextureChild. > ActorDestroy(): > mTextureClient->mActor = nullptr; <-- ref count 1 ==> 0. ~TextureChild() > mWaitForRecycle = nullptr; <-- invalid mWaitForRecycle I'm trying to check scenario 2. At the same time, I would like to redefine the lifecycle of tclient and tchild to avoid crash with multithread condition. Now the tclient and tchild have the reference cycles in their definition. And it is possible that TextureClient::Finalize and TextureChild::ActorDestroy invoked by different threads. Inside these two functions, they are trying to break the reference cycles. From the behavior point of view, the tclient's lifecycle should be longer than tchild. The plan is to add the condition wait inside TextureClient::Finalize[1] to wait the tchild(mActor) and mWaitForRecycle(reference of tclient) destroyed[2]. After TextureClient::Finalize finish, we could make sure tchild is destroyed. [1] http://dxr.mozilla.org/mozilla-central/source/gfx/layers/client/TextureClient.cpp#510 [2] http://dxr.mozilla.org/mozilla-central/source/gfx/layers/client/TextureClient.cpp#191 AtomicRefCountedWithFinalize::Release-> TextureClient::Finalize-> ShadowLayerForwarder::RemoveTexture-> TextureClient::ForceRemove()-> mActor->SendRemoveTexture() ...cross IPC... TextureParent::RecvRemoveTexture() or TextureHost::SendDeleteIPDLActor-> PTextureParent::Send__delete__(actor) ...cross IPC... TextureChild OnMessageReceived with Msg___delete____ID ->(actor)->DestroySubtree(Deletion) ->ActorDestroy(Deletion) ->TextureChild::ActorDestroy
Flags: needinfo?(jolin)
(In reply to peter chang[:pchang][:peter] from comment #3) > (In reply to John Lin[:jolin][:jhlin] from comment #0) > The plan is to add the condition wait inside TextureClient::Finalize[1] to > wait the tchild(mActor) and mWaitForRecycle(reference of tclient) > destroyed[2]. After TextureClient::Finalize finish, we could make sure > tchild is destroyed. This is not a good idea because of performance concern. Correct some info from comment 2. Tclient and tchild are reference recycle only if mWaitForRecycle has value. For video playback case, mWaitForRecycle is always nullptr. Besides mWaitForRecycle, tchild owns the raw pointer of tclient. That caused the race condition problem when TextureClient::Finalize and TextureChild::ActorDestroy were invoked by different threads. Tclient holds RefPtr of TextureChild and is going to destroy itself after TextureClient::Finalize. Under this condition, we don't need to free mActor[1] because ~TextureClient() will do. I will attach my WIP patch and the verification result later. [1]http://dxr.mozilla.org/mozilla-central/source/gfx/layers/client/TextureClient.cpp#189
Attached patch 0001-Bug-988713.patch (obsolete) — Splinter Review
Attached the WIP patch and try to prepare environment for verification.
Assignee: nobody → pchang
Blocks: 911046
attachment 8398404 [details] [diff] [review] seems not strong enough to timing dependent defect. It seems better to fix the problem by more timing robust way. Most good way seems to use thread safe weak pointer. But it is not provided now :-(
If we want to fix the problem without using thread safe weak pointer, we need to think about how to fix the problem more wider area.
(In reply to peter chang[:pchang][:peter] from comment #3) > From the behavior point of view, the tclient's lifecycle should be longer > than tchild. In the general case this is not true: TextureClient::Finalize() is what causes TextureChild to send the asyncRemoveTexture message, and in response to that the parent side sends __delete__ which will trigger ActorDestroy on the child side. So in the general case TextureChild is supposed to outlive TextureClient. Here, we seem to be in a case where ActorDestroy is called before/during Finalize. This must mean that the parent protocol (ImageBridgeChild, or LayerTransactionChild) is being destroyed before the texture client. This looks somewhat similar to bug 966284. Bug 966284 probably won't fix the race entirely if several threads are involved but it's best to know about it to not step on each other's toes.
Note that the mWaitForRecycle stuff is a recent addition that came in with "simple tiling" which is off by default, so if it causes thread safety issues it should be easier to change this logic than rethink completely TextureClient's shutdown. This was introduced as a way to implement texture recycling, and we have another way to implement said recycling in the other tiling implementation so we'll have to decide and which way we want to stick with and make sure it's thread safe.
"ActorDestroy called before/during Finalize" does not happen in normal use case. Therefore it seems better to analyze why it happens.
There is not actual STR to reproduce the crash. Can we have an actual STR to reproduce the crash?
Nical, in current gecko's PTexture implementation Send__delete__() is always sent from parent side. Is there a reason to do it from parent side. I feel Send__delete__() from child side seems better.
Flags: needinfo?(nical.bugzilla)
(In reply to Sotaro Ikeda [:sotaro] from comment #12) > Is there a reason to do it from parent side. Correction: Is there a reason to do it from parent side?
(In reply to Sotaro Ikeda [:sotaro] from comment #13) > Is there a reason to do it from parent side? Yes, if we want to be able to send messages from the compositor to the parent to the child, the only way to make sure that they don't cross (while keeping shutdown asynchronous) is to have the side that decides when to destroy the texture (the child side) send a message (RemoveTexture) which triggers the other side to send __delete__. This way each side knows if it can send a message. If we don't have this kind of hand-shake we can't have the parent side send messages to the child.
Flags: needinfo?(nical.bugzilla)
Thanks! I understand the reason.
(In reply to Sotaro Ikeda [:sotaro] from comment #6) > attachment 8398404 [details] [diff] [review] seems not strong enough to > timing dependent defect. It seems better to fix the problem by more timing > robust way. Most good way seems to use thread safe weak pointer. But it is > not provided now :-( Another way to fix the problem is "always call TextureClient::Finalize()'s implementation in TexureClient's IPC thread". It serializes ActorDestroy() and Finalize(). Almost all TextureClient::Finalize()s are called in IPC thread. Therefore it should not add much delay.
(In reply to Sotaro Ikeda [:sotaro] from comment #11) > There is not actual STR to reproduce the crash. Can we have an actual STR to > reproduce the crash? Please see comment 2 and you can refer below link. https://bugzilla.mozilla.org/show_bug.cgi?id=911046#c6 But I did see the race condition between TextureClient::Finalize() and TextureChild::ActorDestroy().
The following logs are reproduced by video playback. TexureClient::Finalize() was called by Media StateMachine thread. app_534 534 379 91688 30656 ffffffff 00000000 S /system/b2g/plugin-container app_534 540 534 91688 30656 ffffffff 00000000 S ImageBridgeChil app_534 601 534 91688 30656 ffffffff 00000000 S Binder Thread # app_534 619 534 91688 30656 ffffffff 00000000 S Media S~hine #1 03-28 11:29:47.119 534 540 I Gecko : TexureClient::Finalize called this 0x40302ef0 texture child 0x431fb080 03-28 11:29:47.119 534 540 I Gecko : TexureClient::ForceRemove this 0x40302ef0 flag 0x02000000 texturechild 0x431fb080 03-28 11:29:47.139 534 540 I Gecko : TexureClient::Finalize called this 0x40302e80 texture child 0x431fb0c0 03-28 11:29:47.139 534 540 I Gecko : TexureClient::ForceRemove this 0x40302e80 flag 0x02000000 texturechild 0x431fb0c0 03-28 11:29:47.199 534 619 I Gecko : TexureClient::Finalize called this 0x40302ef0 texture child 0x431fb080 03-28 11:29:47.199 534 540 I Gecko : TexureClient::ForceRemove this 0x40302ef0 flag 0x02000000 texturechild 0x431fb080 03-28 11:29:47.199 534 540 I Gecko : TextureChild::ActorDestroy this 0x431fb080 texture client 0x40302ef0 03-28 11:29:47.229 534 540 I Gecko : TexureClient::Finalize called this 0x40302e80 texture child 0x431fb300
(In reply to Sotaro Ikeda [:sotaro] from comment #16) > (In reply to Sotaro Ikeda [:sotaro] from comment #6) > > attachment 8398404 [details] [diff] [review] seems not strong enough to > > timing dependent defect. It seems better to fix the problem by more timing > > robust way. Most good way seems to use thread safe weak pointer. But it is > > not provided now :-( > > Another way to fix the problem is "always call TextureClient::Finalize()'s > implementation in TexureClient's IPC thread". It serializes ActorDestroy() > and Finalize(). Almost all TextureClient::Finalize()s are called in IPC > thread. Therefore it should not add much delay. Another idea is to move [1] to line 514. Then we could avoid the race condition caused by actor's raw pointer. http://dxr.mozilla.org/mozilla-central/source/gfx/layers/client/TextureClient.cpp#519
(In reply to Sotaro Ikeda [:sotaro] from comment #11) > There is not actual STR to reproduce the crash. Can we have an actual STR to > reproduce the crash? When working on bug 911046 I met this problem after apply the patches there. My test env. is a desktop peer (set up as described in bug 911046 comment 6) and a Nexus 5 with latest m-c Gecko + my patches installed, then connect both to local nightly-gupshup (https://github.com/jesup/nightly-gupshup) server and start a conversation. Usually the device will crash after several minutes.
Attached patch another approach (obsolete) — Splinter Review
With this patch, I could run the STR from comment 20 for 1 hr without crash. And from below log, thread 1995 may crash at ActorDestroy[1] if textureclient got destroyed. [1]http://dxr.mozilla.org/mozilla-central/source/gfx/layers/client/TextureClient.cpp#189 02:35:03.480 1989 2511 I Gecko : TextureClient::Finalize() before RemoveTexture this 0xb1a23ac0 02:35:03.480 1989 2511 I Gecko : TextureClient::Finalize() after RemoveTexture this 0xb1a23ac0 02:35:03.480 1989 1995 I Gecko : TextureChild::ActorDestroy this 0xb011c400 02:35:03.480 1989 2511 I Gecko : ~TextureClient this 0xb1a23ac0 02:35:03.480 1989 2511 I Gecko : ~TextureChild this 0xb011c400
Attachment #8398404 - Attachment is obsolete: true
Attachment #8399193 - Attachment is obsolete: true
Attachment #8399258 - Flags: review?(nical.bugzilla)
Attachment #8399258 - Flags: review?(nical.bugzilla) → review+
Sotaro pointed out to me that even with the patch, we can still have a race condition. We should take an hour during the workweek to discuss about TextureClient's shutdown sequence and fixing this in a more bullet-proof way. We can still take this patch but I am marking the bug "leave open" so that we don't close it before we have a solid solution.
Whiteboard: [leave open]
My point is that attachment 8399258 [details] [diff] [review] is not atomic operation in multi threaded environment, so it could cause the very very rare race condition. If a phone has multiple cpus, this risk becomes higher than one cpu.
Actually we looked at it more closely with Sotaro and Peter and this patch appears to fix all the race scenarios of this bug. Let's land it!
Whiteboard: [leave open]
blocking-b2g: --- → 1.4?
blocking-b2g: 1.4? → backlog
All tests passed except gaia integration testing from B2G Desktop Linux x64 Opt. Checking... https://tbpl.mozilla.org/?tree=Try&rev=038d66beec7c&jobname=b2g_ubuntu64_vm%20try%20opt%20test%20gaia-integration
re-run new try server with latest inbound and gaia integration pass. https://tbpl.mozilla.org/?tree=Try&rev=48db580e7b41
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
Blocks: 993736
Nominate to "1.4+", because this bug blocks Bug 993736.
blocking-b2g: backlog → 1.4?
blocking-b2g: 1.4? → 1.4+
Blocks: 985681
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: