988713 - Race condition when destroying TextureClient & TextureChild.

Reporter

Description

•

11 years ago

Scenario 1: TextureClient::Finalize() and TextureChild::ActorDestroy() running simultaneously. ActorDestroy(): if (mTextureClient) { <-- true Finalize(): actor->mTextureClient = nullptr; ActorDestroy(): mTextureClient->mActor = nullptr; <-- null mTextureClient Scenrio 2: TextureClient holds the last ref. to TextureChild. ActorDestroy(): mTextureClient->mActor = nullptr; <-- ref count 1 ==> 0. ~TextureChild() mWaitForRecycle = nullptr; <-- invalid mWaitForRecycle

Sotaro Ikeda [:sotaro]

Comment 1

•

11 years ago

John, can you explain actual STRs by user to cause this problem?

Flags: needinfo?(jolin)

u459114

Comment 2

•

11 years ago

Apply the patch in bug 911046, crash happened in H264 video call.

Peter Chang[:pchang]

Assignee

Comment 3

•

11 years ago

(In reply to John Lin[:jolin][:jhlin] from comment #0) > Scenario 1: TextureClient::Finalize() and TextureChild::ActorDestroy() > running simultaneously. > ActorDestroy(): > if (mTextureClient) { <-- true > Finalize(): > actor->mTextureClient = nullptr; > ActorDestroy(): > mTextureClient->mActor = nullptr; <-- null mTextureClient > > Scenrio 2: TextureClient holds the last ref. to TextureChild. > ActorDestroy(): > mTextureClient->mActor = nullptr; <-- ref count 1 ==> 0. ~TextureChild() > mWaitForRecycle = nullptr; <-- invalid mWaitForRecycle I'm trying to check scenario 2. At the same time, I would like to redefine the lifecycle of tclient and tchild to avoid crash with multithread condition. Now the tclient and tchild have the reference cycles in their definition. And it is possible that TextureClient::Finalize and TextureChild::ActorDestroy invoked by different threads. Inside these two functions, they are trying to break the reference cycles. From the behavior point of view, the tclient's lifecycle should be longer than tchild. The plan is to add the condition wait inside TextureClient::Finalize[1] to wait the tchild(mActor) and mWaitForRecycle(reference of tclient) destroyed[2]. After TextureClient::Finalize finish, we could make sure tchild is destroyed. [1] http://dxr.mozilla.org/mozilla-central/source/gfx/layers/client/TextureClient.cpp#510 [2] http://dxr.mozilla.org/mozilla-central/source/gfx/layers/client/TextureClient.cpp#191 AtomicRefCountedWithFinalize::Release-> TextureClient::Finalize-> ShadowLayerForwarder::RemoveTexture-> TextureClient::ForceRemove()-> mActor->SendRemoveTexture() ...cross IPC... TextureParent::RecvRemoveTexture() or TextureHost::SendDeleteIPDLActor-> PTextureParent::Send__delete__(actor) ...cross IPC... TextureChild OnMessageReceived with Msg___delete____ID ->(actor)->DestroySubtree(Deletion) ->ActorDestroy(Deletion) ->TextureChild::ActorDestroy

Flags: needinfo?(jolin)

Peter Chang[:pchang]

Assignee

Comment 4

•

11 years ago

(In reply to peter chang[:pchang][:peter] from comment #3) > (In reply to John Lin[:jolin][:jhlin] from comment #0) > The plan is to add the condition wait inside TextureClient::Finalize[1] to > wait the tchild(mActor) and mWaitForRecycle(reference of tclient) > destroyed[2]. After TextureClient::Finalize finish, we could make sure > tchild is destroyed. This is not a good idea because of performance concern. Correct some info from comment 2. Tclient and tchild are reference recycle only if mWaitForRecycle has value. For video playback case, mWaitForRecycle is always nullptr. Besides mWaitForRecycle, tchild owns the raw pointer of tclient. That caused the race condition problem when TextureClient::Finalize and TextureChild::ActorDestroy were invoked by different threads. Tclient holds RefPtr of TextureChild and is going to destroy itself after TextureClient::Finalize. Under this condition, we don't need to free mActor[1] because ~TextureClient() will do. I will attach my WIP patch and the verification result later. [1]http://dxr.mozilla.org/mozilla-central/source/gfx/layers/client/TextureClient.cpp#189

Peter Chang[:pchang]

Assignee

Comment 5

•

11 years ago

Attached patch 0001-Bug-988713.patch (obsolete) — Details — Splinter Review

Attached the WIP patch and try to prepare environment for verification.

Peter Chang[:pchang]

Assignee

Updated

•

11 years ago

Assignee: nobody → pchang

Peter Chang[:pchang]

Assignee

Updated

•

11 years ago

Blocks: 911046

Sotaro Ikeda [:sotaro]

Comment 6

•

11 years ago

attachment 8398404 [details] [diff] [review] seems not strong enough to timing dependent defect. It seems better to fix the problem by more timing robust way. Most good way seems to use thread safe weak pointer. But it is not provided now :-(

Sotaro Ikeda [:sotaro]

Comment 7

•

11 years ago

If we want to fix the problem without using thread safe weak pointer, we need to think about how to fix the problem more wider area.

Nicolas Silva [:nical]

Comment 8

•

11 years ago

(In reply to peter chang[:pchang][:peter] from comment #3) > From the behavior point of view, the tclient's lifecycle should be longer > than tchild. In the general case this is not true: TextureClient::Finalize() is what causes TextureChild to send the asyncRemoveTexture message, and in response to that the parent side sends __delete__ which will trigger ActorDestroy on the child side. So in the general case TextureChild is supposed to outlive TextureClient. Here, we seem to be in a case where ActorDestroy is called before/during Finalize. This must mean that the parent protocol (ImageBridgeChild, or LayerTransactionChild) is being destroyed before the texture client. This looks somewhat similar to bug 966284. Bug 966284 probably won't fix the race entirely if several threads are involved but it's best to know about it to not step on each other's toes.

Nicolas Silva [:nical]

Comment 9

•

11 years ago

Note that the mWaitForRecycle stuff is a recent addition that came in with "simple tiling" which is off by default, so if it causes thread safety issues it should be easier to change this logic than rethink completely TextureClient's shutdown. This was introduced as a way to implement texture recycling, and we have another way to implement said recycling in the other tiling implementation so we'll have to decide and which way we want to stick with and make sure it's thread safe.

Sotaro Ikeda [:sotaro]

Comment 10

•

11 years ago

"ActorDestroy called before/during Finalize" does not happen in normal use case. Therefore it seems better to analyze why it happens.

Sotaro Ikeda [:sotaro]

Comment 11

•

11 years ago

There is not actual STR to reproduce the crash. Can we have an actual STR to reproduce the crash?

Sotaro Ikeda [:sotaro]

Comment 12

•

11 years ago

Nical, in current gecko's PTexture implementation Send__delete__() is always sent from parent side. Is there a reason to do it from parent side. I feel Send__delete__() from child side seems better.

Flags: needinfo?(nical.bugzilla)

Sotaro Ikeda [:sotaro]

Comment 13

•

11 years ago

(In reply to Sotaro Ikeda [:sotaro] from comment #12) > Is there a reason to do it from parent side. Correction: Is there a reason to do it from parent side?

Nicolas Silva [:nical]

Comment 14

•

11 years ago

(In reply to Sotaro Ikeda [:sotaro] from comment #13) > Is there a reason to do it from parent side? Yes, if we want to be able to send messages from the compositor to the parent to the child, the only way to make sure that they don't cross (while keeping shutdown asynchronous) is to have the side that decides when to destroy the texture (the child side) send a message (RemoveTexture) which triggers the other side to send __delete__. This way each side knows if it can send a message. If we don't have this kind of hand-shake we can't have the parent side send messages to the child.

Flags: needinfo?(nical.bugzilla)

Sotaro Ikeda [:sotaro]

Comment 15

•

11 years ago

Thanks! I understand the reason.

Sotaro Ikeda [:sotaro]

Comment 16

•

11 years ago

(In reply to Sotaro Ikeda [:sotaro] from comment #6) > attachment 8398404 [details] [diff] [review] seems not strong enough to > timing dependent defect. It seems better to fix the problem by more timing > robust way. Most good way seems to use thread safe weak pointer. But it is > not provided now :-( Another way to fix the problem is "always call TextureClient::Finalize()'s implementation in TexureClient's IPC thread". It serializes ActorDestroy() and Finalize(). Almost all TextureClient::Finalize()s are called in IPC thread. Therefore it should not add much delay.

Peter Chang[:pchang]

Assignee

Comment 17

•

11 years ago

(In reply to Sotaro Ikeda [:sotaro] from comment #11) > There is not actual STR to reproduce the crash. Can we have an actual STR to > reproduce the crash? Please see comment 2 and you can refer below link. https://bugzilla.mozilla.org/show_bug.cgi?id=911046#c6 But I did see the race condition between TextureClient::Finalize() and TextureChild::ActorDestroy().

Peter Chang[:pchang]

Assignee

Comment 18

•

11 years ago

The following logs are reproduced by video playback. TexureClient::Finalize() was called by Media StateMachine thread. app_534 534 379 91688 30656 ffffffff 00000000 S /system/b2g/plugin-container app_534 540 534 91688 30656 ffffffff 00000000 S ImageBridgeChil app_534 601 534 91688 30656 ffffffff 00000000 S Binder Thread # app_534 619 534 91688 30656 ffffffff 00000000 S Media S~hine #1 03-28 11:29:47.119 534 540 I Gecko : TexureClient::Finalize called this 0x40302ef0 texture child 0x431fb080 03-28 11:29:47.119 534 540 I Gecko : TexureClient::ForceRemove this 0x40302ef0 flag 0x02000000 texturechild 0x431fb080 03-28 11:29:47.139 534 540 I Gecko : TexureClient::Finalize called this 0x40302e80 texture child 0x431fb0c0 03-28 11:29:47.139 534 540 I Gecko : TexureClient::ForceRemove this 0x40302e80 flag 0x02000000 texturechild 0x431fb0c0 03-28 11:29:47.199 534 619 I Gecko : TexureClient::Finalize called this 0x40302ef0 texture child 0x431fb080 03-28 11:29:47.199 534 540 I Gecko : TexureClient::ForceRemove this 0x40302ef0 flag 0x02000000 texturechild 0x431fb080 03-28 11:29:47.199 534 540 I Gecko : TextureChild::ActorDestroy this 0x431fb080 texture client 0x40302ef0 03-28 11:29:47.229 534 540 I Gecko : TexureClient::Finalize called this 0x40302e80 texture child 0x431fb300

Peter Chang[:pchang]

Assignee

Comment 19

•

11 years ago

(In reply to Sotaro Ikeda [:sotaro] from comment #16) > (In reply to Sotaro Ikeda [:sotaro] from comment #6) > > attachment 8398404 [details] [diff] [review] seems not strong enough to > > timing dependent defect. It seems better to fix the problem by more timing > > robust way. Most good way seems to use thread safe weak pointer. But it is > > not provided now :-( > > Another way to fix the problem is "always call TextureClient::Finalize()'s > implementation in TexureClient's IPC thread". It serializes ActorDestroy() > and Finalize(). Almost all TextureClient::Finalize()s are called in IPC > thread. Therefore it should not add much delay. Another idea is to move [1] to line 514. Then we could avoid the race condition caused by actor's raw pointer. http://dxr.mozilla.org/mozilla-central/source/gfx/layers/client/TextureClient.cpp#519

John Lin [:jhlin][:jolin]

Reporter

Comment 20

•

11 years ago

(In reply to Sotaro Ikeda [:sotaro] from comment #11) > There is not actual STR to reproduce the crash. Can we have an actual STR to > reproduce the crash? When working on bug 911046 I met this problem after apply the patches there. My test env. is a desktop peer (set up as described in bug 911046 comment 6) and a Nexus 5 with latest m-c Gecko + my patches installed, then connect both to local nightly-gupshup (https://github.com/jesup/nightly-gupshup) server and start a conversation. Usually the device will crash after several minutes.

Peter Chang[:pchang]

Assignee

Comment 21

•

11 years ago

Attached patch another approach (obsolete) — Details — Splinter Review

Peter Chang[:pchang]

Assignee

Comment 22

•

11 years ago

Attached patch fix race condition between TextureClient::Finalize() and TextureChild::ActorDestroy() — Details — Splinter Review

With this patch, I could run the STR from comment 20 for 1 hr without crash. And from below log, thread 1995 may crash at ActorDestroy[1] if textureclient got destroyed. [1]http://dxr.mozilla.org/mozilla-central/source/gfx/layers/client/TextureClient.cpp#189 02:35:03.480 1989 2511 I Gecko : TextureClient::Finalize() before RemoveTexture this 0xb1a23ac0 02:35:03.480 1989 2511 I Gecko : TextureClient::Finalize() after RemoveTexture this 0xb1a23ac0 02:35:03.480 1989 1995 I Gecko : TextureChild::ActorDestroy this 0xb011c400 02:35:03.480 1989 2511 I Gecko : ~TextureClient this 0xb1a23ac0 02:35:03.480 1989 2511 I Gecko : ~TextureChild this 0xb011c400

Attachment #8398404 - Attachment is obsolete: true

Attachment #8399193 - Attachment is obsolete: true

Attachment #8399258 - Flags: review?(nical.bugzilla)

Nicolas Silva [:nical]

Updated

•

11 years ago

Attachment #8399258 - Flags: review?(nical.bugzilla) → review+

Nicolas Silva [:nical]

Comment 23

•

11 years ago

Sotaro pointed out to me that even with the patch, we can still have a race condition. We should take an hour during the workweek to discuss about TextureClient's shutdown sequence and fixing this in a more bullet-proof way. We can still take this patch but I am marking the bug "leave open" so that we don't close it before we have a solid solution.

Whiteboard: [leave open]

Sotaro Ikeda [:sotaro]

Comment 24

•

11 years ago

My point is that attachment 8399258 [details] [diff] [review] is not atomic operation in multi threaded environment, so it could cause the very very rare race condition. If a phone has multiple cpus, this risk becomes higher than one cpu.

Nicolas Silva [:nical]

Comment 25

•

11 years ago

Actually we looked at it more closely with Sotaro and Peter and this patch appears to fix all the race scenarios of this bug. Let's land it!

Whiteboard: [leave open]

Peter Chang[:pchang]

Assignee

Updated

•

11 years ago

blocking-b2g: --- → 1.4?

Peter Chang[:pchang]

Assignee

Comment 26

•

11 years ago

try server link https://tbpl.mozilla.org/?tree=Try&rev=038d66beec7c

Preeti Raghunath(:Preeti)

Updated

•

11 years ago

blocking-b2g: 1.4? → backlog

Peter Chang[:pchang]

Assignee

Comment 27

•

11 years ago

All tests passed except gaia integration testing from B2G Desktop Linux x64 Opt. Checking... https://tbpl.mozilla.org/?tree=Try&rev=038d66beec7c&jobname=b2g_ubuntu64_vm%20try%20opt%20test%20gaia-integration

Peter Chang[:pchang]

Assignee

Comment 28

•

11 years ago

re-run new try server with latest inbound and gaia integration pass. https://tbpl.mozilla.org/?tree=Try&rev=48db580e7b41

Peter Chang[:pchang]

Assignee

Comment 29

•

11 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/e2db1c06a933

Carsten Book [:Tomcat]

Comment 30

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/e2db1c06a933

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla31

Sotaro Ikeda [:sotaro]

Updated

•

11 years ago

Blocks: 993736

Sotaro Ikeda [:sotaro]

Comment 31

•

11 years ago

Nominate to "1.4+", because this bug blocks Bug 993736.

blocking-b2g: backlog → 1.4?

Andrew Overholt [:overholt]

Updated

•

11 years ago

blocking-b2g: 1.4? → 1.4+

Sotaro Ikeda [:sotaro]

Updated

•

11 years ago

Blocks: 985681

Ryan VanderMeulen [:RyanVM]

Comment 32

•

11 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/567e247059e0

status-b2g-v1.4: --- → fixed

status-b2g-v2.0: --- → fixed

status-firefox29: --- → wontfix

status-firefox30: --- → fixed

status-firefox31: --- → fixed

0001-Bug-988713.patch 11 years ago Peter Chang[:pchang] 1.42 KB, patch		Details \| Diff \| Splinter Review
another approach 11 years ago Peter Chang[:pchang] 748 bytes, patch		Details \| Diff \| Splinter Review
fix race condition between TextureClient::Finalize() and TextureChild::ActorDestroy() 11 years ago Peter Chang[:pchang] 1.31 KB, patch	nical : review+	Details \| Diff \| Splinter Review