Closed
Bug 988713
Opened 11 years ago
Closed 11 years ago
Race condition when destroying TextureClient & TextureChild.
Categories
(Core :: Graphics: Layers, defect)
Tracking
()
People
(Reporter: jhlin, Assigned: pchang)
References
Details
Attachments
(1 file, 2 obsolete files)
Scenario 1: TextureClient::Finalize() and TextureChild::ActorDestroy() running simultaneously.
ActorDestroy():
if (mTextureClient) { <-- true
Finalize():
actor->mTextureClient = nullptr;
ActorDestroy():
mTextureClient->mActor = nullptr; <-- null mTextureClient
Scenrio 2: TextureClient holds the last ref. to TextureChild.
ActorDestroy():
mTextureClient->mActor = nullptr; <-- ref count 1 ==> 0. ~TextureChild()
mWaitForRecycle = nullptr; <-- invalid mWaitForRecycle
Comment 1•11 years ago
|
||
John, can you explain actual STRs by user to cause this problem?
Flags: needinfo?(jolin)
Apply the patch in bug 911046, crash happened in H264 video call.
| Assignee | ||
Comment 3•11 years ago
|
||
(In reply to John Lin[:jolin][:jhlin] from comment #0)
> Scenario 1: TextureClient::Finalize() and TextureChild::ActorDestroy()
> running simultaneously.
> ActorDestroy():
> if (mTextureClient) { <-- true
> Finalize():
> actor->mTextureClient = nullptr;
> ActorDestroy():
> mTextureClient->mActor = nullptr; <-- null mTextureClient
>
> Scenrio 2: TextureClient holds the last ref. to TextureChild.
> ActorDestroy():
> mTextureClient->mActor = nullptr; <-- ref count 1 ==> 0. ~TextureChild()
> mWaitForRecycle = nullptr; <-- invalid mWaitForRecycle
I'm trying to check scenario 2.
At the same time, I would like to redefine the lifecycle of tclient and tchild to avoid crash with multithread condition.
Now the tclient and tchild have the reference cycles in their definition. And it is possible that TextureClient::Finalize and TextureChild::ActorDestroy invoked by different threads.
Inside these two functions, they are trying to break the reference cycles.
From the behavior point of view, the tclient's lifecycle should be longer than tchild.
The plan is to add the condition wait inside TextureClient::Finalize[1] to wait the tchild(mActor) and mWaitForRecycle(reference of tclient) destroyed[2]. After TextureClient::Finalize finish, we could make sure tchild is destroyed.
[1] http://dxr.mozilla.org/mozilla-central/source/gfx/layers/client/TextureClient.cpp#510
[2] http://dxr.mozilla.org/mozilla-central/source/gfx/layers/client/TextureClient.cpp#191
AtomicRefCountedWithFinalize::Release->
TextureClient::Finalize->
ShadowLayerForwarder::RemoveTexture->
TextureClient::ForceRemove()->
mActor->SendRemoveTexture()
...cross IPC...
TextureParent::RecvRemoveTexture() or TextureHost::SendDeleteIPDLActor->
PTextureParent::Send__delete__(actor)
...cross IPC...
TextureChild OnMessageReceived with Msg___delete____ID
->(actor)->DestroySubtree(Deletion)
->ActorDestroy(Deletion)
->TextureChild::ActorDestroy
Flags: needinfo?(jolin)
| Assignee | ||
Comment 4•11 years ago
|
||
(In reply to peter chang[:pchang][:peter] from comment #3)
> (In reply to John Lin[:jolin][:jhlin] from comment #0)
> The plan is to add the condition wait inside TextureClient::Finalize[1] to
> wait the tchild(mActor) and mWaitForRecycle(reference of tclient)
> destroyed[2]. After TextureClient::Finalize finish, we could make sure
> tchild is destroyed.
This is not a good idea because of performance concern.
Correct some info from comment 2.
Tclient and tchild are reference recycle only if mWaitForRecycle has value. For video playback case, mWaitForRecycle is always nullptr.
Besides mWaitForRecycle, tchild owns the raw pointer of tclient. That caused the race condition problem when TextureClient::Finalize and TextureChild::ActorDestroy were invoked by different threads.
Tclient holds RefPtr of TextureChild and is going to destroy itself after TextureClient::Finalize. Under this condition, we don't need to free mActor[1] because ~TextureClient() will do.
I will attach my WIP patch and the verification result later.
[1]http://dxr.mozilla.org/mozilla-central/source/gfx/layers/client/TextureClient.cpp#189
| Assignee | ||
Comment 5•11 years ago
|
||
Attached the WIP patch and try to prepare environment for verification.
| Assignee | ||
Updated•11 years ago
|
Assignee: nobody → pchang
Comment 6•11 years ago
|
||
attachment 8398404 [details] [diff] [review] seems not strong enough to timing dependent defect. It seems better to fix the problem by more timing robust way. Most good way seems to use thread safe weak pointer. But it is not provided now :-(
Comment 7•11 years ago
|
||
If we want to fix the problem without using thread safe weak pointer, we need to think about how to fix the problem more wider area.
Comment 8•11 years ago
|
||
(In reply to peter chang[:pchang][:peter] from comment #3)
> From the behavior point of view, the tclient's lifecycle should be longer
> than tchild.
In the general case this is not true: TextureClient::Finalize() is what causes TextureChild to send the asyncRemoveTexture message, and in response to that the parent side sends __delete__ which will trigger ActorDestroy on the child side.
So in the general case TextureChild is supposed to outlive TextureClient.
Here, we seem to be in a case where ActorDestroy is called before/during Finalize. This must mean that the parent protocol (ImageBridgeChild, or LayerTransactionChild) is being destroyed before the texture client. This looks somewhat similar to bug 966284. Bug 966284 probably won't fix the race entirely if several threads are involved but it's best to know about it to not step on each other's toes.
Comment 9•11 years ago
|
||
Note that the mWaitForRecycle stuff is a recent addition that came in with "simple tiling" which is off by default, so if it causes thread safety issues it should be easier to change this logic than rethink completely TextureClient's shutdown. This was introduced as a way to implement texture recycling, and we have another way to implement said recycling in the other tiling implementation so we'll have to decide and which way we want to stick with and make sure it's thread safe.
Comment 10•11 years ago
|
||
"ActorDestroy called before/during Finalize" does not happen in normal use case. Therefore it seems better to analyze why it happens.
Comment 11•11 years ago
|
||
There is not actual STR to reproduce the crash. Can we have an actual STR to reproduce the crash?
Comment 12•11 years ago
|
||
Nical, in current gecko's PTexture implementation Send__delete__() is always sent from parent side. Is there a reason to do it from parent side. I feel Send__delete__() from child side seems better.
Flags: needinfo?(nical.bugzilla)
Comment 13•11 years ago
|
||
(In reply to Sotaro Ikeda [:sotaro] from comment #12)
> Is there a reason to do it from parent side.
Correction:
Is there a reason to do it from parent side?
Comment 14•11 years ago
|
||
(In reply to Sotaro Ikeda [:sotaro] from comment #13)
> Is there a reason to do it from parent side?
Yes, if we want to be able to send messages from the compositor to the parent to the child, the only way to make sure that they don't cross (while keeping shutdown asynchronous) is to have the side that decides when to destroy the texture (the child side) send a message (RemoveTexture) which triggers the other side to send __delete__.
This way each side knows if it can send a message. If we don't have this kind of hand-shake we can't have the parent side send messages to the child.
Flags: needinfo?(nical.bugzilla)
Comment 15•11 years ago
|
||
Thanks! I understand the reason.
Comment 16•11 years ago
|
||
(In reply to Sotaro Ikeda [:sotaro] from comment #6)
> attachment 8398404 [details] [diff] [review] seems not strong enough to
> timing dependent defect. It seems better to fix the problem by more timing
> robust way. Most good way seems to use thread safe weak pointer. But it is
> not provided now :-(
Another way to fix the problem is "always call TextureClient::Finalize()'s implementation in TexureClient's IPC thread". It serializes ActorDestroy() and Finalize(). Almost all TextureClient::Finalize()s are called in IPC thread. Therefore it should not add much delay.
| Assignee | ||
Comment 17•11 years ago
|
||
(In reply to Sotaro Ikeda [:sotaro] from comment #11)
> There is not actual STR to reproduce the crash. Can we have an actual STR to
> reproduce the crash?
Please see comment 2 and you can refer below link.
https://bugzilla.mozilla.org/show_bug.cgi?id=911046#c6
But I did see the race condition between TextureClient::Finalize() and TextureChild::ActorDestroy().
| Assignee | ||
Comment 18•11 years ago
|
||
The following logs are reproduced by video playback.
TexureClient::Finalize() was called by Media StateMachine thread.
app_534 534 379 91688 30656 ffffffff 00000000 S /system/b2g/plugin-container
app_534 540 534 91688 30656 ffffffff 00000000 S ImageBridgeChil
app_534 601 534 91688 30656 ffffffff 00000000 S Binder Thread #
app_534 619 534 91688 30656 ffffffff 00000000 S Media S~hine #1
03-28 11:29:47.119 534 540 I Gecko : TexureClient::Finalize called this 0x40302ef0 texture child 0x431fb080
03-28 11:29:47.119 534 540 I Gecko : TexureClient::ForceRemove this 0x40302ef0 flag 0x02000000 texturechild 0x431fb080
03-28 11:29:47.139 534 540 I Gecko : TexureClient::Finalize called this 0x40302e80 texture child 0x431fb0c0
03-28 11:29:47.139 534 540 I Gecko : TexureClient::ForceRemove this 0x40302e80 flag 0x02000000 texturechild 0x431fb0c0
03-28 11:29:47.199 534 619 I Gecko : TexureClient::Finalize called this 0x40302ef0 texture child 0x431fb080
03-28 11:29:47.199 534 540 I Gecko : TexureClient::ForceRemove this 0x40302ef0 flag 0x02000000 texturechild 0x431fb080
03-28 11:29:47.199 534 540 I Gecko : TextureChild::ActorDestroy this 0x431fb080 texture client 0x40302ef0
03-28 11:29:47.229 534 540 I Gecko : TexureClient::Finalize called this 0x40302e80 texture child 0x431fb300
| Assignee | ||
Comment 19•11 years ago
|
||
(In reply to Sotaro Ikeda [:sotaro] from comment #16)
> (In reply to Sotaro Ikeda [:sotaro] from comment #6)
> > attachment 8398404 [details] [diff] [review] seems not strong enough to
> > timing dependent defect. It seems better to fix the problem by more timing
> > robust way. Most good way seems to use thread safe weak pointer. But it is
> > not provided now :-(
>
> Another way to fix the problem is "always call TextureClient::Finalize()'s
> implementation in TexureClient's IPC thread". It serializes ActorDestroy()
> and Finalize(). Almost all TextureClient::Finalize()s are called in IPC
> thread. Therefore it should not add much delay.
Another idea is to move [1] to line 514. Then we could avoid the race condition caused by actor's raw pointer.
http://dxr.mozilla.org/mozilla-central/source/gfx/layers/client/TextureClient.cpp#519
| Reporter | ||
Comment 20•11 years ago
|
||
(In reply to Sotaro Ikeda [:sotaro] from comment #11)
> There is not actual STR to reproduce the crash. Can we have an actual STR to
> reproduce the crash?
When working on bug 911046 I met this problem after apply the patches there.
My test env. is a desktop peer (set up as described in bug 911046 comment 6) and a Nexus 5 with latest m-c Gecko + my patches installed, then connect both to local nightly-gupshup (https://github.com/jesup/nightly-gupshup) server and start a conversation. Usually the device will crash after several minutes.
| Assignee | ||
Comment 21•11 years ago
|
||
| Assignee | ||
Comment 22•11 years ago
|
||
With this patch, I could run the STR from comment 20 for 1 hr without crash.
And from below log, thread 1995 may crash at ActorDestroy[1] if textureclient got destroyed.
[1]http://dxr.mozilla.org/mozilla-central/source/gfx/layers/client/TextureClient.cpp#189
02:35:03.480 1989 2511 I Gecko : TextureClient::Finalize() before RemoveTexture this 0xb1a23ac0
02:35:03.480 1989 2511 I Gecko : TextureClient::Finalize() after RemoveTexture this 0xb1a23ac0
02:35:03.480 1989 1995 I Gecko : TextureChild::ActorDestroy this 0xb011c400
02:35:03.480 1989 2511 I Gecko : ~TextureClient this 0xb1a23ac0
02:35:03.480 1989 2511 I Gecko : ~TextureChild this 0xb011c400
Attachment #8398404 -
Attachment is obsolete: true
Attachment #8399193 -
Attachment is obsolete: true
Attachment #8399258 -
Flags: review?(nical.bugzilla)
Updated•11 years ago
|
Attachment #8399258 -
Flags: review?(nical.bugzilla) → review+
Comment 23•11 years ago
|
||
Sotaro pointed out to me that even with the patch, we can still have a race condition. We should take an hour during the workweek to discuss about TextureClient's shutdown sequence and fixing this in a more bullet-proof way. We can still take this patch but I am marking the bug "leave open" so that we don't close it before we have a solid solution.
Whiteboard: [leave open]
Comment 24•11 years ago
|
||
My point is that attachment 8399258 [details] [diff] [review] is not atomic operation in multi threaded environment, so it could cause the very very rare race condition. If a phone has multiple cpus, this risk becomes higher than one cpu.
Comment 25•11 years ago
|
||
Actually we looked at it more closely with Sotaro and Peter and this patch appears to fix all the race scenarios of this bug. Let's land it!
Whiteboard: [leave open]
| Assignee | ||
Updated•11 years ago
|
blocking-b2g: --- → 1.4?
| Assignee | ||
Comment 26•11 years ago
|
||
try server link
https://tbpl.mozilla.org/?tree=Try&rev=038d66beec7c
Updated•11 years ago
|
blocking-b2g: 1.4? → backlog
| Assignee | ||
Comment 27•11 years ago
|
||
All tests passed except gaia integration testing from B2G Desktop Linux x64 Opt.
Checking...
https://tbpl.mozilla.org/?tree=Try&rev=038d66beec7c&jobname=b2g_ubuntu64_vm%20try%20opt%20test%20gaia-integration
| Assignee | ||
Comment 28•11 years ago
|
||
re-run new try server with latest inbound and gaia integration pass.
https://tbpl.mozilla.org/?tree=Try&rev=48db580e7b41
| Assignee | ||
Comment 29•11 years ago
|
||
Comment 30•11 years ago
|
||
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
Comment 31•11 years ago
|
||
Nominate to "1.4+", because this bug blocks Bug 993736.
blocking-b2g: backlog → 1.4?
Updated•11 years ago
|
blocking-b2g: 1.4? → 1.4+
Comment 32•11 years ago
|
||
status-b2g-v1.4:
--- → fixed
status-b2g-v2.0:
--- → fixed
status-firefox29:
--- → wontfix
status-firefox30:
--- → fixed
status-firefox31:
--- → fixed
You need to log in
before you can comment on or make changes to this bug.
Description
•