Closed Bug 997367 Opened 6 years ago Closed 6 years ago

Crash in mozilla::layers::CrossProcessCompositorParent::GetCompositionManager

Categories

(Core :: Graphics: Layers, defect, critical)

30 Branch
ARM
Gonk (Firefox OS)
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla32
blocking-b2g 1.4+
Tracking Status
firefox30 --- wontfix
firefox31 --- wontfix
firefox32 --- fixed
b2g-v1.4 --- fixed
b2g-v2.0 --- fixed

People

(Reporter: tkundu, Assigned: pchang)

References

Details

(Keywords: crash, Whiteboard: [caf-crash 166][caf priority: p3][CR 649627][b2g-crash])

Crash Data

Attachments

(5 files, 6 obsolete files)

Attached file stack trace
STR:

1) Run orangutan monkey test on v1.4 msm8226. We don't have any specific STR for this.
Crash Signature: [@ mozilla::layers::CrossProcessCompositorParent::GetCompositionManager(mozilla::layers::LayerTransactionParent*) | mozilla::layers::LayerTransactionParent::RecvUpdate(nsTArraymozilla::layers::Edit const&, mozilla::layers::TargetConfig const&, bool const&…
Attached file .extra file
*.extra file for crash
Blocks: 984663
blocking-b2g: --- → 1.4?
Whiteboard: [CR 649627]
Component: General → Graphics: Layers
Keywords: crash
Product: Firefox OS → Core
Whiteboard: [CR 649627] → [CR 649627][b2g-crash]
Version: unspecified → 30 Branch
CS blocker - so blocking+
blocking-b2g: 1.4? → 1.4+
Milan, sorry to ping you again. Because the component is graphics::layers, do you know if there has anyone who can work on this bug? Thanks.
Flags: needinfo?(milan)
Are you getting this on all monkey tests or just that once?

Nical, Sotaro, anything obvious from the stack?  Just looking inside of CrossProcessCompositorParent::GetCompositionManager, and noting a SEGV at 0x0:
* aLayerTree should not be null, that's "this" from the caller LayerTransactionParent
* can the aLayerTree id be missing from the map, or perhaps just removed?  racing somewhere?

Hard to make this actionable without STR, but if something stands out from the stack or you think we should add more protection against failures in that call, or more logging, let me know.
Flags: needinfo?(sotaro.ikeda.g)
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(milan)
For CS triage - this is not likely for 4/21 unless something trivial is discovered today.
Inder,

Based on Comment #4 can we get anything attached - STRs/ logs?
Flags: needinfo?(ikumar)
From attachment 8407763 [details], the crash seems to be happened because of "sIndirectLayerTrees[id].mParent is nullptr".

AsyncCompositionManager*
CrossProcessCompositorParent::GetCompositionManager(LayerTransactionParent* aLayerTree)
{
  uint64_t id = aLayerTree->GetId();
  return sIndirectLayerTrees[id].mParent->GetCompositionManager(aLayerTree);
}
Flags: needinfo?(sotaro.ikeda.g)
"sIndirectLayerTrees[id].mParent" is registered by CompositorParent::NotifyChildCreated(). It is triggered by PCompositorChild::SendNotifyChildCreated() called from RenderFrameParent::Init(). SendNotifyChildCreated() is async transaction call.

http://mxr.mozilla.org/mozilla-central/source/layout/ipc/RenderFrameParent.cpp#746

http://mxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.cpp#966
Function call sequence from TabChild::InitRenderingState(). It is focused around CompositorParent::NotifyChildCreated() and LayerTransactionParent creation.
"sIndirectLayerTrees[id].mParent" is raw pointer. And the value is updated only by CompositorParent::NotifyChildCreated(). Updated value is always pointer of CompositorParent.
From Comment 8, Comment 9 and Comment 10, it might be possible that CompositorParent::RecvNotifyChildCreated() is deferred by async transaction.
If comment 11 is correct, this patch fix the problem. But it might degrade the app startup performance.

I am not sure if there is other way to fix the problem.
Comment on attachment 8408548 [details] [diff] [review]
patch - Change NotifyChildCreated() to sync

nical, can I have a feedback to the patch?
Attachment #8408548 - Flags: feedback?(nical.bugzilla)
(In reply to Sotaro Ikeda [:sotaro] from comment #11)
> From Comment 8, Comment 9 and Comment 10, it might be possible that
> CompositorParent::RecvNotifyChildCreated() is deferred by async transaction.

The following two API calls will be processed by compositor thread.
Even SendNotifyChildCreated is async API, compositor thread will still process NotifyChildCreated before than SendPLayerTransactionConstructor because they are using the same IPC protocol.
Therefore, I think mParent is valid if content process sends out NotifyChildCreated first.

aysnc PCompositorChild::SendNotifyChildCreated
sync  PCompositorChild::SendPLayerTransactionConstructor

04-17 22:10:18.619  1290  1314 I Gecko   : pchang CompositorParent::NotifyChildCreated child 1157064576 mparent 0x4
04-17 22:10:18.619  1290  1314 I Gecko   : pchang CrossProcessCompositorParent::AllocPLayerTransactionParent called
04-17 22:10:19.409  1290  1314 I Gecko   : pchang CrossProcessCompositorParent::GetCompositionManager child 0 parent 0x4
I tried to disassemble the GetCompositionManaager func[1] and looks like the crash[2] happened at [3].
So it implies the LayerTransactionParent is invalid or nullptr.
But it is a little strange because GetCompositionManager was called from LayerTransactionParent::RecvUpdate and all of them were processed by compositor thread.

[1]disassemble
Dump of assembler code for function _ZN7mozilla6layers28CrossProcessCompositorParent21GetCompositionManagerEPNS0_22LayerTransactionParentE:
1281	{
   0x413902b0 <+0>:	push	{lr}
   0x413902b2 <+2>:	vpush	{d8}
   0x413902b6 <+6>:	sub	sp, #28
   0x413902b8 <+8>:	str	r0, [sp, #12]
   0x413902ba <+10>:	str	r1, [sp, #8]

1282	  uint64_t id = aLayerTree->GetId();
=> 0x413902bc <+12>:	ldr	r0, [sp, #8]
   0x413902be <+14>:	bl	0x4138da5c <_ZNK7mozilla6layers22LayerTransactionParent5GetIdEv>
   0x413902c2 <+18>:	vmov	d16, r0, r1
   0x413902c6 <+22>:	vstr	d16, [sp, #16]

[2] crash stack
CrossProcessCompositorParent::GetCompositionManager(mozilla::layers::LayerTransactionParent*) [CompositorParent.cpp : 1250 + 0xc

[3]http://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.cpp?from=CrossProcessCompositorParent&case=true#1291
sotaro, I will take this bug if comment 14 and comment 15 are right.
Assignee: nobody → pchang
(In reply to Preeti Raghunath(:Preeti) from comment #6)
> Inder,
> 
> Based on Comment #4 can we get anything attached - STRs/ logs?

Passing ni to Tapas.
Preeti - for quick turn around please ping submitter directly
Flags: needinfo?(ikumar) → needinfo?(tkundu)
(In reply to Milan Sreckovic [:milan] from comment #4)
> Are you getting this on all monkey tests or just that once?
> 
We observed it only once so far. 


> Hard to make this actionable without STR, but if something stands out from
> the stack or you think we should add more protection against failures in
> that call, or more logging, let me know.

We already attached logcat/stacktrace/*.extra file. We don't have specific STR as it has happened during orangutan monkey test. I can attach another set of logcat/stacktrac/*.extra file if it happens again.  

Please be more specific if you want me to enable some logging from gecko.
Flags: needinfo?(tkundu)
Status: NEW → ASSIGNED
(In reply to peter chang[:pchang][:peter] from comment #14)
> (In reply to Sotaro Ikeda [:sotaro] from comment #11)
> > From Comment 8, Comment 9 and Comment 10, it might be possible that
> > CompositorParent::RecvNotifyChildCreated() is deferred by async transaction.
> 
> The following two API calls will be processed by compositor thread.
> Even SendNotifyChildCreated is async API, compositor thread will still
> process NotifyChildCreated before than SendPLayerTransactionConstructor
> because they are using the same IPC protocol.
> Therefore, I think mParent is valid if content process sends out
> NotifyChildCreated first.

Is it correct? IIRC, a sync message could be received ahead by a recipient than an async message, even when the async message is sent ahead.
Nical, how do you think about Comment 19?
Flags: needinfo?(nical.bugzilla)
> 0 libxul.so!mozilla::layers::CrossProcessCompositorParent::GetCompositionManager(mozilla::layers::LayerTransactionParent*) [CompositorParent.cpp : 1250 + 0xc]

I think there seems confusion about the above. The code from last week say that line 1250 is the following.
>   return sIndirectLayerTrees[id].mParent->GetCompositionManager(aLayerTree);

Most recent code say that 1250 is the following.

>   uint64_t id = aLayerTree->GetId();
> Crash reason:  SIGSEGV
> Crash address: 0x0

The point of the crash is that the crash address is 0x0.
I got disassembled GetCompositionManager() by gdb from b2g_QRD8x28. It is slightly different from Comment 15.

> (gdb) disassemble
> Dump of assembler code for function mozilla::layers::CrossProcessCompositorParent::GetCompositionManager(mozilla::layers::LayerTransactionParent*):
>    0xb50c4458 <+0>:	push	{r0, r1, r5, lr}
>    0xb50c445a <+2>:	mov	r5, r1
>    0xb50c445c <+4>:	ldrd	r2, r3, [r1, #88]	; 0x58
>    0xb50c4460 <+8>:	add	r1, sp, #8
> => 0xb50c4462 <+10>:	ldr	r0, [pc, #28]	; (0xb50c4480 <mozilla::layers::CrossProcessCompositorParent::GetCompositionManager(mozilla::layers::LayerTransactionParent*)+40>)
>    0xb50c4464 <+12>:	strd	r2, r3, [r1, #-8]!
>    0xb50c4468 <+16>:	add	r0, pc
>    0xb50c446a <+18>:	mov	r1, sp
>    0xb50c446c <+20>:	bl	0xb50c43b8 <std::map<unsigned long long, mozilla::layers::CompositorParent::LayerTreeState, std::less<unsigned long long>, std::allocator<std::pair<unsigned long long const, mozilla::layers::CompositorParent::LayerTreeState> > >::operator[]<unsigned long long>(unsigned long long const&)>
>    0xb50c4470 <+24>:	mov	r1, r5
>    0xb50c4472 <+26>:	ldr	r0, [r0, #8]
>    0xb50c4474 <+28>:	ldr	r3, [r0, #0]
>    0xb50c4476 <+30>:	ldr.w	r3, [r3, #224]	; 0xe0
>    0xb50c447a <+34>:	blx	r3
>    0xb50c447c <+36>:	pop	{r2, r3, r5, pc}
>    0xb50c447e <+38>:	nop
>    0xb50c4480 <+40>:	smlaltteq	r2, r6, r8, r2
In Comment 23, only the following part seems to cause "Crash address: 0x0". I confirmed it with :jrmuizel. It says that mParent is null.

>    0xb50c4474 <+28>:	ldr	r3, [r0, #0]
(In reply to Sotaro Ikeda [:sotaro] from comment #19)
> (In reply to peter chang[:pchang][:peter] from comment #14)
> > (In reply to Sotaro Ikeda [:sotaro] from comment #11)
> > > From Comment 8, Comment 9 and Comment 10, it might be possible that
> > > CompositorParent::RecvNotifyChildCreated() is deferred by async transaction.
> > 
> > The following two API calls will be processed by compositor thread.
> > Even SendNotifyChildCreated is async API, compositor thread will still
> > process NotifyChildCreated before than SendPLayerTransactionConstructor
> > because they are using the same IPC protocol.
> > Therefore, I think mParent is valid if content process sends out
> > NotifyChildCreated first.
> 
> Is it correct? IIRC, a sync message could be received ahead by a recipient
> than an async message, even when the async message is sent ahead.

Do you means that our IO thread will not send the package in order?
I think that the "async" is only no need to wait the reply, but it will be sent before the second sync message.
I will check this with Cervantes.
From Comment 24, "mParent is null" seems to cause the crash as in Comment 7.
(In reply to Jerry Shih[:jerry] from comment #25)
> > 
> > Is it correct? IIRC, a sync message could be received ahead by a recipient
> > than an async message, even when the async message is sent ahead.
> 
> Do you means that our IO thread will not send the package in order?
> I think that the "async" is only no need to wait the reply, but it will be
> sent before the second sync message.
> I will check this with Cervantes.

I just heard that in the past. Sorry, I do not know well ipc area.
In this case, top level protocols are also different. MessageChannel is created for each top level protocol. It might also affected to the bug.
(In reply to Sotaro Ikeda [:sotaro] from comment #26)
> From Comment 24, "mParent is null" seems to cause the crash as in Comment 7.

If mParent is null, than the problem may be id query failed.
For that case, we can simply add a NS_RUNTIME_FAILURE or MOZ_ASSERT here to check sIndirectLayerTrees.count(id) must be 1. And log out some readable log with the target id and try to repro this.
(In reply to Chiajung Hung [:chiajung] from comment #29)
> (In reply to Sotaro Ikeda [:sotaro] from comment #26)
> > From Comment 24, "mParent is null" seems to cause the crash as in Comment 7.
> 
> If mParent is null, than the problem may be id query failed.
> For that case, we can simply add a NS_RUNTIME_FAILURE or MOZ_ASSERT here to
> check sIndirectLayerTrees.count(id) must be 1. And log out some readable log
> with the target id and try to repro this.

I think this case can be easily tested by sending some random IPC message on PCompositor.
Flags: needinfo?(bjacob)
(In reply to Sotaro Ikeda [:sotaro] from comment #27)
> (In reply to Jerry Shih[:jerry] from comment #25)
> > > 
> > > Is it correct? IIRC, a sync message could be received ahead by a recipient
> > > than an async message, even when the async message is sent ahead.
> > 
> > Do you means that our IO thread will not send the package in order?
> > I think that the "async" is only no need to wait the reply, but it will be
> > sent before the second sync message.
> > I will check this with Cervantes.
> 
> I just heard that in the past. Sorry, I do not know well ipc area.

The following shows there is only one queue to handle IPC message, therefore, the messages should be sent in order no matter sync or async.

http://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/chrome/common/ipc_channel_posix.cc#749

I think we may need more info to investigate this issue, like the detail IPC log. But I'm not sure it is efficient to enable IPC log during this monkey test.
(In reply to Chiajung Hung [:chiajung] from comment #29)
> (In reply to Sotaro Ikeda [:sotaro] from comment #26)
> > From Comment 24, "mParent is null" seems to cause the crash as in Comment 7.
> 
> If mParent is null, than the problem may be id query failed.
> For that case, we can simply add a NS_RUNTIME_FAILURE or MOZ_ASSERT here to
> check sIndirectLayerTrees.count(id) must be 1. And log out some readable log
> with the target id and try to repro this.
Based on the assumption here, it may be a multi threading problem, as CompositorParent::DeallocateLayerTreeId is statically called:
http://dxr.mozilla.org/mozilla-central/source/layout/ipc/RenderFrameParent.cpp#945

So it make b2g::main thr queue a task into compositor thread first => compositor IPC recv some other message from IO thread => problem occurs.
(In reply to Chiajung Hung [:chiajung] from comment #32)
> (In reply to Chiajung Hung [:chiajung] from comment #29)
> > (In reply to Sotaro Ikeda [:sotaro] from comment #26)
> > > From Comment 24, "mParent is null" seems to cause the crash as in Comment 7.
> > 
> > If mParent is null, than the problem may be id query failed.
> > For that case, we can simply add a NS_RUNTIME_FAILURE or MOZ_ASSERT here to
> > check sIndirectLayerTrees.count(id) must be 1. And log out some readable log
> > with the target id and try to repro this.
> Based on the assumption here, it may be a multi threading problem, as
> CompositorParent::DeallocateLayerTreeId is statically called:
> http://dxr.mozilla.org/mozilla-central/source/layout/ipc/RenderFrameParent.
> cpp#945
> 
> So it make b2g::main thr queue a task into compositor thread first =>
> compositor IPC recv some other message from IO thread => problem occurs.

Good catch.

I just force trigger RenderFrameParent::Destroy in [1] and then got the crash at CrossProcessCompositorParent::GetCompositionManage.

[1]http://dxr.mozilla.org/mozilla-central/source/dom/ipc/TabChild.cpp#1975
Attached patch v1 (obsolete) — Splinter Review
As mentioned in comment 32, there is a race condition to process sIndirectLayerTrees map between RenderFrameParent::ActorDestroy[1](main thread) and CrossProcessCompositorParent::GetCompositionManager[2](compositorthread).

  If ActorDestroy happens first, it will cause crash at GetCompositionManager [2] because map element is removed already.

  Inside CrossProcessCompositorParent::DeallocPLayerTransactionParent[3], it will remove the same elements from sIndirectLayerTree as ActorDestroy[1] does.

  To fix this problem properly, I simply the erase of sIndirectLayerTree elements only invokes by CompositorParents[2] which invokes by compositor thread.

[1]http://dxr.mozilla.org/mozilla-central/source/layout/ipc/RenderFrameParent.cpp#945

[2]http://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.cpp#1292

[3]http://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.cpp#1233
Attachment #8410159 - Flags: review?(roc)
(In reply to Sotaro Ikeda [:sotaro] from comment #19)
> Is it correct? IIRC, a sync message could be received ahead by a recipient
> than an async message, even when the async message is sent ahead.

With IPDL we have the guarantee that, within a channel, messages are received in the same order as they are sent.
Flags: needinfo?(nical.bugzilla)
Sotaro, looking at the comment, my understanding is that Peter took over the bug and has a fix that takes a different route than yours, so I am cancelling the needinfo. Don't hesitate to to put it back if you still want some more feedbacks on that. I don't mind having a few sync messages that happen only during the app's startup or shut down (if that was your only concern).
Flags: needinfo?(nical.bugzilla)
Attachment #8408548 - Flags: feedback?(nical.bugzilla)
(In reply to Nicolas Silva [:nical] from comment #36)
> Sotaro, looking at the comment, my understanding is that Peter took over the
> bug and has a fix that takes a different route than yours, so I am
> cancelling the needinfo. Don't hesitate to to put it back if you still want
> some more feedbacks on that. I don't mind having a few sync messages that
> happen only during the app's startup or shut down (if that was your only
> concern).

Thanks, it was my misunderstanding. I just believed it without a confirmation since heard about it.
(In reply to Chiajung Hung [:chiajung] from comment #30)
> (In reply to Chiajung Hung [:chiajung] from comment #29)
> > (In reply to Sotaro Ikeda [:sotaro] from comment #26)
> > > From Comment 24, "mParent is null" seems to cause the crash as in Comment 7.
> > 
> > If mParent is null, than the problem may be id query failed.
> > For that case, we can simply add a NS_RUNTIME_FAILURE or MOZ_ASSERT here to
> > check sIndirectLayerTrees.count(id) must be 1. And log out some readable log
> > with the target id and try to repro this.
> 
> I think this case can be easily tested by sending some random IPC message on
> PCompositor.

Yes, this does look like the kind of crashes that one can find with our IPC fuzzer, although I can't guarantee that you'll hit a particular crash before trying. https://intranet.mozilla.org/User:Bjacob@mozilla.com/Gfx_IPC_fuzzing has some instructions to set up for fuzzing Gfx IPC protocols. I'm also available to help if you have speficic problems. I haven't run the fuzzer myself over the past month.
Flags: needinfo?(bjacob)
...well, actually I have; the last time that I ran the fuzzer myself was probably around March 31, just before the Taipei work week.
(In reply to peter chang[:pchang][:peter] from comment #31)
> (In reply to Sotaro Ikeda [:sotaro] from comment #27)
> > (In reply to Jerry Shih[:jerry] from comment #25)
> > > > 
> > > > Is it correct? IIRC, a sync message could be received ahead by a recipient
> > > > than an async message, even when the async message is sent ahead.
> > > 
> > > Do you means that our IO thread will not send the package in order?
> > > I think that the "async" is only no need to wait the reply, but it will be
> > > sent before the second sync message.
> > > I will check this with Cervantes.
> > 
> > I just heard that in the past. Sorry, I do not know well ipc area.
> 
> The following shows there is only one queue to handle IPC message,
> therefore, the messages should be sent in order no matter sync or async.

Thanks, it was my misunderstanding.
(In reply to Benoit Jacob [:bjacob] from comment #39)
> ...well, actually I have; the last time that I ran the fuzzer myself was
> probably around March 31, just before the Taipei work week.

One interesting I noticed from this bug is: Does the IPC fuzzer fire IPC on 1-channel a time only or multiple channel with random message at random order?

As comment 32 stated, this is a problem that multiple IPC channels access the same data structure in a semi-threading safe way: the access calls *run on the same thread*, so no need to use Mutex in general, however, it still is a multi-threading bug.
The IPC fuzzer (in "pickle" mode, which is what I am using) randomly corrupts pickles i.e. message parameter values. It does not reorder messages, or affect the timing of their delivery. If the present bug is not related to particular values of IPC message parameters, then the IPC fuzzer (at least, as I am using it) is not going to help.

I fuzzed today, and found two bugs already (bug 999686 and bug 999697) which is sad because it means we did regress recently, but I haven't reproduced the present bug so far.

For more questions about the fuzzer, like to know if it could help reproducing the particular bug pattern, you could ask its author, :cdiehl.
Attached patch v2(carry r+) (obsolete) — Splinter Review
a. update reviewer
b. remove unused func CompositorParent::DeallocateLayerTreeId
Attachment #8410159 - Attachment is obsolete: true
Attachment #8410715 - Flags: review+
try server result is positive. Request for checkin.
Keywords: checkin-needed
backed this out for test failures like https://tbpl.mozilla.org/php/getParsedLog.php?id=38383866&tree=Mozilla-Inbound - seems in the try run were opt builds while this test failure happens on debug builds.
(In reply to Carsten Book [:Tomcat] from comment #47)
> backed this out for test failures like
> https://tbpl.mozilla.org/php/getParsedLog.php?id=38383866&tree=Mozilla-
> Inbound - seems in the try run were opt builds while this test failure
> happens on debug builds.

It seems there are some memory leak about my patch on OSX.

18:14:44     INFO -  nsTraceRefcnt::DumpStatistics: 1498 entries
18:14:44     INFO -  TEST-INFO | leakcheck | leaked 1 Compositor (64 bytes)
18:14:44     INFO -  TEST-INFO | leakcheck | leaked 1 CompositorOGL (328 bytes)
18:14:44     INFO -  TEST-INFO | leakcheck | leaked 2 CondVar (64 bytes)
18:14:44     INFO -  TEST-INFO | leakcheck | leaked 1 ContainerLayerComposite (840 bytes)
18:14:44     INFO -  TEST-INFO | leakcheck | leaked 1 CrossProcessCompositorParent (768 bytes)
18:14:44     INFO -  TEST-INFO | leakcheck | leaked 2 Layer (912 bytes)
18:14:44     INFO -  TEST-INFO | leakcheck | leaked 1 LayerManager (136 bytes)
18:14:44     INFO -  TEST-INFO | leakcheck | leaked 1 LayerTransactionParent (200 bytes)
18:14:44     INFO -  TEST-INFO | leakcheck | leaked 2 Mutex (48 bytes)
18:14:44     INFO -  TEST-INFO | leakcheck | leaked 1 PCompositorParent (736 bytes)
18:14:44     INFO -  TEST-INFO | leakcheck | leaked 1 PImageBridgeParent (752 bytes)
18:14:44     INFO -  TEST-INFO | leakcheck | leaked 1 PLayerTransactionParent (80 bytes)
18:14:44     INFO -  TEST-INFO | leakcheck | leaked 2 RefCountedMonitor (128 bytes)
18:14:44     INFO -  TEST-INFO | leakcheck | leaked 2 RefCountedTask (32 bytes)
18:14:44     INFO -  TEST-INFO | leakcheck | leaked 1 TextRenderer (48 bytes)
18:14:44     INFO -  TEST-INFO | leakcheck | leaked 1 ThebesLayerComposite (672 bytes)
18:14:44     INFO -  TEST-INFO | leakcheck | leaked 2 WeakReference<MessageListener> (32 bytes)
18:14:44     INFO -  TEST-INFO | leakcheck | leaked 2 ipc::MessageChannel (1040 bytes)
18:14:44     INFO -  TEST-INFO | leakcheck | leaked 13 nsTArray_base (104 bytes)
18:14:44  WARNING -  TEST-UNEXPECTED-FAIL | leakcheck | 6984 bytes leaked (Compositor, CompositorOGL, CondVar, ContainerLayerComposite, CrossProcessCompositorParent, ...)
18:14:44     INFO -  runtests.py | Running tests: end.
I'm able to reproduce mochitest failed on OSX and it was failed during the shutdown stage.

[fail cmd]
./mach mochitest-plain layout/base/tests/test_remote_frame.html layout/base/tests/test_remote_passpointerevents.html test_scroll_event_ordering.html
By disabling the followng MOZ_ASSERT, I could pass the testing.
http://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/message_loop.cc#406
(In reply to peter chang[:pchang][:peter] from comment #49)
> I'm able to reproduce mochitest failed on OSX and it was failed during the
> shutdown stage.
> 
> [fail cmd]
> ./mach mochitest-plain layout/base/tests/test_remote_frame.html
> layout/base/tests/test_remote_passpointerevents.html
> test_scroll_event_ordering.html

(In reply to peter chang[:pchang][:peter] from comment #50)
> By disabling the followng MOZ_ASSERT, I could pass the testing.
> http://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/
> message_loop.cc#406

Ehsan, I hit the MOZ_ASSERT[1] you added from bug 976363.
I don't know why my patch hits the assertion in message_loop.cc.
The mochitest pass if I removed the MOZ_ASSERT.

My modification[2] is to remove one message posting from main thread to compositor thread to avoid the race condition.
[1]http://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/message_loop.cc#406
[2]http://dxr.mozilla.org/mozilla-central/source/layout/ipc/RenderFrameParent.cpp#945

Do I hit the known issue or this problem was caused by my patch?
Flags: needinfo?(ehsan)
(In reply to Sotaro Ikeda [:sotaro] from comment #37)
> (In reply to Nicolas Silva [:nical] from comment #36)
> > Sotaro, looking at the comment, my understanding is that Peter took over the
> > bug and has a fix that takes a different route than yours, so I am
> > cancelling the needinfo. Don't hesitate to to put it back if you still want
> > some more feedbacks on that. I don't mind having a few sync messages that
> > happen only during the app's startup or shut down (if that was your only
> > concern).
> 
> Thanks, it was my misunderstanding. I just believed it without a
> confirmation since heard about it.

I recalled the situation of IPC message order change.

when a IPC client is sending sync IPC, the IPC client does not receive async messages from the clitnt's peer until the sync IPC complete, exit current thread task and IPC messages receiving task is called. Pending receiving async messages are queued in MessageChannel in the following code.

http://mxr.mozilla.org/mozilla-central/source/ipc/glue/MessageChannel.cpp#536

In this situation, the client's peers sync response message could be received earlier than previously sent async messages by the client. If the client send consecutive a lot of sync ipc messages, the async messages receipt could be re-ordered(delayed) a lot.
That assertion is a sign that a task posted to the MessageLoop was not run until the MessageLoop (and presumably its owner thread) died.  In order to debug this, attach gdb and reproduce the assertion, and look at the type of the task(s) stored in work_queue_.  Once you get an idea about the type of the C++ object stored in the queue you can get a better sense of where you need to look to figure out why the work was not scheduled before the thread shut down.
Flags: needinfo?(ehsan)
Depends on: 986113
Attached patch v3 (obsolete) — Splinter Review
Attachment 8410715 [details] [diff] is a better solution which removes the redundant message post and also fix the race condition between main thread and compositor thread.

But it caused the mochitest on OSX debug build from comment 51/comment53. And bug 986113 is trying to fix this mochitest fail caused by non-empty work queue during content process shutdown.

Since this is a 1.4 blocker,  I propose another patch that keeps the message post and fix the race condition part. With this patch, the mochitest was passed in my OSX debug build. 

After bug 986113 is fixed, I will re-land attachment 8410715 [details] [diff] [review].
Attachment #8415300 - Flags: feedback?(cku)
Comment on attachment 8415300 [details] [diff] [review]
v3

Review of attachment 8415300 [details] [diff] [review]:
-----------------------------------------------------------------

Hi Peter,
In Attachment 8410715 [details] [diff] [diff]

Why this change hits that assertion by "preventing" injecting EraseLayerState into Compositor thread?
http://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/message_loop.cc#406
If you introduce a new injection, hit this assertion after apply the patch make sense to me. But it's just the opposite

Back to the patch here, if it works(prevent that assertion and original crash), I am ok with it.

::: gfx/layers/ipc/CompositorParent.cpp
@@ +984,5 @@
> +  /** Erase mParent of the indirect shadow tree by |aId|.
> +   *  Later the map element of this aId will be removed
> +   *  by RemoveIndirectTree(aId).
> +   */
> +  sIndirectLayerTrees[aId].mParent = nullptr;

Don't need this change

@@ +1243,5 @@
>  CrossProcessCompositorParent::RecvNotifyChildCreated(const uint64_t& child)
>  {
> +  if (!sIndirectLayerTrees[child].mParent) {
> +    return false;
> +  }

bool
CrossProcessCompositorParent::RecvNotifyChildCreated(const uint64_t& child)
{
  LayerTreeMap::iterator it = sIndirectLayerTrees.find(child);
  if (it != sIndirectLayerTrees.end()) {
    (*it).mParent->NotifyChildCreated(child);
  }

  return true; // Since you take this solution as workaround,
               // don't return false, do error handle inside 
               // this function only
}
Attachment #8415300 - Flags: feedback?(cku) → feedback+
Comment on attachment 8415300 [details] [diff] [review]
v3

Review of attachment 8415300 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/CompositorParent.cpp
@@ +984,5 @@
> +  /** Erase mParent of the indirect shadow tree by |aId|.
> +   *  Later the map element of this aId will be removed
> +   *  by RemoveIndirectTree(aId).
> +   */
> +  sIndirectLayerTrees[aId].mParent = nullptr;

I still need this change because this func will be called by compositor thread which is generated from the main thread post msg.

Before my patch, the map element will be removed and it may crash at the following GetCompositionManager[1] because the element is already gone. Then it will add new map element with same id but with nullptr mParent.

Therefore, my v3 patch still keeps the map element but set mParent as nullptr.

[1]http://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.cpp#1292

@@ +1243,5 @@
>  CrossProcessCompositorParent::RecvNotifyChildCreated(const uint64_t& child)
>  {
> +  if (!sIndirectLayerTrees[child].mParent) {
> +    return false;
> +  }

This func won't be invoked by anyone and it should be invoked at early stage(before EraseLayerState) if happened. 

Adding error handle here is keeping the same behavior with other modification.
(In reply to C.J. Ku[:CJKu] from comment #55)
> Comment on attachment 8415300 [details] [diff] [review]
> v3
> 
> Review of attachment 8415300 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Hi Peter,
> In Attachment 8410715 [details] [diff]
> 
> Why this change hits that assertion by "preventing" injecting
> EraseLayerState into Compositor thread?
> http://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/
> message_loop.cc#406
> If you introduce a new injection, hit this assertion after apply the patch
> make sense to me. But it's just the opposite
> 
> Back to the patch here, if it works(prevent that assertion and original
> crash), I am ok with it.

About attachment 8410715 [details] [diff] [review], it reduces the IPC message between mainthread(RenderFrameParent) and compositorthread(CrossProcessCompositor) to fix the race condition and also makes sIndirectLayerState still valid when RenderFrameParent::ActorDestroy[1] happened. 

Therefore, message handled by main thread[2] may process longer with my patch and cause some pending messages in queue. Then it has higher possibility to cause assertion[3] happened.

ShadowLayersUpdated(main) -> CrossProcessCompositorParent::ShadowLayersUpdated
[1]http://dxr.mozilla.org/mozilla-central/source/layout/ipc/RenderFrameParent.cpp#945
[2]http://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/LayerTransactionParent.cpp#548
[3]http://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/message_loop.cc#406
Attachment #8415300 - Flags: review?(roc)
Comment on attachment 8415300 [details] [diff] [review]
v3

Review of attachment 8415300 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/CompositorParent.cpp
@@ +984,5 @@
> +  /** Erase mParent of the indirect shadow tree by |aId|.
> +   *  Later the map element of this aId will be removed
> +   *  by RemoveIndirectTree(aId).
> +   */
> +  sIndirectLayerTrees[aId].mParent = nullptr;

Yes, I already know why you did that change. I understand you logic here totally.

Please see what I wrote bellow.

@@ +1243,5 @@
>  CrossProcessCompositorParent::RecvNotifyChildCreated(const uint64_t& child)
>  {
> +  if (!sIndirectLayerTrees[child].mParent) {
> +    return false;
> +  }

This change should apply to all functions that you modified in this patch. I am a little bit lazy to modify this change to all functions, please take it as a template.

And, see the next comment

@@ +1309,5 @@
>  {
>    uint64_t id = aLayerTree->GetId();
> +  if (!sIndirectLayerTrees[id].mParent) {
> +    return nullptr;
> +  }

static void
EraseLayerState(uint64_t aId)
{
  // You can still erase this item from map, with the following change
  // in GetCompositionManager.
  sIndirectLayerTrees.erase(aId);
}

AsyncCompositionManager*
CrossProcessCompositorParent::GetCompositionManager(LayerTransactionParent* aLayerTree)
{
  uint64_t id = aLayerTree->GetId();

  // Use map::find, instead of map::operator[]. This is the 
  // main reason that I think you don't need to change 
  // EraseLayerState at all.
  LayerTreeMap::iterator it = sIndirectLayerTrees.find(id);
  
  // if "id" have been removed form sIndirectLayerTrees by
  // EraseLayerState function, this change can prevent the
  // crash.
  // if (sIndirectLayerTrees[id]) << this is not work, 
  // I think you may already know the reason.
  if (it == sIndirectLayerTrees.end()) {
    return nullptr;
  }
  
  return (*it).mParent->GetCompositionManager(aLayerTree);
}
BTW, 
1. This bug is not caused by race condition, if we always access sIndirectLayerTrees in compositor thread. The root cause is because of we try to access a value(LayerTreeState) which is already be erased from  sIndirectLayerTrees map.

2. To fix this problem, you can either do error prevention(Attachment 8410715 [details] [diff]) or error detection(attachment 8415300 [details] [diff] [review]). You may combine this two patches.

3. All change in attachment 8415300 [details] [diff] [review] is to prevent use a removed item in sIndirectLayerTrees. My suggestion is to use map<K, T>::find to access map value instead of map<K, T>::operator[], which always return a default LayerTreeState even if key(ID) is not matched. Instead of keep value in map and use mParent of that value to do validation.
(In reply to C.J. Ku[:cjku] from comment #60)
> BTW, 
> 1. This bug is not caused by race condition, if we always access
> sIndirectLayerTrees in compositor thread. The root cause is because of we
> try to access a value(LayerTreeState) which is already be erased from 
> sIndirectLayerTrees map.
> 
> 2. To fix this problem, you can either do error prevention(Attachment
> 8410715 [details] [diff]) or error detection(attachment 8415300 [details] [diff] [review]
> [diff] [review]). You may combine this two patches.
> 
> 3. All change in attachment 8415300 [details] [diff] [review] is to prevent
> use a removed item in sIndirectLayerTrees. My suggestion is to use map<K,
> T>::find to access map value instead of map<K, T>::operator[], which always
> return a default LayerTreeState even if key(ID) is not matched. Instead of
> keep value in map and use mParent of that value to do validation.

If we want to keep the erase logic, then use map::count is better. But it requires to add more checking logic into CompsitorParents.cpp. That's why I wrote the patch v2(attachment 8410715 [details] [diff] [review]).
And current v3 is a temp solution to avoid mochitest failed until bug 986113 is fixed.

I don't know when bug 986113 will be fixed. Maybe make a patch for error handling is useful under current stage.
(In reply to peter chang[:pchang][:peter] from comment #61)
> (In reply to C.J. Ku[:cjku] from comment #60)
> > BTW, 
> > 1. This bug is not caused by race condition, if we always access
> > sIndirectLayerTrees in compositor thread. The root cause is because of we
> > try to access a value(LayerTreeState) which is already be erased from 
> > sIndirectLayerTrees map.
> > 
> > 2. To fix this problem, you can either do error prevention(Attachment
> > 8410715 [details] [diff]) or error detection(attachment 8415300 [details] [diff] [review]
> > [diff] [review]). You may combine this two patches.
> > 
> > 3. All change in attachment 8415300 [details] [diff] [review] is to prevent
> > use a removed item in sIndirectLayerTrees. My suggestion is to use map<K,
> > T>::find to access map value instead of map<K, T>::operator[], which always
> > return a default LayerTreeState even if key(ID) is not matched. Instead of
> > keep value in map and use mParent of that value to do validation.
> 
> If we want to keep the erase logic, then use map::count is better. But it
> requires to add more checking logic into CompsitorParents.cpp. That's why I
> wrote the patch v2(attachment 8410715 [details] [diff] [review]).
> And current v3 is a temp solution to avoid mochitest failed until bug 986113
> is fixed.
> 
> I don't know when bug 986113 will be fixed. Maybe make a patch for error
> handling is useful under current stage.
You eventually need to call map::operator[], which takes a O(log N) and is equal to map::find(), since you need to reference mParent.

In V2
if (it == sIndirectLayerTrees.end())
In V1
MOZ_ASSERT(it != sIndirectLayerTrees.end())
if anyone do things wrong in the future, a assertion crash make him know this problem quicker.
Comment on attachment 8415300 [details] [diff] [review]
v3

Review of attachment 8415300 [details] [diff] [review]:
-----------------------------------------------------------------

Please document where you declare sIndirectLayerTrees that it may have elements whose mParent is null, and explain what that means for those elements. Also explain why we need to keep those elements in the map, because I don't know!
Attachment #8415300 - Flags: review?(roc) → review-
Attached patch v4 (obsolete) — Splinter Review
Add the error handling when the map element of sIndirectLayerTrees is not existed.
Attachment #8415300 - Attachment is obsolete: true
Comment on attachment 8417212 [details] [diff] [review]
v4

Review of attachment 8417212 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/CompositorParent.cpp
@@ +1331,5 @@
>  CrossProcessCompositorParent::GetCompositionManager(LayerTransactionParent* aLayerTree)
>  {
>    uint64_t id = aLayerTree->GetId();
> +  const CompositorParent::LayerTreeState* state = CompositorParent::GetIndirectShadowTree(id);
> +  if (!state) {

Here I didn't add the MOZ_ASSERT(sIndirectLayerTrees.end() == cit) inside GetIndirectShadowTree.
Because [1](trigger from main thread) will try to remove map element and here may fail to find the element by id.

[1]http://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.cpp#1000
Attachment #8417212 - Flags: feedback?(cku)
Comment on attachment 8417212 [details] [diff] [review]
v4

Review of attachment 8417212 [details] [diff] [review]:
-----------------------------------------------------------------

Much better. Logic here make sense to me, thanks
Attachment #8417212 - Flags: feedback?(cku) → feedback+
Comment on attachment 8417212 [details] [diff] [review]
v4

Review of attachment 8417212 [details] [diff] [review]:
-----------------------------------------------------------------

This v4 patch is keeping original behavior but adding the error handling when the map element of sIndirectLayerTree got destroyed.
Attachment #8417212 - Flags: review?(roc)
(In reply to peter chang[:pchang][:peter] from comment #67)
> Comment on attachment 8417212 [details] [diff] [review]
> v4
> 
> Review of attachment 8417212 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This v4 patch is keeping original behavior but adding the error handling
> when the map element of sIndirectLayerTree got destroyed.

  RenderFrameParent::ActorDestroy[1] will post message to compositor thread to remove the map element. But the map element could be queried from CrossProcessCompositorParent::GetCompositionManager[2](compositorthread).

[1]http://dxr.mozilla.org/mozilla-central/source/layout/ipc/RenderFrameParent.cpp#945
[2]http://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorParent.cpp#1306
Comment on attachment 8417212 [details] [diff] [review]
v4

Review of attachment 8417212 [details] [diff] [review]:
-----------------------------------------------------------------

Much better. But, please add comments explaining why the entry may not be found.
Attachment #8417212 - Flags: review?(roc) → review-
Attached patch v5 (obsolete) — Splinter Review
Add comment about why the element of sIndirectLayerTree got removed
Attachment #8417212 - Attachment is obsolete: true
Attachment #8418510 - Flags: review?(roc)
Attachment #8418510 - Flags: feedback+
Attachment #8410715 - Attachment is obsolete: true
Comment on attachment 8418510 [details] [diff] [review]
v5

Review of attachment 8418510 [details] [diff] [review]:
-----------------------------------------------------------------

minor suggestion of comment

::: gfx/layers/ipc/CompositorParent.cpp
@@ +997,5 @@
>  CompositorParent::DeallocateLayerTreeId(uint64_t aId)
>  {
>    MOZ_ASSERT(NS_IsMainThread());
> +  // Here main thread notifies compositor thread to remove
> +  // the element from sIndirectLayerTrees.

// an element from sIndirectLayerTrees.

@@ +998,5 @@
>  {
>    MOZ_ASSERT(NS_IsMainThread());
> +  // Here main thread notifies compositor thread to remove
> +  // the element from sIndirectLayerTrees.
> +  // It is possible the removed element will be queried soon

// It is possible that an item(LayerTreeState) be queried after been 
// removed from sIndirectLayerTrees.
// Always check whether an item is still in sIndirectLayerTrees and
// do error handling accordingly.
Comment on attachment 8418510 [details] [diff] [review]
v5

Review of attachment 8418510 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/CompositorParent.cpp
@@ +997,5 @@
>  CompositorParent::DeallocateLayerTreeId(uint64_t aId)
>  {
>    MOZ_ASSERT(NS_IsMainThread());
> +  // Here main thread notifies compositor thread to remove
> +  // the element from sIndirectLayerTrees.

Will change

@@ +998,5 @@
>  {
>    MOZ_ASSERT(NS_IsMainThread());
> +  // Here main thread notifies compositor thread to remove
> +  // the element from sIndirectLayerTrees.
> +  // It is possible the removed element will be queried soon

// It is possible that an removed item will be queried soon.
// Checking the elements of sIndirectLayerTress exist or not
// before using

I would prefer above simple comment.
Attached patch v5 (obsolete) — Splinter Review
update comment based on suggestion
Attachment #8418510 - Attachment is obsolete: true
Attachment #8418510 - Flags: review?(roc)
Attachment #8419153 - Flags: review?(roc)
Attachment #8419153 - Flags: feedback+
Add reviewer
Attachment #8419153 - Attachment is obsolete: true
Attachment #8420767 - Flags: review+
(In reply to Carsten Book [:Tomcat] from comment #76)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/a170dabfd12b

also had to back this out since the original commit message had a wrong bugnummer 997365 vs 997367, so it need checkin again
(In reply to Carsten Book [:Tomcat] from comment #77)
> (In reply to Carsten Book [:Tomcat] from comment #76)
> > https://hg.mozilla.org/integration/mozilla-inbound/rev/a170dabfd12b
> 
> also had to back this out since the original commit message had a wrong
> bugnummer 997365 vs 997367, so it need checkin again

fixed the commit message and relanded as https://hg.mozilla.org/integration/mozilla-inbound/rev/9c4376b338f0 :)
https://hg.mozilla.org/mozilla-central/rev/9c4376b338f0
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla32
Depends on: 1009129
No longer depends on: 1009129
Whiteboard: [CR 649627][b2g-crash] → [caf priority: p3][CR 649627][b2g-crash]
Observed on: 

Device: 
Gonk Version: AU_LINUX_GECKO_B2G_KK_3.5.01.04.00.113.070
Moz BuildID: 20140331000202
B2G Version: 1.4
Gecko Version: 30.0a2
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=4c3b2f57f4229c5f36f0d8fd399e65f4db88f104
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=d02b9250ef7fedafc6c709dfcc899844b8624ab6
Whiteboard: [caf priority: p3][CR 649627][b2g-crash] → [caf-crash 166][caf priority: p3][CR 649627][b2g-crash]
Flags: in-moztrap?(bzumwalt)
Unable to create new test case for bug with current STR.
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(ktucker)
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(ktucker)
Flags: in-moztrap?(bzumwalt)
Flags: in-moztrap-
You need to log in before you can comment on or make changes to this bug.