Closed Bug 1007284 Opened 6 years ago Closed 5 years ago

Intermittent e10s mochitest-2 PROCESS-CRASH | Shutdown | application crashed [@ linux-gate.so + 0x424] or application crashed [@ libc-2.15.so + 0x36445] with Experiments.jsm errors in the log prior and libpthread/libc on the stack

Categories

(Core :: IPC, defect)

All
Linux
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla33
Tracking Status
e10s + ---
firefox30 --- unaffected
firefox31 --- unaffected
firefox32 --- fixed
firefox33 --- fixed
firefox-esr24 --- unaffected

People

(Reporter: RyanVM, Assigned: bjacob)

References

(Blocks 1 open bug)

Details

(Keywords: assertion, crash, intermittent-failure)

Attachments

(6 files, 1 obsolete file)

This happening on an e10s mochitest runs seems pertinent.

https://tbpl.mozilla.org/php/getParsedLog.php?id=39221652&tree=Mozilla-Inbound

Ubuntu VM 12.04 x64 mozilla-inbound opt test mochitest-e10s-2 on 2014-05-07 11:04:23 PDT for push 85924e72f778
slave: tst-linux64-spot-460

11:11:07     INFO -  6362 INFO TEST-START | Shutdown
11:11:07     INFO -  6363 INFO Passed:  180635
11:11:07     INFO -  6364 INFO Failed:  0
11:11:07     INFO -  6365 INFO Todo:    24207
11:11:07     INFO -  6366 INFO Slowest: 46455ms - /tests/dom/imptests/editing/conformancetest/test_runtest.html
11:11:07     INFO -  6367 INFO SimpleTest FINISHED
11:11:07     INFO -  6368 INFO TEST-INFO | Ran 1 Loops
11:11:07     INFO -  6369 INFO SimpleTest FINISHED
11:11:07     INFO -  ###!!! [Child][DispatchAsyncMessage] Error: Route error: message sent to unknown actor ID
11:11:07     INFO -  ###!!! [Child][DispatchAsyncMessage] Error: Route error: message sent to unknown actor ID
11:11:07     INFO -  ###!!! [Child][DispatchAsyncMessage] Error: Route error: message sent to unknown actor ID
11:11:07     INFO -  ###!!! [Child][DispatchAsyncMessage] Error: Route error: message sent to unknown actor ID
11:11:07     INFO -  ###!!! [Child][DispatchAsyncMessage] Error: Route error: message sent to unknown actor ID
11:11:07     INFO -  1399486267654	Browser.Experiments.Experiments	TRACE	PreviousExperimentProvider #0::shutdown()
11:11:07     INFO -  1399486267660	Browser.Experiments.Experiments	TRACE	Experiments #0::uninit: started
11:11:07     INFO -  1399486267666	Browser.Experiments.Experiments	TRACE	Experiments #0::uninit: finished with _loadTask
11:11:07     INFO -  1399486267667	Browser.Experiments.Experiments	TRACE	Experiments #0::uninit: no previous shutdown
11:11:07     INFO -  1399486267668	Browser.Experiments.Experiments	TRACE	Experiments #0::Unregistering instance with Addon Manager.
11:11:07     INFO -  1399486267668	Browser.Experiments.Experiments	TRACE	Experiments #0::Unregistering previous experiment add-on provider.
11:11:07     INFO -  1399486267670	Browser.Experiments.Experiments	TRACE	PreviousExperimentProvider #0::shutdown()
11:11:07     INFO -  1399486267670	addons.manager	ERROR	Exception calling provider shutdown: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIObserverService.removeObserver]"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: resource://app/modules/experiments/Experiments.jsm :: this.Experiments.PreviousExperimentProvider.prototype<.shutdown :: line 2071"  data: no] Stack trace: this.Experiments.PreviousExperimentProvider.prototype<.shutdown()@resource://app/modules/experiments/Experiments.jsm:2071 < callProvider()@resource://gre/modules/AddonManager.jsm:192 < AMI_unregisterProvider()@resource://gre/modules/AddonManager.jsm:848 < AMP_unregisterProvider()@resource://gre/modules/AddonManager.jsm:2326 < Experiments.Experiments.prototype._unregisterWithAddonManager()@resource://app/modules/experiments/Experiments.jsm:496 < Experiments.Experiments.prototype.uninit<()@resource://app/modules/experiments/Experiments.jsm:442 < TaskImpl_run()@resource://gre/modules/Task.jsm:282 < TaskImpl_handleResultValue()@resource://gre/modules/Task.jsm:338 < TaskImpl_run()@resource://gre/modules/Task.jsm:290 < TaskImpl()@resource://gre/modules/Task.jsm:247 < createAsyncFunction/asyncFunction()@resource://gre/modules/Task.jsm:224 < Spinner.prototype.observe()@resource://gre/modules/AsyncShutdown.jsm:320 < <file:unknown>
11:11:07     INFO -  1399486267673	Browser.Experiments.Experiments	INFO	Experiments #0::Completed uninitialization.
11:11:07     INFO -  firefox: tpp.c:63: __pthread_tpp_change_priority: Assertion `new_prio == -1 || (new_prio >= __sched_fifo_min_prio && new_prio <= __sched_fifo_max_prio)' failed.
11:11:08     INFO -  TEST-INFO | Main app process: killed by SIGIOT
11:11:08  WARNING -  TEST-UNEXPECTED-FAIL | Shutdown | application terminated with exit code 6
11:11:08     INFO -  INFO | runtests.py | Application ran for: 0:05:11.881747
11:11:08     INFO -  INFO | zombiecheck | Reading PID log: /tmp/tmp2ecMbtpidlog
11:11:08     INFO -  ==> process 2437 launched child process 2477
11:11:08     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2477
11:11:08     INFO -  mozcrash INFO | Downloading symbols from: https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux64/1399484774/firefox-32.0a1.en-US.linux-x86_64.crashreporter-symbols.zip
11:11:17  WARNING -  PROCESS-CRASH | Shutdown | application crashed [@ libc-2.15.so + 0x36445]
11:11:17     INFO -  Crash dump filename: /tmp/tmpyoZjqK/minidumps/06af5307-14a4-89dd-6324fbe4-3f59ec5b.dmp
11:11:17     INFO -  Operating system: Linux
11:11:17     INFO -                    0.0.0 Linux 3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC 2012 x86_64
11:11:17     INFO -  CPU: amd64
11:11:17     INFO -       family 6 model 62 stepping 4
11:11:17     INFO -       1 CPU
11:11:17     INFO -  Crash reason:  SIGABRT
11:11:17     INFO -  Crash address: 0x3e800000985
11:11:17     INFO -  Thread 3 (crashed)
11:11:17     INFO -   0  libc-2.15.so + 0x36445
11:11:17     INFO -      rbx = 0x00007f2539d30000   r12 = 0x00007f2548543e30
11:11:17     INFO -      r13 = 0x00007f2548543f00   r14 = 0x00000000000008f2
11:11:17     INFO -      r15 = 0x00000000ffffffff   rip = 0x00007f254758d445
11:11:17     INFO -      rsp = 0x00007f2537d50838   rbp = 0x00007f2548543e24
11:11:17     INFO -      Found by: given as instruction pointer in context
11:11:17     INFO -   1  libc-2.15.so + 0x39baa
11:11:17     INFO -      rip = 0x00007f2547590bab   rsp = 0x00007f2537d50840
11:11:17     INFO -      rbp = 0x00007f2548543e24
11:11:17     INFO -      Found by: stack scanning
11:11:17     INFO -   2  libpthread-2.15.so + 0x11e2f
11:11:17     INFO -      rip = 0x00007f2548543e30   rsp = 0x00007f2537d50848
11:11:17     INFO -      rbp = 0x00007f2548543e24
11:11:17     INFO -      Found by: stack scanning
11:11:17     INFO -   3  libc-2.15.so + 0x17bb1f
11:11:17     INFO -      rip = 0x00007f25476d2b20   rsp = 0x00007f2537d50850
11:11:17     INFO -      rbp = 0x00007f2548543e24
11:11:17     INFO -      Found by: stack scanning
11:11:17     INFO -   4  libc-2.15.so + 0x6df51
11:11:17     INFO -      rip = 0x00007f25475c4f52   rsp = 0x00007f2537d50870
11:11:17     INFO -      rbp = 0x00007f2548543e24
11:11:17     INFO -      Found by: stack scanning
11:11:17     INFO -   5  libpthread-2.15.so + 0x11e2f
11:11:17     INFO -      rip = 0x00007f2548543e30   rsp = 0x00007f2537d50878
11:11:17     INFO -      rbp = 0x00007f2548543e24
11:11:17     INFO -      Found by: stack scanning
11:11:17     INFO -   6  libc-2.15.so + 0x179b14
11:11:17     INFO -      rip = 0x00007f25476d0b15   rsp = 0x00007f2537d508d0
11:11:17     INFO -      rbp = 0x00007f2548543e24
11:11:17     INFO -      Found by: stack scanning
I believe that the error in question is  __pthread_tpp_change_priority: Assertion `new_prio == -1 || (new_prio >= __sched_fifo_min_prio && new_prio <= __sched_fifo_max_prio)' failed.

I doubt that this has anything to do with the experiments code. The NS_ERROR_FAILURE from http://hg.mozilla.org/mozilla-central/annotate/417acde736e7/browser/experiments/Experiments.jsm#l2071 is an interesting failure but not the cause of the crash.

The content process should have committed suicide at ###!!! [Child][DispatchAsyncMessage] Error: Route error: message sent to unknown actor ID and that's a bug, but not the proximate cause of the crash.

The chrome-process crash is an abort() from the I/O thread. The main thread is shutting down in ParentImpl::ShutdownBackgroundThread

The I/O thread is at ProcessLink::OnChannelError -> MessageChannel::PostErrorNotifyTask -> MessageLook::PostTask_Helper.

PostTask_Helper appears to be locking bogus memory. Hard to say much more than that. Might be a good candidate for `rr` debugging if it happens much.
Component: Client: Desktop → IPC
Product: Firefox Health Report → Core
https://tbpl.mozilla.org/php/getParsedLog.php?id=39228865&tree=Mozilla-Inbound

Looks like I'm going to have to start bisecting for a culprit now...
Summary: Intermittent e10s Shutdown | application crashed [@ libc-2.15.so + 0x36445] with Experiments.jsm errors in the log prior and libpthread on the stack → Intermittent e10s Shutdown | application crashed [@ libc-2.15.so + 0x36445][@ linux-gate.so + 0x424] with Experiments.jsm errors in the log prior and libpthread on the stack
Summary: Intermittent e10s Shutdown | application crashed [@ libc-2.15.so + 0x36445][@ linux-gate.so + 0x424] with Experiments.jsm errors in the log prior and libpthread on the stack → Intermittent e10s mochitest-2 Shutdown | application crashed [@ libc-2.15.so + 0x36445][@ linux-gate.so + 0x424] with Experiments.jsm errors in the log prior and libpthread on the stack
And BTW, why do I have this sinking feeling that bug 924622 is somehow involved?
Bug 880864 is another possibility. Retriggers running.
Retriggers conclusively point to bug 924622 as the cause. Backed out.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla32
Bill: Nical is trying to fix bug 924622 but his fix hits this e10s mochitest crash. Do you have any suggestions?
Flags: needinfo?(wmccloskey)
I'm occasionally able to reproduce this in Linux debug by running an e10s browser (setting browser.tabs.remote.autostart and restarting), browsing around for a while, and quitting. I'll try to find a way to make it reproduce more consistently.
Flags: needinfo?(wmccloskey)
I found some interesting things today while trying to debug this. One thing is noticed is that I still crash at shutdown even if I disable async video. That means that we have bugs in shutdown that have nothing to do with the image bridge, even with the patch in bug 924622 applied.

It looks like the child process never deletes its CompositorChild instance. That's the first problem.

In the parent, the CrossProcessCompositorParent that's associated with the leaked CompositorChild gets its ActorDestroy method called when we get a channel error, which presumably happens when the child has exited. However, it looks like we might have already killed off the compositor thread by then: we increment the compositor thread's refcount in the CompositorParent constructor, but not in the CrossProcessCompositorParent constructor. If the compositor thread has been destroyed, I think we'll crash when we try to handle the channel error. That's the second problem.

So it seems like we need to have a proper way to shut down the CompositorChild in the child process, and then we need to wait until that's finished before we delete the compositor thread in the parent. Nical, would you be able to work on this by any chance? I still don't understand the code very well. It should make it a lot easier to land bug 924622, too.
Attached patch debuggingSplinter Review
This patch seems to make crashes more frequent because of the sleep call. My STR is to open Firefox in an e10s profile and then close it once about:home has loaded. It crashes about half the time.
Forgot to set needinfo.
Flags: needinfo?(nical.bugzilla)
Taking.

But... why is the status set to RESOLVED FIXED?
Flags: needinfo?(nical.bugzilla) → needinfo?(ryanvm)
The bug was filed as a regression from bug 924622 landing. When that was backed out, this bug was closed. We still want to fix them both though.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Flags: needinfo?(ryanvm)
This removes the crash, and lets shutdown complete normally. Just FYI. I'm still very unfamiliar with this code and don't know whether this patch makes sense. I just thought that MessageChannel should not be induced into doing such internal use-after-free's, even in the face of abnormal usage by the compositor code.

I haven't looked at this bug from the Compositor perspective yet, will do that next.
Forgot a hg add.
Attachment #8427972 - Attachment is obsolete: true
Attachment #8427979 - Attachment description: Strawman: avoid use-after-free of MessageChannel::mWorkerLoop → Expanded debugging patch (same sleep(3) call, more logging)
Attachment #8427969 - Attachment is obsolete: true
Attachment #8427972 - Attachment is obsolete: false
This is unaffected by the recent landing of bug 924622:
 - this shutdown crash persists after the landing of bug 924622;
 - the 'strawman' fix here continues to avert the crash.
Some comments inside this patch should clarify what this is doing. It's sad to be leaking in MessageChannel::Clear() called on wrong thread, but I don't see an immediate fix for this --- see comments. Ideas welcome!

Ben: the issue that this intends to fix is the use-after-free that is apparent in above logs --- MessageChannel::mWorkerLoop becomes a dangling pointer during shutdown as the thread finishes.

This is only a stopgap fix. The deeper issue is that we have a bunch of code in ipc/glue, especially MessageChannel and MessageLink, that are holding raw MessageLoop* pointers, but the MessageLoop is typically a stack variable in the thread main function which sits at the bottom of the thread's call stack... so having all this code use these raw MessageLoop* pointers, both on this thread and on another thread (the link thread), is perhaps a bit dangerous.

We could either

  1) stick to this approach, but rationalize a way for it to be non-crashy, which would have to involve some synchronization during shutdown (sad); or

  2) give up on this approach, and modify ipc/glue code to avoid directly touching the worker MessageLoop from other threads (as it currently does in MessageChannel::PostErrorNotifyTask); or

  3) Make MessageLoop refcounted and allocate it on the heap so that it could survive its thread, so that other threads directly referencing the MessageLoop could avoid crashing.
Attachment #8428885 - Flags: review?(wmccloskey)
Attachment #8428885 - Flags: review?(bent.mozilla)
Flags: needinfo?(wmccloskey)
Flags: needinfo?(bent.mozilla)
(In reply to Bill McCloskey (:billm) from comment #8)
> I found some interesting things today while trying to debug this. One thing
> is noticed is that I still crash at shutdown even if I disable async video.
> That means that we have bugs in shutdown that have nothing to do with the
> image bridge, even with the patch in bug 924622 applied.
> 
> It looks like the child process never deletes its CompositorChild instance.
> That's the first problem.

That's true and confirmed by XPCOM_MEM_LEAK_LOG, as CompositorChild has MOZ_COUNT_CTOR.

I also set a breakpoint on PCompositorChild constructor in the child process, and got it called by ContentChild::AllocPCompositorChild.

I then wondered, why is "DeallocPCompositorChild" not called?

It turns out that there is no DeallocPCompositorChild method at all.

I then wondered, how does this compile, as presumably IPDL-generated code must be trying to call a DeallocPCompositorChild method?

It turns out it's not. It also seems that PCompositorChild is not unique in this respect: a look at ContentChild.h shows other AllocP* methods not matched with a DeallocP* method, and these seem to be the top-level protocols.

This part of ipc/ipdl/ipdl/lower.py seems to be the one determining that:

http://hg.mozilla.org/mozilla-central/file/b5bdc1aaf378/ipc/ipdl/ipdl/lower.py#l2821

        for actor in channelOpenedActors:
            # add the Alloc interface for actors created when a
            # new channel is opened
            actortype = _cxxBareType(actor.asType(), actor.side)
            self.cls.addstmt(StmtDecl(MethodDecl(
                _allocMethod(actor.ptype, actor.side).name,
                params=[ Decl(Type('Transport', ptr=1), 'aTransport'),
                         Decl(Type('ProcessId'), 'aOtherProcess') ],
                ret=actortype,
                virtual=1, pure=1)))

Thus it seems to be intentional that such actors are only created, but not destroyed, by the IPDL-generated code.

Is there, then, anything that we should do here? Are we supposed to manually destroyed objects (top-level protocol actors) that were automatically constructed by IPDL code, or is this leak fully intentional?

Note that IIUC the reason why we started to care about this shutdown leak is that it was related to the present crash; but the above patch to ipc/glue/MessageChannel.* fixes this use-after-free crash, and shows that it was a ipc/glue bug not specific to compositor code. Given that, do we want to do anything else here?
This doesn't seem like the right approach to me. The IPC code is assuming that, when a MessageChannel is connected to a MessageLoop, then the MessageChannel should be closed before the MessageLoop is destroyed. That seems like a reasonable assumption to me. Even if we did change MessageChannel so that it doesn't crash if the MessageLoop is destroyed, it's still just ignoring messages from the channel once the worker dies, and that seems bad.

I do think it would make sense to take a version of your patch that asserts that we never delete a MessageLoop before a MessageChannel that's attached to it. That would make errors like this a lot easier to spot.

My understanding is that top-level protocols like PContent, PCompositor, and PImageBridge have to be destroyed manually. We're pretty careful about how this happens for ImageBrige. I think we just need to do the same thing for PCompositor. Then, we need to make sure that the compositor thread stays alive until PCompositor has been fully shut down. If we do that, then everything should work.
Flags: needinfo?(wmccloskey)
Flags: needinfo?(bent.mozilla)
Attachment #8428885 - Flags: review?(wmccloskey)
Attachment #8428885 - Flags: review?(bent.mozilla)
Thanks for the great feedback. So here is why we're leaking the CompositorChild. It's supposed to be destroyed by CompositorChild::Destroy(), called by nsBaseWidget::DestroyCompositor(). But there,

Breakpoint 1, nsBaseWidget::DestroyCompositor (this=0x7ff9995d8e80)
    at /hack/mozilla-central/widget/xpwidgets/nsBaseWidget.cpp:174
174	  LayerScope::DestroyServerSocket();
(gdb) l
169	    aCompositorChild->Release();
170	}
171	
172	void nsBaseWidget::DestroyCompositor()
173	{
174	  LayerScope::DestroyServerSocket();
175	
176	  if (mCompositorChild) {
177	    mCompositorChild->SendWillStop();
178	    mCompositorChild->Destroy();
(gdb) n
176	  if (mCompositorChild) {
(gdb) p mCompositorChild
$1 = {mRawPtr = 0x0}

mCompositorChild is null, so we don't destroy any compositor.
And indeed, that nsBaseWidget code only tries to destroy the compositor that it may have created manually in nsBaseWidget::CreateCompositor() (not using IPC), but in e10s, compositor creation goes through PContent instead:

Breakpoint 1, mozilla::layers::CompositorChild::CompositorChild (
    this=0x7f690c9a3c00, aLayerManager=0x0)
    at /hack/mozilla-central/gfx/layers/ipc/CompositorChild.cpp:36
36	{
(gdb) bt 8
#0  mozilla::layers::CompositorChild::CompositorChild (this=0x7f690c9a3c00, 
    aLayerManager=0x0)
    at /hack/mozilla-central/gfx/layers/ipc/CompositorChild.cpp:36
#1  0x00007f69232c261d in mozilla::layers::CompositorChild::Create (
    aTransport=0x7f690ec32530, aOtherProcess=18085)
    at /hack/mozilla-central/gfx/layers/ipc/CompositorChild.cpp:76
#2  0x00007f6923d8baef in mozilla::dom::ContentChild::AllocPCompositorChild (
    this=0x7f691691d830, aTransport=0x7f690ec32530, aOtherProcess=18085)
    at /hack/mozilla-central/dom/ipc/ContentChild.cpp:818
#3  0x00007f69229145e6 in mozilla::dom::PContentChild::OnMessageReceived (
    this=0x7f691691d830, __msg=...) at ./PContentChild.cpp:4648
#4  0x00007f69227eb6a7 in mozilla::ipc::MessageChannel::DispatchAsyncMessage (
    this=0x7f691691d890, aMsg=...)
    at /hack/mozilla-central/ipc/glue/MessageChannel.cpp:1164
#5  0x00007f69227ea9ff in mozilla::ipc::MessageChannel::DispatchMessage (
    this=0x7f691691d890, aMsg=...)
    at /hack/mozilla-central/ipc/glue/MessageChannel.cpp:1078
#6  0x00007f69227e697e in mozilla::ipc::MessageChannel::OnMaybeDequeueOne (
    this=0x7f691691d890)
    at /hack/mozilla-central/ipc/glue/MessageChannel.cpp:1061
There is a bunch of trivially-impl CrossProcessCompositorParent saying:

  // FIXME/bug 774388: work out what shutdown protocol we need.
  virtual bool RecvRequestOverfill() MOZ_OVERRIDE { return true; }
  virtual bool RecvWillStop() MOZ_OVERRIDE { return true; }
  virtual bool RecvStop() MOZ_OVERRIDE { return true; }
  virtual bool RecvPause() MOZ_OVERRIDE { return true; }
  virtual bool RecvResume() MOZ_OVERRIDE { return true; }

So I'm realizing, we're not just talking of fixing things --- we're talking of _implementing_ things :-)
Depends on: 774388
The patch on bug 774388 comment 25 fixes this for me locally and it seems on tryserver. Work continues on bug 774388, not here.
This started on inbound again today after bug 774388 landed. I'm also about to file another frequent mochitest-e10s-2 shutdown crash that also started after that push.
Actually, maybe I can repurpose this bug for it.
Summary: Intermittent e10s mochitest-2 Shutdown | application crashed [@ libc-2.15.so + 0x36445][@ linux-gate.so + 0x424] with Experiments.jsm errors in the log prior and libpthread on the stack → Intermittent e10s mochitest-2 PROCESS-CRASH | Shutdown | application crashed [@ linux-gate.so + 0x424] or application crashed [@ libc-2.15.so + 0x36445] with Experiments.jsm errors in the log prior and libpthread/libc on the stack
https://tbpl.mozilla.org/php/getParsedLog.php?id=43074746&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=43078331&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=43079088&tree=Mozilla-Inbound

I'm afraid that at the current frequency, we're going to need to consider a backout :(
Assignee: nobody → bjacob
Flags: needinfo?(bjacob)
Hardware: x86_64 → All
Target Milestone: mozilla32 → ---
Flags: needinfo?(bjacob)
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #31)
> Actually, maybe I can repurpose this bug for it.

I re-landed today and this is now again happening about 25% of the time on e10s mochitest-2.

I am hesitating whether to back out. Aside from this, my push with lots of retriggers for all what was intermittent, is now solid green. There are two or three big intermittents that we have a specific good reason to think are fixed for good by this push, including bug 924622.

https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=d2e7bd70dd95
A 25% failure rate means that mochitest-e10s-2 doesn't meet our job visibility requirements (i.e. we'd move it to hidden by default). I guess that's something to discuss with billm.
Flags: needinfo?(wmccloskey)
I guess I'm okay with temporarily disabling e10s M-2. However, the purpose of all this work is to increase e10s test coverage, not decrease it. So we should only do disable if it gets us closer to that goal.
Flags: needinfo?(wmccloskey)
Yeah, I understand.

The other thing I forgot to mention is I think the bug really changed i.e. the issue we're seeing at the moment isn't anymore the one that was originally investigated here.

They manifest in the same way: this crash stack occurs when a thread dies sooner than expected and a MessageChannel is still referencing its message loop as its mWorkerLoop.

That used to happen here with the compositor thread dying too soon, and Bill and me could reproduce it with Bill's above debugging patch, adding a sleep call.

Now I cannot reproduce at all anymore in this way (even with a longer sleep time) and we also have an excellent reason to believe that the compositor thread is no longer dying too soon.

But it _any_ other thread is still dying too soon during shutdown, then we will still get similar crashes with the same stack, and, at the moment, no way to tell from the logs what thread or messageChannel is involved.

So my best understanding is that there are other such issues remaining in the e10s shutdown sequence and my patches landed today expose them by reordering the shutdown sequence.
(In reply to Bill McCloskey (:billm) from comment #41)
> I guess I'm okay with temporarily disabling e10s M-2. However, the purpose
> of all this work is to increase e10s test coverage, not decrease it. So we
> should only do disable if it gets us closer to that goal.

Thanks for considering that --- I think we're in one of these tough spots where we might have to take one step back to get out of a bad situation.

Though if we disable e10s m2, we should probably at the same time think of how to make this actionable (by landing patches making logs more specific, etc).

(Incidentally, we do know already of a list of threads that do not have a proper shutdown sequence, as was discovered on bug 1008254.  (Another big intermittent that was recently fixed. See the list of threads in bug 1008254 comment 395 or equivalently the list of blockers of Bug 1033577.)
Bill: or we could take another look at the patch in comment 19 and consider landing a variant of it... it's not great to hide actual bugs, but that might be better than having mochitest-2 disabled...
Flags: needinfo?(wmccloskey)
Also, there might be useful information in the fact that this always seems to occur just after Browser.Experiments.Experiments tests...
Nevermind comment 45, this is a shutdown crash.

Regarding comment 44, here is a try push with the patch from comment 19:
https://tbpl.mozilla.org/?tree=Try&rev=5c7d85eb3bd1
FWIW, hidden by default isn't the same as disabled :). The jobs will still be running and results visible by adding &showall=1 to the TBPL URL. That said, nobody actively watches that, so any bustage won't be caught.
Yeah, I understand. That's why I think that landing a variant of the comment 19 patch would be a better compromise than hiding. Lets see what the comment 47 try run says.
(In reply to Benoit Jacob [:bjacob] from comment #49)
> Yeah, I understand. That's why I think that landing a variant of the comment
> 19 patch would be a better compromise than hiding. Lets see what the comment
> 47 try run says.

This is solid green as expected. https://tbpl.mozilla.org/?tree=Try&rev=5c7d85eb3bd1 . So we have already at least this possible way out. (though this needs some work to be landable)

Next, here is another try push to determine whether the thread that is dying too soon is the compositor thread (as it was when this bug was filed ) or not: https://tbpl.mozilla.org/?tree=Try&rev=ab376245700f
https://tbpl.mozilla.org/?tree=Try&rev=ab376245700f shows two things:
 - the used-after-free MessageLoop is *still* the compositorloop
 - the CompositorParent.cpp code that is supposed to be the only place destroying the compositor thread, has *not* run.

Something is killing our compositor thread, and we don't know what....

but one thing is for sure, I should back out my patches of today, because they should prevent such a possibility, so they are not doing their job right.
Flags: needinfo?(wmccloskey)
Ah no, I missed a line of the log. We *are* going through the only place destroying the compositor thread in CompositorParent.cpp. Whew!

So the problem becomes one of figuring what else we should be blocking on, before we proceed to destroy the compositor thread. Looking.
Got it. In GDB, printed the CompositorLoop and set conditional breakpoints in MessageChannel.cpp where mWorkerLoop is set, if it is set to point to the CompositorLoop. Everytime, I set a conditional breakpoint in MessageChannel::~MessageChannel if the 'this' pointer equals what it was when mWorkerLoop was set to point to the CompositorLoop.

In this way, during normal shutdown, we can tell the complete list of protocols having ever used the CompositorLoop as their mWorkerLoop. As expected, we got [CrossProcess]CompositorParent actors.... the unexpected thing was that we also got ImageBridgeParent. I didn't do anything to block the destruction of the compositor loop on imagebridge. That's got to be our problem here.
https://tbpl.mozilla.org/?tree=Try&rev=caee2cd3ff6c is solid green. So this definitely was the issue and we might be just one r+ away from relanding. Going to make a full try push with lots of retriggers though.
https://tbpl.mozilla.org/?tree=Try&rev=81a08416e5a5 is solid green even after lots of retriggers. Will reland as soon as I get a r+ on the last patch on bug 774388.
But 774388 has relanded, and is looking solid green.
Status: REOPENED → RESOLVED
Closed: 6 years ago5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla33
You need to log in before you can comment on or make changes to this bug.