Closed
Bug 1007284
Opened 10 years ago
Closed 10 years ago
Intermittent e10s mochitest-2 PROCESS-CRASH | Shutdown | application crashed [@ linux-gate.so + 0x424] or application crashed [@ libc-2.15.so + 0x36445] with Experiments.jsm errors in the log prior and libpthread/libc on the stack
Categories
(Core :: IPC, defect)
Tracking
()
RESOLVED
FIXED
mozilla33
Tracking | Status | |
---|---|---|
e10s | + | --- |
firefox30 | --- | unaffected |
firefox31 | --- | unaffected |
firefox32 | --- | fixed |
firefox33 | --- | fixed |
firefox-esr24 | --- | unaffected |
People
(Reporter: RyanVM, Assigned: bjacob)
References
(Blocks 1 open bug)
Details
(Keywords: assertion, crash, intermittent-failure)
Attachments
(6 files, 1 obsolete file)
11.02 KB,
patch
|
Details | Diff | Splinter Review | |
1.97 KB,
patch
|
Details | Diff | Splinter Review | |
240.44 KB,
text/plain
|
Details | |
200.76 KB,
text/plain
|
Details | |
51.94 KB,
patch
|
Details | Diff | Splinter Review | |
6.03 KB,
patch
|
Details | Diff | Splinter Review |
This happening on an e10s mochitest runs seems pertinent. https://tbpl.mozilla.org/php/getParsedLog.php?id=39221652&tree=Mozilla-Inbound Ubuntu VM 12.04 x64 mozilla-inbound opt test mochitest-e10s-2 on 2014-05-07 11:04:23 PDT for push 85924e72f778 slave: tst-linux64-spot-460 11:11:07 INFO - 6362 INFO TEST-START | Shutdown 11:11:07 INFO - 6363 INFO Passed: 180635 11:11:07 INFO - 6364 INFO Failed: 0 11:11:07 INFO - 6365 INFO Todo: 24207 11:11:07 INFO - 6366 INFO Slowest: 46455ms - /tests/dom/imptests/editing/conformancetest/test_runtest.html 11:11:07 INFO - 6367 INFO SimpleTest FINISHED 11:11:07 INFO - 6368 INFO TEST-INFO | Ran 1 Loops 11:11:07 INFO - 6369 INFO SimpleTest FINISHED 11:11:07 INFO - ###!!! [Child][DispatchAsyncMessage] Error: Route error: message sent to unknown actor ID 11:11:07 INFO - ###!!! [Child][DispatchAsyncMessage] Error: Route error: message sent to unknown actor ID 11:11:07 INFO - ###!!! [Child][DispatchAsyncMessage] Error: Route error: message sent to unknown actor ID 11:11:07 INFO - ###!!! [Child][DispatchAsyncMessage] Error: Route error: message sent to unknown actor ID 11:11:07 INFO - ###!!! [Child][DispatchAsyncMessage] Error: Route error: message sent to unknown actor ID 11:11:07 INFO - 1399486267654 Browser.Experiments.Experiments TRACE PreviousExperimentProvider #0::shutdown() 11:11:07 INFO - 1399486267660 Browser.Experiments.Experiments TRACE Experiments #0::uninit: started 11:11:07 INFO - 1399486267666 Browser.Experiments.Experiments TRACE Experiments #0::uninit: finished with _loadTask 11:11:07 INFO - 1399486267667 Browser.Experiments.Experiments TRACE Experiments #0::uninit: no previous shutdown 11:11:07 INFO - 1399486267668 Browser.Experiments.Experiments TRACE Experiments #0::Unregistering instance with Addon Manager. 11:11:07 INFO - 1399486267668 Browser.Experiments.Experiments TRACE Experiments #0::Unregistering previous experiment add-on provider. 11:11:07 INFO - 1399486267670 Browser.Experiments.Experiments TRACE PreviousExperimentProvider #0::shutdown() 11:11:07 INFO - 1399486267670 addons.manager ERROR Exception calling provider shutdown: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIObserverService.removeObserver]" nsresult: "0x80004005 (NS_ERROR_FAILURE)" location: "JS frame :: resource://app/modules/experiments/Experiments.jsm :: this.Experiments.PreviousExperimentProvider.prototype<.shutdown :: line 2071" data: no] Stack trace: this.Experiments.PreviousExperimentProvider.prototype<.shutdown()@resource://app/modules/experiments/Experiments.jsm:2071 < callProvider()@resource://gre/modules/AddonManager.jsm:192 < AMI_unregisterProvider()@resource://gre/modules/AddonManager.jsm:848 < AMP_unregisterProvider()@resource://gre/modules/AddonManager.jsm:2326 < Experiments.Experiments.prototype._unregisterWithAddonManager()@resource://app/modules/experiments/Experiments.jsm:496 < Experiments.Experiments.prototype.uninit<()@resource://app/modules/experiments/Experiments.jsm:442 < TaskImpl_run()@resource://gre/modules/Task.jsm:282 < TaskImpl_handleResultValue()@resource://gre/modules/Task.jsm:338 < TaskImpl_run()@resource://gre/modules/Task.jsm:290 < TaskImpl()@resource://gre/modules/Task.jsm:247 < createAsyncFunction/asyncFunction()@resource://gre/modules/Task.jsm:224 < Spinner.prototype.observe()@resource://gre/modules/AsyncShutdown.jsm:320 < <file:unknown> 11:11:07 INFO - 1399486267673 Browser.Experiments.Experiments INFO Experiments #0::Completed uninitialization. 11:11:07 INFO - firefox: tpp.c:63: __pthread_tpp_change_priority: Assertion `new_prio == -1 || (new_prio >= __sched_fifo_min_prio && new_prio <= __sched_fifo_max_prio)' failed. 11:11:08 INFO - TEST-INFO | Main app process: killed by SIGIOT 11:11:08 WARNING - TEST-UNEXPECTED-FAIL | Shutdown | application terminated with exit code 6 11:11:08 INFO - INFO | runtests.py | Application ran for: 0:05:11.881747 11:11:08 INFO - INFO | zombiecheck | Reading PID log: /tmp/tmp2ecMbtpidlog 11:11:08 INFO - ==> process 2437 launched child process 2477 11:11:08 INFO - INFO | zombiecheck | Checking for orphan process with PID: 2477 11:11:08 INFO - mozcrash INFO | Downloading symbols from: https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux64/1399484774/firefox-32.0a1.en-US.linux-x86_64.crashreporter-symbols.zip 11:11:17 WARNING - PROCESS-CRASH | Shutdown | application crashed [@ libc-2.15.so + 0x36445] 11:11:17 INFO - Crash dump filename: /tmp/tmpyoZjqK/minidumps/06af5307-14a4-89dd-6324fbe4-3f59ec5b.dmp 11:11:17 INFO - Operating system: Linux 11:11:17 INFO - 0.0.0 Linux 3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC 2012 x86_64 11:11:17 INFO - CPU: amd64 11:11:17 INFO - family 6 model 62 stepping 4 11:11:17 INFO - 1 CPU 11:11:17 INFO - Crash reason: SIGABRT 11:11:17 INFO - Crash address: 0x3e800000985 11:11:17 INFO - Thread 3 (crashed) 11:11:17 INFO - 0 libc-2.15.so + 0x36445 11:11:17 INFO - rbx = 0x00007f2539d30000 r12 = 0x00007f2548543e30 11:11:17 INFO - r13 = 0x00007f2548543f00 r14 = 0x00000000000008f2 11:11:17 INFO - r15 = 0x00000000ffffffff rip = 0x00007f254758d445 11:11:17 INFO - rsp = 0x00007f2537d50838 rbp = 0x00007f2548543e24 11:11:17 INFO - Found by: given as instruction pointer in context 11:11:17 INFO - 1 libc-2.15.so + 0x39baa 11:11:17 INFO - rip = 0x00007f2547590bab rsp = 0x00007f2537d50840 11:11:17 INFO - rbp = 0x00007f2548543e24 11:11:17 INFO - Found by: stack scanning 11:11:17 INFO - 2 libpthread-2.15.so + 0x11e2f 11:11:17 INFO - rip = 0x00007f2548543e30 rsp = 0x00007f2537d50848 11:11:17 INFO - rbp = 0x00007f2548543e24 11:11:17 INFO - Found by: stack scanning 11:11:17 INFO - 3 libc-2.15.so + 0x17bb1f 11:11:17 INFO - rip = 0x00007f25476d2b20 rsp = 0x00007f2537d50850 11:11:17 INFO - rbp = 0x00007f2548543e24 11:11:17 INFO - Found by: stack scanning 11:11:17 INFO - 4 libc-2.15.so + 0x6df51 11:11:17 INFO - rip = 0x00007f25475c4f52 rsp = 0x00007f2537d50870 11:11:17 INFO - rbp = 0x00007f2548543e24 11:11:17 INFO - Found by: stack scanning 11:11:17 INFO - 5 libpthread-2.15.so + 0x11e2f 11:11:17 INFO - rip = 0x00007f2548543e30 rsp = 0x00007f2537d50878 11:11:17 INFO - rbp = 0x00007f2548543e24 11:11:17 INFO - Found by: stack scanning 11:11:17 INFO - 6 libc-2.15.so + 0x179b14 11:11:17 INFO - rip = 0x00007f25476d0b15 rsp = 0x00007f2537d508d0 11:11:17 INFO - rbp = 0x00007f2548543e24 11:11:17 INFO - Found by: stack scanning
Comment 1•10 years ago
|
||
I believe that the error in question is __pthread_tpp_change_priority: Assertion `new_prio == -1 || (new_prio >= __sched_fifo_min_prio && new_prio <= __sched_fifo_max_prio)' failed. I doubt that this has anything to do with the experiments code. The NS_ERROR_FAILURE from http://hg.mozilla.org/mozilla-central/annotate/417acde736e7/browser/experiments/Experiments.jsm#l2071 is an interesting failure but not the cause of the crash. The content process should have committed suicide at ###!!! [Child][DispatchAsyncMessage] Error: Route error: message sent to unknown actor ID and that's a bug, but not the proximate cause of the crash. The chrome-process crash is an abort() from the I/O thread. The main thread is shutting down in ParentImpl::ShutdownBackgroundThread The I/O thread is at ProcessLink::OnChannelError -> MessageChannel::PostErrorNotifyTask -> MessageLook::PostTask_Helper. PostTask_Helper appears to be locking bogus memory. Hard to say much more than that. Might be a good candidate for `rr` debugging if it happens much.
Component: Client: Desktop → IPC
Product: Firefox Health Report → Core
Reporter | ||
Comment 2•10 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=39228865&tree=Mozilla-Inbound Looks like I'm going to have to start bisecting for a culprit now...
Summary: Intermittent e10s Shutdown | application crashed [@ libc-2.15.so + 0x36445] with Experiments.jsm errors in the log prior and libpthread on the stack → Intermittent e10s Shutdown | application crashed [@ libc-2.15.so + 0x36445][@ linux-gate.so + 0x424] with Experiments.jsm errors in the log prior and libpthread on the stack
Reporter | ||
Updated•10 years ago
|
Summary: Intermittent e10s Shutdown | application crashed [@ libc-2.15.so + 0x36445][@ linux-gate.so + 0x424] with Experiments.jsm errors in the log prior and libpthread on the stack → Intermittent e10s mochitest-2 Shutdown | application crashed [@ libc-2.15.so + 0x36445][@ linux-gate.so + 0x424] with Experiments.jsm errors in the log prior and libpthread on the stack
Reporter | ||
Comment 3•10 years ago
|
||
And BTW, why do I have this sinking feeling that bug 924622 is somehow involved?
Reporter | ||
Comment 4•10 years ago
|
||
Bug 880864 is another possibility. Retriggers running.
Reporter | ||
Comment 5•10 years ago
|
||
Retriggers conclusively point to bug 924622 as the cause. Backed out.
Status: NEW → RESOLVED
Closed: 10 years ago
status-firefox30:
--- → unaffected
status-firefox31:
--- → unaffected
status-firefox32:
--- → fixed
status-firefox-esr24:
--- → unaffected
Resolution: --- → FIXED
Target Milestone: --- → mozilla32
Comment 6•10 years ago
|
||
Bill: Nical is trying to fix bug 924622 but his fix hits this e10s mochitest crash. Do you have any suggestions?
Flags: needinfo?(wmccloskey)
I'm occasionally able to reproduce this in Linux debug by running an e10s browser (setting browser.tabs.remote.autostart and restarting), browsing around for a while, and quitting. I'll try to find a way to make it reproduce more consistently.
Flags: needinfo?(wmccloskey)
I found some interesting things today while trying to debug this. One thing is noticed is that I still crash at shutdown even if I disable async video. That means that we have bugs in shutdown that have nothing to do with the image bridge, even with the patch in bug 924622 applied. It looks like the child process never deletes its CompositorChild instance. That's the first problem. In the parent, the CrossProcessCompositorParent that's associated with the leaked CompositorChild gets its ActorDestroy method called when we get a channel error, which presumably happens when the child has exited. However, it looks like we might have already killed off the compositor thread by then: we increment the compositor thread's refcount in the CompositorParent constructor, but not in the CrossProcessCompositorParent constructor. If the compositor thread has been destroyed, I think we'll crash when we try to handle the channel error. That's the second problem. So it seems like we need to have a proper way to shut down the CompositorChild in the child process, and then we need to wait until that's finished before we delete the compositor thread in the parent. Nical, would you be able to work on this by any chance? I still don't understand the code very well. It should make it a lot easier to land bug 924622, too.
This patch seems to make crashes more frequent because of the sleep call. My STR is to open Firefox in an e10s profile and then close it once about:home has loaded. It crashes about half the time.
Forgot to set needinfo.
Flags: needinfo?(nical.bugzilla)
Blocks: e10s-gfx
Assignee | ||
Comment 11•10 years ago
|
||
Taking. But... why is the status set to RESOLVED FIXED?
Flags: needinfo?(nical.bugzilla) → needinfo?(ryanvm)
The bug was filed as a regression from bug 924622 landing. When that was backed out, this bug was closed. We still want to fix them both though.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Updated•10 years ago
|
Flags: needinfo?(ryanvm)
Assignee | ||
Comment 13•10 years ago
|
||
Assignee | ||
Comment 14•10 years ago
|
||
This removes the crash, and lets shutdown complete normally. Just FYI. I'm still very unfamiliar with this code and don't know whether this patch makes sense. I just thought that MessageChannel should not be induced into doing such internal use-after-free's, even in the face of abnormal usage by the compositor code. I haven't looked at this bug from the Compositor perspective yet, will do that next.
Assignee | ||
Comment 15•10 years ago
|
||
Assignee | ||
Comment 16•10 years ago
|
||
Assignee | ||
Comment 17•10 years ago
|
||
Forgot a hg add.
Attachment #8427972 -
Attachment is obsolete: true
Assignee | ||
Updated•10 years ago
|
Attachment #8427979 -
Attachment description: Strawman: avoid use-after-free of MessageChannel::mWorkerLoop → Expanded debugging patch (same sleep(3) call, more logging)
Assignee | ||
Updated•10 years ago
|
Attachment #8427969 -
Attachment is obsolete: true
Assignee | ||
Updated•10 years ago
|
Attachment #8427972 -
Attachment is obsolete: false
Assignee | ||
Comment 18•10 years ago
|
||
This is unaffected by the recent landing of bug 924622: - this shutdown crash persists after the landing of bug 924622; - the 'strawman' fix here continues to avert the crash.
Assignee | ||
Comment 19•10 years ago
|
||
Some comments inside this patch should clarify what this is doing. It's sad to be leaking in MessageChannel::Clear() called on wrong thread, but I don't see an immediate fix for this --- see comments. Ideas welcome! Ben: the issue that this intends to fix is the use-after-free that is apparent in above logs --- MessageChannel::mWorkerLoop becomes a dangling pointer during shutdown as the thread finishes. This is only a stopgap fix. The deeper issue is that we have a bunch of code in ipc/glue, especially MessageChannel and MessageLink, that are holding raw MessageLoop* pointers, but the MessageLoop is typically a stack variable in the thread main function which sits at the bottom of the thread's call stack... so having all this code use these raw MessageLoop* pointers, both on this thread and on another thread (the link thread), is perhaps a bit dangerous. We could either 1) stick to this approach, but rationalize a way for it to be non-crashy, which would have to involve some synchronization during shutdown (sad); or 2) give up on this approach, and modify ipc/glue code to avoid directly touching the worker MessageLoop from other threads (as it currently does in MessageChannel::PostErrorNotifyTask); or 3) Make MessageLoop refcounted and allocate it on the heap so that it could survive its thread, so that other threads directly referencing the MessageLoop could avoid crashing.
Attachment #8428885 -
Flags: review?(wmccloskey)
Attachment #8428885 -
Flags: review?(bent.mozilla)
Flags: needinfo?(wmccloskey)
Flags: needinfo?(bent.mozilla)
Assignee | ||
Comment 20•10 years ago
|
||
(In reply to Bill McCloskey (:billm) from comment #8) > I found some interesting things today while trying to debug this. One thing > is noticed is that I still crash at shutdown even if I disable async video. > That means that we have bugs in shutdown that have nothing to do with the > image bridge, even with the patch in bug 924622 applied. > > It looks like the child process never deletes its CompositorChild instance. > That's the first problem. That's true and confirmed by XPCOM_MEM_LEAK_LOG, as CompositorChild has MOZ_COUNT_CTOR. I also set a breakpoint on PCompositorChild constructor in the child process, and got it called by ContentChild::AllocPCompositorChild. I then wondered, why is "DeallocPCompositorChild" not called? It turns out that there is no DeallocPCompositorChild method at all. I then wondered, how does this compile, as presumably IPDL-generated code must be trying to call a DeallocPCompositorChild method? It turns out it's not. It also seems that PCompositorChild is not unique in this respect: a look at ContentChild.h shows other AllocP* methods not matched with a DeallocP* method, and these seem to be the top-level protocols. This part of ipc/ipdl/ipdl/lower.py seems to be the one determining that: http://hg.mozilla.org/mozilla-central/file/b5bdc1aaf378/ipc/ipdl/ipdl/lower.py#l2821 for actor in channelOpenedActors: # add the Alloc interface for actors created when a # new channel is opened actortype = _cxxBareType(actor.asType(), actor.side) self.cls.addstmt(StmtDecl(MethodDecl( _allocMethod(actor.ptype, actor.side).name, params=[ Decl(Type('Transport', ptr=1), 'aTransport'), Decl(Type('ProcessId'), 'aOtherProcess') ], ret=actortype, virtual=1, pure=1))) Thus it seems to be intentional that such actors are only created, but not destroyed, by the IPDL-generated code. Is there, then, anything that we should do here? Are we supposed to manually destroyed objects (top-level protocol actors) that were automatically constructed by IPDL code, or is this leak fully intentional? Note that IIUC the reason why we started to care about this shutdown leak is that it was related to the present crash; but the above patch to ipc/glue/MessageChannel.* fixes this use-after-free crash, and shows that it was a ipc/glue bug not specific to compositor code. Given that, do we want to do anything else here?
This doesn't seem like the right approach to me. The IPC code is assuming that, when a MessageChannel is connected to a MessageLoop, then the MessageChannel should be closed before the MessageLoop is destroyed. That seems like a reasonable assumption to me. Even if we did change MessageChannel so that it doesn't crash if the MessageLoop is destroyed, it's still just ignoring messages from the channel once the worker dies, and that seems bad. I do think it would make sense to take a version of your patch that asserts that we never delete a MessageLoop before a MessageChannel that's attached to it. That would make errors like this a lot easier to spot. My understanding is that top-level protocols like PContent, PCompositor, and PImageBridge have to be destroyed manually. We're pretty careful about how this happens for ImageBrige. I think we just need to do the same thing for PCompositor. Then, we need to make sure that the compositor thread stays alive until PCompositor has been fully shut down. If we do that, then everything should work.
Flags: needinfo?(wmccloskey)
Assignee | ||
Updated•10 years ago
|
Flags: needinfo?(bent.mozilla)
Assignee | ||
Updated•10 years ago
|
Attachment #8428885 -
Flags: review?(wmccloskey)
Attachment #8428885 -
Flags: review?(bent.mozilla)
Assignee | ||
Comment 22•10 years ago
|
||
Thanks for the great feedback. So here is why we're leaking the CompositorChild. It's supposed to be destroyed by CompositorChild::Destroy(), called by nsBaseWidget::DestroyCompositor(). But there, Breakpoint 1, nsBaseWidget::DestroyCompositor (this=0x7ff9995d8e80) at /hack/mozilla-central/widget/xpwidgets/nsBaseWidget.cpp:174 174 LayerScope::DestroyServerSocket(); (gdb) l 169 aCompositorChild->Release(); 170 } 171 172 void nsBaseWidget::DestroyCompositor() 173 { 174 LayerScope::DestroyServerSocket(); 175 176 if (mCompositorChild) { 177 mCompositorChild->SendWillStop(); 178 mCompositorChild->Destroy(); (gdb) n 176 if (mCompositorChild) { (gdb) p mCompositorChild $1 = {mRawPtr = 0x0} mCompositorChild is null, so we don't destroy any compositor.
Assignee | ||
Comment 23•10 years ago
|
||
And indeed, that nsBaseWidget code only tries to destroy the compositor that it may have created manually in nsBaseWidget::CreateCompositor() (not using IPC), but in e10s, compositor creation goes through PContent instead: Breakpoint 1, mozilla::layers::CompositorChild::CompositorChild ( this=0x7f690c9a3c00, aLayerManager=0x0) at /hack/mozilla-central/gfx/layers/ipc/CompositorChild.cpp:36 36 { (gdb) bt 8 #0 mozilla::layers::CompositorChild::CompositorChild (this=0x7f690c9a3c00, aLayerManager=0x0) at /hack/mozilla-central/gfx/layers/ipc/CompositorChild.cpp:36 #1 0x00007f69232c261d in mozilla::layers::CompositorChild::Create ( aTransport=0x7f690ec32530, aOtherProcess=18085) at /hack/mozilla-central/gfx/layers/ipc/CompositorChild.cpp:76 #2 0x00007f6923d8baef in mozilla::dom::ContentChild::AllocPCompositorChild ( this=0x7f691691d830, aTransport=0x7f690ec32530, aOtherProcess=18085) at /hack/mozilla-central/dom/ipc/ContentChild.cpp:818 #3 0x00007f69229145e6 in mozilla::dom::PContentChild::OnMessageReceived ( this=0x7f691691d830, __msg=...) at ./PContentChild.cpp:4648 #4 0x00007f69227eb6a7 in mozilla::ipc::MessageChannel::DispatchAsyncMessage ( this=0x7f691691d890, aMsg=...) at /hack/mozilla-central/ipc/glue/MessageChannel.cpp:1164 #5 0x00007f69227ea9ff in mozilla::ipc::MessageChannel::DispatchMessage ( this=0x7f691691d890, aMsg=...) at /hack/mozilla-central/ipc/glue/MessageChannel.cpp:1078 #6 0x00007f69227e697e in mozilla::ipc::MessageChannel::OnMaybeDequeueOne ( this=0x7f691691d890) at /hack/mozilla-central/ipc/glue/MessageChannel.cpp:1061
Assignee | ||
Comment 24•10 years ago
|
||
There is a bunch of trivially-impl CrossProcessCompositorParent saying: // FIXME/bug 774388: work out what shutdown protocol we need. virtual bool RecvRequestOverfill() MOZ_OVERRIDE { return true; } virtual bool RecvWillStop() MOZ_OVERRIDE { return true; } virtual bool RecvStop() MOZ_OVERRIDE { return true; } virtual bool RecvPause() MOZ_OVERRIDE { return true; } virtual bool RecvResume() MOZ_OVERRIDE { return true; } So I'm realizing, we're not just talking of fixing things --- we're talking of _implementing_ things :-)
Assignee | ||
Comment 25•10 years ago
|
||
The patch on bug 774388 comment 25 fixes this for me locally and it seems on tryserver. Work continues on bug 774388, not here.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Reporter | ||
Comment 30•10 years ago
|
||
This started on inbound again today after bug 774388 landed. I'm also about to file another frequent mochitest-e10s-2 shutdown crash that also started after that push.
Reporter | ||
Comment 31•10 years ago
|
||
Actually, maybe I can repurpose this bug for it.
Summary: Intermittent e10s mochitest-2 Shutdown | application crashed [@ libc-2.15.so + 0x36445][@ linux-gate.so + 0x424] with Experiments.jsm errors in the log prior and libpthread on the stack → Intermittent e10s mochitest-2 PROCESS-CRASH | Shutdown | application crashed [@ linux-gate.so + 0x424] or application crashed [@ libc-2.15.so + 0x36445] with Experiments.jsm errors in the log prior and libpthread/libc on the stack
Comment hidden (Legacy TBPL/Treeherder Robot) |
Reporter | ||
Comment 33•10 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=43074746&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=43078331&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=43079088&tree=Mozilla-Inbound I'm afraid that at the current frequency, we're going to need to consider a backout :(
Assignee: nobody → bjacob
status-firefox33:
--- → affected
Flags: needinfo?(bjacob)
Hardware: x86_64 → All
Target Milestone: mozilla32 → ---
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Assignee | ||
Updated•10 years ago
|
Flags: needinfo?(bjacob)
Assignee | ||
Comment 39•10 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #31) > Actually, maybe I can repurpose this bug for it. I re-landed today and this is now again happening about 25% of the time on e10s mochitest-2. I am hesitating whether to back out. Aside from this, my push with lots of retriggers for all what was intermittent, is now solid green. There are two or three big intermittents that we have a specific good reason to think are fixed for good by this push, including bug 924622. https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=d2e7bd70dd95
Reporter | ||
Comment 40•10 years ago
|
||
A 25% failure rate means that mochitest-e10s-2 doesn't meet our job visibility requirements (i.e. we'd move it to hidden by default). I guess that's something to discuss with billm.
Flags: needinfo?(wmccloskey)
I guess I'm okay with temporarily disabling e10s M-2. However, the purpose of all this work is to increase e10s test coverage, not decrease it. So we should only do disable if it gets us closer to that goal.
Flags: needinfo?(wmccloskey)
Assignee | ||
Comment 42•10 years ago
|
||
Yeah, I understand. The other thing I forgot to mention is I think the bug really changed i.e. the issue we're seeing at the moment isn't anymore the one that was originally investigated here. They manifest in the same way: this crash stack occurs when a thread dies sooner than expected and a MessageChannel is still referencing its message loop as its mWorkerLoop. That used to happen here with the compositor thread dying too soon, and Bill and me could reproduce it with Bill's above debugging patch, adding a sleep call. Now I cannot reproduce at all anymore in this way (even with a longer sleep time) and we also have an excellent reason to believe that the compositor thread is no longer dying too soon. But it _any_ other thread is still dying too soon during shutdown, then we will still get similar crashes with the same stack, and, at the moment, no way to tell from the logs what thread or messageChannel is involved. So my best understanding is that there are other such issues remaining in the e10s shutdown sequence and my patches landed today expose them by reordering the shutdown sequence.
Assignee | ||
Comment 43•10 years ago
|
||
(In reply to Bill McCloskey (:billm) from comment #41) > I guess I'm okay with temporarily disabling e10s M-2. However, the purpose > of all this work is to increase e10s test coverage, not decrease it. So we > should only do disable if it gets us closer to that goal. Thanks for considering that --- I think we're in one of these tough spots where we might have to take one step back to get out of a bad situation. Though if we disable e10s m2, we should probably at the same time think of how to make this actionable (by landing patches making logs more specific, etc). (Incidentally, we do know already of a list of threads that do not have a proper shutdown sequence, as was discovered on bug 1008254. (Another big intermittent that was recently fixed. See the list of threads in bug 1008254 comment 395 or equivalently the list of blockers of Bug 1033577.)
Assignee | ||
Comment 44•10 years ago
|
||
Bill: or we could take another look at the patch in comment 19 and consider landing a variant of it... it's not great to hide actual bugs, but that might be better than having mochitest-2 disabled...
Assignee | ||
Updated•10 years ago
|
Flags: needinfo?(wmccloskey)
Assignee | ||
Comment 45•10 years ago
|
||
Also, there might be useful information in the fact that this always seems to occur just after Browser.Experiments.Experiments tests...
Comment hidden (Legacy TBPL/Treeherder Robot) |
Assignee | ||
Comment 47•10 years ago
|
||
Nevermind comment 45, this is a shutdown crash. Regarding comment 44, here is a try push with the patch from comment 19: https://tbpl.mozilla.org/?tree=Try&rev=5c7d85eb3bd1
Reporter | ||
Comment 48•10 years ago
|
||
FWIW, hidden by default isn't the same as disabled :). The jobs will still be running and results visible by adding &showall=1 to the TBPL URL. That said, nobody actively watches that, so any bustage won't be caught.
Assignee | ||
Comment 49•10 years ago
|
||
Yeah, I understand. That's why I think that landing a variant of the comment 19 patch would be a better compromise than hiding. Lets see what the comment 47 try run says.
Assignee | ||
Comment 50•10 years ago
|
||
(In reply to Benoit Jacob [:bjacob] from comment #49) > Yeah, I understand. That's why I think that landing a variant of the comment > 19 patch would be a better compromise than hiding. Lets see what the comment > 47 try run says. This is solid green as expected. https://tbpl.mozilla.org/?tree=Try&rev=5c7d85eb3bd1 . So we have already at least this possible way out. (though this needs some work to be landable) Next, here is another try push to determine whether the thread that is dying too soon is the compositor thread (as it was when this bug was filed ) or not: https://tbpl.mozilla.org/?tree=Try&rev=ab376245700f
Comment hidden (Legacy TBPL/Treeherder Robot) |
Assignee | ||
Comment 52•10 years ago
|
||
https://tbpl.mozilla.org/?tree=Try&rev=ab376245700f shows two things: - the used-after-free MessageLoop is *still* the compositorloop - the CompositorParent.cpp code that is supposed to be the only place destroying the compositor thread, has *not* run. Something is killing our compositor thread, and we don't know what.... but one thing is for sure, I should back out my patches of today, because they should prevent such a possibility, so they are not doing their job right.
Flags: needinfo?(wmccloskey)
Assignee | ||
Comment 53•10 years ago
|
||
Backed out bug 774388 and bug 1028383.
Assignee | ||
Comment 54•10 years ago
|
||
Ah no, I missed a line of the log. We *are* going through the only place destroying the compositor thread in CompositorParent.cpp. Whew! So the problem becomes one of figuring what else we should be blocking on, before we proceed to destroy the compositor thread. Looking.
Assignee | ||
Comment 55•10 years ago
|
||
Got it. In GDB, printed the CompositorLoop and set conditional breakpoints in MessageChannel.cpp where mWorkerLoop is set, if it is set to point to the CompositorLoop. Everytime, I set a conditional breakpoint in MessageChannel::~MessageChannel if the 'this' pointer equals what it was when mWorkerLoop was set to point to the CompositorLoop. In this way, during normal shutdown, we can tell the complete list of protocols having ever used the CompositorLoop as their mWorkerLoop. As expected, we got [CrossProcess]CompositorParent actors.... the unexpected thing was that we also got ImageBridgeParent. I didn't do anything to block the destruction of the compositor loop on imagebridge. That's got to be our problem here.
Assignee | ||
Comment 56•10 years ago
|
||
https://tbpl.mozilla.org/?tree=Try&rev=caee2cd3ff6c is solid green. So this definitely was the issue and we might be just one r+ away from relanding. Going to make a full try push with lots of retriggers though.
Assignee | ||
Comment 57•10 years ago
|
||
Big try push: https://tbpl.mozilla.org/?tree=Try&rev=a7dfecd93344
Comment hidden (Legacy TBPL/Treeherder Robot) |
Assignee | ||
Comment 59•10 years ago
|
||
Was missing a #include. New try: https://tbpl.mozilla.org/?tree=Try&rev=81a08416e5a5
Assignee | ||
Comment 60•10 years ago
|
||
https://tbpl.mozilla.org/?tree=Try&rev=81a08416e5a5 is solid green even after lots of retriggers. Will reland as soon as I get a r+ on the last patch on bug 774388.
Assignee | ||
Comment 61•10 years ago
|
||
But 774388 has relanded, and is looking solid green.
Updated•10 years ago
|
tracking-e10s:
--- → ?
Updated•10 years ago
|
Blocks: e10s-tests
Assignee | ||
Updated•10 years ago
|
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
Reporter | ||
Updated•10 years ago
|
Target Milestone: --- → mozilla33
You need to log in
before you can comment on or make changes to this bug.
Description
•