1007284 - Intermittent e10s mochitest-2 PROCESS-CRASH | Shutdown | application crashed [@ linux-gate.so + 0x424] or application crashed [@ libc-2.15.so + 0x36445] with Experiments.jsm errors in the log prior and libpthread/libc on the stack

Reporter

Description

•

10 years ago

This happening on an e10s mochitest runs seems pertinent.

https://tbpl.mozilla.org/php/getParsedLog.php?id=39221652&tree=Mozilla-Inbound

Ubuntu VM 12.04 x64 mozilla-inbound opt test mochitest-e10s-2 on 2014-05-07 11:04:23 PDT for push 85924e72f778
slave: tst-linux64-spot-460

11:11:07     INFO -  6362 INFO TEST-START | Shutdown
11:11:07     INFO -  6363 INFO Passed:  180635
11:11:07     INFO -  6364 INFO Failed:  0
11:11:07     INFO -  6365 INFO Todo:    24207
11:11:07     INFO -  6366 INFO Slowest: 46455ms - /tests/dom/imptests/editing/conformancetest/test_runtest.html
11:11:07     INFO -  6367 INFO SimpleTest FINISHED
11:11:07     INFO -  6368 INFO TEST-INFO | Ran 1 Loops
11:11:07     INFO -  6369 INFO SimpleTest FINISHED
11:11:07     INFO -  ###!!! [Child][DispatchAsyncMessage] Error: Route error: message sent to unknown actor ID
11:11:07     INFO -  ###!!! [Child][DispatchAsyncMessage] Error: Route error: message sent to unknown actor ID
11:11:07     INFO -  ###!!! [Child][DispatchAsyncMessage] Error: Route error: message sent to unknown actor ID
11:11:07     INFO -  ###!!! [Child][DispatchAsyncMessage] Error: Route error: message sent to unknown actor ID
11:11:07     INFO -  ###!!! [Child][DispatchAsyncMessage] Error: Route error: message sent to unknown actor ID
11:11:07     INFO -  1399486267654	Browser.Experiments.Experiments	TRACE	PreviousExperimentProvider #0::shutdown()
11:11:07     INFO -  1399486267660	Browser.Experiments.Experiments	TRACE	Experiments #0::uninit: started
11:11:07     INFO -  1399486267666	Browser.Experiments.Experiments	TRACE	Experiments #0::uninit: finished with _loadTask
11:11:07     INFO -  1399486267667	Browser.Experiments.Experiments	TRACE	Experiments #0::uninit: no previous shutdown
11:11:07     INFO -  1399486267668	Browser.Experiments.Experiments	TRACE	Experiments #0::Unregistering instance with Addon Manager.
11:11:07     INFO -  1399486267668	Browser.Experiments.Experiments	TRACE	Experiments #0::Unregistering previous experiment add-on provider.
11:11:07     INFO -  1399486267670	Browser.Experiments.Experiments	TRACE	PreviousExperimentProvider #0::shutdown()
11:11:07     INFO -  1399486267670	addons.manager	ERROR	Exception calling provider shutdown: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIObserverService.removeObserver]"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"  location: "JS frame :: resource://app/modules/experiments/Experiments.jsm :: this.Experiments.PreviousExperimentProvider.prototype<.shutdown :: line 2071"  data: no] Stack trace: this.Experiments.PreviousExperimentProvider.prototype<.shutdown()@resource://app/modules/experiments/Experiments.jsm:2071 < callProvider()@resource://gre/modules/AddonManager.jsm:192 < AMI_unregisterProvider()@resource://gre/modules/AddonManager.jsm:848 < AMP_unregisterProvider()@resource://gre/modules/AddonManager.jsm:2326 < Experiments.Experiments.prototype._unregisterWithAddonManager()@resource://app/modules/experiments/Experiments.jsm:496 < Experiments.Experiments.prototype.uninit<()@resource://app/modules/experiments/Experiments.jsm:442 < TaskImpl_run()@resource://gre/modules/Task.jsm:282 < TaskImpl_handleResultValue()@resource://gre/modules/Task.jsm:338 < TaskImpl_run()@resource://gre/modules/Task.jsm:290 < TaskImpl()@resource://gre/modules/Task.jsm:247 < createAsyncFunction/asyncFunction()@resource://gre/modules/Task.jsm:224 < Spinner.prototype.observe()@resource://gre/modules/AsyncShutdown.jsm:320 < <file:unknown>
11:11:07     INFO -  1399486267673	Browser.Experiments.Experiments	INFO	Experiments #0::Completed uninitialization.
11:11:07     INFO -  firefox: tpp.c:63: __pthread_tpp_change_priority: Assertion `new_prio == -1 || (new_prio >= __sched_fifo_min_prio && new_prio <= __sched_fifo_max_prio)' failed.
11:11:08     INFO -  TEST-INFO | Main app process: killed by SIGIOT
11:11:08  WARNING -  TEST-UNEXPECTED-FAIL | Shutdown | application terminated with exit code 6
11:11:08     INFO -  INFO | runtests.py | Application ran for: 0:05:11.881747
11:11:08     INFO -  INFO | zombiecheck | Reading PID log: /tmp/tmp2ecMbtpidlog
11:11:08     INFO -  ==> process 2437 launched child process 2477
11:11:08     INFO -  INFO | zombiecheck | Checking for orphan process with PID: 2477
11:11:08     INFO -  mozcrash INFO | Downloading symbols from: https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux64/1399484774/firefox-32.0a1.en-US.linux-x86_64.crashreporter-symbols.zip
11:11:17  WARNING -  PROCESS-CRASH | Shutdown | application crashed [@ libc-2.15.so + 0x36445]
11:11:17     INFO -  Crash dump filename: /tmp/tmpyoZjqK/minidumps/06af5307-14a4-89dd-6324fbe4-3f59ec5b.dmp
11:11:17     INFO -  Operating system: Linux
11:11:17     INFO -                    0.0.0 Linux 3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC 2012 x86_64
11:11:17     INFO -  CPU: amd64
11:11:17     INFO -       family 6 model 62 stepping 4
11:11:17     INFO -       1 CPU
11:11:17     INFO -  Crash reason:  SIGABRT
11:11:17     INFO -  Crash address: 0x3e800000985
11:11:17     INFO -  Thread 3 (crashed)
11:11:17     INFO -   0  libc-2.15.so + 0x36445
11:11:17     INFO -      rbx = 0x00007f2539d30000   r12 = 0x00007f2548543e30
11:11:17     INFO -      r13 = 0x00007f2548543f00   r14 = 0x00000000000008f2
11:11:17     INFO -      r15 = 0x00000000ffffffff   rip = 0x00007f254758d445
11:11:17     INFO -      rsp = 0x00007f2537d50838   rbp = 0x00007f2548543e24
11:11:17     INFO -      Found by: given as instruction pointer in context
11:11:17     INFO -   1  libc-2.15.so + 0x39baa
11:11:17     INFO -      rip = 0x00007f2547590bab   rsp = 0x00007f2537d50840
11:11:17     INFO -      rbp = 0x00007f2548543e24
11:11:17     INFO -      Found by: stack scanning
11:11:17     INFO -   2  libpthread-2.15.so + 0x11e2f
11:11:17     INFO -      rip = 0x00007f2548543e30   rsp = 0x00007f2537d50848
11:11:17     INFO -      rbp = 0x00007f2548543e24
11:11:17     INFO -      Found by: stack scanning
11:11:17     INFO -   3  libc-2.15.so + 0x17bb1f
11:11:17     INFO -      rip = 0x00007f25476d2b20   rsp = 0x00007f2537d50850
11:11:17     INFO -      rbp = 0x00007f2548543e24
11:11:17     INFO -      Found by: stack scanning
11:11:17     INFO -   4  libc-2.15.so + 0x6df51
11:11:17     INFO -      rip = 0x00007f25475c4f52   rsp = 0x00007f2537d50870
11:11:17     INFO -      rbp = 0x00007f2548543e24
11:11:17     INFO -      Found by: stack scanning
11:11:17     INFO -   5  libpthread-2.15.so + 0x11e2f
11:11:17     INFO -      rip = 0x00007f2548543e30   rsp = 0x00007f2537d50878
11:11:17     INFO -      rbp = 0x00007f2548543e24
11:11:17     INFO -      Found by: stack scanning
11:11:17     INFO -   6  libc-2.15.so + 0x179b14
11:11:17     INFO -      rip = 0x00007f25476d0b15   rsp = 0x00007f2537d508d0
11:11:17     INFO -      rbp = 0x00007f2548543e24
11:11:17     INFO -      Found by: stack scanning

Benjamin Smedberg

Comment 1

•

10 years ago

I believe that the error in question is  __pthread_tpp_change_priority: Assertion `new_prio == -1 || (new_prio >= __sched_fifo_min_prio && new_prio <= __sched_fifo_max_prio)' failed.

I doubt that this has anything to do with the experiments code. The NS_ERROR_FAILURE from http://hg.mozilla.org/mozilla-central/annotate/417acde736e7/browser/experiments/Experiments.jsm#l2071 is an interesting failure but not the cause of the crash.

The content process should have committed suicide at ###!!! [Child][DispatchAsyncMessage] Error: Route error: message sent to unknown actor ID and that's a bug, but not the proximate cause of the crash.

The chrome-process crash is an abort() from the I/O thread. The main thread is shutting down in ParentImpl::ShutdownBackgroundThread

The I/O thread is at ProcessLink::OnChannelError -> MessageChannel::PostErrorNotifyTask -> MessageLook::PostTask_Helper.

PostTask_Helper appears to be locking bogus memory. Hard to say much more than that. Might be a good candidate for `rr` debugging if it happens much.

Component: Client: Desktop → IPC

Product: Firefox Health Report → Core

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 2

•

10 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=39228865&tree=Mozilla-Inbound

Looks like I'm going to have to start bisecting for a culprit now...

Summary: Intermittent e10s Shutdown | application crashed [@ libc-2.15.so + 0x36445] with Experiments.jsm errors in the log prior and libpthread on the stack → Intermittent e10s Shutdown | application crashed [@ libc-2.15.so + 0x36445][@ linux-gate.so + 0x424] with Experiments.jsm errors in the log prior and libpthread on the stack

Ryan VanderMeulen [:RyanVM]

Reporter

Updated

•

10 years ago

Summary: Intermittent e10s Shutdown | application crashed [@ libc-2.15.so + 0x36445][@ linux-gate.so + 0x424] with Experiments.jsm errors in the log prior and libpthread on the stack → Intermittent e10s mochitest-2 Shutdown | application crashed [@ libc-2.15.so + 0x36445][@ linux-gate.so + 0x424] with Experiments.jsm errors in the log prior and libpthread on the stack

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 3

•

10 years ago

And BTW, why do I have this sinking feeling that bug 924622 is somehow involved?

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 4

•

10 years ago

Bug 880864 is another possibility. Retriggers running.

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 5

•

10 years ago

Retriggers conclusively point to bug 924622 as the cause. Backed out.

Status: NEW → RESOLVED

Closed: 10 years ago

status-firefox30: --- → unaffected

status-firefox31: --- → unaffected

status-firefox32: --- → fixed

status-firefox-esr24: --- → unaffected

Resolution: --- → FIXED

Target Milestone: --- → mozilla32

Chris Peterson [:cpeterson]

Comment 6

•

10 years ago

Bill: Nical is trying to fix bug 924622 but his fix hits this e10s mochitest crash. Do you have any suggestions?

Flags: needinfo?(wmccloskey)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 7

•

10 years ago

I'm occasionally able to reproduce this in Linux debug by running an e10s browser (setting browser.tabs.remote.autostart and restarting), browsing around for a while, and quitting. I'll try to find a way to make it reproduce more consistently.

Flags: needinfo?(wmccloskey)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 8

•

10 years ago

I found some interesting things today while trying to debug this. One thing is noticed is that I still crash at shutdown even if I disable async video. That means that we have bugs in shutdown that have nothing to do with the image bridge, even with the patch in bug 924622 applied.

It looks like the child process never deletes its CompositorChild instance. That's the first problem.

In the parent, the CrossProcessCompositorParent that's associated with the leaked CompositorChild gets its ActorDestroy method called when we get a channel error, which presumably happens when the child has exited. However, it looks like we might have already killed off the compositor thread by then: we increment the compositor thread's refcount in the CompositorParent constructor, but not in the CrossProcessCompositorParent constructor. If the compositor thread has been destroyed, I think we'll crash when we try to handle the channel error. That's the second problem.

So it seems like we need to have a proper way to shut down the CompositorChild in the child process, and then we need to wait until that's finished before we delete the compositor thread in the parent. Nical, would you be able to work on this by any chance? I still don't understand the code very well. It should make it a lot easier to land bug 924622, too.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 9

•

10 years ago

Attached patch debugging — Details — Splinter Review

This patch seems to make crashes more frequent because of the sleep call. My STR is to open Firefox in an e10s profile and then close it once about:home has loaded. It crashes about half the time.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 10

•

10 years ago

Forgot to set needinfo.

Flags: needinfo?(nical.bugzilla)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Updated

•

10 years ago

Blocks: e10s-gfx

debugging 10 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) 11.02 KB, patch		Details \| Diff \| Splinter Review
Expanded debugging patch (same sleep(3) call, more logging) 10 years ago Benoit Jacob [:bjacob] (mostly away) 50.35 KB, patch		Details \| Diff \| Splinter Review
Strawman: avoid use-after-free of MessageChannel::mWorkerLoop 10 years ago Benoit Jacob [:bjacob] (mostly away) 1.97 KB, patch		Details \| Diff \| Splinter Review
Log of the crash recorded with the "expanded debugging patch", showing the use-after-free of the MessageChannel 10 years ago Benoit Jacob [:bjacob] (mostly away) 240.44 KB, text/plain		Details
Log with the second patch to avert the use-after-free, no crash 10 years ago Benoit Jacob [:bjacob] (mostly away) 200.76 KB, text/plain		Details
Expanded debugging patch (same sleep(3) call, more logging) 10 years ago Benoit Jacob [:bjacob] (mostly away) 51.94 KB, patch		Details \| Diff \| Splinter Review
Avoid use-after-free of mWorkerLoop by using a MessageLoop::DestructionObserver, but leak it when MessageChannel::Clear() is called on a different thread 10 years ago Benoit Jacob [:bjacob] (mostly away) 6.03 KB, patch		Details \| Diff \| Splinter Review