Closed Bug 1049275 Opened 10 years ago Closed 10 years ago

Frequent hangs with e10s enabled

Categories

(Core :: IPC, defect)

x86
All
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla35
Tracking Status
e10s m2+ ---

People

(Reporter: blassey, Assigned: billm)

References

Details

Attachments

(2 files)

Attached file Sample of Nightly.txt
the stack implicates a deadlock with CPOWs

    +2286 js::jit::DoGetPropFallback(JSContext*, js::jit::BaselineFrame*, js::jit::ICGetProp_Fallback*, JS::MutableHan
dle<JS::Value>, JS::MutableHandle<JS::Value>)  (in XUL) + 712  [0x104350c18]
    +  2286 js::proxy_GetGeneric(JSContext*, JS::Handle<JSObject*>, JS::Handle<JSObject*>, JS::Handle<jsid>, JS::Mutab
leHandle<JS::Value>)  (in XUL) + 9  [0x10458ec79]
    +    2286 js::Proxy::get(JSContext*, JS::Handle<JSObject*>, JS::Handle<JSObject*>, JS::Handle<jsid>, JS::MutableHa
ndle<JS::Value>)  (in XUL) + 446  [0x10458d58e]
    +      2286 js::CrossCompartmentWrapper::get(JSContext*, JS::Handle<JSObject*>, JS::Handle<JSObject*>, JS::Handle<
jsid>, JS::MutableHandle<JS::Value>) const  (in XUL) + 356  [0x1045aa9e4]
    +        2286 js::proxy_GetGeneric(JSContext*, JS::Handle<JSObject*>, JS::Handle<JSObject*>, JS::Handle<jsid>, JS:
:MutableHandle<JS::Value>)  (in XUL) + 9  [0x10458ec79]
    +          2286 js::Proxy::get(JSContext*, JS::Handle<JSObject*>, JS::Handle<JSObject*>, JS::Handle<jsid>, JS::Mut
ableHandle<JS::Value>)  (in XUL) + 446  [0x10458d58e]
    +            2286 CPOWProxyHandler::get(JSContext*, JS::Handle<JSObject*>, JS::Handle<JSObject*>, JS::Handle<jsid>
, JS::MutableHandle<JS::Value>) const  (in XUL) + 93  [0x1021c4b4d]
    +              2286 mozilla::jsipc::WrapperOwner::get(JSContext*, JS::Handle<JSObject*>, JS::Handle<JSObject*>, JS
::Handle<jsid>, JS::MutableHandle<JS::Value>)  (in XUL) + 248  [0x1021c4c58]
    +                2286 mozilla::jsipc::JavaScriptBase<mozilla::jsipc::PJavaScriptParent>::CallGet(unsigned long lon
g const&, mozilla::jsipc::ObjectVariant const&, nsString const&, mozilla::jsipc::ReturnStatus*, mozilla::jsipc::JSVariant*)  (in XUL) + 13  [0x1021c884d]
    +                  2286 mozilla::jsipc::PJavaScriptParent::CallGet(unsigned long long const&, mozilla::jsipc::Obje
ctVariant const&, nsString const&, mozilla::jsipc::ReturnStatus*, mozilla::jsipc::JSVariant*)  (in XUL) + 608  [0x1020ae110]
    +                    2286 mozilla::ipc::MessageChannel::Call(IPC::Message*, IPC::Message*)  (in XUL) + 21  [0x101ef9375]
    +                      2286 mozilla::ipc::MessageChannel::UrgentCall(IPC::Message*, IPC::Message*)  (in XUL) + 262  [0x101ef8e96]
    +                        2286 mozilla::ipc::MessageChannel::SendAndWait(IPC::Message*, IPC::Message*)  (in XUL) + 200  [0x101ef8cd8]
    +                          2286 mozilla::ipc::MessageChannel::WaitForSyncNotify()  (in XUL) + 69  [0x101ef8ff5]
    +                            2286 PR_WaitCondVar  (in libnss3.dylib) + 105  [0x101904219]
    +                              2286 _pthread_cond_wait  (in libsystem_pthread.dylib) + 727  [0x7fff982d0c3b]
    +                                2286 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff8f9f6716]
Next time it happens can you also get a stack for the child process?
Also, is this a recent thing or has it always happened? Can you try to see if a particular add-on is triggering it?
(In reply to Bill McCloskey (:billm) from comment #2)
> Also, is this a recent thing or has it always happened? Can you try to see
> if a particular add-on is triggering it?

I've been noticing 3 or 4 hangs a day for the last few days.

(In reply to Bill McCloskey (:billm) from comment #1)
> Next time it happens can you also get a stack for the child process?

will do
sampled both this time. The parent looks roughly the same (same last few frames anyway). Here's what the Child process looks like:

2439 nsListControlFrame::Reflow(nsPresContext*, nsHTMLReflowMetrics&, nsHTMLReflowState const&, unsigned int&)  (in XUL) + 399  [0x101b7d47f]
   2439 nsListControlFrame::ReflowAsDropdown(nsPresContext*, nsHTMLReflowMetrics&, nsHTMLReflowState const&, unsigned int&)  (in XUL) + 654  [0x101b7db2e]
     2439 nsComboboxControlFrame::GetAvailableDropdownSpace(int*, int*, nsPoint*)  (in XUL) + 124  [0x101b73fbc]
       2439 nsDeviceContext::GetClientRect(nsRect&)  (in XUL) + 37  [0x1009bcd75]
         2439 nsDeviceContext::ComputeClientRectUsingScreen(nsRect*)  (in XUL) + 102  [0x1009bcde6]
           2439 nsScreenManagerProxy::ScreenForNativeWidget(void*, nsIScreen**)  (in XUL) + 122  [0x1014df5ca]
             2439 mozilla::dom::PScreenManagerChild::SendScreenForBrowser(mozilla::dom::PBrowserChild*, mozilla::dom::ScreenDetails*, bool*)  (in XUL) + 483  [0x1005efb23]
               2439 mozilla::ipc::MessageChannel::Send(IPC::Message*, IPC::Message*)  (in XUL) + 197  [0x1003910e5]
                 2439 mozilla::ipc::MessageChannel::DebugAbort(char const*, int, char const*, char const*, bool) const  (in XUL) + 813  [0x10038ff5d]
                   2439 NS_DebugBreak  (in XUL) + 1180  [0x1000719fc]
                     2439 CrashReporter::AppendAppNotesToCrashReport(nsACString_internal const&)  (in XUL) + 315  [0x10205fb6b]
                       2439 mozilla::dom::PCrashReporterChild::SendAppendAppNotes(nsCString const&)  (in XUL) + 441  [0x10048b179]
                        2439 mozilla::ipc::MessageChannel::Send(IPC::Message*)  (in XUL) + 84  [0x100390b74]
                          2439 PR_Lock  (in libnss3.dylib) + 14  [0x107d44f4e]
                            2439 _pthread_mutex_lock  (in libsystem_pthread.dylib) + 372  [0x7fff982d1779]
                              2439 __psynch_mutexwait  (in libsystem_kernel.dylib) + 10  [0x7fff8f9f6746]
So, the child process has crashed and is trying to send crash data to the parent, but the parent is issuing an urgent message. From an irc conversation with billm, the child process needs to "win" in this situation.
I was able to reproduce this while debugging something else. It turns out that we're not waiting on an IPC message; we're just waiting for a lock. It happens like this:

take a lock in the IPC code
do something that causes the IPC code to assert
try to send crash reporter data to the parent, invoking IPC
try to take the same IPC lock, which is already locked

I'm still unsure how to address this. It looks like we don't send any critical data over the IPC channel--just some textual annotations. What's even stranger is that these annotations look like they get attached to the parent process's crashes. I might be wrong about that though--I'll need to look more into how child process crash dumps are generated.
Gavin and Brad think this bug should block M2.
Attached patch avoid-ipc-hangSplinter Review
Please see the explanation in comment 6. This patch just turns off the crash annotations when we abort. I was wrong about these annotations getting attached to the parent--they are attached to the child minidump. However, they don't seem critical. And right now we're not getting any crash reports because of this, so we really need to fix it.
Attachment #8488331 - Flags: review?(benjamin)
The other comments here are way above my head.

I found my way here by searching the E10S bugs for the string 'hang' in the title.

I performed that search because that's what was happening.  It seemed to happen when I'd open a link or bookmark in a new tab.  I've disabled E10S in the meantime.

If I can gather more information, please tell me how.  What tool do you use to generate those stack traces?

I am a Java developer, so I know how to produce a stack trace or a thread dump--in my world, but not this one. :)
Pardon me: re: Comment 10: Whoops, this bug is for OSX.  I am on Windows 7, 64bit.
This might be the cause of your crash, but we won't know until the bug is fixed.
OS: Mac OS X → All
er, your hangs, sorry.
Attachment #8488331 - Flags: review?(benjamin) → review+
Bill: is your r+'d avoid-ipc-hang patch ready to land? We're trying to close out our remaining M2 bugs today.
Flags: needinfo?(wmccloskey)
https://hg.mozilla.org/mozilla-central/rev/24a8469c2ff6
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
Small followup. I just compiled from changeset 205234:bf5fcc0c4b27 and the hang is fixed. I now get a tab crash page.

After the crash, the browser is quite slow reloading the other tabs. It probably took a minute to get back to normal. I have no idea if that's normal.
(In reply to Jack Moffitt [:jack] from comment #17)
> Small followup. I just compiled from changeset 205234:bf5fcc0c4b27 and the
> hang is fixed. I now get a tab crash page.
> 
> After the crash, the browser is quite slow reloading the other tabs. It
> probably took a minute to get back to normal. I have no idea if that's
> normal.

Usually session restore reloads tabs on demand. But the tab crashed page actually loads them all right away. Maybe that's what you're seeing.
(In reply to Jack Moffitt [:jack] from comment #17)
> After the crash, the browser is quite slow reloading the other tabs. It
> probably took a minute to get back to normal. I have no idea if that's
> normal.

Jack, you are probably seeing session restore bug 1037179.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: