Closed Bug 812880 Opened 8 years ago Closed 8 years ago

Deadlock in hal::DisableSwitchNotifications at shutdown

Categories

(Core :: Hardware Abstraction Layer (HAL), defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla20
blocking-basecamp +
Tracking Status
firefox18 --- fixed
firefox19 --- fixed
firefox20 --- fixed

People

(Reporter: justin.lebar+bug, Assigned: justin.lebar+bug)

Details

Attachments

(1 file)

I observed the following deadlock on my device.

>#0  __futex_syscall3 () at bionic/libc/arch-arm/bionic/atomics_arm.S:182
>#1  0x40080254 in _normal_lock (mutex=0x418542f4) at bionic/libc/bionic/pthread.c:951
>#2  pthread_mutex_lock (mutex=0x418542f4) at bionic/libc/bionic/pthread.c:1041
>#3  0x40f54b7e in LockImpl::Lock (this=0x418542f4) at ../../../../ff-git2/src/ipc/chromium/src/base/lock_impl_posix.cc:41
>#4  0x40e27954 in Lock::Acquire (aId=mozilla::ipc::BrowserProcessSubThread::IO) at ../../../../ff-git2/src/ipc/chromium/src/base/lock.h:16
>#5  AutoLock (aId=mozilla::ipc::BrowserProcessSubThread::IO) at ../../../../ff-git2/src/ipc/chromium/src/base/lock.h:43
>#6  mozilla::ipc::BrowserProcessSubThread::GetMessageLoop (aId=mozilla::ipc::BrowserProcessSubThread::IO) at ../../../../ff-git2/src/ipc/glue/BrowserProcessSubThread.cpp:91
>#7  0x40769d0e in XRE_GetIOMessageLoop () at ../../../../ff-git2/src/toolkit/xre/nsEmbedFunctions.cpp:502
>#8  0x40ebbd8c in mozilla::hal_impl::DisableSwitchNotifications (aDevice=1099252468) at ../../../ff-git2/src/hal/gonk/GonkSwitch.cpp:408
>#9  0x40eb7602 in mozilla::hal::DisableSwitchNotifications (aDevice=mozilla::hal::SWITCH_USB) at ../../../ff-git2/src/hal/Hal.cpp:717
>#10 0x40eb88fa in mozilla::hal::UnregisterSwitchObserver (aDevice=<value optimized out>, aObserver=0x43cd11d8) at ../../../ff-git2/src/hal/Hal.cpp:776
>#11 0x40bcd85c in ~UsbCableObserver (this=0x43cd11d8, __in_chrg=<value optimized out>) at ../../../../../ff-git2/src/dom/system/gonk/AutoMounter.cpp:514
>#12 0x40bcd86c in ~UsbCableObserver (this=0x1, __in_chrg=<value optimized out>) at ../../../../../ff-git2/src/dom/system/gonk/AutoMounter.cpp:515
>#13 0x408609be in mozilla::RefCounted<mozilla::gfx::GradientStops>::Release (this=0x418523a8, __in_chrg=<value optimized out>) at ../../dist/include/mozilla/RefPtr.h:64
>#14 mozilla::RefPtr<mozilla::gfx::GradientStops>::unref (this=0x418523a8, __in_chrg=<value optimized out>) at ../../dist/include/mozilla/RefPtr.h:159
>#15 ~RefPtr (this=0x418523a8, __in_chrg=<value optimized out>) at ../../dist/include/mozilla/RefPtr.h:109
>#16 0x4008cc34 in __cxa_finalize (dso=0x0) at bionic/libc/stdlib/atexit.c:158
>#17 0x4008cfd0 in exit (status=42) at bionic/libc/stdlib/exit.c:57
>#18 0x4008cfd0 in exit (status=42) at bionic/libc/stdlib/exit.c:57

This is important if and only if this atexit hang will prevent us from restarting the b2g process if it crashes.  I think this is the case, so nom'ing for blocking.
blocking-basecamp: --- → ?
I don't see how releasing a GradientStops objct should cause us to run ~UsbCableObserver.  I'm going to guess that this stack is incorrect.

I think the issue is that we keep a static RefPtr<UsbCableObserver> instead of a static StaticRefPtr<UsbCableObserver>.
Attached patch Patch, v1Splinter Review
Attachment #682880 - Flags: review?(jones.chris.g)
Now my build dies with

>Program received signal SIGSEGV, Segmentation fault.
>android::SharedBuffer::release (this=0xfffffff0, flags=0) at frameworks/base/include/utils/SharedBuffer.h:139
>139         return (mRefs == 1);
>(gdb) bt
>#0  android::SharedBuffer::release (this=0xfffffff0, flags=0) at frameworks/base/include/utils/SharedBuffer.h:139
>#1  0x41a1f0ac in android::terminate_string16 () at frameworks/base/libs/utils/String16.cpp:55
>#2  0x41a1e3a8 in ~LibUtilsFirstStatics (this=0xfffffff0, __in_chrg=<value optimized out>) at frameworks/base/libs/utils/Static.cpp:38
>#3  0x40089c34 in __cxa_finalize (dso=0x0) at bionic/libc/stdlib/atexit.c:158
>#4  0x40089fd0 in exit (status=42) at bionic/libc/stdlib/exit.c:57
>#5  0x40089fd0 in exit (status=42) at bionic/libc/stdlib/exit.c:57
>Backtrace stopped: previous frame identical to this frame (corrupt stack?)

which, though not great, seems much better (at least the process is actually dead).
Assignee: nobody → justin.lebar+bug
Attachment #682880 - Flags: review?(jones.chris.g) → review+
FWIW I think I was seeing this deadlock because I was calling exit() manually (I'd forgotten about this call).

So I don't know if we actually need this on branches.  But this deadlock is bad enough and the fix is simple enough that I'd want it there.
https://hg.mozilla.org/mozilla-central/rev/23bcb80441fa
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla20
So yeah. I don't think that StaticRefPtr existed when I wrote AutoMounter.cpp.

And I think that when they're declared as static RefPtr then I would expect things to get destructed when exit is called. I see StaticRefPtr has no destructor. The original code expected things to get destructed when xpcom-shutdown was received.

If the destructors were called from the wrong thread context, I could see that potentially causing the deadlock issue. And since StaticRefPtr doesn't destruct, that also sidesteps the issue.

I also noticed that the compiler does certain optimizations when templates and lots of inlined stuff is being used, and sometimes the optimizations are incorrect.

bug 694594 is an example where I saw this happening in the 4.4.3 compiler (we're using 4.6.3), but I've investigated these types of issues several times over the years in gcc. So when you were seeing UsbCable observer in the stack crawl, it may have been from this type of optimization. We'd want to produced a disassembled libxul.so to confirm this.
Target Milestone: mozilla20 → ---
Target Milestone: --- → mozilla20
Comment on attachment 682880 [details] [diff] [review]
Patch, v1

[Approval Request Comment]
This code is used only on b2g.
Attachment #682880 - Flags: approval-mozilla-beta?
Attachment #682880 - Flags: approval-mozilla-aurora?
blocking-basecamp: ? → +
Comment on attachment 682880 [details] [diff] [review]
Patch, v1

This is blocking-basecamp+, so you are OK to land with a=blocking-basecamp for another week (see the extension in https://etherpad.mozilla.org/b2g-convergence-landing). Approving regardless.
Attachment #682880 - Flags: approval-mozilla-beta?
Attachment #682880 - Flags: approval-mozilla-beta+
Attachment #682880 - Flags: approval-mozilla-aurora?
Attachment #682880 - Flags: approval-mozilla-aurora+
(In reply to Justin Lebar [:jlebar] from comment #0)
> I observed the following deadlock on my device.

Yeah - it looks like what's happening is this:

1 - exit is called (presumably on the main thread)
2 - static RefPtr<UsbCableObserver> gets destructed and eventually calls DisableSwitchNotifications
3 - DisableSwitchNotifications does a PostTask to the IOThread to perform the work, and it waits for the IOThread to do its work before continuing.

My guess is that the IOThread is no longer running, and this in turn causes the hang.
I filed bug 813842 to fix GonkSwitch properly
You need to log in before you can comment on or make changes to this bug.