Closed Bug 1457501 Opened 2 years ago Closed 2 years ago

Mac Crash deadlock triggered by CrashReporter::GetFlatThreadAnnotation() lock acquisition

Categories

(Toolkit :: Crash Reporting, defect, P1)

57 Branch
Unspecified
macOS
defect

Tracking

()

RESOLVED FIXED
mozilla61
Tracking Status
firefox61 --- fixed

People

(Reporter: haik, Assigned: haik)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

While running the test testing/marionette/harness/marionette_harness/tests/unit/test_crash.py with a change to make the test wait indefinitely for the browser to cleanly exit (see attached), I hit test hangs where Firefox has exited but a content process is stuck in a hung state. This hang is a deadlock similar to bug 1395504, but in a breakpad callback. The problem is that in CrashReporter::GetFlatThreadAnnotation() which is run from the exception handler thread when all other threads have been suspended, we acquire a lock. Acquiring the lock triggers a memory allocation which blocks on a memory allocator lock. The lock is held by another thread which is suspended, hence the deadlock. And without the allocation, we could still deadlock if another thread held the mutex. So we shouldn't try to get the lock in exception context on Mac. This bug is specific to Mac because Windows and Linux implementations of breakpad work differently and don't suspend all threads while still trying to run in the child.

Here are some stacks from an instance of the problem.

Exception handler thread, blocked trying to allocate memory:
  thread #5
    frame #0 libsystem_kernel.dylib`syscall_thread_switch() 
    frame #1 libsystem_platform.dylib`_OSSpinLockLockYield() 
    frame #2 libmozglue.dylib`Mutex::Lock() at Mutex.h:66
    frame #3 libmozglue.dylib`AutoLock<Mutex>::AutoLock() at Mutex.h:129
    frame #4 libmozglue.dylib`AutoLock<Mutex>::AutoLock() at Mutex.h:127
    frame #5 libmozglue.dylib`arena_t::MallocSmall() at mozjemalloc.cpp:2940
    frame #6 libmozglue.dylib`arena_t::Malloc() at mozjemalloc.cpp:3000
    frame #7 libmozglue.dylib`BaseAllocator::calloc() at mozjemalloc.cpp:4179
    frame #8 libmozglue.dylib`Allocator<MozJemallocBase>::calloc() at malloc_decls.h:38
    frame #9 libmozglue.dylib`Allocator<ReplaceMallocBase>::calloc() at malloc_decls.h:38
    frame #10 libmozglue.dylib`::calloc(size_t, size_t)() at malloc_decls.h:38
    frame #11 libnss3.dylib`PR_Calloc() at prmem.c:443
    frame #12 libnss3.dylib`pt_AttachThread() at ptthread.c:265
    frame #13 libnss3.dylib`PR_GetCurrentThread() at ptthread.c:622
    frame #14 libnss3.dylib`PR_GetThreadPrivate() at prtpd.c:204
    frame #15 XUL`mozilla::BlockingResourceBase::ResourceChainFront() at BlockingResourceBase.h:157
    frame #16 XUL`mozilla::BlockingResourceBase::CheckAcquire() at BlockingResourceBase.cpp:279
    frame #17 XUL`mozilla::OffTheBooksMutex::Lock() at BlockingResourceBase.cpp:384
    frame #18 XUL`mozilla::StaticMutex::Lock() at StaticMutex.h:44
    frame #19 XUL`mozilla::BaseAutoLock<mozilla::StaticMutex>::BaseAutoLock() at Mutex.h:166
    frame #20 XUL`mozilla::BaseAutoLock<mozilla::StaticMutex>::BaseAutoLock() at Mutex.h:163
    frame #21 XUL`CrashReporter::GetFlatThreadAnnotation() at ThreadAnnotation.cpp:235
    frame #22 XUL`CrashReporter::PrepareChildExceptionTimeAnnotations() at nsExceptionHandler.cpp:1300
    frame #23 XUL`CrashReporter::ChildFilter() at nsExceptionHandler.cpp:1426
    frame #24 XUL`google_breakpad::ExceptionHandler::WriteMinidumpWithException() at exception_handler.cc:373
    frame #25 XUL`google_breakpad::ExceptionHandler::WaitForMessage() at exception_handler.cc:573
    frame #26 libsystem_pthread.dylib`_pthread_body() 
    frame #27 libsystem_pthread.dylib`_pthread_start() 
    frame #28 libsystem_pthread.dylib`thread_start()

Main thread, suspended by exception handler, holding allocator lock:
  thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0 libmozglue.dylib`RedBlackTree<arena_chunk_map_t, ArenaRunTreeTrait>::TreeNode::Left() at rb.h:172
    frame #1 libmozglue.dylib`RedBlackTree<arena_chunk_map_t, ArenaRunTreeTrait>::Remove() at rb.h:577
    frame #2 libmozglue.dylib`RedBlackTree<arena_chunk_map_t, ArenaRunTreeTrait>::Remove() at rb.h:144
    frame #3 libmozglue.dylib`arena_t::GetNonFullBinRun() at mozjemalloc.cpp:2790
    frame #4 libmozglue.dylib`arena_t::MallocSmall() at mozjemalloc.cpp:2943
    frame #5 libmozglue.dylib`arena_t::Malloc() at mozjemalloc.cpp:3000
    frame #6 libmozglue.dylib`BaseAllocator::calloc() at mozjemalloc.cpp:4179
    frame #7 libmozglue.dylib`Allocator<MozJemallocBase>::calloc() at malloc_decls.h:38
    frame #8 libmozglue.dylib`Allocator<ReplaceMallocBase>::calloc() at malloc_decls.h:38
    frame #9 libmozglue.dylib`::calloc(size_t, size_t)() at malloc_decls.h:38
    frame #10 libmozglue.dylib`zone_calloc() at zone.c:143
    frame #11 libsystem_malloc.dylib`malloc_zone_calloc() 
    frame #12 libsystem_malloc.dylib`calloc() 
    frame #13 ColorSync`CMMMemMgr::New() 
    frame #14 ColorSync`CMMBase::NewInternal() 
    frame #15 ColorSync`CMMProfile::MakeTag() 
    frame #16 ColorSync`CMMProfile::GetTag() 
    frame #17 ColorSync`CMMProfile::Usable() 
    frame #18 ColorSync`DoValidateProfile() 
    frame #19 ColorSync`AppleCMMValidateProfile() 
    frame #20 ColorSync`create() 
    frame #21 ColorSync`ColorSyncVerifySRGBData() 
    frame #22 CoreGraphics`CGColorSpaceCreateWithICCData() 
    frame #23 SkyLight`CGSColorSpaceRegistryCopyColorSpace() 
    frame #24 SkyLight`SLSCopyDisplayColorSpace() 
    frame #25 AppKit`-[NSCGSDisplay initWithDisplayID:flipOffset:]() 
    frame #26 AppKit`___NSCGSCreateDisplaysFromDisplayIDsUsingPredicate_block_invoke() 
    frame #27 AppKit`_NSCGSCreateArrayUsingBlock() 
    frame #28 AppKit`+[NSCGSDisplay uniqueDisplays]() 
    frame #29 AppKit`_NSScreenConfigurationUpdateSharedInfoForReason() 
    frame #30 AppKit`___NSScreenConfigurationEnsureInitialUpdateOccurred_block_invoke() 
    frame #31 libdispatch.dylib`_dispatch_client_callout() 
    frame #32 libdispatch.dylib`dispatch_once_f() 
    frame #33 AppKit`-[NSApplication(ScreenHandling) _registerForDisplayChangedNotifications]() 
    frame #34 AppKit`-[NSApplication init]() 
    frame #35 AppKit`+[NSApplication sharedApplication]() 
    frame #36 XUL`nsAppShell::Init() at nsAppShell.mm:314
    frame #37 XUL`nsAppShellInit() at nsAppShellSingleton.h:45
    frame #38 XUL`nsComponentManagerImpl::KnownModule::Load() at nsComponentManager.cpp:726
    frame #39 XUL`nsFactoryEntry::GetFactory() at nsComponentManager.cpp:1748
    frame #40 XUL`nsComponentManagerImpl::CreateInstance() at nsComponentManager.cpp:964
    frame #41 XUL`nsComponentManagerImpl::GetService() at nsComponentManager.cpp:1209
    frame #42 XUL`CallGetService() at nsComponentManagerUtils.cpp:56
    frame #43 XUL`nsGetServiceByCID::operator()() at nsComponentManagerUtils.cpp:253
    frame #44 XUL`nsCOMPtr<nsIAppShell>::assign_from_gs_cid() at nsCOMPtr.h:1198
    frame #45 XUL`nsCOMPtr<nsIAppShell>::nsCOMPtr() at nsCOMPtr.h:559
    frame #46 XUL`nsCOMPtr<nsIAppShell>::nsCOMPtr() at nsCOMPtr.h:556
    frame #47 XUL`XRE_RunAppShell() at nsEmbedFunctions.cpp:850
    frame #48 XUL`mozilla::ipc::MessagePumpForChildProcess::Run() at MessagePump.cpp:269
    frame #49 XUL`MessageLoop::RunInternal() at message_loop.cc:326
    frame #50 XUL`MessageLoop::RunHandler() at message_loop.cc:319
    frame #51 XUL`MessageLoop::Run() at message_loop.cc:299
    frame #52 XUL`XRE_InitChildProcess() at nsEmbedFunctions.cpp:719
    frame #53 XUL`mozilla::BootstrapImpl::XRE_InitChildProcess() at Bootstrap.cpp:69
    frame #54 plugin-container`content_process_main() at plugin-container.cpp:50
    frame #55 plugin-container`main() at MozillaRuntimeMain.cpp:25
    frame #56 plugin-container`start()
Assignee: nobody → haftandilian
Priority: -- → P1
More specifically: it's not safe to malloc in the exception handler, because one of the things that can cause a crash is heap corruption, so trying to malloc could cause us to crash again! It's unfortunate that `PR_GetCurrentThread` can malloc.

I wonder if we could change BlockingResourceBase to use a `static MOZ_THREAD_LOCAL(BlockingResourceBase*) sResourceAcqnChainFront` member instead of `PR_{Set,Get}ThreadPrivate`?

That would involve changing this:
https://dxr.mozilla.org/mozilla-central/rev/63a0e2f626febb98d87d2543955ab99a653654ff/xpcom/threads/BlockingResourceBase.h#297

To look more like this:
https://searchfox.org/mozilla-central/rev/78dbe34925f04975f16cb9a5d4938be714d41897/dom/script/ScriptSettings.cpp#29

and replacing the `PR_{Get,Set}CurrentThread` usage with `sResourceAcqnChainFront.{get,set}`.
I'm not super familiar with how our thread local stuff works, but we're using `MOZ_THREAD_LOCAL` in plenty of other places and it seems like the thing we need here.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #3)
> I'm not super familiar with how our thread local stuff works, but we're
> using `MOZ_THREAD_LOCAL` in plenty of other places and it seems like the
> thing we need here.

Even if acquiring the StaticMutex didn't trigger allocations, it's still not safe to do in exception context because a suspended thread may hold the lock. That would cause deadlock because the exception thread would be waiting for a suspended thread to release a lock which would never happen.
See Also: → 1457545
Comment on attachment 8971644 [details]
Bug 1457501 - Part 1 - Mac Crash deadlock triggered by CrashReporter::GetFlatThreadAnnotation() lock acquisition

https://reviewboard.mozilla.org/r/240414/#review246608

Looks good to me. While I don't like seeing more platform-specific code being added to these code paths I don't see an alternative method of fixing this.
Attachment #8971644 - Flags: review?(gsvelto) → review+
Comment on attachment 8971645 [details]
Bug 1457501 - Part 2 - Remove unused type DeleteWithLock

https://reviewboard.mozilla.org/r/240416/#review246610

Nice catch, the code that used this was backed out and never reintroduced. LGTM.
Attachment #8971645 - Flags: review?(gsvelto) → review+
Pushed by haftandilian@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/bba71ee5fb46
Part 1 - Mac Crash deadlock triggered by CrashReporter::GetFlatThreadAnnotation() lock acquisition r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/911c930bc055
Part 2 - Remove unused type DeleteWithLock r=gsvelto
Backed out 2 changesets (bug 1457501) for build bustage. CLOSED TREE

Log:
https://treeherder.mozilla.org/logviewer.html#?job_id=176362508&repo=autoland&lineNumber=24665

[task 2018-04-30T22:45:34.310Z] 22:45:34     INFO -  /builds/worker/workspace/build/src/toolkit/crashreporter/ThreadAnnotation.cpp:82:8: error: 'MacCrashReporterLock' does not name a type
[task 2018-04-30T22:45:34.310Z] 22:45:34     INFO -   static MacCrashReporterLock sMutex;
[task 2018-04-30T22:45:34.310Z] 22:45:34     INFO -          ^~~~~~~~~~~~~~~~~~~~
[task 2018-04-30T22:45:34.311Z] 22:45:34     INFO -  /builds/worker/workspace/build/src/toolkit/crashreporter/ThreadAnnotation.cpp: In function 'void CrashReporter::{anonymous}::ThreadLocalDestructor(void*)':
[task 2018-04-30T22:45:34.311Z] 22:45:34     INFO -  /builds/worker/workspace/build/src/toolkit/crashreporter/ThreadAnnotation.cpp:210:30: error: 'sMutex' was not declared in this scope
[task 2018-04-30T22:45:34.311Z] 22:45:34     INFO -     CrashReporterAutoLock lock(sMutex);
[task 2018-04-30T22:45:34.311Z] 22:45:34     INFO -                                ^~~~~~
[task 2018-04-30T22:45:34.311Z] 22:45:34     INFO -  /builds/worker/workspace/build/src/toolkit/crashreporter/ThreadAnnotation.cpp: In destructor 'CrashReporter::{anonymous}::ThreadAnnotationSpan::~ThreadAnnotationSpan()':
[task 2018-04-30T22:45:34.311Z] 22:45:34     INFO -  /builds/worker/workspace/build/src/toolkit/crashreporter/ThreadAnnotation.cpp:222:3: error: 'sMutex' was not declared in this scope
[task 2018-04-30T22:45:34.311Z] 22:45:34     INFO -     sMutex.AssertCurrentThreadOwns();
[task 2018-04-30T22:45:34.311Z] 22:45:34     INFO -     ^~~~~~
[task 2018-04-30T22:45:34.311Z] 22:45:34     INFO -  /builds/worker/workspace/build/src/toolkit/crashreporter/ThreadAnnotation.cpp: In function 'void CrashReporter::InitThreadAnnotation()':
[task 2018-04-30T22:45:34.311Z] 22:45:34     INFO -  /builds/worker/workspace/build/src/toolkit/crashreporter/ThreadAnnotation.cpp:233:30: error: 'sMutex' was not declared in this scope
[task 2018-04-30T22:45:34.311Z] 22:45:34     INFO -     CrashReporterAutoLock lock(sMutex);
[task 2018-04-30T22:45:34.311Z] 22:45:34     INFO -                                ^~~~~~
[task 2018-04-30T22:45:34.311Z] 22:45:34     INFO -  /builds/worker/workspace/build/src/toolkit/crashreporter/ThreadAnnotation.cpp: In function 'void CrashReporter::SetCurrentThreadName(const char*)':
[task 2018-04-30T22:45:34.311Z] 22:45:34     INFO -  /builds/worker/workspace/build/src/toolkit/crashreporter/ThreadAnnotation.cpp:258:30: error: 'sMutex' was not declared in this scope
[task 2018-04-30T22:45:34.311Z] 22:45:34     INFO -     CrashReporterAutoLock lock(sMutex);
[task 2018-04-30T22:45:34.311Z] 22:45:34     INFO -                                ^~~~~~
[task 2018-04-30T22:45:34.311Z] 22:45:34     INFO -  /builds/worker/workspace/build/src/toolkit/crashreporter/ThreadAnnotation.cpp: In function 'void CrashReporter::GetFlatThreadAnnotation(const std::function<void(const char*)>&, bool)':
[task 2018-04-30T22:45:34.312Z] 22:45:34     INFO -  /builds/worker/workspace/build/src/toolkit/crashreporter/ThreadAnnotation.cpp:294:5: error: 'sMutex' was not declared in this scope
[task 2018-04-30T22:45:34.312Z] 22:45:34     INFO -       sMutex.Lock();
[task 2018-04-30T22:45:34.312Z] 22:45:34     INFO -       ^~~~~~
[task 2018-04-30T22:45:34.312Z] 22:45:34     INFO -  /builds/worker/workspace/build/src/toolkit/crashreporter/ThreadAnnotation.cpp:305:5: error: 'sMutex' was not declared in this scope
[task 2018-04-30T22:45:34.312Z] 22:45:34     INFO -       sMutex.Unlock();
[task 2018-04-30T22:45:34.312Z] 22:45:34     INFO -       ^~~~~~
[task 2018-04-30T22:45:34.312Z] 22:45:34     INFO -  /builds/worker/workspace/build/src/toolkit/crashreporter/ThreadAnnotation.cpp: In function 'void CrashReporter::ShutdownThreadAnnotation()':
[task 2018-04-30T22:45:34.312Z] 22:45:34     INFO -  /builds/worker/workspace/build/src/toolkit/crashreporter/ThreadAnnotation.cpp:311:30: error: 'sMutex' was not declared in this scope
[task 2018-04-30T22:45:34.312Z] 22:45:34     INFO -     CrashReporterAutoLock lock(sMutex);
[task 2018-04-30T22:45:34.312Z] 22:45:34     INFO -                                ^~~~~~
[task 2018-04-30T22:45:34.312Z] 22:45:34     INFO -  At global scope:
[task 2018-04-30T22:45:34.312Z] 22:45:34     INFO -  cc1plus: error: unrecognized command line option '-Wno-implicit-fallthrough' [-Werror]
[task 2018-04-30T22:45:34.313Z] 22:45:34     INFO -  cc1plus: all warnings being treated as errors
[task 2018-04-30T22:45:34.313Z] 22:45:34     INFO -  /builds/worker/workspace/build/src/config/rules.mk:1024: recipe for target 'Unified_cpp_crashreporter0.o' failed
[task 2018-04-30T22:45:34.313Z] 22:45:34     INFO -  make[4]: *** [Unified_cpp_crashreporter0.o] Error 1
[task 2018-04-30T22:45:34.313Z] 22:45:34     INFO -  make[4]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox/toolkit/crashreporter'
[task 2018-04-30T22:45:34.313Z] 22:45:34     INFO -  /builds/worker/workspace/build/src/config/recurse.mk:73: recipe for target 'toolkit/crashreporter/target' failed
[task 2018-04-30T22:45:34.313Z] 22:45:34     INFO -  make[3]: *** [toolkit/crashreporter/target] Error 2
[task 2018-04-30T22:45:34.313Z] 22:45:34     INFO -  make[3]: *** Waiting for unfinished jobs....
[task 2018-04-30T22:45:34.313Z] 22:45:34     INFO -  make[4]: Entering directory '/builds/worker/workspace/build/src/obj-firefox/media/mtransport/third_party/nICEr/nicer_nicer'
[task 2018-04-30T22:45:34.313Z] 22:45:34     INFO -  make[4]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox/media/mtransport/third_party/nICEr/nicer_nicer'

Push with fails:
https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=911c930bc0558545ef3b49294b81fa3ec0237b9a

Backout:
https://hg.mozilla.org/integration/autoland/rev/072c882dccce90b2b2b3e13442dab17752855947
Flags: needinfo?(haftandilian)
(In reply to Dorel Luca [:dluca] from comment #12)
> Backed out 2 changesets (bug 1457501) for build bustage. CLOSED TREE

Sorry, this fails to compile on Linux due to a typo. The fix is below. I'll fix this and test the build on Linux/Mac/Windows and then re-land.

$ hg diff
diff --git a/toolkit/crashreporter/ThreadAnnotation.cpp b/toolkit/crashreporter/ThreadAnnotation.cpp
--- a/toolkit/crashreporter/ThreadAnnotation.cpp
+++ b/toolkit/crashreporter/ThreadAnnotation.cpp
@@ -74,17 +74,17 @@ typedef mozilla::BaseAutoLock<MacCrashRe
 typedef MacCrashReporterLock CrashReporterLockType;
 #else /* !XP_MACOSX */
 // Use StaticMutex for locking
 typedef StaticMutexAutoLock CrashReporterAutoLock;
 typedef StaticMutex CrashReporterLockType;
 #endif /* XP_MACOSX */
 
 // Protects access to sInitialized and sThreadAnnotations.
-static MacCrashReporterLock sMutex;
+static CrashReporterLockType sMutex;
Flags: needinfo?(haftandilian)
hg error in cmd: hg pull upstream: pulling from https://hg.mozilla.org/integration/autoland
searching for changes
abort: HTTP Error 500: Internal Server Error
Pushed by haftandilian@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c287079aaf7f
Part 1 - Mac Crash deadlock triggered by CrashReporter::GetFlatThreadAnnotation() lock acquisition r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/e4216d889836
Part 2 - Remove unused type DeleteWithLock r=gsvelto
https://hg.mozilla.org/mozilla-central/rev/c287079aaf7f
https://hg.mozilla.org/mozilla-central/rev/e4216d889836
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
You need to log in before you can comment on or make changes to this bug.