Closed Bug 1651102 Opened 4 years ago Closed 4 years ago

Deadlock when calling profiler_get_backtrace from mozilla::PresShell::EnsureStyleFlush()

Categories

(Core :: Gecko Profiler, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla80
Tracking Status
firefox80 --- fixed

People

(Reporter: florian, Assigned: mozbugz)

References

Details

Attachments

(4 files)

I was profiling with the nativeallocations feature enabled.
Unfortunately I don't know more than the stack lldb gave me:

* thread #1, name = 'MainThread', queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007fff73edc062 libsystem_kernel.dylib`__psynch_mutexwait + 10
    frame #1: 0x00007fff73f9a917 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 83
    frame #2: 0x00007fff73f98937 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 222
    frame #3: 0x000000010a0e7c4a libmozglue.dylib`mozilla::detail::MutexImpl::lock() at Mutex_posix.cpp:118:3 [opt]
    frame #4: 0x000000010a0e7c42 libmozglue.dylib`mozilla::detail::MutexImpl::lock(this=0x0000000114bc0678) at Mutex_posix.cpp:163 [opt]
    frame #5: 0x000000010fb5cbbc XUL`profiler_get_backtrace() [inlined] PSMutex::Lock(this=0x0000000114bc0678) at platform.cpp:298:35 [opt]
    frame #6: 0x000000010fb5cb9e XUL`profiler_get_backtrace() [inlined] PSAutoLock::PSAutoLock(this=<unavailable>, aMutex=0x0000000114bc0678) at platform.cpp:374 [opt]
    frame #7: 0x000000010fb5cb9e XUL`profiler_get_backtrace() [inlined] PSAutoLock::PSAutoLock(this=<unavailable>, aMutex=0x0000000114bc0678) at platform.cpp:374 [opt]
    frame #8: 0x000000010fb5cb9e XUL`profiler_get_backtrace() at platform.cpp:5193 [opt]
    frame #9: 0x000000010d134bd9 XUL`mozilla::PresShell::EnsureStyleFlush() [inlined] mozilla::PresShell::SetNeedStyleFlush(this=0x000000012dbfb000) at PresShellInlines.h:43:19 [opt]
    frame #10: 0x000000010d134bc9 XUL`mozilla::PresShell::EnsureStyleFlush(this=0x000000012dbfb000) at PresShellInlines.h:51 [opt]
    frame #11: 0x000000010d173dd0 XUL`mozilla::dom::NoteDirtyElement(aElement=0x000000012d87dd80, aBits=524288) at Element.cpp:4164:16 [opt]
    frame #12: 0x000000010ebaf896 XUL`mozilla::RestyleManager::SnapshotFor(this=<unavailable>, aElement=0x000000012d87dd80) at RestyleManager.cpp:2926:12 [opt]
    frame #13: 0x000000010ebb06f2 XUL`mozilla::RestyleManager::TakeSnapshotForAttributeChange(this=0x000000012d918880, aElement=0x000000012d87dd80, aNameSpaceID=0, aAttribute=0x0000000112cdfc24) at RestyleManager.cpp:3287:36 [opt]
    frame #14: 0x000000010eb9a573 XUL`mozilla::PresShell::AttributeWillChange(mozilla::dom::Element*, int, nsAtom*, int) [inlined] mozilla::RestyleManager::AttributeWillChange(this=<unavailable>, aElement=0x000000012d87dd80, aNameSpaceID=0, aAttribute=0x0000000112cdfc24, aModType=<unavailable>) at RestyleManager.cpp:3254:3 [opt]
    frame #15: 0x000000010eb9a565 XUL`mozilla::PresShell::AttributeWillChange(this=0x000000012dbfb000, aElement=0x000000012d87dd80, aNameSpaceID=0, aAttribute=0x0000000112cdfc24, aModType=<unavailable>) at PresShell.cpp:4350 [opt]
    frame #16: 0x000000010d1a551e XUL`mozilla::dom::MutationObservers::NotifyAttributeWillChange(aElement=0x000000012d87dd80, aNameSpaceID=0, aAttribute=0x0000000112cdfc24, aModType=2) at MutationObservers.cpp:151:3 [opt]
    frame #17: 0x000000010d16b06f XUL`mozilla::dom::Element::SetAttr(this=<unavailable>, aNamespaceID=0, aName=0x0000000112cdfc24, aPrefix=<unavailable>, aValue=<unavailable>, aSubjectPrincipal=0x0000000000000000, aNotify=<unavailable>) at Element.cpp:2205:5 [opt]
    frame #18: 0x000000010ea10be2 XUL`nsMenuX::MenuOpened() [inlined] mozilla::dom::Element::SetAttr(this=<unavailable>, aNameSpaceID=<unavailable>, aName=<unavailable>, aPrefix=<unavailable>, aValue=<unavailable>, aNotify=<unavailable>) at Element.h:889:12 [opt]
    frame #19: 0x000000010ea10bbc XUL`nsMenuX::MenuOpened() [inlined] mozilla::dom::Element::SetAttr(this=<unavailable>, aNameSpaceID=<unavailable>, aName=<unavailable>, aValue=<unavailable>, aNotify=<unavailable>) at Element.h:885 [opt]
    frame #20: 0x000000010ea10bbc XUL`nsMenuX::MenuOpened(this=0x000000012ea47f90) at nsMenuX.mm:318 [opt]
    frame #21: 0x000000010ea18003 XUL`UpdateMenu(aMenu=0x000000012ea47f90) at nsStandaloneNativeMenu.mm:51:10 [opt]
    frame #22: 0x000000010ea18072 XUL`non-virtual thunk to nsStandaloneNativeMenu::MenuWillOpen(bool*) [inlined] nsStandaloneNativeMenu::MenuWillOpen(this=<unavailable>, aResult=0x00007ffee5b695af) at nsStandaloneNativeMenu.mm:69:3 [opt]
    frame #23: 0x000000010ea18069 XUL`non-virtual thunk to nsStandaloneNativeMenu::MenuWillOpen(this=<unavailable>, aResult=0x00007ffee5b695af) at nsStandaloneNativeMenu.mm:0 [opt]
    frame #24: 0x000000010fe00ae4 XUL`-[MacApplicationDelegate applicationDockMenu:](self=<unavailable>, _cmd=<unavailable>, sender=<unavailable>) at MacApplicationDelegate.mm:253:18 [opt]
    frame #25: 0x00007fff3739c426 AppKit`-[NSApplication _createDockMenu:] + 176
    frame #26: 0x00007fff3739c343 AppKit`__44-[NSApplication _copyPublicPersistentUIInfo]_block_invoke + 42
    frame #27: 0x00007fff37642b70 AppKit`___NSMainRunLoopPerformBlockInModes_block_invoke + 25
    frame #28: 0x00007fff39f36a7c CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_A_BLOCK__ + 12
    frame #29: 0x00007fff39f369c4 CoreFoundation`__CFRunLoopDoBlocks + 379
    frame #30: 0x00007fff39f358bb CoreFoundation`__CFRunLoopRun + 912
    frame #31: 0x00007fff39f34ece CoreFoundation`CFRunLoopRunSpecific + 462
    frame #32: 0x00007fff38b63abd HIToolbox`RunCurrentEventLoopInMode + 292
    frame #33: 0x00007fff38b637d5 HIToolbox`ReceiveNextEventCommon + 584
    frame #34: 0x00007fff38b63579 HIToolbox`_BlockUntilNextEventMatchingListInModeWithFilter + 64
    frame #35: 0x00007fff371ab829 AppKit`_DPSNextEvent + 883
    frame #36: 0x00007fff371aa070 AppKit`-[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 1352
    frame #37: 0x000000010e9e898d XUL`-[GeckoNSApplication nextEventMatchingMask:untilDate:inMode:dequeue:](self=0x000000010a3507c0, _cmd=<unavailable>, mask=18446744073709551615, expiration=0x00007fff917bd4b8, mode="kCFRunLoopDefaultMode", flag=YES) at nsAppShell.mm:171:24 [opt]
    frame #38: 0x00007fff3719bd7e AppKit`-[NSApplication run] + 658
    frame #39: 0x000000010e9e9921 XUL`nsAppShell::Run(this=0x0000000122bb28b0) at nsAppShell.mm:692:5 [opt]
    frame #40: 0x000000010fd20357 XUL`nsAppStartup::Run(this=0x0000000122d510b0) at nsAppStartup.cpp:270:30 [opt]
    frame #41: 0x000000010fdf698d XUL`XREMain::XRE_mainRun(this=0x00007ffee5b6afb0) at nsAppRunner.cpp:4717:22 [opt]
    frame #42: 0x000000010fdf70f6 XUL`XREMain::XRE_main(this=0x00007ffee5b6afb0, argc=<unavailable>, argv=<unavailable>, aConfig=<unavailable>) at nsAppRunner.cpp:4866:8 [opt]
    frame #43: 0x000000010fdf7458 XUL`XRE_main(argc=<unavailable>, argv=<unavailable>, aConfig=<unavailable>) at nsAppRunner.cpp:4920:21 [opt]
    frame #44: 0x000000010a095d18 firefox`main [inlined] do_main(argc=<unavailable>, argv=0x00007ffee5b6b580, envp=0x00007ffee5b6b5b0) at nsBrowserApp.cpp:217:22 [opt]
    frame #45: 0x000000010a095a91 firefox`main(argc=<unavailable>, argv=<unavailable>, envp=0x00007ffee5b6b5b0) at nsBrowserApp.cpp:331 [opt]
    frame #46: 0x00007fff73d98cc9 libdyld.dylib`start + 1

You can use bt all in lldb to get stacks for all threads.

Native allocations is the gift that keeps on giving! 😱

Yes, it would be very helpful if you could capture other threads. (Thank you Markus for the hint.)

I'm guessing this could be the same/similar issue as for bug 1642726 and/or bug 1649776.
If I'm correct:
The solution there was to "try-lock" from allocations, which was acceptable because allocations are already statistically sampled so losing a few that way didn't matter.
But in this case here, losing requested backtraces would not be acceptable. And I'm thinking this kind of deadlock could happen with other mutex-locking profiler functions.

So maybe I should really work on the more generic solution I was thinking of in bug 1642726 comment 4:

[...] somehow we shouldn't end up dispatching a task [to handle a profiler buffer chunk update] while holding any profiler lock.
So we'd need a way to save this data somewhere (without using blocking synchronization) and later find it and deal with it. It feels quite complex to me at the moment, I'll need to think more about the technical details.

Severity: -- → S3
Flags: needinfo?(florian)
Priority: -- → P2

On a related note: Do we really need to lock the profiler mutex from profiler_get_backtrace?
Naively, I would hope that it's not needed when walking the stack from the stack-owning thread.
We already have other stack walkers that don't use the profiler lock.
And the periodic sampler should still be able to do its work because it's suspending the target thread and just reading from its stack.

What do you think Markus, any strong reason for this lock to exist?
If you agree it's not strictly needed, I'll file a separate bug for it. (And this would help with bug 899764 and bug 1280077.)

Flags: needinfo?(mstange.moz)

I see quite a few uses of the LockRef inside profiler_get_backtrace. For example, DoSharedSample checks ActivePS::FeatureStackWalk, which could be mutated on the profiler controller thread.

Flags: needinfo?(mstange.moz)

Thanks. This doesn't sound like a hard blocker, so I've filed bug 1651259 to explore this properly... 😉

Attached file deadlock bt all

I reproduced a deadlock again, it's not the same stack on the main thread, but the thread #29 (Compositor) is also locked on trying to add a marker.

Flags: needinfo?(florian)
Assignee: nobody → gsquelart
Status: NEW → ASSIGNED

Profile buffer updates could be triggered from a number of locations, including scopes where profiler and/or system locks are held, making deadlocks possible if profiler and/or system function are called.
So instead of dispatching updates to the main thread (which may use OS task queue functions), we fold updates into a static storage. When child updates arrive, we can safely handle parent updates as well.

Child updates are assumed to arrive regularly enough to properly handle parent updates.
But in the worst case, if no updates came from children, it should mean nothing is happening, and in any case we wouldn't know how the memory is used elsewhere. Note that the chunk manager still enforces local limits automatically, so memory usage would still get limited.

Depends on D83745

Profile buffer updates could be triggered from a number of locations, including scopes where profiler and/or system locks are held, making deadlocks possible if profiler and/or system function are called.
So instead of dispatching updates to the main thread (which may use OS task queue functions), we fold updates into a static storage. The profiler sampler loop regularly triggers processing of these pending updates.

Depends on D83746

Pushed by gsquelart@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/437241fc572f
Remove unused ProfileBufferGlobalController members - r=canaltinova
https://hg.mozilla.org/integration/autoland/rev/343987668ae8
Safely delay handling of parent profile buffer updates - r=canaltinova
https://hg.mozilla.org/integration/autoland/rev/c1f35101d8a6
Safely delay handling of child profile buffer updates - r=canaltinova
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla80
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: