Open Bug 1706500 Opened 4 years ago Updated 3 years ago

Crash in [@ shutdownhang | __futex_abstimed_wait_common64]

Categories

(Core :: XPCOM, defect)

Firefox 90
x86_64
Linux
defect

Tracking

()

REOPENED
Tracking Status
firefox-esr78 --- affected
firefox-esr91 --- affected
firefox88 --- affected
firefox89 --- affected
firefox90 --- affected
firefox93 --- affected
firefox94 --- affected
firefox95 --- affected
firefox96 --- affected

People

(Reporter: matt.fagnani, Unassigned)

References

Details

(Keywords: hang, Whiteboard: [QA-not-reproducible][tbird crash])

Crash Data

I updated to Firefox Nightly 90.0a1 (2021-4-20) in a Fedora 34 KDE Plasma installation. I started Firefox Nightly 90.0a1 (2021-4-20) on Wayland in Plasma 5.21.4 on Wayland. FIrefox started on X due to the errors I reported at https://bugzilla.mozilla.org/show_bug.cgi?id=1706452 I closed Firefox. I tried to open Firefox again, but a message stating that Firefox was still running was shown. The crash reporter appeared about a minute after I closed Firefox. The reason for the crash was "Shutdown hanging after all known phases and workers finished." This crash usually doesn't happen. I've only seen a shutdown crash with that reason once.

Maybe Fission related. (DOMFissionEnabled=1)

Crash report: https://crash-stats.mozilla.org/report/index/7520dc48-86ec-465d-8a90-2a4150210420

MOZ_CRASH Reason: MOZ_CRASH(Shutdown hanging after all known phases and workers finished.)

Top 10 frames of crashing thread:

0 libpthread.so.0 __futex_abstimed_wait_common64 /usr/src/debug/glibc-2.33/sysdeps/nptl/futex-internal.c:74
1 libpthread.so.0 __pthread_cond_wait /usr/src/debug/glibc-2.33/nptl/pthread_cond_wait.c:619
2 firefox-bin mozilla::detail::ConditionVariableImpl::wait mozglue/misc/ConditionVariable_posix.cpp:108
3 libxul.so nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1093
4 libxul.so NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:548
5 libxul.so nsThreadManager::Shutdown xpcom/threads/nsThreadManager.cpp:420
6 libxul.so mozilla::ShutdownXPCOM xpcom/build/XPCOMInit.cpp:653
7 libxul.so ScopedXPCOMStartup::~ScopedXPCOMStartup toolkit/xre/nsAppRunner.cpp:1668
8 libxul.so XREMain::XRE_main toolkit/xre/nsAppRunner.cpp:5556
9 libxul.so XRE_main toolkit/xre/nsAppRunner.cpp:5598

Im unable to reproduce this since i only have Ubuntu 20, tried to update to nightly 90.0a1 and saw no issues

Whiteboard: QA-not-reproducible

This is not a new crash signature, but its crash volume has increased starting in 88 and 89:

https://crash-stats.mozilla.org/search/?signature=~__futex_abstimed_wait_common64&product=Firefox&date=%3E%3D2020-10-27T22%3A09%3A00.000Z&date=%3C2021-04-27T22%3A09%3A00.000Z&_facets=signature&_facets=version&_facets=dom_fission_enabled&_facets=platform&_facets=cpu_arch&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-version

The Fission Nightly experiment increased from ~20% Fission to ~40% Fission in 88 Nightly. We expect to see more shutdown hangs and ShutDownKills with Fission, not because Fission is causing more but simply because Fission launches (and shuts down) more content processes. So I suspect this increase in crash volume is correlated with Fission, but not caused by Fission.

Crash Signature: [@ shutdownhang | __futex_abstimed_wait_common64] → [@ shutdownhang | __futex_abstimed_wait_common64] [@ IPCError-browser | ShutDownKill | __futex_abstimed_wait_common64]
Whiteboard: QA-not-reproducible → QA-not-reproducible [not-a-fission-bug]

The crash reports being referred to in this bug have unfortunately expired, and we never got around to analyzing them. I'm closing this one out as INCOMPLETE.

Status: UNCONFIRMED → RESOLVED
Closed: 4 years ago
Resolution: --- → INCOMPLETE
Flags: needinfo?(cpeterson)

(In reply to Mike Kaply [:mkaply] from comment #4)

Volume is really low. cpeterson is this worth worrying about or should we keep closed as incomplete?

I think we should keep a crash bug open as long as we are still receiving crash reports for it, since the cause hasn't been resolved. Here's a crash report from Nightly 96:

bp-094ff203-d415-4e5a-b866-99aea0211103

I see the crash reason is MOZ_CRASH(Shutdown hanging after all known phases and workers finished.), which is the same as a intermittent test crash bug 1719400. The users and tests might be hanging for different reasons, so these bugs aren't necessarily duplicates.

Status: RESOLVED → REOPENED
Ever confirmed: true
Flags: needinfo?(cpeterson)
Keywords: hang
Product: Firefox → Core
Resolution: INCOMPLETE → ---
See Also: → 1719400
Whiteboard: QA-not-reproducible [not-a-fission-bug] → QA-not-reproducible
Component: General → XPCOM

I looked at my crash reports and found an old one which has the signature [@ shutdownhang | __futex_abstimed_wait_common64 ] linked to this bug, see here:
https://crash-stats.mozilla.org/report/index/4c04f911-696a-47c7-8f10-a40f10210517

It doesn't list DOMFissionEnabled = 1 in the crash annotations and the telemetry environment says fissionEnabled: false, so I guess this bug isn't related to Fission and you might want to change the whiteboard tag.

I assume that the report will soon be deleted, so here is some of its information:

MOZ_CRASH Reason (Sanitized): MOZ_CRASH(Shutdown hanging after all known phases and workers finished.)

Top 10 frames of crashing thread:

0 	libpthread.so.0 	__futex_abstimed_wait_common64
1 	libpthread.so.0 	__pthread_cond_wait
2 	firefox-bin 	mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl&) 	mozglue/misc/ConditionVariable_posix.cpp:108
3 	libxul.so 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1093
4 	libxul.so 	nsThreadManager::Shutdown() 	xpcom/threads/nsThreadManager.cpp:420
5 	libxul.so 	mozilla::ShutdownXPCOM(nsIServiceManager*) 	xpcom/build/XPCOMInit.cpp:655
6 	libxul.so 	ScopedXPCOMStartup::~ScopedXPCOMStartup() 	toolkit/xre/nsAppRunner.cpp:1674
7 	libxul.so 	XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) 	toolkit/xre/nsAppRunner.cpp:5582
8 	libxul.so 	XRE_main(int, char**, mozilla::BootstrapConfig const&) 	toolkit/xre/nsAppRunner.cpp:5624
9 	firefox-bin 	main 	browser/app/nsBrowserApp.cpp:351 

Still very rare...

Thunderbird 101.0a1 shutdownhang | __futex_abstimed_wait_common64 bp-67ba2198-fbfe-41a0-82e8-680ce0220419
Firefox 101.0a1 shutdownhang | __futex_abstimed_wait_common64 bp-3f94d5fc-6a22-4b3a-97db-5f1aa0220430

Whiteboard: QA-not-reproducible → [QA-not-reproducible][tbird crash]

Unfortunately this stack is quite bad, as the __futex_abstimed_wait_common64 stack doesn't rell us anything about why the futex is being waited on. There are a bunch of different failures like bug 1782445, which also have bad signatures.

:jstutte, do we have a common bug we can/should be duping these to? We might also want to add these calls to the prefix list so that they're split into more useful hang stacks.

Severity: -- → S3
Flags: needinfo?(jstutte)

The most helpful way to facet these is through xpcom spin event loop stack, I think. Not sure if we can automate this meaningful, though.

I did not check all reports, but the bug to use is probably bug 1505660, with a signature like

[@ shutdownhang | mozilla::SpinEventLoopUntil | nsThread::Shutdown | nsThreadManager::ShutdownNonMainThreads ] 

:gsvelto, any ideas how we can prefix these better?

Flags: needinfo?(jstutte) → needinfo?(gsvelto)

If xpcom spin event loop stack is a better fit than the actual stack then we could use that for the crash signature for shutdown hangs. That is we could make this crash signature go from [@ shutdownhang | __futex_abstimed_wait_common64] to shutdownhang | default: CompositorThreadHolder::Shutdown and this one similarly turn into shutdownhang | default: ThreadEventTarget::Dispatch. Would this be better? What should we use for crashes where the annotation is empty like this one?

Flags: needinfo?(gsvelto)

(In reply to Gabriele Svelto [:gsvelto] from comment #10)

That is we could make this crash signature go from [@ shutdownhang | __futex_abstimed_wait_common64] to shutdownhang | default: CompositorThreadHolder::Shutdown and this one similarly turn into shutdownhang | default: ThreadEventTarget::Dispatch. Would this be better? What should we use for crashes where the annotation is empty like this one?

I think we should pay attention to look only at reports that have nsThread::Shutdown on the stack here and make those point to bug 1505660. The rest are different cases, as you can see also from the shutdown phase in MOZ_CRASH reason. Those with nsThreadShutdown should all have a SpinEventLoopUntil on the stack. The rest need further analysis.

If we then want to use xpcom spin event loop stack we should revisit also some signatures that already arrive on bug 1505660. So probably the first step is to have a signature that makes it easier to assign them to bug 1505660 to have a common starting point for further analysis.

I can make some changes to "peel off" the crashes that should fall under bug 1505660 from this signature. Those changes would alter the other crashes under this signature too. Here's some examples of the new signatures:

The last one would fall under bug 1505660. Would this work for you?

This looks like a good improvement, thanks!

Depends on: 1806107

Now that 1806107 has landed the signatures here should disappear, and break away into separate ones like I described in comment 12.

The patch in bug 1806107 was insufficient to clean up the signatures, I'll file another bug.

Depends on: 1810519
You need to log in before you can comment on or make changes to this bug.