Closed Bug 1624171 Opened 4 years ago Closed 2 years ago

Crash in [@ shutdownhang | libpthread.so.0@0xfea2]

Categories

(Core :: mozglue, defect, P5)

76 Branch
Unspecified
Linux
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox76 --- affected

People

(Reporter: matt.fagnani, Unassigned)

Details

Crash Data

This bug is for crash report bp-c7c8d04a-6d48-4a69-98fc-5f0890200322.

Top 7 frames of crashing thread:

0 libpthread.so.0 libpthread.so.0@0xfea2 
1 firefox-bin mozilla::detail::MutexImpl::unlock mozglue/misc/Mutex_posix.cpp:178
2 libpthread.so.0 libpthread.so.0@0xfc2f 
3 libxul.so _fini 
4 firefox-bin <name omitted> mozglue/misc/ConditionVariable_posix.cpp:109
5 libxul.so nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1141
6 firefox-bin arena_t::MallocSmall memory/build/mozjemalloc.cpp:2862

I updated Firefox Nightly 76.0a1 (2020-3-21) on X in Help > About Nightly. I pressed Restart. Firefox didn't restart properly. The crash reporter appeared, and I submitted the report. The report showed a segmentation fault in libpthread.so.0 in glibc-2.31-1.fc32. The segmentation fault might've been a null pointer dereference since the crash address was 0x0. The reason for the crash was MOZ_CRASH(Shutdown hanging before starting.) I'm using the Fedora 32 KDE Plasma spin with Plasma 5.18.3 on Wayland, KF 5.67.0, Qt 5.13.2.

:dmajor, :rpl, Blake and I looked at the crash report for a while, but we're not able to spot anything conclusive about what would have caused the shutdown hang. Can you spot anything we're missing here, or know who might be able to have another look?

Flags: needinfo?(dmajor)

Since this is on Linux, I'm not the best person to take a look. Maybe gsvelto has an idea.

Flags: needinfo?(dmajor) → needinfo?(gsvelto)

Firefox somehow got stuck before it could update, the MOZ_CRASH() reason is Shutdown hanging before starting.. However it's hard to tell what's going on without symbols. Since this is Fedora 32 I assume it's the beta version for which we don't have symbols yet. I'll add Fedora beta my symbol scraping scripts and reprocess the crash once we have proper debug information available. Leaving the NI? for now.

I'm pulling down Fedora 32 packages right now. Extracting debug information and pushing it on our symbol servers is going to take a while but we should be able to re-process the crashes by tomorrow morning.

The raw dump page of the crash report shows the crashing thread is the Shutdown Hang Terminator https://crash-stats.mozilla.org/report/index/c7c8d04a-6d48-4a69-98fc-5f0890200322#tab-rawdump I think that after I clicked Restart, the default timeout for shutting down of maybe 5 seconds passed (dom.ipc.tabs.shutdownTimeoutSecs is 5). Then the Shutdown Hang Terminator did something that led to a segmentation fault due to a null pointer dereference possibly in mozilla::(anonymous namespace)::RunWatchdog(void*) at hg:hg.mozilla.org/mozilla-central:toolkit/components/terminator/nsTerminator.cpp:c4d2ca8f78b7680dc0b199a2cb0e2c6f18cd8963

I reported a similar crash of Firefox Nightly 76.0a1 (2010-3-9) after updating and restarting at https://bugzilla.mozilla.org/show_bug.cgi?id=1621561
That crash had a different trace of the main thread, but it was also a segmentation fault involving a null pointer. The crashing thread was the Shutdown Hang Terminator in mozilla::(anonymous namespace)::RunWatchdog(void*) at hg:hg.mozilla.org/mozilla-central:toolkit/components/terminator/nsTerminator.cpp:268543e53e1b11ce0e468d985ea3777563e7b8a8 according to https://crash-stats.mozilla.org/report/index/c4cafe1f-cd8e-414e-8ca2-dc2530200310#tab-rawdump The trace included the glibc 2.31 debug info which I have installed. These crashes might have a common cause. The crashes occurred < 10% of the time I've updated and restarted Firefox Nightly. Fedora 32 is in beta. Thanks.

Thanks Matt. I've scraped all the Fedora 32 symbols but somehow I can't get a better trace out of this one. Do you have the testing updates repositories enabled? Those are the only ones which I haven't scraped yet.

What you're seeing is a deliberate crash triggered because Firefox detected a shutdown hang. If something's causing Firefox to take time to shut down then it's probably the same issue as in your other report even though the traces are different. I see from the crash report that you have both Privacy Badger and HTTPS Everywhere add-ons installed though I don't think either one should affect shutdown. Can you think of any other reason why shutdown might be slow? When you restart Firefox does is trigger a lot of disk activity?

Flags: needinfo?(gsvelto) → needinfo?(matthew.fagnani)

Yes, I have the Fedora 32 updates-testing repositories enabled. I think the updates-testing repositories are enabled by default in Fedora beta and branched versions. The updates of Nightly sometimes led to the slow shutdown/restart times particularly if I pressed Restart right after the update was finished being applied. I've occasionally seen that the CPU was in a high percent of waiting state due to data transfer using top when certain KDE programs became unresponsive and crashed with errors that their Wayland connections broke. Something similar might've been slowing down Firefox from shutting down after updates. Thanks.

Flags: needinfo?(matthew.fagnani)

Marking as a P5 (at least until we got more info or if this raises to more concerning amount of crashes) and re-newed needinfo for Gabriele.

Flags: needinfo?(gsvelto)
Priority: -- → P5

I scraped all the RPMs I could find on Fedora servers but still couldn't symbolicate this trace; probably this was generated from a package that was updated before I ran my scripts. I'm afraid there's not much we can do without symbols, but if it happens again you should get a better crash report because now we have symbol covering all the Fedora 32 system libraries, including testing updates.

Flags: needinfo?(gsvelto)

Bugbug thinks this bug is a regression, but please revert this change in case of error.

Keywords: regression
Keywords: regression

Alright, I finally managed to put symbols on that crash. It turns out there was a bug in the tool I used to dump symbols from the system libraries and once that was fixed I re-processed all the affected crashes (basically a month worth of reports) and this was among them. Now I'll have a deeper because I think I can tell what the problem is from the stack contents.

The shutdown is getting stuck here:

https://searchfox.org/mozilla-central/rev/7fba7adfcd695343236de0c12e8d384c9b7cd237/toolkit/xre/nsXREDirProvider.cpp#1043

There aren't many listeners for that. IIRC only telemetry itself - which should send the telemetry - and a callback that saves the prefs to disk. Could you check the size of your prefs.js file? It's in your profile directory under .mozilla/firefox/Profiles/<profile_name>/prefs.js. We had instances where the prefs.js file grew to a disproportionate size and if that's the case it might take a while to write it out during restart.

Flags: needinfo?(matthew.fagnani)

Gabriele, that prefs.js file is 17 kB in my profile directory. I have had telemetry and studies disabled. I noticed that this crash and the other shutdown crash I mentioned in comment 5 have __pthread_cond_wait at pthread_cond_wait.c:638 from glibc 2.31 and <name omitted> at ConditionVariable_posix.cpp:109 at the top of the main thread traces though some of the rest of the traces differed. If the main thread was waiting when 5 seconds since the shutdown started, that might have led to the Shutdown Hang Terminator thread doing what it did. Thanks for fixing the debug symbols issue and looking into this problem.

Flags: needinfo?(matthew.fagnani)

The main thread is waiting for something, but it's unclear what that is because it's stuck in JavaScript code. Whatever is happening is during the profile-before-change-telemetry event which is triggered during shutdown. From what I can tell the only JS listener for that event is Telemtry itself so we might be somewhere here. Chris, can the telemetry code get stuck somehow during a shutdown for users who have telemetry disabled?

Crash Signature: [@ shutdownhang | libpthread.so.0@0xfea2] → [@ shutdownhang | libpthread.so.0@0xfea2] [@ shutdownhang | __pthread_cond_wait | <name omitted> | NS_InvokeByIndex]
Flags: needinfo?(chutten)

BTW I downloaded a few minidumps from this signature and found the same event in the stack in recent crashes so we might be onto something here.

To my knowledge there's nothing in the shutdown path that should care whether telemetry is disabled or not. But "should" and "does" don't always align...

Flags: needinfo?(chutten)

Bugbug thinks this bug should belong to this component, but please revert this change in case of error.

Component: General → mozglue
Product: Firefox → Core

Closing because no crashes reported for 12 weeks.

Status: UNCONFIRMED → RESOLVED
Closed: 2 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.