Crash in shutdownhang | NtReadVirtualMemory

RESOLVED FIXED in Firefox 64

Status

()

defect
--
critical
RESOLVED FIXED
7 months ago
2 months ago

People

(Reporter: skywalker333, Assigned: gsvelto)

Tracking

(Blocks 2 bugs, 4 keywords)

64 Branch
mozilla65
All
Windows
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox-esr60 unaffected, firefox62 unaffected, firefox63+ wontfix, firefox64+ fixed, firefox65 fixed)

Details

(crash signature)

Attachments

(2 attachments)

Reporter

Description

7 months ago
This bug was filed from the Socorro interface and is
report bp-ecd70659-392a-4b1b-9322-c099a0181014.
=============================================================

Top 10 frames of crashing thread:

0 ntdll.dll NtReadVirtualMemory 
1 kernelbase.dll ReadProcessMemory 
2 dbghelp.dll Win32LiveSystemProvider::ReadAllVirtual 
3 dbghelp.dll GenReadTlsDirectory 
4 dbghelp.dll GenAllocateModuleObject 
5 dbghelp.dll GenGetProcessInfo 
6 dbghelp.dll MiniDumpProvideDump 
7 dbghelp.dll MiniDumpWriteDump 
8 xul.dll bool google_breakpad::ExceptionHandler::WriteMinidumpWithExceptionForProcess toolkit/crashreporter/breakpad-client/windows/handler/exception_handler.cc:950
9 xul.dll google_breakpad::ExceptionHandler::WriteMinidumpForChild toolkit/crashreporter/breakpad-client/windows/handler/exception_handler.cc:814

=============================================================
Reporter

Comment 1

7 months ago
Top Crashers for Firefox 64.0a1

14      1.1% 	-0.64% 	shutdownhang | NtReadVirtualMemory	46 	46 	0 	0 	46 	0 	2015-01-21 


Signature report for shutdownhang | NtReadVirtualMemory

Windows 10 	35 	76.1%
Windows 8.1 	6 	13.0%
Windows 8 	3 	6.5%
Windows 7 	2 	4.3%

Firefox 	64.0a1 	46 	100.0% 	42

amd64 	36 	78.3%
x86 	10 	21.7%
How are we tracking these shutdown issues, Jim?
Flags: needinfo?(jmathies)
Looks like this might be related to the extension hang associated with fxmonitor, although those tend to have different signatures.
Flags: needinfo?(jmathies)
(In reply to Andrew Overholt [:overholt] from comment #2)
> How are we tracking these shutdown issues, Jim?

A bunch of metas with thousands of bugs under them. :(

Comment 5

7 months ago
Just got hit by this crash.
https://crash-stats.mozilla.com/report/index/c20d5c36-83d9-42b5-b1dc-892090181023
Any chance that these signatures:
[@ shutdownhang | NtQueryDirectoryFileEx ]
[@ shutdownhang | NtCreateFile ]
are related? I'm experiencing a rising number of actual shutdown crashes lately.
Nightly has been very sluggish for me in the past month after a certian time of use, and when it gets to that point and I close Nightly, this is the crash I get ~2 minutes after Nightly closed.
The crash you reference in comment 5 is when we're trying to write a minidump. Ted, do you have thoughts on why we'd crash there?
Flags: needinfo?(ted)

Comment 7

7 months ago
shutdownhangs with a similar stack are regressing since 64.0a1 build 20180919003800, where bug 1463048 was landing (and later uplifted to 63).

the issue is spread out over multiple signatures:
https://crash-stats.mozilla.com/search/?proto_signature=~WriteMinidumpWithExceptionForProcess&build_id=%3E%3D20180919003800&date=%3E%3D2018-09-01&_facets=signature&_facets=version&_facets=platform_pretty_version&_facets=useragent_locale&_facets=release_channel&_facets=cpu_arch#facet-signature
Blocks: 1463048
Crash Signature: [@ shutdownhang | NtReadVirtualMemory] → [@ shutdownhang | NtReadVirtualMemory] [@ shutdownhang | NtCreateFile] [@ shutdownhang | GenImageNtHeader] [@ shutdownhang | GenGetDebugRecord] [@ shutdownhang | NtQueryVirtualMemory] [@ shutdownhang | NtGetContextThread] [@ shutdownhang | ZwReadVir…
Has Regression Range: --- → yes
Flags: needinfo?(gsvelto)
Keywords: regression

Comment 8

7 months ago
this is accounting for 11.4% of browser crashes in 63.0 currently.
Assignee

Comment 9

7 months ago
I've opened a bunch of crashes from the various signatures and they all look like genuine shutdown crashes: what's happening here is that the shutdown hang detector is racing against the timer that force-kills content processes taking too long to shut down. First of all the reason why landing bug 1463048 have caused this to shoot up is that before that bug grabbing minidumps of hung processes during shutdown was borked, so instead of hitting this bug we were hitting that one. With bug 1463048 we now see the real impact of this issue.

So first of all here's the chain of events:

- We commence shutdown by notifying all listeners of the "quit-application" topic [1]

- The shutdown hang detector picks this up and fires its watchdog thread, from now on Firefox has 60s to shutdown, if it doesn't we'll hit a MOZ_CRASH() that kills it for good. Most of the crashes here hit [3] but some hit [4].

- After this initial phase we start closing the content processes via ContentProcess::ShutDownProcess(). This is usually called asynchronously as tabs are closed [5] and later synchronously for processes that still had not tab associated to them [6]. The asynchronous part means we usually start killing more than one process in a short amount of time.

- And here's the issue, when we shut down a process we fire a 5 seconds timer that will kill the process if it hasn't shut down by the time it triggers [7]. This will eventually try to take a minidump of the content process which we assume is hung [8]. What is happening is that we're probably experiencing a slow shutdown, this causes us to kill one content process which in turn causes us to grab a minidump. After the fix for bug 1463048 has landed minidumps are taken synchronously on the main thread so they slow down the shutdown process which most likely has an avalanche effect: more shutdown process timers expire (we fired them all together!) triggering more shutdown process kills and thus more minidumps until the shutdown watchdog kicks in and kills the main process.

As we increase the number of content processes this problem will become more severe so we need to figure out a better way to detect shutdown hangs without *causing* them ourselves. In the short term however increasing both the content shutdown kill timer and the shutdown watchgod timer should give us some breathing room.

[1] https://searchfox.org/mozilla-central/rev/5b3b6b8fd9f90087f618c20382e631451136ed2b/toolkit/components/startup/nsAppStartup.cpp#468
[2] https://searchfox.org/mozilla-central/rev/5b3b6b8fd9f90087f618c20382e631451136ed2b/toolkit/components/terminator/nsTerminator.cpp#508
[3] https://searchfox.org/mozilla-central/rev/5b3b6b8fd9f90087f618c20382e631451136ed2b/toolkit/components/terminator/nsTerminator.cpp#
[4] https://searchfox.org/mozilla-central/rev/5b3b6b8fd9f90087f618c20382e631451136ed2b/toolkit/components/terminator/nsTerminator.cpp#219
[5] https://searchfox.org/mozilla-central/rev/5b3b6b8fd9f90087f618c20382e631451136ed2b/dom/ipc/ContentParent.cpp#2051
[6] https://searchfox.org/mozilla-central/rev/5b3b6b8fd9f90087f618c20382e631451136ed2b/dom/ipc/ContentParent.cpp#1779
[7] https://searchfox.org/mozilla-central/rev/5b3b6b8fd9f90087f618c20382e631451136ed2b/dom/ipc/ContentParent.cpp#1485
[8] https://searchfox.org/mozilla-central/rev/5b3b6b8fd9f90087f618c20382e631451136ed2b/dom/ipc/ContentParent.cpp#3500
Flags: needinfo?(gsvelto)
Assignee

Comment 10

7 months ago
I've just noticed we have a second shutdown hang detector with similar logic but implemented in JavaScript:

https://searchfox.org/mozilla-central/rev/5b3b6b8fd9f90087f618c20382e631451136ed2b/toolkit/components/asyncshutdown/AsyncShutdown.jsm#930

Since it uses the same timeout as the code in nsTerminator.cpp but starts after it I guess it doesn't trigger often.
(In reply to Gabriele Svelto [:gsvelto] from comment #9)
> - And here's the issue, when we shut down a process we fire a 5 seconds
> timer that will kill the process if it hasn't shut down by the time it
> triggers [7]. This will eventually try to take a minidump of the content
> process which we assume is hung [8]. What is happening is that we're

It sounds like the most straightforward fix here would be to just disable this "content process is hung, so capture a minidump" logic during shutdown. If content processes don't shut down in time we could simply kill them and move on.
Flags: needinfo?(ted)
There's not many people who would take this, so given the crash rate, should we at least have an assignee (which is probably Ted or Gabriele)?
Flags: needinfo?(overholt)
Assignee

Comment 13

7 months ago
Taking this, I want to try a mitigation first then we'll see.
Assignee: nobody → gsvelto
Status: NEW → ASSIGNED
Thank you, Gabriele!
Flags: needinfo?(overholt)
Assignee

Comment 16

7 months ago
This doubles the timeout for killing content processes during shutdown and increases the overall shutdown hang detection timeout by 50%. If this isn't sufficient to reduce the crash rate then we'll have to turn off grabbing minidumps during shutdown.

Comment 17

7 months ago
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/11edba6d53bb
Lengthen timeouts for detecting hung content processes during shutdown r=Ehsan
Assignee

Comment 18

7 months ago
Let's leave this open until we're sure we've reduced the crash rate.
Keywords: leave-open

Comment 19

7 months ago
can we uplift it to beta though to assess the full impact there there as soon as possbile? (there isn't such a clear crash pattern on nightly)
Flags: needinfo?(gsvelto)
Assignee

Comment 20

7 months ago
Comment on attachment 9021930 [details]
Bug 1498942 - Lengthen timeouts for detecting hung content processes during shutdown

[Beta/Release Uplift Approval Request]

Feature/Bug causing the regression: Bug 1463048

User impact if declined: Firefox gets stuck during shutdown and crashes after 10 seconds. This is not caused directly by the fix for bug 1463048 but that fix caused the issue to become widespread.

Is this code covered by automated tests?: No

Has the fix been verified in Nightly?: Yes

Needs manual test from QE?: No

If yes, steps to reproduce: 

List of other uplifts needed: None

Risk to taking this patch: Low

Why is the change risky/not risky? (and alternatives if risky): This patch only adjusts a preference to give more time for the content processes to shut down.

String changes made/needed: none
Flags: needinfo?(gsvelto)
Attachment #9021930 - Flags: approval-mozilla-beta?
Comment on attachment 9021930 [details]
Bug 1498942 - Lengthen timeouts for detecting hung content processes during shutdown

ok let's try this on beta, approved for 64.0b8
Attachment #9021930 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment 24

7 months ago
i don't see much of an improvement in the crash pattern in data for 64.0b8. should we look into the other avenue mentioned in comment #16?

in b7 the share of the crashes for this query compared to all browser crashes was 8.7%: 
https://crash-stats.mozilla.com/search/?proto_signature=~WriteMinidumpWithExceptionForProcess&version=64.0b7&process_type=browser&date=%3E%3D2018-09-01

in b8 it's 8.5% so far:
https://crash-stats.mozilla.com/search/?proto_signature=~WriteMinidumpWithExceptionForProcess&version=64.0b8&process_type=browser&date=%3E%3D2018-09-01
Flags: needinfo?(gsvelto)
Assignee

Comment 25

7 months ago
Yeah, the crash rate keeps ticking up. I'll follow Ted's suggestion in comment 11 to turn this functionality off for the time being. I'll file a bug to reintroduce this at a later point with a mechanism that does not cause this kind of avalanche effects.
Flags: needinfo?(gsvelto)
Assignee

Comment 26

6 months ago
This patch also takes the timeout for killing a content process back to its
original value and removes a related but long unused field.

Comment 27

6 months ago
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c1a4345c8103
Don't attempt to take minidumps of hung content processes when shutting down r=Ehsan

Comment 29

6 months ago
hi, could you request uplift to beta again please in order to see how the new approach performs there?
Flags: needinfo?(gsvelto)
Setting 64 back to affected as the first patch didn't decrease the volume of crashed in beta 8 and we have a second patch we may uplift.
Assignee

Comment 31

6 months ago
There are no more crashes on the nightly channel after the 20181114222927 build so the patch is working as expected.
Assignee

Comment 32

6 months ago
Comment on attachment 9024661 [details]
Bug 1498942 - Don't attempt to take minidumps of hung content processes when shutting down

[Beta/Release Uplift Approval Request]

Feature/Bug causing the regression: Bug 1463048

User impact if declined: Firefox gets stuck during shutdown and crashes after 60 seconds. This is not caused directly by the fix for bug 1463048 but that fix caused the issue to become widespread.

Is this code covered by automated tests?: No

Has the fix been verified in Nightly?: Yes

Needs manual test from QE?: No

If yes, steps to reproduce: 

List of other uplifts needed: None

Risk to taking this patch: Low

Why is the change risky/not risky? (and alternatives if risky): The patch only prevents minidumps from being generated during shutdown and rolls back the previous change which did not prove useful.

String changes made/needed: None
Flags: needinfo?(gsvelto)
Attachment #9024661 - Flags: approval-mozilla-beta?
Comment on attachment 9024661 [details]
Bug 1498942 - Don't attempt to take minidumps of hung content processes when shutting down

I guess the flip side is that makes it impossible to debug hung content processes?  Anyway, better than making shutdown itself hang.  Approved for 64.0b11
Attachment #9024661 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Assignee

Updated

6 months ago
Blocks: 1508265

Comment 35

6 months ago
the crash pattern looks fixed in beta with the latest patch landing in 64.0b11.
Status: ASSIGNED → RESOLVED
Last Resolved: 6 months ago
Keywords: leave-open
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
Wontfix for 63 since 64 is in two weeks.
Resolving this bug seems to have resulted in a huge drop of content shutdown crash submissions (a good thing, if true!), see bug 1515664 for details.

Updated

5 months ago
See Also: → 1515664
Depends on: 1533842
You need to log in before you can comment on or make changes to this bug.