Crash in shutdownhang | ZwCreateUserProcess

RESOLVED FIXED in Firefox 57

Status

()

--
critical
RESOLVED FIXED
a year ago
a year ago

People

(Reporter: BesTo, Assigned: gsvelto)

Tracking

({crash, topcrash, topcrash-win})

45 Branch
mozilla57
All
Windows
crash, topcrash, topcrash-win
Points:
---

Firefox Tracking Flags

(thunderbird_esr45 wontfix, thunderbird_esr52 affected, firefox-esr52 wontfix, firefox54 wontfix, firefox55- wontfix, firefox56- wontfix, firefox57 fixed)

Details

(crash signature, URL)

Attachments

(1 attachment)

(Reporter)

Description

a year ago
This bug was filed from the Socorro interface and is 
report bp-785f93a1-9a25-4268-b0c8-55cbd0170617.
=============================================================

FF56.0a1, 64bit, E10s on Win7:
https://crash-stats.mozilla.com/report/index/785f93a1-9a25-4268-b0c8-55cbd0170617


https://crash-stats.mozilla.com/signature/?product=Firefox&signature=shutdownhang%20%7C%20ZwCreateUserProcess
The date range is from 7 days ago to now.

Operating System:
Windows 7 	23 	88.5%
Windows 8 	1 	3.8%
Windows 8.1 	1 	3.8%
Windows Vista 	1 	3.8%

Product:
Firefox 	56.0a1 		9 	30.0% 	6
Firefox 	55.0b1 		6 	20.0% 	6
Firefox 	55.0a1 		4 	13.3% 	5
Firefox 	53.0.3 		2 	6.7% 	2
Firefox 	47.0 		1 	3.3% 	1
Firefox 	50.0 		1 	3.3% 	1
Firefox 	52.0.2esr 	1 	3.3% 	1
Firefox 	54.0b99 	1 	3.3% 	1
Firefox 	55.0 		1 	3.3% 	1
Thunderbird 	52.1.1 		2 	6.7% 	2
Thunderbird 	45.5.1 		1 	3.3% 	1
Thunderbird 	45.8.0 		1 	3.3% 	1

Architecture:
amd64 	14 	53.8%
x86 	12 	46.2%


[Tracking Requested - why for this release]: topcrash-win, topcrash
(Reporter)

Updated

a year ago
No longer blocks: 1219672

Comment 1

a year ago
This looks like it could be the telemetry pingsender, so gsvelto heads-up. marco can I get some DLL correlations for this? First hunch would be this was related to an antivirus or similar software.
Flags: needinfo?(mcastelluccio)
Flags: needinfo?(gsvelto)
(Assignee)

Comment 2

a year ago
These all look like shutdown hangs but the pingsender should not block shutdown since it's launched asynchronously. I'll try to reproduce this locally tomorrow by making the pingsender take more than 60 seconds to run. Leaving the NI for now.
There are very few reports, so the results are not great, but the DLLs that show 100% of the time look like Windows DLLs.
I will check again tomorrow if there are more reports.

Comment 4

a year ago
Gabriele you'll see from the stack that the Windows API CreateProcessW is top of stack: so I don't think we're blocking on the new process finishing, but we might somehow be blocked on it *starting*.
I only see a few crash reports with this signature. How is it a topcrash? 
We could still take a patch for 55 but I don't think release management needs to track this bug.
status-firefox54: affected → fix-optional
status-firefox55: affected → fix-optional
tracking-firefox55: ? → -
tracking-firefox56: ? → -
(Reporter)

Comment 6

a year ago
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #5)
> I only see a few crash reports with this signature. How is it a topcrash? 

It was in: https://crash-stats.mozilla.com/topcrashers/?product=Firefox&version=56.0a1&days=7


Last 7 days:
Windows 7 	54 	96.4%
Windows 8 	2 	3.6%
From: https://crash-stats.mozilla.com/signature/?product=Firefox&signature=shutdownhang%20%7C%20ZwCreateUserProcess
Still no DLL standing out.
Flags: needinfo?(mcastelluccio)
(Assignee)

Comment 8

a year ago
I've done a bit of research and on paper CreateProcess() and friends should never block... but apparently it's taking a long time (>60 seconds). So I've done a bit of research and as it turns out ZwCreateUserProcess is one of the kernel functions that anti-viruses usually hook to scan executables being launched. Some crash reports do have an AV add-on but others don't though I cannot tell if the system is using an AV w/o a specific Firefox add-on. One thing that stands out is that they all seem to come from older machines. I've looked at a dozen reports and most of them had either 1 or 2 GiB of memory.

Based on that I'd speculate that this is the AV being very slow (on older HD-based machines with little RAM) when encountering the pingsender executable for the first time.
Flags: needinfo?(gsvelto)
(Reporter)

Comment 9

a year ago
(In reply to Gabriele Svelto [:gsvelto] from comment #8)
> Based on that I'd speculate that this is the AV being very slow (on older
> HD-based machines with little RAM) when encountering the pingsender
> executable for the first time.

I was thinking the pingsender executable is something new... and the crashes goes back till FF40...
(Assignee)

Comment 10

a year ago
(In reply to Tobias B. Besemer [:BesTo] (QA) from comment #9)
> I was thinking the pingsender executable is something new... and the crashes
> goes back till FF40...

There's an explanation for those too: Firefox may also launch other programs on shutdown like the updater. We might add an annotation with the program(s) being launched to shed some light here.
(Reporter)

Comment 11

a year ago
Maybe I can shed some light on FF56.0a1 ... Plz read bug 1219672 comment 68.

Updated

a year ago
See Also: → bug 1372650

Comment 12

a year ago
(In reply to Gabriele Svelto [:gsvelto] from comment #10)
> There's an explanation for those too: Firefox may also launch other programs
> on shutdown like the updater. We might add an annotation with the program(s)
> being launched to shed some light here.

So does this mean that if firefox is hung or crashing, that while it is trying to shut itself down it is trying to generate a crash report with minidump while ALSO trying to update itself? Could this cause problems?
See Also: → bug 1386760
According to bug 1386760 comment 1, the updater doesn't start processes from JS, so it can't be the updater here.
Also, the stack in bug 1386760 didn't have anything app update related.
Liz, this is now #14 top crasher and the signature from bug 1386760, which is probably the same thing, is #4 at https://crash-stats.mozilla.com/topcrashers/?product=Firefox&version=55.0b&days=7.

Should we track 56?
(Assignee)

Comment 16

a year ago
OK, so to recap what's going on:

- We originally had bug 1362203, which was caused by bug 1293656. The reason was that we put a shutdown blocker in place when doing minidump analysis for a crash, the rationale for it is in bug 1293656 comment 41. We assumed that minidump analysis would be fast and wouldn't be an issue during shutdown, as it turned out we were wrong.

- I rewrote that code in bug 1359326, got rid of the shutdown blocker we created in C++ but kept the one we already had in JS code. Hanging on that blocker seems to be the cause for both this bug and bug 1386760.

There's two actionable things here: we can safely close bug 1362203 as it shouldn't appear again and I'll write a patch so that we don't block on minidump analysis during shutdown. I'll see if it would be possible to defer the analysis until the next reboot but if it isn't we'll just have to accept we can't get stack traces for every possible crash.
(Assignee)

Updated

a year ago
Assignee: nobody → gsvelto
Status: NEW → ASSIGNED
Comment hidden (mozreview-request)
(Assignee)

Comment 18

a year ago
I've done two changes to the affected code: first of all I moved all the operations that can be potentially long outside of the shutdown blocker. We now block on shutdown just for sending the crash ping which was the original behavior. I've also moved error-handling for the various step into the respective functions. This allows the procedure to go ahead even if one step fails (e.g. minidump analysis). In such case we'll still get a crash ping, just with less information.

Comment 19

a year ago
mozreview-review
Comment on attachment 8893586 [details]
Bug 1373958 - Don't block shutdown while analyzing crashes; .mielczarek

https://reviewboard.mozilla.org/r/164542/#review172822
Attachment #8893586 - Flags: review?(ted) → review+

Comment 20

a year ago
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/78404bf21684
Don't block shutdown while analyzing crashes; r=ted.mielczarek
https://hg.mozilla.org/mozilla-central/rev/78404bf21684
Status: ASSIGNED → RESOLVED
Last Resolved: a year ago
status-firefox57: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
Given the analysis in comment 16, I'm not sure how we're seeing this signature in Fx <53, but the volume is low enough that I'm OK with calling them wontfix anyway (will leave TB52 for the folks who manage that release). We should nominate this for Beta approval when we're feeling comfortable doing so, though.
status-firefox54: fix-optional → wontfix
status-firefox55: fix-optional → wontfix
status-firefox-esr52: affected → wontfix
status-thunderbird_esr45: affected → wontfix
Still seeing recently Nightly crash reports with this signature :(
Flags: needinfo?(gsvelto)
(Assignee)

Comment 24

a year ago
This is unexpected, I'll have a look at Socorro to see what might be amiss. Leaving the NI for now.
(Assignee)

Comment 25

a year ago
So I went through the crash reports of versions that should include this patch and here's what I found:

https://crash-stats.mozilla.com/signature/?build_id=%3E%3D20170812000000&version=57.0a1&signature=shutdownhang%20%7C%20ZwCreateUserProcess&date=%3E%3D2017-08-10T08%3A12%3A00.000Z&date=%3C2017-08-25T08%3A12%3A00.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1

- Not all of them are related to crash handling, if mozilla::ipc::CrashReporterHost::NotifyCrashService() isn't on the stack then the crash is unrelated to this

- A few of them are from very, very slow machines with little memory which are probably swapping like crazy

- Quite a few of them seem to come from the same machines

- From what I can tell almost all of the valid crashes have an antivirus installed, a DLL injected by Avast in particular shows up on a lot of stacks

- There is a handful of crashes that come from machines that shouldn't be particularly slow and don't have an antivirus installed, those should not be crashing this way

The latter require some further investigation as those crashes really shouldn't be happening and for the others too they shouldn't have a blocker in place. Right now we have a blocker only for sending the crash ping so I wonder if the nsProcess machinery isn't causing a hang in and by itself (i.e. it's blocking shutdown on its own). I'll check and report back.
Flags: needinfo?(gsvelto)
(Assignee)

Comment 26

a year ago
OK, I've double-checked the code and I think I know what's going on: the first thing I was missing is that the shutdown watchdog triggers an assertion if shutdown takes too long, irrespective of the shutdown blockers. Simply put, it looks like we start a 60 seconds timer when shutting down, and if the process hasn't died by then we crash. There's no way to fully solve this issue, but it can be mitigated:

- I could prevent content crash analysis from running if shutdown has already initiated

- I could kill the minidump analyzer if it started before shutdown but isn't done yet when the shutdown procedure begins

However, this wouldn't fix the issue entirely because there's still the possibility that crash analysis begins *before* shutdown starts and we get stuck in ShellExecuteExW() because of an AV like we see in many of those signatures. Until ShellExecuteExW() we have no way of killing the process so we're basically stuck with it and if it takes too long it will trigger a shutdown crash.

Still I'll file another bug for putting in place the changes I've described above; hopefully we can reduce the number of these crashes further.
(Assignee)

Updated

a year ago
Depends on: 1393716
Would it help mitigate the crash rate (really, the user experience) for 56 if we uplift this fix to beta? Or should we be waiting for whatever you are going to land in bug 1393716?
Flags: needinfo?(gsvelto)
(Assignee)

Comment 28

a year ago
I don't know, looking at this graph it didn't seem to help:

https://crash-stats.mozilla.com/signature/?build_id=%3E%3D20170812000000&version=57.0a1&signature=shutdownhang%20%7C%20ZwCreateUserProcess&date=%3E%3D2017-05-29T19%3A42%3A24.000Z&date=%3C2017-08-29T19%3A42%3A24.000Z#graphs

I thought the problem was the blocker but it isn't because Firefox is stuck in a way that triggers the watchdog instead and the patch here doesn't help with that. I need to write a patch for 1393716 and hope that will reduce the crash rate significantly.
Flags: needinfo?(gsvelto)
status-firefox56: affected → wontfix
You need to log in before you can comment on or make changes to this bug.