Closed Bug 1386760 Opened 7 years ago Closed 3 years ago

Crash in shutdownhang | NtCreateUserProcess

Categories

(External Software Affecting Firefox :: Other, defect)

x86
Windows 8
defect
Not set
critical

Tracking

(firefox-esr60 wontfix, firefox55 wontfix, firefox56 wontfix, firefox57 wontfix, firefox66 wontfix, firefox67 wontfix, firefox68 wontfix, firefox69 fix-optional, firefox70 fix-optional)

RESOLVED INACTIVE
Tracking Status
firefox-esr60 --- wontfix
firefox55 --- wontfix
firefox56 --- wontfix
firefox57 --- wontfix
firefox66 --- wontfix
firefox67 --- wontfix
firefox68 --- wontfix
firefox69 --- fix-optional
firefox70 --- fix-optional

People

(Reporter: marco, Unassigned)

References

Details

(Keywords: crash, regression)

Crash Data

Attachments

(2 files)

This bug was filed from the Socorro interface and is 
report bp-ebce39e1-4fbc-402b-9034-182cc0170802.
=============================================================

This is a fairly high-volume shutdown crash, happening while we are trying to execute a process from JS.
Several users are mentioning updates.
There is nothing app update related in that stack and app update doesn't execute any processes from JS. Moving to Toolkit -> Startup & Profile system.
Component: Application Update → Startup and Profile System
needinfo'ing Jim Mathies in case he has someone on his team to investigate.
Flags: needinfo?(jmathies)
Looks like this experienced a regression in 55.

I don't have any resources available for this currently.
Flags: needinfo?(jmathies)
Could it be the pingsender? Are we executing it at shutdown?
What else are we executing at shutdown, assuming it isn't an addon?
Flags: needinfo?(gsvelto)
It's not the pingsender because that would have nsProcess::Run() in the stack, not RunAsync():

https://dxr.mozilla.org/mozilla-central/rev/ef9a0f01e4f68214f0ff8f4631783b8a0e075a82/toolkit/components/telemetry/TelemetrySend.jsm#

It could be the minidump-analyzer though:

https://dxr.mozilla.org/mozilla-central/rev/ef9a0f01e4f68214f0ff8f4631783b8a0e075a82/toolkit/components/crashes/CrashService.js#45

This would happen in the case where a content process crashed and the user shut down Firefox before the crash could have been fully processed. Maybe this has something to do with bug 1373958? Or maybe even bug 1362203? If it's the case then we might want to modify content crash handling so that it doesn't block on shutdown but rather defers the analysis to the following run.
Flags: needinfo?(gsvelto)
(In reply to Gabriele Svelto [:gsvelto] from comment #5)
> It's not the pingsender because that would have nsProcess::Run() in the
> stack, not RunAsync():
> 
> https://dxr.mozilla.org/mozilla-central/rev/
> ef9a0f01e4f68214f0ff8f4631783b8a0e075a82/toolkit/components/telemetry/
> TelemetrySend.jsm#
> 
> It could be the minidump-analyzer though:
> 
> https://dxr.mozilla.org/mozilla-central/rev/
> ef9a0f01e4f68214f0ff8f4631783b8a0e075a82/toolkit/components/crashes/
> CrashService.js#45
> 
> This would happen in the case where a content process crashed and the user
> shut down Firefox before the crash could have been fully processed. Maybe
> this has something to do with bug 1373958? Or maybe even bug 1362203? If
> it's the case then we might want to modify content crash handling so that it
> doesn't block on shutdown but rather defers the analysis to the following
> run.

The signature from bug 1373958 and this bug have exactly the same evolution, so I guess they are related.
Bug 1362203 started a few months before.
See Also: → 1373958
Something similar to the STR in bug 1315306 comment 12 may be happening here...
See Also: → 1400294
The crash rate here is ticking up quite a bit, many reports have AV software installed and I tend to think that might be part of the problem. NtCreateUserProcess is a call that gets hooked up by AV scanners to detect new applications being launched so we might get stuck because of that.

That being said it might be our fault anyway and since it's hard to identify the culprit I'll add an annotation that records the last process being launched. It won't be perfect: if two processes are launched at the same time one will overwrite the other's annotation but it's better than nothing.
This has stalled because my patch seem to cause issues on certain tests and I haven't had time to investigate them yet. I'll try to land this today or tomorrow at the latest.
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/48b2aa8dc68f
Add a crash annotation containing the last executable we launched r=froydnj
Taking this and flagging it with leave-open since the patch is only for figuring out what's causing the problem.
Assignee: nobody → gsvelto
Status: NEW → ASSIGNED
Keywords: leave-open

We've been a bit unlucky with this one. Since the patch landed we haven't had many crashes on nightly and the few that were reported were for previous builds.

I still haven't figured out what's going on here. The new annotation isn't helping much: I found it only in a couple of reports and the contents was "1" which should not be possible! It's odd but maybe I did something wrong.

I went back and sifted through the crashes again. First of all not many have the annotation set, which means we're not catching this scenario very often with it. The second thing that came up is that I've got the annotation wrong too; I should have realized it immediately but I didn't. Instead of passing a reference to an nsACString I'm passing a const char* which the compiler is promptly and quietly converting to a bool, thanks C++!

To get out of the impasse I've downloaded a bunch of raw minidumps from Socorro and I will be inspecting them manually. There's a good chance that the executable name is buried in the stack so I should be able to find it.

I've manually downloaded a lot of raw dumps at random and inspected their stacks: most of them had the pingsender executable path on the main thread stack, the rest had firefox instead. So, we're getting stuck on trying to spawn pingsender or restarting firefox after an update. My guess is that this is happening because the kernel call is hooked by whatever AV software the user has installed and the AV tries to scan the (new) executable it sees.

This is not going to be easy to solve. For pingsender I wonder if it would be possible to launch the program early during shutdown, freeze it, assemble the ping, hand it over to pingsender and then resume its execution. I'll give it a spin ASAP.

A reminder that new data collections are subject to Data Collection Review: https://wiki.mozilla.org/Firefox/Data_Collection

...and apparently by "new" I mean "four months ago when the annotation was added" :S

(In reply to Chris H-C :chutten from comment #18)

A reminder that new data collections are subject to Data Collection Review: https://wiki.mozilla.org/Firefox/Data_Collection

Does that apply to annotations that don't go in the crash ping too? Because by default crash annotations will only go in crash reports explicitly submitted by users but will not be included in the ping.

BTW I was about to remove this one since we don't need it anymore :P.

(In reply to Gabriele Svelto [:gsvelto] from comment #20)

(In reply to Chris H-C :chutten from comment #18)

A reminder that new data collections are subject to Data Collection Review: https://wiki.mozilla.org/Firefox/Data_Collection

Does that apply to annotations that don't go in the crash ping too? Because by default crash annotations will only go in crash reports explicitly submitted by users but will not be included in the ping.

BTW I was about to remove this one since we don't need it anymore :P.

Excitingly, it does! The Data Collection Review applies to all new or changed data collections, not just Telemetry ones.

This would've been an easy one, but since you've already filed a patch to remove it I guess I'll just put a vague assurance here that the collection appears to have conformed to the standards we hold for Data Collections in Firefox. It was documented, opt-in, limited data to answer a very specific question, and it was only collected for a short time.

Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f0b7027aa979
Remove the ExecutableName annotation r=froydnj

(In reply to Chris H-C :chutten from comment #22)

Excitingly, it does! The Data Collection Review applies to all new or changed data collections, not just Telemetry ones.

Thanks, I'll keep it in mind.

This would've been an easy one, but since you've already filed a patch to remove it I guess I'll just put a vague assurance here that the collection appears to have conformed to the standards we hold for Data Collections in Firefox. It was documented, opt-in, limited data to answer a very specific question, and it was only collected for a short time.

Plus it was buggy so it always returned "1" instead of the executable name :)

Marking as FIXED since the patch landed on m-c in comment 25.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED

(In reply to Liz Henry (:lizzard) (use needinfo) from comment #26)

Marking as FIXED since the patch landed on m-c in comment 25.

That was a backout of instrumentation to look into the issue. Comment 17 has more details on what we think is going on.

Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---

Gabriel, is there an update on this one?

No, I haven't devised a way to solve this yet.

Component: Startup and Profile System → Other
Product: Toolkit → External Software Affecting Firefox

Quick update: I've been pulled left and right on other more pressing issues but I'm still planning a fix for this.

The leave-open keyword is there and there is no activity for 6 months.
:gsvelto, maybe it's time to close this bug?

Flags: needinfo?(gsvelto)

This is sadly still an issue :-(

Flags: needinfo?(gsvelto)

The leave-open keyword is there and there is no activity for 6 months.
:gsvelto, maybe it's time to close this bug?

Flags: needinfo?(gsvelto)

I opened a few more crashes to refresh my memory and the situation is the same as in comment 17. We're getting stuck while launching pingsender.exe or relaunching firefox.exe. In every case I can find an AV module loaded into memory so the kernel call has been hooked by the AV software and is being slowed down as it "scans" the new executable.

There is no real way to mitigate this as the time that might be spent by the AV scanner is effectively unbounded. In both cases we might tell the shutdown terminator to wait a little longer but there's no guarantee that it will be ever enough. I'm seriously tempted to close this as WONTFIX.

Flags: needinfo?(gsvelto)

Unassigning myself because I don't think there's much more we can do here.

Assignee: gsvelto → nobody

The leave-open keyword is there and there is no activity for 6 months.
:gcp, maybe it's time to close this bug?

Flags: needinfo?(gpascutto)
Flags: needinfo?(gpascutto)
Status: REOPENED → RESOLVED
Closed: 5 years ago3 years ago
Resolution: --- → INACTIVE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: