Crash in shutdownhang | NtCreateUserProcess

REOPENED
Assigned to

Status

()

defect
--
critical
REOPENED
2 years ago
4 days ago

People

(Reporter: marco, Assigned: gsvelto)

Tracking

({crash, leave-open, regression})

unspecified
x86
Windows 8
Points:
---

Firefox Tracking Flags

(firefox-esr60 wontfix, firefox55 wontfix, firefox56 wontfix, firefox57 wontfix, firefox66 wontfix, firefox67 wontfix, firefox68 wontfix, firefox69 affected)

Details

(crash signature)

Attachments

(2 attachments)

Reporter

Description

2 years ago
This bug was filed from the Socorro interface and is 
report bp-ebce39e1-4fbc-402b-9034-182cc0170802.
=============================================================

This is a fairly high-volume shutdown crash, happening while we are trying to execute a process from JS.
Several users are mentioning updates.
There is nothing app update related in that stack and app update doesn't execute any processes from JS. Moving to Toolkit -> Startup & Profile system.
Component: Application Update → Startup and Profile System
needinfo'ing Jim Mathies in case he has someone on his team to investigate.
Flags: needinfo?(jmathies)
Looks like this experienced a regression in 55.

I don't have any resources available for this currently.
Flags: needinfo?(jmathies)
Reporter

Comment 4

2 years ago
Could it be the pingsender? Are we executing it at shutdown?
What else are we executing at shutdown, assuming it isn't an addon?
Flags: needinfo?(gsvelto)
Assignee

Comment 5

2 years ago
It's not the pingsender because that would have nsProcess::Run() in the stack, not RunAsync():

https://dxr.mozilla.org/mozilla-central/rev/ef9a0f01e4f68214f0ff8f4631783b8a0e075a82/toolkit/components/telemetry/TelemetrySend.jsm#

It could be the minidump-analyzer though:

https://dxr.mozilla.org/mozilla-central/rev/ef9a0f01e4f68214f0ff8f4631783b8a0e075a82/toolkit/components/crashes/CrashService.js#45

This would happen in the case where a content process crashed and the user shut down Firefox before the crash could have been fully processed. Maybe this has something to do with bug 1373958? Or maybe even bug 1362203? If it's the case then we might want to modify content crash handling so that it doesn't block on shutdown but rather defers the analysis to the following run.
Flags: needinfo?(gsvelto)
Reporter

Comment 6

2 years ago
(In reply to Gabriele Svelto [:gsvelto] from comment #5)
> It's not the pingsender because that would have nsProcess::Run() in the
> stack, not RunAsync():
> 
> https://dxr.mozilla.org/mozilla-central/rev/
> ef9a0f01e4f68214f0ff8f4631783b8a0e075a82/toolkit/components/telemetry/
> TelemetrySend.jsm#
> 
> It could be the minidump-analyzer though:
> 
> https://dxr.mozilla.org/mozilla-central/rev/
> ef9a0f01e4f68214f0ff8f4631783b8a0e075a82/toolkit/components/crashes/
> CrashService.js#45
> 
> This would happen in the case where a content process crashed and the user
> shut down Firefox before the crash could have been fully processed. Maybe
> this has something to do with bug 1373958? Or maybe even bug 1362203? If
> it's the case then we might want to modify content crash handling so that it
> doesn't block on shutdown but rather defers the analysis to the following
> run.

The signature from bug 1373958 and this bug have exactly the same evolution, so I guess they are related.
Bug 1362203 started a few months before.
See Also: → 1373958

Comment 7

2 years ago
Something similar to the STR in bug 1315306 comment 12 may be happening here...
Assignee

Updated

2 years ago
See Also: → 1400294
Assignee

Comment 8

7 months ago
The crash rate here is ticking up quite a bit, many reports have AV software installed and I tend to think that might be part of the problem. NtCreateUserProcess is a call that gets hooked up by AV scanners to detect new applications being launched so we might get stuck because of that.

That being said it might be our fault anyway and since it's hard to identify the culprit I'll add an annotation that records the last process being launched. It won't be perfect: if two processes are launched at the same time one will overwrite the other's annotation but it's better than nothing.
Assignee

Comment 10

6 months ago
This has stalled because my patch seem to cause issues on certain tests and I haven't had time to investigate them yet. I'll try to land this today or tomorrow at the latest.

Comment 11

6 months ago
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/48b2aa8dc68f
Add a crash annotation containing the last executable we launched r=froydnj
Assignee

Comment 12

6 months ago
Taking this and flagging it with leave-open since the patch is only for figuring out what's causing the problem.
Assignee: nobody → gsvelto
Status: NEW → ASSIGNED
Keywords: leave-open
Assignee

Comment 14

5 months ago

We've been a bit unlucky with this one. Since the patch landed we haven't had many crashes on nightly and the few that were reported were for previous builds.

Assignee

Comment 15

5 months ago

I still haven't figured out what's going on here. The new annotation isn't helping much: I found it only in a couple of reports and the contents was "1" which should not be possible! It's odd but maybe I did something wrong.

Assignee

Comment 16

3 months ago

I went back and sifted through the crashes again. First of all not many have the annotation set, which means we're not catching this scenario very often with it. The second thing that came up is that I've got the annotation wrong too; I should have realized it immediately but I didn't. Instead of passing a reference to an nsACString I'm passing a const char* which the compiler is promptly and quietly converting to a bool, thanks C++!

To get out of the impasse I've downloaded a bunch of raw minidumps from Socorro and I will be inspecting them manually. There's a good chance that the executable name is buried in the stack so I should be able to find it.

Assignee

Comment 17

2 months ago

I've manually downloaded a lot of raw dumps at random and inspected their stacks: most of them had the pingsender executable path on the main thread stack, the rest had firefox instead. So, we're getting stuck on trying to spawn pingsender or restarting firefox after an update. My guess is that this is happening because the kernel call is hooked by whatever AV software the user has installed and the AV tries to scan the (new) executable it sees.

This is not going to be easy to solve. For pingsender I wonder if it would be possible to launch the program early during shutdown, freeze it, assemble the ping, hand it over to pingsender and then resume its execution. I'll give it a spin ASAP.

A reminder that new data collections are subject to Data Collection Review: https://wiki.mozilla.org/Firefox/Data_Collection

...and apparently by "new" I mean "four months ago when the annotation was added" :S

Assignee

Comment 20

2 months ago

(In reply to Chris H-C :chutten from comment #18)

A reminder that new data collections are subject to Data Collection Review: https://wiki.mozilla.org/Firefox/Data_Collection

Does that apply to annotations that don't go in the crash ping too? Because by default crash annotations will only go in crash reports explicitly submitted by users but will not be included in the ping.

BTW I was about to remove this one since we don't need it anymore :P.

(In reply to Gabriele Svelto [:gsvelto] from comment #20)

(In reply to Chris H-C :chutten from comment #18)

A reminder that new data collections are subject to Data Collection Review: https://wiki.mozilla.org/Firefox/Data_Collection

Does that apply to annotations that don't go in the crash ping too? Because by default crash annotations will only go in crash reports explicitly submitted by users but will not be included in the ping.

BTW I was about to remove this one since we don't need it anymore :P.

Excitingly, it does! The Data Collection Review applies to all new or changed data collections, not just Telemetry ones.

This would've been an easy one, but since you've already filed a patch to remove it I guess I'll just put a vague assurance here that the collection appears to have conformed to the standards we hold for Data Collections in Firefox. It was documented, opt-in, limited data to answer a very specific question, and it was only collected for a short time.

Comment 23

2 months ago
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f0b7027aa979
Remove the ExecutableName annotation r=froydnj
Assignee

Comment 24

2 months ago

(In reply to Chris H-C :chutten from comment #22)

Excitingly, it does! The Data Collection Review applies to all new or changed data collections, not just Telemetry ones.

Thanks, I'll keep it in mind.

This would've been an easy one, but since you've already filed a patch to remove it I guess I'll just put a vague assurance here that the collection appears to have conformed to the standards we hold for Data Collections in Firefox. It was documented, opt-in, limited data to answer a very specific question, and it was only collected for a short time.

Plus it was buggy so it always returned "1" instead of the executable name :)

Marking as FIXED since the patch landed on m-c in comment 25.

Status: ASSIGNED → RESOLVED
Closed: Last month
Resolution: --- → FIXED

(In reply to Liz Henry (:lizzard) (use needinfo) from comment #26)

Marking as FIXED since the patch landed on m-c in comment 25.

That was a backout of instrumentation to look into the issue. Comment 17 has more details on what we think is going on.

Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---
You need to log in before you can comment on or make changes to this bug.