Crash in shutdownhang | NtCreateUserProcess
Categories
(External Software Affecting Firefox :: Other, defect)
Tracking
(firefox-esr60 wontfix, firefox55 wontfix, firefox56 wontfix, firefox57 wontfix, firefox66 wontfix, firefox67 wontfix, firefox68 wontfix, firefox69 fix-optional, firefox70 fix-optional)
Tracking | Status | |
---|---|---|
firefox-esr60 | --- | wontfix |
firefox55 | --- | wontfix |
firefox56 | --- | wontfix |
firefox57 | --- | wontfix |
firefox66 | --- | wontfix |
firefox67 | --- | wontfix |
firefox68 | --- | wontfix |
firefox69 | --- | fix-optional |
firefox70 | --- | fix-optional |
People
(Reporter: marco, Unassigned)
References
Details
(Keywords: crash, regression)
Crash Data
Attachments
(2 files)
This bug was filed from the Socorro interface and is report bp-ebce39e1-4fbc-402b-9034-182cc0170802. ============================================================= This is a fairly high-volume shutdown crash, happening while we are trying to execute a process from JS. Several users are mentioning updates.
Comment 1•7 years ago
|
||
There is nothing app update related in that stack and app update doesn't execute any processes from JS. Moving to Toolkit -> Startup & Profile system.
Comment 2•7 years ago
|
||
needinfo'ing Jim Mathies in case he has someone on his team to investigate.
Comment 3•7 years ago
|
||
Looks like this experienced a regression in 55. I don't have any resources available for this currently.
Reporter | ||
Comment 4•7 years ago
|
||
Could it be the pingsender? Are we executing it at shutdown? What else are we executing at shutdown, assuming it isn't an addon?
Comment 5•7 years ago
|
||
It's not the pingsender because that would have nsProcess::Run() in the stack, not RunAsync(): https://dxr.mozilla.org/mozilla-central/rev/ef9a0f01e4f68214f0ff8f4631783b8a0e075a82/toolkit/components/telemetry/TelemetrySend.jsm# It could be the minidump-analyzer though: https://dxr.mozilla.org/mozilla-central/rev/ef9a0f01e4f68214f0ff8f4631783b8a0e075a82/toolkit/components/crashes/CrashService.js#45 This would happen in the case where a content process crashed and the user shut down Firefox before the crash could have been fully processed. Maybe this has something to do with bug 1373958? Or maybe even bug 1362203? If it's the case then we might want to modify content crash handling so that it doesn't block on shutdown but rather defers the analysis to the following run.
Reporter | ||
Comment 6•7 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #5) > It's not the pingsender because that would have nsProcess::Run() in the > stack, not RunAsync(): > > https://dxr.mozilla.org/mozilla-central/rev/ > ef9a0f01e4f68214f0ff8f4631783b8a0e075a82/toolkit/components/telemetry/ > TelemetrySend.jsm# > > It could be the minidump-analyzer though: > > https://dxr.mozilla.org/mozilla-central/rev/ > ef9a0f01e4f68214f0ff8f4631783b8a0e075a82/toolkit/components/crashes/ > CrashService.js#45 > > This would happen in the case where a content process crashed and the user > shut down Firefox before the crash could have been fully processed. Maybe > this has something to do with bug 1373958? Or maybe even bug 1362203? If > it's the case then we might want to modify content crash handling so that it > doesn't block on shutdown but rather defers the analysis to the following > run. The signature from bug 1373958 and this bug have exactly the same evolution, so I guess they are related. Bug 1362203 started a few months before.
Comment 7•7 years ago
|
||
Something similar to the STR in bug 1315306 comment 12 may be happening here...
Comment 8•6 years ago
|
||
The crash rate here is ticking up quite a bit, many reports have AV software installed and I tend to think that might be part of the problem. NtCreateUserProcess is a call that gets hooked up by AV scanners to detect new applications being launched so we might get stuck because of that. That being said it might be our fault anyway and since it's hard to identify the culprit I'll add an annotation that records the last process being launched. It won't be perfect: if two processes are launched at the same time one will overwrite the other's annotation but it's better than nothing.
Comment 9•6 years ago
|
||
Comment 10•5 years ago
|
||
This has stalled because my patch seem to cause issues on certain tests and I haven't had time to investigate them yet. I'll try to land this today or tomorrow at the latest.
Comment 11•5 years ago
|
||
Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/48b2aa8dc68f Add a crash annotation containing the last executable we launched r=froydnj
Comment 12•5 years ago
|
||
Taking this and flagging it with leave-open since the patch is only for figuring out what's causing the problem.
Comment 13•5 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/48b2aa8dc68f
Comment 14•5 years ago
|
||
We've been a bit unlucky with this one. Since the patch landed we haven't had many crashes on nightly and the few that were reported were for previous builds.
Comment 15•5 years ago
|
||
I still haven't figured out what's going on here. The new annotation isn't helping much: I found it only in a couple of reports and the contents was "1" which should not be possible! It's odd but maybe I did something wrong.
Comment 16•5 years ago
|
||
I went back and sifted through the crashes again. First of all not many have the annotation set, which means we're not catching this scenario very often with it. The second thing that came up is that I've got the annotation wrong too; I should have realized it immediately but I didn't. Instead of passing a reference to an nsACString
I'm passing a const char*
which the compiler is promptly and quietly converting to a bool
, thanks C++!
To get out of the impasse I've downloaded a bunch of raw minidumps from Socorro and I will be inspecting them manually. There's a good chance that the executable name is buried in the stack so I should be able to find it.
Comment 17•5 years ago
|
||
I've manually downloaded a lot of raw dumps at random and inspected their stacks: most of them had the pingsender executable path on the main thread stack, the rest had firefox instead. So, we're getting stuck on trying to spawn pingsender or restarting firefox after an update. My guess is that this is happening because the kernel call is hooked by whatever AV software the user has installed and the AV tries to scan the (new) executable it sees.
This is not going to be easy to solve. For pingsender I wonder if it would be possible to launch the program early during shutdown, freeze it, assemble the ping, hand it over to pingsender and then resume its execution. I'll give it a spin ASAP.
Comment 18•5 years ago
|
||
A reminder that new data collections are subject to Data Collection Review: https://wiki.mozilla.org/Firefox/Data_Collection
Comment 19•5 years ago
|
||
...and apparently by "new" I mean "four months ago when the annotation was added" :S
Updated•5 years ago
|
Comment 20•5 years ago
|
||
(In reply to Chris H-C :chutten from comment #18)
A reminder that new data collections are subject to Data Collection Review: https://wiki.mozilla.org/Firefox/Data_Collection
Does that apply to annotations that don't go in the crash ping too? Because by default crash annotations will only go in crash reports explicitly submitted by users but will not be included in the ping.
BTW I was about to remove this one since we don't need it anymore :P.
Comment 21•5 years ago
|
||
Comment 22•5 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #20)
(In reply to Chris H-C :chutten from comment #18)
A reminder that new data collections are subject to Data Collection Review: https://wiki.mozilla.org/Firefox/Data_Collection
Does that apply to annotations that don't go in the crash ping too? Because by default crash annotations will only go in crash reports explicitly submitted by users but will not be included in the ping.
BTW I was about to remove this one since we don't need it anymore :P.
Excitingly, it does! The Data Collection Review applies to all new or changed data collections, not just Telemetry ones.
This would've been an easy one, but since you've already filed a patch to remove it I guess I'll just put a vague assurance here that the collection appears to have conformed to the standards we hold for Data Collections in Firefox. It was documented, opt-in, limited data to answer a very specific question, and it was only collected for a short time.
Comment 23•5 years ago
|
||
Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/f0b7027aa979 Remove the ExecutableName annotation r=froydnj
Comment 24•5 years ago
|
||
(In reply to Chris H-C :chutten from comment #22)
Excitingly, it does! The Data Collection Review applies to all new or changed data collections, not just Telemetry ones.
Thanks, I'll keep it in mind.
This would've been an easy one, but since you've already filed a patch to remove it I guess I'll just put a vague assurance here that the collection appears to have conformed to the standards we hold for Data Collections in Firefox. It was documented, opt-in, limited data to answer a very specific question, and it was only collected for a short time.
Plus it was buggy so it always returned "1" instead of the executable name :)
Comment 25•5 years ago
|
||
bugherder |
Updated•5 years ago
|
Comment 26•5 years ago
|
||
Marking as FIXED since the patch landed on m-c in comment 25.
Updated•5 years ago
|
Comment 27•5 years ago
|
||
(In reply to Liz Henry (:lizzard) (use needinfo) from comment #26)
Marking as FIXED since the patch landed on m-c in comment 25.
That was a backout of instrumentation to look into the issue. Comment 17 has more details on what we think is going on.
Updated•5 years ago
|
Comment 28•5 years ago
|
||
Gabriel, is there an update on this one?
Comment 29•5 years ago
|
||
No, I haven't devised a way to solve this yet.
Updated•5 years ago
|
Comment 30•5 years ago
•
|
||
Quick update: I've been pulled left and right on other more pressing issues but I'm still planning a fix for this.
Comment 31•4 years ago
|
||
The leave-open keyword is there and there is no activity for 6 months.
:gsvelto, maybe it's time to close this bug?
Comment 33•3 years ago
|
||
The leave-open keyword is there and there is no activity for 6 months.
:gsvelto, maybe it's time to close this bug?
Comment 34•3 years ago
|
||
I opened a few more crashes to refresh my memory and the situation is the same as in comment 17. We're getting stuck while launching pingsender.exe or relaunching firefox.exe. In every case I can find an AV module loaded into memory so the kernel call has been hooked by the AV software and is being slowed down as it "scans" the new executable.
There is no real way to mitigate this as the time that might be spent by the AV scanner is effectively unbounded. In both cases we might tell the shutdown terminator to wait a little longer but there's no guarantee that it will be ever enough. I'm seriously tempted to close this as WONTFIX.
Comment 35•3 years ago
|
||
Unassigning myself because I don't think there's much more we can do here.
Comment 36•3 years ago
|
||
The leave-open keyword is there and there is no activity for 6 months.
:gcp, maybe it's time to close this bug?
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Description
•