Closed Bug 1541344 Opened 7 months ago Closed 3 months ago

Crash in [@ shutdownhang | mozilla::ipc::GeckoChildProcessHost::WaitUntilConnected]

Categories

(Core :: Plug-ins, defect, P2, critical)

68 Branch
All
Windows
defect

Tracking

()

RESOLVED FIXED
mozilla70
Tracking Status
firefox-esr60 --- unaffected
firefox-esr68 --- fixed
firefox67 --- wontfix
firefox68 + wontfix
firefox69 + fixed
firefox70 + fixed

People

(Reporter: skywalker333, Assigned: baku)

References

Details

(4 keywords)

Crash Data

Attachments

(1 file)

This bug is for crash report bp-fb87e717-efa3-44c1-961c-8cb480190402.

Top 10 frames of crashing thread:

0 ntdll.dll NtWaitForKeyedEvent 
1 ntdll.dll RtlSleepConditionVariableSRW 
2 kernel32.dll SleepConditionVariableSRW 
3 mozglue.dll mozilla::detail::ConditionVariableImpl::wait_for mozglue/misc/ConditionVariable_windows.cpp:78
4 xul.dll mozilla::CondVar::Wait xpcom/threads/CondVar.h:70
5 xul.dll mozilla::ipc::GeckoChildProcessHost::WaitUntilConnected ipc/glue/GeckoChildProcessHost.cpp:390
6 xul.dll mozilla::plugins::PluginProcessParent::WaitUntilConnected dom/plugins/ipc/PluginProcessParent.cpp:129
7 xul.dll mozilla::plugins::PluginModuleChromeParent::LoadModule dom/plugins/ipc/PluginModuleParent.cpp:487
8 xul.dll GetNewPluginLibrary dom/plugins/base/nsNPAPIPlugin.cpp:214
9 xul.dll nsNPAPIPlugin::CreatePlugin dom/plugins/base/nsNPAPIPlugin.cpp:230

Component: General → Plug-ins

The priority flag is not set for this bug.
:jimm, could you have a look please?

Flags: needinfo?(jmathies)
Flags: needinfo?(jmathies)
Priority: -- → P3

sorry for hijacking the bug - this signature is regressing in volume in firefox 68 (since nightly build 20190408193006) and is currently accounting for 1.5% of browser crashes in 68.0b; i'm going to mark it accordingly.

Keywords: regression, topcrash
Version: 45 Branch → 68 Branch

jimm - see the spike on beta; reprioritize?

Flags: needinfo?(jmathies)

Seems plugins are having issues getting launched.

Jed, could you check to see if this correlates with any of the async or other ipc process work? This could also be other things, like a flash change.

Flags: needinfo?(jld)

Using the crash graph widget to filter by release channel, we're looking for something that landed on Nightly around 2019-04-08 and merged into Beta with 68, and would affect NPAPI plugins. Bug 1541130 would qualify; it's also the last thing that landed in ipc/ before build 20190409095652, which is the first Nightly build where we see this. But it's a relatively simple refactoring, so I don't see how it would have caused this. Also,

The other thing is: the IPC I/O thread and IPC Launch thread are idle, which means that the process has been launched and we're waiting for it to connect to the named pipe (this seems to be Windows-only) for the initial channel. If it freezes or crashes early in startup, we'll wait for the entire timeout.

Looking at other “core” modules, bug 1541792 also landed at that time and made nontrivial changes to XPCOM component registration. It's also the only other thing I can see that's in build 20190409095652 but not its predecessor 20190408193006 (in Hg, 98b223de0543 % 395a65d512e3) that doesn't look incapable of causing this kind of problem.

However: the plugin launch timeout defaults to 45 seconds, but the shutdown hang timeout defaults to 60 seconds. So this shouldn't be causing a shutdown hang in any case… but it is.

Flags: needinfo?(jld)

Top Crashers for Firefox 69.0a1 (Nightly) - 7 days ago

#5 1.05% 0.43% shutdownhang | mozilla::ipc::GeckoChildProcessHost::WaitUntilConnected 72 72 0 0 69 0 2017-09-27

Signature report for shutdownhang | mozilla::ipc::GeckoChildProcessHost::WaitUntilConnected
Showing results from 7 days ago

Operating System
Windows 10 474 82.0%
Windows 7 58 10.0%
Windows 8.1 29 5.0%
Windows Vista 15 2.6%
Windows 8 2 0.3%

Product
Firefox 69.0a1 72 12.5% 66
Firefox 68.0b9 32 5.5% 37
Firefox 68.0b8 147 25.4% 99
Firefox 68.0b7 119 20.6% 99
Firefox 68.0b6 29 5.0% 18
Firefox 68.0b5 12 2.1% 14
Firefox 68.0b4 3 0.5% 3
Firefox 68.0b3 1 0.2% 1
Firefox 68.0b1 1 0.2% 1
Firefox 67.0b3 1 0.2% 1
Firefox 67.0.2 7 1.2% 8
Firefox 67.0.1 27 4.7% 38
Firefox 67.0 8 1.4% 6
Firefox 66.0.5 6 1.0% 6
Firefox 66.0.2 2 0.3% 2
Firefox 62.0.2 1 0.2% 1
Firefox 62.0 1 0.2% 1
Firefox 60.7.0esr 22 3.8% 16
Firefox 60.6.3esr 7 1.2% 8
Firefox 60.6.2esr 1 0.2% 1
Firefox 60.6.1esr 2 0.3% 2
Firefox 60.5.2esr 48 8.3% 46
Firefox 60.4.0esr 2 0.3% 2
Firefox 60.2.2esr 1 0.2% 1
Firefox 60.2.0esr 1 0.2% 1
Firefox 60.0 1 0.2% 1
Firefox 58.0.2 1 0.2% 1
Firefox 52.9.0esr 18 3.1% 15
Firefox 52.6.0esr 1 0.2% 1
Firefox 52.5.0esr 1 0.2% 1
Firefox 52.3.0esr 3 0.5% 3

Architecture
amd64 399 69.0%
x86 179 31.0%

Hrm, I got confused. Per comment 2 and following this is tracking a regression in 68.

Blocks: 1559850

Discussed during triage. We are averaging around 200 crashes per beta build (and this is barely visible on 67 release) - Jed is there anything else we can do here to investigate or improve the crash situation before we ship 68?

Flags: needinfo?(jld)

I can't see anything useful that I didn't already mention in comment #5. There's something going on here that's specific to NPAPI plugins, though, so maybe :jimm can take another look.

Flags: needinfo?(jld)
Flags: needinfo?(jmathies)
Priority: P3 → P2
Duplicate of this bug: 1559850

Still digging, but this might be a flash bug. They released a new rev on 4/9/2019.

(In reply to Jim Mathies [:jimm] from comment #14)

Still digging, but this might be a flash bug. They released a new rev on
4/9/2019.

Hmm, so may be flash related since the start of the crash ties back to a Flash release on 4/9/2019 for 32.0.0.172. However it doesn't impact 68, and impacts 69, so maybe some sort of incompatibility with changes we made. Which means we should be able to track this down on our end if we can figure out how.

FYI to Chris on this, I wonder if Adobe might have some input.

Flags: needinfo?(cpeterson)

(In reply to Jim Mathies [:jimm] from comment #15)

(In reply to Jim Mathies [:jimm] from comment #14)

Still digging, but this might be a flash bug. They released a new rev on
4/9/2019.

Hmm, so may be flash related since the start of the crash ties back to a
Flash release on 4/9/2019 for 32.0.0.172. However it doesn't impact 68, and
impacts 69, so maybe some sort of incompatibility with changes we made.
Which means we should be able to track this down on our end if we can figure
out how.

correction, Impacts 68, and not 67.

Hi Jeromie, we see a recent increase in Flash shutdown hangs on Windows. Firefox is waiting for the Flash plugin process to shut down and gives up (force crash) after a minute. The shutdown hangs might be related to some change in Firefox 68 (which will be released on 2019-07-09 next week).

Can an Adobe engineer try testing with Firefox 68 (Beta) for any shutdown problems apparent from the Flash plugin process's side? Thanks. Unfortunately, we don't have steps to reproduce the shutdown hangs, just the crash reports.

Flags: needinfo?(cpeterson) → needinfo?(jeclark)

This is Adobe's annual shutdown week. We won't be back in the office until Monday.

What you're describing sounds like the majority of the latent crashes that we see in Firefox. Without minidumps, I don't hold put a lot of hope for resolving this. We're always happy to arrange to send someone to your office to collaboratively debug, and IMHO, that's the only good path forward. We're also always happy to share symbols (and potentially relevant source) under the existing NDA, etc.

Firefox is a first class citizen in our testing, so it's already getting the full battery of tests. These issues definitely persist, but infrequently that they're rare in lab conditions. That's always been one of the gating factors for progress on this front, and was one of the big initial drivers for establishing a public beta program years ago.

Flags: needinfo?(jeclark)

(In reply to Jeromie Clark from comment #19)

What you're describing sounds like the majority of the latent crashes that we see in Firefox. Without minidumps, I don't hold put a lot of hope for resolving this. We're always happy to arrange to send someone to your office to collaboratively debug, and IMHO, that's the only good path forward. We're also always happy to share symbols (and potentially relevant source) under the existing NDA, etc.

Thanks. Unfortunately, we don't have steps to reproduce. We just have Firefox hang (crash) reports with a stack trace pointing to the NPAPI loading code in Firefox.

Without minidumps, I don't hold put a lot of hope for resolving this.

@ Jim:

  1. Can we get minidumps from the Socorro crash reports? Would they include the plugin processes?

  2. Also, this bug is labeled a "shutdownhang", but the stack trace in comment 0 points to nsNPAPIPlugin::CreatePlugin. So the hang appears to be happening during plugin instantiation, not shutdown?

Flags: needinfo?(jmathies)

(In reply to Chris Peterson [:cpeterson] from comment #20)

(In reply to Jeromie Clark from comment #19)

What you're describing sounds like the majority of the latent crashes that we see in Firefox. Without minidumps, I don't hold put a lot of hope for resolving this. We're always happy to arrange to send someone to your office to collaboratively debug, and IMHO, that's the only good path forward. We're also always happy to share symbols (and potentially relevant source) under the existing NDA, etc.

Thanks. Unfortunately, we don't have steps to reproduce. We just have
Firefox hang (crash) reports with a stack trace pointing to the NPAPI
loading code in Firefox.

Without minidumps, I don't hold put a lot of hope for resolving this.

@ Jim:

  1. Can we get minidumps from the Socorro crash reports? Would they include
    the plugin processes?

I'll look, but I doubt it since it's a browser side hang.

  1. Also, this bug is labeled a "shutdownhang", but the stack trace in
    comment 0 points to nsNPAPIPlugin::CreatePlugin. So the hang appears to be
    happening during plugin instantiation, not shutdown?

The plugin fails to connect, then some time later the browser shuts down and we get a hang on the plugin startup.

Flags: needinfo?(jmathies)

Cristian - Since this is fairly prevalent in 69 beta, can you or someone in SV try to reproduce this issue using Win 10 - (84.93% in signature vs 30.76% overall) platform_pretty_version = Windows 10.

Flags: needinfo?(cristian.comorasu)

Hello Marcia,
I tried several builds (Fx 70.0a1, 69.0a1, 69.0b3, 68.0a1 and 68.0b12) with the links and details from the comment section, but to no avail.
Some of the crashes have been reported on about:home/on stand-by, and others while trying to shut down the PC, none of the options reproduced this crash on my station.

Flags: needinfo?(cristian.comorasu)

After the OOM signature, this is the top overall crash on 68 release.

Tracking for 68 since this is such a high volume crash.

Since this is a hang in the plugin process, we don't have plugin version or a dump from the pluign process. Gabriel, wondering if you might have some ideas on how we could pull more information here, maybe through changes to the way we catch this.

Flags: needinfo?(gsvelto)

67 is impacted but to a lesser extent. Really feels like something we landed since we didn't see a jump in 67 (hence, not a flash issue).

could bug 1532948 play a part here - it landed in the timeframe where the volume of this began to rise in nightly and the patch would tick the boxes shutdown and plugins?

(In reply to [:philipp] from comment #28)

could bug 1532948 play a part here - it landed in the timeframe where the volume of this began to rise in nightly and the patch would tick the boxes shutdown and plugins?

Oh that's interesting. Clearing plugin data maybe but the plugin process isn't running, so we try to fire it up in shutdown?

Flags: needinfo?(amarchesini)

Almost 10K in crashes on 68 release and this bug is very evident in 69 beta so far. Since we are already at 69.0b6, can we get answers to the needinfos? Thanks in advance.

Flags: needinfo?(gsvelto) → needinfo?(wleung)

I've glanced at some of the minidumps and there's an important bit that must be taken into account: we're crashing in a very specific spot of the shutdown terminator that gets hit only if shutdown commenced but never made any progress.

While it's true that the main thread is stuck waiting for the Flash plugin to come up in nsNPAPIPlugin::CreatePlugin() what's a few frames up in the stack is more interesting. We're attempting to launch the Flash plugin in nsPluginHost::ClearSiteData(). To my knowledge that method is only called from ClearDataService.jsm as part of the cleaner service.

So what seems to be happening is that we're trying to clear plugin data during shutdown (maybe in response to shutdown?) and to do so on the Flash plugin we need it to start in the first place. But it's not starting or it's taking too long. Since we're waiting for it to start on the main thread then shutdown cannot proceed (at all) and the terminator kicks in after a minute and kills Firefox.

ClearDataService.jsm contains a comment that's rather illuminating WRT this issue:

As evidenced in bug 1253204, clearing plugin data can sometimes be
very, very long, for mysterious reasons. Unfortunately, this is not
something actionable by Mozilla, so crashing here serves no purpose.
For this reason, instead of waiting for sanitization to always
complete, we introduce a soft timeout. Once this timeout has
elapsed, we proceed with the shutdown of Firefox.

So it seems we were aware of this problem and we put a workaround in place but either it's not working or the assumptions it was making are not valid anymore.

I'm not familiar with the nsIPluginHost interface. Does it make sense to clean up data only if the plugin is active?
Or maybe just delete data if nsIPluginHost::siteHasData() returns true is enough.

Flags: needinfo?(amarchesini) → needinfo?(gsvelto)

I don't know enough about how plugins work to be sure of that. We can certainly try and deploy a fix that only cleans the data if the plugin is running and see if it helps. After all, if we crash we're not cleaning the data either.

Flags: needinfo?(gsvelto)

While reviewing your patch I noticed that nsIPluginHost::SiteHasData() also tries to load the plugin if it's not running yet. So it's likely to get stuck in the same place during shutdown.

Would it be A very bad thing™ to clear the data only if the plugin has already been loaded during shutdown? I also wonder if this is only happening during shutdown or if users are also getting freezes at runtime.

Pushed by amarchesini@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/225c129a3f3e
Delete plugin-data only for the sites with data,r=gsvelto
Status: NEW → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla70
Assignee: nobody → amarchesini

the patch seems to have worked well so far on nightly. could you please request an uplift to beta if it applies there as well and you deem fit to do so?
thank you

Flags: needinfo?(amarchesini)

Comment on attachment 9083030 [details]
Bug 1541344 - Delete plugin-data only for the sites with data,r?gsvelto

Beta/Release Uplift Approval Request

  • User impact if declined: A crash can occur during the shutdown, when cleaning plugin data if the plugin process has not been executed yet.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This patch checks if the plugin has been activated before starting the data cleanup. The active/inactive state is a boolean check, added in ClearDataService.jsm. Low risk.
  • String changes made/needed:
Flags: needinfo?(amarchesini)
Attachment #9083030 - Flags: approval-mozilla-beta?

Comment on attachment 9083030 [details]
Bug 1541344 - Delete plugin-data only for the sites with data,r?gsvelto

Will hopefully help with one of our top crashes at the moment. Approved for 69.0b15.

Attachment #9083030 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Flags: needinfo?(wleung)

This is looking good on Nightly and Beta so far, should we uplift this to ESR68 also?

Flags: needinfo?(amarchesini)

Comment on attachment 9083030 [details]
Bug 1541344 - Delete plugin-data only for the sites with data,r?gsvelto

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: This patch fixes a major crash in nightly/beta, just avoiding the cleanup of data for unloaded plugins.
  • User impact if declined:
  • Fix Landed on Version: 70
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): See the previous beta-approval comments
  • String or UUID changes made by this patch: none
Flags: needinfo?(amarchesini)
Attachment #9083030 - Flags: approval-mozilla-esr68?

Comment on attachment 9083030 [details]
Bug 1541344 - Delete plugin-data only for the sites with data,r?gsvelto

Fixes a frequent shutdown crash. Approved for 68.1esr.

Attachment #9083030 - Flags: approval-mozilla-esr68? → approval-mozilla-esr68+
You need to log in before you can comment on or make changes to this bug.