Closed Bug 1637048 Opened 4 years ago Closed 4 years ago

Killing a content process makes the parent process unresponsive, making everything worse

Categories

(Toolkit :: Crash Reporting, defect)

defect

Tracking

()

RESOLVED FIXED
mozilla79
Tracking Status
firefox79 --- fixed

People

(Reporter: mstange, Assigned: gsvelto)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

When the system is generally slow, and we decide to kill a content process because it seems hung, we make the situation worse by dumping a crash on the main thread of the parent process, hanging the entire browser.

This was encountered in bug 1623609 comment 4, bug 1626789 comment 7, and also in an unaired Joy of Profiling episode when we were analyzing the following profile of Firefox startup on the 2017 reference device: https://perfht.ml/2L3uUHD

Apparently the rewrite of the crash reporting system will fix this, according to bug 1623609 comment 7:

(In reply to Gabriele Svelto [:gsvelto] from bug 1623609 comment #7)

We tried moving this OMT but it regressed badly so we gave up on that. The slowness here might be due to bug 1622316 for which I have a fix for waiting for review. That being said all this code is moving completely out-of-process in bug 1588530 which I'm currently working on.

Component: IPC → DOM: Content Processes

Note that this applies only to nightly. Crash generation for hung/slow content process shutdown is disabled in beta/release.

Oh, that is great to hear. I didn't know that.

Component: DOM: Content Processes → Crash Reporting
Product: Core → Toolkit

I'd like to try and bump the timeout significantly, to 30 or possible even 60 seconds. This should shake out the times when content processes are just being slow at shutting down (or don't get a chance because the scheduler doesn't wake them up in time). Hopefully it would leave only the actual hard deadlocks sending us crash dumps.

But there's the downside that now some processes might really take a while to shut down before being killed. Chris, do you think this might be problematic?

Flags: needinfo?(cpeterson)

(In reply to Gabriele Svelto [:gsvelto] from comment #4)

I'd like to try and bump the timeout significantly, to 30 or possible even 60 seconds. This should shake out the times when content processes are just being slow at shutting down (or don't get a chance because the scheduler doesn't wake them up in time). Hopefully it would leave only the actual hard deadlocks sending us crash dumps.

But there's the downside that now some processes might really take a while to shut down before being killed. Chris, do you think this might be problematic?

Nika, this is a question for you. ^

Flags: needinfo?(cpeterson) → needinfo?(nika)

I'm a bit worried about there being a worse experience for our users if a hung content process is kept around for longer than it needs to be, but I don't have a fundamental reason why I'd oppose the change other than that.

Redirecting to jld in case they know more about this.

Flags: needinfo?(nika) → needinfo?(jld)

The number of ShutDownKill signatures I see on nightly is going up a lot. That means content process shutdown is happening more often - something we expect - but not getting any faster. If we don't increase the timeout and we don't improve the shutdown procedure the only other option we have is reverting bug 1543113. That will cause all ShutDownKill signatures to be lumped together again so we'll see just the overall volume.

I don't have any insight here besides what's already been written. I don't have any problems with increasing the timeout; 5 seconds does seem low, thinking about older hardware with several processes going through shutdown at once and maybe swapping.

Flags: needinfo?(jld)

Alright, taking this. Let's bump up the timeout and see if there's any improvement.

Assignee: nobody → gsvelto
Status: NEW → ASSIGNED
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2e5d8bebf8c7
Significantly increase the time we wait before killing a content process that hasn't finished shutting down r=jld

The severity field is not set for this bug.
:gsvelto, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(gsvelto)
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla79
Severity: -- → S3
Flags: needinfo?(gsvelto)
No longer depends on: 1588530
See Also: → 1588530
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: