1386290 - We seem to have no protection about a hung content process

Reporter

Description

•

7 years ago

Debugging bug 1384658 yesterday made me realize this. We have the slow script dialog which works OK if we manage to run SpiderMonkey, but if we for example get stuck in a C++ infinite loop in a content process, it seems like that content process is forever gone and we have no mechanism of getting it back! This seems like a mistake, but I'm not entirely sure that I'm not missing something. Bill can you confirm my understanding here? If a web page for example calls into method foo() implemented in C++ and that function runs and infinite loop, how would we get our content process back? Is there any existing protection against this? It seems like at the very least we want something like the hung plugin detection here.

Flags: needinfo?(wmccloskey)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 1

•

7 years ago

That's correct. There are two issues at play here: 1. If the hang has JS on the stack (i.e., if JS called into the hanging method foo), then we should get a JS hang report. However, pressing "Stop It" will not fix the hang. We used to have a separate button to kill the content process, but UX asked that it be removed for simplicity (bug 1119442). 2. If there's no JS on the stack when we hang, then no hang will be reported. We've talked about trying to detect arbitrary hangs, and it would be pretty easy to do technically. My one concern is that users really hate the hang notification. No matter what we did with the JS hang reporter, we always seem to report hangs when the machine is just being slow or something. Adding another reporter would probably just worsten the situation. To do this "right", we would probably have to invest a fair amount of work into figuring out when to report hangs (doing a better job than the usual "wait 10 seconds" thing) and also recovering from them intelligently (trying to kill script first, and then killing the process if that doesn't work).

Flags: needinfo?(wmccloskey)

(no longer active)

Reporter

Comment 2

•

7 years ago

I just wanted to also add that we have no real data about this happening to our users AFAIK, since no crash reports or anything like that are sent in this state...

Naveed Ihsanullah [:naveed]

Comment 3

•

7 years ago

Jim this might be a FF57 blocker issue. The wedged content process does not report its state in any way and cannot be killed easily.

Flags: needinfo?(jmathies)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 4

•

7 years ago

(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog, Away Aug 7-8) from comment #2) > I just wanted to also add that we have no real data about this happening to > our users AFAIK, since no crash reports or anything like that are sent in > this state... I believe BHR is supposed to report these issues. They had a very fancy system for handling "permahangs". I don't know the details though.

(no longer active)

Reporter

Comment 5

•

7 years ago

(In reply to Bill McCloskey (:billm) from comment #4) > (In reply to :Ehsan Akhgari (needinfo please, extremely long backlog, Away > Aug 7-8) from comment #2) > > I just wanted to also add that we have no real data about this happening to > > our users AFAIK, since no crash reports or anything like that are sent in > > this state... > > I believe BHR is supposed to report these issues. They had a very fancy > system for handling "permahangs". I don't know the details though. Yeah BHR reports them, and now we have a way to look at the data but only as of recently. My point was that this may have been an issue affecting people in the wild for a long time in the time BHR was down and nobody was looking at it and we may not have realized instances of it were happening. :-/

Jean Gong :jgong

Comment 6

•

7 years ago

[Tracking Requested - why for this release]:QF team triage this bug and feels it's an e10s issues that should be looked at. It may have a serious impact on user's browser performance as part of the Quantum effort.

status-firefox57: --- → ?

tracking-firefox57: --- → ?

Whiteboard: [qf]

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 7

•

7 years ago

(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog, Away Aug 7-8) from comment #5) > Yeah BHR reports them, and now we have a way to look at the data but only as > of recently. My point was that this may have been an issue affecting people > in the wild for a long time in the time BHR was down and nobody was looking > at it and we may not have realized instances of it were happening. :-/ BHR was one of the most important release criteria for e10s. We looked at it a lot in aggregate. If we hadn't been looking at BHR, e10s would have been released at least a month sooner. It's true that BHR didn't collect native stacks, and it's also true that the dashboard was broken and eventually disabled, but we still looked closely at changes in total number of hangs between e10s and non-e10s. Since then we may have stopped watching as carefully, but that can happen to any data we collect.

Jim Mathies [:jimm]

Comment 8

•

7 years ago

(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog, Away Aug 7-8) from comment #0) > Debugging bug 1384658 yesterday made me realize this. We have the slow > script dialog which works OK if we manage to run SpiderMonkey, but if we for > example get stuck in a C++ infinite loop in a content process, it seems like > that content process is forever gone and we have no mechanism of getting it > back! > > This seems like a mistake, but I'm not entirely sure that I'm not missing > something. Bill can you confirm my understanding here? If a web page for > example calls into method foo() implemented in C++ and that function runs > and infinite loop, how would we get our content process back? Is there any > existing protection against this? The IPDL timeout mechanism should pick this up and trigger a kill hard on the content process. That would in turn generate a crashed tab / crash report and that user could submit. We don't have a method of recovering child processes that are deadlocked in C++ AFAIA.

Flags: needinfo?(jmathies)

Jim Mathies [:jimm]

Comment 9

•

7 years ago

(In reply to Jim Mathies [:jimm] from comment #8) > (In reply to :Ehsan Akhgari (needinfo please, extremely long backlog, Away > The IPDL timeout mechanism should pick this up and trigger a kill hard on > the content process. That would in turn generate a crashed tab / crash > report and that user could submit. Hmm, in practice though I don't see this happening. At least for PBrowser, every message we send over is async now.

Jim Mathies [:jimm]

Updated

•

7 years ago

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → DUPLICATE

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 11

•

7 years ago

Moved tracking nom from here to bug 1374353.

status-firefox57: ? → ---

tracking-firefox57: ? → ---

Bugzilla

We seem to have no protection about a hung content process

Categories

(Core :: DOM: Content Processes, enhancement)

Tracking

()

People

(Reporter: ehsan.akhgari, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Comment 11