1270172 - Record child process stacks when parent process is killed with a running child process

Reporter

Description

•

9 years ago

When the watchdog kills the parent process for example for not shutting down in time or in some other way freezing up. Currently we don't get stacks for the child process, since on shutdown the parent process will actually block on the child process to shutdown, all the useful information is contained in the child process stacks and the parent process stack is actually a red herring.

Ryan VanderMeulen [:RyanVM]

Comment 1

•

9 years ago

Not really sure if Testing::mozbase is the right component or not for this (because I'm not sure if this a harness issue or something in Gecko itself), but this is actively making life difficult for developers to diagnose e10s topcrashes currently affecting us.

tracking-e10s: --- → ?

Ryan VanderMeulen [:RyanVM]

Comment 2

•

9 years ago

i.e. crashes like: https://treeherder.mozilla.org/logviewer.html#?repo=ash&job_id=173506#L2797

(not currently active) Ted Mielczarek

Comment 3

•

9 years ago

I think this needs to get fixed in the platform, if we're talking about intentional shutdown hang crashes like in the log in comment 2 (Bas mentions some other vague things in comment 0, they might need separate bugs if they're also issues). We have this watchdog thread for shutdown that just does a MOZ_CRASH if things take too long: https://dxr.mozilla.org/mozilla-central/rev/0a25833062a880f369e6f9f622413a94cc671bf4/toolkit/components/terminator/nsTerminator.cpp#157 If shutdown is taking too long because a content process is taking too long, we won't get a stack out of it because it'll just terminate itself when its connection to the parent process goes away. We'd need to do something smarter there, which might be hard because it's in the middle of shutdown and some services we would need to do things like write minidumps for the content process might be shutdown. We could possibly add an environment variable or preference to make the watchdog thread just hang forever instead of crashing, in which case the existing "get a stack from a hung process" code would kick in, which I believe nowadays handles getting stacks from child processes as well. You might want to ask some e10s folks about this.

Component: Mozbase → Async Tooling

Product: Testing → Toolkit

Version: Version 3 → unspecified

Jim Mathies [:jimm]

Updated

•

9 years ago

tracking-e10s: ? → +

Priority: -- → P1

Jim Mathies [:jimm]

Comment 4

•

9 years ago

Bas, I didn't block on this but I could change my mind if it's something we really need. Ryan mentioned that this was something you needed for debugging top crashes. Can you provide additional background?

Flags: needinfo?(bas)

Bas Schouten (:bas.schouten)

Reporter

Comment 5

•

9 years ago

(In reply to Jim Mathies [:jimm] from comment #4) > Bas, I didn't block on this but I could change my mind if it's something we > really need. Ryan mentioned that this was something you needed for debugging > top crashes. Can you provide additional background? Well, basically right now we can see some of the top crashes are in CompositorBridgeParent::Shutdown(), that might look like it's a Gfx issue, and it may be, but it certainly doesn't have to be, if the child process never reaches its shutdown function and closes down it's bridges to the parent (i.e. if the child is frozen somewhere outside of graphics), this hang in the parent will still occur. Without having a stack trace of where the child is hanging, we essentially have absolutely no useful information in our shutdown hang reports when the shutdown hang is caused by the child being frozen. Whether we should block on this I don't know, I somewhat doubt it, but I feel it's important tools to have as we gather more crash data.

Flags: needinfo?(bas)

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

9 years ago

Blocks: 1268559

Milan Sreckovic [:milan] (needinfo for best results)

Comment 6

•

9 years ago

Will this help? "Starting today, there is a change to crash signatures for a subset of e10s crashes. If the parent Firefox process detects that the child process has sent broken or unprocessable IPDL data, or is not shutting down in a timely manner, it kills the child process with a crash report. These crashes will now have a signature that indicates why the process was killed, rather than the child stack at the moment. The new signatures will look like "IPCError-browser | (msgtype=0x7,name=???) Route error: message sent to unknown actor ID" We intend to reprocess this back to 27-April for better e10s analysis."

Benjamin Smedberg

Comment 7

•

9 years ago

My understanding is that this bug is about our automation and has almost nothing to do with the Firefox crashreporter or crash-stats.

Jim Mathies [:jimm]

Comment 8

•

9 years ago

(In reply to Benjamin Smedberg [:bsmedberg] from comment #7) > My understanding is that this bug is about our automation and has almost > nothing to do with the Firefox crashreporter or crash-stats. I was thinking it had to do with shutdownkill reports which I guess don't collect matching child reports.

Jim Mathies [:jimm]

Updated

•

9 years ago

tracking-e10s: + → ?

Jim Mathies [:jimm]

Comment 9

•

9 years ago

(In reply to Jim Mathies [:jimm] from comment #8) > (In reply to Benjamin Smedberg [:bsmedberg] from comment #7) > > My understanding is that this bug is about our automation and has almost > > nothing to do with the Firefox crashreporter or crash-stats. > > I was thinking it had to do with shutdownkill reports which I guess don't > collect matching child reports. It applies to test hangs as well, per Ted's comment 3.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 10

•

9 years ago

(In reply to Jim Mathies [:jimm] from comment #8) > I was thinking it had to do with shutdownkill reports which I guess don't > collect matching child reports. I presume you mean shutdownhang reports rather than ShutDownKill reports. (ShutDownKill reports are paired reports, with the primary one being the child process.) Judging from the dependency on bug 1268559 also suggests it's about shutdownhang reports. On the other hand, maybe (if I'm understanding things correctly) an alternative (though less general) solution here would be to have a mechanism in CompositorBridgeParent like the one that generates the ShutDownKill reports, where ContentParent::StartForceKillTimer times out waiting for the child process and kills the child before nsTerminator decides that the parent process has taken too long to shut down and kills it.

Jim Mathies [:jimm]

Comment 11

•

9 years ago

(In reply to David Baron [:dbaron] ⌚️UTC-7 (review requests must explain patch) from comment #10) > (In reply to Jim Mathies [:jimm] from comment #8) > > I was thinking it had to do with shutdownkill reports which I guess don't > > collect matching child reports. > > I presume you mean shutdownhang reports rather than ShutDownKill reports. > (ShutDownKill reports are paired reports, with the primary one being the > child process.) No I was thinking of ShutDownKill. Your right though those are paired. Bas mentioned CompositorBridgeParent::Shutdown, I find nine reports with that in the signature. https://crash-stats.mozilla.com/signature/?product=Firefox&signature=shutdownhang+%7C+__psynch_cvwait+%7C+PR_WaitCondVar+%7C+%3Cname+omitted%3E+%7C+nsThread%3A%3AProcessNextEvent+%7C+NS_ProcessNextEvent+%7C+mozilla%3A%3Alayers%3A%3ACompositorBridgeParent%3A%3AShutDown I'll leave it to Bas to clarify but for now we're not blocking on this.

tracking-e10s: ? → +

Flags: needinfo?(bas)

Priority: P1 → P3

Bas Schouten (:bas.schouten)

Reporter

Comment 12

•

9 years ago

(In reply to Milan Sreckovic [:milan] from comment #6) > Will this help? > > "Starting today, there is a change to crash signatures for a subset of e10s > crashes. If the parent Firefox process detects that the child process has > sent broken or unprocessable IPDL data, or is not shutting down in a timely > manner, it kills the child process with a crash report. These crashes will > now have a signature that indicates why the process was killed, rather than > the child stack at the moment. > > The new signatures will look like "IPCError-browser | > (msgtype=0x7,name=???) Route error: message sent to unknown actor ID" > > We intend to reprocess this back to 27-April for better e10s analysis." This sounds like it will help. But it depends on whether it also applied to the child process not shutting down in a timely manner -on shutdown-. In any case, this bug is really about both tests and crash reports, it would be great if the same happens for Shutdown kills in the test harness. I'm sorry about the confusion this may have caused, the two seemed to be tightly related (i.e. if a shutdownkill happens and we generate a crash report for the child process I assume that will also just show as a crash report in the test harness) but maybe not as much as I thought.

Flags: needinfo?(bas)

Bas Schouten (:bas.schouten)

Reporter

Updated

•

9 years ago

Blocks: 1264694

Jim Mathies [:jimm]

Comment 13

•

9 years ago

Ok so to summarize, when we kill the parent through the hang monitor thread, we would like to pair that with the content process report *if* the content process is the reason for the shutdown wait.

tracking-e10s: + → ?

Priority: P3 → P1

Jim Mathies [:jimm]

Updated

•

9 years ago

tracking-e10s: ? → +

Flags: needinfo?(mconley)

Jim Mathies [:jimm]

Comment 14

•

9 years ago

Raised during triage today, open question: if these are browser shutdown hangs related to hung content processes, why doesn't the parent side content process shutdownkill timer solve the problem?

Bas Schouten (:bas.schouten)

Reporter

Comment 15

•

9 years ago

(In reply to Jim Mathies [:jimm] from comment #14) > Raised during triage today, open question: if these are browser shutdown > hangs related to hung content processes, why doesn't the parent side content > process shutdownkill timer solve the problem? I don't think I suggested it did? I don't know enough about this to say whether it does :-). Simply as a hack because I had to debug something I added something which simply kills the child process if it doesn't shutdown within 30 seconds, which was 30 seconds less than the parent process gets. And that made the parent process not die (since it will now continue and shutdown successfully), and brings in useful stuff for the crash report. That was a hack for debugging something in the test harness, but never-the-less extremely useful in digging into the cause of the issue.

Mike Conley (:mconley) (:⚙️)

Comment 16

•

9 years ago

Hey Bas, I was hoping to look into this today, but I think I need a better understanding of what's going on. Tell me if I have this right: 1) The parent is asked to be shut down 2) The parent starts to shut down, and asks its content process(es) to shut down 3) The parent is blocked waiting for the content processes to shut down 4) One of our hang monitors (HangMonitor? BHR? Watchdog? Other?) notices that we're waiting too long, and takes the action of killing the parent process to speed things along. 5) When (4) happens, the content process doesn't have a stack taken, which isn't great, since that's where the real problem is here - the slow shutdown of the content process. Is that roughly correct? If so, can you point me towards the component that's killing the parent, because it's not clear to me which component that is... the JS Watchdog?

Flags: needinfo?(mconley) → needinfo?(bas)

Jim Mathies [:jimm]

Comment 17

•

9 years ago

https://bugzilla.mozilla.org/show_bug.cgi?id=1268559#c60 This comment in the parent bug explains what I think is happening here. Getting a child stack won't be possible from the parent side if the patch in bug 1268559 lands. We'll have to find a way to trigger a child reports when we hit the child timeout.

Bas Schouten (:bas.schouten)

Reporter

Comment 18

•

9 years ago

(In reply to Mike Conley (:mconley) - (needinfo me!) from comment #16) > Hey Bas, > > I was hoping to look into this today, but I think I need a better > understanding of what's going on. Tell me if I have this right: > > 1) The parent is asked to be shut down > 2) The parent starts to shut down, and asks its content process(es) to shut > down > 3) The parent is blocked waiting for the content processes to shut down > 4) One of our hang monitors (HangMonitor? BHR? Watchdog? Other?) notices > that we're waiting too long, and takes the action of killing the parent > process to speed things along. > 5) When (4) happens, the content process doesn't have a stack taken, which > isn't great, since that's where the real problem is here - the slow shutdown > of the content process. > > Is that roughly correct? If so, can you point me towards the component > that's killing the parent, because it's not clear to me which component that > is... the JS Watchdog? It's the shutdown kill timer, at least for this particular stack. As Jimm illustrated, there's a possibility this might happen at run time with a child process being killed, but I'm not sure how severe the consequences of that would be to a parent.

Flags: needinfo?(bas)

(not currently active) Ted Mielczarek

Comment 19

•

9 years ago

(In reply to Mike Conley (:mconley) - (needinfo me!) from comment #19) > (In reply to Ted Mielczarek [:ted.mielczarek] from comment #3) > > If shutdown is taking too long because a content process is taking too long, > > we won't get a stack out of it because it'll just terminate itself when its > > connection to the parent process goes away. We'd need to do something > > smarter there, which might be hard because it's in the middle of shutdown > > and some services we would need to do things like write minidumps for the > > content process might be shutdown. > > Naive question - could we change ContentChild so that if it detects the > parent going away unexpectedly, it dumps a crash report before exiting? I don't think we can do that and get a useful result. The content process is wired up to simply send a message to the chrome process when it crashes, and the chrome process writes the minidump. If we write a minidump from the content process directly it'll be missing a lot of the metadata it gets from the chrome process, and I don't think it will be reported properly. > > We could possibly add an environment variable or preference to make the > > watchdog thread just hang forever instead of crashing, in which case the > > existing "get a stack from a hung process" code would kick in, which I > > believe nowadays handles getting stacks from child processes as well. > > Where is the "get a stack from a hung process" code? Could we somehow call > into that before crashing? I'm thinking of the XPCOM HangMonitor: https://dxr.mozilla.org/mozilla-central/rev/5511d54a3f172c1d68f98cc55dce4de1d0ba1b51/xpcom/threads/HangMonitor.cpp#236 ...but that doesn't seem to collect child process stacks, so I guess I was wrong. (It just calls `NS_RUNTIMEABORT`.)

Flags: needinfo?(ted)

Mike Conley (:mconley) (:⚙️)

Comment 20

•

9 years ago

(In reply to Ted Mielczarek [:ted.mielczarek] from comment #3) > If shutdown is taking too long because a content process is taking too long, > we won't get a stack out of it because it'll just terminate itself when its > connection to the parent process goes away. We'd need to do something > smarter there, which might be hard because it's in the middle of shutdown > and some services we would need to do things like write minidumps for the > content process might be shutdown. Naive question - could we change ContentChild so that if it detects the parent going away unexpectedly, it dumps a crash report before exiting? > > We could possibly add an environment variable or preference to make the > watchdog thread just hang forever instead of crashing, in which case the > existing "get a stack from a hung process" code would kick in, which I > believe nowadays handles getting stacks from child processes as well. Where is the "get a stack from a hung process" code? Could we somehow call into that before crashing?

Flags: needinfo?(ted)

BugBot [:suhaib / :marco/ :calixte]

Comment 21

•

6 years ago

Moving to p2 because no activity for at least 24 weeks.
See How Do You Triage for more information

Priority: P1 → P2

BMO Automation

Updated

•

2 years ago

Severity: normal → S3