Open
Bug 1270172
Opened 9 years ago
Updated 2 years ago
Record child process stacks when parent process is killed with a running child process
Categories
(Toolkit :: Async Tooling, defect, P2)
Toolkit
Async Tooling
Tracking
()
NEW
Tracking | Status | |
---|---|---|
e10s | + | --- |
People
(Reporter: bas.schouten, Unassigned)
References
Details
When the watchdog kills the parent process for example for not shutting down in time or in some other way freezing up. Currently we don't get stacks for the child process, since on shutdown the parent process will actually block on the child process to shutdown, all the useful information is contained in the child process stacks and the parent process stack is actually a red herring.
Comment 1•9 years ago
|
||
Not really sure if Testing::mozbase is the right component or not for this (because I'm not sure if this a harness issue or something in Gecko itself), but this is actively making life difficult for developers to diagnose e10s topcrashes currently affecting us.
tracking-e10s:
--- → ?
Comment 2•9 years ago
|
||
i.e. crashes like:
https://treeherder.mozilla.org/logviewer.html#?repo=ash&job_id=173506#L2797
Comment 3•9 years ago
|
||
I think this needs to get fixed in the platform, if we're talking about intentional shutdown hang crashes like in the log in comment 2 (Bas mentions some other vague things in comment 0, they might need separate bugs if they're also issues). We have this watchdog thread for shutdown that just does a MOZ_CRASH if things take too long:
https://dxr.mozilla.org/mozilla-central/rev/0a25833062a880f369e6f9f622413a94cc671bf4/toolkit/components/terminator/nsTerminator.cpp#157
If shutdown is taking too long because a content process is taking too long, we won't get a stack out of it because it'll just terminate itself when its connection to the parent process goes away. We'd need to do something smarter there, which might be hard because it's in the middle of shutdown and some services we would need to do things like write minidumps for the content process might be shutdown.
We could possibly add an environment variable or preference to make the watchdog thread just hang forever instead of crashing, in which case the existing "get a stack from a hung process" code would kick in, which I believe nowadays handles getting stacks from child processes as well.
You might want to ask some e10s folks about this.
Component: Mozbase → Async Tooling
Product: Testing → Toolkit
Version: Version 3 → unspecified
![]() |
||
Updated•9 years ago
|
Priority: -- → P1
![]() |
||
Comment 4•9 years ago
|
||
Bas, I didn't block on this but I could change my mind if it's something we really need. Ryan mentioned that this was something you needed for debugging top crashes. Can you provide additional background?
Flags: needinfo?(bas)
Reporter | ||
Comment 5•9 years ago
|
||
(In reply to Jim Mathies [:jimm] from comment #4)
> Bas, I didn't block on this but I could change my mind if it's something we
> really need. Ryan mentioned that this was something you needed for debugging
> top crashes. Can you provide additional background?
Well, basically right now we can see some of the top crashes are in CompositorBridgeParent::Shutdown(), that might look like it's a Gfx issue, and it may be, but it certainly doesn't have to be, if the child process never reaches its shutdown function and closes down it's bridges to the parent (i.e. if the child is frozen somewhere outside of graphics), this hang in the parent will still occur. Without having a stack trace of where the child is hanging, we essentially have absolutely no useful information in our shutdown hang reports when the shutdown hang is caused by the child being frozen.
Whether we should block on this I don't know, I somewhat doubt it, but I feel it's important tools to have as we gather more crash data.
Flags: needinfo?(bas)
Will this help?
"Starting today, there is a change to crash signatures for a subset of e10s
crashes. If the parent Firefox process detects that the child process has
sent broken or unprocessable IPDL data, or is not shutting down in a timely
manner, it kills the child process with a crash report. These crashes will
now have a signature that indicates why the process was killed, rather than
the child stack at the moment.
The new signatures will look like "IPCError-browser |
(msgtype=0x7,name=???) Route error: message sent to unknown actor ID"
We intend to reprocess this back to 27-April for better e10s analysis."
Comment 7•9 years ago
|
||
My understanding is that this bug is about our automation and has almost nothing to do with the Firefox crashreporter or crash-stats.
![]() |
||
Comment 8•9 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #7)
> My understanding is that this bug is about our automation and has almost
> nothing to do with the Firefox crashreporter or crash-stats.
I was thinking it had to do with shutdownkill reports which I guess don't collect matching child reports.
![]() |
||
Updated•9 years ago
|
![]() |
||
Comment 9•9 years ago
|
||
(In reply to Jim Mathies [:jimm] from comment #8)
> (In reply to Benjamin Smedberg [:bsmedberg] from comment #7)
> > My understanding is that this bug is about our automation and has almost
> > nothing to do with the Firefox crashreporter or crash-stats.
>
> I was thinking it had to do with shutdownkill reports which I guess don't
> collect matching child reports.
It applies to test hangs as well, per Ted's comment 3.
(In reply to Jim Mathies [:jimm] from comment #8)
> I was thinking it had to do with shutdownkill reports which I guess don't
> collect matching child reports.
I presume you mean shutdownhang reports rather than ShutDownKill reports. (ShutDownKill reports are paired reports, with the primary one being the child process.)
Judging from the dependency on bug 1268559 also suggests it's about shutdownhang reports.
On the other hand, maybe (if I'm understanding things correctly) an alternative (though less general) solution here would be to have a mechanism in CompositorBridgeParent like the one that generates the ShutDownKill reports, where ContentParent::StartForceKillTimer times out waiting for the child process and kills the child before nsTerminator decides that the parent process has taken too long to shut down and kills it.
![]() |
||
Comment 11•9 years ago
|
||
(In reply to David Baron [:dbaron] ⌚️UTC-7 (review requests must explain patch) from comment #10)
> (In reply to Jim Mathies [:jimm] from comment #8)
> > I was thinking it had to do with shutdownkill reports which I guess don't
> > collect matching child reports.
>
> I presume you mean shutdownhang reports rather than ShutDownKill reports.
> (ShutDownKill reports are paired reports, with the primary one being the
> child process.)
No I was thinking of ShutDownKill. Your right though those are paired.
Bas mentioned CompositorBridgeParent::Shutdown, I find nine reports with that in the signature.
https://crash-stats.mozilla.com/signature/?product=Firefox&signature=shutdownhang+%7C+__psynch_cvwait+%7C+PR_WaitCondVar+%7C+%3Cname+omitted%3E+%7C+nsThread%3A%3AProcessNextEvent+%7C+NS_ProcessNextEvent+%7C+mozilla%3A%3Alayers%3A%3ACompositorBridgeParent%3A%3AShutDown
I'll leave it to Bas to clarify but for now we're not blocking on this.
Reporter | ||
Comment 12•9 years ago
|
||
(In reply to Milan Sreckovic [:milan] from comment #6)
> Will this help?
>
> "Starting today, there is a change to crash signatures for a subset of e10s
> crashes. If the parent Firefox process detects that the child process has
> sent broken or unprocessable IPDL data, or is not shutting down in a timely
> manner, it kills the child process with a crash report. These crashes will
> now have a signature that indicates why the process was killed, rather than
> the child stack at the moment.
>
> The new signatures will look like "IPCError-browser |
> (msgtype=0x7,name=???) Route error: message sent to unknown actor ID"
>
> We intend to reprocess this back to 27-April for better e10s analysis."
This sounds like it will help. But it depends on whether it also applied to the child process not shutting down in a timely manner -on shutdown-.
In any case, this bug is really about both tests and crash reports, it would be great if the same happens for Shutdown kills in the test harness. I'm sorry about the confusion this may have caused, the two seemed to be tightly related (i.e. if a shutdownkill happens and we generate a crash report for the child process I assume that will also just show as a crash report in the test harness) but maybe not as much as I thought.
Flags: needinfo?(bas)
![]() |
||
Comment 13•9 years ago
|
||
Ok so to summarize, when we kill the parent through the hang monitor thread, we would like to pair that with the content process report *if* the content process is the reason for the shutdown wait.
Priority: P3 → P1
![]() |
||
Updated•9 years ago
|
Flags: needinfo?(mconley)
![]() |
||
Comment 14•9 years ago
|
||
Raised during triage today, open question: if these are browser shutdown hangs related to hung content processes, why doesn't the parent side content process shutdownkill timer solve the problem?
Reporter | ||
Comment 15•9 years ago
|
||
(In reply to Jim Mathies [:jimm] from comment #14)
> Raised during triage today, open question: if these are browser shutdown
> hangs related to hung content processes, why doesn't the parent side content
> process shutdownkill timer solve the problem?
I don't think I suggested it did? I don't know enough about this to say whether it does :-).
Simply as a hack because I had to debug something I added something which simply kills the child process if it doesn't shutdown within 30 seconds, which was 30 seconds less than the parent process gets. And that made the parent process not die (since it will now continue and shutdown successfully), and brings in useful stuff for the crash report. That was a hack for debugging something in the test harness, but never-the-less extremely useful in digging into the cause of the issue.
Comment 16•9 years ago
|
||
Hey Bas,
I was hoping to look into this today, but I think I need a better understanding of what's going on. Tell me if I have this right:
1) The parent is asked to be shut down
2) The parent starts to shut down, and asks its content process(es) to shut down
3) The parent is blocked waiting for the content processes to shut down
4) One of our hang monitors (HangMonitor? BHR? Watchdog? Other?) notices that we're waiting too long, and takes the action of killing the parent process to speed things along.
5) When (4) happens, the content process doesn't have a stack taken, which isn't great, since that's where the real problem is here - the slow shutdown of the content process.
Is that roughly correct? If so, can you point me towards the component that's killing the parent, because it's not clear to me which component that is... the JS Watchdog?
Flags: needinfo?(mconley) → needinfo?(bas)
![]() |
||
Comment 17•9 years ago
|
||
https://bugzilla.mozilla.org/show_bug.cgi?id=1268559#c60
This comment in the parent bug explains what I think is happening here. Getting a child stack won't be possible from the parent side if the patch in bug 1268559 lands. We'll have to find a way to trigger a child reports when we hit the child timeout.
Reporter | ||
Comment 18•9 years ago
|
||
(In reply to Mike Conley (:mconley) - (needinfo me!) from comment #16)
> Hey Bas,
>
> I was hoping to look into this today, but I think I need a better
> understanding of what's going on. Tell me if I have this right:
>
> 1) The parent is asked to be shut down
> 2) The parent starts to shut down, and asks its content process(es) to shut
> down
> 3) The parent is blocked waiting for the content processes to shut down
> 4) One of our hang monitors (HangMonitor? BHR? Watchdog? Other?) notices
> that we're waiting too long, and takes the action of killing the parent
> process to speed things along.
> 5) When (4) happens, the content process doesn't have a stack taken, which
> isn't great, since that's where the real problem is here - the slow shutdown
> of the content process.
>
> Is that roughly correct? If so, can you point me towards the component
> that's killing the parent, because it's not clear to me which component that
> is... the JS Watchdog?
It's the shutdown kill timer, at least for this particular stack. As Jimm illustrated, there's a possibility this might happen at run time with a child process being killed, but I'm not sure how severe the consequences of that would be to a parent.
Flags: needinfo?(bas)
Comment 19•9 years ago
|
||
(In reply to Mike Conley (:mconley) - (needinfo me!) from comment #19)
> (In reply to Ted Mielczarek [:ted.mielczarek] from comment #3)
> > If shutdown is taking too long because a content process is taking too long,
> > we won't get a stack out of it because it'll just terminate itself when its
> > connection to the parent process goes away. We'd need to do something
> > smarter there, which might be hard because it's in the middle of shutdown
> > and some services we would need to do things like write minidumps for the
> > content process might be shutdown.
>
> Naive question - could we change ContentChild so that if it detects the
> parent going away unexpectedly, it dumps a crash report before exiting?
I don't think we can do that and get a useful result. The content process is wired up to simply send a message to the chrome process when it crashes, and the chrome process writes the minidump. If we write a minidump from the content process directly it'll be missing a lot of the metadata it gets from the chrome process, and I don't think it will be reported properly.
> > We could possibly add an environment variable or preference to make the
> > watchdog thread just hang forever instead of crashing, in which case the
> > existing "get a stack from a hung process" code would kick in, which I
> > believe nowadays handles getting stacks from child processes as well.
>
> Where is the "get a stack from a hung process" code? Could we somehow call
> into that before crashing?
I'm thinking of the XPCOM HangMonitor:
https://dxr.mozilla.org/mozilla-central/rev/5511d54a3f172c1d68f98cc55dce4de1d0ba1b51/xpcom/threads/HangMonitor.cpp#236
...but that doesn't seem to collect child process stacks, so I guess I was wrong. (It just calls `NS_RUNTIMEABORT`.)
Flags: needinfo?(ted)
Comment 20•9 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #3)
> If shutdown is taking too long because a content process is taking too long,
> we won't get a stack out of it because it'll just terminate itself when its
> connection to the parent process goes away. We'd need to do something
> smarter there, which might be hard because it's in the middle of shutdown
> and some services we would need to do things like write minidumps for the
> content process might be shutdown.
Naive question - could we change ContentChild so that if it detects the parent going away unexpectedly, it dumps a crash report before exiting?
>
> We could possibly add an environment variable or preference to make the
> watchdog thread just hang forever instead of crashing, in which case the
> existing "get a stack from a hung process" code would kick in, which I
> believe nowadays handles getting stacks from child processes as well.
Where is the "get a stack from a hung process" code? Could we somehow call into that before crashing?
Flags: needinfo?(ted)
Comment 21•6 years ago
|
||
Moving to p2 because no activity for at least 24 weeks.
See How Do You Triage for more information
Priority: P1 → P2
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•