[mozcrash] Improve crash information output for e10s mode (content processes)

NEW
Unassigned

Status

Testing
Mozbase
a year ago
3 months ago

People

(Reporter: whimboo, Unassigned)

Tracking

Version 3
Points:
---

Firefox Tracking Flags

(e10s+)

Details

(Reporter)

Description

a year ago
Over on bug 1290186 we had expected crashes for content processes because our test is opening and closing new browser windows too quickly so that the content processes are not able to shutdown themselves and get killed by the parent. As result we create crash reports for the affected content processes, but in-pair also for the parent process to get the needed information - even it hasn't been crashed. An example can be found on bug 1289243.

I wonder what we can do to improve mozcrash for e10s (content) related crashes. Right now it reports even the above crashes as "PROCESS_CRASH", which then kills the current test job and no other tests are getting executed. I think that's not an ideal situation.

So here some questions which came up for me:

1) How important is it to log both the content and the parent crash stack. As I have heard from Mike, at least in the above case the content crash stack is not helpful at all, but the parent one contains the necessary information about the shutdown kill timer.

2) Could we modify the log output to eg. CONTENT_PROCESS_CRASH so we could change the handling of those crashes on Treeherder? If yes, I assume that minidump-stackwalk might need an update?

3) Maybe we could interpret the type of crash by analyzing the crash reason value. In the above case we have `DUMP_REQUESTED` which indicates that it's not a real crash. 

Ted, I would appreciate to get some feedback and maybe further ideas from you. Thanks.
Flags: needinfo?(ted)
(In reply to Henrik Skupin (:whimboo) from comment #0)
> 1) How important is it to log both the content and the parent crash stack.
> As I have heard from Mike, at least in the above case the content crash
> stack is not helpful at all, but the parent one contains the necessary
> information about the shutdown kill timer.

I think it depends on the crash. We don't always know where the interesting bits are.

> 2) Could we modify the log output to eg. CONTENT_PROCESS_CRASH so we could
> change the handling of those crashes on Treeherder? If yes, I assume that
> minidump-stackwalk might need an update?

We wouldn't need to change minidump-stackwalk, this information is available in the .extra file (which as we discussed in another bug mozcrash isn't reading currently) as the "ProcessType" annotation. It's not set for chrome process crashes, but it's set to "content" for content process crashes.

> 3) Maybe we could interpret the type of crash by analyzing the crash reason
> value. In the above case we have `DUMP_REQUESTED` which indicates that it's
> not a real crash. 

I don't think this is a great idea, there are lots of ways to get into these sorts of situations.

Couldn't you fix the root cause here--either fix the test harness to not trigger the crash, or fix the browser code to cope with this situation without killing the content process? That seems like it would be the best state of affairs.
Flags: needinfo?(ted)

Updated

a year ago
tracking-e10s: --- → +
(Reporter)

Comment 2

a year ago
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #1)
> Couldn't you fix the root cause here--either fix the test harness to not
> trigger the crash, or fix the browser code to cope with this situation
> without killing the content process? That seems like it would be the best
> state of affairs.

I think there are too many ways when a content process can be killed. But yes, the browser side should be improved for each of those crashes. Therefore I think we should not hide them. One situation we hit with automated tests is bug 1293284. Opening and closing windows and tabs can result in a way higher number of crashes.

So if we do not force-stop the browser in such situations, how safe is it to continue the tests? Could we expect any side-effects due to that crash? Or doesn't it outweigh the additional results we get?

I mean what we could do when checking the process status for socket errors is the following:

> if process.returncode == None:
>     if not crashed:
>         kill process
> else:
>     show process got unexpectedly closed message

If that is fine, I can get this implemented in Marionette here:
https://dxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette_driver/marionette.py#756
Flags: needinfo?(ted)
(Reporter)

Comment 3

a year ago
(In reply to Henrik Skupin (:whimboo) from comment #2)
> If that is fine, I can get this implemented in Marionette here:

This would actually be bug 1299216. So if you think it's nothing we can do in mozcrash, feel free to close this bug.
(Reporter)

Comment 4

a year ago
Oh, from my standpoint it would still be helpful to see if it is a chrome or content crash as reported on Treeherder.
(In reply to Henrik Skupin (:whimboo) from comment #2)
> (In reply to Ted Mielczarek [:ted.mielczarek] from comment #1)
> > Couldn't you fix the root cause here--either fix the test harness to not
> > trigger the crash, or fix the browser code to cope with this situation
> > without killing the content process? That seems like it would be the best
> > state of affairs.
> 
> I think there are too many ways when a content process can be killed. But
> yes, the browser side should be improved for each of those crashes.
> Therefore I think we should not hide them. One situation we hit with
> automated tests is bug 1293284. Opening and closing windows and tabs can
> result in a way higher number of crashes.
> 
> So if we do not force-stop the browser in such situations, how safe is it to
> continue the tests? Could we expect any side-effects due to that crash? Or
> doesn't it outweigh the additional results we get?

I don't know the answer to this question. Obviously the browser is written to continue to function in such scenarios (for our users' sake), but whether hitting that crash will invalidate your tests is not clear.

I'd be fine with changing mozcrash to read the .extra file and print information regarding what process each crash it encounters came from, but I think that's about the limits of what it's sensible to do. As we discussed in another bug, it might also make sense to implement `MOZ_CRASHREPORTER_SHUTDOWN` in Firefox to allow test harnesses to opt-in to simply have the browser exit if a content process crash happens. That way at least you'd have a clearer picture of what happened, and you wouldn't have to specially handle content process crashes.
Flags: needinfo?(ted)
Duplicate of this bug: 1395129
(Reporter)

Updated

3 months ago
Summary: [mozcrash] Improve crash information output for e10s mode → [mozcrash] Improve crash information output for e10s mode (content processes)
You need to log in before you can comment on or make changes to this bug.