Marionette sometimes doesn't detect minidump files (crashes) of Firefox

RESOLVED INVALID

Status

P3
normal
RESOLVED INVALID
a year ago
6 months ago

People

(Reporter: whimboo, Unassigned)

Tracking

(Blocks: 4 bugs)

52 Branch
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

a year ago
What I noticed over on bug 1223277 we sometimes are not able to detect the minidump files anymore. This is most likely happening because they are getting created with a delay. With mozcrash we are checking the minidump folder right away after Firefox closed.

This seems to mostly happen for content related crashes but in some cases like bug 1376773 it also happens for chrome.

Ted, has something been changed lately which could have caused such a delayed minidump creation? It came up the first time not too long ago.
Flags: needinfo?(ted)
(Reporter)

Updated

a year ago
Blocks: 1326124
(Reporter)

Updated

a year ago
Blocks: 1351931
(Reporter)

Updated

a year ago
Blocks: 1365391
(Reporter)

Updated

a year ago
Blocks: 1322138
(Reporter)

Updated

a year ago
Blocks: 1377030
(Reporter)

Updated

a year ago
Blocks: 1377359
Gabriele: this is the issue I mentioned last week about a potential race condition with minidump creation.
Flags: needinfo?(ted) → needinfo?(gsvelto)
Minidumps for content crashes are still being generated immediately but they're being processed so the file is usually locked for a brief while until processing is finished. This caused issues in mochitests when the harness tried to delete the minidumps that were still being processed. It shouldn't prevent other code from accessing them though. When I'm back from paternity leave I can try to reproduce locally and have a look. Sorry for not responding earlier but I've been on PL since after the all-hands and I'll be until the end of the month.
Flags: needinfo?(gsvelto)
(Reporter)

Comment 3

a year ago
Interestingly those failures have now been started to appear on mozilla-beta too. All that by early last week. Given that we had nearly no uplifts due to sheriffs being out, here a list of changes which got uplifted the days before:

https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?startdate=2017-06-28&enddate=2017-07-04

I know that we had and still have a couple of crashes for mozilla::dom::Request on central, so I wonder if the landing of bug 371889 could have something to do with it. Andrea, what do you think?
Flags: needinfo?(amarchesini)
(Reporter)

Comment 4

a year ago
Oh wait, I have just seen that the patch on bug 371889 got backed out on July 6th from mozilla-beta, but the behavior is still persistent on that branch. So it shouldn't be related.
Flags: needinfo?(amarchesini)
(Reporter)

Comment 5

a year ago
I wonder if bug 1363378 could have an effect here regarding the shutdown of Firefox, and the delay of minidump creation.
(Reporter)

Comment 6

a year ago
(In reply to Gabriele Svelto [:gsvelto] (on PTO until 31/07/2017) from comment #2)
> shouldn't prevent other code from accessing them though. When I'm back from
> paternity leave I can try to reproduce locally and have a look. Sorry for
> not responding earlier but I've been on PL since after the all-hands and
> I'll be until the end of the month.

Sounds good. Until then I will put ni? for you and will implement a workaround for the tests for the time being to make those not fail anymore.
Flags: needinfo?(gsvelto)
(Reporter)

Updated

a year ago
Blocks: 1381403
(Reporter)

Comment 7

a year ago
So with my workaround on bug 1381403 it seems that I was able to get it working by just having a timeout of 5s in waiting for the minidump files being created. But for code coverage builds it perma fails. So a couple of questions:

1) Is breakpad enabled for those builds?
2) How slow are those builds compared to normal builds?
3) Is the crash reporter affected by that slowness too, or does it have to process more info?
4) (Do we really store a target.tar.bz2 of 500MB for each build? WOW!!)
Flags: needinfo?(klahnakoski)
I have not looked at the builds for a while; looking now I see `target.tar.bz2` is big. I am not sure what roll it plays; they do not seem coverage related to me; but I can not discount it because of active experiments we are running. I added marco and gmierz who are familiar with the coverage builds to answer these questions for you.
Flags: needinfo?(mcastelluccio)
Flags: needinfo?(klahnakoski)
Flags: needinfo?(gmierz2)
Looking at the contents of `target.tar.bz2`, it seems to be a build, with a much larger libxul.so file, which I assume is a side effect of the extra coverage code. There is little we can do about the size. If it is any consolation; coverage is only run on central (and try).
(In reply to Henrik Skupin (:whimboo) from comment #7)
> So with my workaround on bug 1381403 it seems that I was able to get it
> working by just having a timeout of 5s in waiting for the minidump files
> being created. But for code coverage builds it perma fails. So a couple of
> questions:
> 
> 1) Is breakpad enabled for those builds?
> 2) How slow are those builds compared to normal builds?
> 3) Is the crash reporter affected by that slowness too, or does it have to
> process more info?
> 4) (Do we really store a target.tar.bz2 of 500MB for each build? WOW!!)

1) The crash reporter is disabled in code coverage builds: https://hg.mozilla.org/mozilla-central/file/5f44d10bacca2d693413b529e0caadc73e634e1e/browser/config/mozconfigs/linux64/code-coverage#l7.

2) They are slower, I have no idea by how much.

3) See 1.
Flags: needinfo?(mcastelluccio)
(Reporter)

Comment 11

a year ago
(In reply to Marco Castelluccio [:marco] from comment #10)
> https://hg.mozilla.org/mozilla-central/file/
> 5f44d10bacca2d693413b529e0caadc73e634e1e/browser/config/mozconfigs/linux64/
> code-coverage#l7.

Ah, that would be the explanation why the tests are perma failing. We don't take that into account and simply return an empty array. We will have to automatically skip the test. I will care about it on bug 1381403.

So what's left here is definitely the fact that for normal builds the minidump files are getting created with a delay.
Flags: needinfo?(gmierz2)
(Reporter)

Comment 12

a year ago
Ted, I wonder if we shutdown Firefox too quickly when 'MOZ_CRASHREPORTER_SHUTDOWN' is set. I just found the following which is in use by mochitests:

https://dxr.mozilla.org/mozilla-central/source/testing/mochitest/BrowserTestUtils/BrowserTestUtils.jsm#1081

Maybe we should delay the requested shutdown of Firefox until the crashmanager reports that the minidump files exist?
Flags: needinfo?(ted)
Setting MOZ_CRASHREPORTER_SHUTDOWN should not affect the crash report generation since it forces Firefox to close only when it has hit an actual content crash, see this:

https://dxr.mozilla.org/mozilla-central/rev/ef9a0f01e4f68214f0ff8f4631783b8a0e075a82/browser/modules/ContentCrashHandlers.jsm#111

... and this happens later:

https://dxr.mozilla.org/mozilla-central/rev/ef9a0f01e4f68214f0ff8f4631783b8a0e075a82/browser/modules/ContentCrashHandlers.jsm#150

As you can see from that code it is possible to get an ipc:content-shutdown message which corresponds to a crash but which doesn't have an associated minidump:

https://dxr.mozilla.org/mozilla-central/rev/ef9a0f01e4f68214f0ff8f4631783b8a0e075a82/browser/modules/ContentCrashHandlers.jsm#118

That should happen only if Breakpad fails to generate the minidump though, which should really be an uncommon scenario. What's the best way to repro this? I can try and run a failing case a bunch of time with enough (manual) instrumentation in the code to figure out if the minidump is really being created or not.
Flags: needinfo?(gsvelto)
(Reporter)

Comment 14

a year ago
(In reply to Gabriele Svelto [:gsvelto] from comment #13)
> That should happen only if Breakpad fails to generate the minidump though,
> which should really be an uncommon scenario. What's the best way to repro

There are a couple of bugs open when breakpad failed to generate the dumps. Ted didn't seem to have time yet to check those.

> this? I can try and run a failing case a bunch of time with enough (manual)
> instrumentation in the code to figure out if the minidump is really being
> created or not.

Best is to run all the tests from the following test:
https://dxr.mozilla.org/mozilla-central/source/testing/marionette/harness/marionette_harness/tests/unit/test_crash.py

Please make sure to comment out the workaround inside of try/except and only leave line 42:
https://dxr.mozilla.org/mozilla-central/source/testing/marionette/harness/marionette_harness/tests/unit/test_crash.py#40-48

To run the test use `mach marionette test testing/marionette/harness/marionette_harness/tests/unit/test_crash.py -vv --gecko-log -` with tracing log output.

The platform these failures are mostly happening is osx-10-10 in Taskcluster.
(Reporter)

Comment 15

a year ago
I could assume that this bug could be related to bug 1387369.
See Also: → bug 1387369
(In reply to Henrik Skupin (:whimboo) from comment #15)
> I could assume that this bug could be related to bug 1387369.

I don't think bug 1387369 is related. It is the result of bug 1360308, which landed on 6/22, but the symptoms in the related bugs happen before that. Also, I suppose we are crashing intentionally in the tests, right? Bug 1387369 results from non-crashing minidump generation so it should be irrelevant.
(Reporter)

Comment 17

a year ago
We only shutdown Firefox (not crash) in case of content crashes. So not sure how all this is then related to chrome process crashes.
(Reporter)

Updated

a year ago
See Also: → bug 1390540
(Reporter)

Comment 18

a year ago
Gabriele, any idea how to proceed here?

Maybe it would be helpful as first step to add a log entry when the process is shutdown due to 'MOZ_CRASHREPORTER_SHUTDOWN'=1? It would give us an indication in the logs when such a situation happens.
Flags: needinfo?(gsvelto)
Created attachment 8897792 [details] [diff] [review]
[PATCH] Print a message when quitting in response to a content crash

This prints out a message when we quit in response to a content crash when MOZ_CRASHREPORTER_SHUTDOWN is set. I have been busy with a huge bug that's being on my TODO list for a while but I'll come to help out with this ASAP.
Flags: needinfo?(gsvelto)
Attachment #8897792 - Flags: review?(hskupin)
(Reporter)

Comment 20

a year ago
Comment on attachment 8897792 [details] [diff] [review]
[PATCH] Print a message when quitting in response to a content crash

Review of attachment 8897792 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry, but I'm not a browser peer and so cannot review it. Also I'm not sure if dump() or using a logger() is a better idea.
Attachment #8897792 - Flags: review?(hskupin)
(Reporter)

Updated

a year ago
Attachment #8897792 - Flags: review?(cyu)
(In reply to Henrik Skupin (:whimboo) from comment #20)
> Sorry, but I'm not a browser peer and so cannot review it. Also I'm not sure
> if dump() or using a logger() is a better idea.

Bugzilla's suggested reviewer interface suggested you for this, weird :|
(Reporter)

Comment 22

a year ago
(In reply to Gabriele Svelto [:gsvelto] from comment #21)
> Bugzilla's suggested reviewer interface suggested you for this, weird :|

Thank you for letting me know. I filed bug 1391163 to get this fixed.
Comment on attachment 8897792 [details] [diff] [review]
[PATCH] Print a message when quitting in response to a content crash

This only prints out the message on release build when browser.dom.window.dump.enabled is flipped to true. Do we do this during test? Or we just care about debug builds?
Flags: needinfo?(hskupin)
(Reporter)

Comment 24

a year ago
Yes, at least our Marionette tests are flipping this preference to true. I cannot speak about others. If there is a better way to get the information to stdout like with the logger, I'm all for it. But I don't know what's preferred to use in those JS modules on browser side.
Flags: needinfo?(hskupin)
Comment on attachment 8897792 [details] [diff] [review]
[PATCH] Print a message when quitting in response to a content crash

For debugging the underlying root cause, I think we can go with this patch.
Attachment #8897792 - Flags: review?(cyu) → review+

Comment 26

a year ago
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/135ecb6cf04a
Print a message when quitting in response to a content crash; r=cyu
(Reporter)

Updated

a year ago
Keywords: leave-open
(Reporter)

Comment 28

a year ago
Basically all the known crashes here are happening due to bug 1395504. It's just a webcontent process which doesn't quit and as such Firefox doesn't shutdown. Marionette kills the process after 120s, and at that time no minidump files exist.

I will have a look if I can enhance our tests so that we have a better reporting. Otherwise we have to wait for a fix on bug 1395504.
Depends on: 1395504
Flags: needinfo?(ted)
(Reporter)

Comment 29

11 months ago
(In reply to Henrik Skupin (:whimboo) from comment #28)
> I will have a look if I can enhance our tests so that we have a better
> reporting. Otherwise we have to wait for a fix on bug 1395504.

I don't think that there is something I can do to improve the tests. Thing is that the shutdown is drastically delayed, and Firefox doesn't even quit within our now raised shutdown timeout of 120s. Marionette just kills the browser at this stage, and as result no minidump files are written. It means we clearly have to wait for a fix for bug 1395504.
Priority: -- → P3
(Reporter)

Comment 30

7 months ago
Meanwhile I think there is nothing wrong with Marionette but this is all due to a hang during shutdown all caused by bug 1395504. So this is really something I would like to see an investigation, and fix for sure.

I don't think it makes sense to keep this bug further open.
Status: NEW → RESOLVED
Last Resolved: 7 months ago
Resolution: --- → INVALID
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.