Closed Bug 1376795 Opened 7 years ago Closed 7 years ago

Marionette sometimes doesn't detect minidump files (crashes) of Firefox

Categories

(Remote Protocol :: Marionette, enhancement, P3)

52 Branch
enhancement

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: whimboo, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

What I noticed over on bug 1223277 we sometimes are not able to detect the minidump files anymore. This is most likely happening because they are getting created with a delay. With mozcrash we are checking the minidump folder right away after Firefox closed. This seems to mostly happen for content related crashes but in some cases like bug 1376773 it also happens for chrome. Ted, has something been changed lately which could have caused such a delayed minidump creation? It came up the first time not too long ago.
Flags: needinfo?(ted)
Blocks: 1326124
Blocks: 1351931
Blocks: 1365391
Blocks: 1322138
Blocks: 1377030
Blocks: 1377359
Gabriele: this is the issue I mentioned last week about a potential race condition with minidump creation.
Flags: needinfo?(ted) → needinfo?(gsvelto)
Minidumps for content crashes are still being generated immediately but they're being processed so the file is usually locked for a brief while until processing is finished. This caused issues in mochitests when the harness tried to delete the minidumps that were still being processed. It shouldn't prevent other code from accessing them though. When I'm back from paternity leave I can try to reproduce locally and have a look. Sorry for not responding earlier but I've been on PL since after the all-hands and I'll be until the end of the month.
Flags: needinfo?(gsvelto)
Interestingly those failures have now been started to appear on mozilla-beta too. All that by early last week. Given that we had nearly no uplifts due to sheriffs being out, here a list of changes which got uplifted the days before: https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?startdate=2017-06-28&enddate=2017-07-04 I know that we had and still have a couple of crashes for mozilla::dom::Request on central, so I wonder if the landing of bug 371889 could have something to do with it. Andrea, what do you think?
Flags: needinfo?(amarchesini)
Oh wait, I have just seen that the patch on bug 371889 got backed out on July 6th from mozilla-beta, but the behavior is still persistent on that branch. So it shouldn't be related.
Flags: needinfo?(amarchesini)
I wonder if bug 1363378 could have an effect here regarding the shutdown of Firefox, and the delay of minidump creation.
(In reply to Gabriele Svelto [:gsvelto] (on PTO until 31/07/2017) from comment #2) > shouldn't prevent other code from accessing them though. When I'm back from > paternity leave I can try to reproduce locally and have a look. Sorry for > not responding earlier but I've been on PL since after the all-hands and > I'll be until the end of the month. Sounds good. Until then I will put ni? for you and will implement a workaround for the tests for the time being to make those not fail anymore.
Flags: needinfo?(gsvelto)
Blocks: 1381403
So with my workaround on bug 1381403 it seems that I was able to get it working by just having a timeout of 5s in waiting for the minidump files being created. But for code coverage builds it perma fails. So a couple of questions: 1) Is breakpad enabled for those builds? 2) How slow are those builds compared to normal builds? 3) Is the crash reporter affected by that slowness too, or does it have to process more info? 4) (Do we really store a target.tar.bz2 of 500MB for each build? WOW!!)
Flags: needinfo?(klahnakoski)
I have not looked at the builds for a while; looking now I see `target.tar.bz2` is big. I am not sure what roll it plays; they do not seem coverage related to me; but I can not discount it because of active experiments we are running. I added marco and gmierz who are familiar with the coverage builds to answer these questions for you.
Flags: needinfo?(mcastelluccio)
Flags: needinfo?(klahnakoski)
Flags: needinfo?(gmierz2)
Looking at the contents of `target.tar.bz2`, it seems to be a build, with a much larger libxul.so file, which I assume is a side effect of the extra coverage code. There is little we can do about the size. If it is any consolation; coverage is only run on central (and try).
(In reply to Henrik Skupin (:whimboo) from comment #7) > So with my workaround on bug 1381403 it seems that I was able to get it > working by just having a timeout of 5s in waiting for the minidump files > being created. But for code coverage builds it perma fails. So a couple of > questions: > > 1) Is breakpad enabled for those builds? > 2) How slow are those builds compared to normal builds? > 3) Is the crash reporter affected by that slowness too, or does it have to > process more info? > 4) (Do we really store a target.tar.bz2 of 500MB for each build? WOW!!) 1) The crash reporter is disabled in code coverage builds: https://hg.mozilla.org/mozilla-central/file/5f44d10bacca2d693413b529e0caadc73e634e1e/browser/config/mozconfigs/linux64/code-coverage#l7. 2) They are slower, I have no idea by how much. 3) See 1.
Flags: needinfo?(mcastelluccio)
(In reply to Marco Castelluccio [:marco] from comment #10) > https://hg.mozilla.org/mozilla-central/file/ > 5f44d10bacca2d693413b529e0caadc73e634e1e/browser/config/mozconfigs/linux64/ > code-coverage#l7. Ah, that would be the explanation why the tests are perma failing. We don't take that into account and simply return an empty array. We will have to automatically skip the test. I will care about it on bug 1381403. So what's left here is definitely the fact that for normal builds the minidump files are getting created with a delay.
Flags: needinfo?(gmierz2)
Ted, I wonder if we shutdown Firefox too quickly when 'MOZ_CRASHREPORTER_SHUTDOWN' is set. I just found the following which is in use by mochitests: https://dxr.mozilla.org/mozilla-central/source/testing/mochitest/BrowserTestUtils/BrowserTestUtils.jsm#1081 Maybe we should delay the requested shutdown of Firefox until the crashmanager reports that the minidump files exist?
Flags: needinfo?(ted)
Setting MOZ_CRASHREPORTER_SHUTDOWN should not affect the crash report generation since it forces Firefox to close only when it has hit an actual content crash, see this: https://dxr.mozilla.org/mozilla-central/rev/ef9a0f01e4f68214f0ff8f4631783b8a0e075a82/browser/modules/ContentCrashHandlers.jsm#111 ... and this happens later: https://dxr.mozilla.org/mozilla-central/rev/ef9a0f01e4f68214f0ff8f4631783b8a0e075a82/browser/modules/ContentCrashHandlers.jsm#150 As you can see from that code it is possible to get an ipc:content-shutdown message which corresponds to a crash but which doesn't have an associated minidump: https://dxr.mozilla.org/mozilla-central/rev/ef9a0f01e4f68214f0ff8f4631783b8a0e075a82/browser/modules/ContentCrashHandlers.jsm#118 That should happen only if Breakpad fails to generate the minidump though, which should really be an uncommon scenario. What's the best way to repro this? I can try and run a failing case a bunch of time with enough (manual) instrumentation in the code to figure out if the minidump is really being created or not.
Flags: needinfo?(gsvelto)
(In reply to Gabriele Svelto [:gsvelto] from comment #13) > That should happen only if Breakpad fails to generate the minidump though, > which should really be an uncommon scenario. What's the best way to repro There are a couple of bugs open when breakpad failed to generate the dumps. Ted didn't seem to have time yet to check those. > this? I can try and run a failing case a bunch of time with enough (manual) > instrumentation in the code to figure out if the minidump is really being > created or not. Best is to run all the tests from the following test: https://dxr.mozilla.org/mozilla-central/source/testing/marionette/harness/marionette_harness/tests/unit/test_crash.py Please make sure to comment out the workaround inside of try/except and only leave line 42: https://dxr.mozilla.org/mozilla-central/source/testing/marionette/harness/marionette_harness/tests/unit/test_crash.py#40-48 To run the test use `mach marionette test testing/marionette/harness/marionette_harness/tests/unit/test_crash.py -vv --gecko-log -` with tracing log output. The platform these failures are mostly happening is osx-10-10 in Taskcluster.
I could assume that this bug could be related to bug 1387369.
See Also: → 1387369
(In reply to Henrik Skupin (:whimboo) from comment #15) > I could assume that this bug could be related to bug 1387369. I don't think bug 1387369 is related. It is the result of bug 1360308, which landed on 6/22, but the symptoms in the related bugs happen before that. Also, I suppose we are crashing intentionally in the tests, right? Bug 1387369 results from non-crashing minidump generation so it should be irrelevant.
We only shutdown Firefox (not crash) in case of content crashes. So not sure how all this is then related to chrome process crashes.
See Also: → 1390540
Gabriele, any idea how to proceed here? Maybe it would be helpful as first step to add a log entry when the process is shutdown due to 'MOZ_CRASHREPORTER_SHUTDOWN'=1? It would give us an indication in the logs when such a situation happens.
Flags: needinfo?(gsvelto)
This prints out a message when we quit in response to a content crash when MOZ_CRASHREPORTER_SHUTDOWN is set. I have been busy with a huge bug that's being on my TODO list for a while but I'll come to help out with this ASAP.
Flags: needinfo?(gsvelto)
Attachment #8897792 - Flags: review?(hskupin)
Comment on attachment 8897792 [details] [diff] [review] [PATCH] Print a message when quitting in response to a content crash Review of attachment 8897792 [details] [diff] [review]: ----------------------------------------------------------------- Sorry, but I'm not a browser peer and so cannot review it. Also I'm not sure if dump() or using a logger() is a better idea.
Attachment #8897792 - Flags: review?(hskupin)
Attachment #8897792 - Flags: review?(cyu)
(In reply to Henrik Skupin (:whimboo) from comment #20) > Sorry, but I'm not a browser peer and so cannot review it. Also I'm not sure > if dump() or using a logger() is a better idea. Bugzilla's suggested reviewer interface suggested you for this, weird :|
(In reply to Gabriele Svelto [:gsvelto] from comment #21) > Bugzilla's suggested reviewer interface suggested you for this, weird :| Thank you for letting me know. I filed bug 1391163 to get this fixed.
Comment on attachment 8897792 [details] [diff] [review] [PATCH] Print a message when quitting in response to a content crash This only prints out the message on release build when browser.dom.window.dump.enabled is flipped to true. Do we do this during test? Or we just care about debug builds?
Flags: needinfo?(hskupin)
Yes, at least our Marionette tests are flipping this preference to true. I cannot speak about others. If there is a better way to get the information to stdout like with the logger, I'm all for it. But I don't know what's preferred to use in those JS modules on browser side.
Flags: needinfo?(hskupin)
Comment on attachment 8897792 [details] [diff] [review] [PATCH] Print a message when quitting in response to a content crash For debugging the underlying root cause, I think we can go with this patch.
Attachment #8897792 - Flags: review?(cyu) → review+
Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/135ecb6cf04a Print a message when quitting in response to a content crash; r=cyu
Keywords: leave-open
Basically all the known crashes here are happening due to bug 1395504. It's just a webcontent process which doesn't quit and as such Firefox doesn't shutdown. Marionette kills the process after 120s, and at that time no minidump files exist. I will have a look if I can enhance our tests so that we have a better reporting. Otherwise we have to wait for a fix on bug 1395504.
Depends on: 1395504
Flags: needinfo?(ted)
(In reply to Henrik Skupin (:whimboo) from comment #28) > I will have a look if I can enhance our tests so that we have a better > reporting. Otherwise we have to wait for a fix on bug 1395504. I don't think that there is something I can do to improve the tests. Thing is that the shutdown is drastically delayed, and Firefox doesn't even quit within our now raised shutdown timeout of 120s. Marionette just kills the browser at this stage, and as result no minidump files are written. It means we clearly have to wait for a fix for bug 1395504.
Priority: -- → P3
Meanwhile I think there is nothing wrong with Marionette but this is all due to a hang during shutdown all caused by bug 1395504. So this is really something I would like to see an investigation, and fix for sure. I don't think it makes sense to keep this bug further open.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INVALID
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
No longer blocks: 1504482
Product: Testing → Remote Protocol
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: