Closed Bug 1299216 Opened 8 years ago Closed 8 years ago

[e10s] Improve crash handling for Marionette

Categories

(Remote Protocol :: Marionette, defect)

defect
Not set
normal

Tracking

(e10s+, firefox52 fixed)

RESOLVED FIXED
mozilla52
Tracking Status
e10s + ---
firefox52 --- fixed

People

(Reporter: whimboo, Assigned: whimboo)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(8 files)

58 bytes, text/x-review-board-request
ahal
: review+
Details
58 bytes, text/x-review-board-request
ahal
: review+
Details
58 bytes, text/x-review-board-request
ahal
: review+
Details
58 bytes, text/x-review-board-request
automatedtester
: review+
Details
58 bytes, text/x-review-board-request
automatedtester
: review+
Details
58 bytes, text/x-review-board-request
automatedtester
: review+
Details
58 bytes, text/x-review-board-request
automatedtester
: review+
Details
58 bytes, text/x-review-board-request
automatedtester
: review+
Details
The following job on Treeherder shows it perfectly:

> PROCESS-CRASH | test_profile_management.py TestLog.test_in_app_restart_the_browser | application crashed [@ mozalloc_abort(char const*)]
> 
[..]
> TEST-UNEXPECTED-ERROR | test_profile_management.py TestLog.test_in_app_restart_the_browser | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Connection timed out after 360s) 

Because the crash we are seeing here is a crash in content, Firefox is still up and running. In force_shutdown [1] we actually check if the process is still alive after waiting 65s. If that is the case we kill Firefox. 

As it looks like the behavior might not be the best for e10s. Shall we keep running our tests if there was a crash in content? We may need a solution for bug 1293270 first, to be able to be fully sure about.

Ted, what do you think?

[1] https://dxr.mozilla.org/mozilla-central/rev/26e22af660e543ebb69930f082188b69ec756185/testing/marionette/client/marionette_driver/marionette.py#756
Flags: needinfo?(ted)
We actually discussed this situation a while back when the e10s work was first starting. I added an environment variable ` MOZ_CRASHREPORTER_SHUTDOWN` which was supposed to tell the application to quit if a content process crashed. It looks like we only actually implemented this for B2G though:
https://dxr.mozilla.org/mozilla-central/rev/26e22af660e543ebb69930f082188b69ec756185/b2g/chrome/content/shell.js#162

Even if we made this work for Firefox I don't know that we'd necessarily want this enabled for every harness, since we have some like browser-chrome that do intentionally cause content process crashes to test that functionality. Having the option for individual harnesses seems like it'd be a good idea, though.
Flags: needinfo?(ted)
Blocks: e10s-tests
tracking-e10s: --- → +
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
I'm currently blocked on bug 1300709 here given that I have to load a remoting URI but this does not work until sessionstore completed. I really don't want to add a sleep to the test.
Depends on: 1300709
Actually I don't need the extra restart in the tests so the missing sessionstore feature is not blocking me.
No longer depends on: 1300709
Blocks: 1292588
Blocks: 1297364
A first try push from two days ago was mostly green:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a0d088416f21&selectedJob=27320923

There were only failures for ASAN related builds, which after some investigation make sense because the crash reporter is not available for those builds. I pushed a follow-up try build with a fix for that right now. If all works well, I will upload the patch today, just after some clean-up.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=5091b92dee2c
Marking as leave open so I can push the releases to PyPI once it got landed.
Keywords: leave-open
Whiteboard: [needs mozcrash and mozrunner release]
With the latest updates the tests on Linux and Windows pass just fine, but we have issues on OSX with mozprocess:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=f5bb2a0c48d1&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=runnable&selectedJob=27535448&filter-searchStr=mn

13:18:41    ERROR -  TEST-UNEXPECTED-ERROR | test_crash.py TestCrash.test_crash_chrome_process | OSError: [Errno 1] Operation not permitted
13:18:41     INFO -  Traceback (most recent call last):
13:18:41     INFO -    File "/builds/slave/test/build/venv/lib/python2.7/site-packages/marionette/marionette_test/testcases.py", line 161, in run
13:18:41     INFO -      testMethod()
13:18:41     INFO -    File "/builds/slave/test/build/tests/marionette/tests/testing/marionette/harness/marionette/tests/unit/test_crash.py", line 94, in test_crash_chrome_process
13:18:41     INFO -      self.crash, chrome=True)
13:18:41     INFO -    File "/tools/python27/lib/python2.7/unittest/case.py", line 989, in assertRaisesRegexp
13:18:41     INFO -      callable_obj(*args, **kwargs)
13:18:41     INFO -    File "/builds/slave/test/build/tests/marionette/tests/testing/marionette/harness/marionette/tests/unit/test_crash.py", line 84, in crash
13:18:41     INFO -      """, sandbox=sandbox)
13:18:41     INFO -    File "/builds/slave/test/build/venv/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1685, in execute_script
13:18:41     INFO -      rv = self._send_message("executeScript", body, key="value")
13:18:41     INFO -    File "/builds/slave/test/build/venv/lib/python2.7/site-packages/marionette_driver/decorators.py", line 40, in _
13:18:41     INFO -      m.handle_socket_failure(crashed)
13:18:41     INFO -    File "/builds/slave/test/build/venv/lib/python2.7/site-packages/marionette_driver/marionette.py", line 766, in handle_socket_failure
13:18:41     INFO -      self.instance.close()
13:18:41     INFO -    File "/builds/slave/test/build/venv/lib/python2.7/site-packages/marionette_driver/geckoinstance.py", line 172, in close
13:18:41     INFO -      self.runner.stop()
13:18:41     INFO -    File "/builds/slave/test/build/venv/lib/python2.7/site-packages/mozrunner/base/runner.py", line 169, in stop
13:18:41     INFO -      self.process_handler.kill(sig=sig)
13:18:41     INFO -    File "/builds/slave/test/build/venv/lib/python2.7/site-packages/mozprocess/processhandler.py", line 749, in kill
13:18:41     INFO -      self.proc.kill(sig=sig)
13:18:41     INFO -    File "/builds/slave/test/build/venv/lib/python2.7/site-packages/mozprocess/processhandler.py", line 165, in kill
13:18:41     INFO -      send_sig(signal.SIGTERM)
13:18:41     INFO -    File "/builds/slave/test/build/venv/lib/python2.7/site-packages/mozprocess/processhandler.py", line 152, in send_sig
13:18:41     INFO -      os.killpg(pid, sig)
13:18:41     INFO -  TEST-INFO took 1892ms

Looks like the code I wrote for bug 1176758 still has a bad side-effect. I will investigate later today when the setup of my MBP has been done.

and:

> Exception socket.error: (57, 'Socket is not connected') in <bound method TcpTransport.__del__ of <marionette_driver.transport.TcpTransport object at 0x10daed310>> ignored

I have also seen this on Linux before, and the strange part is that this comes through wptserve.
The problematic issue here on OS X was that we didn't wait long enough when calling runner.wait(). A second only is too short and even with a crash the process might not be completely gone. Increasing the timeout value fixed the failures on OS X: https://treeherder.mozilla.org/#/jobs?repo=try&revision=61aba68b8a0a
Attachment #8791590 - Flags: review?(ahalberstadt)
Attachment #8791591 - Flags: review?(ahalberstadt)
Attachment #8791592 - Flags: review?(ahalberstadt)
Attachment #8793329 - Flags: review?(dburns)
Attachment #8791593 - Flags: review?(dburns)
Attachment #8791594 - Flags: review?(dburns)
Comment on attachment 8793329 [details]
Bug 1299216 - Don't care about socket not connected for sock.shutdown() call.

https://reviewboard.mozilla.org/r/80088/#review79828
Attachment #8793329 - Flags: review?(dburns) → review+
Comment on attachment 8791593 [details]
Bug 1299216 - Enhance handling of crashes in Marionette.

https://reviewboard.mozilla.org/r/78956/#review79830

::: testing/marionette/harness/marionette/marionette_test/decorators.py:64
(Diff revision 5)
>          func.metaparameters.append((func_suffix, args, kwargs))
>          return func
>      return wrapped
>  
>  
> +def run_if_e10s(target):

Shouldnt all tests run in e10s mode?
Comment on attachment 8791593 [details]
Bug 1299216 - Enhance handling of crashes in Marionette.

https://reviewboard.mozilla.org/r/78956/#review79830

> Shouldnt all tests run in e10s mode?

No, we still run Marionette tests in non-e10s mode on all branches.
Comment on attachment 8791594 [details]
Bug 1299216 - Remove always parameter from do_process_check decorator.

https://reviewboard.mozilla.org/r/78958/#review79836
Attachment #8791594 - Flags: review?(dburns) → review+
Comment on attachment 8791590 [details]
Bug 1299216 - [mozcrash] Count crash reports in check_for_crashes and bump version to 1.0.

https://reviewboard.mozilla.org/r/78950/#review79982

Thanks, this behaviour is more useful!

::: testing/mozbase/mozcrash/mozcrash/mozcrash.py:78
(Diff revision 2)
>      filename of the calling function will be used.
>  
>      If `quiet` is set, no PROCESS-CRASH message will be printed to stdout if a
>      crash is detected.
>  
> -    Returns True if any minidumps were found, False otherwise.
> +    Returns Number of minidump files found.

nit: number should be lower case
Attachment #8791590 - Flags: review?(ahalberstadt) → review+
Comment on attachment 8791591 [details]
Bug 1299216 - [mozrunner] check_for_crashes() has to return the crash count since its last invocation.

https://reviewboard.mozilla.org/r/78952/#review79984

Lgtm!
Attachment #8791591 - Flags: review?(ahalberstadt) → review+
Comment on attachment 8791592 [details]
Bug 1299216 - Bump mozrunner version to 6.13.

https://reviewboard.mozilla.org/r/78954/#review79988
Attachment #8791592 - Flags: review?(ahalberstadt) → review+
Comment on attachment 8791592 [details]
Bug 1299216 - Bump mozrunner version to 6.13.

https://reviewboard.mozilla.org/r/78954/#review79990

::: testing/mozbase/mozrunner/setup.py:21
(Diff revision 2)
>          'mozlog >= 3.0',
>          'mozprocess >= 0.23',
>          'mozprofile >= 0.18',
>          ]
>  
>  EXTRAS_REQUIRE = {'crash': ['mozcrash >= 0.14']}

This should get bumped to 1.0
Comment on attachment 8791593 [details]
Bug 1299216 - Enhance handling of crashes in Marionette.

https://reviewboard.mozilla.org/r/78956/#review79832
Attachment #8791593 - Flags: review?(dburns) → review+
Comment on attachment 8791590 [details]
Bug 1299216 - [mozcrash] Count crash reports in check_for_crashes and bump version to 1.0.

https://reviewboard.mozilla.org/r/78948/#review80748

Updated patch after rebasing to mozilla-central. It includes a move of the line which ensures that the session gets deleted out of _request_in_app_shutdown into quit() and restart(), to ensure we always call it even in the case of using a callback. Given that no change is actually done I assume that I can take over the r+.

For safety I will retrigger a set of tests including Windows XP which seems to be a bit unstable.
Comment on attachment 8791590 [details]
Bug 1299216 - [mozcrash] Count crash reports in check_for_crashes and bump version to 1.0.

https://reviewboard.mozilla.org/r/78948/#review83156

Rebased again due to conflicts with the latest changes on bug 1291687. I triggered another set on Windows XP to check if the last issue is still persistent or not.
As reference the last try build which was successful across all platforms except WinXP where we had dozen of crashes of Firefox:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3750a6ed0e55&selectedJob=28315042

New try build:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ca295062ac9e
The hundreds of crashes on XP are still around. All of them are related to accessibility and specifically "mozilla::a11y::PDocAccessibleChild::SendCOMProxy()". I thought that ally is disabled by default when we run Marionette tests?

https://archive.mozilla.org/pub/firefox/try-builds/hskupin@mozilla.com-ca295062ac9e74bf0073941137c2e1a33027a73e/try-win32/try_xp_ix_test-marionette-e10s-bm110-tests1-windows-build15.txt.gz

Aaron, do you have an idea here why we are crashing that often?
Flags: needinfo?(aklotz)
Btw. the same crash also happens for other try builds which do not have any change to Marionette:

https://treeherder.mozilla.org/#/jobs?repo=try&filter-searchStr=windows%20xp&fromchange=e7f585576ba4c8377f218dbecb2cfabd77571e57&selectedJob=28888091

The difference to my push is that with my patch we continue the tests and do not abort after the first crash. Maybe we should better keep this behavior also for content crashes.

David, would you agree with the above?
Flags: needinfo?(dburns)
I see the same crashes over on bug 1141483. A build from before my vacation didn't show the crashes. So I feel that the following a11y change which landed on Oct 6th caused the massive crashes:

https://hg.mozilla.org/mozilla-central/rev/7cc58d0708e7#l8.10
(In reply to Henrik Skupin (:whimboo) from comment #57)
> Btw. the same crash also happens for other try builds which do not have any
> change to Marionette:
> 
> https://treeherder.mozilla.org/#/jobs?repo=try&filter-
> searchStr=windows%20xp&fromchange=e7f585576ba4c8377f218dbecb2cfabd77571e57&se
> lectedJob=28888091
> 
> The difference to my push is that with my patch we continue the tests and do
> not abort after the first crash. Maybe we should better keep this behavior
> also for content crashes.
> 
> David, would you agree with the above?

Yup, sounds good.
Flags: needinfo?(dburns)
I'm close to a solution for an early abort but I'm blocked on bug 1309250 now. I will see if I can find a workaround for now to not have to raise an exception, or catch it sanely.
Depends on: 1309250
(In reply to Henrik Skupin (:whimboo) from comment #56)
> The hundreds of crashes on XP are still around. All of them are related to
> accessibility and specifically
> "mozilla::a11y::PDocAccessibleChild::SendCOMProxy()". I thought that ally is
> disabled by default when we run Marionette tests?
> 
> https://archive.mozilla.org/pub/firefox/try-builds/hskupin@mozilla.com-
> ca295062ac9e74bf0073941137c2e1a33027a73e/try-win32/try_xp_ix_test-marionette-
> e10s-bm110-tests1-windows-build15.txt.gz
> 
> Aaron, do you have an idea here why we are crashing that often?

It's because that is a11y code that only runs on e10s, but a11y+e10s is not is not a supported configuration on Windows XP at this time. If possible I'd suggest that we not run a11y and e10s together on Windows XP.
Flags: needinfo?(aklotz)
(In reply to Aaron Klotz [:aklotz] from comment #61)
> It's because that is a11y code that only runs on e10s, but a11y+e10s is not
> is not a supported configuration on Windows XP at this time. If possible I'd
> suggest that we not run a11y and e10s together on Windows XP.

That is problematic then for us given that Marionette enables a11y or pieces of it, which are needed for specific events to perform. A while ago I already filed bug 1277791 to get this investigated or stopped.
Depends on: 1277791
Comment on attachment 8791590 [details]
Bug 1299216 - [mozcrash] Count crash reports in check_for_crashes and bump version to 1.0.

https://reviewboard.mozilla.org/r/78948/#review83728

Trying to see if using `self.quit()` works better as only calling `self.instance.close()`. We would at least unregister all listeners which might cause lesser issues during a forced shutdown.
(In reply to David Burns :automatedtester from comment #59)
> > The difference to my push is that with my patch we continue the tests and do
> > not abort after the first crash. Maybe we should better keep this behavior
> > also for content crashes.
> > 
> > David, would you agree with the above?
> 
> Yup, sounds good.

Btw the early abort of a test run right after the instance crashed happens at the moment because of a side-effect which is tracked by bug 1292588. I do not have a clear answer yet without investigating it deeper. But it's most likely due to the additional error/exception in the test runner. With that we seem to break out of everything.

While thinking more about early aborts, I feel that we shouldn't do that. Instead we should always continue until all tests have been run. To get a better knowledge of how other test suites are handling this situation I checked Treeherder results for process crashes and was able to find some. As I was able to see other test suites like XPCShell, Robocop, Mochitest Media, RefTest Crash continue after a crash. Here some logs:

https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-aurora-macosx64-debug/1476129263/mozilla-aurora_yosemite_r7-debug_test-xpcshell-bm135-tests1-macosx-build2.txt.gz

https://public-artifacts.taskcluster.net/Mfu_jSTiRlmSZ74PTmNzTA/0/public/logs/live_backing.log

https://archive.mozilla.org/pub/mobile/tinderbox-builds/autoland-android-api-15/1476258107/autoland_ubuntu64_vm_armv7_large_test-mochitest-media-2-bm130-tests1-linux64-build2.txt.gz

https://public-artifacts.taskcluster.net/a6lgsCWlTjS7qmmobS5Yqw/0/public/logs/live_backing.log

Ted, you as test module owner, can you please give an advise how Marionette should handle such a situation? Should the tests continue even when we could end-up with dozen/hundreds of crashes in case of total bustage of a build. Thanks.
Flags: needinfo?(ted)
Using quit() doesn't actually change anything for Windows XP. We still crash like hell in the a11y part:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=453572693326&selectedJob=29027853

So we really have to get a11y disabled, or we cannot land this patch until Firefox reaches 53.0a1, which would be sad.
Some information from IRC as discussed with Ted over there:

> [14:05:38] @ted AFAIK most suites will not restart a run if a browser crashesr
> [14:05:40] @ted xpcshell tests all run in individual proceses, so one crash doesn't terminate the run
> [14:05:48] @ted i'm not sure what mochitest does with run-by-dir nowadays
> [14:05:53] whimboo you mean continue or restart
> [14:06:38] jmaher ted: for mochitest we do run-by-dir, so if one dir has a crash, we will continue on and get coverage on other dirs with a fresh profile/browser
> [14:06:44] @ted okay, that makes sense
> [14:06:48] @ted that seems like sensible behavior
> [14:07:07] whimboo ted: with marionette we also restart the browser in case of a content crash
> [14:07:09] @ted run as much of the tests as possible, but don't try to run the rest of the tests in a directory if one of them crashes
> [14:07:27] @ted whimboo: i'm honestly not sure how mochitest would handle a content crash
> [14:07:29] whimboo i don't think we create a fresh profile yet
> [14:07:49] whimboo ted: in my patch i implemetned a shutdown and restart of firefox
> [14:07:55] @ted ah
> [14:07:59] @ted that seems sensible
> [14:08:14] whimboo if a content process crash happens we cannot be sure that the installed listeners by Marionette are still active
> [14:08:48] whimboo ted: so right now its just an assumption that we have a content process crash given that a minidump is present and firefox still running
> [14:08:58] whimboo mozcrash doesnt provide the necessary information yet
> [14:09:33] @ted yeah...
> [14:09:46] @ted well if there's a content process crash there is a chrome notification that happens
> [14:12:23] whimboo there is still the question how functional marionette is at this point. But yes, the chrome process should not be affected by that content crash
> [14:23:15] @ted yeah, terminating the browser on a content process crash is totally sensible
> [14:26:39] whimboo ted: so right now we wait 65s before crashing the browser in case if a possible content crash
> [14:26:50] whimboo just in case to be sure its not a slow shutdown
> [14:27:04] whimboo when firefox would kill a hanging thread after 60s
> [14:27:32] whimboo ted: so do you think the notificatino check would be important for the initial patch?
(In reply to Henrik Skupin (:whimboo) from comment #67)
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=453572693326&selectedJob=29027853

We should actually never run Mn-e10s for WinXP on try. Those tests are not executed on any of our integration and release branches. So this is not really a blocker anymore.

I will refresh to latest mozilla-central and trigger another final try build.
I think operating the way that Mochitest does would be sensible--Mochitest will continue on to the next directory of tests after a crash, since it starts a fresh browser instance for each directory, but it will not attempt to restart the current directory in which it crashed, even if that means skipping the rest of the tests in that directory.

If the Marionette harness would have started a new browser process at some point then it's fine to let it continue to do so. If not, it should just stop.

> [14:27:32] whimboo ted: so do you think the notificatino check would be important for the initial patch?

I think this would be more reliable than just waiting for a timeout. It's fairly simple, you just listen for the "ipc:content-shutdown" observer service notification, and check if it has an "abnormal" key:
https://dxr.mozilla.org/mozilla-central/rev/7452437b3ab571b1d60aed4e973d82a1471f72b2/browser/modules/ContentCrashHandlers.jsm#79

As I mentioned back in comment 1, maybe we should just implement support for `MOZ_CRASHREPORTER_SHUTDOWN` in Firefox, to provide a way to exit with an error status if a content process crashes? Then harnesses could opt-in to this behavior. Aside from browser-chrome tests that are explicitly testing content process crashes it's unlikely that a test harness is going to be able to continue to run tests usefully after a content process crash.
Flags: needinfo?(ted)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #70)
> If the Marionette harness would have started a new browser process at some
> point then it's fine to let it continue to do so. If not, it should just
> stop.

Whatever type of crash happens Marionette definitely brings up a new browser process. But compared to Mochitest Marionette would continue with the next test method (if there is one) from the same file. This is due to bug 1309250, and where I do not see a way to skip all remaining tests from that file. It means we won't loose any test coverage, but on the other side could have more reported crashes in case a test causes a higher amount of crashes.

> I think this would be more reliable than just waiting for a timeout. It's
> fairly simple, you just listen for the "ipc:content-shutdown" observer
> service notification, and check if it has an "abnormal" key:
> https://dxr.mozilla.org/mozilla-central/rev/
> 7452437b3ab571b1d60aed4e973d82a1471f72b2/browser/modules/
> ContentCrashHandlers.jsm#79

Ah, good to know that we have an observer for specifically this situation. I will have a look and will see if I can get this implemented. Would it be ok to get this moved to a follow-up bug? Or do you think we should have this right away? I ask because there will be API changes, and I would have to handle fallback code for older releases of Firefox. 

> As I mentioned back in comment 1, maybe we should just implement support for
> `MOZ_CRASHREPORTER_SHUTDOWN` in Firefox, to provide a way to exit with an
> error status if a content process crashes? Then harnesses could opt-in to
> this behavior. Aside from browser-chrome tests that are explicitly testing
> content process crashes it's unlikely that a test harness is going to be
> able to continue to run tests usefully after a content process crash.

That's something out of my scope. I'm happy to file a bug for you if you cannot do it, but it's nothing I could contribute to.
Flags: needinfo?(ted)
Talked with ted on IRC and here our conversation:

[21:44:15] whimboo ted: re content crashes... would it be fine to do the observer notification check as a follow-up or shall I add it with my initial patch to Marionette?
[21:44:44] @ted i guess doing it as a follow-up would be OK
[21:44:59] whimboo ted: k, there is also the quesiotn of backward compat
[21:45:03] @ted alternately we just get someone to support that env var in firefox in general
[21:45:05] whimboo especially for our update tests
[21:45:32] whimboo ted: should i file a bug for the env var?
[21:45:41] @ted whimboo: yeah, i think so
[21:45:44] @ted we could use it in mochitest etc too
[21:46:03] whimboo k, i will create a bug now
[21:46:16] whimboo ted: thanks. looks like my remaining questions are solved. 

So I'm going to file two new bugs. One bug for the env variable, and another one for handling the observer notification when content crashes happen.
Flags: needinfo?(ted)
Actually on ash we run mn-e10s jobs for Windows 8 x64 opt/debug. It means the test failures as can be seen on my last try push need to be fixed.

I did some more investigation on a local Win10 VM and noticed that the minidump files are getting written out with a delay. As result checking for the files right after the content crash would result in an empty list, which means no crash is detected. Exactly that is also the test failure as listed in my try build.

Ted, given that mozcrash doesn't wait for any probably to appear minidump files, a delay here could mean that we detect a crash and erroneously assign it to a maybe following test. Not sure if this is something we can control or wait for.

For this specific test which tests content crashes, I could update the minidump file retrieval method to just wait a bit. I think that this should be fine enough. But for real unwanted crashes we might end-up with the above situation.
Flags: needinfo?(ted)
Depends on: 1311016
(In reply to Henrik Skupin (:whimboo) from comment #75) 
> Ted, given that mozcrash doesn't wait for any probably to appear minidump
> files, a delay here could mean that we detect a crash and erroneously assign
> it to a maybe following test. Not sure if this is something we can control
> or wait for.

I think the only reasonable thing to do here is to wait for the observer service notification, which fires after the minidump is available.
Flags: needinfo?(ted)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #76)
> I think the only reasonable thing to do here is to wait for the observer
> service notification, which fires after the minidump is available.

Hm, working on a temporary solution which might go away soon when we have a fix for bug 1311016 landed, I don't see valuable. Instead I picked up bug 1311016 and provided the patch. With the hope that all is fine and we can get it landed soon, I will be able to finalize this patch soon.
Given that autolander is currently broken I triggered the try build myself:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=3bb855f560525e847c304cb9916f440821fefe60
Even with the new environment variable in place which let Firefox close itself, my test is still failing with no crash being detected:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=908e8d88290a4057414d5391c1d1ddccf6bc7cf3&selectedJob=29943878

Reason is that the minidump files are getting created somewhat delayed and the crash detection code returns false. With that we assume the process has been somehow shutdown with exit code 0.

I think that I will add some debugging code so we can figure out about which delay we are talking about here.
Depends on: 1313604
I talked with Ted and will try one more thing before digging into a loaner. So what could happen here is that we see a race because we check for crashes first and then for the process status. It means that when we close Firefox due to the crash, it shutdown the socket server before the process ends. Ideally the minidump file should have already been written, but who knows what's going on here because content processes are involved. It means that I will flip the checks, doing the process check first, and then checking for crashes. With that we can be sure that if the process is not running anymore, breakpad cannot do anything.
I pushed the changes as mentioned above to try and it's still failing on Windows 8 test slaves. This is totally crazy because when Firefox has been shutdown Breakpad can no longer write to disk, but we can see those files to appear a bit later. I wonder if that is some kind of caching when data is written to the disk. 

So I will make use of the Win8 loaner and hopefully figure out more details.
So this failure seem to only happen when the test_crash_chrome_process test is run before. I can also stop the failure when forcing another restart of Firefox at the end of the above test. For me it means that something is wrong with the instance of Firefox which gets started after the chrome crash.

I will dig more into all that tomorrow.
Ok, so this was a tricky one! But I think that I nailed it down to a specific issue... What happens? Here the excerpt from the chrome crash test:

>     def test_crash_chrome_process(self):
>         self.assertRaisesRegexp(IOError, 'Process crashed',
>                                 self.crash, chrome=True)
[..]
>         self.marionette.start_session()
>
>         self.assertNotEqual(self.marionette.session['processId'], self.pid)
>         self.marionette.get_url()

The first command actually crashes the browser and we start Firefox again via `start_session()`. The next assertion is also working fine. But then something is wrong with the call to `get_url()`. As result the next test for the content crash is always failing. When this line gets removed all is working fine. Leaving this line in and doing a restart afterward, also makes it working as stated in my last comment.

In this case get_url() runs inside content context which means `getCurrentUrl()` in listener.js is getting executed and we end-up here:

https://dxr.mozilla.org/mozilla-central/rev/3e73fd638e687a4d7f46613586e5156b8e2af846/testing/marionette/listener.js#1051

Something seems to block specific messages between listener.js and driver.js afterward and that as long as Firefox is running and not getting restarted. Due to that the call of the content crash script will not be forwarded to the frame script and doesn't get run. As result no crash happens BUT Marionette client is shutting down Firefox due to a socket timeout. This is why we get no minidump files, and the wrong error message.

What can we do? I will leave in the call to get_url() but comment it out for now with a bug reference to a new bug I have to file. Given that this problem only happens on a single platform we can get this investigated later. No need to do it now.
(In reply to Henrik Skupin (:whimboo) from comment #92)
> >         self.marionette.start_session()
> >
> >         self.assertNotEqual(self.marionette.session['processId'], self.pid)
> >         self.marionette.get_url()

Just to add, the faulty page here is about:blank, so I assume this issue could be related to bug 1312674.
Depends on: 1314594
Using the --artifact switch for try doesn't seem to create the Mn jobs for Windows. Also pulse_actions is lagging a lot. So I cannot even reschedule those jobs. I will trigger a new try build specifically for Windows.

Here the current results which look promosing:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=9418bf09f4a0
Comment on attachment 8793329 [details]
Bug 1299216 - Don't care about socket not connected for sock.shutdown() call.

https://reviewboard.mozilla.org/r/80088/#review89804

David, beside the 2 new commits in this series there is an existing one which I would like to have another review from you. Reason is that a lot of changes went in there so your original review is obsolete. Please see: https://reviewboard.mozilla.org/r/78956/diff/
Attachment #8804621 - Flags: review?(dburns)
Attachment #8806706 - Flags: review?(dburns)
Comment on attachment 8806706 [details]
Bug 1299216 - Wait for process exit first before checking for crashes.

https://reviewboard.mozilla.org/r/90060/#review90532
Attachment #8806706 - Flags: review?(dburns) → review+
Comment on attachment 8804621 [details]
Bug 1299216 - Make use of MOZ_CRASHREPORTER_SHUTDOWN environment variable in Marionette.

https://reviewboard.mozilla.org/r/88540/#review90534
Attachment #8804621 - Flags: review?(dburns) → review+
(In reply to Henrik Skupin (:whimboo) from comment #103)
> https://reviewboard.mozilla.org/r/80088/#review89804
> 
> David, beside the 2 new commits in this series there is an existing one
> which I would like to have another review from you. Reason is that a lot of
> changes went in there so your original review is obsolete. Please see:
> https://reviewboard.mozilla.org/r/78956/diff/

Maybe you missed this request to check an already reviewed commit again. Thanks
Flags: needinfo?(dburns)
looks fine to me
Flags: needinfo?(dburns)
Comment on attachment 8791590 [details]
Bug 1299216 - [mozcrash] Count crash reports in check_for_crashes and bump version to 1.0.

https://reviewboard.mozilla.org/r/78948/#review91220

Last update of the patch series to fix the rebase conflict for the unittest manifest file.
Summary: [e10s] Improve handling of content crashes for Marionette → [e10s] Improve crash handling for Marionette
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d58be05673a2
[mozcrash] Count crash reports in check_for_crashes and bump version to 1.0. r=ahal
https://hg.mozilla.org/integration/autoland/rev/6ec047c4e48e
[mozrunner] check_for_crashes() has to return the crash count since its last invocation. r=ahal
https://hg.mozilla.org/integration/autoland/rev/7c0bdd869d2a
Bump mozrunner version to 6.13. r=ahal
https://hg.mozilla.org/integration/autoland/rev/46fd1d89a918
Don't care about socket not connected for sock.shutdown() call. r=automatedtester
https://hg.mozilla.org/integration/autoland/rev/8463d0e5bf28
Make use of MOZ_CRASHREPORTER_SHUTDOWN environment variable in Marionette. r=automatedtester
https://hg.mozilla.org/integration/autoland/rev/32b908ad1b02
Enhance handling of crashes in Marionette. r=automatedtester
https://hg.mozilla.org/integration/autoland/rev/d57d7ada1080
Remove always parameter from do_process_check decorator. r=automatedtester
https://hg.mozilla.org/integration/autoland/rev/e7aa936d8b6e
Wait for process exit first before checking for crashes. r=automatedtester
Pushed to pypi:

mozcrash:
=========
creating build/bdist.macosx-10.11-x86_64/wheel/mozcrash-1.0.dist-info/WHEEL
Uploading distributions to https://upload.pypi.org/legacy/
Uploading mozcrash-1.0-py2-none-any.whl
[================================] 11673/11673 - 00:00:02
Uploading mozcrash-1.0.tar.gz
[================================] 10007/10007 - 00:00:03

mozrunner:
==========
creating build/bdist.macosx-10.11-x86_64/wheel/mozrunner-6.13.dist-info/WHEEL
Uploading distributions to https://upload.pypi.org/legacy/
Uploading mozrunner-6.13-py2-none-any.whl
[================================] 80823/80823 - 00:00:04
Uploading mozrunner-6.13.tar.gz
[================================] 70895/70895 - 00:00:02


Given that this patch contributes a lot of improvements for handling crashes especially for content crashes in e10s, it might be worth uplifting the patches to aurora. Lets wait one more day with the decision, which I will use to check if the patches would apply cleanly.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Keywords: leave-open
Resolution: --- → FIXED
Whiteboard: [needs mozcrash and mozrunner release]
Target Milestone: --- → mozilla52
There are actually too many dependencies which would have to be uplifted. So lets cancel a possible uplift to aurora.
Depends on: 1316552
Blocks: 1320380
No longer blocks: 1320380
Depends on: 1320380
Product: Testing → Remote Protocol
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: