1038868 - Use B2GDeviceRunner in the Marionette client

Reporter

Description

•

10 years ago

With the latest mozrunner, we can now use B2GDeviceRunner when we're running against a physical device. This should allow us to remove some of the B2G utility methods from gaiatest and the marionette test runner mixins, and also provide a device manager instance to the tests as appropriate.

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Updated

•

10 years ago

Blocks: 1038870

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 1

•

10 years ago

http://hg.mozilla.org/mozilla-central/file/835e22069c1a/testing/mozbase/mozrunner/mozrunner/runners.py#l146

No longer blocks: 1038870

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Updated

•

10 years ago

Blocks: 1039626

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 2

•

10 years ago

Attached patch Work in progress patch (obsolete) — Details — Splinter Review

This is a work in progress patch, which is working for me locally. It does currently require use of --app=b2g in order to determine that we're running against a B2G device. Could you please provide some feedback Andrew? :)

Assignee: nobody → dave.hunt

Status: NEW → ASSIGNED

Attachment #8457024 - Flags: feedback?(ahalberstadt)

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 3

•

10 years ago

Attached patch gaiatest patch for wip (obsolete) — Details — Splinter Review

This patch is for gaiatest, which works in association with the WIP patch.

Andrew Halberstadt [:ahal]

Comment 4

•

10 years ago

Comment on attachment 8457024 [details] [diff] [review]
Work in progress patch

Review of attachment 8457024 [details] [diff] [review]:
-----------------------------------------------------------------

Yeah, this looks great! Thanks for doing this. DeviceRunner is pretty new, so don't be afraid to change things around if it makes sense. For example, maybe it isn't necessary to reboot the device in start (I'm doubtful, but it might be worth a try).. that code has been there for a long time, dating back to when this used to live in b2gautomation.py.

::: testing/mozbase/mozrunner/mozrunner/base/device.py
@@ +21,5 @@
>      remote devices (or emulators), such as B2G.
>      """
>      def __init__(self, device_class, device_args=None, **kwargs):
>          process_args = kwargs.get('process_args', {})
> +        # TODO: Make streaming to stdout optional

Yeah, the mochitest structured logging patch actually has a fix that will make it so these values don't overwrite what gets passed in (which is very bad!). If that doesn't land soon, I'm going to cherry-pick it myself within the next couple of days as it will also be blocking me.

@@ +121,5 @@
>          self.check_for_crashes()
>  
>      def check_for_crashes(self):
>          dump_dir = self.device.pull_minidumps()
> +        return BaseRunner.check_for_crashes(self, dump_directory=dump_dir, test_name=self.last_test)

Good catch!

Attachment #8457024 - Flags: feedback?(ahalberstadt) → feedback+

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 5

•

10 years ago

(In reply to Andrew Halberstadt [:ahal] from comment #4)
> maybe it isn't necessary to reboot the device in start (I'm doubtful, but it
> might be worth a try).. that code has been there for a long time, dating
> back to when this used to live in b2gautomation.py.

I think if we're pushing a profile then it's necessary in order to pick up the new profile. We do have consumers of gaiatest that need to not restart between tests, so perhaps we could avoid the restart by indicating that we want to use the profile provided on the device and not to clean up. This might be best as a subclass of B2GDeviceRunner.

> ::: testing/mozbase/mozrunner/mozrunner/base/device.py
> @@ +21,5 @@
> >      remote devices (or emulators), such as B2G.
> >      """
> >      def __init__(self, device_class, device_args=None, **kwargs):
> >          process_args = kwargs.get('process_args', {})
> > +        # TODO: Make streaming to stdout optional
> 
> Yeah, the mochitest structured logging patch actually has a fix that will
> make it so these values don't overwrite what gets passed in (which is very
> bad!). If that doesn't land soon, I'm going to cherry-pick it myself within
> the next couple of days as it will also be blocking me.

Do you have the bug number so I can set it as a blocker?

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Updated

•

10 years ago

Flags: needinfo?(ahalberstadt)

Zac C (:zac)

Comment 6

•

10 years ago

CCing Paul and Walter to provide info on the non-restart use case.

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 7

•

10 years ago

(In reply to Dave Hunt (:davehunt) from comment #2)
> It does currently require use of --app=b2g in order to determine that we're running
> against a B2G device.

As Zac pointed out on IRC we could assume app is b2g in gaiatest if no command line value has been set.

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Updated

•

10 years ago

Depends on: 1040078

Andrew Halberstadt [:ahal]

Updated

•

10 years ago

Depends on: 886570

Flags: needinfo?(ahalberstadt)

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Updated

•

10 years ago

Depends on: 1051986

Paul Yang [:pyang] (away)

Comment 8

•

10 years ago

(In reply to Zac C (:zac) from comment #6)
> CCing Paul and Walter to provide info on the non-restart use case.

In our use case, We randomly pick up test cases and trigger gaiatest runner.  Since we can't send infinite test cases we'll re-initialize runner again and again.

Hopefully we can have deviceRunner in our wrapper and pass it to runner, so that we can keep the instance and avoid reboot the device.

Zac C (:zac)

Updated

•

10 years ago

Whiteboard: [affects=b2gdevice]

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Updated

•

10 years ago

Depends on: 1057044

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Updated

•

10 years ago

Depends on: 1059248

Malini Das [:mdas] - Away, not checking bugmail

Updated

•

10 years ago

Blocks: 940911

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 9

•

10 years ago

Attached patch Use B2GDeviceRunner in the Marionette client. v1.0 (obsolete) — Details — Splinter Review

Attachment #8457024 - Attachment is obsolete: true

Attachment #8495960 - Flags: review?(ahalberstadt)

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 10

•

10 years ago

Attached file Link to Github pull-request: https://github.com/mozilla-b2g/gaia/pull/24475 — Details

Attachment #8457028 - Attachment is obsolete: true

Attachment #8495963 - Flags: review?(zcampbell)

Attachment #8495963 - Flags: review?(rwood)

Attachment #8495963 - Flags: review?(ahalberstadt)

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 11

•

10 years ago

The mozilla-central patch should have no impact on our current testing as it only affects on-device tests. The Gaia patch depends on the mozilla-central patch, so we'll most likely want to release a new version of the marionette client ahead of the patch landing.

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 12

•

10 years ago

The patches also take care of crash reporting as a side-effect. I've included a unit test for crash detection and recovery. The MINIDUMP_STACKWALK environment variable must be set to the location of the stackwalk binary [1] in order to process the crash dump and the --symbols command line option should be provided with a path or URL to the appropriate symbols file.

If the MINIDUMP_SAVE_PATH environment variable is set, any crash dump files will be saved to the location specified by it's value.

[1] http://hg.mozilla.org/build/tools/file/tip/breakpad/

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Updated

•

10 years ago

Attachment #8495963 - Flags: feedback?(pyang)

Andrew Halberstadt [:ahal]

Comment 13

•

10 years ago

Comment on attachment 8495960 [details] [diff] [review]
Use B2GDeviceRunner in the Marionette client. v1.0

Review of attachment 8495960 [details] [diff] [review]:
-----------------------------------------------------------------

Lgtm! r+ with a couple questions and nits.

::: testing/marionette/client/marionette/marionette.py
@@ +18,5 @@
>  from decorators import do_crash_check
>  from keys import Keys
>  from marionette_transport import MarionetteTransport
>  
> +from mozrunner import B2GDeviceRunner

nit: just add this comma separated on the line below

@@ +474,5 @@
>                   gecko_log=None, homedir=None, baseurl=None, no_window=False, logdir=None,
>                   busybox=None, symbols_path=None, timeout=None, socket_timeout=360,
>                   device_serial=None, adb_path=None, process_args=None):
>          self.host = host
> +        self.port = self.remote_port = port

Should self.port be None to begin with and then get set for the first time when setup_port_forwarding is called?

@@ +548,5 @@
>  
> +        if app and app.lower() == 'b2g':
> +            # There's no way to know if the target is a device, so we try to
> +            # start a device runner and if that fails we assume the target is
> +            # not a device

If we get passed all the if statements above, doesn't that pretty much only leave devices left?

::: testing/marionette/client/marionette/runner/base.py
@@ +574,4 @@
>                  'app_args': self.app_args,
>                  'bin': self.bin,
>                  'profile': self.profile,
> +                'gecko_log': self.gecko_log

nit: generally it's good form to leave commas at the end of python objects, this makes the diff smaller if someone adds a line in later.

Attachment #8495960 - Flags: review?(ahalberstadt) → review+

Andrew Halberstadt [:ahal]

Comment 14

•

10 years ago

Comment on attachment 8495963 [details] [review]
Link to Github pull-request: https://github.com/mozilla-b2g/gaia/pull/24475

I'm not super familiar with gaiatest, so mostly just looked at general python/structure stuff. But looks alright from what I can tell.

Attachment #8495963 - Flags: review?(ahalberstadt) → review+

Paul Yang [:pyang] (away)

Comment 15

•

10 years ago

Comment on attachment 8495963 [details] [review]
Link to Github pull-request: https://github.com/mozilla-b2g/gaia/pull/24475

lgtm. I saw gaiatest get marionette instance from init parameters, will create and pass it to runner and avoid re-initialization.  Thanks!

Attachment #8495963 - Flags: feedback?(pyang) → feedback+

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Updated

•

10 years ago

Blocks: 1064798

Robert Wood [:rwood]

Comment 16

•

10 years ago

Comment on attachment 8495963 [details] [review]
Link to Github pull-request: https://github.com/mozilla-b2g/gaia/pull/24475

LGTM. Just a question in the PR regarding turning off the device screen on GaiaTestCase.teardown.

Attachment #8495963 - Flags: review?(rwood) → review+

Zac C (:zac)

Comment 17

•

10 years ago

Do these have to be landed simultaneously or can we land the marionette client side before we test/land the gaiatest changes?

Flags: needinfo?(dave.hunt)

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 18

•

10 years ago

(In reply to Zac C (:zac) from comment #17)
> Do these have to be landed simultaneously or can we land the marionette
> client side before we test/land the gaiatest changes?

See comment 11.

Flags: needinfo?(dave.hunt)

Zac C (:zac)

Updated

•

10 years ago

Blocks: 1040742

Zac C (:zac)

Comment 19

•

10 years ago

Comment on attachment 8495963 [details] [review]
Link to Github pull-request: https://github.com/mozilla-b2g/gaia/pull/24475

r- as per two comments in the Github PR.

Attachment #8495963 - Flags: review?(zcampbell) → review-

Zac C (:zac)

Comment 20

•

10 years ago

I also found some issues around running against a local desktopb2g instances.

(am I running it correctly?)



In this case I started it with the device connected but not intending to use it.

1. Connect a running, marionette enabled flame device by usb, but make sure adb server is killed
2. Execute gaiatest using --binary and --profile and --restart to have it start desktopb2g
3. The runner will start the desktopb2g instance but then also reboot the flame. Then it all falls apart and all the tests will fail.

I guess it needs to be a bit more strict when binary and profile are passed in to not to try to attempt any device-commands.

Console trace:
(gecko-dev)zac@zac-X1-Carbon:~/Mozilla/gaia/tests/python/gaia-ui-tests$ gaiatest --testvars gaiatest/testvars_desktop.json --html-output=results/index.html  --binary /home/zac/desktopb2g/master/b2g/b2g-bin --profile /home/zac/Mozilla/gaia/profile/ gaiatest/tests/unit/test_kill.py --restart
Results will not be posted to Treeherder. Please set the following environment variables to enable Treeherder reports: TREEHERDER_KEY, TREEHERDER_SECRET
mozdevice {'os': 'flame-eng 4.4.2 KOT49H eng.cltbld.20141006.114623 test-keys'}
starting httpd
running webserver on http://10.246.27.94:46158/
SUITE-START | Running 1 tests
TEST-START | test_kill.py TestKill.test_kill
Traceback (most recent call last):
  File "/home/zac/Mozilla/src/testing/mozbase/mozrunner/mozrunner/base/runner.py", line 194, in check_for_crashes
    test=test_name)
TypeError: unsupported operand type(s) for +=: 'int' and 'NoneType'
TEST-UNEXPECTED-ERROR | test_kill.py TestKill.test_kill | AttributeError: 'GeckoRuntimeRunner' object has no attribute 'device'

Traceback (most recent call last):
  File "/home/zac/Mozilla/src/testing/marionette/client/marionette/marionette_test.py", line 249, in run
    self.setUp()
  File "/home/zac/Mozilla/gaia/tests/python/gaia-ui-tests/gaiatest/gaia_test.py", line 747, in setUp
    self.device.start_b2g()
  File "/home/zac/Mozilla/gaia/tests/python/gaia-ui-tests/gaiatest/gaia_test.py", line 567, in start_b2g
    self.marionette.port = runner.device.setup_port_forwarding(

TEST-INFO took 12785ms
Traceback (most recent call last):
  File "/home/zac/Mozilla/src/testing/mozbase/mozrunner/mozrunner/base/runner.py", line 194, in check_for_crashes
    test=test_name)
TypeError: unsupported operand type(s) for +=: 'int' and 'NoneType'
TEST-START | test_kill.py TestKill.test_kill_multiple
Traceback (most recent call last):
  File "/home/zac/Mozilla/src/testing/mozbase/mozrunner/mozrunner/base/runner.py", line 194, in check_for_crashes
    test=test_name)
TypeError: unsupported operand type(s) for +=: 'int' and 'NoneType'
TEST-UNEXPECTED-ERROR | test_kill.py TestKill.test_kill_multiple | AttributeError: 'GeckoRuntimeRunner' object has no attribute 'device'

Traceback (most recent call last):
  File "/home/zac/Mozilla/src/testing/marionette/client/marionette/marionette_test.py", line 249, in run
    self.setUp()
  File "/home/zac/Mozilla/gaia/tests/python/gaia-ui-tests/gaiatest/gaia_test.py", line 747, in setUp
    self.device.start_b2g()
  File "/home/zac/Mozilla/gaia/tests/python/gaia-ui-tests/gaiatest/gaia_test.py", line 567, in start_b2g
    self.marionette.port = runner.device.setup_port_forwarding(

TEST-INFO took 2287ms
Traceback (most recent call last):
  File "/home/zac/Mozilla/src/testing/mozbase/mozrunner/mozrunner/base/runner.py", line 194, in check_for_crashes
    test=test_name)
TypeError: unsupported operand type(s) for +=: 'int' and 'NoneType'
Traceback (most recent call last):
  File "/home/zac/Mozilla/src/testing/mozbase/mozrunner/mozrunner/base/runner.py", line 194, in check_for_crashes
    test=test_name)
TypeError: unsupported operand type(s) for +=: 'int' and 'NoneType'

SUMMARY
-------
passed: 0
failed: 2
todo: 0

FAILED TESTS
-------
test_kill.py test_kill.TestKill.test_kill
test_kill.py test_kill.TestKill.test_kill_multiple
Traceback (most recent call last):
  File "/home/zac/Mozilla/src/testing/mozbase/mozrunner/mozrunner/base/runner.py", line 194, in check_for_crashes
    test=test_name)
TypeError: unsupported operand type(s) for +=: 'int' and 'NoneType'
Traceback (most recent call last):
  File "/home/zac/Mozilla/src/testing/mozbase/mozrunner/mozrunner/base/runner.py", line 194, in check_for_crashes
    test=test_name)
TypeError: unsupported operand type(s) for +=: 'int' and 'NoneType'
mozversion application_buildid: 20141001040205
mozversion application_changeset: 14665b1de5ee
mozversion application_display_name: B2G
mozversion application_name: B2G
mozversion application_repository: https://hg.mozilla.org/mozilla-central
mozversion application_version: 35.0a1
mozversion gaia_date: 1412163943
mozversion platform_buildid: 20141001040205
mozversion platform_changeset: 14665b1de5ee
mozversion platform_repository: https://hg.mozilla.org/mozilla-central
SUITE-END | took 19s

Zac C (:zac)

Comment 21

•

10 years ago

I also have some concern that when running the tests against a device it adds around 10-15 seconds per test. That's more than 10% in most cases, it would add quite a lot of test run time on Jenkins CI.

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 22

•

10 years ago

(In reply to Zac C (:zac) from comment #20)
> I also found some issues around running against a local desktopb2g instances.
> 
> In this case I started it with the device connected but not intending to use
> it.

That sounds like a bug. I'll look into it.

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 23

•

10 years ago

(In reply to Zac C (:zac) from comment #21)
> I also have some concern that when running the tests against a device it
> adds around 10-15 seconds per test. That's more than 10% in most cases, it
> would add quite a lot of test run time on Jenkins CI.

I would expect an initial delay while the harness does a restart of b2g ahead of the first test, but it shouldn't have an impact on subsequent tests. I can do some timing comparisons and try to see where we're adding time.

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 24

•

10 years ago

(In reply to Dave Hunt (:davehunt) from comment #22)
> (In reply to Zac C (:zac) from comment #20)
> > In this case I started it with the device connected but not intending to use it.
> 
> That sounds like a bug. I'll look into it.

This was because I had an if where I needed an elif. After speaking on IRC we were also able to resolve the TypeError issues too, which were likely due to the environment, and perhaps an outdated mozcrash being used.

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 25

•

10 years ago

(In reply to Dave Hunt (:davehunt) from comment #23)
> (In reply to Zac C (:zac) from comment #21)
> > I also have some concern that when running the tests against a device it
> > adds around 10-15 seconds per test. That's more than 10% in most cases, it
> > would add quite a lot of test run time on Jenkins CI.
> 
> I would expect an initial delay while the harness does a restart of b2g
> ahead of the first test, but it shouldn't have an impact on subsequent
> tests. I can do some timing comparisons and try to see where we're adding
> time.

I ran the unit test suite (36 tests) before and after applying the patches. With the patches applied the suite took an additional 10 minutes (~16 seconds per test). Thinking about it, this is likely to be a combination of the change of the restart (should be negligible), the additional (one-off) restart at the beginning of the suite, and the additional overhead of checking for crashes regularly. As a check for crashes involves pulling any files from the device, and is performed frequently, I suspect this contributes the majority of the increase. I will continue to investigate.

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 26

•

10 years ago

I've profiled running just tests/unit/test_kill.py (two tests) with and without the patches applied. With the patches applied we gained ~49 seconds.

An additional 28 seconds was spent in gaia_test.py(setUp), however our gaia_test.py(start_b2g) only increased by around 4 seconds. We have the additional restart that happens during the initialisation of the Marionette object, which has increased by 10 seconds.

We now also spend 13 additional seconds checking for crashes. Of this, we spent 5 seconds getting the minidumps directory and 8 seconds listing the remote files. Looking at mozrunner, we could certainly optimise this so we spend less time when checking for crashes on devices. I'll file a followup bug for this.

During these two tests we check for crashes 155 times, which does seem a little excessive to me. Malini: Do you know if we need to check so often? Also, do you have any other suggestions for reducing the tests run time?

Flags: needinfo?(mdas)

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 27

•

10 years ago

Attached file Profile without patches applied — Details

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 28

•

10 years ago

Attached file Profile with patches applied — Details

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 29

•

10 years ago

I've raised bug 1079774 to improve the crash detection on devices.

Malini Das [:mdas] - Away, not checking bugmail

Comment 30

•

10 years ago

(In reply to Dave Hunt (:davehunt) from comment #26)
> I've profiled running just tests/unit/test_kill.py (two tests) with and
> without the patches applied. With the patches applied we gained ~49 seconds.
> 
> An additional 28 seconds was spent in gaia_test.py(setUp), however our
> gaia_test.py(start_b2g) only increased by around 4 seconds. We have the
> additional restart that happens during the initialisation of the Marionette
> object, which has increased by 10 seconds.
> 
> We now also spend 13 additional seconds checking for crashes. Of this, we
> spent 5 seconds getting the minidumps directory and 8 seconds listing the
> remote files. Looking at mozrunner, we could certainly optimise this so we
> spend less time when checking for crashes on devices. I'll file a followup
> bug for this.
> 
> During these two tests we check for crashes 155 times, which does seem a
> little excessive to me. Malini: Do you know if we need to check so often?
> Also, do you have any other suggestions for reducing the tests run time?

Definitely don't need to check that often. I think we should only check for crashes when we receive a connection related exception (IOError, etc).

Flags: needinfo?(mdas)

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Updated

•

10 years ago

Depends on: 1081043

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 32

•

10 years ago

There are a couple of things still concerning me before I'm comfortable landing these patches. Each may need to be raised separately and potentially marked as blockers:

1. In at least two places if we detect a crash we abort the test run. As we have crash recovery (at least for gaiatest) we do not want to prevent all further tests from running in the event of a crash. See http://hg.mozilla.org/mozilla-central/file/29fbfc1b31aa/testing/marionette/client/marionette/runner/base.py#l189 and http://hg.mozilla.org/mozilla-central/file/29fbfc1b31aa/testing/marionette/client/marionette/runner/base.py#l808

2. The call to check_for_crash only returns true if a crash is detected by that call, so it's unlikely to return true if a crash was detected earlier on in the test. This means any logic that should execute if a crash has been detected at any point during the test (see above examples) may not work as expected.

3. I strongly suspect that our check for crashes can hit a race condition when the crash occurs immediately prior to the check. I have seen instances where the dmp file pulled from the device cannot be read. This might be unrelated, or it might be that the file was still being written to at the time. I have also seen a test immediately following a test that causes a crash to also detect a crash, suggesting that the crash files were still being written and are picked up during the setUp of the next test.

4. A crash occurring during a test does not appear to impact the test result. I expect that we would instead want to mark a test as failed if we witness a crash.

Malini and Andrew: Could you share your thoughts on the above concerns? I suspect we'll need changes to both the Marionette test runner and to mozrunner.

Flags: needinfo?(mdas)

Flags: needinfo?(ahalberstadt)

Andrew Halberstadt [:ahal]

Comment 33

•

10 years ago

1) Not super familiar with this, but seems easy to do.

2) I don't really understand what you mean by this.. Check for crash is very simple in that it returns true if there's a dmp file in the profile. So as long as we are calling check_for_crash in all the right places (i.e on timeout, on test end, on process exit), and clean up the minidumps folder between tests, I don't see a problem here.

3) This is likely true, but isn't specific to gaiatest. Our desktop harnesses have been running like this for years and I don't think I've ever seen it happen.. I think it might depend on how the harness is set up to use mozcrash. Typically we have an on_finish handler in mozprocess where we place the check_for_crashes call. When a crash happens, the gecko process won't actually exit until the minidump has finished being written. So we avoid any race conditions that way. I suspect if you are seeing this, then we aren't calling check_for_crashes in the right places or we aren't waiting for the process to exit before starting the next one.

4) Seems reasonable. This wasn't a concern for other suites since crash aborts the test run. There's already a CRASH status in mozlog's test_end method.

Flags: needinfo?(ahalberstadt)

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 34

•

10 years ago

(In reply to Andrew Halberstadt [:ahal] from comment #33)
> 2) I don't really understand what you mean by this.. Check for crash is very
> simple in that it returns true if there's a dmp file in the profile. So as
> long as we are calling check_for_crash in all the right places (i.e on
> timeout, on test end, on process exit), and clean up the minidumps folder
> between tests, I don't see a problem here.

I mean we should distinguish when we want to check if there is a minidump, and when we want to check if we have detected a crash at any point during the test. For example the current code appears to prevent further tests from running if there is a minidump but not if a minidump has already been detected/reported/deleted. The Marionette client itself checks for crashes if certain exceptions are thrown, so it's not safe to assume a minidump will be waiting until the harness picks things up after the test.

> 3) This is likely true, but isn't specific to gaiatest. Our desktop
> harnesses have been running like this for years and I don't think I've ever
> seen it happen.. I think it might depend on how the harness is set up to use
> mozcrash. Typically we have an on_finish handler in mozprocess where we
> place the check_for_crashes call. When a crash happens, the gecko process
> won't actually exit until the minidump has finished being written. So we
> avoid any race conditions that way. I suspect if you are seeing this, then
> we aren't calling check_for_crashes in the right places or we aren't waiting
> for the process to exit before starting the next one.

We have the on_finish handler, but I'm causing a crash using stop with a SIGABRT. I suspect the forced crash is too close to the on_finish handler's execution.

Malini Das [:mdas] - Away, not checking bugmail

Comment 35

•

10 years ago

(In reply to Dave Hunt (:davehunt) from comment #34)
> (In reply to Andrew Halberstadt [:ahal] from comment #33)
> > 2) I don't really understand what you mean by this.. Check for crash is very
> > simple in that it returns true if there's a dmp file in the profile. So as
> > long as we are calling check_for_crash in all the right places (i.e on
> > timeout, on test end, on process exit), and clean up the minidumps folder
> > between tests, I don't see a problem here.
> 
> I mean we should distinguish when we want to check if there is a minidump,
> and when we want to check if we have detected a crash at any point during
> the test. For example the current code appears to prevent further tests from
> running if there is a minidump but not if a minidump has already been
> detected/reported/deleted. The Marionette client itself checks for crashes
> if certain exceptions are thrown, so it's not safe to assume a minidump will
> be waiting until the harness picks things up after the test.
> 
> > 3) This is likely true, but isn't specific to gaiatest. Our desktop
> > harnesses have been running like this for years and I don't think I've ever
> > seen it happen.. I think it might depend on how the harness is set up to use
> > mozcrash. Typically we have an on_finish handler in mozprocess where we
> > place the check_for_crashes call. When a crash happens, the gecko process
> > won't actually exit until the minidump has finished being written. So we
> > avoid any race conditions that way. I suspect if you are seeing this, then
> > we aren't calling check_for_crashes in the right places or we aren't waiting
> > for the process to exit before starting the next one.
> 

If I understand correctly, if we implement 1) then shouldn't we have consistent behaviour when checking for a crash, that is, if we see a crash, mark the test as a failure and continue?

Flags: needinfo?(mdas)

Martijn Wargers (dead)

Updated

•

10 years ago

Blocks: 1087687

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 36

•

10 years ago

(In reply to Malini Das [:mdas] from comment #35)
> If I understand correctly, if we implement 1) then shouldn't we have
> consistent behaviour when checking for a crash, that is, if we see a crash,
> mark the test as a failure and continue?

That sounds like a combination of 1) and 4) and in order to implement this we could use remove the two places we currently try to stop the remaining tests on crash, and indicate that a crash occurred when the test ends. I would also suggest that this solves 2) because we'll be removing that logic and instead using the runner.crashed value to determine if any crashes were detected in the life of a test. It doesn't help with 3) however perhaps a small sleep in the unit test that demonstrates this is a reasonable trade-off for the coverage.

I'll work on an updated patch with the suggestions implemented.

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 37

•

10 years ago

I have a new working patch with the changes applied. It still requires some work and testing, but I've compared running the 36 unit tests with and without the patch to see what the impact is on the running time. With the patch applied the unit tests took ~3 minutes longer, which works out at ~5 seconds extra per test. Given that ~20 seconds of this is actually before we start running the first test, I think this is an acceptable increase that shouldn't impact our current suites considerably, especially given the advantage that the improved crash handling will provide us. What do you think Zac?

Flags: needinfo?(zcampbell)

Zac C (:zac)

Comment 38

•

10 years ago

Yes I agree, that's an acceptable compromise. There are other places we explore to make that time back.

Flags: needinfo?(zcampbell)

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 39

•

10 years ago

Attached patch Use B2GDeviceRunner in the Marionette client. v2.0 (obsolete) — Details — Splinter Review

Try:
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=db10f4c98d03

Attachment #8495960 - Attachment is obsolete: true

Flags: needinfo?(dave.hunt)

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 40

•

10 years ago

Comment on attachment 8512543 [details] [diff] [review]
Use B2GDeviceRunner in the Marionette client. v2.0

Try is looking good so far, so let's mark this for another review.

Flags: needinfo?(dave.hunt)

Attachment #8512543 - Flags: review?(ahalberstadt)

Andrew Halberstadt [:ahal]

Comment 41

•

10 years ago

Comment on attachment 8512543 [details] [diff] [review]
Use B2GDeviceRunner in the Marionette client. v2.0

Review of attachment 8512543 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with comments (especially second one) addressed.

::: testing/marionette/client/marionette/marionette.py
@@ +474,5 @@
>                   gecko_log=None, homedir=None, baseurl=None, no_window=False, logdir=None,
>                   busybox=None, symbols_path=None, timeout=None, socket_timeout=360,
>                   device_serial=None, adb_path=None, process_args=None):
>          self.host = host
> +        self.port = self.remote_port = port

Hard to say for sure without full context, but looks like there's no reason to set self.port at this point anymore.

@@ +755,3 @@
>          if returncode is not None:
> +            print ('PROCESS-CRASH | %s | abnormal termination with exit code %d' % (name, returncode))
> +        assert(crashed == 0)

I think I like the idea of using an exception here, but I don't think this should be an assertion. Assertions can be disabled by passing in -O into python which means we wouldn't detect crashes properly. Probably not a very common case, but I'd rather just raise some sort of exception if crashed > 0.

::: testing/marionette/client/marionette/runner/base.py
@@ -579,4 @@
>                  'app_args': self.app_args,
>                  'bin': self.bin,
>                  'profile': self.profile,
> -                'gecko_log': self.gecko_log,

optional: Personally I like trailing commas because it makes it easier to add an entry.

Attachment #8512543 - Flags: review?(ahalberstadt) → review+

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 42

•

10 years ago

Attached patch Use B2GDeviceRunner in the Marionette client. v3.0 (obsolete) — Details — Splinter Review

Thanks for the review. I've addressed two of your comments, in partcular check_for_crash has been reverted to return a boolean but will now return true if a crash has occurred at any point in the lifetime of the runner, rather than just at that moment. This patch also allows us to mark crashing tests as expected to fail either by decorator or manifest.

I didn't change the self.port because otherwise this would not be set if the if/elif block was not executed.

Try:
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=963580128383

Attachment #8512543 - Attachment is obsolete: true

Attachment #8514205 - Flags: review?(ahalberstadt)

Andrew Halberstadt [:ahal]

Comment 43

•

10 years ago

Comment on attachment 8514205 [details] [diff] [review]
Use B2GDeviceRunner in the Marionette client. v3.0

Review of attachment 8514205 [details] [diff] [review]:
-----------------------------------------------------------------

This looks a lot less frightening :), thanks!

::: testing/marionette/client/marionette/runner/base.py
@@ -743,5 @@
>              for failed_test in self.failures:
>                  self.logger.info('%s' % failed_test[0])
>  
> -        try:
> -            self.marionette.check_for_crash()

I assume the check_for_crash's in this file are no longer needed because they are in MarionetteTestCase instead?

Attachment #8514205 - Flags: review?(ahalberstadt) → review+

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 44

•

10 years ago

(In reply to Andrew Halberstadt [:ahal] from comment #43)
> Comment on attachment 8514205 [details] [diff] [review]
> Use B2GDeviceRunner in the Marionette client. v3.0
> 
> Review of attachment 8514205 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This looks a lot less frightening :), thanks!

Yeah, this is a much nicer approach. Thanks for helping me arrive here. :)

> ::: testing/marionette/client/marionette/runner/base.py
> @@ -743,5 @@
> >              for failed_test in self.failures:
> >                  self.logger.info('%s' % failed_test[0])
> >  
> > -        try:
> > -            self.marionette.check_for_crash()
> 
> I assume the check_for_crash's in this file are no longer needed because
> they are in MarionetteTestCase instead?

That's correct.

So my try run appears to have replicated a crash, and all subsequent tests failed because we don't restart between tests so runner.crashed will remain true. I don't think this is really a problem with this patch though. I tested a crash in Firefox desktop without the patch and we stop the entire test run on crash, so the difference now is that we'll have more failures as a result of a crash. When I replicated with my patch I got errors about starting a Marionette session, so the try run is a little different for some reason (are there types of crashes where we'll have a minidump but not lose the browser/session?)

I think a subsequent patch to detect that a previous test has crashed and in that scenario we could restart the runner during setUp. We kind of already do this with gaiatest for B2G but I think it makes sense to have this in the Marionette test runner where we can then override the behaviour for B2G. I think this is pretty much bug 1062676.

Andrew: Does this sound good/make sense? If so, I'll go ahead and land this patch early next week and will closely follow it up with the gaiatest patch and release of the marionette-client to PyPI.

Flags: needinfo?(ahalberstadt)

Andrew Halberstadt [:ahal]

Comment 45

•

10 years ago

(In reply to Dave Hunt (:davehunt) from comment #44)
> (In reply to Andrew Halberstadt [:ahal] from comment #43)
> > Comment on attachment 8514205 [details] [diff] [review]
> > Use B2GDeviceRunner in the Marionette client. v3.0
> > 
> > Review of attachment 8514205 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > This looks a lot less frightening :), thanks!
> 
> Yeah, this is a much nicer approach. Thanks for helping me arrive here. :)
> 
> > ::: testing/marionette/client/marionette/runner/base.py
> > @@ -743,5 @@
> > >              for failed_test in self.failures:
> > >                  self.logger.info('%s' % failed_test[0])
> > >  
> > > -        try:
> > > -            self.marionette.check_for_crash()
> > 
> > I assume the check_for_crash's in this file are no longer needed because
> > they are in MarionetteTestCase instead?
> 
> That's correct.
> 
> So my try run appears to have replicated a crash, and all subsequent tests
> failed because we don't restart between tests so runner.crashed will remain
> true. I don't think this is really a problem with this patch though. I
> tested a crash in Firefox desktop without the patch and we stop the entire
> test run on crash, so the difference now is that we'll have more failures as
> a result of a crash.

Can you post a link to the log? You might want to run this by the sheriffs to make sure it isn't being unnecessarily noisy.

> When I replicated with my patch I got errors about
> starting a Marionette session, so the try run is a little different for some
> reason (are there types of crashes where we'll have a minidump but not lose
> the browser/session?)

Yes, if e10s is enabled and MOZ_CRASHREPORTER_SHUTDOWN is unset, a child process crash should trigger this behaviour.

> Andrew: Does this sound good/make sense? If so, I'll go ahead and land this
> patch early next week and will closely follow it up with the gaiatest patch
> and release of the marionette-client to PyPI.

Yep makes sense, I'd just run the new log by the sheriffs.

Flags: needinfo?(ahalberstadt)

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 46

•

10 years ago

The log is here:
https://treeherder.mozilla.org/ui/logviewer.html#?job_id=2827460&repo=try

So without this patch I'd expect that log would stop after the PROCESS-CRASH and subsequent TEST-UNEXPECTED-FAIL. Now we may have situations where a crash is followed by the remaining tests all failing either due to an assertion failure checking that no crashes have occurred (seen in the above log) or perhaps an exception stating that there's no Marionette session. We could fix this to recover from the crash in bug 1062676. Does this sound okay?

Flags: needinfo?(ryanvm)

Ryan VanderMeulen [:RyanVM]

Comment 47

•

10 years ago

Spamtastic failures tend to be a pain. I'd much prefer that the suite abort on a crash.

Flags: needinfo?(ryanvm)

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 48

•

10 years ago

(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #47)
> Spamtastic failures tend to be a pain. I'd much prefer that the suite abort
> on a crash.

We already have crash recovery in the Gaia tests, so aborting the suite on crash would prevent us from valuable test coverage.

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 49

•

10 years ago

Attached patch Use B2GDeviceRunner in the Marionette client. v3.1 (obsolete) — Details — Splinter Review

This restores the behaviour of aborting the testrun on crash. We can override this in GaiaTest, where we have crash recovery.

Try:
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=6bb5465e0af9

Attachment #8514205 - Attachment is obsolete: true

Attachment #8516752 - Flags: review?(ahalberstadt)

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 50

•

10 years ago

Comment on attachment 8495963 [details] [review]
Link to Github pull-request: https://github.com/mozilla-b2g/gaia/pull/24475

Clearing previous feedback/reviews and requesting review for updated patch.

Attachment #8495963 - Flags: review?

Attachment #8495963 - Flags: review-

Attachment #8495963 - Flags: review+

Attachment #8495963 - Flags: feedback+

Zac C (:zac)

Comment 51

•

10 years ago

I've been doing some testing and found a clash between desktopb2g and device when both were running/attached to the host computer. adb commands would run against device and tests against desktopb2g (which is bound to the port).

Dave and I debugged it and found that mozdevice's forward was failing silently (Bug 1094164). With that resolved, the test run would fail before it starts and force the user to disconnect one of the running instances.

Depends on: 1094164

Zac C (:zac)

Comment 52

•

10 years ago

Generally I'm happy with how this goes, but I would like to change the default or have the option to override it stopping the b2g process at the end of the test.
Our workflow is quite heavily rooted in leaving the device running after the test has finished but this is mostly a requirement for debugging test failures or while writing tests rather than when just executing the test suite.

Andrew Halberstadt [:ahal]

Comment 53

•

10 years ago

Comment on attachment 8516752 [details] [diff] [review]
Use B2GDeviceRunner in the Marionette client. v3.1

Review of attachment 8516752 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks, lgtm!

Attachment #8516752 - Flags: review?(ahalberstadt) → review+

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 54

•

10 years ago

(In reply to Zac C (:zac) from comment #52)
> Generally I'm happy with how this goes, but I would like to change the
> default or have the option to override it stopping the b2g process at the
> end of the test.

I agree that when we're running tests against a device it doesn't make sense to stop the process when the Marionette object is destroyed. I initially thought we could just skip the cleanup call if the target is a device (but not emulator) but this also includes some code for restoring the original profile, though I'm not convinced this is working as expected as I've experienced an unbootable device after running tests... :/

Andrew: Do you have any thoughts on how we could leave devices in a usable state after tests have completed? Should the backup/restore be optional? Are you aware of the backup/restore being used anywhere?

Flags: needinfo?(ahalberstadt)

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 55

•

10 years ago

I worked out the issue with the device being unusable. We were calling runner.stop() to restart between tests, which doesn't call cleanup() and therefore does not restore the profile. On the next start() this means we backup the remote profiles.ini over the existing backup, therefore losing the ability to restore the original profile. I've solved this by calling cleanup() instead of stop() but a similar issue could happen if a test aborts or is interrupted by a user. In this case we'll likely remain in the pushed profile, which isn't as bad as a device that won't boot.

Andrew Halberstadt [:ahal]

Comment 56

•

10 years ago

It seems like stop() should probably call cleanup() instead of the other way around.. I'm not sure if it's worth fixing though, as there are likely places that depend on this behaviour..

Also this parameter should control whether the profile gets restored or not afterwards:
http://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozrunner/mozrunner/devices/base.py#25

I'm not guaranteeing that it works exactly as advertised though :/

Flags: needinfo?(ahalberstadt)

Comment hidden (duplicate, obsolete)

This patch restarts the remote b2g service after cleanup, which means a completed test run will restart a device into a usable state with the original profile and settings. Testing against running emulators will also restart once the cleanup is completed.

There are scenarios where cleanup may fail, such as a keyboard interrupt or device crash. In these cases the profile will not be restored. We may want to add an option to skip the cleanup and leave the target in the state at the end of the last run test, however I feel this should be raised in a separate bug.

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 58

•

10 years ago

Attached patch Use B2GDeviceRunner in the Marionette client. v3.2 (obsolete) — Details — Splinter Review

This patch restarts the remote b2g service after cleanup, which means a completed test run will restart a device into a usable state with the original profile and settings. Testing against running emulators will also restart once the cleanup is completed.

There are scenarios where cleanup may fail, such as a keyboard interrupt or device crash. In these cases the profile will not be restored. We may want to add an option to skip the cleanup and leave the target in the state at the end of the last run test, however I feel this should be raised in a separate bug.

Try:
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=7799fd930f1a

Attachment #8519926 - Flags: review?(ahalberstadt)

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 59

•

10 years ago

Try suffered timeout probably related to bug 1096337. Here's a new try run:
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=e06a7e99d170

Andrew Halberstadt [:ahal]

Comment 60

•

10 years ago

(In reply to Dave Hunt (:davehunt) from comment #58)
> There are scenarios where cleanup may fail, such as a keyboard interrupt or
> device crash. In these cases the profile will not be restored. We may want
> to add an option to skip the cleanup and leave the target in the state at
> the end of the last run test, however I feel this should be raised in a
> separate bug.

We could piggy back off of the mozprofile 'restore' parameter: http://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozprofile/mozprofile/profile.py#41

Another bug sounds fine to me though.

Andrew Halberstadt [:ahal]

Comment 61

•

10 years ago

Comment on attachment 8519926 [details] [diff] [review]
Use B2GDeviceRunner in the Marionette client. v3.2

Review of attachment 8519926 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, r+ with the minor review comment fixed.

::: testing/mozbase/mozrunner/mozrunner/devices/base.py
@@ +257,5 @@
>          # Remove the test profile
>          self.dm.removeDir(self.app_ctx.remote_profile)
>  
> +        # Restart the application
> +        self.app_ctx.start_application()

Please wrap this in an "if not emulator".

Attachment #8519926 - Flags: review?(ahalberstadt) → review+

Andrew Halberstadt [:ahal]

Comment 62

•

10 years ago

Comment on attachment 8519911 [details] [diff] [review]
Use B2GDeviceRunner in the Marionette client. v3.2

Afaict, this is the same patch.

Attachment #8519911 - Attachment is obsolete: true

Attachment #8519911 - Flags: review?(ahalberstadt)

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 63

•

10 years ago

(In reply to Andrew Halberstadt [:ahal] from comment #61)
> Comment on attachment 8519926 [details] [diff] [review]
> Use B2GDeviceRunner in the Marionette client. v3.2
> 
> Review of attachment 8519926 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Looks good, r+ with the minor review comment fixed.
> 
> ::: testing/mozbase/mozrunner/mozrunner/devices/base.py
> @@ +257,5 @@
> >          # Remove the test profile
> >          self.dm.removeDir(self.app_ctx.remote_profile)
> >  
> > +        # Restart the application
> > +        self.app_ctx.start_application()
> 
> Please wrap this in an "if not emulator".

I figured that if running tests against an emulator that's already running we'd want to restart the application. This does mean that when we let Marionette start the emulator that we restart the application and then immediately kill the emulator process, but I didn't see that as an issue. Could you confirm that you don't want the application restarted on emulator after restoring the profile, etc?

Flags: needinfo?(ahalberstadt)

Andrew Halberstadt [:ahal]

Comment 64

•

10 years ago

I guess it doesn't hurt, but changes to the profile on the emulator aren't persistent anyway, and we kill the emulator after our test runs, so it seems like a waste of time to do.

Flags: needinfo?(ahalberstadt)

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 65

•

10 years ago

(In reply to Andrew Halberstadt [:ahal] from comment #64)
> I guess it doesn't hurt, but changes to the profile on the emulator aren't
> persistent anyway, and we kill the emulator after our test runs, so it seems
> like a waste of time to do.

As discussed on IRC we don't kill the emulator if we didn't start it, and the restart of the application isn't expected to have much impact when we do kill the emulator.

Landed on inbound with a version bump for the Marionette client:
https://hg.mozilla.org/integration/mozilla-inbound/rev/92fc5580994a

We'll need to release to PyPI once this reaches mozilla-central and then we'll need revisit the Gaia patch to update them to use the new client.

Keywords: leave-open

Ryan VanderMeulen [:RyanVM]

Comment 66

•

10 years ago

Retriggers on inbound are strongly pointing to this as the cause of a recent spike in OSX Marionette crashes like the log below:
https://treeherder.mozilla.org/ui/logviewer.html#?job_id=3821145&repo=mozilla-inbound

Backed out.
https://hg.mozilla.org/integration/mozilla-inbound/rev/053dba7ee315

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 67

•

10 years ago

(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #66)
> Retriggers on inbound are strongly pointing to this as the cause of a recent
> spike in OSX Marionette crashes like the log below:
> https://treeherder.mozilla.org/ui/logviewer.
> html#?job_id=3821145&repo=mozilla-inbound
> 
> Backed out.
> https://hg.mozilla.org/integration/mozilla-inbound/rev/053dba7ee315

This would suggest that we're detecting crashes but not stopping the test suite. It also suggests that the crash we're detecting does not kill the browser instance. I wouldn't expect that we've increased the number of crashes, but instead our ability to detect them with a side-effect of not being able to recover from them. I also notice that we're not passing the symbols or setting the minidump stacktrace, so the crash details are severely limited.

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 68

•

10 years ago

(In reply to Dave Hunt (:davehunt) from comment #67)
> (In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #66)
> > Retriggers on inbound are strongly pointing to this as the cause of a recent
> > spike in OSX Marionette crashes like the log below:
> > https://treeherder.mozilla.org/ui/logviewer.
> > html#?job_id=3821145&repo=mozilla-inbound
> > 
> > Backed out.
> > https://hg.mozilla.org/integration/mozilla-inbound/rev/053dba7ee315
> 
> This would suggest that we're detecting crashes but not stopping the test
> suite. It also suggests that the crash we're detecting does not kill the
> browser instance. I wouldn't expect that we've increased the number of
> crashes, but instead our ability to detect them with a side-effect of not
> being able to recover from them. I also notice that we're not passing the
> symbols or setting the minidump stacktrace, so the crash details are
> severely limited.

Here's a couple of examples of a similar crashes without this patch that go completely unnoticed: https://treeherder.mozilla.org/logviewer.html#?job_id=3837945&repo=mozilla-inbound
https://treeherder.mozilla.org/logviewer.html#?job_id=3812049&repo=mozilla-inbound

In these cases the crash is reported, the suite stops, but the tests all pass. I think the worst thing introduced by this bug's patch is that we continue running the tests after the crash is detected and therefore get a lot of extra noise. In my opinion failing on the crash is a good thing.

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 69

•

10 years ago

I've raised bug 1098258 to enable processing of these crash dumps.

Carsten Book [:Tomcat]

Comment 70

•

10 years ago

(In reply to Dave Hunt (:davehunt) from comment #68)
> (In reply to Dave Hunt (:davehunt) from comment #67)
> > (In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #66)
> > > Retriggers on inbound are strongly pointing to this as the cause of a recent
> > > spike in OSX Marionette crashes like the log below:
> > > https://treeherder.mozilla.org/ui/logviewer.
> > > html#?job_id=3821145&repo=mozilla-inbound
> > > 
> > > Backed out.
> > > https://hg.mozilla.org/integration/mozilla-inbound/rev/053dba7ee315
> > 
> > This would suggest that we're detecting crashes but not stopping the test
> > suite. It also suggests that the crash we're detecting does not kill the
> > browser instance. I wouldn't expect that we've increased the number of
> > crashes, but instead our ability to detect them with a side-effect of not
> > being able to recover from them. I also notice that we're not passing the
> > symbols or setting the minidump stacktrace, so the crash details are
> > severely limited.
> 
> Here's a couple of examples of a similar crashes without this patch that go
> completely unnoticed:
> https://treeherder.mozilla.org/logviewer.html#?job_id=3837945&repo=mozilla-
> inbound
> https://treeherder.mozilla.org/logviewer.html#?job_id=3812049&repo=mozilla-
> inbound
> 
> In these cases the crash is reported, the suite stops, but the tests all
> pass. I think the worst thing introduced by this bug's patch is that we
> continue running the tests after the crash is detected and therefore get a
> lot of extra noise. In my opinion failing on the crash is a good thing.

Josh, Steven: could you help here finding someone who can fix the Mac Crashes ?

Flags: needinfo?(smichaud)

Flags: needinfo?(joshmoz)

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 71

•

10 years ago

(In reply to Carsten Book [:Tomcat] from comment #70)
> Josh, Steven: could you help here finding someone who can fix the Mac
> Crashes ?

I think we need to open new bugs for any crashes this bug exposes. I'd also rather not block landing this for a crash fix, so would like to know what other options we might have.

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 72

•

10 years ago

Attached patch Use B2GDeviceRunner in the Marionette client. v3.3 — Details — Splinter Review

This version of the patch restores terminating the test suite when a crash is detected outside the stopTest. This reduces the noise after a crash is detected. See try run here including an occurrence of the intermittent crash:
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=c6b7257d8bed

Attachment #8516752 - Attachment is obsolete: true

Attachment #8519926 - Attachment is obsolete: true

Attachment #8522190 - Flags: review?(ahalberstadt)

Zac C (:zac)

Comment 73

•

10 years ago

(In reply to Dave Hunt (:davehunt) from comment #71)
> (In reply to Carsten Book [:Tomcat] from comment #70)
> > Josh, Steven: could you help here finding someone who can fix the Mac
> > Crashes ?
> 
> I think we need to open new bugs for any crashes this bug exposes. I'd also
> rather not block landing this for a crash fix, so would like to know what
> other options we might have.

If we have commitment from a dev to debug and potentially resolve the crash then we could disable the test for treeherder/mac.
But disabling the test without that commitment in advance would make me worry the test would just get ignored and the crash never attended to.

Andrew Halberstadt [:ahal]

Comment 74

•

10 years ago

(In reply to Dave Hunt (:davehunt) from comment #71)
> I think we need to open new bugs for any crashes this bug exposes. I'd also
> rather not block landing this for a crash fix, so would like to know what
> other options we might have.

It's probably up to the sheriffs and how frequent the crash is. I've run into similar situations before (where my new change exposed previously failing issues) and had to add in hacks to keep tbpl happy. It's a sad state of affairs.

Andrew Halberstadt [:ahal]

Comment 75

•

10 years ago

Comment on attachment 8522190 [details] [diff] [review]
Use B2GDeviceRunner in the Marionette client. v3.3

Review of attachment 8522190 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, but make sure to talk to the sheriffs before re-landing. Even if this patch doesn't cause the crash, the sheriffs are well within their right to back it out anyway. So let's figure out what the best way forward is first.

Attachment #8522190 - Flags: review?(ahalberstadt) → review+

Ryan VanderMeulen [:RyanVM]

Comment 76

•

10 years ago

It's a sad state of affairs (no disagreement there), but unfortunately the visibility policy is clear on this:
https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy

If this patch causes Mn on OSX to no longer meet visibility standards, we'll either need to back it out (like we did yesterday) or hide the suite. Leaving it failing a large percentage of the time in production because it wasn't directly your patch's fault is unfortunately not an option.

If it were me, I'd work to get symbolification fixed first before this re-lands so you know which team to ping about it (no reason that couldn't be sorted out on Try?). Then if you want to re-land and let the suite get hidden by default, we can file the appropriate bugs, mark the deps, and let the various stakeholders prioritize getting it fixed.

I'll also point out that as resource-constrained as we are on OSX, we're probably not going to want to stomach leaving them running but hidden for very long. We don't have excess capacity to waste like that.

Steven Michaud [:smichaud] (Retired)

Comment 77

•

10 years ago

(In reply to comment #70)

I'm not going to have time to investigate these crashes anytime soon.  And I suspect Josh is in the same boat.

Furthermore, I suspect these crashes aren't really Mac-specific, even if they're disproportionately triggered on the Mac.  In any case you don't appear to have any information about them.

So here's what you guys need to do:

1) Open a bug, and provide as much information as possible -- hopefully including Socorro and Apple crash reports and reliable STR.

2) Assign one of your own people to the bug, to investigate it to the point that it requires Mac-specific expertise to resolve -- if in fact that point is ever reached.

Flags: needinfo?(smichaud)

Josh Aas

Updated

•

10 years ago

Flags: needinfo?(joshmoz)

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Updated

•

10 years ago

Depends on: 1098258
No longer depends on: 1040078

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 78

•

10 years ago

Thanks to ahal's work on bug 1098258 we have crash details. A log showing the crash can be seen here: https://treeherder.mozilla.org/ui/logviewer.html#?job_id=66196&repo=ash

It appears to be already raised in bug 1090921 so setting that as a blocker.

Depends on: 1090921

Randell Jesup [:jesup] (needinfo me)

Comment 79

•

10 years ago

(In reply to Dave Hunt (:davehunt) from comment #78)
> Thanks to ahal's work on bug 1098258 we have crash details. A log showing
> the crash can be seen here:
> https://treeherder.mozilla.org/ui/logviewer.html#?job_id=66196&repo=ash
> 
> It appears to be already raised in bug 1090921 so setting that as a blocker.

a) this is failing in an entirely different test:
test_visibility.py TestVisibility.testShouldAllowTheUserToTellIfAnElementIsDisplayedOrNot
Bug 1090921 is always in test_speech_queue.html.  it's possibly both are due to some underlying bug in the IPC code or how it's used.
b)  No Assertion is flagged in the run above you gave as they are in the speech test, though the line number matches.

Likely this is a locking/timing issue with IPC, as these aren't really code that's directly part of speech/or the test above, etc. Possibly a timing/sequence-of-shutdown issue.  Adding some cc's there.

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Updated

•

9 years ago

Blocks: 1056001

Martijn Wargers (dead)

Updated

•

9 years ago

Blocks: 1138569

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Comment 80

•

8 years ago

This probably still makes sense for any Python based testing on Firefox OS devices, but I'm no longer working on it.

Assignee: dave.hunt → nobody

Status: ASSIGNED → NEW

Dave Hunt [:davehunt] [he/him] ⌚BST

Reporter

Updated

•

8 years ago

Attachment #8495963 - Flags: review?

David Burns :automatedtester

Comment 81

•

8 years ago

b2g related so closing

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → WONTFIX

Firefox Bug Husbandry Bot

Comment 82

•

6 years ago

Removing leave-open keyword from resolved bugs, per :sylvestre.

Keywords: leave-open

BMO Automation

Updated

•

1 year ago

Product: Testing → Remote Protocol

Work in progress patch 10 years ago Dave Hunt [:davehunt] [he/him] ⌚BST 8.65 KB, patch	ahal : feedback+	Details \| Diff \| Splinter Review
gaiatest patch for wip 10 years ago Dave Hunt [:davehunt] [he/him] ⌚BST 3.47 KB, patch		Details \| Diff \| Splinter Review
Use B2GDeviceRunner in the Marionette client. v1.0 10 years ago Dave Hunt [:davehunt] [he/him] ⌚BST 8.42 KB, patch	ahal : review+	Details \| Diff \| Splinter Review
Link to Github pull-request: https://github.com/mozilla-b2g/gaia/pull/24475 10 years ago Dave Hunt [:davehunt] [he/him] ⌚BST 46 bytes, text/x-github-pull-request		Details \| Review
Profile without patches applied 10 years ago Dave Hunt [:davehunt] [he/him] ⌚BST 175.62 KB, text/plain		Details
Profile with patches applied 10 years ago Dave Hunt [:davehunt] [he/him] ⌚BST 181.10 KB, text/plain		Details
Use B2GDeviceRunner in the Marionette client. v2.0 10 years ago Dave Hunt [:davehunt] [he/him] ⌚BST 15.87 KB, patch	ahal : review+	Details \| Diff \| Splinter Review
Use B2GDeviceRunner in the Marionette client. v3.0 10 years ago Dave Hunt [:davehunt] [he/him] ⌚BST 17.10 KB, patch	ahal : review+	Details \| Diff \| Splinter Review
Use B2GDeviceRunner in the Marionette client. v3.1 10 years ago Dave Hunt [:davehunt] [he/him] ⌚BST 17.71 KB, patch	ahal : review+	Details \| Diff \| Splinter Review
Use B2GDeviceRunner in the Marionette client. v3.2 10 years ago Dave Hunt [:davehunt] [he/him] ⌚BST 19.76 KB, patch		Details \| Diff \| Splinter Review
Use B2GDeviceRunner in the Marionette client. v3.2 10 years ago Dave Hunt [:davehunt] [he/him] ⌚BST 19.76 KB, patch	ahal : review+	Details \| Diff \| Splinter Review
Use B2GDeviceRunner in the Marionette client. v3.3 10 years ago Dave Hunt [:davehunt] [he/him] ⌚BST 19.53 KB, patch	ahal : review+	Details \| Diff \| Splinter Review