[mozrunner] DeviceRunner has to override returncode and wait() to check for remote process status

RESOLVED FIXED in Firefox 60

Status

enhancement
P1
normal
RESOLVED FIXED
a year ago
a year ago

People

(Reporter: whimboo, Assigned: whimboo)

Tracking

Version 3
mozilla60
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox60 fixed)

Details

Attachments

(1 attachment, 2 obsolete attachments)

As noticed on bug 1433905 comment 8 the DeviceRunner doesn't override the `poll()` and `wait()` methods, and as such calling both methods from a consumer of mozrunner will trigger the implementations in BaseRunner. Those do not check for the remote process, but instead check for the process status of `adb`, which in all the cases will return `0` - `adb` always forks itself as daemon.

Only `is_running` of the DeviceRunner class has the correct behavior:

https://dxr.mozilla.org/mozilla-central/rev/27fd083ed7ee5782e52a5eaf0286c5ffa8b35a9e/testing/mozbase/mozrunner/mozrunner/base/device.py#124

It queries the device manager and checks if the process exists. In case of FennecRunner this is the DeviceManagerADB class which is based on DeviceManager:

https://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozdevice/mozdevice/devicemanagerADB.py
https://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozdevice/mozdevice/devicemanager.py

`processExist()` itself fetches the process info and returns True or False.

I would say we can use the same for `poll()` and `wait()` to at least return `None` or `0`. I'm not sure if there is a way to retrieve the exit code of an already gone process via adb. Goeff do you know?
Flags: needinfo?(gbrown)
(In reply to Henrik Skupin (:whimboo) from comment #0)
> I'm not sure if there is a way to retrieve the exit code of
> an already gone process via adb. Goeff do you know?

I think it is not possible to retrieve the exit code of a remote process which has already completed.

A technique that we have used with some success is to echo the exit status of the remote process:

https://dxr.mozilla.org/mozilla-central/rev/9b69cc60e5848f2f8802c911fd00771b50eed41f/testing/mozbase/mozdevice/mozdevice/devicemanagerADB.py#140

then wait for the process to end and collect and parse its output for the exit code:

https://dxr.mozilla.org/mozilla-central/rev/9b69cc60e5848f2f8802c911fd00771b50eed41f/testing/mozbase/mozdevice/mozdevice/devicemanagerADB.py#171-185

That ought to work if the remote process is a shell script or a native executable like 'ps' or 'ls'. Of course, if you are launching an application (firefox), that's probably going to be done through 'am start' or similar and this technique won't be of much use.
Flags: needinfo?(gbrown)
I got it successfully running as part of my patch for bug 1433873. Here a try build which utilizes a new wait() method on the DeviceRunner class:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=e9408c41de5868adae6d675d2ff021f7e6d7af07

I will extract that patch, and upload it here for a separate release of mozrunner which is overdue anyway.
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Attachment #8953232 - Flags: review?(gbrown)
Attachment #8953233 - Flags: review?(gbrown)
Attachment #8953233 - Attachment is obsolete: true
Attachment #8953233 - Flags: review?(gbrown)
(In reply to Henrik Skupin (:whimboo) from comment #2)
> I will extract that patch, and upload it here for a separate release of
> mozrunner which is overdue anyway.

Well, we shouldn't release yet given that more work has to be done for both mozrunner and mozprocess, so I removed the second commit.

Comment 6

a year ago
mozreview-review
Comment on attachment 8953232 [details]
Bug 1438679 - [mozrunner] DeviceRunner has to override returncode and wait() to check for remote process status.

https://reviewboard.mozilla.org/r/222514/#review228768

::: testing/mozbase/mozrunner/mozrunner/base/device.py:127
(Diff revision 1)
>                      self.app_ctx.remote_process))
>  
>      def is_running(self):
>          return self.app_ctx.dm.processExist(self.app_ctx.remote_process)
>  
> +    def poll(self):

I don't see poll() defined in the base class. What am I missing?

::: testing/mozbase/mozrunner/mozrunner/base/device.py:134
(Diff revision 1)
> +
> +    def wait(self, timeout=None):
> +        count = 0
> +
> +        while True:
> +            if not self.is_running():

The is_running() call is ultimately an adb command out to the remote device -- that might take some time.

I expect this implementation is usually fine, but on a slow/malfunctioning device, it might wait too long. Instead of assuming one loop == 1 second, I'd prefer to see something like:

end = datetime.now() + datetime.timedelta(timeout)
...
  if datetime.now() > end:
  
You can't guarantee a perfect wait time, but this approach will limit the damage.
Attachment #8953232 - Flags: review?(gbrown) → review-
Assignee

Comment 7

a year ago
mozreview-review-reply
Comment on attachment 8953232 [details]
Bug 1438679 - [mozrunner] DeviceRunner has to override returncode and wait() to check for remote process status.

https://reviewboard.mozilla.org/r/222514/#review228768

> I don't see poll() defined in the base class. What am I missing?

Oh, I was working that much with the process classes that I didn't notice that mozrunner has such an awkward API. So instead of `poll()` we have the property `returncode` only, or have to use `wait(timeout=0)`. I will remove that method.

> The is_running() call is ultimately an adb command out to the remote device -- that might take some time.
> 
> I expect this implementation is usually fine, but on a slow/malfunctioning device, it might wait too long. Instead of assuming one loop == 1 second, I'd prefer to see something like:
> 
> end = datetime.now() + datetime.timedelta(timeout)
> ...
>   if datetime.now() > end:
>   
> You can't guarantee a perfect wait time, but this approach will limit the damage.

I checked again and we actually have the same 1s sleep logic in `start()` and `stop()`. I have never seen that such a call can take that long, but I agree that we should make that more accurate. I will update all affected methods in this module, which also gives us better coverage for `wait()`.

A new try build is up at https://treeherder.mozilla.org/#/jobs?repo=try&revision=700aaa319090b0ccdaca703d0b344e08d1550259
Summary: [mozrunner] DeviceRunner has to override poll() and wait() methods to check for remote process status → [mozrunner] DeviceRunner has to override wait() to check for remote process status
While digging deeper I noticed that we inappropriately override `is_running`, it's just identical to the BaseRunner. Instead we completely miss the `returncode` property, which has to be independent from the BaseRunner. Maybe that is what I originally tried to refer to but accidentally replaced with `poll()`.

Further I will patch `processExist()` which claims to return a boolean, but actually returns the process id.
Summary: [mozrunner] DeviceRunner has to override wait() to check for remote process status → [mozrunner] DeviceRunner has to override returncode and wait() to check for remote process status
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)

Comment 12

a year ago
mozreview-review
Comment on attachment 8953970 [details]
Bug 1438679 - [mozdevice] processExist() has to return a boolean and not remote process id.

https://reviewboard.mozilla.org/r/223128/#review229168
Attachment #8953970 - Flags: review?(gbrown) → review+

Comment 13

a year ago
mozreview-review
Comment on attachment 8953232 [details]
Bug 1438679 - [mozrunner] DeviceRunner has to override returncode and wait() to check for remote process status.

https://reviewboard.mozilla.org/r/222514/#review229232

This looks great now. Thanks.
Attachment #8953232 - Flags: review?(gbrown) → review+

Comment 14

a year ago
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/5685421d86eb
[mozdevice] processExist() has to return a boolean and not remote process id. r=gbrown
https://hg.mozilla.org/integration/autoland/rev/0f79e397a320
[mozrunner] DeviceRunner has to override returncode and wait() to check for remote process status. r=gbrown
Backed out for xpcshell failures on Android.
 backout: https://hg.mozilla.org/integration/autoland/rev/d04639eccb1871310307a9515f3b310536cb652e
 push with failures: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=0f79e397a3202e6579fb8a6b642e828ab2dff793
 failure logs:
https://treeherder.mozilla.org/logviewer.html#?job_id=164398083&repo=autoland&lineNumber=26093

[task 2018-02-26T19:17:46.257Z] 19:17:46     INFO -  (xpcshell/head.js) | test MAIN run_test finished (1)
[task 2018-02-26T19:17:46.257Z] 19:17:46     INFO -  exiting test
[task 2018-02-26T19:17:46.258Z] 19:17:46     INFO -  intl/uconv/tests/unit/test_encode_CP1258.js | [7298, Main Thread] WARNING: '!gThread', file /builds/worker/workspace/build/src/xpcom/threads/nsTimerImpl.cpp, line 399
[task 2018-02-26T19:17:46.258Z] 19:17:46     INFO -  intl/uconv/tests/unit/test_encode_CP1258.js | [7298, Main Thread] WARNING: OOPDeinit() without successful OOPInit(): file /builds/worker/workspace/build/src/toolkit/crashreporter/nsExceptionHandler.cpp, line 3371
[task 2018-02-26T19:17:46.258Z] 19:17:46     INFO -  <<<<<<<
[task 2018-02-26T19:17:46.258Z] 19:17:46     INFO -  INFO | Result summary:
[task 2018-02-26T19:17:46.258Z] 19:17:46     INFO -  INFO | Passed: 152
[task 2018-02-26T19:17:46.258Z] 19:17:46  WARNING -  INFO | Failed: 152
[task 2018-02-26T19:17:46.259Z] 19:17:46  WARNING -  One or more unittests failed.
[task 2018-02-26T19:17:46.259Z] 19:17:46     INFO -  INFO | Todo: 0
[task 2018-02-26T19:17:46.259Z] 19:17:46     INFO -  INFO | Retried: 0
[task 2018-02-26T19:17:46.259Z] 19:17:46     INFO -  SUITE-END | took 1792s
[task 2018-02-26T19:17:46.273Z] 19:17:46    ERROR - Return code: 1
[task 2018-02-26T19:17:46.273Z] 19:17:46     INFO - TinderboxPrint: xpcshell<br/>152/<em class="testfail">152</em>/0
[task 2018-02-26T19:17:46.273Z] 19:17:46     INFO - ##### xpcshell log ends
Flags: needinfo?(hskupin)
It's totally not clear to me what's failing here. I cannot find any failing test in that log even it is listed as WARNING at the end. Geoff, do you have any idea? If not I will have to dig into that tomorrow.
Flags: needinfo?(hskupin) → needinfo?(gbrown)
I see a lot of "Process still running after test!", which seems to come from:

https://dxr.mozilla.org/mozilla-central/rev/7208b6a7b11c3ed8c87a7f17c9c30a8f9583e791/testing/xpcshell/runxpcshelltests.py#292-300

Note that poll() is implemented as:

https://dxr.mozilla.org/mozilla-central/rev/7208b6a7b11c3ed8c87a7f17c9c30a8f9583e791/testing/xpcshell/remotexpcshelltests.py#194

...which seems related to the processExist() change.
Flags: needinfo?(gbrown)
Interestingly I haven't gotten an email from Bugzilla about your reply, so I figured out the same meanwhile as you pointed out in the last comment. The usage of `processExist()` is really wrong here, and would explain it.
I see a lot of usage for `processExist()` in tree, and I think it is a better idea to actually move this change out to a different bug.

https://dxr.mozilla.org/mozilla-central/search?q=processExist&=mozilla-central

I will revert this change.
Comment hidden (mozreview-request)
Attachment #8953970 - Attachment is obsolete: true

Comment 23

a year ago
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1b30f674e78e
[mozrunner] DeviceRunner has to override returncode and wait() to check for remote process status. r=gbrown

Comment 24

a year ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/1b30f674e78e
Status: ASSIGNED → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
You need to log in before you can comment on or make changes to this bug.