Closed Bug 1438679 Opened 4 years ago Closed 4 years ago

[mozrunner] DeviceRunner has to override returncode and wait() to check for remote process status

Categories

(Testing :: Mozbase, enhancement, P1)

Version 3
enhancement

Tracking

(firefox60 fixed)

RESOLVED FIXED
mozilla60
Tracking Status
firefox60 --- fixed

People

(Reporter: whimboo, Assigned: whimboo)

References

Details

Attachments

(1 file, 2 obsolete files)

As noticed on bug 1433905 comment 8 the DeviceRunner doesn't override the `poll()` and `wait()` methods, and as such calling both methods from a consumer of mozrunner will trigger the implementations in BaseRunner. Those do not check for the remote process, but instead check for the process status of `adb`, which in all the cases will return `0` - `adb` always forks itself as daemon.

Only `is_running` of the DeviceRunner class has the correct behavior:

https://dxr.mozilla.org/mozilla-central/rev/27fd083ed7ee5782e52a5eaf0286c5ffa8b35a9e/testing/mozbase/mozrunner/mozrunner/base/device.py#124

It queries the device manager and checks if the process exists. In case of FennecRunner this is the DeviceManagerADB class which is based on DeviceManager:

https://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozdevice/mozdevice/devicemanagerADB.py
https://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozdevice/mozdevice/devicemanager.py

`processExist()` itself fetches the process info and returns True or False.

I would say we can use the same for `poll()` and `wait()` to at least return `None` or `0`. I'm not sure if there is a way to retrieve the exit code of an already gone process via adb. Goeff do you know?
Flags: needinfo?(gbrown)
(In reply to Henrik Skupin (:whimboo) from comment #0)
> I'm not sure if there is a way to retrieve the exit code of
> an already gone process via adb. Goeff do you know?

I think it is not possible to retrieve the exit code of a remote process which has already completed.

A technique that we have used with some success is to echo the exit status of the remote process:

https://dxr.mozilla.org/mozilla-central/rev/9b69cc60e5848f2f8802c911fd00771b50eed41f/testing/mozbase/mozdevice/mozdevice/devicemanagerADB.py#140

then wait for the process to end and collect and parse its output for the exit code:

https://dxr.mozilla.org/mozilla-central/rev/9b69cc60e5848f2f8802c911fd00771b50eed41f/testing/mozbase/mozdevice/mozdevice/devicemanagerADB.py#171-185

That ought to work if the remote process is a shell script or a native executable like 'ps' or 'ls'. Of course, if you are launching an application (firefox), that's probably going to be done through 'am start' or similar and this technique won't be of much use.
Flags: needinfo?(gbrown)
I got it successfully running as part of my patch for bug 1433873. Here a try build which utilizes a new wait() method on the DeviceRunner class:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=e9408c41de5868adae6d675d2ff021f7e6d7af07

I will extract that patch, and upload it here for a separate release of mozrunner which is overdue anyway.
Attachment #8953232 - Flags: review?(gbrown)
Attachment #8953233 - Flags: review?(gbrown)
Attachment #8953233 - Attachment is obsolete: true
Attachment #8953233 - Flags: review?(gbrown)
(In reply to Henrik Skupin (:whimboo) from comment #2)
> I will extract that patch, and upload it here for a separate release of
> mozrunner which is overdue anyway.

Well, we shouldn't release yet given that more work has to be done for both mozrunner and mozprocess, so I removed the second commit.
Comment on attachment 8953232 [details]
Bug 1438679 - [mozrunner] DeviceRunner has to override returncode and wait() to check for remote process status.

https://reviewboard.mozilla.org/r/222514/#review228768

::: testing/mozbase/mozrunner/mozrunner/base/device.py:127
(Diff revision 1)
>                      self.app_ctx.remote_process))
>  
>      def is_running(self):
>          return self.app_ctx.dm.processExist(self.app_ctx.remote_process)
>  
> +    def poll(self):

I don't see poll() defined in the base class. What am I missing?

::: testing/mozbase/mozrunner/mozrunner/base/device.py:134
(Diff revision 1)
> +
> +    def wait(self, timeout=None):
> +        count = 0
> +
> +        while True:
> +            if not self.is_running():

The is_running() call is ultimately an adb command out to the remote device -- that might take some time.

I expect this implementation is usually fine, but on a slow/malfunctioning device, it might wait too long. Instead of assuming one loop == 1 second, I'd prefer to see something like:

end = datetime.now() + datetime.timedelta(timeout)
...
  if datetime.now() > end:
  
You can't guarantee a perfect wait time, but this approach will limit the damage.
Attachment #8953232 - Flags: review?(gbrown) → review-
Comment on attachment 8953232 [details]
Bug 1438679 - [mozrunner] DeviceRunner has to override returncode and wait() to check for remote process status.

https://reviewboard.mozilla.org/r/222514/#review228768

> I don't see poll() defined in the base class. What am I missing?

Oh, I was working that much with the process classes that I didn't notice that mozrunner has such an awkward API. So instead of `poll()` we have the property `returncode` only, or have to use `wait(timeout=0)`. I will remove that method.

> The is_running() call is ultimately an adb command out to the remote device -- that might take some time.
> 
> I expect this implementation is usually fine, but on a slow/malfunctioning device, it might wait too long. Instead of assuming one loop == 1 second, I'd prefer to see something like:
> 
> end = datetime.now() + datetime.timedelta(timeout)
> ...
>   if datetime.now() > end:
>   
> You can't guarantee a perfect wait time, but this approach will limit the damage.

I checked again and we actually have the same 1s sleep logic in `start()` and `stop()`. I have never seen that such a call can take that long, but I agree that we should make that more accurate. I will update all affected methods in this module, which also gives us better coverage for `wait()`.

A new try build is up at https://treeherder.mozilla.org/#/jobs?repo=try&revision=700aaa319090b0ccdaca703d0b344e08d1550259
Summary: [mozrunner] DeviceRunner has to override poll() and wait() methods to check for remote process status → [mozrunner] DeviceRunner has to override wait() to check for remote process status
While digging deeper I noticed that we inappropriately override `is_running`, it's just identical to the BaseRunner. Instead we completely miss the `returncode` property, which has to be independent from the BaseRunner. Maybe that is what I originally tried to refer to but accidentally replaced with `poll()`.

Further I will patch `processExist()` which claims to return a boolean, but actually returns the process id.
Summary: [mozrunner] DeviceRunner has to override wait() to check for remote process status → [mozrunner] DeviceRunner has to override returncode and wait() to check for remote process status
Comment on attachment 8953970 [details]
Bug 1438679 - [mozdevice] processExist() has to return a boolean and not remote process id.

https://reviewboard.mozilla.org/r/223128/#review229168
Attachment #8953970 - Flags: review?(gbrown) → review+
Comment on attachment 8953232 [details]
Bug 1438679 - [mozrunner] DeviceRunner has to override returncode and wait() to check for remote process status.

https://reviewboard.mozilla.org/r/222514/#review229232

This looks great now. Thanks.
Attachment #8953232 - Flags: review?(gbrown) → review+
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/5685421d86eb
[mozdevice] processExist() has to return a boolean and not remote process id. r=gbrown
https://hg.mozilla.org/integration/autoland/rev/0f79e397a320
[mozrunner] DeviceRunner has to override returncode and wait() to check for remote process status. r=gbrown
Backed out for xpcshell failures on Android.
 backout: https://hg.mozilla.org/integration/autoland/rev/d04639eccb1871310307a9515f3b310536cb652e
 push with failures: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=0f79e397a3202e6579fb8a6b642e828ab2dff793
 failure logs:
https://treeherder.mozilla.org/logviewer.html#?job_id=164398083&repo=autoland&lineNumber=26093

[task 2018-02-26T19:17:46.257Z] 19:17:46     INFO -  (xpcshell/head.js) | test MAIN run_test finished (1)
[task 2018-02-26T19:17:46.257Z] 19:17:46     INFO -  exiting test
[task 2018-02-26T19:17:46.258Z] 19:17:46     INFO -  intl/uconv/tests/unit/test_encode_CP1258.js | [7298, Main Thread] WARNING: '!gThread', file /builds/worker/workspace/build/src/xpcom/threads/nsTimerImpl.cpp, line 399
[task 2018-02-26T19:17:46.258Z] 19:17:46     INFO -  intl/uconv/tests/unit/test_encode_CP1258.js | [7298, Main Thread] WARNING: OOPDeinit() without successful OOPInit(): file /builds/worker/workspace/build/src/toolkit/crashreporter/nsExceptionHandler.cpp, line 3371
[task 2018-02-26T19:17:46.258Z] 19:17:46     INFO -  <<<<<<<
[task 2018-02-26T19:17:46.258Z] 19:17:46     INFO -  INFO | Result summary:
[task 2018-02-26T19:17:46.258Z] 19:17:46     INFO -  INFO | Passed: 152
[task 2018-02-26T19:17:46.258Z] 19:17:46  WARNING -  INFO | Failed: 152
[task 2018-02-26T19:17:46.259Z] 19:17:46  WARNING -  One or more unittests failed.
[task 2018-02-26T19:17:46.259Z] 19:17:46     INFO -  INFO | Todo: 0
[task 2018-02-26T19:17:46.259Z] 19:17:46     INFO -  INFO | Retried: 0
[task 2018-02-26T19:17:46.259Z] 19:17:46     INFO -  SUITE-END | took 1792s
[task 2018-02-26T19:17:46.273Z] 19:17:46    ERROR - Return code: 1
[task 2018-02-26T19:17:46.273Z] 19:17:46     INFO - TinderboxPrint: xpcshell<br/>152/<em class="testfail">152</em>/0
[task 2018-02-26T19:17:46.273Z] 19:17:46     INFO - ##### xpcshell log ends
Flags: needinfo?(hskupin)
It's totally not clear to me what's failing here. I cannot find any failing test in that log even it is listed as WARNING at the end. Geoff, do you have any idea? If not I will have to dig into that tomorrow.
Flags: needinfo?(hskupin) → needinfo?(gbrown)
I see a lot of "Process still running after test!", which seems to come from:

https://dxr.mozilla.org/mozilla-central/rev/7208b6a7b11c3ed8c87a7f17c9c30a8f9583e791/testing/xpcshell/runxpcshelltests.py#292-300

Note that poll() is implemented as:

https://dxr.mozilla.org/mozilla-central/rev/7208b6a7b11c3ed8c87a7f17c9c30a8f9583e791/testing/xpcshell/remotexpcshelltests.py#194

...which seems related to the processExist() change.
Flags: needinfo?(gbrown)
Interestingly I haven't gotten an email from Bugzilla about your reply, so I figured out the same meanwhile as you pointed out in the last comment. The usage of `processExist()` is really wrong here, and would explain it.
I see a lot of usage for `processExist()` in tree, and I think it is a better idea to actually move this change out to a different bug.

https://dxr.mozilla.org/mozilla-central/search?q=processExist&=mozilla-central

I will revert this change.
Attachment #8953970 - Attachment is obsolete: true
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1b30f674e78e
[mozrunner] DeviceRunner has to override returncode and wait() to check for remote process status. r=gbrown
https://hg.mozilla.org/mozilla-central/rev/1b30f674e78e
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
You need to log in before you can comment on or make changes to this bug.