Closed Bug 1051986 Opened 10 years ago Closed 10 years ago

[mozdevice] Make B2GDeviceRunner's stop method kill the remote process instead of the initial adb process

Categories

(Testing :: Mozbase, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
mozilla35

People

(Reporter: davehunt, Assigned: davehunt)

References

Details

Attachments

(1 file, 7 obsolete files)

Currently the stop method on a B2GDeviceRunner instance will kill the remote_binary process, which is '/system/bin/b2g.sh'. The remote process is actually '/system/b2g/b2g', and a kill on the remote_binary is not producing a crash report.

We should create a new stop method that first kills the remote process, and then kills the remote_binary. I have tested this locally and am able to produce crash reports.
OS: Mac OS X → Gonk (Firefox OS)
Hardware: x86 → ARM
Assignee: nobody → dave.hunt
Status: NEW → ASSIGNED
Attachment #8470931 - Flags: review?(ahalberstadt)
Comment on attachment 8470931 [details] [diff] [review]
Kill the correct b2g process on device. v1.0

Review of attachment 8470931 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks! r+ with comments addressed.

::: testing/mozbase/mozrunner/mozrunner/base/device.py
@@ +96,5 @@
>          if not self.device.wait_for_net():
>              raise Exception("Failed to get a network connection")
>  
> +    def stop(self, sig=None):
> +        sig = sig or signal.SIGABRT

I feel like SIGKILL should be the default (which mozdevice should already do if you just pass in None).

@@ +114,5 @@
>              if not self.app_ctx.dm.processExist(self.app_ctx.remote_process):
>                  break
>              time.sleep(1)
>          else:
>              print("timed out waiting for '%s' process to exit" % self.app_ctx.remote_process)

I think this while loop should also get moved into stop(). Seems like there is less of a chance for race conditions that way.
Attachment #8470931 - Flags: review?(ahalberstadt) → review+
So it feels kind of like the model's broken here. mozrunner seems to assume that a Runner corresponds to a particular process. But for B2G that isn't really the case, or at least the process isn't local. So it violates the assumed invariant that, in the absence of any crashes, process_handler is a live process until .stop is called.

It seems like the fix for this should really be either to remove that assumption or to write something like a proxy for the remote b2g process so it can be used like a local process.
When I say "the process isn't local" I mean that there exists a process on the device that could be used ("b2g") but the one that's actually used is the adb process that will start adb then exit.
Carried r+ and pushed to try: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=5c1466009e35

I won't take any further action until James' comments are addressed, though I'm not sure I have much useful input to add myself.
Attachment #8470931 - Attachment is obsolete: true
Attachment #8470952 - Flags: review+
I understand now that my original comment was wrong. As James points out, stop currently tries to kill the adb command that started the remote process. I suspect that fixing the model could be taken care of in a separate bug, and that we can temporarily workaround it to unblock bug 1039626.

I have also found that the remote process is somewhat resilient to being killed. I have found that first running stop before killing the remote process makes it much more obedient. I'll have an updated patch for review after some more testing.
Summary: [mozdevice] Make B2GDeviceRunner's stop method kill the remote process instead of the startup script → [mozdevice] Make B2GDeviceRunner's stop method kill the remote process instead of the initial adb process
Requesting review again due to the changes. Try run: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=b32634f64472
Attachment #8470952 - Attachment is obsolete: true
Attachment #8471571 - Flags: review?(ahalberstadt)
Yes, if you kill the process without running 'adb stop' on it first, the system will automatically try to restart it.
Comment on attachment 8471571 [details] [diff] [review]
Kill the correct b2g process on device. v1.2

Review of attachment 8471571 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me, r+ with one comment.

::: testing/mozbase/mozrunner/mozrunner/base/device.py
@@ +105,5 @@
> +                if not self.is_running() == remote_pid:
> +                    break
> +                time.sleep(1)
> +            else:
> +                print("timed out waiting for '%s' process to exit" % self.app_ctx.remote_process)

I think we should call BaseRunner.stop() at the end here so that process_handler gets set to None.
Attachment #8471571 - Flags: review?(ahalberstadt) → review+
Also have you tested this in the case that we are replacing the existing profile? CC'ing mdas just in case this breaks stuff for her.
(In reply to Andrew Halberstadt [:ahal] from comment #9)
> I think we should call BaseRunner.stop() at the end here so that
> process_handler gets set to None.

The process has already stopped, so including the BaseRunner.stop() pollutes the console with a message indicating that the process has already been killed. Perhaps in the case where we use ADB to launch a remote process we should set the process_handler to None after a successful start?

(In reply to Andrew Halberstadt [:ahal] from comment #10)
> Also have you tested this in the case that we are replacing the existing
> profile? CC'ing mdas just in case this breaks stuff for her.

The profile is recreated each time we call runner.start(). Could you expand on your question, as I'm not sure I understand it yet.

I also discovered from speaking to :jgraham and :dhylands that calling stop on the b2g process will kill it, so we're somewhat relying on a race condition here that the process is still available after calling stop so we can then kill it with a specified signal. As mentioned in comment 8 we are unable to simply kill the process as it will restart, which seems to be particularly disruptive during boot.
(In reply to Dave Hunt (:davehunt) from comment #11)
> The profile is recreated each time we call runner.start(). Could you expand
> on your question, as I'm not sure I understand it yet.

One of the motivating factors of implementing DeviceRunner the way it is was so that we could push a custom profile to the device and modify profiles.ini to point to it. My understanding was that you didn't need this feature for your work. But e.g the Gij tests that Malini is working on are heavily dependent on the ability to push customized profiles. I just wanted to make sure that we aren't breaking her work.
(I also depend on the profiles feature)
(In reply to Andrew Halberstadt [:ahal] from comment #12)
> (In reply to Dave Hunt (:davehunt) from comment #11)
> > The profile is recreated each time we call runner.start(). Could you expand
> > on your question, as I'm not sure I understand it yet.
> 
> One of the motivating factors of implementing DeviceRunner the way it is was
> so that we could push a custom profile to the device and modify profiles.ini
> to point to it. My understanding was that you didn't need this feature for
> your work. But e.g the Gij tests that Malini is working on are heavily
> dependent on the ability to push customized profiles. I just wanted to make
> sure that we aren't breaking her work.

Seems like it shouldn't affect it at all since we set the B2GDeviceRunner object's profile in start_runner (https://github.com/mozilla-b2g/gaia/blob/master/tests/python/runner-service/runner_service/handlers/runner.py#L125). Unless we destroy that object, it should push the custom profile.
This version of the patch now calls the BaseRunner.stop() method too as I'm no longer seeing the issues I previously saw with that. I'm also now storing if the runner has witnessed a crash, which makes it possible for me to write a test that causes a crash. Previously I was able to do this by directly checking for crashes in the test, but now the on_finish method is where the crash is being picked up so it's missed.

I'm still unsure about calling stop immediately followed by killing the process. While this has worked flawlessly for me I'm conscious that it would seem to be a race condition. See comment 11 for more details.
Attachment #8471571 - Attachment is obsolete: true
Attachment #8477310 - Flags: review?(ahalberstadt)
Attachment #8477310 - Flags: feedback?(james)
Comment on attachment 8477310 [details] [diff] [review]
Kill the correct b2g process on device. v1.3

I refreshed the wrong patch. Please hold.
Attachment #8477310 - Attachment is obsolete: true
Attachment #8477310 - Flags: review?(ahalberstadt)
Attachment #8477310 - Flags: feedback?(james)
Comment on attachment 8477324 [details] [diff] [review]
Kill the correct b2g process on device. v1.4

Review of attachment 8477324 [details] [diff] [review]:
-----------------------------------------------------------------

This will break crash stacks on timeout, otherwise it looks good.

::: testing/mozbase/mozrunner/mozrunner/base/device.py
@@ +98,5 @@
>  
> +    def stop(self, sig=None):
> +        try:
> +            remote_pid = self.is_running()
> +            self.app_ctx.stop_application()

Is this necessary? I think that stopping the application this way makes the 'sig' parameter useless (i.e we won't trigger the crashreporter on timeout). When testing locally, calling runner.on_timeout() should generate and process minidumps.

@@ +122,5 @@
>          if match:
>              self.last_test = match[-1]
>  
>      def on_timeout(self):
> +        self.stop()

You need to pass in sig=signal.SIGABRT here. Otherwise we won't get stacks if there is a timeout.
Attachment #8477324 - Flags: review?(ahalberstadt) → review-
(In reply to Andrew Halberstadt [:ahal] from comment #18)
> Comment on attachment 8477324 [details] [diff] [review]
> Kill the correct b2g process on device. v1.4
> 
> Review of attachment 8477324 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This will break crash stacks on timeout, otherwise it looks good.
> 
> ::: testing/mozbase/mozrunner/mozrunner/base/device.py
> @@ +98,5 @@
> >  
> > +    def stop(self, sig=None):
> > +        try:
> > +            remote_pid = self.is_running()
> > +            self.app_ctx.stop_application()
> 
> Is this necessary? I think that stopping the application this way makes the
> 'sig' parameter useless (i.e we won't trigger the crashreporter on timeout).
> When testing locally, calling runner.on_timeout() should generate and
> process minidumps.

Yes, see comment 8. If we don't first stop the b2g process then it will restart as soon as we kill it. This is a concern as mentioned in comment 11 and comment 15, but I'm not sure what else we should try.

> @@ +122,5 @@
> >          if match:
> >              self.last_test = match[-1]
> >  
> >      def on_timeout(self):
> > +        self.stop()
> 
> You need to pass in sig=signal.SIGABRT here. Otherwise we won't get stacks
> if there is a timeout.

I'll fix that and trigger a new try run. My last try run was very sad and I've yet to investigate why, so will try to reproduce the failures today on my Ubuntu VM.
Try: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=ce5d3337c0a2
Attachment #8477324 - Attachment is obsolete: true
Attachment #8477324 - Flags: feedback?(james)
Flags: needinfo?(dave.hunt)
Okay, I think I finally worked this out. The patch was causing us not to stop b2g on emulator at all and therefore not pick up the new profile, etc. This version of the patch works for me on device and emulator.

Try: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=0b878a3dd796
Attachment #8479876 - Attachment is obsolete: true
Attachment #8481593 - Flags: review?(ahalberstadt)
Flags: needinfo?(dave.hunt)
Comment on attachment 8481593 [details] [diff] [review]
Kill the correct b2g process on device. v1.6

Review of attachment 8481593 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for digging into this!
Attachment #8481593 - Flags: review?(ahalberstadt) → review+
Investigating potential regression in try results.
Flags: needinfo?(ahalberstadt)
Keywords: checkin-needed
https://tbpl.mozilla.org/php/getParsedLog.php?id=47239041&tree=Try&full=1#error0

Yep, this is a regression from the current patch and I see what is happening.

In mozprocess.wait() we join the outputThread here:
http://mxr.mozilla.org/mozilla-central/source/testing/mozbase/mozprocess/mozprocess/processhandler.py#820

Normally this wait() function is called by the main thread and everything is fine. The "onTimeout" handlers however, are called from within the outThread here:
http://mxr.mozilla.org/mozilla-central/source/testing/mozbase/mozprocess/mozprocess/processhandler.py#787

Because of this patch, the on_timeout handler in mozrunner now calls mozrunner.stop(), which then calls mozprocess.kill(), which then calls mozprocess.wait(). This results in the outThread trying to join with itself.

I think you could fix this by not calling BaseRunner.stop() anymore (I know I asked you to call it originally... sorry about that!). Alternatively we could fix mozprocess so that it won't try to join if we are in the outThread context.
Flags: needinfo?(ahalberstadt)
Comment on attachment 8486521 [details] [diff] [review]
Kill the correct b2g process on device. v1.7

Review of attachment 8486521 [details] [diff] [review]:
-----------------------------------------------------------------

If this fixes the intermittent, lgtm!
Attachment #8486521 - Flags: review?(ahalberstadt) → review+
https://hg.mozilla.org/mozilla-central/rev/55d2bb1cbe23
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: