Closed Bug 1506928 Opened Last year Closed 7 months ago

"mach raptor-test" doesn't handle KeyboardInterrupt and leaves Raptor Python process behind

Categories

(Testing :: Raptor, enhancement, P1)

Version 3
enhancement

Tracking

(firefox68 fixed)

RESOLVED FIXED
mozilla68
Tracking Status
firefox68 --- fixed

People

(Reporter: rwood, Assigned: whimboo)

References

Details

Attachments

(8 files, 1 obsolete file)

If you ctrl+c out of raptor tp6 early, the mitmproxy server will be left running. Then when running raptor again locally, it will error out because the mitmproxy server is already running. At that point you need to manually kill the leftover mitmdump process.

We should avoid this by:

Trap the ctrl+c and have raptor/mozharness kill the mitmproxy server before raptor finishes.

- or -

When raptor is starting mitmproxy playback, have it look for an existing mitmdump process and if it finds it, kill it first.
More specifically, this is the error we wish to avoid:

14:41:45     INFO -  Error starting proxy server: OSError(48, 'Address already in use')
Assignee: nobody → rwood
Status: NEW → ASSIGNED
Priority: -- → P2
Priority: P2 → P1

Hmmm I don't really understand the Raptor unit test (Windows only) failure here [1]. Importing psutil inside Raptor itself on all the platforms works / the patch is green; but not in the unit tests on Windows (linux64 unit tests are green).

:ahal, any ideas?

https://treeherder.mozilla.org/#/jobs?repo=try&revision=440ea1f674d1014099a4a4ac7d508cce126af846&selectedJob=220053695

Flags: needinfo?(ahal)

I think psutil needs to be compiled on Windows. I'd highly recommend trying to avoid it if at all possible :).

Tbh, I'd recommend going the trap ctrl-c route anyway. Leaving processes lying around on the user's system is never good, even if it doesn't ultimately cause errors. You could either catch KeyboardInterrupt or use the builtin signal module to trap it globally no matter what (if you do this, be sure to have your handler call the original handler afterwards.. I can give you some pointers on this if you like).

Flags: needinfo?(ahal)

Thanks Andrew, ok I'll go the 'trap ctrl+c' route. Since this isn't as quick a fix as I was hoping, I need to put this on the back-burner for now.

Priority: P1 → P2
Attachment #9034486 - Attachment is obsolete: true

Also we should be reverting the Firefox policies that were set to turn on the proxy and import the mitmproxy CA cert, if someone ctrl+c's out of a test.

Summary: Prevent mitmproxy already running error → Capture ctrl+c in Raptor and clean-up appropriately

Haven't had a chance to get back to this; putting it up for grabs.

Assignee: rwood → nobody
Status: ASSIGNED → NEW
Duplicate of this bug: 1539103

The actual problem here is not located in Raptor itself but in its mach command, or mach directly. When I run testing/raptor/raptor/raptor.py directly it is all just fine.

Summary: Capture ctrl+c in Raptor and clean-up appropriately → "mach raptor-test" doesn't handle KeyboardInterrupt and leaves Raptor Python process behind

I will actually take a look here.

Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Priority: P2 → P1

So I got this working for Firefox desktop. There are two reasons why it was failing:

  1. The mach command runs Raptor through mozharness, and mozharness doesn't correctly handle keyboard interrupts in run_command().

Solution: Given the complexity of mozharness I would just vote to add a try/except block for KeyboardInterrupt around the subprocess run-time.

  1. Raptor itself gets invoked by mozharness via a subprocess, and if a keyboard interrupt happens mozharness will kill the Raptor process now. Given that Raptor doesn't have a signal handler set, the launched Firefox and webserver processes aren't taken down.

Solution: In case Raptor is called as subprocess and __main__ is executed, a signal handler has to be setup to handle SIGTERM and SIGKILL.

With my changes try looks still fine:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=96d4015d63c2337209b4

As of now I wasn't able yet to also get it working with mobile, and will also have to check with Chrome.

I need the fixes for bug 1541215 landed first.

Depends on: 1541215
Attachment #9055091 - Attachment description: Bug 1506928 - [raptor] Use signal handler for safe shutdown. r=rwood,tarek → Bug 1506928 - [raptor] Use signal handler for safe shutdown. r=rwood,tarek,davehunt

I have problem to get the application to quit on Android. While after a normal exit the geckoview_example process is gone, it is still present with my patch. I assume that stopping the process via mozdevice somehow doesn't work.

Attached patch whimboo-2.patchSplinter Review

This was the patch whimboo was working on which left the browser running after ctrl-c.

Attached patch whimboo-3.patchSplinter Review

You just need to catch the exception and stop the application. run_test_teardown wasn't sufficient. This worked for me in limited testing.

(In reply to Robert Wood [:rwood] (PTO - back 15-Apr) from comment #0)

If you ctrl+c out of raptor tp6 early, the mitmproxy server will be left
running. Then when running raptor again locally, it will error out because
the mitmproxy server is already running. At that point you need to manually
kill the leftover mitmdump process.

I'm not currently seeing the tests error out when mitmproxy is already running. Instead, the tests continue to run against the existing instance of mitmproxy. When looking in the mitmproxy.log I see the already running errors and 404s for the requests. Does the current patch address this, and ensure we have a new instance of mitmproxy each time?

(In reply to Bob Clary [:bc:] from comment #20)

You just need to catch the exception and stop the application. run_test_teardown wasn't sufficient. This worked for me in limited testing.

Sorry, I missed to add that line to the clean_up() method after resetting my local changes and re-applying the needed ones. Note that it is triggered from main() from within the finally block. That method currently force-closes the application on desktop, and as such I think should do the same for mobile.(In reply to Dave Hunt [:davehunt] [he/him] ⌚️UTC from comment #21)

I'm not currently seeing the tests error out when mitmproxy is already running. Instead, the tests continue to run against the existing instance of mitmproxy. When looking in the mitmproxy.log I see the already running errors and 404s for the requests. Does the current patch address this, and ensure we have a new instance of mitmproxy each time?

Yes, the control server is taken down with that patch. You will notice that also when running warm benchmark tests like speedometer on Android. It currently fails in force-stopping the application. Sadly only for warm tests, but not cold tests.

I added the following code around force-stopping the binary:

      try:
            self._logger.info("*** force stop")
            p = self.shell_output("am force-stop %s" % app_name,
                            timeout=timeout, root=root)
            self._logger.info("*** after force stop: %s" % p)
        except BaseException as e:
            self._logger.info("*** exc force stop: %s" % e)
            raise

When the Raptor tests get started I can see that a previously running geckoview_example binary is getting force-stopped:

09:04:57 INFO - raptor-main starting geckoview
09:04:57 INFO - adb *** in stop application
09:04:57 INFO - adb *** force stop
09:04:58 INFO - adb *** exit code for am force-stop org.mozilla.geckoview_example: 0
09:04:58 INFO - adb *** output:
09:04:58 INFO - adb shell_output: adb -s G070VM0990350952 wait-for-device shell am force-stop org.mozilla.geckoview_example; echo adb_returncode=$?, timeout: None, root: False, timedout: None, exitcode: 0, output:

But when the same method is called under a SIGINT condition it looks like that:

09:05:12 INFO - adb *** in stop application
09:05:12 INFO - adb *** force stop
09:05:12 WARNING - setting return code to -1
09:05:12 INFO - Killing logcat pid 6382.
09:05:12 CRITICAL - PERFHERDER_DATA was seen 0 times, expected 1.
09:05:12 INFO - Running post-action listener: _package_coverage_data
09:05:12 INFO - Running post-action listener: _resource_record_post_action
09:05:12 INFO - Running post-action listener: process_java_coverage_data
09:05:12 INFO - Running post-action listener: stop_device
09:05:12 INFO - Using adb 1.0.40
09:05:12 INFO - *** exit code for id: 0
09:05:12 INFO - *** output: uid=2000(shell) gid=2000(shell) groups=2000(shell),1004(input),1007(log),1011(adb),1015(sdcard_rw),1028(sdcard_r),3001(net_bt_admin),3002(net_bt),3003(inet),3006(net_bw_stats),3009(readproc) context=u:r:shell:s0

Not sure why id is called here instead of am force-stop.

When I move the code from main to the run_test_teardown method it works. I assume that it is a bit slow until we reach that point in the code, and Raptor got killed.

Also given that the browser application is getting started in the run_test() method, we should get rid of it in teardown.

The updated Phabricator review works for both desktop and mobile (cold and warm tests) now.

Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6ad35d665d4b
[mozharness] Handle KeyboardInterrupt in ScriptMixin. r=gbrown
https://hg.mozilla.org/integration/mozilla-inbound/rev/ab985d0cbffd
[raptor] Handle "MOZ_UPLOAD_DIR" as optional in Output class. r=rwood
https://hg.mozilla.org/integration/mozilla-inbound/rev/9ca5edb93d09
[raptor] Refactor arguments for constructors of Raptor classes. r=rwood
https://hg.mozilla.org/integration/mozilla-inbound/rev/120d7d8970e5
[raptor] Fix inheritance of methods in raptor classes. r=davehunt
https://hg.mozilla.org/integration/mozilla-inbound/rev/097b31242a5a
[raptor] "get()" defaults to None by default if key isn't found. r=davehunt
https://hg.mozilla.org/integration/mozilla-inbound/rev/98c7524647ae
[raptor] Use signal handler for safe shutdown. r=rwood,tarek,davehunt

I decided to push the patch series to inbound given that autoland is closed kinda often the last 24h, and phabricator was not able to land the patches due to rebase conflicts.

The problem here is that it hangs after starting geckoview_example app, and waiting for the webextension to be ready:

14:45:45 INFO - adb shell_output: adb -s HT8641A00042 wait-for-device shell am start -W -n org.mozilla.geckoview_example/org.mozilla.geckoview_example.GeckoViewActivity -a android.intent.action.Main --ez use_multiprocess True --es args "-profile /sdcard/raptor/profile --es env0 LOG_VERBOSE=1 --es env1 R_LOG_LEVEL=6" -d about:blank; echo adb_returncode=$?, timeout: None, root: False, timedout: None, exitcode: 0, output: Starting: Intent { act=android.intent.action.Main dat=about:blank cmp=org.mozilla.geckoview_example/.GeckoViewActivity (has extras) }
14:45:45 INFO - Status: ok
14:45:45 INFO - Activity: org.mozilla.geckoview_example/.GeckoViewActivity
14:45:45 INFO - ThisTime: 223
14:45:45 INFO - TotalTime: 223
14:45:45 INFO - WaitTime: 233
14:45:45 INFO - Complete
15:01:15 INFO - raptor-main application timed out after 933 seconds
15:01:15 INFO - adb shell_output: adb -s HT8641A00042 wait-for-device shell am force-stop org.mozilla.geckoview_example; echo adb_returncode=$?, timeout: None, root: False, timedout: None, exitcode: 0, output:

This is an Android only issue as the additional triggered jobs on inbound show:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&resultStatus=pending%2Crunning%2Csuccess%2Ctestfailed%2Cbusted%2Cexception&revision=98c7524647ae009fc69e434ef8dddd2c3fb72e81&searchStr=raptor%2Ctp6

Flags: needinfo?(hskupin)

I back-filled the landing changeset on inbound with additional Raptor jobs to see which jobs are actually affected. As it looks like those are the tp6m-1, and tp6m-2 ones. Those have multiple sites included, means the browser gets restarted in between those.

Given that my patches didn't apply anymore I had to rebase and thought to push again to try:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=fb5d39f61949f5447aa38dd756e2eb072af975ae

Failures are still present. Now I'm going to push only parts of the patch series to try, to see which changeset actually caused the regression. I suspect the following two changesets beside the SIGINT work:

  1. Refactor arguments for constructors of Raptor classes: https://treeherder.mozilla.org/#/jobs?repo=try&revision=740a85e0b1e656e47950a54352502f0eaa97bb15

  2. Fix inheritance of methods in raptor classes: https://treeherder.mozilla.org/#/jobs?repo=try&revision=50a674e027579227bae6b6a7d9eac5d25be430d4

All changes excluding the very last one for the signal handling additions work:

pass: https://treeherder.mozilla.org/#/jobs?repo=try&revision=0f83c185a88a353f1f47b3525f5a975daffc2082&selectedJob=242126668
fail: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f64bdbecdd25951006703eca82dbb56d0a332447&selectedJob=242129456

I spotted one problem where we now force closing the app for warm tests which shouldn't happen, and which seems to be the underlying problem here.

So I found the problem of the hang. As noted above it only happens when the browser gets started a second time. Running it locally with an attached Android device, I was seeing that the browser came up successfully, and no failures were reported from the Raptor web extension.

As such I assumed that maybe the reverse proxy connection isn't working, and as such no connection can be made to the device by Raptor. Commenting out the appropriate line proves that.

An updated fix should be up soon.

Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6f64c339a2a8
[mozharness] Handle KeyboardInterrupt in ScriptMixin. r=gbrown
https://hg.mozilla.org/integration/autoland/rev/ec3fe3df0985
[raptor] Handle "MOZ_UPLOAD_DIR" as optional in Output class. r=rwood
https://hg.mozilla.org/integration/autoland/rev/97795bbf4e35
[raptor] Refactor arguments for constructors of Raptor classes. r=rwood
https://hg.mozilla.org/integration/autoland/rev/dbd92ad84948
[raptor] Fix inheritance of methods in raptor classes. r=davehunt
https://hg.mozilla.org/integration/autoland/rev/34a50e2bcaeb
[raptor] "get()" defaults to None by default if key isn't found. r=davehunt
https://hg.mozilla.org/integration/autoland/rev/b2677f62c9d0
[raptor] Use signal handler for safe shutdown. r=rwood,tarek
See Also: → 1546523
Blocks: 1546523
You need to log in before you can comment on or make changes to this bug.