Closed Bug 1276220 Opened 8 years ago Closed 8 years ago

Intermittent Automation Error: Could not terminate process - OSError: [Errno 1] Operation not permitted | 'firefox_ui_harness/cli_update.py'

Categories

(Testing :: Firefox UI Tests, defect)

49 Branch
All
macOS
defect
Not set
critical

Tracking

(firefox48 unaffected, firefox49 fixed, firefox50 fixed)

RESOLVED FIXED
Tracking Status
firefox48 --- unaffected
firefox49 --- fixed
firefox50 --- fixed

People

(Reporter: whimboo, Unassigned)

References

Details

(Keywords: hang, intermittent-failure, regression)

Attachments

(1 obsolete file)

Since yesterday we have a couple of busted tests on OS X for update tests:

Automation Error: mozprocess timed out after 300 seconds running ['/Users/mozauto/jenkins/workspace/mozilla-central_update/build/venv/bin/python', '/Users/mozauto/jenkins/workspace/mozilla-central_update/build/venv/lib/python2.7/site-packages/firefox_ui_harness/cli_update.py', '--binary', '/Users/mozauto/jenkins/workspace/mozilla-central_update/build/application/FirefoxNightly.app/Contents/MacOS/firefox', '--address', 'localhost:2828', '--server-root', '/Users/mozauto/jenkins
timed out after 300 seconds of no output
Return code: -9 

The return code looks like a fatal error or crash. Mozcrash doesn't detect anything.
This is most likely a fallout from the patch from Stephen on bug 394984. I can reproduce this locally too with the following command:

./mach firefox-ui-tests --binary %path_to_build_from_May_26_or_25%
Flags: needinfo?(spohl.mozilla.bugs)
Keywords: hang, regression
Maybe the commit https://hg.mozilla.org/mozilla-central/rev/a468fe7af937 caused it given that we seemed to have changed how child processes are launched. I feel that mozprocess looses the pid here and cannot connect to Firefox anymore.
(In reply to Henrik Skupin (:whimboo) from comment #1)
> ./mach firefox-ui-tests --binary %path_to_build_from_May_26_or_25%

I'm confused by this timeline. If I'm reading the changelog properly, bug 394984 didn't land in time for the 5/25 nightly. Could you confirm?
Flags: needinfo?(spohl.mozilla.bugs) → needinfo?(hskupin)
Updates are broken for us with source builds starting May 25th. The nightly build on that day was with:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=d6d4e8417d2fd71fdf47c319b7a217f6ace9d5a5

Hg shows me that your patches are part of that Nightly:
https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=d6d4e8417d2fd71fdf47c319b7a217f6ace9d5a5
Flags: needinfo?(hskupin)
Attachment #8757588 - Attachment is obsolete: true
Attachment #8757588 - Attachment is patch: false
Attempt to get this starrable.
Summary: Automation Error: mozprocess timed out after 300 seconds running ['python', 'firefox_ui_harness/cli_update.py'] → Intermittent Automation Error: mozprocess timed out after 300 seconds running | 'firefox_ui_harness/cli_update.py'
Stephen, I'm not familiar with any process handling on OS X so the following are just some speculations... With the change as linked below we seem to change the way how child processes are getting started. Do I see it correct that we no longer use the posix methods for it but some OS X specific launcher tools?

https://hg.mozilla.org/integration/mozilla-inbound/rev/a468fe7af93708141cbbc201890909adf4acad83

As of now mozprocess is able to follow processes if those are in the same process group. To kill the remaining open Firefox process it is using os.killpg(). I wonder if the above change breaks out of this process group and presents us an independent process.
Flags: needinfo?(spohl.mozilla.bugs)
I think that this is the problem. I tried the restart() command for one of our restart tests and as you can see the process group stays always the same for each of the three restarts:

COMM               PID  PPID  PGID
$ ps xao comm,pid,ppid,pgid | grep Nightly
./FirefoxNightly  9624  9607  9624
$ ps xao comm,pid,ppid,pgid | grep Nightly
./FirefoxNightly  9629     1  9624
$ ps xao comm,pid,ppid,pgid | grep Nightly
./FirefoxNightly  9633     1  9624
$ ps xao comm,pid,ppid,pgid | grep Nightly
./FirefoxNightly  9636     1  9624

The following now shows the new behavior for update tests:

$ ps xao comm,pid,ppid,pgid | grep /var
/var/folders/z5/  9661  9644  9661        (first instance to apply some prefs)
$ ps xao comm,pid,ppid,pgid | grep /var
/var/folders/z5/  9671     1  9661        (after first restart to download and apply updates)
$ ps xao comm,pid,ppid,pgid | grep /var
/var/folders/z5/  9678     1  9678        (after application has been upgraded)

Due to the change of the process group id we lost track of the underlying process group.
I'm confident now that this is clearly a regression from the mac updater changes on bug 394984.
This change may contribute to fixing bugs such as bug 1258515. We should make mozprocess compatible with this change. We still use posix_spawnp in other places, which might be why you're not seeing the same behavior for regular restart tests. However, we may change this as well to further reduce problems like bug 1258515.
Flags: needinfo?(spohl.mozilla.bugs)
So that means escaping the process group under those conditions is expected and necessary to fix the issues like the one you layed out above? If not, why can't we make sure to keep inside it?
Depends on: 1176758
(In reply to Henrik Skupin (:whimboo) from comment #13)
> So that means escaping the process group under those conditions is expected
> and necessary to fix the issues like the one you layed out above?

Correct.
I see. Ok, so as first step I want to get rid of the 300s timeout, so we get a better error message in automation. I will work on that via bug 1276886.
Depends on: 1276886
With the patch on bug 1276886 landed over the weekend we will at least no longer reproduce the hang after the update with the nightly builds starting maybe Sunday. I'm now working on bug 1176758 so that we can track the process even outside the known process group.
 05:57:17     INFO -  Could not terminate process: 57921
 05:57:17    ERROR - Failure during execution of the update test.
 05:57:17    ERROR - Traceback (most recent call last):
 05:57:17    ERROR - 
 05:57:17    ERROR -   File "/Users/mozauto/jenkins/workspace/mozilla-central_update/build/venv/lib/python2.7/site-packages/firefox_ui_harness/runners/update.py", line 60, in _run_tests
 05:57:17    ERROR -     FirefoxUITestRunner.run_tests(self, tests)
 05:57:17    ERROR - 
 05:57:17    ERROR -   File "/Users/mozauto/jenkins/workspace/mozilla-central_update/build/venv/lib/python2.7/site-packages/marionette/runner/base.py", line 827, in run_tests
 05:57:17    ERROR -     self._print_summary(tests)
 05:57:17    ERROR - 
 05:57:17    ERROR -   File "/Users/mozauto/jenkins/workspace/mozilla-central_update/build/venv/lib/python2.7/site-packages/marionette/runner/base.py", line 859, in _print_summary
 05:57:17    ERROR -     self.marionette.instance.close()
 05:57:17    ERROR - 
 05:57:17    ERROR -   File "/Users/mozauto/jenkins/workspace/mozilla-central_update/build/venv/lib/python2.7/site-packages/marionette_driver/geckoinstance.py", line 150, in close
 05:57:17    ERROR -     self.runner.stop()
 05:57:17    ERROR - 
 05:57:17    ERROR -   File "/Users/mozauto/jenkins/workspace/mozilla-central_update/build/venv/lib/python2.7/site-packages/mozrunner/base/runner.py", line 165, in stop
 05:57:17    ERROR -     self.process_handler.kill(sig=sig)
 05:57:17    ERROR - 
 05:57:17    ERROR -   File "/Users/mozauto/jenkins/workspace/mozilla-central_update/build/venv/lib/python2.7/site-packages/mozprocess/processhandler.py", line 743, in kill
 05:57:17    ERROR -     self.proc.kill(sig=sig)
 05:57:17    ERROR - 
 05:57:17    ERROR -   File "/Users/mozauto/jenkins/workspace/mozilla-central_update/build/venv/lib/python2.7/site-packages/mozprocess/processhandler.py", line 164, in kill
 05:57:17    ERROR -     send_sig(signal.SIGTERM)
 05:57:17    ERROR - 
 05:57:17    ERROR -   File "/Users/mozauto/jenkins/workspace/mozilla-central_update/build/venv/lib/python2.7/site-packages/mozprocess/processhandler.py", line 151, in send_sig
 05:57:17    ERROR -     os.killpg(self.pid, sig)
 05:57:17    ERROR - 
05:57:17 ERROR - OSError: [Errno 1] Operation not permitted
Summary: Intermittent Automation Error: mozprocess timed out after 300 seconds running | 'firefox_ui_harness/cli_update.py' → Intermittent Automation Error: Could not terminate process - OSError: [Errno 1] Operation not permitted | 'firefox_ui_harness/cli_update.py'
The mozprocess process tracking is not easily to solve so I would propose we do a workaround patch in Marionette for now. It would mean we pass the current process id via the capabilities to the client, and update the pid of the runner appropriately.
Depends on: 1281750
This should be fixed for the nightly builds today by my patch on bug 1282570.
Status: NEW → RESOLVED
Closed: 8 years ago
Depends on: 1282570
Resolution: --- → FIXED
Does this only run on nightly or does the fix need uplift?
Flags: needinfo?(hskupin)
The uplift has been handled via bug 1282570 and was done on July 12th.
Flags: needinfo?(hskupin)
Depends on: 1329528
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: