Closed Bug 1329528 Opened 3 years ago Closed 3 years ago

OSError: [Errno 1] Operation not permitted exception when killing a zombie process.

Categories

(Testing :: Mozbase, defect)

Version 3
defect
Not set

Tracking

(firefox52 wontfix, firefox-esr52 wontfix, firefox53 fixed, firefox54 fixed)

RESOLVED FIXED
mozilla54
Tracking Status
firefox52 --- wontfix
firefox-esr52 --- wontfix
firefox53 --- fixed
firefox54 --- fixed

People

(Reporter: gw, Assigned: sgiles)

References

Details

Attachments

(1 file)

On OSX, calling the kill() method on the Process class can result in an unhandled OSError exception.

This occurs when the process in question has exited and is in the zombie state.

An example stack trace from [1] is:

Tests with unexpected results:
  ▶ ERROR [expected OK] /html/browsers/windows/noreferrer.html
  └   → Traceback (most recent call last):
  File "/Users/servo/buildbot/slave/mac-rel-wpt1/build/tests/wpt/harness/wptrunner/executors/base.py", line 149, in run_test
    result = self.do_test(test)
  File "/Users/servo/buildbot/slave/mac-rel-wpt1/build/tests/wpt/harness/wptrunner/executors/executorservo.py", line 143, in do_test
    self.proc.kill()
  File "/Users/servo/buildbot/slave/mac-rel-wpt1/build/python/_virtualenv/lib/python2.7/site-packages/mozprocess/processhandler.py", line 766, in kill
    self.proc.kill(sig=sig)
  File "/Users/servo/buildbot/slave/mac-rel-wpt1/build/python/_virtualenv/lib/python2.7/site-packages/mozprocess/processhandler.py", line 172, in kill
    send_sig(signal.SIGTERM)
  File "/Users/servo/buildbot/slave/mac-rel-wpt1/build/python/_virtualenv/lib/python2.7/site-packages/mozprocess/processhandler.py", line 159, in send_sig
    os.killpg(pid, sig)
OSError: [Errno 1] Operation not permitted

When a process (group) is in zombie state, it will remain that way until waitpid() is called on it. I tested catching the exception when os.killpg() is called, and then calling os.waitpid(pid). This appears to fix the problem locally for my test case, but I'm not sure if this is the correct fix. 

[1] https://github.com/servo/servo/pull/14818
This could be the reason for various test failures we see in Marionette restart tests.
Assignee: nobody → sgiles
There's a reproducable test case here (including the fix): https://github.com/servo/servo/pull/15579#issuecomment-280215792

Will add a patch for mozprocess tomorrow.
It would've been great to get a test case for this, but Python multiprocessing is too good at making sure you don't end up with Zombies.. :P
Comment on attachment 8837939 [details]
Bug 1329528 - Reap zombie processes on Mac OS if killing the process group initially fails with EPERM;

https://reviewboard.mozilla.org/r/112940/#review114466

::: testing/mozbase/mozprocess/mozprocess/processhandler.py:169
(Diff revision 1)
> +                            # before continuing
> +                            # Note: A negative pid refers to the entire process
> +                            # group
> +                            if retries < 1 and getattr(e, "errno", None) == errno.EPERM:
> +                                try:
> +                                    os.waitpid(-pid, 0)

What will happen if we do this call by default before sending killpg()? Would that cause a freeze?

I just wonder if we could get rid of all the extra code the patch will add.
(In reply to sgiles from comment #4)
> It would've been great to get a test case for this, but Python
> multiprocessing is too good at making sure you don't end up with Zombies.. :P

We have at least a forking proc written in C in the tests folder of mozprocess. Maybe you could add just another one?
Comment on attachment 8837939 [details]
Bug 1329528 - Reap zombie processes on Mac OS if killing the process group initially fails with EPERM;

https://reviewboard.mozilla.org/r/112940/#review114520

Thanks for digging into this! Looks reasonable to me
Attachment #8837939 - Flags: review?(ahalberstadt) → review+
Comment on attachment 8837939 [details]
Bug 1329528 - Reap zombie processes on Mac OS if killing the process group initially fails with EPERM;

https://reviewboard.mozilla.org/r/112940/#review114466

> What will happen if we do this call by default before sending killpg()? Would that cause a freeze?
> 
> I just wonder if we could get rid of all the extra code the patch will add.

Yep, we need to send the kill signal first, otherwise we'll end up waiting on non-zombie processes and potentially hanging.
Pushed by ahalberstadt@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2b4830caedd4
Reap zombie processes on Mac OS if killing the process group initially fails with EPERM; r=ahal
https://hg.mozilla.org/mozilla-central/rev/2b4830caedd4
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
This issue also affects branches down to current beta (52.0). We might want to wait some days but then request an uplift of the patch. It would be great to see it also fixed in the next ESR release.
No known fallout I'm aware of so far. Going to give this a go on Aurora for a bit before deciding on Beta.
https://hg.mozilla.org/releases/mozilla-aurora/rev/63baf28e129e
Bah, this needs to be rebased around bug 1309060 if we want to uplift this to 52.
Flags: needinfo?(sgiles)
Would still consider a rebased patch for ESR52 if it's practical to do so, but it's too late for Fx52 at this point.
Flags: needinfo?(sgiles)
You need to log in before you can comment on or make changes to this bug.