Closed Bug 947248 Opened 11 years ago Closed 10 years ago

[mozprocess] Process.poll() doesn't account the i/o completion port handle and returns 0 even when the process hasn't fully stopped yet

Categories

(Testing :: Mozbase, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whimboo, Assigned: whimboo)

References

Details

Attachments

(1 file, 1 obsolete file)

As what I have seen on bug 944684 we return too early for an ended process. That means on WIndows the application can still have some handles to files open, which is causing "Access denied" errors or similar. As of now we retry to delete those files but that should not be the final solution given that other client code can also fail if it depends on the removal of those files.

Not sure where in detail the problem might be located in mozprocess but I think the handling of I/O completion ports could be related here.

A workaround would be to add a small sleep of 100ms after the self.proc.poll(), which in my case also fixed the problem. But that would be not that reliable given the different hardware specs. So a proper solution would be great to have.
Summary: [mozprocess] On Windows we → [mozprocess] On Windows we are returning too early from a finished process, which can cause 'Access denied' errors
This is not only an issue on Windows. I'm seeing something similar on OS X for the kill() method. When it gets called via Mozrunner, it returns -9 as expected. But the process is still running and I cannot create a new jsbridge connection because the old one from the former process is still around. Doing a 'ps' for Firefox I can see a zombie process.
OS: Windows 7 → All
Summary: [mozprocess] On Windows we are returning too early from a finished process, which can cause 'Access denied' errors → [mozprocess] We are returning too early from a finished process, which can cause 'Access denied' errors
It probably makes sense to block on the managed process actually disappearing inside kill(). Note that there are some cases where a process can not be killed, in which case we should probably log/print some kind of warning before proceeding.

http://unix.stackexchange.com/questions/5642/what-if-kill-9-does-not-work
Here the output from a testrun on OS X:

First start of Mozmill so a socket is available and we have a process with the pid 12799:

*** socket available ***
  501 12799 12793   0  4:09PM ttys000    0:00.57 /Applications/Firefox/Nightly.app/Contents/MacOS/firefox -foreground -profile /var/folders/wd/zmy4z7xn7wd7sjq90z1y52f80000gn/T/tmplfnjR_.mozrunner

Tests gets run and we kill the application via self.process_handler.kill() in mozrunner:

TEST-START | test1.js | setupModule
TEST-START | test1.js | testNewPageLoaded
Timeout: bridge.execFunction("41a1a023-882e-11e3-876e-109add6a95a5", bridge.registry["{063c6ce1-a697-a74d-b46a-26c685a4b018}"]["runTestFile"], ["/Volumes/data/code/mozmill/test/test1.js", null])

**** Lets kill the application...
Killed runner: -9

Right after we start the application again and accidentally detect that the socket is available. But it is for the old process which will be dead within a second and we fail in creating the bridge.

**** Restarting application after kill
*** socket available ***
  501 12799 12793   0  4:09PM ttys000    0:00.00 (firefox)
  501 12805 12793   0  4:10PM ttys000    0:00.00 /Applications/Firefox/Nightly.app/Contents/MacOS/firefox -foreground -profile /var/folders/wd/zmy4z7xn7wd7sjq90z1y52f80000gn/T/tmplfnjR_.mozrunner

TEST-UNEXPECTED-FAIL | Disconnect Error: Application unexpectedly closed

I will have a look at this bug in a bit.
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Blocks: 794020
So that's interesting. As talked on IRC with Andrew we usually use process groups on POSIX systems. I was able to get this fixed for killpg() but so far not for Popen.wait() which still returns too early. Not sure if that might actually be a platform bug. I'm able to workaround that by adding a small sleep of 1s as mentioned in my initial comment. I'm not sure why wait() in such a case returns immediately.
Blocks: 970846
Blocks: 970595
I haven't analyzed all the process related code for Windows yet, but what I can say is that poll() returns 0 way too early while the process is not fully gone yet. Also the outThread is not active anymore, whereby I'm not sure if that one is in use on Windows. I will dive deeper into it.
So the solution here is easy. As of now we do not overload the poll() method in the Process class. That means when we call poll() on the process_handler, it will call poll() directly from the Popen class. But that class doesn't know anything about our i/o completion port stuff. So calling it will directly return 0, even when we still have an i/o handle.

What we have to do here is simply to overload the poll() method and check for the handle first. If that is not existent anymore, we can call Popen.poll() for the returncode. Otherwise we have to return None to indicate that the process has not been finished yet.

Patch upcoming soon. Not sure if a test will work given that all tests for mozprocess are disabled on Windows. But I will try.
Summary: [mozprocess] We are returning too early from a finished process, which can cause 'Access denied' errors → [mozprocess] Process.poll() doesn't account the i/o completion port handle and returns 0 even when the process hasn't fully stopped yet
Attached patch Patch v1 (obsolete) — Splinter Review
Patch which fixes the problem. It's not fully tested yet. So I will do it now across platforms. I will also check which of our mozprocess tests we can re-enable now.
Attached patch Patch v1.1Splinter Review
Updated patch given that I exported it on my Windows machine with an older version of mozbase, so that the line numbers were not in sync.

I have tested this and it works like a charm. No longer issues with access denied issues on Windows during our Mozmill tests.
Attachment #8375026 - Attachment is obsolete: true
Attachment #8375057 - Flags: review?(ahalberstadt)
I think that re-enabling the disabled tests on Windows we should do on the bug which has already been filed a while back.
Comment on attachment 8375057 [details] [diff] [review]
Patch v1.1

Review of attachment 8375057 [details] [diff] [review]:
-----------------------------------------------------------------

Nice small change :)
Attachment #8375057 - Flags: review?(ahalberstadt) → review+
https://github.com/mozilla/mozbase/commit/f9ae3301c2f95e9bd03bfb7f73d129ce96c37dcc

I will keep the in-testsuite flag set, so we can add tests later when they all are re-enabled for Windows.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Flags: in-testsuite?
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: