Closed Bug 1274584 Opened 8 years ago Closed 8 years ago

[mozprocess] OSError: IOCompletion Port failed to signal process shutdown | after dom/workers/test/serviceworkers/browser_force_refresh.js on windows

Categories

(Testing :: Mozbase, defect)

All
Windows
defect
Not set
normal

Tracking

(firefox49 fixed)

RESOLVED FIXED
mozilla49
Tracking Status
firefox49 --- fixed

People

(Reporter: ahal, Assigned: ahal)

References

Details

Attachments

(2 files)

Note, this is *not* an intermittent, so I'm filing this separately from bug 1152372, though the error is the same and I suspect a fix here will also fix that.

This is permafail on certain windows browser-chrome jobs (whichever chunk contains dom/workers/test/serviceworkers/browser_force_refresh.js). But for some reason I can't fathom, mozharness is not picking this error up! The error seems to be mostly harmless, as it happens after the tests have run. Due to runByDir, the next batch of tests starts right back up after it.

We discovered this error because of Paul's patch in bug 1261194 which switches the DesktopOutputParser for the StructuredOutputParser. The latter *does* correctly flag this traceback as an error.

We'll either need to fix this, or find a way to get the StructuredOutputParser to ignore it. I'd much prefer fixing this though :).
Note -  
I found function checkForZombie in mochitest
https://dxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#1845
Which means mochitest excepts hung processes and defines a way to cleanup.
If it is a convention to ignore post test hung process I'll give it a try to catch exception.
So the problem is simply that mozprocess is timing out while waiting for the child process to terminate:
https://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozprocess/mozprocess/processhandler.py#486

But the reality contradicts the comment from the above link. The comment suggests that if that happens, we will force kill the process. But instead we just raise OSError. I checked, and the traceback there is definitely just a Queue timeout. So maybe the fix is to force kill the process if that happens rather than raise OSError?
So mozprocess used to just warn about that error and return. Then this commit (with a hilariously bad commit message) started making it raise:
https://github.com/mozilla/mozbase-deprecated/commit/335c9724a189efcf2569b0ae384fb2921f5ff41c#diff-0c06f1ccd459838a7add9b23d5aae447R382

I suspect, though of course I don't know, that the rationale was we could force downstream consumers to deal with this rather than potentially leave zombie processes lying around.

I think we have two options:

1) Go with Paul's proposed fix from comment 1 and mark this WONTFIX (but at least improve the message to make it clear what this error means)

2) Attempt to hard kill the process when this happens like this comment suggests we do:
https://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozprocess/mozprocess/processhandler.py#487

Jgriffin, not that I expect you'll remember, but you actually wrote the above comment (digging way back in github history). Do you have any vague recollections on what is supposed to happen here? There is a discrepency between that comment and what we actually do.
Flags: needinfo?(jgriffin)
I think the theory was that _cleanup would kill the processes automatically, but looking at the code and at the Windows API documentation, it seems that doesn't happen unless JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE is set on the job, which isn't happening in this code.

In any case, I'm in favor of doing our best not to leave zombie processes around.
Flags: needinfo?(jgriffin)
As of python 2.7, os.kill() works on windows. I wonder if the fix is as simple as:

except:
    os.kill(self.pid)

I don't think we're using this anywhere with 2.6 anyway..
(In reply to Andrew Halberstadt [:ahal] from comment #5)
> As of python 2.7, os.kill() works on windows. I wonder if the fix is as
> simple as:
> 
> except:
>     os.kill(self.pid)
> 
> I don't think we're using this anywhere with 2.6 anyway..

It may be good enough; worth experimenting a bit with anyway.
I think I see what's happening but there's still lots I don't know. For context, there is a thread that monitors windows IO Completion ports (herein called IOC thread), and signals to the MainThread once it notices all child processes have been shutdown. When something goes wrong, and the child processes don't shutdown (after ~3 minutes), it will attempt to force kill them and abort via this call to self.kill():
https://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozprocess/mozprocess/processhandler.py#521

However, that kill() function has a call to self.wait() in it which means the IOC thread will go there and happily block waiting for a message from itself that of course will never arrive:
https://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozprocess/mozprocess/processhandler.py#488

So we are technically cleaning up all the child processes properly (I think), but we go down the wrong code path afterwards and block for FINISHED a second time. I still have two questions I don't know the answer to, but that aren't strictly necessary to fix this bug:

1) Why is the IO Completion port failing to terminate the child processes in the first place? It has something to do with dom/workers/test/serviceworkers/browser_force_refresh.js, we should probably file a separate bug (even though it is probably harmless)

2) How does the MainThread block on self._procmngrthread? I verified that we never get into that _wait() method until the IOC thread receives JOB_OBJECT_MSG_EXIT_PROCESS. But I don't see how that triggers the MainThread to go into _wait(). Oh well.

I'll get a patch up next week after I do some more testing.
We can set MOZPROCESS_DEBUG to help debug windows process code. However on try
it is unreadable as there are multiple things using mozprocess, and each process
has multiple threads. It's impossible to tell which log message comes from where.
This improves the debug logs a bit by always specifying the PID and thread name.

There are a few other drive-by cleanups in this thread. The only one of note is
removing a python 2.5 only code path.

Review commit: https://reviewboard.mozilla.org/r/56338/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/56338/
Attachment #8757988 - Flags: review?(jgriffin)
Attachment #8757989 - Flags: review?(jgriffin)
Sometimes the IO completion port doesn't shutdown child processes. When this happens,
mozprocess will attempt to force kill the child processes manually. However, there is
a bug here which causes the OSError to get raised.

Although this fixes that bug, the original issue(s) which prevented the IOC port
from signaling shutdown remain and are still undiagnosed.

Review commit: https://reviewboard.mozilla.org/r/56340/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/56340/
Assignee: nobody → ahalberstadt
Status: NEW → ASSIGNED
Comment on attachment 8757988 [details]
MozReview Request: Bug 1274584 - [mozprocess] Improve debugging messages on windows, r?jgriffin

https://reviewboard.mozilla.org/r/56338/#review53382
Attachment #8757988 - Flags: review?(jgriffin) → review+
Comment on attachment 8757989 [details]
MozReview Request: Bug 1274584 - [mozprocess] Fix IO Completion Port failed to signal process shutdown, r?jgriffin

https://reviewboard.mozilla.org/r/56340/#review53386

I think this makes sense!
Attachment #8757989 - Flags: review?(jgriffin) → review+
Pushed by ahalberstadt@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/fe37010e2b84
[mozprocess] Improve debugging messages on windows, r=jgriffin
https://hg.mozilla.org/integration/mozilla-inbound/rev/fd75d39e98b6
[mozprocess] Fix IO Completion Port failed to signal process shutdown, r=jgriffin
https://hg.mozilla.org/mozilla-central/rev/fe37010e2b84
https://hg.mozilla.org/mozilla-central/rev/fd75d39e98b6
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
See Also: → 1753797
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: