Closed Bug 640047 Opened 13 years ago Closed 13 years ago

Investigate using IOCompletion ports to determine when app under test is dead

Categories

(Testing Graveyard :: Mozmill, defect)

x86
Windows 7
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cmtalbert, Assigned: cmtalbert)

References

Details

(Whiteboard: [mozmill-2.0+])

Attachments

(2 files, 1 obsolete file)

Over in bug 639879 we were fighting this scenario:
* Mozrunner launches a process in a job object (windows)
* A test in the job object fails, causing the runner to call __init__.py::kill
* profile.cleanup() is called, and we cannot clean the profile because the process under test is still running.

To avoid this, we can poll to ensure the process is down or we can use IO Completion ports to listen for the process to go down.  This bug investigates using IO Completion ports in this instance as they seem like a more deterministic (if asynchronous) method for determining this, and they get us out of the issue where we may attempt to call TerminateJobObject twice if we were to do a "kill" then a "wait" 

This is a link to the code that caused this issue, it's pretty clear from this code on how to make a test that tests this behavior by having this while loop throw an exception and then ensure that both the kill works properly as well as the non exception method which calls wait work as well.
http://hg.mozilla.org/projects/addon-sdk/file/b8837d5790ba/python-lib/cuddlefish/runner.py#l275
Here is the c code that enables I/O ports and notifications for job objects under windows.


**************************************
Global variables and defines
**************************************

// Handles required
HANDLE hIOCP;			// Completion port that receives Job notifications
HANDLE hThreadIOCP;		// Completion port thread
HANDLE hJob;			// Job Object

// Completion keys for the completion port
#define COMPKEY_TERMINATE  ((UINT_PTR) 0)
#define COMPKEY_JOBOBJECT  ((UINT_PTR) 1)

**************************************
// This is the thread function which polls and dequeues the messages
**************************************
DWORD WINAPI JobNotify(PVOID)
{
	TCHAR sz[2000];
	BOOL fDone = FALSE;
	BOOL fGotCompStat = FALSE;

	while (!fDone)
		{
		DWORD dwBytesXferred;
		ULONG_PTR CompKey;
		LPOVERLAPPED po;

		// Get queued completion status (times out after 2 secs.)
        fGotCompStat = GetQueuedCompletionStatus(hIOCP, &dwBytesXferred, &CompKey, &po, 2000);

		// if the function returns false and po is null we have either timed out or the i/0
        // completion port has been closed
        if (!fGotCompStat && (po == NULL))
			{
			DWORD dwRet = GetLastError();
            // if the port has been closed bail otherwise poll again
			if (dwRet == ERROR_ABANDONED_WAIT_0)
				{
				fDone = TRUE;
				}
			continue;
			}

		// The app is shutting down, exit this thread
		fDone = (CompKey == COMPKEY_TERMINATE);

        // this is the key passed when the process was added to the job object
		if (CompKey == COMPKEY_JOBOBJECT)
			{
			switch (dwBytesXferred)
				{
                // Process count is zero, we're probably done here
				case JOB_OBJECT_MSG_ACTIVE_PROCESS_ZERO:
					break;

                // A new process has been added to the job object, the PID is in the po variable
				case JOB_OBJECT_MSG_NEW_PROCESS:
					break;

				// One of the processes has exited, the PID is in the po variable
                case JOB_OBJECT_MSG_EXIT_PROCESS:
					break;

				// One of the processes has exited abnormally, the PID is in the po variable
                case JOB_OBJECT_MSG_ABNORMAL_EXIT_PROCESS:
					break;

				// Ignore any other message
                default:
					break;
				}
			}
		}
}

**************************************
This is the setup code
**************************************
    // Create the I/O completion port
	hIOCP = CreateIoCompletionPort(INVALID_HANDLE_VALUE, NULL, 0, 0);

    // create the job object
	hJob = CreateJobObject(NULL, _T("foo"));
	if (hJob != NULL)
		{
        // set the job object to kill children on job close also allow breakaways
		JOBOBJECT_EXTENDED_LIMIT_INFORMATION jeli = { 0 };
		jeli.BasicLimitInformation.LimitFlags = JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE | JOB_OBJECT_LIMIT_BREAKAWAY_OK;
		if( 0 == SetInformationJobObject( hJob, JobObjectExtendedLimitInformation, &jeli, sizeof(jeli)))
			{
			::MessageBox( 0, _T("Could not SetInformationJobObject"), _T("Error"), MB_OK);
			}

        // set the job object to send notifications using the COMPKEY_JOBOBJECT key constant
		JOBOBJECT_ASSOCIATE_COMPLETION_PORT joacp = { (PVOID) COMPKEY_JOBOBJECT, hIOCP };
		if (0 == SetInformationJobObject(hJob, JobObjectAssociateCompletionPortInformation, &joacp, sizeof(joacp)))
			{
			::MessageBox( 0, _T("Could not Associate Completion Port"), _T("Error"), MB_OK);
			}
		}

	// Create a thread that waits on the completion port
	hThreadIOCP = CreateThread(NULL, 0, JobNotify, NULL, 0, NULL);

**************************************
This is the cleanup code
**************************************
    	// Post a special key that tells the completion port thread to terminate
	PostQueuedCompletionStatus(hIOCP, 0, COMPKEY_TERMINATE, NULL);
	
	// Wait for the completion port thread to terminate
	WaitForSingleObject(hThreadIOCP, INFINITE);
   
	// Clean up everything properly
	CloseHandle(hIOCP);
	CloseHandle(hThreadIOCP);
Blocks: 600736
Attached patch WIP (obsolete) — Splinter Review
Jonathan and Bob, can you take a look at this?  This is the work in progress of the IO Completion work so far.  The set of things left to do:
1. Figure out if we care about getting the PIDs out of the overlapped struct.  It doesn't seem to work at all.  From what I've read of the pywin32 extensions source code, it looks like python may be GC'ing the object before I have a chance to use it.
2. I need to add in the wait() method from killableprocess into processmanager.  You can see the existing wait process here: https://github.com/mozautomation/mozmill/blob/master/mozprocess/mozprocess/killableprocess.py#L195 (it will of course be refactored to signal the IO comp port thread instead of polling).
3. I need to do the same set of process management classes for POSIX systems (hopefully linux and mac can be treated the same in this regard (this was what the old code did).
4. What should the API be?  Do we want to do a Popen with a special flag to indicate that this Popen call should manage child processes?  If it is called without that flag, then I think we'd want to just kill off the PID like we used to do in the original processmanager codebase.  Or do we always want to manage child processes?  I tend to think we should use a flag because in the event we can't create a job object, or if something should go wrong, we should at least default back to the original behavior (kill the process we spawn by PID)
5. I have to wire all this into mozrunner, but that is the last set of changes I'll make.

Notes to run it.
1. Check out mozmill from github (see above)
2. Apply patch
3. Unzip the horrible ugly processlauncher throw-away code
4. Edit horrible processlauncher code to hard code an application you'd like to launch.
5. (from mozmill dir) python setup_development.py
6. Run the test from mutt (the Mozmill UnitTest Tool): mutt testpy -b "C:\\Program Files (x86)\
\Minefield\\firefox.exe" -m mutt\\mutt\\tests\\python\\pythontests.ini
Assignee: nobody → ctalbert
Status: NEW → ASSIGNED
Attachment #530147 - Flags: feedback?(jgriffin)
Comment on attachment 530147 [details] [diff] [review]
WIP

Review of attachment 530147 [details] [diff] [review]:

This is cool, and I can tell a lot of work went into all the ctypes code.  Good job!  See inline notes about GetOverlappedResult.

There seem to be multiple mechanisms for waiting for processes here - a) waitForFinish, which just waits for stdout output, b) the io completion port polling thread you wrote, and 3) the wait method from killableprocess.py you mentioned still needing to import.  I guess wait is an improved version of subprocess.wait that gets called by waitForFinish, and the io completion port polling thread you wrote will take the place of some portions of wait when we're able to use job objects.

> What should the API be?  Do we want to do a Popen with a special flag to
> indicate that this Popen call should manage child processes?  If it is called
> without that flag, then I think we'd want to just kill off the PID like we used
> to do in the original processmanager codebase.  Or do we always want to manage
> child processes?

It's easy in a multiple process environment to leave orphaned child processes about (I was able to do it easily just tweaking your example code a little bit), so I think I'd be in favor of defaulting to managing child processes, and falling back to single process management if we can't create job objects.  You could add an ignoreChildren=False parameter to Popen so callers could specifically request that behavior is they had some reason to.  I guess the question then is, what do you do if ignoreChildren == False and you can't create job objects...do you raise an Exception, or just fallback silently?

There are some environments where we just won't be able to create job objects...mostly when Windows UAC is enabled and the current process is already in a managed job object.  In this case, we don't want the API's to fail, so maybe silently falling back to single-process handling would be the best compromise.

::: mozprocess/mozprocess/processhandler.py
@@ +182,5 @@
+                    rtn = c_ulong(0)
+                    winprocess.GetOverlappedResult(self._job,
+                                                           byref(overlapped),
+                                                           byref(rtn),
+                                                           1)

You don't need to use OVERLAPPED here at all, you can just pass a ulong pointer instead, and read the PID directly from there.  I tested and it works:

in winprocess.py, change the definition of GetQueuedCompletionStatusProto:
   GetQueuedCompletionStatusProto = WINFUNCTYPE(BOOL,         # Return Type
                                                HANDLE,       # Completion Port
                                                LPDWORD,      # Msg ID
                                                LPULONG,      # Completion Key
                                                LPULONG,      # PID
                                                DWORD)        # milliseconds to wait

here, change:
    pid = c_ulong(0)
    portstatus = winprocess.GetQueuedCompletionStatus(self._io_port,
                                                      byref(msgid),
                                                      byref(compkey),
                                                      byref(pid),
                                                      10000)

Then, after receiving a JOB_OBJECT_MSG_NEW_PROCESS message, for example, pid contains the new process's PID.

@@ +188,5 @@
+                    print "winproc: Done getting status compkey: %s" % compkey.value
+                    print "winproc: And msgid is: %s" % msgid.value
+
+                    if (not portstatus):
+                        print "error detected"

This isn't necessarily an error; if GetLastError() returns 258 (WAIT_TIMEOUT), it just means that GetQueuedCompletionStatus timed out because nothing interesting happened.  Nothing wrong with the current logic, it just wasn't clear form reading the code.

@@ +217,5 @@
+                            rtn = c_ulong(0)
+                            winprocess.GetOverlappedResult(self._io_port,
+                                                           byref(overlapped),
+                                                           byref(rtn),
+                                                           1)

Don't need GetOverlappedResult here either, just use the pid variable from the previous GetQueuedCompletionStatus call (see above note).

@@ +367,5 @@
+                    err = self.GetLastError()
+                    if err == 38 or err == 109: # ERROR_HANDLE_EOF || ERROR_BROKEN_PIPE
+                        return ('', False)
+                    else:
+                        log.error("readWithTimeout got error: %d", err)

There's no 'log' object here (a bug in my original code), should probably raise an Exception instead.
Attachment #530147 - Flags: feedback?(jgriffin) → feedback+
(In reply to comment #4)

> There seem to be multiple mechanisms for waiting for processes here - a)
> waitForFinish, which just waits for stdout output, b) the io completion port
> polling thread you wrote, and 3) the wait method from killableprocess.py you
> mentioned still needing to import.  I guess wait is an improved version of
> subprocess.wait that gets called by waitForFinish, and the io completion port
> polling thread you wrote will take the place of some portions of wait when
> we're able to use job objects.
That's exactly right. The wait is an improved version of subprocess.wait, which is called by waitForFinish, and that wait code will interact with the IO Completion code so that it on windows you do not have the non-deterministic time window between issuing TerminateProcess and the process actually ending.

> 
> > What should the API be?  Do we want to do a Popen with a special flag to
> > indicate that this Popen call should manage child processes?  If it is called
> > without that flag, then I think we'd want to just kill off the PID like we used
> > to do in the original processmanager codebase.  Or do we always want to manage
> > child processes?
> 
> It's easy in a multiple process environment to leave orphaned child processes
> about (I was able to do it easily just tweaking your example code a little
> bit), so I think I'd be in favor of defaulting to managing child processes, and
> falling back to single process management if we can't create job objects.  You
> could add an ignoreChildren=False parameter to Popen so callers could
> specifically request that behavior is they had some reason to.  I guess the
> question then is, what do you do if ignoreChildren == False and you can't
> create job objects...do you raise an Exception, or just fallback silently?
> 
> There are some environments where we just won't be able to create job
> objects...mostly when Windows UAC is enabled and the current process is already
> in a managed job object.  In this case, we don't want the API's to fail, so
> maybe silently falling back to single-process handling would be the best
> compromise.
> 
That's the compromise I'm leaning toward.  I'll provide a way to turn off managing child processes, and I'll fall back to ignoring the children if we can't create job objects for some reason.  I'm more in favor of falling back silently so that we have an API that "just works".  At levels this low, we'd rather it work and save throwing exceptions for things that are show stoppers. 

However, I do want to ensure that we ensure the fallback is logged when we fall back in some sort of very visible way.

> You don't need to use OVERLAPPED here at all, you can just pass a ulong pointer
> instead, and read the PID directly from there.  I tested and it works:
HOLY CRAP YOU ARE MY HERO.  Thanks so much for figuring this out.

> +                    if (not portstatus):
> +                        print "error detected"
> 
> This isn't necessarily an error; if GetLastError() returns 258 (WAIT_TIMEOUT),
> it just means that GetQueuedCompletionStatus timed out because nothing
> interesting happened.  Nothing wrong with the current logic, it just wasn't
> clear form reading the code.
Right.  I'll clean that up.

> @@ +367,5 @@
> +                    err = self.GetLastError()
> +                    if err == 38 or err == 109: # ERROR_HANDLE_EOF ||
> ERROR_BROKEN_PIPE
> +                        return ('', False)
> +                    else:
> +                        log.error("readWithTimeout got error: %d", err)
> 
> There's no 'log' object here (a bug in my original code), should probably raise
> an Exception instead.
Ok, I'll take a look at that.  We need to figure out logging across the board as well.  But that is another bug.  And I think that we have quite a bit of code that handles logging already anyway.  So we'll just have to factor that into this as an external module.
Here is a patch that is closer to final.
* Should run well on windows and linux.
* Should have basic running behavior on mac (still some debug output on mac, as there are a couple more things to track down).

Let's get a first review done and see where I stand while I'm fixing up the mac code.
Attachment #530147 - Attachment is obsolete: true
Attachment #532387 - Flags: review?(jhammel)
Attachment #532387 - Flags: feedback?(jgriffin)
if you want all the mozmill code so you have something to run it in (install git) and then:
1. git clone git://github.com/ctalbert/mozmill.git
2. cd mozmill
3. git fetch origin
4. git branch --track iocompletion origin/iocompletion
5. git checkout iocompletion

Then you can run mozmill:
python setup_development.py
mozmill -b <path to firefox> -t mutt/mutt/tests/js/test_assert.js

It's -t <path to mozmill test>.

Running it outside of mozmill is also completely valid and encouraged.  Let me know what you find.  You should be able to use the mozprocess/ component as a stand alone python module too.
Re comment 7

I get

INFO | Passed: 0
INFO | Failed: 1
INFO | Skipped: 0

Is that expected?
Comment on attachment 532387 [details] [diff] [review]
process that should apply to tip of master

class ReportMsg():
        """ Trying to work with the standard python logger didn't work
well.
            This just helps us print things to stdout.
        """

I'm not sure if this is intended to be checked in with the code, but I
would discourage it. The current Mozmill logger, and IMHO any logger
we should ultimately use (presumedly MozLogger) has the option to
present stdout, stderr as logging levels. Unless there is a reason to
have more than two logging levels, I would suggest `print` and
`print >> stderr` as logging substitutes. for any code to be checked
in. There is also no reason for this to be a class as there is no
object or shared data. As stands these should be free-standing
functions, though, again, you probably don't need them to be checked
in.

> 60 # Flip this for crazy debug output
self._debug = False

Probably also not the best thing to check in. You only use it twice
and only for print statements. I'd say just remove and if you need to
debug you can add in the two print statements you may need

68       # Odd workaround for mac - TODO: Figure out why this exists
69       self.kill_called = False


> 80 print "These are your ARGS %s " % args
To be gone

> 87 def __del__(self, _maxint=sys.maxint):

Currently this only cleans up on windows.  Shouldn't this do something
on the other OSes?

<nit/> if (self._handle):

Its not JS.  You don't need the ()s


102 except:
103 # TODO: Throw?
104 self.logger.warn("Could not terminate process")

Since its a bare except, I say throw (or mark the exception and
through later).

121     finally:
122     self.returncode = -9
123     else:
124     os.kill(self.pid, signal.SIGKILL)
125     self.returncode = -9

Should document what -9 is

136    print "clintdbg: got wait"

To remove along with similar statements

143       def _timed_wait_callback(self, callback_func, timeout):
144       """ This function is used by linux and mac to handle their
wait code.
145  It runs a wait loop with the given timeout and calls back into
146  a function (callback_func) with the given timeout parmeter
147  """

If this is only linux + mac, then maybe it should be in an appropriate
if block?

157     now = datetime.datetime.now()
158     starttime = datetime.datetime.now()

now = starttime = datetime.datetime.now()
You might be better off using time.time() instead of dealing with
horrible timedeltas

160      while (((diff.seconds * 100 * 1000 + diff.microseconds) <
timeout * 1000 * 1000) and done is False

Like this one; should be diff.seconds * 1000 * 1000.

164  else:
165  if subprocess.poll() is not None:
166  done = self.returncode

AttributeError: 'module' object has no attribute 'poll'

why not done = True in this case?

> 180 if not isinstance(args, types.StringTypes):
> 181 args = subprocess.list2cmdline(args)

I've always seen this as `if not isinstance(args, basestring):` which
to the best of my knowledge works fine.  We should probably decide on
a convention if they are the same, and I prefer my method since it
doesn't require an import (they only place you use the types
module). Is this the same on windows (just str and unicode on 'nix)?
Maybe I don't understand the point of the types module at all

Also, 'nix will have problems with string arguments if shell != True;
I guess this isn't the case for windows?

212   if canCreateJob:
213     self.logger.info("ProcessManager using job objects to manage
child processes")
214   creationflags |= winprocess.CREATE_BREAKAWAY_FROM_JOB
215   else:
216   # Since we've warned, we just log info here to inform you
217   # of the consequence of setting ignore_children = True
218   self.logger.info("ProcessManager NOT managing child processes")

Maybe there should be a flag (defaulting to False) that will raise on
not being able to manage child processes. And again, IMHO this should
be a print or print >> stderr vs a logger, at least until/if MozLogger
is done.  Likewise for line 207

270 if p2cread is not None:
271   p2cread.Close()
272   if c2pwrite is not None:
273   c2pwrite.Close()
274   if errwrite is not None:
275   errwrite.Close()

for i in (p2cread, c2pwrite, errwrite):
    if i is not None:
        i.Close()

(Or if you're masochistic [i.Close() for i in (p2cread, c2pwrite,
errwrite if i is not None])

276      time.sleep(.1)

Is there a point to this? If so, it should be documented.

281      if not (self._io_port) or not (self._job)

Again, no need for ()s

+                                del(self._spawned_procs[pid.value])
and again: `del self._spawned_procs[pid.value]`

311 # Wait 3 minutes for windows to get its act together

Why 3 minutes? how was this determined?

353 self._spawned_procs[pid.value] = 1

If the value doesn't matter, use a set instead of a dict

388    # Dude, the process is like totally dead!

*PLEASE* keep this line in! :)

474      elif sys.platform == "linux2" or (sys.platform in ('sunos5',
'solaris')):

Please fix this line while you're there:
`elif sys.platform in ('linux2', 'sunos5', 'solaris'):`


504      timeout = timeout * 1000
Inside self._timed_wait_callback you already multiply to make this
microseconds....now it is a factor of 1000 too big (ignoring the 100
vs 1000 nit above).


586     def run_popen_directly(klass, *args, **kwargs):

I don't think this function is necessary.  They can always just call
ProcessHandler.Process(*args, **kwargs)

mutt/mutt/tests/python/mozprocess1.py should just be
mozprocess.py. This should probably be stripped of hard-coded paths if
it is to be checked in.  Alternatively, we could start a sample
directory somewhere and put it there.  Enough of our tests don't work
that I'd hate to add another one.

I honestly have no idea why the linux/unix and mac code should differ
at all.  Maybe that's for a future bug.

Overall, I think this is a really good patch. I like the organization
of it and it makes a bunch of things clearer.  I'm not really
qualified to assess what is going on at the micro-level, especially in
windows land, but aside from the caveats above and tons of debugging
code to be removed, it looks good to go.
Attachment #532387 - Flags: review?(jhammel) → review+
Attachment #532387 - Flags: feedback?(jgriffin)
This landed on master, in a mess and a half.  Had merge issues with git - you can see the twenty some odd commit messages that leaked in from my personal branch.

https://github.com/mozautomation/mozmill/commit/488b2d620993cb6c23bd2e255a052309cb8edf1f

Word to the wise:
* Use commit --amend each time you want to commit a new change in a personal branch.  That way all your changes end up in one commit, which prevents you from getting into these intra-commit conflicts like I did.

anyhow, fixed now.

For future bugs in this area please use a new bug as this one has already morphed purposes quite a bit, and I'd like to not morph it's purpose any further.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: