Last Comment Bug 523208 - Ensure mochitest expects multiple processes
: Ensure mochitest expects multiple processes
Status: RESOLVED FIXED
:
Product: Testing
Classification: Components
Component: Mochitest (show other bugs)
: Other Branch
: All All
: -- normal (vote)
: ---
Assigned To: Benjamin Smedberg [:bsmedberg]
:
Mentors:
Depends on: 523894 525065 525090 534934 535564
Blocks: 519572 917842 523094
  Show dependency treegraph
 
Reported: 2009-10-19 13:52 PDT by Jonathan Griffin (:jgriffin)
Modified: 2013-09-18 08:02 PDT (History)
4 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
automation.py.in patch (9.70 KB, patch)
2009-11-06 15:57 PST, Jonathan Griffin (:jgriffin)
no flags Details | Diff | Review
patch v2 (11.05 KB, patch)
2009-11-10 16:15 PST, Jonathan Griffin (:jgriffin)
ted: review-
Details | Diff | Review
patch, v3 (9.34 KB, patch)
2009-11-20 14:11 PST, Benjamin Smedberg [:bsmedberg]
ted: review+
Details | Diff | Review

Description Jonathan Griffin (:jgriffin) 2009-10-19 13:52:25 PDT
For mochitest:

If a browser hangs, the test frameworks currently just kill it off. Now that
browsers are going to launch subprocesses, we need to make sure:

* if the process exits normally, there aren't any subprocesses left around
* if the process is killed off, there aren't orphan processes which could cause
weird problems on future test runs
Comment 1 Ted Mielczarek [:ted.mielczarek] 2009-10-19 13:55:24 PDT
Might need to pull in bsmedberg's killablesubprocess to do the right thing on Win32.
Comment 2 Benjamin Smedberg [:bsmedberg] 2009-10-19 14:11:32 PDT
Another possibility is to have the main process print the PID of each subprocess it launches, and then troll the log for those and explicitly kill them if necessary.

Also, right now the log from the parent and child process are going to be intermingled on stdout/stderr, which has the potential to mess up dump() statements. Should we consider redirecting the child process logs to different log files?
Comment 3 Jonathan Griffin (:jgriffin) 2009-10-19 14:33:32 PDT
According to http://benjamin.smedbergs.us/blog/2006-12-11/killableprocesspy/, killablesubprocess only works on Windows on win2k+...does this exclude WinXP?  If so we'd need to update it if we use it.  Also, I'm guessing it might not work as-is on mobile platforms.
Comment 4 Benjamin Smedberg [:bsmedberg] 2009-10-19 14:35:30 PDT
Ignore WINCE for now since we don't have a multi-process impl on there.

WinXP came out after Win2k, killableprocess works on both (any NT-style OS, really).
Comment 5 Jonathan Griffin (:jgriffin) 2009-10-19 14:49:04 PDT
In that case, I'm all for killablesubprocess...that seems like a neater solution than parsing large log files looking for PIDs.
Comment 6 Jonathan Griffin (:jgriffin) 2009-10-20 14:01:38 PDT
When testing this, it would be helpful to be able to artificially induce an orphan child process...is there any easy way to do this?
Comment 7 Benjamin Smedberg [:bsmedberg] 2009-10-20 14:25:13 PDT
Well, we could certainly do it with nsIProcess: launch `cat`, perhaps?
Comment 8 Ted Mielczarek [:ted.mielczarek] 2009-10-20 14:45:14 PDT
The nsIProcess unit tests have a couple of little programs that you could probably use or modify:
http://mxr.mozilla.org/mozilla-central/source/xpcom/tests/TestArguments.cpp
http://mxr.mozilla.org/mozilla-central/source/xpcom/tests/TestBlockingProcess.cpp
Comment 9 Jonathan Griffin (:jgriffin) 2009-10-21 16:54:25 PDT
I've done quite a bit of testing with this on Windows, and it works well; still need to do testing on Linux.  Do OOP plugins work on Mac these days?

Ted: runtests.py doesn't attempt to detect or recover from hung processes. Is there anything on tinderbox that does?  If so, that may need some adjustment.

Bsmedberg:  All leak logging info is from all processes is dumped into a single file.  There's no way to tell from the log which process generated the leak.  Additionally, each time a new process starts it blows away any existing log, so it's possible that a plugin process would terminate, and any leaks reported by it would be shredded by a new plugin process starting up.  I can file a separate bug for this.
Comment 10 Ted Mielczarek [:ted.mielczarek] 2009-10-21 17:04:50 PDT
I have patches in bug 501034 to make automation.py detect hangs and kill the process to get a stack.
Comment 11 Jonathan Griffin (:jgriffin) 2009-10-22 11:02:24 PDT
That's a cool way to handle hangs.  I need to see if this helps us with all the scenarios we can imagine for e10s.  For example, if a child process hangs, it can happen that the parent process will just wait for it, thus producing no output, even though it isn't hung.  In this case, the patch in bug 501034 would inject a crash into the wrong process.

A more complicated method that might work for e10s would be to enumerate the processes in the job, and see if any had hung (using OpenProcess and WaitForInputIdle).  This would require some additions to killableprocess, and something different would be required for linux/Mac.
Comment 12 Jonathan Griffin (:jgriffin) 2009-10-22 12:01:52 PDT
Re comment #9 and leak reporting, I see it already has a separate bug, bug 519570.
Comment 13 Jonathan Griffin (:jgriffin) 2009-10-23 15:35:55 PDT
The more I test around this stuff, the more I'm actually leaning towards bsmedberg's idea of logging PID's.  The killableprocess approach seems to have some holes:

1 - it doesn't fit well with the hung process patch Ted is making in bug 501034
2 - it doesn't do a good job of handling the case where the child process crashes but the parent process doesn't.  I'm guessing we'd want to catch this case and stop the test, not just continue obliviously.
3 - it may lead to headaches in the future as we look towards getting multiple processes enabled on other platforms

Instead, I'm thinking it would be much easier to log PID's to a dedicated log file (not injected into the regular log spew), and then work with that.  If the browser could maintain an array of active processes, and write the active PID list to the logfile each time the list changes, we could then iterate over that list periodically:

1 - if a PID in the logfile no longer exists, it's crashed and we should stop the test and do cleanup
2 - we can check if all of the active PID's are not hung (using WaitForInputIdle on Windows...I'm not sure how this is done on linux but I assume there is a way), and if we find one, stop the test and do a cleanup.  Integrate with ted's patch in 501034 after it lands.

For cleanup we'd always have a list of PID's to kill, so no worries there.

Comments, suggestions?
Comment 14 Benjamin Smedberg [:bsmedberg] 2009-10-23 16:11:22 PDT
* Why doesn't it fit well with Ted's work?
* We don't necessarily want the test to fail when the child process crashes. We're going to want to write tests where we explicitly crash the child process and then check and make sure the chrome process is still functional and doesn't hang.
* It's trivial on non-Windows platforms to group processes... killableprocess exists because there wasn't a facility to do that on Windows.

But I'm also fine with writing PIDs out.
Comment 15 Jonathan Griffin (:jgriffin) 2009-10-23 16:19:57 PDT
The killableprocess approach doesn't work well with Ted's patch because it can't tell which process is hung, only that there's been no output in a while.  It will inject a crash (in order to get a stack trace) into firefox, even if it's the child process that's hung, and firefox is just sitting around waiting for it.

If you're ok with that, that's fine.

> We don't necessarily want the test to fail when the child process crashes...

Hmm.  I'm not sure how to have it both ways unless we introduce another parameter to runtests.py to cause it to ignore child process crashes, and then run those "we want a crash" tests in a separate run.  Otherwise we could always ignore child crashes, but is there a danger that could cause unpredictable behavior in some other test down the line?
Comment 16 Ted Mielczarek [:ted.mielczarek] 2009-10-25 13:53:19 PDT
(In reply to comment #15)
> The killableprocess approach doesn't work well with Ted's patch because it
> can't tell which process is hung, only that there's been no output in a while. 
> It will inject a crash (in order to get a stack trace) into firefox, even if
> it's the child process that's hung, and firefox is just sitting around waiting
> for it.

Presumably this is a case that we will want to handle in Firefox itself. Hung child processes shouldn't cause the whole app to become non-responsive. Hopefully we could handle this in the JavaScript part of the Mochitest harness.
Comment 17 Jonathan Griffin (:jgriffin) 2009-10-26 10:39:59 PDT
> Presumably this is a case that we will want to handle in Firefox itself. Hung
> child processes shouldn't cause the whole app to become non-responsive.
> Hopefully we could handle this in the JavaScript part of the Mochitest harness.

All we can do in javascript is time the test out, which is what currently happens.  The test framework, though, is never aware that a child process hung.  I'm also thinking down the line to OOP content tests.  In that case, in mochitest-plain at least, it's likely the test framework will itself be running in the child process, so we won't be able to do anything about child process hangs within the context of the framework.  It seems it would be better to have hangs managed by the python test runners, since we're also making the python test runners responsible for process cleanup.

However, that's a bit down the road, and if we just want to implement a fast-and-easy killableprocess solution now, and worry about OOP content stuff later, I'm fine with that.
Comment 18 Ted Mielczarek [:ted.mielczarek] 2009-10-26 10:53:01 PDT
What I mean to say is, file a Core:IPC bug requesting chrome JavaScript accessible APIs for dealing with child processes so you could do this in the harness.
Comment 19 Benjamin Smedberg [:bsmedberg] 2009-10-27 11:09:58 PDT
The IPDL framework knows (or can know) when the child process hangs, but I'm really not concerned about that case at all from the framework. The only thing I really care about is that if a child process lives beyond the end of the test (its normal shutdown), the test should fail, and we should also terminate that process.
Comment 20 cmtalbert 2009-10-27 15:35:14 PDT
(In reply to comment #19)
> The IPDL framework knows (or can know) when the child process hangs, but I'm
> really not concerned about that case at all from the framework. The only thing
> I really care about is that if a child process lives beyond the end of the test
> (its normal shutdown), the test should fail, and we should also terminate that
> process.
I think we do want to detect child process hangs for the future case of OOP content.  For OOP plugins we can depend on the plugin being destructed at the end of the test and kill it if it is not, and mark the test as a fail.  My only concern with this approach (which may be invalid w.r.t. the implementation of E10S) is if plugins must be GC'd before the process is terminated how do we know how much time to allow for that GC to occur? Do we need to force GC at the end of the plugin tests?

Going forward with OOP content, I think we do want the ability to detect hangs in at least two cases:
As far as I know, mochitest (when run normally) runs everything through an iframe on a single page so there is only one content process in that case, and that content process should remain active from the beginning to the end of the test.  Or, does each new onload and onunload in that iframe create/kill a child process?  So, for normal mochitests down the road, we'd want to know when a process hangs.

The second case is one that we are already dealing with -- when a crashtest or reftest crash or hang we have no way to determine from the tinderbox logs which actually occurred.  This is something dbaron has brought up in the past.  If we could use this IPC call to detect a hung child process in this case, and could then print something to the log before ditching the test run, that'd be a significant improvement over where we are now.
Comment 21 Jonathan Griffin (:jgriffin) 2009-10-28 13:54:34 PDT
> The only thing
> I really care about is that if a child process lives beyond the end of the test
> (its normal shutdown), the test should fail, and we should also terminate that
> process.

If we want to fail tests in this condition, then the mochitest framework will have to be able to detect and deal with this.  I'll file a bug per comment #18.
Comment 22 Jonathan Griffin (:jgriffin) 2009-11-05 16:51:48 PST
I have a patch which works in every scenario I could device on both linux and win32.  It's built on top of ted's patches in bug 501034, which haven't landed yet on e10s.  In order to make a cleaner diff for review, do you mind if I push these to e10s, ted?

http://hg.mozilla.org/mozilla-central/rev/c641519baa90
http://hg.mozilla.org/mozilla-central/rev/3b932018ce6a
http://hg.mozilla.org/mozilla-central/rev/0de15169a880
Comment 23 Ted Mielczarek [:ted.mielczarek] 2009-11-05 16:56:13 PST
I think you want someone to merge m-c to e10s, actually.
Comment 24 Jonathan Griffin (:jgriffin) 2009-11-05 17:01:04 PST
> I think you want someone to merge m-c to e10s, actually.

Bsmedberg, any m-c to e10s merges planned for the near future?
Comment 25 Benjamin Smedberg [:bsmedberg] 2009-11-06 03:51:39 PST
> I think we do want to detect child process hangs for the future case of OOP
> content.  For OOP plugins we can depend on the plugin being destructed at the
> end of the test and kill it if it is not, and mark the test as a fail.  My only

I don't know what "we" means here. Once a plugin is loaded, it currently stays loaded until shutdown, no matter how many instances of it are created/destroyed. The parent process will have hang detection for its children, so there is absolutely no need to build some additional mutli-process hang detection in.

> Going forward with OOP content, I think we do want the ability to detect hangs
> in at least two cases:
> As far as I know, mochitest (when run normally) runs everything through an
> iframe on a single page so there is only one content process in that case, and
> that content process should remain active from the beginning to the end of the
> test.  Or, does each new onload and onunload in that iframe create/kill a child
> process?  So, for normal mochitests down the road, we'd want to know when a
> process hangs.

Right now there's only one content process ever. But again, since IPDL would notice hangs, why does it matter? We can make the IPDL hang detector perform any action we want in the chrome process, including dumping output to the log, since the chrome process isn't hung.

jgriffin, you can do the merge or catch any one of us on IRC.
Comment 26 Jonathan Griffin (:jgriffin) 2009-11-06 15:57:32 PST
Created attachment 410888 [details] [diff] [review]
automation.py.in patch

Thanks for the merge, bsmedberg.  

I've tested this patch on both linux and win32, and it seems to work well.  It cleans up any child processes after the test is finished, and also works with ted's patch in bug 501034 to handle hangs correctly in child processes, in the case where a child process is hung after the parent process closes.
Comment 27 Jonathan Griffin (:jgriffin) 2009-11-10 16:15:33 PST
Created attachment 411561 [details] [diff] [review]
patch v2

Update to original patch which makes sure that all copies of automation.py which are generated have MOZ_CHILD_PROCESS_NAME set correctly.
Comment 28 Ted Mielczarek [:ted.mielczarek] 2009-11-18 08:59:16 PST
Comment on attachment 411561 [details] [diff] [review]
patch v2

>diff --git a/build/Makefile.in b/build/Makefile.in
>--- a/build/Makefile.in
>+++ b/build/Makefile.in
>+ifdef MOZ_IPC
>+include $(topsrcdir)/ipc/app/defs.mk
>+endif

Why not just put this in automation-build.mk directly? Saves scattering it around the tree, which is the whole point of automation-build.mk.

>diff --git a/build/automation.py.in b/build/automation.py.in
>--- a/build/automation.py.in
>+++ b/build/automation.py.in
>+def killPid(pid):
>+  "Kills the process specified by |pid|"

For methods like these where we're going to have essentially two implementations, for POSIX and Win32, I'd prefer if you kept to what I did for readWithTimeout, and have two completely separate function definitions on opposite sides of an if. I think it makes things a bit clearer. (Also, you can just reuse my existing if block for this.

>+def isPidAlive(pid, processName = MOZ_CHILD_PROCESS_NAME, ignoreDefunct = False):
>+  else:
>+    pslist = Process(["ps", "-e"], stdout=subprocess.PIPE)
>+    pidlist = Process(["grep", str(pid)], stdin=pslist.stdout, stdout=subprocess.PIPE)
>+    output =  pidlist.communicate()[0]
>+    regexp = "%d.*?%s(.*?)$" % (pid, os.path.basename(processName))
>+    m = re.search(regexp, output)
>+    if m:
>+      if ignoreDefunct and "defunct" in m.group(1):
>+        return False
>+      else:
>+        return True
>+    else:
>+      return False

This is kind of ugly. How badly do you need the ignoreDefunct behavior? If you don't really need it, then you could just write:

import os, errno
def isPidAlive(pid):
  try:
    os.kill(pid, 0)
  except OSError, err:
    if err.errno == errno.ESRCH:
      return False
  return True

(This also skips the process name part, but you skip that in the Win32 implementation already...)

>     while line != "" and not didTimeout:
>-      log.info(line.rstrip())
>+      # As we read lines, search them for launched PID's, and if found,
>+      # append them to a list.
>+      line = line.rstrip()
>+      m = re.search("launched child process (\d*)$", line)
>+      if m:
>+        processList.append(int(m.group(1)))
>+      log.info(line)

It seems ugly to mix the PID info into stdout. I realize this was decided in a different bug, but I'd much rather have it go to a logfile specified in an environment variable for cleanliness. (We could also log it to stdout in DEBUG builds, that would be fine.) I commented to this effect on bug 525090.

Also, if or when we get code that cleanly shuts down child processes, it'd probably be good to log the shutdowns as well, so this code could remove those from the list of processes to check. That way if a child process is cleanly shut down, and another process is launched with the same PID, we don't have problems.

>       (line, didTimeout) = readWithTimeout(logsource, timeout)
>     if didTimeout:
>       log.info("TEST-UNEXPECTED-FAIL | automation.py | application timed out after %d seconds with no output", int(timeout))
>-      triggerBreakpad(proc, utilityPath)
>+      # We can timeout because either the default process, or a child process,
>+      # is still alive.  Before attempting to trigger breakpad on the default
>+      # process, check and see if it's really still running...
>+      isAlive = isPidAlive(proc.pid, processName = app, ignoreDefunct = True)
>+      if isAlive:
>+        log.info("TEST-UNEXPECTED-FAIL | automation.py | primary process still alive, triggering breakpad")

I don't think you need this log statement, the previous one about the timeout should be sufficient.

>+      # ...otherwise, we've probably timed out because of a stuck child
>+      # process.  In that case, scan all the PID's of launched processes, and
>+      # kill any that are still alive.
>+      for processPID in processList:
>+        isAlive = isPidAlive(processPID, ignoreDefunct = True)
>+        if isAlive:
>+          log.info("TEST-UNEXPECTED-FAIL | automation.py | child process %d still alive after shutdown", processPID)

This seems unnecessary. If the parent process dies, I don't think we hit the timeout path, readWithTimeout should just return an empty line. In any case, your child process cleanup block below will handle wayward child processes, so why bother here?

>+  # Do a final check for zombie child processes.  We can sometimes observe
>+  # a <defunct> child for a short while, so allow some time for the OS to 
>+  # clean it up before killing it with an error.
>+  doneTime = time.time() + 15
>+  continueChecking = True
>+  while continueChecking:
>+    foundProcess = False
>+    timedOut = time.time() >= doneTime
>+    for processPID in processList:
>+      isAlive = isPidAlive(processPID)
>+      if isAlive:
>+        foundProcess = True
>+        if timedOut:
>+          log.info("TEST-UNEXPECTED-FAIL | automation.py | child process %d still alive after shutdown", processPID)

You might want to stick that 15 in a variable, and indicate here that that's how long we waited for it, like "child process %d still alive %d seconds after shutdown".

r- for some cleanup and because I really want that "PIDs logged to stdout" changed to a logfile. Otherwise the majority of the patch looks good.
Comment 29 Benjamin Smedberg [:bsmedberg] 2009-11-20 14:11:40 PST
Created attachment 413703 [details] [diff] [review]
patch, v3

I've simplified a lot: moving the PID log into a different file means it's relatively simple to read it at any time. Also because there shouldn't ever be zombies (they should always be waited by the toplevel process) I've removed the possibility of allowing zombies, and I've removed the waitloop which uses delays to work around zombie issues.
Comment 30 Benjamin Smedberg [:bsmedberg] 2009-11-20 14:31:06 PST
Comment on attachment 413703 [details] [diff] [review]
patch, v3

I'm having trouble testing this effectively, but I believe it does what we want. I've had some odd situations where the main process would exit and runApp would sit there in readWithTimeout waiting for something to happen, but I'm pretty sure this patch doesn't change that codepath at all and so I think this is ready for review.
Comment 31 Ted Mielczarek [:ted.mielczarek] 2009-11-23 07:07:25 PST
Comment on attachment 413703 [details] [diff] [review]
patch, v3

>diff --git a/build/automation.py.in b/build/automation.py.in
>--- a/build/automation.py.in
>+++ b/build/automation.py.in
>@@ -44,16 +44,17 @@ import logging
> import os
> import re
> import select
> import shutil
> import signal
> import subprocess
> import sys
> import threading
>+import time

I don't think you're actually using "time" anymore.

>+  def isPidAlive(pid):
>+    try:
>+      # kill(pid, 0) checks for a valid PID without actually sending a signal
>+      os.kill(pid, 0)

Could probably stand a comment that this will throw if the process doesn't exist to make the control flow less confusing here.

>+    except OSError, err:
>+      if err.errno == errno.ESRCH:
>+        return False
>+      raise

And a comment here explaining that the raise is to pass on other failures.

> def runApp(testURL, env, app, profileDir, extraArgs,
>            runSSLTunnel = False, utilityPath = DIST_BIN,
>            xrePath = DIST_BIN, certPath = CERTS_SRC_DIR,
>            debuggerInfo = None, symbolsPath = None,
>-           timeout = DEFAULT_TIMEOUT):
>+           timeout = DEFAULT_TIMEOUT, processLog = None):

I don't know that there's any need to force the caller to pass in a filename here, is there? I'd just use tempfile.NamedTemporaryFile() to get a filename in here and let callers be ignorant of it.
http://docs.python.org/library/tempfile.html#tempfile.NamedTemporaryFile

>+  # Do a final check for zombie child processes.
>+  if processLog is not None:
>+    if not os.path.exists(processLog):
>+      log.info("TEST-UNEXPECTED-FAIL | automation.py | process PID log '%s' does not exist", processLog)

Will the process log be created if the env var is set, even if the IPC pref is off? If not, this will break regular Mochitests.

>+    else:
>+      processList = []
>+      pidRE = re.compile(r'launched child process (\d+)$')
>+      for line in open(processLog):
>+        m = pidRE.search(line)
>+        if m:
>+          processList.append(int(m.group(1)))
>+
>+      for processPID in processList:
>+        log.info("Checking for orphan process with PID %d", processPID)
>+        isAlive = isPidAlive(processPID)
>+        if isAlive:

I don't think you need the temporary variable here.

r=me with those changes.
Comment 32 Benjamin Smedberg [:bsmedberg] 2009-11-23 14:36:43 PST
http://hg.mozilla.org/projects/electrolysis/rev/db9989871e18

Note You need to log in before you can comment on or make changes to this bug.