I've tracked the freezes down to a root cause. The problem is the combination of threads + fork + Python + subprocess. Some Background: The test runner is implemented using a work stealing queue feeding into <count> threads. When each thread grabs a job to execute it creates a list with entries [None, None] and spawns _another_ thread, passing a reference to the list. The new thread does a subprocess.Popen(), sets the returned process object into the passed list (effectively passing a reference to the creating thread) then calls .communicate() on it in the sub-sub-thread. Subprocess.Popen is a python implemented function which calls os.fork() and then, in the child, os.exec(). The call to communicate in the parent calls os.waitpid(). This wait will not end until the test dies. The sub-thread that spawned the sub-sub (forking) thread joins on the forking thread with a timeout. When this timeout ends, it uses the reference passed from the sub-sub-thread to get the pid of the child process. It then calls os.kill with this pid in a loop, swallowing all exceptions. We then join (with infinite timeout) on the sub-sub-process in order to clean it up. This code is all basically correct, if seriously over-complicated. Problem 1: In the guts of os.fork, we have some code that looks more or less like this: PyGIL_Release(); // * fork(); PyGIL_Acquire(); Threads in Python are native threads, _but_ all access to any interpreter object is gated through a Global Interpreter Lock: the GIL. This allows external code to run in parallel, but anything that is python code or uses python apis has to hold this lock. The purpose of code like this is to allow other compute-bound python threads to get work done while this thread takes a potentially long-blocking, io-heavy system call. The problem arises when a process attempts to fork in a thread, but is interrupted at *. Specifically, another python thread is allowed to run, then the system lets the original thread continue running either while the other thread is still running or has simply been interrupted inside python code. Note: this can happen even in the mono-processor case (but significantly more rarely). The problem is that the other thread in this case will be holding the GIL when we fork. The new process will wake up and try to grab the GIL, waiting for it to be released by a thread that no-longer exists in the new world. Deadlock #1. The only reason that this problem does not manifest all the time is that the threads in the pytest runner are very I/O and sleep heavy. They are much more likely to stop running when they take a system call than to be shut down by the scheduler when still running. Even so, this problem manifests 2-4 times for me on a typical -j7 debug test suite run on my 4 core + hyperthreading i7. However, the testrunner code is not unprepared for this situation. As noted in the background, we would issue a kill after the timeout, unblocking things. However, we do not, because of: Problem 2: Subprocess.Popen wants to emit a nice exception for the parent if exec fails on the child, i.e. no such file, you don't have permission, that's not an executable, etc. To do this it creates a pipe in CLOEXEC mode. The parent reads one end. It doesn't expect to get any data out, it just expects an error when the child execs, implicitly closing it. But the child may have frozen in the C code immediately after the fork, well before it gets to run more python code. In this case the code it would run is the code that would run the exec. Deadlock #2. Conclusion: Now we have enough information to diagram the failure: 1) The child process forks but never execs (Problem 1). 2) The parent never leaves subprocess.Popen (Problem 2). 3) The forking thread in the parent process never passes the Popen result to the job thread. 4) The job thread times out on its initial join. 5) The job thread can't kill anything: it needs the pid stored in the Popen result which it was never passed. 6) The job thread assumes the forking thread died with an exception and unconditionally calls join on the thread to clean it up. 7) This thread is not dead, but is deadlocked at #2, waiting on deadlock #1. 8) The joining thread deadlocks (#3). 9) The test suite finishes and calls join on all the job threads. 10) The test suite deadlocks (#4). We are now in a state where we need to manually signal the test suite in order to kill it. Remediation: Easy: In (5) we can put a timeout on the join. We will still have the problem of massive random pauses and random oranges, but at least we won't put the test-suite in a state where it needs to be manually killed. Impossible: (1) and (2) are Python bugs. There has been a lot of talk in their community about how to address these. Their solution (the last conversation I read on this was ~6 months ago, so this may be out of date) was to tell program authors not to use threads and forking at the same time. We should not expect them to help with this. Correct: Rewrite. This is not how job servers should be written, for reasons that should now be evident. The correct way to write this program is a while(True) wait() loop that manages all jobs off of the main threads. This would have the added benefit of being tremendously less complicated. This should even be possible without having to specialize for each platform, if we use Python. Wider applicability: Whether you see these errors or not is quite dependent on the low-level scheduling details of the kernel version and configuration, the test suite, the processor, the background workload, cosmic rays, etc. However, the code in question is not conditioned on any specific operating system, so it is probably applicable to some degree on all fork/exec OS's. Windows uses fork/exec too in this code, but through an emulation layer. I am not sure if this leaves it vulnerable. The jit test harness uses the same code, but appears to hit this edge case less frequently. Both of these harnesses were copied from other test harnesses in mozilla's tree, so this probably has quite wide impact. Test harness owners should give their code a look to see if they suffer from the same root problem.
Great writeup. Rewriting and simplifying the core work loop would be great. We might need to use different techniques on Unix and Windows: Python doesn't have os.wait on Windows. (And this problem doesn't seem to happen on Windows machines, at least not on mine.)
Created attachment 595920 [details] [diff] [review] v0: Appears to work locally. https://tbpl.mozilla.org/?tree=Try&rev=c4edef706927
Assignee: general → terrence
Status: NEW → ASSIGNED
Created attachment 596093 [details] [diff] [review] v1: Cleaned up and added comments. No oranges on J tests.
Comment on attachment 596093 [details] [diff] [review] v1: Cleaned up and added comments. Review of attachment 596093 [details] [diff] [review]: ----------------------------------------------------------------- This is a great improvement to the task system, and the structure is good. I have a bunch of change requests, but they are all details. ::: js/src/tests/jstests.py @@ +192,5 @@ > except ImportError: > pass > results.pb = pb > > test_list = [ TestTask(test) for test in tests ] TestTask is just a useless wrapper with the new task loop, so you should probably create that only for the old version, and just pass the straight test list to the new version. @@ +197,4 @@ > > + if (sys.platform.startswith('linux') or > + sys.platform.startswith('darwin') > + ): Let's factor this part into RunAllTests(test_list) (or some such) which just dispatches to RunAllTests_Unix or RunAllTests_Windows ::: js/src/tests/unix.py @@ +1,1 @@ > +import os, sys, select The name of this module is too generic. Maybe tasks_unix.py (and also rename workers.py to tasks_win.py)? @@ +3,5 @@ > +from tests import TestOutput, NullTestOutput > + > +PROGRESS_BAR_GRANULARITY = 0.1 #sec > + > +class TaskStruct(object): This can be called just 'Task'. @@ +6,5 @@ > + > +class TaskStruct(object): > + def __init__(self, test, pid, stdout, stderr): > + self.test = test > + self.cmd = test.test.get_command(test.js_cmd_prefix) If you just pass in the tests objects, all the 'test.test' won't be needed any more. @@ +22,5 @@ > + (rout, wout) = os.pipe() > + (rerr, werr) = os.pipe() > + > + # Fork it. > + rv = os.fork() Comment is unnecessary. Please also delete the rest of the comments below in this function, except for the very helpful '# Parent' and '# Child'. @@ +50,5 @@ > + > +def get_max_wait(tasks, results, timeout): > + """ > + Sift through the task list and get the maximum time we can safely wait > + without overflowing any tasks timeout. Revise to: Return the maximum time we can wait before any task should time out. @@ +62,5 @@ > + wait = wait.total_seconds() > + > + # Note select timeout forever is None. > + if wait == 0: > + return None I don't understand what this is for. @@ +66,5 @@ > + return None > + > + # If we have a progress-meter, update it constantly. > + if results.pb is not None: > + wait = min(wait, PROGRESS_BAR_GRANULARITY) This seems OK, but it also doesn't really seem necessary to update the progress bar constantly when no tests are finishing. @@ +77,5 @@ > + """ > + rv = os.read(fd, 4096) > + frags.append(rv) > + while len(rv) >= 4096: > + rv = os.read(fd, 4096) Why >=? Isn't read supposed to read max nbytes data? Another question about this: is Python os.read() blocking or nonblocking? E.g., what happens if exactly 4096 bytes are available? Will the second call to read() block? @@ +80,5 @@ > + while len(rv) >= 4096: > + rv = os.read(fd, 4096) > + frags.append(rv) > + > +def wait_for_input(tasks, timeout): This should probably be renamed to something like 'read_available_input', since it actually does read everything rather than just waiting. @@ +115,5 @@ > +def timed_out(task, timeout): > + """ > + Return True if the given task is over its time alotment. Timeout may also > + be falsy, indicating that the task may not be timed out at this point. > + """ Revise to something like: Return True if the given task has been running for longer than |timeout|. |timeout| may be falsy, indicating an infinite timeout (in which case timed_out always returns False). @@ +126,5 @@ > + """ > + Search for children of this process that have finished. If they are tasks, > + then this routine will clean up the child and send a TestOutput to the > + results channel. This method returns a new task list that has had the ended > + tasks removed. Any particular reason to return a new task list instead of mutating the input? Mutating seems easier, and more what I would expect such a function to do. @@ +134,5 @@ > + pid, status = os.waitpid(0, os.WNOHANG) > + if pid == 0: > + break > + except OSError, e: > + if e.errno == 10: Please give |10| a name. @@ +165,5 @@ > + os.kill(task.pid, 9) > + > +def mainloop(tests, results, options): > + # Reverse for fast pop off end. > + tests.reverse() Conversely, I was surprised that this function would mutate its input. It probably doesn't matter, but it seems slightly nicer to take a copy if you're going to empty the list. @@ +172,5 @@ > + tasks =  > + > + while len(tests) or len(tasks): > + # spawn new tasks > + while len(tests) and len(tasks) < options.worker_count: This comment, and the ones below, are unnecessary. The first two comments in the method are good: the first one explains special code and the second a key variable. @@ +175,5 @@ > + # spawn new tasks > + while len(tests) and len(tasks) < options.worker_count: > + test = tests.pop() > + if test.test.enable or options.run_skipped: > + tasks.append(spawn_test(test)) This isn't quite right. I'm not exactly sure what is right: It would be nicer if we could just filter out skipped tests in the front end, before we even reach the worklist. The problem is that Tinderbox apparently wanted some 'TEST-KNOWN-FAIL' messages for those. It'd probably be nice to take that out of the TB output: I doubt it serves any useful function. But if you want to preserve present behavior, then this part should just push a NullTestOutput when you skip a test. @@ +187,5 @@ > + tasks = reap_zombies(tasks, results, False) > + > + # Forcibly halt and then reap anything living past its timeout. > + if kill_undead(tasks, results, options.timeout): > + tasks = reap_zombies(tasks, results, options.timeout) Would it work to call kill_undead first and then call reap_zombies just once? ::: js/src/tests/workers.py @@ +57,1 @@ > def __init__(self, id, tasks, sink, verbose): Any particular reason for this change?
Excellent comments, and I'll get a new patch up with them applied shortly. (In reply to David Mandelin from comment #4) > @@ +62,5 @@ > > + wait = wait.total_seconds() > > + > > + # Note select timeout forever is None. > > + if wait == 0: > > + return None > > I don't understand what this is for. The default timeout if none is set on the command-line is 0, but if we call select with a timeout of 0, it will return immediately. What we need to pass to select in this case is None. I'll update the comment. > @@ +66,5 @@ > > + return None > > + > > + # If we have a progress-meter, update it constantly. > > + if results.pb is not None: > > + wait = min(wait, PROGRESS_BAR_GRANULARITY) > > This seems OK, but it also doesn't really seem necessary to update the > progress bar constantly when no tests are finishing. Not necessary, but it does distinguish the "test suite is frozen" case from the "nothing is happening case", which I find nice. > @@ +77,5 @@ > > + """ > > + rv = os.read(fd, 4096) > > + frags.append(rv) > > + while len(rv) >= 4096: > > + rv = os.read(fd, 4096) > > Why >=? Isn't read supposed to read max nbytes data? > > Another question about this: is Python os.read() blocking or nonblocking? > E.g., what happens if exactly 4096 bytes are available? Will the second call > to read() block? You are right. os.read() is a slim wrapper around the read syscall -- the behavior is: return all data in the buffer if there is any there, or wait for more. We can put these fd's into non-blocking mode or insert a trivial insta-timeout select() call here. The problem is that we have to drain the full buffer as soon as the program exits, or we need to track dead-but-not-drained tasks in the task list. > @@ +126,5 @@ > > + """ > > + Search for children of this process that have finished. If they are tasks, > > + then this routine will clean up the child and send a TestOutput to the > > + results channel. This method returns a new task list that has had the ended > > + tasks removed. > > Any particular reason to return a new task list instead of mutating the > input? Mutating seems easier, and more what I would expect such a function > to do. I used a comprehension because it's more convenient to use than list.remove if you are only matching on one property of an object. But this is the only place that uses a functional style, so I'll just rewrite it. > @@ +187,5 @@ > > + tasks = reap_zombies(tasks, results, False) > > + > > + # Forcibly halt and then reap anything living past its timeout. > > + if kill_undead(tasks, results, options.timeout): > > + tasks = reap_zombies(tasks, results, options.timeout) > > Would it work to call kill_undead first and then call reap_zombies just once? I want to give all processes that finished before we woke up a chance to exit cleanly before we reap them and potentially set them as timed-out. This is in case the testrunner gets delayed by a significant amount before waking up, for example, if we are swapping heavily. > ::: js/src/tests/workers.py > @@ +57,1 @@ > > def __init__(self, id, tasks, sink, verbose): > > Any particular reason for this change? It's one of the places I looked for the deadlock before I found it elsewhere. I didn't undo this one because it was more concise. In particular, I prefer subclassing Thread and calling setDaemon on yourself in __init__ because there is then (almost) no way to construct a non-daemon Thread of this type.
Created attachment 597115 [details] [diff] [review] v2: With review feedback. Applied review comments, renamed the new file to tasks_unix, renamed workers to tasks_win, collapsed TestTask's one property into TestCase since we need it everywhere anyway, moved check_manifest into manifest.py, moved results handling functionality to new results.py file, handle skipped test list at the top level.
Comment on attachment 597115 [details] [diff] [review] v2: With review feedback. Review of attachment 597115 [details] [diff] [review]: ----------------------------------------------------------------- I like your thinking about being able to distinguish frozen vs. nothing happening. Thanks for the refactorings!
Attachment #597115 - Flags: review?(dmandelin) → review+
Status: ASSIGNED → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla13
You need to log in before you can comment on or make changes to this bug.