Open Bug 819048 Opened 7 years ago Updated 2 years ago

Add cross-platform parallel job processing framework to tree

Categories

(Testing :: Mozbase, defect)

defect
Not set

Tracking

(Not tracked)

People

(Reporter: gps, Unassigned)

References

Details

Attachments

(1 file)

I would like the mozilla-central tree to eventually contain some Python code for *easily* and *robustly* launching and receiving results from several jobs in parallel. We would use this for things like parallel test execution and, well, general tasks (e.g. parsing moz.build files).

AFAIK, the closest solutions we have today are multiprocessing.Pool and make/pymake.

We can't use multiprocessing.Pool for a few reasons. First, it won't work on BSDs (see bug 808280). Second, it has known problems with SIGINT/KeyboardInterrupt (http://www.bryceboe.com/2012/02/14/python-multiprocessing-pool-and-keyboardinterrupt-revisited/).

The current implementation of pymake uses multiprocessing.Pool. Fortunately, we only use pymake on Windows. The KeyboardInterrupt issues probably exist, but AFAICT nobody has complained about them.

Using GNU make would be interesting. We'd have to make a few sacrifices - notably not being able to dynamically inject jobs into the worker pool. We could even override the .SHELL variable in make so all rules spawn a Python interpreter! It's doable, but feels a bit hacky to me. I'd prefer a pure Python solution.

Anyway, multiprocessing.Pool and pymake are non-starters for a cross-platform solution that "just works." We need something that actually works. Maybe that is built on top of multiprocessing and/or make/pymake and it's simply an abstraction layer that does the right thing depending on what the platform supports. I don't know.

If we can import an existing package that does this, great. If not, I suppose we'll write our own (hopefully contributing to PyPI, of course).

I haven't exhaustively searched the Internets for existing solutions. If anyone knows of any, please suggest them!

Urgency on this is low. But, I imagine people will start wanting it more and more once we do things like converting test runners to execute tests in parallel.

I'm filing this under mozbase because, well, where else would I put it?
I don't really think we should block multiprocessing.Pool on BSD availability. As long as we gracefully fall back to serial execution that seems fine. Perhaps the SIGINT bits are a dealbreaker. Related: bug 795360 has a contributor making symbolstore.py parallel using multiprocessing.Pool.
If we use multiprocessing where it is available and fall back to serial execution, I'd be happy with that. I just want to avoid the problem where we have N consumers in the tree all writing the same ugly workarounds for multiprocessing shortcomings.

And, if we roll our own, we can also incrementally add features like, say, integration with psutil for per-process resource monitoring. I know I'd love to effortlessly record cycle counts for individual tests :D
Are these jobs always going to be external processes? If so can't we just use subprocess? There might be some tricks necessary to deal with pipes without blocking, but I think it shouldn't be too hard.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #3)
> Are these jobs always going to be external processes? If so can't we just
> use subprocess? There might be some tricks necessary to deal with pipes
> without blocking, but I think it shouldn't be too hard.

I was thinking this would be mostly for native Python parallelism. You pass in a Python function and it does something and returns a Python object. Now, said job function could spawn a child process if it wanted to. I just don't want to pigeonhole us into thinking "every job is a new process."
http://pypi.python.org/pypi/pprocess/0.5 seems interesting. Not sure if it works on Windows.
Attached patch Proof of conceptSplinter Review
After investigating a lot of packages on PyPI, I couldn't find anything that fulfilled our requirements. Most packages were lacking Windows support. Others had some really funky APIs or were doing some very suspect things (like a lot of low-level socket or win32 operations which scared me a bit).

So, I went ahead and pieced something together. I built things on top of multiprocessing.Pipe. I've tested it on OS X and Windows and it seems to work just fine! I /think/ it will still work on BSD (the part of multiprocessing that isn't BSD friendly is locking and AFAICT multiprocessing.Pipe and multiprocessing.Connection don't use locks.

I still need to find a select() substitute that works with pipes on Windows because the sleep() to prevent the busy loop really slows things down. And, I need a lot more guards and exception checks in areas. But, I think this is the start of something that we could use.

I'd appreciate drive-by's on the API and implementation. The abstraction of multiple implementations per task is to eventually allow one-time serialization of callables. This is not yet implemented and leads to some unnecessary overhead.
Under the hood multiprocessing.Pipe is using CreateNamedPipe on Windows. Sadly, it does not define FILE_FLAG_OVERLAPPED (at least on Python 2.7 - I haven't checked Python 3). So, the handle returned is not usable by WaitForSingleObject or WaitForMultipleObjects. AFAIK, this means it's impossible to perform a select() equivalent on multiple named pipes. (.poll() is serviced by WaitNamedPipe which only allows 1 pipe handle). See also http://msdn.microsoft.com/en-us/library/windows/desktop/aa365781%28v=vs.85%29.aspx

I suppose we could implement our own bi-directional pipe in-line.

Until then, we'll have to make due with a sleep() based loop. Sadness.

I should seriously consider submitting some patches to CPython...
Python 3.3 opens the pipes with overlapped I/O and uses WaitForMultipleObjects internally for all operations \o/

Too bad we can't require Python 3 :(
The proof of concept breaks for large payloads. If we attempt to send too large of a payload through the pipe, the pipe exceeds capacity and read() and write() both block. Bad news bears.

I've also been thinking about use cases a bit. We essentially have two models of execution:

1) tasks are Python callables
2) tasks are processes to execute

If we wanted to, we could certainly employ different models of execution for each. Although, requiring everything to be #1 would facilitate #2 (at the slight expense of unneeded processes).

Anyway, I may continue to hack on this in my spare time. Although, other projects are more important to me right now. If anyone else wants to have a go at it...
Blocks: 845748
You need to log in before you can comment on or make changes to this bug.