processhandler.py frequently does very short waits

NEW
Unassigned

Status

4 years ago
3 years ago

People

(Reporter: roc, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

This is likely a performance problem due to frequent unnecessary context switches. It certainly shows up running the harness under rr.

In processhandler.py, _read does:
    line, callback = queue.get(True, 0.02)
That is, do a blocking read from the queue with 20ms timeout. That's implemented in Queue.py as
                endtime = _time() + timeout
                while not self._qsize():
                    remaining = endtime - _time()
                    if remaining <= 0.0:
                        raise Empty
                    self.not_empty.wait(remaining)
Which is probably reasonable, although it does an unnecessary system call and may iterate a few times. The real problem is in threading.py, in the wait method:
                # Balancing act:  We can't afford a pure busy loop, so we
                # have to sleep; but if we sleep the whole timeout time,
                # we'll be unresponsive.  The scheme here sleeps very
                # little at first, longer as time goes on, but never longer
                # than 20 times per second (or the timeout time remaining).
In other words, this is just a horrible implementation that can't actually get a wakeup when the condition variable is signaled, instead it just polls with a heuristic to choose the inter-poll interval. In practice I see a series of waits of 1, 2, 4, 8 and 16ms with an occasional 0.4ms or something like that to eat up the remaining time.

Worse still, it appears that we have multiple threads doing these waits simultaneously, for multiples of this overhead.

Surely it can't be that hard to use a properly implemented queue/condition variable which doesn't wake up until the condition is signaled or the timeout has expired. (He said hopefully!)
:jgraham wrote an interesting thought on irc:

jgraham wonders why we do queue.get(True, 0.02), rather than queue.get(True, min(timeout,output_timeout) - now)

We should try that, maybe this will solve the problem here.
Well if I understand roc's comment correctly, it won't *fix* things because the actual problem is in the threading.wait implementation in the stdlib, which is a bit harder to fix. But we can at least avoid some of the polling this way.
Polling less frequently should help with the unnecessary context switching aspect of the problem, but due to threading.wait's implementation, it could actually make things *slower*.

The wait method starts out by sleeping very short intervals, and progressively getting longer up to 50ms. If we get the event (output line) early in this cycle, that's up to 50ms of unnecessary blocking we could be doing. By calling queue.poll every 20ms, we make sure threading.wait never reaches that high of an interval (I don't know how quickly its sleep interval rises, but I'd guess it would get to about a 10ms sleep).

It comes down to whether the context switching is the primary cause of slowness, or the unnecessarily long waiting. Like roc mentioned, if we could wake up on a signal, both of these problems disappear.

I think this is an inherent problem in the threading library, and to fix it we'd need to move to something else. jgraham mentioned that this is fixed in python 3's version of threading, maybe someone backported it to python 2? Another alternative might be the greenlet library[1].

Any performance win we can make in mozprocess is worth it if for no other reason than it is so ubiquitous.

[1] https://github.com/python-greenlet/greenlet
I think the multiprocessing Queue has a better event-based wakeup, but I don't know if the other overhead of that is prohibitive.

FWIW it looks like this is fixed in Python 3 (but migrating to that is not sensible at this time).
There's another alternative which still involves using the threading library. If we don't pass a timeout to threading.wait, it won't poll [1]. So we could call `queue.get(True)` without a timeout, then handle timeout and outputTimeout in a separate thread from the one calling queue.get. Someone on stack overflow made a generic implementation of this [2]. This still requires extra thread overhead, but it's probably less overhead than using multiprocessing. We might also be able to re-use the main thread for this.

I guess we probably need to just implement a few of these ideas and do some profiling.

[1] https://github.com/python/cpython/blob/2.7/Lib/threading.py#L339
[2] http://stackoverflow.com/a/2231023/794460
Maybe on a multicore machine (or VM) the performance impact is minor, since in a lot of cases you could process these pointless wakeups without actually context switching away from a Firefox thread. The bogosity of this code still grates :-).
You need to log in before you can comment on or make changes to this bug.