Closed Bug 1195767 Opened 9 years ago Closed 9 years ago

mozilla::TaskQueues seem pretty heavy weight on OS X

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla43
Tracking Status
firefox42 --- verified
firefox43 --- verified

People

(Reporter: jrmuizel, Assigned: froydnj)

References

(Blocks 1 open bug)

Details

Attachments

(7 files)

While playing a video on YouTube I see a number of threads using 6% in Instruments. The majority of this time seems to be in synchronization overhead spread over functions like PR_Wait, PR_EnterMonitor, PR_ExitMonitor.

I don't expect this bug to be actionable without some kind of microbenchmark but I'm filing it now to make people aware of a potential problem.
Blocks: 1195790
So, the task queues run on top of a thread pool. I'd expect the real work to be distributed across the threads in the thread pool, and for those threads to be idle a lot of the time as well.

Are the profiler samples in synchronization primitives you're seeing actually overhead, or is it just the profiler's way of saying that these threads spend most of their time blocked on their input queues?
(In reply to Bobby Holley (:bholley) from comment #1)
> So, the task queues run on top of a thread pool. I'd expect the real work to
> be distributed across the threads in the thread pool, and for those threads
> to be idle a lot of the time as well.
> 
> Are the profiler samples in synchronization primitives you're seeing
> actually overhead, or is it just the profiler's way of saying that these
> threads spend most of their time blocked on their input queues?

I believe, but am not certain, that this actual overhead and not just blocked time. Instruments has an option for 'Record waiting time' and that was not checked while I was profiling.
This may not be related, but Jeff previously found that OS X locks on 10.10 were weirdly slow (as seen in bug 1137963), to the extent that debug builds were much slower.
I just profiled youtube on OSX. I see the overhead, but it's not in TaskQueue per se, but mostly in nsThreadPool.

I see the majority of CPU time being spent in nsThreadPool::Run as expected, but interestingly, only about 15% of that is spent in mozilla::TaskQueue::Runner::Run, where the actual work happens. The rest of it is spent in PR_Wait, PR_EnterMonitor, and PR_ExitMonitor.

So either the measurements are screwy (and reporting waiting time as active CPU time), or we're wasting a huge amount of CPU time on super-slow synchronization primitives. I'm tempted to believe the former, but the incident in comment 3 shows real-world (non-profiler) confirmation of problems like that.

So I think we need to investigate this further. Nathan, do you have cycles to take a look, or should we get it on the radar of the Low Level Tools Team?
Flags: needinfo?(nfroyd)
Here's a screenshot after my profiler run. Don't forget to profile the plugin-container and not firefox. ;-)
I will take a look, but it will be later this week.  Setting ni? for myself to remind myself.
Flags: needinfo?(nfroyd)
Ah, looks like a bugzilla bug.
Flags: needinfo?(nfroyd)
Jeff, I assume your measurements were taken on 10.10?  Do you have measurements for 10.9?
Flags: needinfo?(jmuizelaar)
My measurements are from 10.9
Flags: needinfo?(jmuizelaar)
Mine were on 10.10.
I'm doing the measurements on 10.10, and I'm kind of a novice when it comes to Instruments, so I've just been poking around.  I see roughly the same thing as comment 5: for a thread running a thread pool, there's ~15-20% in TaskQueue::Runner::Run, and the rest spread out in synchronization functions.  This is all with the Time Profiler in Instruments.

Activity Monitor says the CPU is definitely working.

The weird thing is that after turning on NSPR logging for nsThreadPool, I reach a steady state where there's no messages being emitted to the log...but I still see samples in Instruments accumulating underneath the synchronization primitives called by nsThreadPool::Run.  So there's some loop happening that bypasses the logging, but I haven't been able to construct a scenario where that happens yet.
...or it's possible that Instruments is just accumulating samples in the synchronization routines because that's where we happen to be paused.  It's not clear to me whether the samples are being accumulated because there's actual work going on or because the thread is just stopped there.

I think there is actual work going on, because there's other thread pools (like the image decoding pool) and AFAICT, those threads are not accumulating samples.  It's only a handful of threads that are getting sampled, and those all line up with media threads that I can see in lldb.
Attached file playback-thread.log
Hi, I'm a Gecko hacker who just learned how much fun it is to debug e10s code.

OK, after learning how NSPR logs work with child processes, I can see the MediaPlayback thread pool running.  And it looks like we just have threads waking up fairly constantly to do nothing but fall back asleep again.

The attached log shows NSPR_LOG_MODULES=nsThreadPool:5 for one of the MediaPlayback threads.  It's pretty big, but the following sort of behavior is not unusual:

2015-08-27 22:19:09.028607 UTC - 511442944[11fcb5bc0]: THRD-P(11fd2d710) MediaPlayback waiting [60000]
2015-08-27 22:19:09.029467 UTC - 511442944[11fcb5bc0]: THRD-P(11fd2d710) done waiting
2015-08-27 22:19:09.029481 UTC - 511442944[11fcb5bc0]: THRD-P(11fd2d710) MediaPlayback waiting [60000]
2015-08-27 22:19:09.029628 UTC - 511442944[11fcb5bc0]: THRD-P(11fd2d710) done waiting
2015-08-27 22:19:09.029643 UTC - 511442944[11fcb5bc0]: THRD-P(11fd2d710) MediaPlayback waiting [59999]
2015-08-27 22:19:09.126898 UTC - 511442944[11fcb5bc0]: THRD-P(11fd2d710) done waiting
2015-08-27 22:19:09.126904 UTC - 511442944[11fcb5bc0]: THRD-P(11fd2d710) MediaPlayback waiting [59902]
2015-08-27 22:19:09.127437 UTC - 511442944[11fcb5bc0]: THRD-P(11fd2d710) done waiting
2015-08-27 22:19:09.127457 UTC - 511442944[11fcb5bc0]: THRD-P(11fd2d710) MediaPlayback waiting [59902]
2015-08-27 22:19:09.127677 UTC - 511442944[11fcb5bc0]: THRD-P(11fd2d710) done waiting
2015-08-27 22:19:09.127870 UTC - 511442944[11fcb5bc0]: THRD-P(11fd2d710) MediaPlayback waiting [59901]
2015-08-27 22:19:09.128693 UTC - 511442944[11fcb5bc0]: THRD-P(11fd2d710) done waiting
2015-08-27 22:19:09.128711 UTC - 511442944[11fcb5bc0]: THRD-P(11fd2d710) MediaPlayback waiting [59900]
2015-08-27 22:19:09.128957 UTC - 511442944[11fcb5bc0]: THRD-P(11fd2d710) done waiting
2015-08-27 22:19:09.128970 UTC - 511442944[11fcb5bc0]: THRD-P(11fd2d710) MediaPlayback waiting [59900]
2015-08-27 22:19:09.131001 UTC - 511442944[11fcb5bc0]: THRD-P(11fd2d710) done waiting
2015-08-27 22:19:09.131144 UTC - 511442944[11fcb5bc0]: THRD-P(11fd2d710) MediaPlayback waiting [59898]
2015-08-27 22:19:09.131224 UTC - 511442944[11fcb5bc0]: THRD-P(11fd2d710) done waiting
2015-08-27 22:19:09.131249 UTC - 511442944[11fcb5bc0]: THRD-P(11fd2d710) MediaPlayback running [120111d60]

That is, within ~100ms, we repeated the following sequence nine times:

- started waiting on the thread pool's monitor:

http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/nsThreadPool.cpp#219

- got woken up from our wait
- dropped the monitor, since we exited a ReentrantMonitorAutoEnter block
- reacquired the monitor
- reacquired the monitor again for the event queue to check whether we have any events
- dropped the monitor

Locks may be cheap, but they're not free.  And then imagine four threads (or more, since the MediaPlayback pool has a max of 12) in a thread pool doing all this, and the CPU usage numbers and the profiles start to become very believable: we're doing a bad job of farming out the work to the threads in the pool.  NSPR's synchronization primitives are not great, but I don't think they're at fault here.

Eliminating the last two steps from the above is pretty easy.  I'm not sure if there's a better way to do the first four steps.  It might be worth using a different thread pool loop for the media stuff, since the media pools want all the threads running all the time, and taking advantage of that would make managing the pool significantly easier...
Flags: needinfo?(nfroyd)
The comment here suggests that we might AddRef/Release, but we really do
no such thing.  Let's deal with the transfer of ownership directly,
rather than going through nsCOMPtr.  This change makes the code slightly
smaller, and it also makes later refactorings to pull the lock out of
this function easier to do, since we don't have to consider how to hold
the lock within the lifetime of the nsCOMPtr temporary.

Gerald, you said you looked at some nsThreadPool stuff in bug 1161405, so you
get to review.  If you are overloaded or don't feel comfortable reviewing this
stuff, please say so!  I can find someone else.
Attachment #8657135 - Flags: review?(gsquelart)
Clients of nsEventQueue don't always need fully reentrant monitors.
Let's account for that by having a base class templated on the monitor
type.  This change also opens up the possibility of having the monitor
for the event queue not owned by the event queue itself, but by the
client class, which makes a lot more sense than the current design.
Attachment #8657136 - Flags: review?(gsquelart)
There's no reason nsThreadPool needs to use a reentrant monitor for
locking its event queue.  Having it use a non-reentrant one should be
slightly more efficient, both in the general operation of the monitor,
and that we're not performing redundant locking in methods like
nsThreadPool::Run.  This change also eliminates the only usage of
nsEventQueue::GetReentrantMonitor.
Attachment #8657137 - Flags: review?(gsquelart)
The last commit eliminated the only client of this method, so we can
remove it now.
Attachment #8657138 - Flags: review?(gsquelart)
There's no reason to wake up all the threads in a thread pool when one
item gets placed in the queue.  Waking up one will serve the same
purpose and is significantly more efficient for thread pools with large
numbers of threads.

This is likely the most helpful patch, but it's also the riskiest, as it
modifies core behavior of our thread run loop.  Try says everything's green on
x86-64 Linux, however, and broadcast should only be necessary if we were ever
adding multiple items to the queue at once, which we're not.
Attachment #8657140 - Flags: review?(gsquelart)
The patch series just posted appears to help some--at least profiling in Instruments no longer shows a mass of samples in locking methods.
Comment on attachment 8657135 [details] [diff] [review]
part 1 - remove nsCOMPtr temporary from nsEventQueue::PutEvent

Review of attachment 8657135 [details] [diff] [review]:
-----------------------------------------------------------------

r+ (looks correct to me), with style suggestions:

::: xpcom/threads/nsEventQueue.cpp
@@ +119,5 @@
>    }
>  
> +  nsIRunnable** location = &mTail->mEvents[mOffsetTail];
> +  MOZ_ASSERT(!*location);
> +  *location = aRunnable.take();

How about we just don't use an indirection, e.g.:
> MOZ_ASSERT(!mTail->mEvents[mOffsetTail]);
> mTail->mEvents[mOffsetTail] = aRunnable.take();
The MOZ_ASSERT only costs in DEBUG&non-optimized mode, which is already slow anyway.
And the hero statement is now clear about what it's doing, without using an indirection over 3 lines.
Of course, the disadvantage is maintenance of this duplicate expression; though hopefully that won't ever be needed. (right?)

If you prefer keeping the indirection, please consider making it |nsIRunnable*&|, and maybe giving it a name that shows its intent, e.g. 'emplacementLocation' or similar.
Attachment #8657135 - Flags: review?(gsquelart) → review+
Comment on attachment 8657136 [details] [diff] [review]
part 2 - create an nsEventQueueBase templated over the monitor type

Review of attachment 8657136 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with question and optional nit:

::: xpcom/threads/nsEventQueue.cpp
@@ +208,5 @@
> +
> +  return Base::Count(mon);
> +}
> +
> +#undef RAII_MONITOR_VARIABLE

What is this #undef about?

::: xpcom/threads/nsEventQueue.h
@@ +32,3 @@
>  // A threadsafe FIFO event queue...
> +template<typename MonitorType>
> +class nsEventQueueBase

This 'nsEventQueueBase' will not be public:ly derived, and will even be used directly (e.g., in nsThreadPool in the following patch), so is it really a *base* class?

Just naming it 'nsEventQueue' would probably cause too much trouble to be worth the effort here, as we'd either:
- have to rename the original nsEventQueue everywhere;
- Or maybe use more complicated templating magic, e.g.:
> template <typename MonitorType = ReentrantMonitor, bool InternalMonitor = true> nsEventQueue; with specializations for InternalMonitor being true/false.
(I'm actually liking this template better, but I understand if you don't want to rework everything now -- maybe a future bug, if you agree?)

Otherwise, you could rename nsEventQueueBase, the most explicit name might be something like 'nsExternallyMonitoredEventQueue', but it's quite a keyboardful.

All this is just a small nit for something that works as-is, so feel free to ignore!
Attachment #8657136 - Flags: review?(gsquelart) → review+
Comment on attachment 8657137 [details] [diff] [review]
part 3 - modify nsThreadPool to use a non-reentrant monitor

Review of attachment 8657137 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/threads/nsThreadPool.cpp
@@ +11,5 @@
>  #include "nsMemory.h"
>  #include "nsAutoPtr.h"
>  #include "prinrval.h"
>  #include "mozilla/Logging.h"
> +#include "mozilla/Monitor.h"

Shouldn't this #include be in the header, where Monitor is first needed?
Attachment #8657137 - Flags: review?(gsquelart) → review+
Attachment #8657138 - Flags: review?(gsquelart) → review+
Attachment #8657140 - Flags: review?(gsquelart) → review+
Thanks for the reviews!  I addressed the review comments, except for renaming nsEventQueueBase.  There are a couple other things to clean up around that class, and I figured that doing that work in a separate bug would be better.
Assignee: nobody → nfroyd
Blocks: 1202497
This seems to have caused bug 1085667 to permafail.
Flags: needinfo?(nfroyd)
(In reply to Nigel Babu [:nigelb] from comment #26)
> This seems to have caused bug 1085667 to permafail.

Patches here landed on 2015-09-08 (AEST), 12 hours before this comment; but bug 1085667 started failing again on 2015-09-03, so I doubt this is the cause.
This made a moderate improvement to the power consumption of YouTube on Mac. See bug 1195790 comment 8.
Turns out it's actually a Gaia bug. Tomcat is backing it out now. Sorry about the confusion, folks!
Flags: needinfo?(nfroyd)
Comment on attachment 8657135 [details] [diff] [review]
part 1 - remove nsCOMPtr temporary from nsEventQueue::PutEvent

This approval request is for all the patches in this bug.  It's not crucial that these patches are uplifted, but performance/power improvements seem like a reasonable aurora uplift request.

Approval Request Comment
[Feature/regressing bug #]: None.
[User impact if declined]: Higher CPU/power usage on youtube and other video sites; see comment 28, for instance.
[Describe test coverage new/current, TreeHerder]: There are no tests specifically for these changes, but platform features that use this code are pretty well tested.
[Risks and why]: These patches are reasonably low risk.
[String/UUID change made/needed]: None
Attachment #8657135 - Flags: approval-mozilla-aurora?
Can someone verify this bug?
Keywords: verifyme
(In reply to Sylvestre Ledru [:sylvestre] from comment #32)
> Can someone verify this bug?

What sort of verification are you looking for?  Are you looking for something different than comment 28?
Flags: needinfo?(sledru)
Status: RESOLVED → VERIFIED
Flags: needinfo?(sledru)
Comment on attachment 8657135 [details] [diff] [review]
part 1 - remove nsCOMPtr temporary from nsEventQueue::PutEvent

Miss that, thanks.
Taking it in aurora then!
Attachment #8657135 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8657140 - Flags: approval-mozilla-aurora+
Attachment #8657136 - Flags: approval-mozilla-aurora+
Attachment #8657137 - Flags: approval-mozilla-aurora+
Attachment #8657138 - Flags: approval-mozilla-aurora+
Nathan, this is totally rad work - thank you so much for actively maintaining this stuff!
Encountered the following results by using powermetrics command-line utility, under Mac OS X 10.11:

Nightly without this fix (from 2015-08-27)
Name                               ID     CPU ms/s  User%  Deadlines (<2 ms, 2-5 ms)  Wakeups (Intr, Pkg idle)  GPU ms/s
org.mozilla.nightly                246    454.77                                      2361.91 1528.85           42.82     
  plugin-container                 497    335.73    56.63  1074.87 43.70              1227.11 792.13            10.68     
  firefox                          495    121.93    62.44  1003.83 2.99               1134.33 736.66            32.14              6.16   


latest 44.0a1 (2015-10-26)
Name                               ID     CPU ms/s  User%  Deadlines (<2 ms, 2-5 ms)  Wakeups (Intr, Pkg idle)  GPU ms/s
org.mozilla.nightly                252    288.18                                      301.14  194.97            73.03     
  plugin-container                 520    170.45    76.65  53.27   35.72              196.14  123.91            48.15     
  firefox                          519    121.41    78.47  12.57   2.39               104.16  70.43             24.87     

I can see a reduced CPU usage for the plugin-container after this fix. Is this enough to be called verified? OR should I look over something else too? Thanks in advance!
Flags: needinfo?(nfroyd)
(In reply to Alexandra Lucinet, QA Mentor [:adalucinet] from comment #37)
> I can see a reduced CPU usage for the plugin-container after this fix. Is
> this enough to be called verified? OR should I look over something else too?
> Thanks in advance!

This is enough, thanks!
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #38)
> This is enough, thanks!

Thanks, Nathan!
Same result encountered with 42.0RC build 1 (Build ID: 20151026170526):


Name                               ID     CPU ms/s  User%  Deadlines (<2 ms, 2-5 ms)  Wakeups (Intr, Pkg idle)  GPU ms/s
org.mozilla.firefox                284    210.45                                      223.73  171.24            139.35    
  firefox                          764    211.21    76.32  36.72   16.16              223.51  170.82            139.34    
  plugin-container                 765    0.24      67.25  0.00    0.00               0.40    0.40              0.00     


Marking accordingly.
Release Note Request (optional, but appreciated)
[Why is this notable]: Good to highlight performance wins for video playing
[Suggested wording]: "Lower CPU usage while playing videos on popular sites such as Youtube and Vimeo"  (The original problem was on OS X, and that's the only platform we've verified it on, but I'd be *very* surprised if it didn't have some effect on other platforms.)
[Links (documentation, blog post, etc)]: None; I was going to write a techy blog post on it, though, which made me think a relnote would be a good thing. :)
relnote-firefox: --- → ?
Flags: needinfo?(lhenry)
OK, great! A blog post would be very useful (best if it comes from a mozilla.org domain) and I can add the link any time during beta.
Flags: needinfo?(lhenry)
Summary: mozilla::TaskQueue's seem pretty heavy weight on OS X → mozilla::TaskQueues seem pretty heavy weight on OS X
Just realized this is for 42. I can still add a note for beta.  Sylvestre it sounds like you will want this, anyway!
> "Lower CPU usage while playing videos on popular sites such as Youtube and Vimeo"

A supporting blog post with measurements would be very good. We don't want to oversell the improvement, because some users get upset if we claim an improvement and they don't see it themselves.
Our power usage for video playback on Mac sucks and this bug appears to be about it sucking a little bit less. I'm not sure it is something we should draw too much attention to. Contrastingly we have good power usage on Windows albeit not as good as Edge.
Since this is actually sylvestre's note for 42, he should make the call.  I suggest removing the note and then putting it back when we know we have a good supporting blog post to link to.
Flags: needinfo?(sledru)
Flags: needinfo?(nfroyd)
That'd be fine with me.  In comment 44, Anthony sounds like he would like to wait to relnote changes until more substantial power reductions have been achieved anyway.
Flags: needinfo?(nfroyd)
Agreed! Thanks
Flags: needinfo?(sledru)
Removing the flag, please resubmit when we agree it is time to communicate on it.
relnote-firefox: ? → ---
While investigating bug 1230641, I've found that tasks take an unreasonable amount of time to be dispatched, we're talking 18ms from time to time with an average of 8ms.

Interestingly, when I dropped the number of threads in the thread pool from 24 to 8, that average dropped to less than 3ms.

There are other issues at hand with bug 1230641, but the slowness of the task queues is certainly a big factor.
See Also: → 1230641
(In reply to Jean-Yves Avenard [:jya] from comment #49)
> While investigating bug 1230641, I've found that tasks take an unreasonable
> amount of time to be dispatched, we're talking 18ms from time to time with
> an average of 8ms.
> 
> Interestingly, when I dropped the number of threads in the thread pool from
> 24 to 8, that average dropped to less than 3ms.

Is that just the time spent in nsThreadPool::Dispatch, or were you measuring something else?  Is it possible for you to break down that 18ms by function, or even line?

Can you file a separate bug on that slowness?
Flags: needinfo?(jyavenard)
sorry for the delay.

The issue is only seen on particular mac laptop with an Intel IGP and when performing particular graphic operations.

When the compositor is rendering GPU based frames, everything slows to a crawl, and I had first looked in the logs on how slowly a task were dispatched. Likely a red herring that is a direct consequence to the graphic bottleneck
Flags: needinfo?(jyavenard)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: