Closed Bug 1431285 Opened 5 years ago Closed 5 years ago

Cap the number of stylo threads at six to avoid poor performance on many-core machines

Categories

(Core :: CSS Parsing and Computation, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla63
Tracking Status
firefox-esr52 --- unaffected
firefox58 --- wontfix
firefox59 --- wontfix
firefox60 --- wontfix
firefox61 --- wontfix
firefox62 --- fixed
firefox63 --- fixed

People

(Reporter: bas.schouten, Assigned: bholley)

References

Details

Attachments

(1 file)

While investigating performance of the 'Multiply' test from the MotionMark benchmark suite (See browserbench.org/MotionMark/developer.html), I noticed that at least while running it at the highest complexity (select maintain target FPS and target a frame rate of 1), style processing was taking a very long time on my 20-core desktop machine. Here's some numbers:

Using Critical Sections for Malloc locking:

With 24 threads (which were created by default):
Styles phase 40ms, 70-80% of time in Malloc critical section spinlock.
With 6 threads:
Styles phase 24ms, 25-30% of time in Malloc critical section spinlock.
With 2 threads:
Styles phase 29ms, 0-5% of time in spinlocks
With 1 threads:
Styles phase 31ms

Glandium suggested using an SRWLock instead of critical sections, this resulted in:

With 24 threads:
Styles phase 34ms, VTune didn't recognise this lock as spinning, I estimate about 50% of time spinning
With 6 threads:
Styles phase 28ms, really hard to estimate how much time spent spinning

I didn't bother trying with less threads as it was now obviously underperforming critical sections. 

https://pastebin.mozilla.org/9076189 gives an idea of where the mallocs are coming from.
Priority: -- → P3
Trying to reproduce this on my MBP (which will default to 6 threads) with the Gecko profiler.

I ran the 'Multiply' test for 10 seconds with target FPS 1, and get the following profile: https://perfht.ml/2DB8BVP

I didn't enable profiling for the style threads (since that can sometimes skew the overall numbers), but it appears that ~90% of the styling time (time spent in Servo_TraverseSubtree) is in invalidate_style_if_needed, which runs on the main thread before the parallel work begins (the rayon_core::latch::{impl}::wait stuff).

This could be related to differences in our hardware or OS, but I'd like to first verify that we're performing the same steps. Does the above profile link differ from what you see when taking the same steps?
Flags: needinfo?(bas)
I'm in Toronto at the moment. I will get back to you next week.
(In reply to Bobby Holley (:bholley) from comment #1)
> Trying to reproduce this on my MBP (which will default to 6 threads) with
> the Gecko profiler.
> 
> I ran the 'Multiply' test for 10 seconds with target FPS 1, and get the
> following profile: https://perfht.ml/2DB8BVP
> 
> I didn't enable profiling for the style threads (since that can sometimes
> skew the overall numbers), but it appears that ~90% of the styling time
> (time spent in Servo_TraverseSubtree) is in invalidate_style_if_needed,
> which runs on the main thread before the parallel work begins (the
> rayon_core::latch::{impl}::wait stuff).
> 
> This could be related to differences in our hardware or OS, but I'd like to
> first verify that we're performing the same steps. Does the above profile
> link differ from what you see when taking the same steps?

I checked this, my times are very different, I'm profiling from VTune, but using 6 worker threads I see about 6ms of 'parallel' work at the -start- of the Styles event, and then about 20 more ms of main thread work. Nothing anywhere near the insane style processing times you're getting.. are Rust optimizations on in this build? Some of that looks like it should be inlined.
Flags: needinfo?(bas) → needinfo?(bobbyholley)
I think I must have had uBlock enabled, and hit bug 1418161.

I measured again and got more reasonable results. I then added the style threads to the profiler, 0.2ms resolution and 1 content process to reduce overhead: https://perfht.ml/2FByRN1

This matches Bas' results in comment 0. I'm curious as to why this hasn't come up before, and whether it's somehow unique to this testcase. I'll try to investigate more tomorrow.
I didn't end up having much time to investigate on Friday. However, I did think about it for a bit in my spare moments.

What strikes me as odd here is that I wouldn't expect us to have any contention. We have per-thread arenas. Each of these arenas does need its own lock for critical sections, since allocation might race with a free from another thread. But in practice that should be rare during the style phases (and indeed, I don't see any free stacks on the work threads in the profile from comment 4).

So I'd expect the per-arena malloc locks to be mostly uncontended during the parallel traversal, but we have profiles on both Windows and Mac showing them to be spinning a lot, which suggest high contention. So something fishy is going on.

Julian, is this something you have time to investigate on Monday? I can investigate it if you don't, though would be curious to at least know what your preferred tooling for the investigation would be.
Flags: needinfo?(jseward)
IIUC, you've got a bunch of threads contending for the malloc lock(s),
but that seems strange because you think there shouldn't be an excessive
amount of allocation or freeing going on.  Is that right?

In this case I'd be inclined to inspect the allocating and deallocation
points in the threads involved, to crosscheck your intuition that there
isn't excessive allocation/freeing activity at these points.

I'd also look at whether blocks are being deallocated on a different thread
from which they get allocated.  We know this makes things worse in
at least two ways: (1) IIRC, it will cause jemalloc's per-arena locks 
to get bounced around between cores, and (2) it interacts with jemalloc's
free-poisoning to cause pointless cross-core traffic.

But I'm just guessing here.  I can chase it this week if you want; lmk.
Flags: needinfo?(jseward)
(In reply to Julian Seward [:jseward] from comment #6)
> IIUC, you've got a bunch of threads contending for the malloc lock(s),
> but that seems strange because you think there shouldn't be an excessive
> amount of allocation or freeing going on.  Is that right?

No. The issue is that I expect a lot of allocation and not very much freeing. And since each thread is supposed to have a thread-local arena, I would expect that the locks would map nearly-exclusively to separate threads, and thus would not expect to see very much spinning going on.

> But I'm just guessing here.  I can chase it this week if you want; lmk.

I keep hoping to have time but getting pulled away by various fires. How about this: if you have cycles tuesday, see how far you can get here, and dump your status in the bug EOD. If you don't have time or get stuck, I'll see if I have cycles during my day tomorrow, and so forth. Sound good?
Here are some observations.  These are taken with my standard P50 laptop
running Linux.  That is, 4 cores, 8 threads.  Comments about that at the
end.

Running the test described by Bas, it's clear there is some large loss
of performance at high thread counts, both from the numbers and simply
by watching the test display.  I set the test time to 10 seconds and got
the following numbers (I ran each twice):

  10 seconds
    nthr 16   900.91, 909.37
    nthr 32   861.92, 873.13
    nthr 64   207.43, 195.96

Bizarrely the falloff isn't so bad for a 20 second run:

  20 seconds
    nthr 16   925.00, 925.00
    nthr 32   925.00, 925.00
    nthr 64   403.14, 508.98

Bas suggested on irc:

  So, here's a theory, the first couple of seconds the run is 'figuring out'
  how complex to make the scene (the target FPS of 1 should always make it
  take the highest complexity), maybe that phase is messing with your
  scores.

I dunno.

---------

I also ran while watching with Linux's perf profiler.  What I see from that
is that essentially all Style threads wind up burning cycles in
__memmove_avx_unaligned_erms, and as the number of threads increases, the
number of them in __memmove_avx_unaligned_erms increases at more or less the
same rate.  Given that perf doesn't change the speed at which threads run,
I'd consider its results as fairly trustworthy.

---------

I then ran it with 4, 16 and 64 threads on Valgrind/Callgrind with MESI
simulation, to check for intended or unintended cross-core traffic, and lock
contention.  The results are pretty confusing, not least because Callgrind's
slowness makes the tests behave differently from natively.  But I didn't see
any excessive communication costs.  I did see a lot of instructions used in
"__memmove_avx_unaligned_erms", which corresponds with the 'perf'
measurements above.  Working back up the tree leads to "coco::*" which IIUC
is part of Rayon:

   coco::epoch::thread::pin::{{closure}} (mostly, and)
   coco::epoch::thread::pin::{{closure}}::{{closure}}
   -> <core::cell:Cell<T>>::set
      -> <core::cell::Cell<T>>::replace
         -> core::mem::replace
            -> core::mem::swap
               -> core::ptr::swap_nonoverlapping
                  -> core::ptr::swap_nonoverlapping_bytes
                     -> memmove_avx_unaligned_erms

---------

Finally I note from the above runs, that plain /bin/top shows that Fx
consumes >= 760% CPU (basically, all h/w threads available) at 8 threads and
over.

---------

What this says to me is that this isn't a lock contention problem (since we
can saturate 8 h/w threads no problem), and it isn't a cross-core traffic
problem (Callgrind/MESI shows no sign of that).  It looks to me instead like
some kind of spinnery/livelocking in coco::epoch::thread::pin, at high
thread counts.

Note also, this might be a different problem from what shows on Bas' machine
-- or perhaps there's more thatn one problem here.  The best would be to use
Linux/Perf on the same setup that Bas has, but I have no such machine.
(In reply to Julian Seward [:jseward] from comment #8)
> problem (Callgrind/MESI shows no sign of that).  It looks to me instead like
> some kind of spinnery/livelocking in coco::epoch::thread::pin, at high
> thread counts.

I see the following hot path into the coco:: complex:

  rayon_core::registry::WorkerThread::steal::{{closure}}
  -> <coco::deque::Stealer<T>>::steal
     -> <coco::deque::Deque<T>>::steal
        -> coco::epoch::thread::pin and coco::epoch::thread::is_pinned
           -> <std::thread::local::LocalKey<T>>::with
              -> <std::thread::local::LocalKey<T>>::try_with
                 -> coco::epoch::thread::pin::{{closure}}
                    and to some extent coco::epoch::thread::HARNESS::__getit

Niko, is it possible that Rayon could be getting into a situation where
threads lacking work are caused to spin in coco for long periods rather
than allowed to go to sleep quickly?  Something like that?
Flags: needinfo?(nmatsakis)
I tracked this down, and it's not pretty. Bug 1436541 has the scoop.
Depends on: 1436541
Flags: needinfo?(bobbyholley)
ok, so I guess it's not spinning on coco? I can imagine that we might spend some time repeatedly looking for work, but I'm not sure what would make us do that for a *particularly* long time once things are quiescent.
Flags: needinfo?(nmatsakis)
Hmm.  So it seems that I misinterpreted my measurements, and/or profiled the
wrong thing.  Sorry for the noise.

Anyway, with Bobby's fix from bug 1436541 applied, I get the following
MotionMark numbers for Animometer/Multiply, for a 20 second run:

20 seconds
  thr  4   925 925   %CPU 135
  thr  8   925 925   %CPU 155
  thr 16   925 925   %CPU 180
  thr 32   925 925   %CPU 290
  thr 64   532 569   %CPU 700

I also noted the average %CPU while the test ran.  I am not sure what to
draw from that, given that the CPU usage continues to increase with the
number of styling threads, even though the overall score is the same at
least until 32 threads.

Profiling with perf with 64 threads, I still see most styling threads in
__memmove_avx_unaligned_erms, as before.  With fewer threads that is not
nearly so much the case.  I suspect what that is is some kind of scaling
effect in Rayon with large-ish (32+) thread numbers; but that's far beyond
what's of concern to us for Stylo.
Ok, so I think we've sorted out the big issue here. While we're at it though, this seems like a good time to address Bas' concern about diminishing returns for the thread pool in Stylo.

The current heuristic is NUM_LOGICAL_CORES * (3/4). So assuming everybody's hyperthreaded, a single core machine gets 1 thread (meaning we don't make a threadpool at all), a dual-core gets three threads, and a quad-core gets six threads.

The above is the vast majority of our user base, but there are high-end devices out there with lots of cores. Those devices are relatively rare but likely to be owned by influencers, so I think it's important that stylo perform optimally on those machines.

I think a good starting point is to determine whether more than six threads is _ever_ a significant win, which means testing some style-heavy sites. A good initial set would be:
* http://html.spec.whatwg.org/
* https://github.com/bholley/gecko/commit/96ecf505adde95240799d285de90a007310f7bda
* https://en.wikipedia.org/wiki/Barack_Obama
* https://www.amazon.com/STIGA-Evolution-Table-Tennis-Racket/dp/B00EFY9F1C
* http://bholley.net/testcases/style-perf-tests/perf-reftest/bloom-basic.html (possibly tweaking the parameters)
* file://path/to/mozilla-central/third_party/webkit/PerformanceTests/StyleBench/InteractiveRunner.html

We should measure the time spent in Servo_TraverseSubtree (making sure that the load happens in the foreground tab) with various values of STYLO_THREADS. It's probably also worth testing with/without uBlock Origin enabled, since adblockers inject a lot of CSS that significantly increases styling time, and power-users tend to have adblockers installed.

If anyone has a capable machine and is willing to run these measurements, let me know. Otherwise I'll try to make time for it next week.
See Also: → 1445422
I finally got around to doing the measurements on my 18-virtual-core Xeon, and the performance benefits generally seem to level off at around 6 threads. So it seems worth capping at 6, which gives us three supported configurations (no thread pool, 3 threads, 6 threads). I'll write a patch.
Assignee: nobody → bobbyholley
MozReview-Commit-ID: 3qI1mIvDn8j
Comment on attachment 8989510 [details]
Bug 1431285 - Cap the number of style threads at six.

Emilio Cobos Álvarez (:emilio) has approved the revision.

https://phabricator.services.mozilla.com/D1928
Attachment #8989510 - Flags: review+
Attachment #8989510 - Attachment description: Bug 1431285 - Cap the number of style threads at six. v1 → Bug 1431285 - Cap the number of style threads at six.
Pushed by bholley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/0d61bbcd24d8
Cap the number of style threads at six. r=emilio
https://hg.mozilla.org/mozilla-central/rev/0d61bbcd24d8
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
This may be worth uplifting? Seems fairly low-risk.
Flags: needinfo?(bobbyholley)
Comment on attachment 8989510 [details]
Bug 1431285 - Cap the number of style threads at six.

(In reply to Emilio Cobos Álvarez (:emilio) from comment #19)
> This may be worth uplifting? Seems fairly low-risk.

Sure, why not. Not a huge amount of urgency but can't hurt.

Approval Request Comment
[Feature/Bug causing the regression]: stylo
[User impact if declined]: Unnecessary memory usage in many-core machines, potential performance pitfalls on those devices.
[Is this code covered by automated tests?]: Sort of. 
[Has the fix been verified in Nightly?]: No
[Needs manual test from QE? If yes, steps to reproduce]: Probably doesn't need manual testing. 
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: No
[Why is the change risky/not risky?]: This is just a 1-line change to ensure that we never make more than 6 stylo threads. We use 6 on quad-core devices (like MBPs), and so this just avoids creating huge numbers of threads on workstation-y machines with huge numbers of cores.
[String changes made/needed]: None
Flags: needinfo?(bobbyholley)
Attachment #8989510 - Flags: approval-mozilla-beta?
Comment on attachment 8989510 [details]
Bug 1431285 - Cap the number of style threads at six.

Perf improvement / avoid memory leak in an edge case, let's uplift for beta 8.
Attachment #8989510 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Summary: Stylo performs poorly at least in some cases on many-core machines → Cap the number of stylo threads at six to avoid poor performance on many-core machines
Flags: qe-verify-
You need to log in before you can comment on or make changes to this bug.