Note: There are a few cases of duplicates in user autocompletion which are being worked on.

stylo: Too much time spent styling on Speedometer Inferno benchmark

NEW
Assigned to

Status

()

Core
CSS Parsing and Computation
P1
normal
a month ago
7 days ago

People

(Reporter: bholley, Assigned: bholley)

Tracking

(Depends on: 2 bugs, Blocks: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

STR:
(1) Visit http://speedometer2.benj.me/InteractiveRunner.html
(2) Set Profiler resolution to 0.5ms
(3) uncheck everything but "Inferno"
(4) Start profiler, click run, capture profiler

This is the profile I get for sequential stylo: https://perfht.ml/2s9uExz
This is the profile I get for gecko: https://perfht.ml/2s9AC1i

The Gecko profile spends 12.5ms in ResolveStyleFor, 0.5 in *PseudoElementStyle, and 5ms in WalkRuleTree, summing up to 18ms. The stylo profile spends 54.5ms in Traverse_Subtree. The numbers are significantly worse (92.5ms) with 6 threads, so something is obviously fishy here.

My hunch is that this is a variant of the "lots of tiny traversals" problem, potentially combined with some restyle hint bug causing us to invalidate too much. Will need some looking into, though it may be some days before I get to it.
Also worth looking at http://speedometer2.benj.me/InteractiveRunner.html?suite=VueJS-TodoMVC
Julian and I spent some time looking at this today.

The general pattern in this test is hundreds of restyles, each of which only actually styles 0-2 elements. So the parallelism won't help here, but we should at least avoid making things slower.

We discovered that the rust compiler is generating some silly memcpys that add a significant per-traversal cost, which is much more noticeable in the scenario described above. I'll get rid of these and then re-profile, though it might get pushed back to next week.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #2)
> We discovered that the rust compiler is generating some silly memcpys that
> add a significant per-traversal cost

My measurements (next comment) support this.  But excessive memcpys would,
afaics, only cause each thread to run a constant factor more slowly than it
otherwise would.  I don't see how this can cause performance to get worse
as the number of threads increases, unless the amount of redundant copying
that each thread does increases with the number of threads.  So perhaps
there's something else going on too.
Here are my working notes so far, taken from private email with bholley.

I profiled with Callgrind, with pseudo-parallel (fine-grain interleaved)
thread execution.  This is with 1 thread (with thread pool) and 4 threads,
that is:

FORCE_STYLO_THREAD_POOL=1 STYLO_THREADS=1  ("p1")
FORCE_STYLO_THREAD_POOL=1 STYLO_THREADS=4  ("p4")

I did not see any numbers to make me believe the slowdowns are due to memory
effects (cache misses, memory bandwith, or MESI invalidations).  Right now
it looks like a simple case of "too many instructions", for unclear reasons.

It does look as if there are some allocations whose size depends on the
number of threads (see "free" below).  That might explain the slowdown with
more threads.

Some measurements.
MI = million insns  (of the specified function and all its callees).
KI = thousand insns (ditto).
"x" = number of calls to the specified function.


There is no difference in the tree rooted at

SRM::ProcessPendingRestyles
p1  772 x,  118.947 MI
p4  786 x,  119.031 MI

calls
RT::DoProcessRestyles
p1  46 x, 113.893 MI
p4  45 x, 114.068 MI


But in the tree rooted at SRM::DoProcessPendingRestyles we have
p1  519 x, 50.36 MI
p4  521 x, 70.08 MI

Leads to a really big difference in m::ServoStyleSet::StyleDocument
p1  747 x, 20.82 MI
p4  750 x, 44.09 MI

difference carried through to m::SSS::PrepareAndTraverseSubtree
p1  3755+131 x, 23.998 MI
p4  3770+131 x, 47.351 MI

leading to Servo_TraverseSubtree
p1  4012 x,  22.068 MI
p4  4027 x,  45.405 MI

and on to geckoservo::glue::traverse_subtree
p1  4113 x,  21.792 MI
p4  4128 x,  45.127 MI


traverse_subtree is quasi-leaf (at the machine code level) because so much
stuff is inlined into it.  Nevertheless, it has the following callees:

p1 ->  5811 KI 1369x  free
   ->  5150 KI  671x  core::ptr::drop_in_place
   ->  4129 KI 1156x  memcpy
   ->  3626 KI  150x  RecalcStyleOnly

p4 -> 16563 KI 1372x  free
   ->  6946 KI  785x  core::ptr::drop_in_place
   -> 14692 KI 3756x  memcpy
   ->  3629 KI  150x  RecalcStyleOnly

The calls to drop_in_place and RecalcStyleOnly look harmless -- more or less
identical in both count and insns for p1 vs p4.

For memcpy, we have a lot more calls in the p4 case, but the cost is more or
less the same (3911 insn/call for p1, 3571 insn/call for p4).  This suggests
that the copied structure/whatever is more or less the same for both.

For free, we have the same number of calls (1369 for p1, 1372 for p4) but
the per-call cost is very different (4244 insn/call for p1, 12072 insn/call
for p4).  Given that our mozjemalloc free() poisons the returned values,
this suggests to me that the p4 frees are freeing much larger chunks than
the p1 frees.  This warrants further investigation.
As far as the expensive calls from traverse_subtree to memcpy go, there is
so much inlining that I can't make much sense of the annotated sources.
However, the machine code for the hottest call looks like this:

  // r15 is the base of some array.
  // Compute r12 = index(r14) * 0x23e8(9192), and write zero at r15 + r12
  imul   $0x23e8,%r14,%r12
  movq   $0x0,(%r15,%r12,1)

  // "memcpy(dest = r15 + r12 + 8,  src = rbp - 0x27c0,  n = 0x1f18(7960))"
  lea    0x8(%r15,%r12,1),%rdi
  mov    $0x1f18,%edx
  lea    -0x27c0(%rbp),%rsi
  callq  a21510 <memcpy@plt>

So we are compute an array index for objects of size 9192, which from the
Dwarf is probably the type
RefCell<core::option::Option<style::context::ThreadLocalStyleContext<style::gecko::wrapper::GeckoElement>>>

and then we're copying, from stack to probably-heap, an object of size of
7960, probably a
StyleSharingCandidateCache<style::gecko::wrapper::GeckoElement> which is
located 8 bytes after the start of the RefCell<..> type.

Does that make sense to anyone?
I reran the tests with 1, 4, 16 and 64 threads, to try and shake out
scaling effects with memcpy and free, per comment #4.

For geckoservo::glue::traverse_subtree, #calls and total costs are:

p1   4163x   22.001 MI
p4   4148x   43.928 MI
p16  4168x  135.763 MI
p64  4163x  500.015 MI

callees of traverse_subtree are

#thr  callee_name      #calls   tot_cost    KI_per_call
-------------------------------------------------------

p1    free             1375x    5,859 KI    4.26
      drop_in_place     673x    5,260 KI    7.81
      memcpy           1160x    4,146 KI    3.57
      RecalcStyleOnly   150x    3,617 KI   24.11

p4    free             1372x   16,552 KI   12.06
      drop_in_place     785x    5,782 KI    7.36
      memcpy           3756x   14,692 KI    3.91
      RecalcStyleOnly   150x    3,609 KI   24.06

p16   free             1378x   65,019 KI   47.18
      drop_in_place     823x    5,976 KI    7.26
      memcpy          14212x   57,169 KI    4.02
      RecalcStyleOnly   150x    3,609 KI   24.06

p64   free             1372x  257,759 KI  187.87
      drop_in_place     907x    6,275 KI    6.91
      memcpy          55716x  225,779 KI    4.05
      RecalcStyleOnly   150x    3,612 KI   24.08

So: RecalcStyleOnly's costs are independent of the number of threads.

drop_in_place is almost independent.  The number of calls increases gently
(637x to 907x) but the per-call cost is approximately constant.

free is called the same number of times in each scenario, but the cost
is proportional to the number of threads.  Hence the size of the object
being freed is proportional to the number of threads.

memcpy is called a number of times proportional to the number of threads
(more or less), but the cost of each call is constant, hence the size of the
copied call is constant.

One scenario which would be consistent with this is that the entire TLS is
copied (in fixed-sized chunks) and later freed (in one big chunk) by each
thread.  At the very least, there is clearly something deeply ungood going
on here, viz-a-viz memory management.
No, that's not right, because it would imply a cost in free() which is
quadratic in the number of threads.  But what we see is a cost in free()
which is proportional to the number of threads.  Here's take 2:

  An allocation of some kind, of which there are 1372 copies of, regardless
  of the number of threads, but which has size proportional to the number of
  threads.  All such structures are eventually freed by some thread or
  other.

So .. wouldn't that just be the TLS array?  Perhaps there are 1372 parallel
traversals in the workload.
The best I can get for an allocation stack of the abovementioned freed
blocks is

  Servo_TraverseSubtree
  -> geckoservo::glue::traverse_subtree
     -> style::parallel::traverse_dom  (advertised as being in
                                        liballoc_system/lib.rs, sigh)

I presume it's really

  let tls = ScopedTLS::<D::ThreadLocalContext>::new(pool);

in traverse_dom.  I'm not sure why the unwinding is so bad.

bholley, do the above few comments give you something useful to go on?
Basically sounds to me like "parallel traversal setup cost (which is
proportional to number of worker threads) swamps the actual cost of
these multiple mini-traversals" -- as you initially suggested.
Flags: needinfo?(bobbyholley)
In general this analysis looks fruitful - I won't have time to look until monday but will dig in then, and ping you if I have other questions or encounter any mysteries. Thanks for looking at this!
Depends on: 1378845
Depends on: 1380198
The patches in bug 1380198 remove the memmoves and allocations. They definitely help, but don't solve the problem entirely. Julian, care to apply those patches and take another look?
Flags: needinfo?(bobbyholley) → needinfo?(jseward)
As an update here - I've done some profiling and found various ways to reduce traversal overhead significantly, so I'm optimistic about that part. I still don't have a good explanation for why performance on [1] gets worse with more threads though. I could entertain some amount of context-switching overhead, but FORCE_STYLO_THREAD_POOL=1 STYLO_THREADS=1 is about halfway between STYLO_THREADS=1 and STYLO_THREADS=6, so there's definitely some other effect that's hurting us when we pile on more threads.

So I'll keep pushing on slimming down general traversal overhead. Julian, if you have time to look at the threading part, that would be helpful.

[1] https://github.com/heycam/style-perf-tests/blob/master/perf-reftest/tiny-traversal-singleton.html
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #10)
> The patches in bug 1380198 remove the memmoves and allocations. They
> definitely help, but don't solve the problem entirely.

Correct -- they (greatly!) reduce the constant-factor inefficiencies.  Even
for 1 (parallel) thread, the instruction count for
geckoservo::glue::traverse_subtree is more than halved.  For 4 threads it is
about 35% of what it was pre-patch.

Numbers below (compare against comment #6).  Summary is:

For the callees of geckoservo::glue::traverse_subtree, relative to pre bug 1380198:

  free: number of calls unchanged, per-call cost reduced by a factor of 6 to 18

  drop_in_place: number of calls unchanged, per-call cost reduced by 10% to 50%

  memcpy: number of calls halved, per-call cost reduced by a factor of 23 to 30

  RecalcStyleOnly: number and cost of calls unchanged

It doesn't change the behaviour that the free() cost is dependent on the
number of threads.  So, presumably, we're still allocating and freeing the
TLS for every traversal, but the TLS is now around 10 times smaller than it
was before.  Yes?

----

For geckoservo::glue::traverse_subtree, #calls and total costs are:

p1  4113x  10.786 MI  73547 Ge
p4  4098x  15.074 MI  74013 Ge
p16 4148x  20.396 MI  69117 Ge
p64 4133x  51.562 MI  64417 Ge

callees of traverse_subtree are

#thr  callee_name      #calls   tot_cost    KI_per_call
-------------------------------------------------------

p1    free             1369x      346 KI    0.25
      drop_in_place     671x    4,360 KI    6.49
      memcpy            432x       65 KI    0.15
      RecalcStyleOnly   150x    3,610 KI   24.06
      node_needs_trav  4079x      449 KI    0.11

p4    free             1372x    4,056 KI    2.95
      drop_in_place     785x    4,358 KI    5.55
      memcpy           1732x      261 KI    0.15
      RecalcStyleOnly   150x    3,614 KI   24.09
      node_needs_tra   4064x      447 KI    0.11

p16   free             1372x    9,388 KI    6.84
      drop_in_place     821x    3,238 KI    3.94
      memcpy           6928x    1,046 KI    0.15
      RecalcStyleOnly   150x    3,610 KI   24.06
      node_needs_trav  4114x      453 KI    0.11

p64   free             1369x   36,033 KI   26.32
      drop_in_place     879x    3,273 KI    3.72
      memcpy          27648x    4,174 KI    0.15
      RecalcStyleOnly   150x    3,622 KI   24.14
      node_needs_trav  4099x      451 KI    0.11
Flags: needinfo?(jseward)
(In reply to Julian Seward [:jseward] from comment #12)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #10)
> > The patches in bug 1380198 remove the memmoves and allocations. They
> > definitely help, but don't solve the problem entirely.
> 
> Correct -- they (greatly!) reduce the constant-factor inefficiencies.  Even
> for 1 (parallel) thread, the instruction count for
> geckoservo::glue::traverse_subtree is more than halved.  For 4 threads it is
> about 35% of what it was pre-patch.
> 
> Numbers below (compare against comment #6).  Summary is:
> 
> For the callees of geckoservo::glue::traverse_subtree, relative to pre bug
> 1380198:
> 
>   free: number of calls unchanged, per-call cost reduced by a factor of 6 to
> 18
> 
>   drop_in_place: number of calls unchanged, per-call cost reduced by 10% to
> 50%
> 
>   memcpy: number of calls halved, per-call cost reduced by a factor of 23 to
> 30
> 
>   RecalcStyleOnly: number and cost of calls unchanged
> 
> It doesn't change the behaviour that the free() cost is dependent on the
> number of threads.  So, presumably, we're still allocating and freeing the
> TLS for every traversal, but the TLS is now around 10 times smaller than it
> was before.  Yes?

That sounds right. If it's still a significant issue, I could look into eliminating that allocation, which would be a bit icky but not so bad. Do I understand from your numbers below that it is indeed worth doing? If so please file a dependent bug and needinfo me and I'll write it up.

> 
> ----
> 
> For geckoservo::glue::traverse_subtree, #calls and total costs are:
> 
> p1  4113x  10.786 MI  73547 Ge
> p4  4098x  15.074 MI  74013 Ge
> p16 4148x  20.396 MI  69117 Ge
> p64 4133x  51.562 MI  64417 Ge
> 
> callees of traverse_subtree are
> 
> #thr  callee_name      #calls   tot_cost    KI_per_call
> -------------------------------------------------------
> 
> p1    free             1369x      346 KI    0.25
>       drop_in_place     671x    4,360 KI    6.49
>       memcpy            432x       65 KI    0.15
>       RecalcStyleOnly   150x    3,610 KI   24.06
>       node_needs_trav  4079x      449 KI    0.11
> 
> p4    free             1372x    4,056 KI    2.95
>       drop_in_place     785x    4,358 KI    5.55
>       memcpy           1732x      261 KI    0.15
>       RecalcStyleOnly   150x    3,614 KI   24.09
>       node_needs_tra   4064x      447 KI    0.11
> 
> p16   free             1372x    9,388 KI    6.84
>       drop_in_place     821x    3,238 KI    3.94
>       memcpy           6928x    1,046 KI    0.15
>       RecalcStyleOnly   150x    3,610 KI   24.06
>       node_needs_trav  4114x      453 KI    0.11
> 
> p64   free             1369x   36,033 KI   26.32
>       drop_in_place     879x    3,273 KI    3.72
>       memcpy          27648x    4,174 KI    0.15
>       RecalcStyleOnly   150x    3,622 KI   24.14
>       node_needs_trav  4099x      451 KI    0.11
Flags: needinfo?(jseward)
Depends on: 1380877
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #13)
> > It doesn't change the behaviour that the free() cost is dependent on the
> > number of threads.  So, presumably, we're still allocating and freeing the
> > TLS for every traversal, but the TLS is now around 10 times smaller than it
> > was before.  Yes?
> 
> That sounds right. If it's still a significant issue, I could look into
> eliminating that allocation, which would be a bit icky but not so bad. Do I
> understand from your numbers below that it is indeed worth doing? If so
> please file a dependent bug and needinfo me and I'll write it up.

I think it is worth doing so.  TLS free costs for small traversals are a
significant source of overhead even with small numbers of threads (like, 2).
I filed bug 1381021 with some more numbers.
Flags: needinfo?(jseward)
Depends on: 1381021
You need to log in before you can comment on or make changes to this bug.