Open Bug 1371496 Opened 7 years ago Updated 2 years ago

stylo: Too much time spent styling on Speedometer Inferno benchmark

Categories

(Core :: CSS Parsing and Computation, defect, P4)

defect

Tracking

()

Tracking Status
firefox57 --- wontfix

People

(Reporter: bholley, Unassigned)

References

(Depends on 2 open bugs, Blocks 1 open bug)

Details

STR: (1) Visit http://speedometer2.benj.me/InteractiveRunner.html (2) Set Profiler resolution to 0.5ms (3) uncheck everything but "Inferno" (4) Start profiler, click run, capture profiler This is the profile I get for sequential stylo: https://perfht.ml/2s9uExz This is the profile I get for gecko: https://perfht.ml/2s9AC1i The Gecko profile spends 12.5ms in ResolveStyleFor, 0.5 in *PseudoElementStyle, and 5ms in WalkRuleTree, summing up to 18ms. The stylo profile spends 54.5ms in Traverse_Subtree. The numbers are significantly worse (92.5ms) with 6 threads, so something is obviously fishy here. My hunch is that this is a variant of the "lots of tiny traversals" problem, potentially combined with some restyle hint bug causing us to invalidate too much. Will need some looking into, though it may be some days before I get to it.
Julian and I spent some time looking at this today. The general pattern in this test is hundreds of restyles, each of which only actually styles 0-2 elements. So the parallelism won't help here, but we should at least avoid making things slower. We discovered that the rust compiler is generating some silly memcpys that add a significant per-traversal cost, which is much more noticeable in the scenario described above. I'll get rid of these and then re-profile, though it might get pushed back to next week.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #2) > We discovered that the rust compiler is generating some silly memcpys that > add a significant per-traversal cost My measurements (next comment) support this. But excessive memcpys would, afaics, only cause each thread to run a constant factor more slowly than it otherwise would. I don't see how this can cause performance to get worse as the number of threads increases, unless the amount of redundant copying that each thread does increases with the number of threads. So perhaps there's something else going on too.
Here are my working notes so far, taken from private email with bholley. I profiled with Callgrind, with pseudo-parallel (fine-grain interleaved) thread execution. This is with 1 thread (with thread pool) and 4 threads, that is: FORCE_STYLO_THREAD_POOL=1 STYLO_THREADS=1 ("p1") FORCE_STYLO_THREAD_POOL=1 STYLO_THREADS=4 ("p4") I did not see any numbers to make me believe the slowdowns are due to memory effects (cache misses, memory bandwith, or MESI invalidations). Right now it looks like a simple case of "too many instructions", for unclear reasons. It does look as if there are some allocations whose size depends on the number of threads (see "free" below). That might explain the slowdown with more threads. Some measurements. MI = million insns (of the specified function and all its callees). KI = thousand insns (ditto). "x" = number of calls to the specified function. There is no difference in the tree rooted at SRM::ProcessPendingRestyles p1 772 x, 118.947 MI p4 786 x, 119.031 MI calls RT::DoProcessRestyles p1 46 x, 113.893 MI p4 45 x, 114.068 MI But in the tree rooted at SRM::DoProcessPendingRestyles we have p1 519 x, 50.36 MI p4 521 x, 70.08 MI Leads to a really big difference in m::ServoStyleSet::StyleDocument p1 747 x, 20.82 MI p4 750 x, 44.09 MI difference carried through to m::SSS::PrepareAndTraverseSubtree p1 3755+131 x, 23.998 MI p4 3770+131 x, 47.351 MI leading to Servo_TraverseSubtree p1 4012 x, 22.068 MI p4 4027 x, 45.405 MI and on to geckoservo::glue::traverse_subtree p1 4113 x, 21.792 MI p4 4128 x, 45.127 MI traverse_subtree is quasi-leaf (at the machine code level) because so much stuff is inlined into it. Nevertheless, it has the following callees: p1 -> 5811 KI 1369x free -> 5150 KI 671x core::ptr::drop_in_place -> 4129 KI 1156x memcpy -> 3626 KI 150x RecalcStyleOnly p4 -> 16563 KI 1372x free -> 6946 KI 785x core::ptr::drop_in_place -> 14692 KI 3756x memcpy -> 3629 KI 150x RecalcStyleOnly The calls to drop_in_place and RecalcStyleOnly look harmless -- more or less identical in both count and insns for p1 vs p4. For memcpy, we have a lot more calls in the p4 case, but the cost is more or less the same (3911 insn/call for p1, 3571 insn/call for p4). This suggests that the copied structure/whatever is more or less the same for both. For free, we have the same number of calls (1369 for p1, 1372 for p4) but the per-call cost is very different (4244 insn/call for p1, 12072 insn/call for p4). Given that our mozjemalloc free() poisons the returned values, this suggests to me that the p4 frees are freeing much larger chunks than the p1 frees. This warrants further investigation.
As far as the expensive calls from traverse_subtree to memcpy go, there is so much inlining that I can't make much sense of the annotated sources. However, the machine code for the hottest call looks like this: // r15 is the base of some array. // Compute r12 = index(r14) * 0x23e8(9192), and write zero at r15 + r12 imul $0x23e8,%r14,%r12 movq $0x0,(%r15,%r12,1) // "memcpy(dest = r15 + r12 + 8, src = rbp - 0x27c0, n = 0x1f18(7960))" lea 0x8(%r15,%r12,1),%rdi mov $0x1f18,%edx lea -0x27c0(%rbp),%rsi callq a21510 <memcpy@plt> So we are compute an array index for objects of size 9192, which from the Dwarf is probably the type RefCell<core::option::Option<style::context::ThreadLocalStyleContext<style::gecko::wrapper::GeckoElement>>> and then we're copying, from stack to probably-heap, an object of size of 7960, probably a StyleSharingCandidateCache<style::gecko::wrapper::GeckoElement> which is located 8 bytes after the start of the RefCell<..> type. Does that make sense to anyone?
I reran the tests with 1, 4, 16 and 64 threads, to try and shake out scaling effects with memcpy and free, per comment #4. For geckoservo::glue::traverse_subtree, #calls and total costs are: p1 4163x 22.001 MI p4 4148x 43.928 MI p16 4168x 135.763 MI p64 4163x 500.015 MI callees of traverse_subtree are #thr callee_name #calls tot_cost KI_per_call ------------------------------------------------------- p1 free 1375x 5,859 KI 4.26 drop_in_place 673x 5,260 KI 7.81 memcpy 1160x 4,146 KI 3.57 RecalcStyleOnly 150x 3,617 KI 24.11 p4 free 1372x 16,552 KI 12.06 drop_in_place 785x 5,782 KI 7.36 memcpy 3756x 14,692 KI 3.91 RecalcStyleOnly 150x 3,609 KI 24.06 p16 free 1378x 65,019 KI 47.18 drop_in_place 823x 5,976 KI 7.26 memcpy 14212x 57,169 KI 4.02 RecalcStyleOnly 150x 3,609 KI 24.06 p64 free 1372x 257,759 KI 187.87 drop_in_place 907x 6,275 KI 6.91 memcpy 55716x 225,779 KI 4.05 RecalcStyleOnly 150x 3,612 KI 24.08 So: RecalcStyleOnly's costs are independent of the number of threads. drop_in_place is almost independent. The number of calls increases gently (637x to 907x) but the per-call cost is approximately constant. free is called the same number of times in each scenario, but the cost is proportional to the number of threads. Hence the size of the object being freed is proportional to the number of threads. memcpy is called a number of times proportional to the number of threads (more or less), but the cost of each call is constant, hence the size of the copied call is constant. One scenario which would be consistent with this is that the entire TLS is copied (in fixed-sized chunks) and later freed (in one big chunk) by each thread. At the very least, there is clearly something deeply ungood going on here, viz-a-viz memory management.
No, that's not right, because it would imply a cost in free() which is quadratic in the number of threads. But what we see is a cost in free() which is proportional to the number of threads. Here's take 2: An allocation of some kind, of which there are 1372 copies of, regardless of the number of threads, but which has size proportional to the number of threads. All such structures are eventually freed by some thread or other. So .. wouldn't that just be the TLS array? Perhaps there are 1372 parallel traversals in the workload.
The best I can get for an allocation stack of the abovementioned freed blocks is Servo_TraverseSubtree -> geckoservo::glue::traverse_subtree -> style::parallel::traverse_dom (advertised as being in liballoc_system/lib.rs, sigh) I presume it's really let tls = ScopedTLS::<D::ThreadLocalContext>::new(pool); in traverse_dom. I'm not sure why the unwinding is so bad. bholley, do the above few comments give you something useful to go on? Basically sounds to me like "parallel traversal setup cost (which is proportional to number of worker threads) swamps the actual cost of these multiple mini-traversals" -- as you initially suggested.
Flags: needinfo?(bobbyholley)
In general this analysis looks fruitful - I won't have time to look until monday but will dig in then, and ping you if I have other questions or encounter any mysteries. Thanks for looking at this!
Depends on: 1378845
Depends on: 1380198
The patches in bug 1380198 remove the memmoves and allocations. They definitely help, but don't solve the problem entirely. Julian, care to apply those patches and take another look?
Flags: needinfo?(bobbyholley) → needinfo?(jseward)
As an update here - I've done some profiling and found various ways to reduce traversal overhead significantly, so I'm optimistic about that part. I still don't have a good explanation for why performance on [1] gets worse with more threads though. I could entertain some amount of context-switching overhead, but FORCE_STYLO_THREAD_POOL=1 STYLO_THREADS=1 is about halfway between STYLO_THREADS=1 and STYLO_THREADS=6, so there's definitely some other effect that's hurting us when we pile on more threads. So I'll keep pushing on slimming down general traversal overhead. Julian, if you have time to look at the threading part, that would be helpful. [1] https://github.com/heycam/style-perf-tests/blob/master/perf-reftest/tiny-traversal-singleton.html
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #10) > The patches in bug 1380198 remove the memmoves and allocations. They > definitely help, but don't solve the problem entirely. Correct -- they (greatly!) reduce the constant-factor inefficiencies. Even for 1 (parallel) thread, the instruction count for geckoservo::glue::traverse_subtree is more than halved. For 4 threads it is about 35% of what it was pre-patch. Numbers below (compare against comment #6). Summary is: For the callees of geckoservo::glue::traverse_subtree, relative to pre bug 1380198: free: number of calls unchanged, per-call cost reduced by a factor of 6 to 18 drop_in_place: number of calls unchanged, per-call cost reduced by 10% to 50% memcpy: number of calls halved, per-call cost reduced by a factor of 23 to 30 RecalcStyleOnly: number and cost of calls unchanged It doesn't change the behaviour that the free() cost is dependent on the number of threads. So, presumably, we're still allocating and freeing the TLS for every traversal, but the TLS is now around 10 times smaller than it was before. Yes? ---- For geckoservo::glue::traverse_subtree, #calls and total costs are: p1 4113x 10.786 MI 73547 Ge p4 4098x 15.074 MI 74013 Ge p16 4148x 20.396 MI 69117 Ge p64 4133x 51.562 MI 64417 Ge callees of traverse_subtree are #thr callee_name #calls tot_cost KI_per_call ------------------------------------------------------- p1 free 1369x 346 KI 0.25 drop_in_place 671x 4,360 KI 6.49 memcpy 432x 65 KI 0.15 RecalcStyleOnly 150x 3,610 KI 24.06 node_needs_trav 4079x 449 KI 0.11 p4 free 1372x 4,056 KI 2.95 drop_in_place 785x 4,358 KI 5.55 memcpy 1732x 261 KI 0.15 RecalcStyleOnly 150x 3,614 KI 24.09 node_needs_tra 4064x 447 KI 0.11 p16 free 1372x 9,388 KI 6.84 drop_in_place 821x 3,238 KI 3.94 memcpy 6928x 1,046 KI 0.15 RecalcStyleOnly 150x 3,610 KI 24.06 node_needs_trav 4114x 453 KI 0.11 p64 free 1369x 36,033 KI 26.32 drop_in_place 879x 3,273 KI 3.72 memcpy 27648x 4,174 KI 0.15 RecalcStyleOnly 150x 3,622 KI 24.14 node_needs_trav 4099x 451 KI 0.11
Flags: needinfo?(jseward)
(In reply to Julian Seward [:jseward] from comment #12) > (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #10) > > The patches in bug 1380198 remove the memmoves and allocations. They > > definitely help, but don't solve the problem entirely. > > Correct -- they (greatly!) reduce the constant-factor inefficiencies. Even > for 1 (parallel) thread, the instruction count for > geckoservo::glue::traverse_subtree is more than halved. For 4 threads it is > about 35% of what it was pre-patch. > > Numbers below (compare against comment #6). Summary is: > > For the callees of geckoservo::glue::traverse_subtree, relative to pre bug > 1380198: > > free: number of calls unchanged, per-call cost reduced by a factor of 6 to > 18 > > drop_in_place: number of calls unchanged, per-call cost reduced by 10% to > 50% > > memcpy: number of calls halved, per-call cost reduced by a factor of 23 to > 30 > > RecalcStyleOnly: number and cost of calls unchanged > > It doesn't change the behaviour that the free() cost is dependent on the > number of threads. So, presumably, we're still allocating and freeing the > TLS for every traversal, but the TLS is now around 10 times smaller than it > was before. Yes? That sounds right. If it's still a significant issue, I could look into eliminating that allocation, which would be a bit icky but not so bad. Do I understand from your numbers below that it is indeed worth doing? If so please file a dependent bug and needinfo me and I'll write it up. > > ---- > > For geckoservo::glue::traverse_subtree, #calls and total costs are: > > p1 4113x 10.786 MI 73547 Ge > p4 4098x 15.074 MI 74013 Ge > p16 4148x 20.396 MI 69117 Ge > p64 4133x 51.562 MI 64417 Ge > > callees of traverse_subtree are > > #thr callee_name #calls tot_cost KI_per_call > ------------------------------------------------------- > > p1 free 1369x 346 KI 0.25 > drop_in_place 671x 4,360 KI 6.49 > memcpy 432x 65 KI 0.15 > RecalcStyleOnly 150x 3,610 KI 24.06 > node_needs_trav 4079x 449 KI 0.11 > > p4 free 1372x 4,056 KI 2.95 > drop_in_place 785x 4,358 KI 5.55 > memcpy 1732x 261 KI 0.15 > RecalcStyleOnly 150x 3,614 KI 24.09 > node_needs_tra 4064x 447 KI 0.11 > > p16 free 1372x 9,388 KI 6.84 > drop_in_place 821x 3,238 KI 3.94 > memcpy 6928x 1,046 KI 0.15 > RecalcStyleOnly 150x 3,610 KI 24.06 > node_needs_trav 4114x 453 KI 0.11 > > p64 free 1369x 36,033 KI 26.32 > drop_in_place 879x 3,273 KI 3.72 > memcpy 27648x 4,174 KI 0.15 > RecalcStyleOnly 150x 3,622 KI 24.14 > node_needs_trav 4099x 451 KI 0.11
Flags: needinfo?(jseward)
Depends on: 1380877
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #13) > > It doesn't change the behaviour that the free() cost is dependent on the > > number of threads. So, presumably, we're still allocating and freeing the > > TLS for every traversal, but the TLS is now around 10 times smaller than it > > was before. Yes? > > That sounds right. If it's still a significant issue, I could look into > eliminating that allocation, which would be a bit icky but not so bad. Do I > understand from your numbers below that it is indeed worth doing? If so > please file a dependent bug and needinfo me and I'll write it up. I think it is worth doing so. TLS free costs for small traversals are a significant source of overhead even with small numbers of threads (like, 2). I filed bug 1381021 with some more numbers.
Flags: needinfo?(jseward)
Depends on: 1381021
Depends on: 1383332
Priority: P1 → --
Priority: -- → P1
Depends on: 1392170
Depends on: 1393632
The important dependent bugs have landed, and we have verified that there is no speedometer regression from stylo. We still spend a tiny bit more time styling on inferno, which I would like to look into at some point, but it doesn't block shipping at this point.
Priority: P1 → P4
status-firefox57=wontfix unless someone thinks this bug should block 57
Assignee: bobbyholley → nobody
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.