stylo: Rewrite parallel.rs

RESOLVED FIXED

Status

()

Core
CSS Parsing and Computation
P1
normal
RESOLVED FIXED
2 months ago
15 days ago

People

(Reporter: bholley, Assigned: bholley)

Tracking

(Blocks: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 2 obsolete attachments)

I went in to the Berlin office today to meet with jseward, and we had a chat with nmatsakis over vidyo. I was explaining to him how our parallel traversal works, and then we started looking at the code, and I realized that the code doesn't do anything close to what I thought it was doing. Instead, it does various things that are terrible for performance.

The biggest offense, by far, is that children of siblings are coalesced into the same work unit (I thought they were partitioned per-parent). In practice, this means there is zero parallelism until we hit a level in the DOM with more than 64 same-level cousins. This explains why stylo has been performing so much better on wikipedia than myspace: wikipedia's DOM is very wide starting pretty early on, whereas myspace is narrower.

There are also various other dumb things going on in terms of doing unnecessary allocations and context switches. I spent the afternoon fixing all this stuff up and have a result that I'm pretty happy with.

Sequential stylo can style myspace's DOM in 17ms. Before this patch, parallel stylo styled in in 12ms. After this patch, parallel styles myspace in 3.5ms.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #0)
> Sequential stylo can style myspace's DOM in 17ms. Before this patch,
> parallel stylo styled in in 12ms. After this patch, parallel styles myspace
> in 3.5ms.

Wow.  That is an excellent result.
(In reply to Julian Seward [:jseward] from comment #1)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #0)
> > Sequential stylo can style myspace's DOM in 17ms. Before this patch,
> > parallel stylo styled in in 12ms. After this patch, parallel styles myspace
> > in 3.5ms.
> 
> Wow.  That is an excellent result.

Yeah, though I don't want to jinx it until I get the Talos numbers back. There's always a chance that I'm screwing up the measurements.

Also, the win on bloom-basic is significant but less pronounced. We're now looking at a 3.4x speedup from parallelism (up from 3x), but that's still a bit short of what I'd expect.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=156e5a6f852b367e74bc50d0c50f21212c2d033b

Fingers crossed. In the mean time I'm going to go find some falafel. :-)
O - [Child 1064] WARNING: stylo: cannot get ServoStyleSheets from XBL bindings yet. See bug 1290276.: file /home/worker/workspace/build/src/layout/base/nsCSSFrameConstructor.cpp, line 2752
1813
[task 2017-05-19T19:55:06.491083Z] 19:55:06     INFO - thread '<unnamed>' panicked at 'assertion failed: !traversal.is_parallel()', /home/worker/workspace/build/src/servo/components/style/sequential.rs:27
1814
[task 2017-05-19T19:55:06.491761Z] 19:55:06     INFO - stack backtrace:
1815
[task 2017-05-19T19:55:06.815927Z] 19:55:06     INFO -    0: std::sys::imp::backtrace::tracing::imp::unwind_backtrace
1816
[task 2017-05-19T19:55:06.816665Z] 19:55:06     INFO -    1: std::sys_common::backtrace::_print
1817
[task 2017-05-19T19:55:06.816763Z] 19:55:06     INFO -    2: std::panicking::default_hook::{{closure}}
1818
[task 2017-05-19T19:55:06.816871Z] 19:55:06     INFO -    3: std::panicking::default_hook
1819
[task 2017-05-19T19:55:06.817292Z] 19:55:06     INFO -    4: std::panicking::rust_panic_with_hook
1820
[task 2017-05-19T19:55:06.817387Z] 19:55:06     INFO -    5: std::panicking::begin_panic
1821
[task 2017-05-19T19:55:06.817489Z] 19:55:06     INFO -    6: style::sequential::traverse_dom
1822
[task 2017-05-19T19:55:06.817877Z] 19:55:06     INFO -    7: geckoservo::glue::traverse_subtree
1823
[task 2017-05-19T19:55:06.817965Z] 19:55:06     INFO -    8: Servo_TraverseSubtree
1824
[task 2017-05-19T19:55:06.818103Z] 19:55:06     INFO -    9: _ZN7mozilla13ServoStyleSet25PrepareAndTraverseSubtreeEPKNS_3dom7ElementENS_21TraversalRootBehaviorENS_24TraversalRestyleBehaviorE
1825
Yeah that's just because I had some other stale stuff in the branch. The perf results should still be valid, re-pushing the debug tests here:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a7c96701bfa6f4c471534dac3fdaa0fda697586d
Created attachment 8869590 [details] [diff] [review]
Fix the parallel traversal. v1
Attachment #8869590 - Flags: review?(emilio+bugs)
Here's another Talos try push based on a more recent revision: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fb98394e86539475aef2ed400965fd4b2f525088

Results look like quite an improvement. We should get this into the tree.
Comment on attachment 8869590 [details] [diff] [review]
Fix the parallel traversal. v1

Review of attachment 8869590 [details] [diff] [review]:
-----------------------------------------------------------------

::: servo/components/style/parallel.rs
@@ +95,5 @@
>  
>      queue.install(|| {
>          rayon::scope(|scope| {
> +            traverse_nodes(nodes,
> +                           /* is_tail_call = */ true,

Can we convert is_tail_call into an enum? I don't think it's too much of a hassle, and it's worth for readability I believe, specially if you document the variants well :P

@@ +233,5 @@
> +                top_down_dom(&nodes, root, traversal_data, scope, traversal, tls);
> +            });
> +        }
> +    } else {
> +        let mut first_chunk: Option<Box<[SendNode<E::ConcreteNode>]>> = None;

Why using a boxed slice here instead of an Option<NodeList<E>>? If there's a good reason, leave a comment explaining why.

@@ +250,5 @@
> +        }
> +
> +        // If this is a tail call, bypass rayon for the first chunk and invoke top_down_dom
> +        // directly.
> +        debug_assert!(first_chunk.is_some() == is_tail_call);

nit: debug_assert_eq!.
Attachment #8869590 - Flags: review?(emilio+bugs) → review+
Created attachment 8869612 [details] [diff] [review]
Fix the parallel traversal. r=emilio
Attachment #8869612 - Flags: review+
Attachment #8869590 - Attachment is obsolete: true
For the crashtest failure: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a4a94f8d02eeee9e6a5fab51d0cbd70c064a4641
Assuming that the crashtest failure is fixed, the only remaining issue is the font crash, presumably because more stuff is executing in parallel now and we're uncovering latent bugs: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a7c96701bfa6f4c471534dac3fdaa0fda697586d

I'm basically out of time for the week. Manish, can you take that? It would be good to get this patch into the tree as soon as possible.
Flags: needinfo?(manishearth)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e1eea5fdc5802aa521d8261593e8362c61817328

The fix is just

diff --git a/layout/style/ServoBindings.cpp b/layout/style/ServoBindings.cpp
index 4f7e1ed..da9774a 100644
--- a/layout/style/ServoBindings.cpp
+++ b/layout/style/ServoBindings.cpp
@@ -1916,6 +1916,7 @@ void
 Gecko_nsStyleFont_FixupNoneGeneric(nsStyleFont* aFont,
                                    RawGeckoPresContextBorrowed aPresContext)
 {
+  MutexAutoLock lock(*sServoFontMetricsLock);
   const nsFont* defaultVariableFont =
     aPresContext->GetDefaultFont(kPresContext_DefaultVariableFont_ID,
                                  aFont->mLanguage);


However, before landing, I would like to get numbers on whether or not this mutex affected perf (so that we can verify that we return to the original situation when we add a finer-grained rwlock).

(Or, I can land this first, provided this analysis is done at some point)
Flags: needinfo?(manishearth)
Try is good. Should we land this?
Created attachment 8869670 [details] [diff] [review]
gecko-side threadsafety patch
Attachment #8869670 - Flags: review?(emilio+bugs)
(In reply to Manish Goregaokar [:manishearth] from comment #12)
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=e1eea5fdc5802aa521d8261593e8362c61817328
> 
> The fix is just
> 
> diff --git a/layout/style/ServoBindings.cpp b/layout/style/ServoBindings.cpp
> index 4f7e1ed..da9774a 100644
> --- a/layout/style/ServoBindings.cpp
> +++ b/layout/style/ServoBindings.cpp
> @@ -1916,6 +1916,7 @@ void
>  Gecko_nsStyleFont_FixupNoneGeneric(nsStyleFont* aFont,
>                                     RawGeckoPresContextBorrowed aPresContext)
>  {
> +  MutexAutoLock lock(*sServoFontMetricsLock);
>    const nsFont* defaultVariableFont =
>      aPresContext->GetDefaultFont(kPresContext_DefaultVariableFont_ID,
>                                   aFont->mLanguage);
> 
> 
> However, before landing, I would like to get numbers on whether or not this
> mutex affected perf (so that we can verify that we return to the original
> situation when we add a finer-grained rwlock).

I just measured with a microbenchmark which I created by applying the patch at [1] to bloom-basic.html in [2].

The results weren't pretty. Before this patch, each worker thread spends less than 1ms in Gecko_nsStyleFont_FixupNoneGeneric. With this patch, each thread spends about 200ms thrashing for the lock, ruining overall performance.

So I'm not really comfortable landing this. I'll file a dependent bug.

[1] https://gist.github.com/bholley/e839eecacc8c56f80fec47b4356e212e
[2] https://github.com/heycam/style-perf-tests

https://perfht.ml/2qBgzqr
Depends on: 1366468
Attachment #8869670 - Flags: review?(emilio+bugs) → review-
Okay, we can then try the existing fix on the bug but that is not thread safe in the presence of XML:lang and also is affected by a bug in the gfx caching code.

I'll push up a try push.
Trying the WIP fix from bug 1362599

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b6bdc63353e62e176a9b10d412b34e7d84bd9e27

This is 100% thread safe, but will panic if you use xml:lang to use new languages. Don't do that (our tests don't, so far). But that makes this a temporary fix. Feel free to land, however, it's already r=heycam.

It contains hacks because the caching code is broken. It should not affect performance aside from a miniscule blip right before the first traversal.

The right solution here is to use a read-write lock. I'm not yet sure if the RW lock should be used in conjunction with the precaching, or if it should be used in isolation. It can be used in isolation, but precaching prevents a lot of contention (OTOH it adds to the pretraverse cost, though it's mostly code that would have to run anyway at some point during traversal).

Either way, those patches can be landed. They may make one or two test fail because of the gfx bug mentioned in bug 1362599 comment 12 (I have inserted a hack fix to prevent that, but it may not always work)
Ah, the naming scheme for the unicode atom is different, try https://treeherder.mozilla.org/#/jobs?repo=try&revision=68d0ccaedcbaf8f997d2cd3d837fc0daa2cd3e76
Okay, that passes. Feel free to land!

(do a quick measurement first though)
Can you please post the patch(es) you intend to be landed in bug 1366468? It looks like there were fixups in comment 18, and I'm not comfortable just scraping the try push and pushing it.
Flags: needinfo?(manishearth)
They've been posted on bug 1362599.

I'll open a new bug for the missing XML:lang stuff later (not on laptop right now)
Flags: needinfo?(manishearth)
Attachment #8869670 - Attachment is obsolete: true
Ok, just tested them, and the overhead is gone. Thanks!

So all I need to do is hit 'land commits' on bug 1366347?
Flags: needinfo?(manishearth)
I guess that won't work, because there's a whitespace change to servo code. So I'll presumably need to strip that. Anything else?
Yeah, sorry, that snuck in there when I was pushing. Land those with the servo part stripped, yes.
Flags: needinfo?(manishearth)
https://github.com/servo/servo/pull/16971
Status: NEW → RESOLVED
Last Resolved: 2 months ago
Resolution: --- → FIXED
I've started using a revision that including this change. It's incredibly faster than before! Amazing!
Duplicate of this bug: 1320812
You need to log in before you can comment on or make changes to this bug.