Closed Bug 1385982 Opened 7 years ago Closed 7 years ago

stylo: parallel traversal and style sharing still don't play nice together

Categories

(Core :: CSS Parsing and Computation, enhancement, P3)

53 Branch
enhancement

Tracking

()

RESOLVED FIXED
mozilla57
Tracking Status
firefox-esr52 --- unaffected
firefox55 --- wontfix
firefox56 --- wontfix
firefox57 --- fixed

People

(Reporter: bzbarsky, Assigned: bzbarsky)

Details

Attachments

(1 file)

Testcase is:

  <!DOCTYPE html>
  <div id=first><span id="infirst"></span></div>
  <div id=second><span id="insecond"></span></div>
  <div id=third><span id="inthird"></span></div>

I'm seeing us use one style sharing cache (by pointer-identity) when styling the three divs and the <span id="inthird">, a second one for <span id="insecond">, and a third on for <span id="infirst">.  I would have expected all the work here to happen as a single work unit, on a single thread...  I get no difference in behavior if I force may_dispatch_tail to false in traverse_nodes in parallel.rs.

Looking at top_down_dom it seems to be a bit confused.  On the one hand, it has:

    // Collect all the children of the elements in our work unit. This will
    // contain the combined children of up to WORK_UNIT_MAX nodes
...
    let mut discovered_child_nodes = SmallVec::<[SendNode<E::ConcreteNode>; 128]>::new();

On the other hand it does:

        for n in nodes {
            // If the last node we processed produced children, spawn them off
            // into a work item.
...
            if !discovered_child_nodes.is_empty() {
              /* Do travers_nodes on discovered_child_nodes */
              discovered_child_nodes.clear();
            }

So in practice we never have more than one element's kids in discovered_child_nodes.  And in this case we end up with three separate work units, in exactly the way described above; the "inthird" span doesn't get spawned via that nested if and lands in the traverse_nodes() call in the end, which may or may not be a tail call; in tail mode it gets handled on the same thread, in non-tail it may not be, whatever.

I'm going to test making that inner check a bit less strict and waiting until we have a full work item before kicking off a separate traverse.
Summary: stylo: parallel traversal and style sharing still don't play nice together somehow → stylo: parallel traversal and style sharing still don't play nice together
OK, patch coming up.  Without that patch, loading our usual test page at https://en.wikipedia.org/wiki/Barack_Obama as of today I see 7463 non-pseudo style contexts once things settle.  With the patch, I see 5164.  In STYLO_THREADS=1 mode, I see 2236.
Hm. So the current behavior is somewhat intentional, in the sense that we previously would get no parallelism until the tree got wide enough, and that really hurt us. I fixed it in bug 1366347, and the difference was pretty pronounced.

But it's possible that we over-rotated. In particular, the work unit size was 64 before, and now it's 16. So I'm certainly willing to revisit this, but would like to do some measurements.

Ideally we'd measure the parallel traversal on our (locally-cached) testcases:
* obama
* myspace
* amazon product page
* youtube homepage
* facebook

And compare both total traversal times as well as amount of style sharing. We should compare with/without this patch, and also with half/double workunit sizes. Is that something you're willing to do?
Flags: needinfo?(bzbarsky)
I can probably do that if you link me to the relevant cached testcases.  As far as measuring traversal times, do we have existing logging to do this, or do you just do it via the profiler or something else?
Flags: needinfo?(bzbarsky) → needinfo?(bobbyholley)
Also, are you just measuring pageload, or some sort of toplevel restyle for your traversal measurements?
(In reply to Boris Zbarsky [:bz] from comment #5)
> I can probably do that if you link me to the relevant cached testcases.  As
> far as measuring traversal times, do we have existing logging to do this, or
> do you just do it via the profiler or something else?

I bundled up all of the above besides facebook (which doesn't have style overhead logged out, and which I don't want to publish logged in). I haven't been measuring facebook much anyway.

https://www.dropbox.com/s/a8mtk66jplcs1rj/stylo-sitetests.zip?dl=0

(In reply to Boris Zbarsky [:bz] from comment #6)
> Also, are you just measuring pageload, or some sort of toplevel restyle for
> your traversal measurements?

Just pageload, usually with the gecko profiler profiler at 0.5ms resolution. Not profiling StyleThreads, just measuring time in Servo_TraverseSubtree.
Flags: needinfo?(bobbyholley)
Comment on attachment 8892144 [details]
Bug 1385982.  Don't start kicking off work units during parallel stylo traversal until they're actually full.

https://reviewboard.mozilla.org/r/163122/#review168494

Canceling review until the measurements come in.
Attachment #8892144 - Flags: review?(bobbyholley)
Oh, and if it's not too late, please also test http://tests.themasta.com/twitter-compact/twitter-main.html
OK, so data.  Without the proposed change:

Amazon: 2793 styles, 44.5ms, 45ms, 45.5ms, 46.5ms, 47.5ms
Myspace: 1349 styles, 15ms, 8.5ms, 22.5ms, 7.5ms, 8.5ms
Obama: 5415 styles, 18ms, 14.5ms, 19.5ms, 15ms, 16.5ms
YouTube: 3683 styles, 52ms, 74ms, 64.5ms, 42.5ms, 44.5ms
Twitter: 1631 styles, 73.5ms, 54.5ms, 63ms, 65ms, 83.5ms

With the proposed change:

Amazon: 2735 styles, 43.5ms, 46ms, 38.5ms, 41ms, 42ms
Myspace: 1105 styles, 22ms, 11.5ms, 10ms, 13ms, 15ms
Obama: 4227 styles, 16ms, 14ms, 14ms, 16.5ms, 17ms
YouTube: 3002 styles, 61ms, 42ms, 66.5ms, 55.5ms, 64ms
Twitter: 1461 styles, 61ms, 87.5ms, 63ms, 64ms, 72ms

where "styles" means "steady-state number of non-pseudo ComputedValues live" (insofar as this makes sense for Youtube and Twitter; both are restyling continuously afaict) and the times are time under Servo_TraverseSubtree at 0.5ms resolution during pageload, with the same caveat for YouTube and Twitter; I'm sure it caught some of the restyling after the pageload finished....

The "styles" numbers are a bit noisy because they can depend on exactly when I stopped and took the measurement, expecially for the pages running script all the time.  But the noise is plus minus 10-20 or so, afaict.

With the timings, some of them were super-noisy, so I measured each one 5 times.

My general conclusion is that there are sharing wins on all the pages.  The myspace times do seem to get a bit slower, though it's hard to tell given how noisy it is.  All the other timings stay as they are, pretty much.  Obama may be a tiny bit faster; hard to say.

Thoughts?
Flags: needinfo?(bobbyholley)
OK, per IRC discussion, here are some measurements, all _with_ the proposed change, of the first restyle time for these various pages (except Amazon, which has no obvious first restyle, and Youtube for which I list the first two restyles) at different work unit sizes:

16 -- for these you can see the numbers of styles in comment 10:

Amazon: N/A | 2735 styles
Myspace: 10.2ms, 8.5ms, 5.9ms, 5.9ms, 5.5ms | 1105 styles.
Obama: 14.4ms, 16.14ms, 12.7ms, 12.7ms, 12.6ms | 4227 styles
YouTube: 2.7/8.8, 2.1/8.2, 2.3/9.2, 2.4/10.4, 3.0/7.7 | 3002 styles
Twitter: 49.7, 55.6, 74, 76, 77 | 1461 styles -- note: thermal throttling might have been kicking
                                                 in here; I got 105ms on the next load.

12:

Amazon: N/A | 2627 styles
Myspace: 5.8ms, 5.2ms, 5.1ms, 5.1ms, 5.6ms | 1240 styles
Obama: 14ms, 14.2ms, 13.0ms, 11.5ms, 12.8ms | 15593 styles
YouTube: 2.2/7.2, 2.5/7.7, 2.3/10.5, 2.6/8.9, 2.2/6.9 | 1991 styles
Twitter: 45.3, 42.5, 40.3, 56, 48.4 | 1423 styles

The youtube thing is weird because it had different behavior this time around compared to comment 10.  For example, the videos did not play and I didn't see the continuous restyling I saw before.  So take that with a grain of salt.

I don't know what to make of the other small drops in number of extant styles.  Note the huge rise on Obama, though.
So I just remeasures the style contexts on obama a few times, and that number was some sort of fluke. I got 4659, 4576, 4639.
Continuing to test work unit sizes:

20:

Amazon: N/A | 2652 styles
Myspace: 6.7ms, 6.3ms, 8.5ms, 8.4ms, 6.3ms | 1158 styles
Obama: 12.0ms, 11.7ms, 12.8ms, 10.9ms, 11.6ms | 4348 styles
YouTube: 2.3/10.8, 2.5/7.2, 2.5/8.1, 3.6/11.6, 2.2/9.1 | 1770 styles
Twitter: 52.5, 45.2, 44.2, 47.5, 50.1 | 1503 styles

Why the numbers for number of styles are not across the board smaller than 16 here is hard to say.  Presumably something about how we end up chunking things across threads...  Or maybe it's all noise.  :(
Comment on attachment 8892144 [details]
Bug 1385982.  Don't start kicking off work units during parallel stylo traversal until they're actually full.

So given all that, I think 16 looks like a reasonable size to keep sticking with, fwiw...
Attachment #8892144 - Flags: review?(bobbyholley)
Comment on attachment 8892144 [details]
Bug 1385982.  Don't start kicking off work units during parallel stylo traversal until they're actually full.

https://reviewboard.mozilla.org/r/163122/#review168916

::: servo/components/style/parallel.rs:37
(Diff revision 2)
>  /// Larger values will increase style sharing cache hits and general DOM locality
>  /// at the expense of decreased opportunities for parallelism. This value has not
>  /// been measured and could potentially be tuned.
>  pub const WORK_UNIT_MAX: usize = 16;

Link to the measurements in the bug?

::: servo/components/style/parallel.rs:137
(Diff revision 2)
>  /// * Never process a child before its parent (since child style depends on
>  ///   parent style). If this were to happen, the styling algorithm would panic.
>  /// * Prioritize discovering nodes as quickly as possible to maximize
> -///   opportunities for parallelism.
> +///   opportunities for parallelism.  But this needs to be weighed against
> +///   styling cousins on a single thread to improve sharing.
>  /// * Style all the children of a given node (i.e. all sibling nodes) on
>  ///   a single thread (with an upper bound to handle nodes with an
>  ///   abnormally large number of children). This is important because we use
>  ///   a thread-local cache to share styles between siblings.

This comment in general could be tweaked a bit more.

::: servo/components/style/parallel.rs:180
(Diff revision 2)
> -            // sibling directly on this thread without a spawn call.
> +            // children of the last sibling directly on this thread without a
> +            // spawn call.

It's not really "last sibling" anymore.

::: servo/components/style/parallel.rs:193
(Diff revision 2)
> +            // full work item to do so.  The former optimizes for speed of
> +            // discovery (we'll start discovering the kids of the things in
> +            // "nodes" ASAP).  The latter gives us better sharing (e.g. we can
> +            // share between cousins much better, because we don't hand them off
> +            // as separate work items, which are likely to end up on separate

Please describe/link to the measurements?

::: servo/components/style/parallel.rs:202
(Diff revision 2)
> +            // item is a deep tree which has (WORK_UNIT_MAX - 1) "linear"
> +            // branches, hence WORK_UNIT_MAX-1 elements at each level.  Such a

Technically I don't think the -1 is correct. If we have WORK_UNIT, it's even worse, because we lose the tail call, potentially context switch, and still process sequentially.

::: servo/components/style/parallel.rs:237
(Diff revision 2)
>      // Handle the children of the last element in this work unit. If any exist,
>      // we can process them (or at least one work unit's worth of them) directly

This comment could use updating.
Attachment #8892144 - Flags: review?(bobbyholley) → review+
Flags: needinfo?(bobbyholley)
Some more perf data bholley asked for: bloom-basic.  Without my patch:

bloom-basic: 85.59, 48.46, 47.93, 50.19, 66.51
bloom-basic-2: 53.55, 56.32, 53.00, 49.02, 45.44

With my patch:

bloom-basic: 59.63, 51.94, 46.63, 47.62, 48.20
bloom-basic-2: 49.08, 48.39, 48.08, 47.12, 48.27
Comment on attachment 8892144 [details]
Bug 1385982.  Don't start kicking off work units during parallel stylo traversal until they're actually full.

https://reviewboard.mozilla.org/r/163122/#review168916

> Technically I don't think the -1 is correct. If we have WORK_UNIT, it's even worse, because we lose the tail call, potentially context switch, and still process sequentially.

Good catch, fixed.
https://hg.mozilla.org/mozilla-central/rev/ee3040e6db09
Assignee: nobody → bzbarsky
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: