Closed Bug 572200 Opened 11 years ago Closed 8 years ago

Consider always caching reset structs on rulenodes when asked for them

Categories

(Core :: CSS Parsing and Computation, defect, P2)

x86
macOS
defect

Tracking

()

RESOLVED FIXED
mozilla19
Tracking Status
firefox18 --- fixed

People

(Reporter: bzbarsky, Assigned: bzbarsky)

References

(Blocks 1 open bug)

Details

Attachments

(5 files)

I used the attached patch to record which rulenodes had PropagateDependentBit called on them, on the assumption that we'd cache at those callsites.  Then in ~nsRuleNode I classified rulenodes into 4 bins:  either the rulenode has _some_ reset struct cached on itself or it does not, and either it's had its PropagateDependentBit member called or it has not.

The data varies slightly depending on the test load, but here are the numbers for our Tp4 test:

43471 had PropagateDependentBit but no cached data
62778 had cached data but not PropagateDependentBit
109515 had both
78368 had neither.

So looks like about 41% of rulenodes are saving the memory needed by a nsCachedResetData.  14% would stop doing so if we cached in PropagateDependentBit.  The memory involved is 14 words; I'm still working on measuring the performance impact, if any.

If I look at actual dependent bits, not just the nodes that had PropagateDependentBit called on them, then a higher percentage of nodes has the bit but not cached data; apparently some rulenodes have dependent bits but never have style data gotten from them directly.

For inherit structs we'd almost certainly not want to do this.  I didn't measure, but the chance of many rulenodes having at least one inherit struct is ... low.
I tried doing this, and it seems to be a 1% win or so, maybe, on the benchmark I was measuring on.  I'll try to get more data, ideally with less noise...
Blocks: 500410
Priority: -- → P1
Priority: P1 → P2
One other note.  If we did this, so that the mCachedResetData would never be null for a rulenode, then we could also ensure that mCachedResetData is non-null for a style context by just pointing it at mRuleNode's mCachedResetData as needed.

We could further cache structs on the rulenode that asked for them, so that reset structs are treated in the rule tree just like inherit structs in the context tree: always cached right where you want them...
So the idea here is that you're proposing that we cache structs when we set dependent bits, possibly only on the requesting node rather than on all nodes on the path?  It seems like a classic space vs. size tradeoff; I don't have an opinion on what the right thing is other than that we should measure the tradeoff.
> you're proposing that we cache structs when we set dependent bits

Yes, pretty much.  And that if we do this for all nodes in that path, then we should consider not heap-allocating the mCachedResetData for rulenodes.

Any idea what would actually be useful to measure here?
I did do some measurements basically using the following testcase:

1)  Load HTML5 single-page spec.
2)  Change body.style.top.
3)  Flush out restyles only (not layout).  Measure the time this step takes.

Note that the vast majority of the time is reparenting style contexts, so there is no selector matching or anything involved; just recursing down in ReResolveStyleContext, creating the new style contexts, and the various Get/PeekStyleData involved, plus the CalcDifference calls.

On my machine a vanilla Mac build does this in 275ms or so.

If I implement the suggestion from this bug, that drops to 260ms.

If I then inline the fast path in nsRuleNode::GetStyle* (which is now always hit after the first time for that rulenode), it drops to 245ms.

The overall time for that testcase is about 23% recursive calls in ReResolveStyleContext (maybe we should pass along fewer arguments... if we can), 15% GetChildList calls on nsBlockFrame and nsContainerFrame (mostly the latter; see bug 645597), 6% the actual creation of new style contexts.

The places where time is spent getting styles are DidSetStyleContext (5% of total time after the changes listed above), CalcStyleDifference (20-25% of the time after said changes).

So there's no way we can get more than a 30% speedup on that testcase no matter what we do in this bug.  I _think_ that testcase is the one where we call into the ruletree most in comparison to other things we're doing.  I could be wrong, of course.

I find myself wishing for the ability to write self-modifying code so that after the first GetStyle* call and caching the data I can just self-modify away all the null-checks and gunk and have a fast-path that gets from the cache.  I've clearly been spending too much time with the jits.  :(

I wonder whether a better approach is reducing the number of style structs we have, such that we can stop lazily heap-allocating the storage for them.  That might get us enough wins in terms of not needing null-checks plus better cache locality
To continue:

to do even better than the above numbers (if nothing else because there would be no cost then to caching everything on every rulenode, etc).
This could be an interesting project for a layout intern if we ever get one: measuring just what properties we actually care about how much and figuring out whether our current set of style structs is the right one...
So as a result of profiling the testcase in bug 780341 comment 56, I came back to this.

I wrote a somewhat similar patch to Boris's, also based on the idea of caching only on rule nodes that actually have a style context (though that's technically slightly different from Boris's idea of doing it on-request), rather than the intermediate rule nodes in the chain that aren't pointed to.  Except I took a slightly different tack, and decided to (a) mark rule nodes that have a style context pointing to them and (b) maintain the invariant that any rule node that has a style context pointing to it *always* has any style structs in its mDependentBits cached, which allows completely eliminating nsRuleNode::GetParentData/GetParent##struct_ and removing the mDependentBits check in nsRuleNode::GetStyleData/GetStyle##struct_ around the call to it.  (I'm also doing it for both reset and inherited structs; I'm not sure whether that's a good idea or not, though I expect it won't make much difference, and I think the uniformity is probably valuable.)

This ended up giving me a 14% improvement on the testcase, which is at least intended to have the same performance characteristics as the gaia homescreen, and thus basically be a real testcase (related to dynamic changes of transform).  (This is substantially bigger, by percentage, than the 275ms to 260ms that Boris reported in comment 5; presumably the additional inlining that Boris used there to get to 245ms could help here as well.)

The patches I tested with are:
https://hg.mozilla.org/users/dbaron_mozilla.com/patches/raw-file/38d0e857c5a4/rule-node-know-used-directly
https://hg.mozilla.org/users/dbaron_mozilla.com/patches/raw-file/38d0e857c5a4/rule-node-dont-destroy-structs
https://hg.mozilla.org/users/dbaron_mozilla.com/patches/raw-file/38d0e857c5a4/cache-on-used-rule-nodes

I think it's worth noting that this is likely to have the most performance improvement on testcases that are doing simple dynamic changse to non-inherited properties on elements with a lot of descendants that match a lot of style rules.
Hmm.  So those patches don't give us an always-non-null mCachedResetData, right?  So comment 2 won't really apply...

Even without that I think it's a good idea.
Are any of the numbers a page loading performance test?  None of them look like it to me.  (Do we still run such a test?)

I was also interested in looking at memory numbers for the pageload tests.
> Are any of the numbers a page loading performance test? 

Hrm.  I don't know....  :(
The claim on irc is that tp5n_paint may or may not be the pageload time.  And that compare-talos didn't get updated when people once again renamed tests....
On the assumption that tp5n_paint is the relevant pageloading performance test, and tp5n_main_rss_paint is the relvant pageloading memory test:



Roughly a 1% improvement on tp5n_paint Linux32:

	Welch Two Sample t-test

data:  c(366.71, 369.69, 368.22, 367.83, 368.06) and c(365.59, 366.56, 363.35, 361.9, 368.34) 
t = 2.3839, df = 5.35, p-value = 0.05955
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -0.1695921  6.0775921 
sample estimates:
mean of x mean of y 
  368.102   365.148 



Roughly a 2% improvement on tp5n_paint Linux64:

	Welch Two Sample t-test

data:  c(317.95, 323.61, 322.27, 322.78, 319.85) and c(314.7, 315.12, 316.87, 316.68, 313.85) 
t = 4.8944, df = 6.264, p-value = 0.002417
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 2.953949 8.742051 
sample estimates:
mean of x mean of y 
  321.292   315.444 




Roughly a 1% increase on tp5n_main_rss_paint Linux32:

	Welch Two Sample t-test

data:  c(109.9, 109.9, 109.8, 109.6, 109.3) and c(110.7, 111.2, 114.9, 110.5, 110.3) 
t = -2.1024, df = 4.141, p-value = 0.101
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -4.1915361  0.5515361 
sample estimates:
mean of x mean of y 
   109.70    111.52 



Much smaller increase on tp5n_main_rss_paint Linux64:

	Welch Two Sample t-test

data:  c(143.4, 143.2, 142.8, 142.7, 142.4) and c(143.3, 142.8, 143.3, 142.9, 143) 
t = -0.7752, df = 6.388, p-value = 0.4659
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -0.6576939  0.3376939 
sample estimates:
mean of x mean of y 
   142.90    143.06
Some but not all of those results (before patches, then after):


OSX 10.6 tp5n_paint:

data:  c(257.82, 258, 258.68, 258.84, 257.93) and c(255.72, 257.29, 256.31, 255.71, 257.52) 
t = 3.993, df = 6.208, p-value = 0.006694
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 0.6838871 2.8041129 
sample estimates:
mean of x mean of y 
  258.254   256.510 


OSX 10.6 tp5n_main_rss_paint:

data:  c(179.2, 179.8, 179.2, 178.7, 179.2) and c(178.6, 177.9, 179.1, 178.1, 178.7) 
t = 2.6702, df = 7.667, p-value = 0.02944
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 0.09608106 1.38391894 
sample estimates:
mean of x mean of y 
   179.22    178.48 


WINNT 6.1 tp5n_paint:

data:  c(318.46, 318.38, 315.55, 316.99, 313.13) and c(314.06, 314.48, 314.6, 314.35, 312.33) 
t = 2.3477, df = 5.366, p-value = 0.06225
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -0.1849325  5.2609325 
sample estimates:
mean of x mean of y 
  316.502   313.964 


WINNT 6.1 tp5n_pbytes_paint:

data:  c(119.7, 121.1, 121.4, 121.9, 119.9) and c(121.3, 120.7, 121, 120, 121.2) 
t = -0.0819, df = 6.182, p-value = 0.9373
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -1.226767  1.146767 
sample estimates:
mean of x mean of y 
   120.80    120.84
Comment on attachment 678991 [details] [diff] [review]
, part 1:  Make rule nodes know whether they're used directly by a style context.

r=me
Attachment #678991 - Flags: review?(bzbarsky) → review+
Comment on attachment 678992 [details] [diff] [review]
, part 2:  Allow rule nodes to cache data owned by an ancestor rule node:  don't destroy structs that we don't own.

r=me
Attachment #678992 - Flags: review?(bzbarsky) → review+
Comment on attachment 678994 [details] [diff] [review]
, part 3:  Cache data that lives in the rule tree on every relevant rule node that has a style context directly pointing to it directly.

I don't think allocation from the presshell arena is infallible yet.  We should really fix that...

r=me modulo that.
Attachment #678994 - Flags: review?(bzbarsky) → review+
(In reply to Boris Zbarsky (:bz) from comment #24)
> Comment on attachment 678994 [details] [diff] [review]
> , part 3:  Cache data that lives in the rule tree on every relevant rule
> node that has a style context directly pointing to it directly.
> 
> I don't think allocation from the presshell arena is infallible yet.  We
> should really fix that...
> 
> r=me modulo that.

ok, fixing that in bug 809533.
Comment on attachment 678991 [details] [diff] [review]
, part 1:  Make rule nodes know whether they're used directly by a style context.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): N/A
User impact if declined: Lack of significant performance improvement to dynamic changes of non-inherited CSS properties on elements with large numbers of descendants, which is a signficant improvement for the B2G homescreen.
Testing completed (on m-c, etc.):  on mozilla-central
Risk to taking this patch (and alternatives if risky): Very low; just setting a bit on an an object.
String or UUID changes made by this patch: None
Attachment #678991 - Flags: approval-mozilla-aurora?
Comment on attachment 678992 [details] [diff] [review]
, part 2:  Allow rule nodes to cache data owned by an ancestor rule node:  don't destroy structs that we don't own.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): N/A
User impact if declined: Lack of significant performance improvement to dynamic changes of non-inherited CSS properties on elements with large numbers of descendants, which is a signficant improvement for the B2G homescreen.
Testing completed (on m-c, etc.):  on mozilla-central
Risk to taking this patch (and alternatives if risky):  Low; just tells destructor not to destroy objects that we don't own.
String or UUID changes made by this patch: None
Attachment #678992 - Flags: approval-mozilla-aurora?
Comment on attachment 678994 [details] [diff] [review]
, part 3:  Cache data that lives in the rule tree on every relevant rule node that has a style context directly pointing to it directly.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): N/A
User impact if declined: Lack of significant performance improvement to dynamic changes of non-inherited CSS properties on elements with large numbers of descendants, which is a signficant improvement for the B2G homescreen.
Testing completed (on m-c, etc.):  on mozilla-central
Risk to taking this patch (and alternatives if risky):  The risk from this patch is probably relatively low simply because the code is exercised so well by just about everything in the browser that we'd notice quickly if there were something wrong with it.  That said, there's always the possibility that it has some sort of mistake that we don't catch quickly (e.g., a new rare crash, a new memory leak).
String or UUID changes made by this patch: None
Attachment #678994 - Flags: approval-mozilla-aurora?
Comment on attachment 678991 [details] [diff] [review]
, part 1:  Make rule nodes know whether they're used directly by a style context.

Approving on aurora as the risk is manageable and this is something that helps b2g
Attachment #678991 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #678992 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #678994 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.