It takes too long to free a ComputedStyle object
Categories
(Core :: CSS Parsing and Computation, defect)
Tracking
()
People
(Reporter: mstange, Unassigned)
References
(Depends on 1 open bug, Blocks 2 open bugs)
Details
(Whiteboard: [sp3])
Of the entire measured time on Speedometer 3, a full percent is spent in core::ptr::drop_in_place(style::gecko_bindings::structs::root::mozilla::ComputedStyle*)
: https://share.firefox.dev/3O3ikGG
We're freeing all the component objects (e.g. nsStyleText
) one-by-one.
Is there a reason why these "components" are separate heap allocations? Could we place them inline into ServoComputedValues
?
(Making sp3 overall faster by 1% is equivalent to making a single subtest faster by 23%.)
Here are profiles of layout+style object destruction on TodoMVC-WebComponents:
Firefox: https://share.firefox.dev/4hkEGkO (2427 samples)
Chrome: https://share.firefox.dev/4dOoMfp (819 samples, 3x faster)
Making Firefox as fast as Chrome on destroying layout+style objects would make TodoMVC-WebComponents faster by 4.5%, affects most subtests, and would make overall sp3 faster by 1.4%. (updated 2024-10-17).
Updated•2 years ago
|
Comment 1•2 years ago
|
||
(In reply to Markus Stange [:mstange] from comment #0)
Is there a reason why these "components" are separate heap allocations? Could we place them inline into
ServoComputedValues
?
These style structs are intentionally split up, yeah -- that allows for more granular style-sharing between elements (which may e.g. have a different font but have all other properties being the same). See https://hacks.mozilla.org/2017/08/inside-a-super-fast-css-engine-quantum-css-aka-stylo/ "A sidenote: style struct sharing" and the notes about style structs in https://wiki.mozilla.org/Gecko:Overview#Gecko as well.
If I'm understanding the "could we place them inline into ServoComputedValues
" strawman-proposal correctly: I think that would roughly mean giving each element its own copy of every CSS property (or something to that effect), which would indeed make freeing a single ComputedStyle
instance faster, but would be much more expensive in terms of space and time overall.
Having said all that: it's possible there might still be low-hanging fruit here; maybe emilio can take a look to see if anything stands out as poke-worthy here.
Comment 2•2 years ago
|
||
Yes, they are different allocations because they are shared across elements. One other interesting pattern that this code has is that style structs and such get allocated on worker threads potentially, but freed on the main thread. I see there's a bunch of locking etc going on... Maybe that can be optimized somehow?
The other thing we could do is freeing them off the main thread, I guess. Not sure what the best approach for that would be though.
Comment 3•2 years ago
|
||
We could look at how often these are shared vs. not, and due to which properties. It might be that:
- We want to split some properties into separate (cheaper) structs.
- We might want to remove some structs if they don't achieve sharing.
Comment 4•2 years ago
|
||
I was thinking about how to speed these up and thought of somehow arena-allocating these or what not. But, crazy thought since most of the time here seems actually je_free... How crazy would it be to defer je_free / coalesce je_free calls to the same arena somewhat more generally?
Reporter | ||
Comment 5•2 years ago
|
||
I'd like to forward that question to Paul :)
Comment 6•2 years ago
|
||
...somehow arena-allocating these or what not...
There's a lot here, it's a discussion that I'd like to take to a video call. I want to give a example of the general question I have:
we could make jemalloc free things quickly if we knew they all belonged to the same run, bin and arena. But would that restriction on the API be too severe that such a batch API isn't useful? So what rules are useful to a batch API that give us the right amount of optimisations?
How crazy would it be to defer je_free / coalesce je_free calls to the same arena somewhat more generally?
That sounds almost as if jemalloc had some kind of visibility into a "free/not free" status of objects and could process them all in a batch. Like mark bits in a GC. Blending normal allocators and some GC-ish features (mark bits, or refcounts) is pretty exciting.
Who can I bring into such a meeting? Emilio? Smaug? I'll be on PTO next week, returning on the 7th.
Comment 8•2 years ago
|
||
I'd be happy to attend such a call of course.
we could make jemalloc free things quickly if we knew they all belonged to the same run, bin and arena. But would that restriction on the API be too severe that such a batch API isn't useful? So what rules are useful to a batch API that give us the right amount of optimisations?
That seems hard to guarantee from style code at least. Things are usually allocated from the worker threads, which have thread-local arenas, but we don't keep around which worker thread computed the style (and even so we might share some allocations across threads).
That sounds almost as if jemalloc had some kind of visibility into a "free/not free" status of objects and could process them all in a batch. Like mark bits in a GC. Blending normal allocators and some GC-ish features (mark bits, or refcounts) is pretty exciting.
Kinda? I was arguably thinking of something a bit more dumb (just record the free calls, do them all together potentially more efficiently, at some point later). How it is implemented I don't think I have a strong opinion on tho...
Reporter | ||
Comment 9•2 years ago
|
||
(In reply to Paul Bone [:pbone] from comment #6)
Who can I bring into such a meeting? Emilio? Smaug? I'll be on PTO next week, returning on the 7th.
Jeff Muizelaar would probably be happy to attend, too. (I'm on PTO until the 15th.)
Reporter | ||
Comment 11•2 years ago
|
||
... that might work! Not with an Arc<MyStruct>
, but maybe with a type MyStructArc = ArcWithAllocationRecycling<MyStruct, 16>
.
Updated•2 years ago
|
Reporter | ||
Comment 12•2 years ago
|
||
I've started writing a first cut at a recycling Arc
implementation: https://github.com/mstange/recycling-arc/blob/main/src/lib.rs
I haven't looked at all at what it would take to use this inside Firefox. It would probably be better if somebody else did that part (Emilio?).
The first goal would be to just have an extremely hacky prototype, to see if this approach would help at all.
Comment 13•2 years ago
|
||
I've also started a "batch free" implementation for jemalloc. I will make it general because it could also help the snow white killer and JS GC improve their usage of free. Bug 1843997
Reporter | ||
Comment 14•10 months ago
|
||
Style struct and layout object destruction still shows up as a top bucket where Firefox is slower than Chrome. Recent comparison reports say that making Firefox as fast as Chrome on this bucket would improve sp3 overall by 1.4%.
Here are up-to-date profiles of layout+style object destruction on TodoMVC-WebComponents:
Firefox: https://share.firefox.dev/4hkEGkO (2427 samples)
Chrome: https://share.firefox.dev/4dOoMfp (819 samples, 3x faster)
Updated•6 months ago
|
Description
•