Open Bug 1845130 Opened 2 years ago Updated 6 months ago

It takes too long to free a ComputedStyle object

Categories

(Core :: CSS Parsing and Computation, defect)

defect

Tracking

()

People

(Reporter: mstange, Unassigned)

References

(Depends on 1 open bug, Blocks 2 open bugs)

Details

(Whiteboard: [sp3])

Of the entire measured time on Speedometer 3, a full percent is spent in core::ptr::drop_in_place(style::gecko_bindings::structs::root::mozilla::ComputedStyle*): https://share.firefox.dev/3O3ikGG

We're freeing all the component objects (e.g. nsStyleText) one-by-one.

Is there a reason why these "components" are separate heap allocations? Could we place them inline into ServoComputedValues?

(Making sp3 overall faster by 1% is equivalent to making a single subtest faster by 23%.)


Here are profiles of layout+style object destruction on TodoMVC-WebComponents:

Firefox: https://share.firefox.dev/4hkEGkO (2427 samples)
Chrome: https://share.firefox.dev/4dOoMfp (819 samples, 3x faster)

Making Firefox as fast as Chrome on destroying layout+style objects would make TodoMVC-WebComponents faster by 4.5%, affects most subtests, and would make overall sp3 faster by 1.4%. (updated 2024-10-17).

(In reply to Markus Stange [:mstange] from comment #0)

Is there a reason why these "components" are separate heap allocations? Could we place them inline into ServoComputedValues?

These style structs are intentionally split up, yeah -- that allows for more granular style-sharing between elements (which may e.g. have a different font but have all other properties being the same). See https://hacks.mozilla.org/2017/08/inside-a-super-fast-css-engine-quantum-css-aka-stylo/ "A sidenote: style struct sharing" and the notes about style structs in https://wiki.mozilla.org/Gecko:Overview#Gecko as well.

If I'm understanding the "could we place them inline into ServoComputedValues" strawman-proposal correctly: I think that would roughly mean giving each element its own copy of every CSS property (or something to that effect), which would indeed make freeing a single ComputedStyle instance faster, but would be much more expensive in terms of space and time overall.

Having said all that: it's possible there might still be low-hanging fruit here; maybe emilio can take a look to see if anything stands out as poke-worthy here.

Flags: needinfo?(emilio)

Yes, they are different allocations because they are shared across elements. One other interesting pattern that this code has is that style structs and such get allocated on worker threads potentially, but freed on the main thread. I see there's a bunch of locking etc going on... Maybe that can be optimized somehow?

The other thing we could do is freeing them off the main thread, I guess. Not sure what the best approach for that would be though.

Flags: needinfo?(emilio)

We could look at how often these are shared vs. not, and due to which properties. It might be that:

  • We want to split some properties into separate (cheaper) structs.
  • We might want to remove some structs if they don't achieve sharing.

I was thinking about how to speed these up and thought of somehow arena-allocating these or what not. But, crazy thought since most of the time here seems actually je_free... How crazy would it be to defer je_free / coalesce je_free calls to the same arena somewhat more generally?

Flags: needinfo?(mstange.moz)

I'd like to forward that question to Paul :)

Flags: needinfo?(mstange.moz) → needinfo?(pbone)

...somehow arena-allocating these or what not...

There's a lot here, it's a discussion that I'd like to take to a video call. I want to give a example of the general question I have:

we could make jemalloc free things quickly if we knew they all belonged to the same run, bin and arena. But would that restriction on the API be too severe that such a batch API isn't useful? So what rules are useful to a batch API that give us the right amount of optimisations?

How crazy would it be to defer je_free / coalesce je_free calls to the same arena somewhat more generally?

That sounds almost as if jemalloc had some kind of visibility into a "free/not free" status of objects and could process them all in a batch. Like mark bits in a GC. Blending normal allocators and some GC-ish features (mark bits, or refcounts) is pretty exciting.

Who can I bring into such a meeting? Emilio? Smaug? I'll be on PTO next week, returning on the 7th.

Flags: needinfo?(smaug)
Flags: needinfo?(pbone)
Flags: needinfo?(mstange.moz)
Flags: needinfo?(emilio)

I filed Bug 1843997 earlier to consider a batch API.

Depends on: 1843997

I'd be happy to attend such a call of course.

we could make jemalloc free things quickly if we knew they all belonged to the same run, bin and arena. But would that restriction on the API be too severe that such a batch API isn't useful? So what rules are useful to a batch API that give us the right amount of optimisations?

That seems hard to guarantee from style code at least. Things are usually allocated from the worker threads, which have thread-local arenas, but we don't keep around which worker thread computed the style (and even so we might share some allocations across threads).

That sounds almost as if jemalloc had some kind of visibility into a "free/not free" status of objects and could process them all in a batch. Like mark bits in a GC. Blending normal allocators and some GC-ish features (mark bits, or refcounts) is pretty exciting.

Kinda? I was arguably thinking of something a bit more dumb (just record the free calls, do them all together potentially more efficiently, at some point later). How it is implemented I don't think I have a strong opinion on tho...

Flags: needinfo?(emilio)

(In reply to Paul Bone [:pbone] from comment #6)

Who can I bring into such a meeting? Emilio? Smaug? I'll be on PTO next week, returning on the 7th.

Jeff Muizelaar would probably be happy to attend, too. (I'm on PTO until the 15th.)

Flags: needinfo?(mstange.moz)

Are we not trying to reuse the structs?

Flags: needinfo?(smaug)

... that might work! Not with an Arc<MyStruct>, but maybe with a type MyStructArc = ArcWithAllocationRecycling<MyStruct, 16>.

Severity: -- → S3

I've started writing a first cut at a recycling Arc implementation: https://github.com/mstange/recycling-arc/blob/main/src/lib.rs
I haven't looked at all at what it would take to use this inside Firefox. It would probably be better if somebody else did that part (Emilio?).
The first goal would be to just have an extremely hacky prototype, to see if this approach would help at all.

I've also started a "batch free" implementation for jemalloc. I will make it general because it could also help the snow white killer and JS GC improve their usage of free. Bug 1843997

See Also: → 1870020

Style struct and layout object destruction still shows up as a top bucket where Firefox is slower than Chrome. Recent comparison reports say that making Firefox as fast as Chrome on this bucket would improve sp3 overall by 1.4%.

Here are up-to-date profiles of layout+style object destruction on TodoMVC-WebComponents:

Firefox: https://share.firefox.dev/4hkEGkO (2427 samples)
Chrome: https://share.firefox.dev/4dOoMfp (819 samples, 3x faster)

Blocks: motionmark
You need to log in before you can comment on or make changes to this bug.