Closed Bug 1474793 Opened 4 years ago Closed 4 years ago

consider sharing UA style sheets across processes

Categories

(Core :: CSS Parsing and Computation, enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED
mozilla68
Fission Milestone M2
Tracking Status
firefox68 --- fixed

People

(Reporter: heycam, Assigned: heycam)

References

(Blocks 3 open bugs)

Details

(Whiteboard: [overhead:400K][layout:p1])

Attachments

(20 files, 1 obsolete file)

47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
I don't know if this is feasible yet.  Maybe not.

But we seem to use about 300 KiB to store all of the UA style sheets in a freshly opened content process that just loaded about:blank, and another 160 KiB for the cached UA sheet cascade data.  Can we share this data between processes in shared memory somehow?

The contents of UA style sheets shouldn't change; even when shown in devtools, you shouldn't be able to edit the rules.  But devtools does need to be able to look up those rules.  It would be tricky to handle the reference counting across processes.

The cached UA cascade data has referenced counted objects in it too, like selectors.  And there are atoms in there, which obviously wouldn't be pointer identical to ones generated in the content process.

Maybe there's a different way we can represent this data in memory so it's suitable for use across processes?
I think it's going to be hard for the cascade data.  But maybe do-able for the style sheets themselves.  Since we should know at build time what the UA style sheets contain, we can ensure that we only ever need to have static atom references.

If we kept the same representation we would need to avoid actually bumping reference counts on the objects in the style sheet, since that would end up doing a copy-on-write.

Or we could have a completely separate representation -- maybe some kind of compact bytecode-like format -- that we generate at build time.  But it would be a pain to maintain both that representation and the regular Rust object representation.
(In reply to Cameron McCormack (:heycam) (busy until July 12) from comment #1)
> I think it's going to be hard for the cascade data.  But maybe do-able for
> the style sheets themselves.  Since we should know at build time what the UA
> style sheets contain, we can ensure that we only ever need to have static
> atom references.
> 
> If we kept the same representation we would need to avoid actually bumping
> reference counts on the objects in the style sheet, since that would end up
> doing a copy-on-write.

I'm not sure how the copy-on-write bit would work, since the child processes aren't forked from the parent process (i.e. e10s!=NUWA). So you can't share anything with pointers across the processes (since the virtual space doesn't line up), which makes this pretty hard.

> Or we could have a completely separate representation -- maybe some kind of
> compact bytecode-like format -- that we generate at build time.  But it
> would be a pain to maintain both that representation and the regular Rust
> object representation.

Yeah that seems like a huge amount of work, and not obviously tractable.

I wonder if we could get somewhere by partitioning up the UA sheets and lazily-loading them based on certain conditions. For example, we could probably avoid loading the mathml and svg sheets until an element from that namespace was inserted into the document. counterstyles.css seems like we could trim it down and load the infrequently-used parts lazily. There's a lot of stuff in forms.css for a few infrequently-used form controls like number and range, perhaps we could load those on-demand as well?
We've actually been moving away from on-demand loading recently, because the more granular you get about which UA style sheets are loaded in a given document, the more chance you have of having many entries in the UA sheet cascade cache (for different sets of sheets).  Though maybe in a post-Fission world there will be fewer entries in that cache anyway just because we have fewer documents loaded in a given process.
(In reply to Bobby Holley (:bholley) from comment #2)
> > Or we could have a completely separate representation -- maybe some kind of
> > compact bytecode-like format -- that we generate at build time.  But it
> > would be a pain to maintain both that representation and the regular Rust
> > object representation.
> 
> Yeah that seems like a huge amount of work, and not obviously tractable.

And I agree with this, FWIW, although it might be possible to leverage serde to avoid writing a whole bunch of code to handle this representation.
I discussed with heycam during the work week and I suggested another possible route to go.

My suggestion was that, we can probably migrate majority of the UA sheets into (maybe generated) Rust code, and probably have a special routine invoked during selector matching and an additional (non-Arced) style source variant for rule tree.

This is feasible because as far as I can see, majority of rules in UA sheets only have a single compound selector (i.e. no combinator), and many of them are even just single simple selector, which makes it very easy to do the matching in code without going through the general matching mechanism.

Additionally, doing so may allow us to remove several internal pseudo-classes which were added purely for performance reason and/or enabling special matching.

I suppose that putting data into code is probably the safest way to share something between processes.
And for combinators used in UA sheets, probably all of them are child combinator, and they should usually take happy path when the last element matches. Child combinator it probably not very hard to support in code as well.
Oh, well, we have some descendant combinators... which is probably not to bad either :)
Here's the style sheet sizes for the sheets loaded in a fresh content process that only loaded about:blank:

128312 resource://gre-resources/forms.css
104912 resource://gre-resources/html.css
50216 resource://gre-resources/quirk.css
39288 resource://gre-resources/ua.css
25944 resource://gre-resources/mathml.css
14704 chrome://global/content/minimal-xul.css
8592 resource://gre/res/svg.css
5472 chrome://global/skin/scrollbars.css
2096 resource://gre-resources/counterstyles.css
1584 about:PreferenceStyleSheet
400 resource://gre-resources/noscript.css
Slightly different results with bug 1475191 patch applied:

128472 resource://gre-resources/forms.css
118880 resource://gre-resources/html.css
50504 resource://gre-resources/quirk.css
39416 resource://gre-resources/ua.css
26456 resource://gre-resources/mathml.css
14864 chrome://global/content/minimal-xul.css
8816 resource://gre/res/svg.css
5504 chrome://global/skin/scrollbars.css
2096 resource://gre-resources/counterstyles.css
1648 about:PreferenceStyleSheet
400 resource://gre-resources/noscript.css
Depends on: 1475197
It does seem like we could forms.css is a big fish here. A lot of that is the stuff for range, number, and date, so we could consider being more lazy with those, but that has the downsides mentioned in comment 3.

I think Xidorn's idea about baking this stuff into the binary is a really good one, and worth investigating more.
I guess that should be fine, although it does mean a step backwards in maintainability of the UA sheets.  A couple of things to be mindful of:

* handling invalidation correctly, if we're hard coding any rules that have dependencies on attribute values and element state

* ensuring we walk these rules at the right time, in case the relative order of rules (within a given sheet or between other UA sheets) matters for the cascade order


Xidorn mentions doing this just for rules that are sequences of simple selectors (i.e. have no combinators), but I wonder if it would be that much harder to support any arbitrary selectors, so long as we only reference static atoms.
Given the numbers from comment 9, roughly guessing 500K of overhead / content process.
Whiteboard: [overhead:500K]
(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #12)
> Given the numbers from comment 9, roughly guessing 500K of overhead /
> content process.

Well, it's really 397k if you add up the numbers. Is the issue that we need to round up or down, or should we put 400k instead?
Flags: needinfo?(erahm)
(In reply to Bobby Holley (:bholley) from comment #13)
> (In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment
> #12)
> > Given the numbers from comment 9, roughly guessing 500K of overhead /
> > content process.
> 
> Well, it's really 397k if you add up the numbers. Is the issue that we need
> to round up or down, or should we put 400k instead?

I wasn't sure if we're also addressing the cached sheets (comment #0) in this bug as well. Maybe we should split that out?

Either way we're just trying to track level of impact with the [overhead] tag, rounding is okay :)
Flags: needinfo?(erahm)
Whiteboard: [overhead:500K] → [overhead:400K]
(In reply to Cameron McCormack (:heycam) from comment #11)
> I guess that should be fine, although it does mean a step backwards in
> maintainability of the UA sheets.

It might be a step backwards to some extent, but it has an advantage that the values are guaranteed to be valid at compile time. (We don't really have approach to verify ua sheets we ship, especially the ua-only stuffs at the moment.)

Also if we, at some point, convert all ua styles into code, we can remove the UA-only checks as well as parsing functions of some internal properties.

> * handling invalidation correctly, if we're hard coding any rules that have
> dependencies on attribute values and element state
> 
> * ensuring we walk these rules at the right time, in case the relative order
> of rules (within a given sheet or between other UA sheets) matters for the
> cascade order
> 
> Xidorn mentions doing this just for rules that are sequences of simple
> selectors (i.e. have no combinators), but I wonder if it would be that much
> harder to support any arbitrary selectors, so long as we only reference
> static atoms.

The things you listed above are exactly the reason that I think we can start with compound selectors (i.e. no combinator). If everything can be check against just a single element (without its context), there should be fewer cases which we would make mistake, I guess...
(In reply to Eric Rahm [:erahm] from comment #14)
> I wasn't sure if we're also addressing the cached sheets (comment #0) in
> this bug as well. Maybe we should split that out?

Split that out to bug 1481975.
Assignee: nobody → cam
Status: NEW → ASSIGNED
Summary: consider sharing UA style sheets and/or cascade data across processes → consider sharing UA style sheets across processes
Depends on: 1482782
Note that compiling the UA sheets into code is going to make artifact builders unhappy, so there's going to need to be some sort of fallback case that doesn't use the compiled-in sheets, or uses the actual sheets if they're "newer", or something.
Why would that make artifact builders unhappy? The sheets would go into the binary directly, and it would no longer be .css files changeable for artifact builds, which should be fine since those are UA sheets and frontend people unlikely want to change them.
(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #18)
> Why would that make artifact builders unhappy?

I don't understand why this is a question when:

> The sheets would go into the
> binary directly, and it would no longer be .css files changeable for
> artifact builds

really provides the answer.

> which should be fine since those are UA sheets and frontend
> people unlikely want to change them.

I dunno, if I'm a frontend person and I want to iterate on sheet design, for whatever reason, and I now have to actually compile a binary every single time, I feel like that would make me quite unhappy, no matter how infrequent it is.  How often do these sheets get modified, anyway?  Do we have any kind of numbers on that?
(In reply to Nathan Froyd [:froydnj] from comment #19)
> I dunno, if I'm a frontend person and I want to iterate on sheet design, for
> whatever reason, and I now have to actually compile a binary every single
> time, I feel like that would make me quite unhappy, no matter how infrequent
> it is.  How often do these sheets get modified, anyway?  Do we have any kind
> of numbers on that?

The point I think Xidorn was making is that most front-end CSS aren't UA sheets. UA sheets are generally defined in the HTML spec, and is something mostly the Gecko people touch.
Aha, #eggonface.

My misunderstanding, then, carry on! :)
Depends on: 1483121
I spoke a bit with Cameron about this, and I think depending on how complex this ends up looking like it may or may not end up paying off, assuming we can do more targeted reductions for all sheets.

For example, Facebook's login page and nytimes.com all load more than 5MB of stylesheets here (lol). It may be worth to see what the average origin's sheet sizes are, but if we could reduce, let's say, 10% of the memory than the current stylesheets or cascade data use, it will end up reducing way more memory than this.

I may take a look at some these if I have some free time, filed bug 1484563 to track stuff.
(In reply to Emilio Cobos Álvarez (:emilio) from comment #22)
> I spoke a bit with Cameron about this, and I think depending on how complex
> this ends up looking like it may or may not end up paying off, assuming we
> can do more targeted reductions for all sheets.
> 
> For example, Facebook's login page and nytimes.com all load more than 5MB of
> stylesheets here (lol). It may be worth to see what the average origin's
> sheet sizes are, but if we could reduce, let's say, 10% of the memory than
> the current stylesheets or cascade data use, it will end up reducing way
> more memory than this.

If there are ways we can generally reduce stylesheet memory usage across the board, that's great - but we also need to get our fixed per-process overhead down. There are lots of simple sites that currently don't consume a lot of memory to load, but will start to do so when we enable fission. Given that the UA sheets consume ~400k per process, encoding the UA sheets into the text segment seems very much worth doing.
If we can handle this sharing in a way which will allow it to use it for stylesheets added by extensions, it will probably substantially add to the savings. Ad blockers are notorious for injecting huge stylesheets into every site, and lots of other add-ons tend to inject huge UI stylesheets on the off chance that they may eventually need them.

That said, even if we can't manage that, just sharing the UA stylesheets is almost certainly worth doing even if web sites in general are a bigger problem. The same sort of argument goes areas like JS. Web sites in general will probably have a lot more JS overhead than our chrome code does, but they don't suffer from the same sort of multiplying factor that we will as we increase our process count.

Our current base number is still close to 15MB in the best case, and we definitely can't afford 15MB of extra overhead for every frame, even if the websites we load into many frames are much larger than that.
Sure, to be clear I'm just somewhat concerned about the potential complexity here.

This would only ever work for UA sheets (static sheets), not for add-ons), and honestly it's a fair amount of work and complexity. Also note that no other engine supporting Site Isolation does this at all as far as I can tell (I can try to ask Blink folks if you're curious about why, I kind of am...).

WebKit is doing a large set of general memory-usage improvements, which end up adding-up. Just cherry-picking the ones that are style-system related and that have landed over the last couple months:

  https://bugs.webkit.org/show_bug.cgi?id=186731
  https://bugs.webkit.org/show_bug.cgi?id=186705
  https://bugs.webkit.org/show_bug.cgi?id=186711
  https://bugs.webkit.org/show_bug.cgi?id=186700
  https://bugs.webkit.org/show_bug.cgi?id=186737
  https://bugs.webkit.org/show_bug.cgi?id=186739
  https://bugs.webkit.org/show_bug.cgi?id=186741
  https://bugs.webkit.org/show_bug.cgi?id=186713
  https://bugs.webkit.org/show_bug.cgi?id=186698
  https://bugs.webkit.org/show_bug.cgi?id=186708
  https://bugs.webkit.org/show_bug.cgi?id=186988
  https://bugs.webkit.org/show_bug.cgi?id=186990

I'm not asserting that they'd have the same multiplicative effect on every content process as this of course (maybe, though?). But also this kind of stuff would affect all stylesheets across the board, including add-on sheets, and every content page. I'm sure that this kind of stuff does end up adding up.

Anyway, the point I was trying to make in comment 22 is just that I don't think that any price in complexity is worth paying for 400K of savings if we can get them from somewhere else, and this sure looks complex without having seen the whole patch. I think it's worth investigating what other engines do in terms of memory usage for Site Isolation for layout and style.

That being said if the thing ends up not being as complex / hard to maintain as I fear and we can get all of it then that'd be great, for sure. In any case Cam is the style system owner, he has the last word, and also he's doing the work here so...
(In reply to Kris Maglione [:kmag] from comment #24)
> If we can handle this sharing in a way which will allow it to use it for
> stylesheets added by extensions, it will probably substantially add to the
> savings. Ad blockers are notorious for injecting huge stylesheets into every
> site, and lots of other add-ons tend to inject huge UI stylesheets on the
> off chance that they may eventually need them.

Ugh, I forgot about adblockers. That's going to be a big problem, come to think of it - those stylesheets tend to be very large, and we can't bake them into the binary.

(In reply to Emilio Cobos Álvarez (:emilio) from comment #25)

> This would only ever work for UA sheets (static sheets), not for add-ons)

This is a fair point. Per above, it would be really great if we could find a solution that also helped addon sheets. But that doesn't mean we can just punt on the UA sheets.

> and honestly it's a fair amount of work and complexity. Also note that no
> other engine supporting Site Isolation does this at all as far as I can tell
> (I can try to ask Blink folks if you're curious about why, I kind of am...).

It's hard to compare engines here. Other browsers don't have frame scripts or as many vtables, so they have a natural advantage in terms of per-process memory overhead. And Chrome is certainly taking heat over the memory regressions from site isolation.

> WebKit is doing a large set of general memory-usage improvements, which end
> up adding-up. Just cherry-picking the ones that are style-system related and
> that have landed over the last couple months:
> 
>   https://bugs.webkit.org/show_bug.cgi?id=186731
>   https://bugs.webkit.org/show_bug.cgi?id=186705
>   https://bugs.webkit.org/show_bug.cgi?id=186711
>   https://bugs.webkit.org/show_bug.cgi?id=186700
>   https://bugs.webkit.org/show_bug.cgi?id=186737
>   https://bugs.webkit.org/show_bug.cgi?id=186739
>   https://bugs.webkit.org/show_bug.cgi?id=186741
>   https://bugs.webkit.org/show_bug.cgi?id=186713
>   https://bugs.webkit.org/show_bug.cgi?id=186698
>   https://bugs.webkit.org/show_bug.cgi?id=186708
>   https://bugs.webkit.org/show_bug.cgi?id=186988
>   https://bugs.webkit.org/show_bug.cgi?id=186990
> 
> I'm not asserting that they'd have the same multiplicative effect on every
> content process as this of course (maybe, though?). But also this kind of
> stuff would affect all stylesheets across the board, including add-on
> sheets, and every content page. I'm sure that this kind of stuff does end up
> adding up.

If we can shrink stylesheets significantly without significantly compromising performance, I'm all ears. 

> Anyway, the point I was trying to make in comment 22 is just that I don't
> think that any price in complexity is worth paying for 400K of savings if we
> can get them from somewhere else

If "somewhere else" means "shrinking stylesheets", that's certainly an alternative, but per above I'm skeptical that we can shrink them enough. If it means "finding memory savings elsewhere in the platform", that's a tougher sell. Per [1], there just aren't enough options for memory reduction to ignore a potential 400k win.


> That being said if the thing ends up not being as complex / hard to maintain
> as I fear and we can get all of it then that'd be great, for sure. In any
> case Cam is the style system owner, he has the last word, and also he's
> doing the work here so...

Yeah. And maybe he'll have a good idea for addon stylesheets. :-)  

[1] https://mail.mozilla.org/pipermail/firefox-dev/2018-July/006628.html
Most of my patch queue is oriented around pointing to selector components, rules, declaration blocks, etc. in static const memory without using pointers.  If we want to support sheets loaded by WebExtensions too, then that sounds like something we'd need to do in the parent process and use shared memory for, but we can probably use most of the same same setup to avoid pointer values in the data.  Still it would be a little trickier as we wouldn't just have a single flat array of selector components etc., but one per extension-loaded style sheet.

Do we need to support mutation of style sheets loaded by WebExtensions?  I hope not. :-)
(In reply to Cameron McCormack (:heycam) (away 24 Aug) from comment #27)
> Most of my patch queue is oriented around pointing to selector components,
> rules, declaration blocks, etc. in static const memory without using
> pointers.  If we want to support sheets loaded by WebExtensions too, then
> that sounds like something we'd need to do in the parent process and use
> shared memory for, but we can probably use most of the same same setup to
> avoid pointer values in the data.  Still it would be a little trickier as we
> wouldn't just have a single flat array of selector components etc., but one
> per extension-loaded style sheet.

Oh, nice! That actually sounds pretty tractable, though certainly some amount of complexity. 
 
> Do we need to support mutation of style sheets loaded by WebExtensions?  I
> hope not. :-)

I also hope not. Kris?
Flags: needinfo?(kmaglione+bmo)
(In reply to Cameron McCormack (:heycam) (away 24 Aug) from comment #27)
> If we want to support sheets loaded by WebExtensions too, then
> that sounds like something we'd need to do in the parent process and use
> shared memory for, but we can probably use most of the same same setup to
> avoid pointer values in the data.

That should be doable. I've already been thinking about moving some of the
content script pre-loading logic to the parent process so we don't have to
re-compile every script for every new process. Doing the same for stylesheets
wouldn't be much extra work.


(In reply to Bobby Holley (:bholley) from comment #28)
> > Do we need to support mutation of style sheets loaded by WebExtensions?  I
> > hope not. :-)
> 
> I also hope not. Kris?

Not the ones that are loaded via the sheet loading APIs. Those are shared
between documents, and injected via DOMWindowUtils so they shouldn't be visible
to pages.

Sheets loaded via <link> tags probably need to support mutation, but those are a
lost cause anyway.
Flags: needinfo?(kmaglione+bmo)
One fly in the ointment is atoms.  The switch to use indexes for static atoms in my patch queue works well for the UA style sheets, since we just require up front that any identifier in the UA sheets have a static atom, and then have the canonical reference to them be their index, rather than the nsStaticAtom pointer.  For sheets loaded by extensions, we would need to have some sort of dynamically registered indexes/IDs for atoms, have that in shared memory, and then always resolve that index/ID to the nsDynamicAtom pointer when doing equality checks.  Doable but much more of a pain.
(In reply to Cameron McCormack (:heycam) (away 24 Aug) from comment #30)
> One fly in the ointment is atoms.  The switch to use indexes for static
> atoms in my patch queue works well for the UA style sheets, since we just
> require up front that any identifier in the UA sheets have a static atom,
> and then have the canonical reference to them be their index, rather than
> the nsStaticAtom pointer.  For sheets loaded by extensions, we would need to
> have some sort of dynamically registered indexes/IDs for atoms, have that in
> shared memory, and then always resolve that index/ID to the nsDynamicAtom
> pointer when doing equality checks.  Doable but much more of a pain.

Yeah. It seems reasonable to land the UA sheet stuff as the MVP, just keeping in mind the eventual desire to make it work for addon sheets (and probably getting a bug on file to track it).
Whiteboard: [overhead:400K] → [overhead:400K][layout:p1]
Depends on: 1500362
Since this got mentioned in the meeting today, I just wanted to pop in and say that sharing memory between processes at the same address space is concerning. I'd like to better understand how that memory will be interacted with. Specifically, passing pointers (raw or offsets) over IPC is exceptionally dangerous.
(In reply to Tom Ritter [:tjr] from comment #32)
> Since this got mentioned in the meeting today, I just wanted to pop in and
> say that sharing memory between processes at the same address space is
> concerning. I'd like to better understand how that memory will be interacted
> with. Specifically, passing pointers (raw or offsets) over IPC is
> exceptionally dangerous.

Since all of the data would be generated in the parent process, I'm not especially concerned about it being dangerous.

I think the main issue is that style data has references to other data structures within the same data set, and having to store/access all of that data by offset is complicated. If there are certain data blocks that are mapped at the same address in all child processes, that makes the situation much simpler. Aside from making sure we can actually map data blocks at the same address in all content processes. That's easy enough on 64-bit, but a bit tricky on 32-bit.
(In reply to Kris Maglione [:kmag] from comment #33)
> (In reply to Tom Ritter [:tjr] from comment #32)
> > Since this got mentioned in the meeting today, I just wanted to pop in and
> > say that sharing memory between processes at the same address space is
> > concerning. I'd like to better understand how that memory will be interacted
> > with. Specifically, passing pointers (raw or offsets) over IPC is
> > exceptionally dangerous.
> 
> Since all of the data would be generated in the parent process, I'm not
> especially concerned about it being dangerous.
> 
> I think the main issue is that style data has references to other data
> structures within the same data set, and having to store/access all of that
> data by offset is complicated. If there are certain data blocks that are
> mapped at the same address in all child processes, that makes the situation
> much simpler.

Okay, that's what I was hoping: all pointers go Parent -> Child only.

This is a built-in remote code execution engine for Parent -> Child Process compromise, but I expect we already have a few of those.
(In reply to Kris Maglione [:kmag] from comment #33)
> I think the main issue is that style data has references to other data
> structures within the same data set, and having to store/access all of that
> data by offset is complicated. If there are certain data blocks that are
> mapped at the same address in all child processes, that makes the situation
> much simpler.

Yes, that's exactly my thinking.  The original patches I was working towards in this bug, for storing the style sheet in static data, basically added a bunch of places where we stored offsets into static data arrays, and adding a bunch of enums that allowed either that or regular heap pointers.  It was getting too complex.

> Aside from making sure we can actually map data blocks at the
> same address in all content processes. That's easy enough on 64-bit, but a
> bit tricky on 32-bit.

How important it is it to do something at all on 32 bit?  The technique to choose a random address to map the shared memory to should work well for 64 bit, but on 32 bit on average it's going to cut the available contiguous address space in half, which wouldn't be great.
(In reply to Cameron McCormack (:heycam) from comment #35)
> > Aside from making sure we can actually map data blocks at the
> > same address in all content processes. That's easy enough on 64-bit, but a
> > bit tricky on 32-bit.
> 
> How important it is it to do something at all on 32 bit?  The technique to
> choose a random address to map the shared memory to should work well for 64
> bit, but on 32 bit on average it's going to cut the available contiguous
> address space in half, which wouldn't be great.

I think we should probably make a best effort to use shared data on 32 bit and fall back to per-process loads if we need to. I'm still not sure what the Fission situation is going to look like on 32 bit, but I'm still pretty worried that 32 bit users will in general have less RAM, and that we won't be able to entirely ignore them.

We're going to wind up allocating stylesheet memory in those content processes regardless, so I kind of suspect the address space situation will in general look about the same either way, and possibly even give us less fragmentation in the shared sheet case.
Depends on: 1504034
Depends on: 1504104
Depends on: 1506760
Depends on: 1507050
Depends on: 1507687
Depends on: 1511276
This looks like a use case for a general way to freeze shared memory (bug 1479960).  A read-only mapping of a writeable descriptor/handle can generally have write permissions re-added later, so we need to make sure we're not ever giving the content processes write access (even if we trust the process at the time the mapping is established).
Depends on: 1479960
Thanks for the pointer to your work, Jed; I agree it would be good to drop write access from the parent process as soon as we finish copying the data into the buffer.
Depends on: 1515533
Depends on: 1516212
Blocks: 1519001

(In reply to Cameron McCormack (:heycam) from comment #39)

Now that bug 1516493 and bug 1515201 have landed, my current patch queue shows the expected content process overhead reductions:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=d565e9d7de84da8d3974c8f5a02fc3067cc3420f&newProject=try&newRevision=3fbde7f7886236ca2641b8187883068663ddc6a1&framework=4

Nice work! And to make sure I understand - the solution here will also apply to sheets injected via addons?

Almost. There are a couple of style sheet features (like @import sheets) that I skipped support for since they weren't needed in UA sheets. I still need to investigate popular add-ons though to see whether they are loading big sheets that apply to all documents, or if they have adjusted to inserting page-specific sheets. I will look at that once this lands.

If we do have big addon inserted sheets, then the one tricky thing will be atoms. In the UA sheets we have a fixed set of strings that are atomized, so we can just assert that all of the atoms within the style sheets are static atoms. For a sharable add-on generated sheet, we'd probably need to do something like atomize in each content process separately, and dynamically look up those atoms from a big table mapping from the string's pointer or something.

Depends on: 1519011
Depends on: 1519737

I had a look at what ABP, the most popular ad-blocking addon, is doing. I tested by opening a number of tabs (loading theage.com.au, reddit.com, bbc.com, mozilla.org, cnn.com, abc.net.au, and duckduckgo.com).

Each content process ended up with at least one very large (a bit over 1 MiB) style sheet, and most of them were unique. One of them ended up being loaded in all but one content process, and it was inserted into tracking / ad iframes. The sheet loaded into the top level document for each of these sites was unique for each site. I haven't dug into what the differences are, but they only differ by several hundred of bytes, so it's probably a ~1 MiB prefix that has general blocking rules plus a small amount of site-specific rules. So this is already bad news in a non-Fission world, since these will be unique style sheets even within the same content process.

Some sites also got a small, site-specific style sheet (< 2 KiB).

If ABP were always using the same, separate 1 MiB style sheet, and splitting out anything site specific to a separate sheet, then we might be able to do something. The WebExtensions API ends up calling nsStyleSheetService::PreloadSheetAsync to create a StyleSheet object, and nsDOMWindowUtils::AddSheet later to insert into explicitly into each document. In PreloadSheetAsync we could send the style sheet contents to the parent process, get it to parse it into a sharable form, then pass a handle to the sharable form back down the content process. The chance of allocating another shared memory buffer that can be mapped at the same address in the parent and content processes will be lower, at least on 32 bit, though I guess we could just reserve another few megabytes of address space for this purpose at the time we allocate the shared memory for the UA style sheets.

(The WebExtensions API takes the style sheet text from the addon and generates a data: URI which it passes into the style sheet service. This is also not great, as we have a massive nsIURI object hanging around in addition to the 1 MiB of parsed style sheet data.)

I also looked a uBlock Origin, which is the second most popular ad-blocking addon. It didn't load any style sheets into tracking / ad iframes. Each site had a unique style sheet. There may well be some commonality between them but it's not a big prefix like the ABP ones, so I couldn't tell at first glance. The sheet sizes were around 130 KiB.

The only thing left blocking uploading patches here is a Talos regression in a couple of tests (ts_paint, sessionrestore) on macOS. Profiling and bisecting my code reveals that this is due to the ipc::SharedMemoryBasic::ShareToProcess call that I do in ContentParent::InitInternal, so I can pass the shared memory buffer down to the content process early on, in the SetXPCOMProcessAttributes IPC call.

It turns out that this call gets blocked, waiting for the newly created content process to finish its sandbox compilation call, which happens under mozilla::EarlyStartMacSandboxIfEnabled. We have bug 1412855 on file for sandbox compilation being slow; here it's taking around 100 ms.

Making the early macOS sandbox start later (after my ShareToProcess call) sounds like it defeats the purpose of being early. So I think either we need to fix the slow sandbox compilation or use some other method for passing the shared memory down to the child, e.g. like what we do for prefs, passing a handle on the command line.

Base ts_paint profile:

https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FCcjM7ADDQw-oKWKL0Qqtlg%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_ts_paint.zip/calltree/?file=profile_ts_paint%2Fstartup%2Fcycle_19.profile&globalTrackOrder=0-2-3-4-5-1&hiddenGlobalTracks=2-3-4-5&localTrackOrderByPid=2566-1-0~2567-0~&range=0.0000_0.7335&search=mach_msg_trap&thread=0&v=3

Profile with patch queue applied:

https://perf-html.io/from-url/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FG6c2RW6ASS-w31HzoCooQw%2Fruns%2F0%2Fartifacts%2Fpublic%2Ftest_info%2Fprofile_ts_paint.zip/calltree/?file=profile_ts_paint%2Fstartup%2Fcycle_19.profile&globalTrackOrder=0-1-2-3-4&hiddenGlobalTracks=1-2-4&localTrackOrderByPid=1442-1-0~1444-0~1443-0~&range=0.0000_0.8167&search=mach_msg_trap&thread=0&v=3

(Not passing a handle on the command line, but having a fixed file descriptor for the UA sheet shared memory, which gets dup'd after the content process spawns.)

(In reply to Cameron McCormack (:heycam) from comment #43)

It turns out that this call gets blocked, waiting for the newly created content process to finish its sandbox compilation call, which happens under mozilla::EarlyStartMacSandboxIfEnabled. We have bug 1412855 on file for sandbox compilation being slow; here it's taking around 100 ms.

Bug 1465669 (using POSIX shm on Mac) would probably help here: sending the SCM_RIGHTS probably wouldn't block on anything to do with the receiving process, but even if it did it would block the IPC thread and not the main thread, which ought to at least impact total time less.

(In reply to Cameron McCormack (:heycam) (away Jan 18) from comment #43)

Making the early macOS sandbox start later (after my ShareToProcess call) sounds like it defeats the purpose of being early. So I think either we need to fix the slow sandbox compilation or use some other method for passing the shared memory down to the child, e.g. like what we do for prefs, passing a handle on the command line.

FWIW, I collected some rough data on Mac sandbox string compilation times on different hardware. Copied from bug 1501126 comment 3:

I also found that sandbox_init_with_parameters() is much slower
on older hardware like we have on try. Some rough numbers I collected
for the average execution time:

2008 MacBook with macOS 10.9: 500ms
try hardware (t-yosemite-r7-186) with macOS 10.10: 122ms
2015 MacBook Air with macOS 10.14: 15ms
2012 MacBook Pro with macOS 10.11: 14ms
2018 MacBook Pro with macOS 10.14: 8ms

Thanks, Haik, that's interesting. I wonder how much of the improvement over the try hardware is due to the newer OS rather than the hardware. In any case, I've written a workaround to avoid sending the shared memory over IPC on macOS for now.

Each user agent style sheet has a unique URLExtraData object containing
its URL, but since they are refcounted objects, we can't share them
easily across processes. Rather than adding support for copying them
into a shared memory buffer like we will do with the Rust objects, here
we just set up a static array of URLExtraData objects per UA style
sheet. The array will be filled in in a later patch.

Rust UrlExtraData objects, once they are transformed into their
sharable form and copied into the shared memory buffer, will reference
them by an index.

Depends on D17181

UA style sheets only ever specify a single generic font family in font-family
properties, so we pre-create a unique, static SharedFontList for each generic
and change the representation of FontFamilyList to be able to refer to them
by their generic ID. This avoids having to share refcounted SharedFontList
objects across processes.

Depends on D17182

This avoids having to support storing refcounted URLValue objects in shared memory,
which would be tricky.

Depends on D17183

Attachment #9038121 - Attachment description: Bug 1474793 - Part 7: Add SharedMemoryBuilder type and ToShared trait. → Bug 1474793 - Part 7: Add SharedMemoryBuilder type and ToShmem trait.
Attachment #9038123 - Attachment description: Bug 1474793 - Part 9: Add support for deriving ToShared. → Bug 1474793 - Part 9: Add support for deriving ToShmem.
Attachment #9038124 - Attachment description: Bug 1474793 - Part 10.1: Add simple ToShared implementations. → Bug 1474793 - Part 10.1: Add simple ToShmem implementations.
Attachment #9038125 - Attachment description: Bug 1474793 - Part 10.2: Add ToShared impls for collections and strings. → Bug 1474793 - Part 10.2: Add ToShmem impls for collections and strings.
Attachment #9038126 - Attachment description: Bug 1474793 - Part 10.3: Add ToShared impl for Atom. → Bug 1474793 - Part 10.3: Add ToShmem impl for Atom.
Attachment #9038127 - Attachment description: Bug 1474793 - Part 10.4: Add ToShared impl for shared_lock::Locked. → Bug 1474793 - Part 10.4: Add ToShmem impl for shared_lock::Locked.
Attachment #9038128 - Attachment description: Bug 1474793 - Part 10.5: Add ToShared impl for URLValueSource. → Bug 1474793 - Part 10.5: Add ToShmem impl for URLValueSource.
Attachment #9038129 - Attachment description: Bug 1474793 - Part 10.6: Add ToShared impl for UrlExtraData. → Bug 1474793 - Part 10.6: Add ToShmem impl for UrlExtraData.
Attachment #9038130 - Attachment description: Bug 1474793 - Part 10.7: Add ToShared impl for FontFamilyList. → Bug 1474793 - Part 10.7: Add ToShmem impl for FontFamilyList.
Attachment #9038131 - Attachment description: Bug 1474793 - Part 10.8: Add derived ToShared implementations. → Bug 1474793 - Part 10.8: Add derived ToShmem implementations.
Fission Milestone: --- → M2
No longer depends on: 1479960
Attachment #9038135 - Attachment is obsolete: true
Pushed by cmccormack@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/df3abfe65d13
Part 1: Add a UserAgentStyleSheetID enum based on UserAgentStyleSheetList.h. r=emilio
https://hg.mozilla.org/integration/autoland/rev/858a1c942df3
Part 2: Allow references to static C++ URLExtraData objects from Rust UrlExtraData. r=emilio
https://hg.mozilla.org/integration/autoland/rev/6e3e40504fe5
Part 3: Allow references to static, single-generic C++ SharedFontList objects from Rust FontFamilyList. r=emilio
https://hg.mozilla.org/integration/autoland/rev/2022cea7c178
Part 4: Allow C++ URLValue objects to be lazily created from Rust SpecifiedUrls. r=emilio
https://hg.mozilla.org/integration/autoland/rev/1133c358953b
Part 5: Add support for read only SharedRwLocks, which don't need any locking. r=emilio
https://hg.mozilla.org/integration/autoland/rev/f10b3a04f246
Part 6: Add support for static references to servo_arc::Arc. r=emilio
https://hg.mozilla.org/integration/autoland/rev/dbaaaa587dfe
Part 7: Add SharedMemoryBuilder type and ToShmem trait. r=emilio
https://hg.mozilla.org/integration/autoland/rev/e3103dfb3c84
Part 8: Factor out some of style_derive. r=emilio
https://hg.mozilla.org/integration/autoland/rev/112e458d8382
Part 9: Add support for deriving ToShmem. r=emilio
https://hg.mozilla.org/integration/autoland/rev/48168cceb50e
Part 10.1: Add simple ToShmem implementations. r=emilio
https://hg.mozilla.org/integration/autoland/rev/020b16c1291d
Part 10.2: Add ToShmem impls for collections and strings. r=emilio
https://hg.mozilla.org/integration/autoland/rev/11ac463510f6
Part 10.3: Add ToShmem impl for Atom. r=emilio
https://hg.mozilla.org/integration/autoland/rev/1d7e012e4c26
Part 10.4: Add ToShmem impl for shared_lock::Locked. r=emilio
https://hg.mozilla.org/integration/autoland/rev/cc47e694718e
Part 10.5: Add ToShmem impl for URLValueSource. r=emilio
https://hg.mozilla.org/integration/autoland/rev/6fe8bc633215
Part 10.6: Add ToShmem impl for UrlExtraData. r=emilio
https://hg.mozilla.org/integration/autoland/rev/637160560b12
Part 10.7: Add ToShmem impl for FontFamilyList. r=emilio
https://hg.mozilla.org/integration/autoland/rev/7c9ebe38c5a9
Part 10.8: Add derived ToShmem implementations. r=emilio
https://hg.mozilla.org/integration/autoland/rev/e450b952d748
Part 11: Add FFI API to use SharedMemoryBuilder. r=emilio
https://hg.mozilla.org/integration/autoland/rev/7d7b50ad974a
Part 12: Add FFI API to create a StyleSheet backed by shared data. r=emilio
https://hg.mozilla.org/integration/autoland/rev/81735d15243c
Part 13: Build and use shared memory user agent style sheets in parent and content processes. r=emilio,kmag

Impressive patch queue! Is there a high-level summary in the comments somewhere about how this works? If not, could you write one here? I'm also interested to know where we ended up on the adblocker question.

Flags: needinfo?(cam)

The awsy numbers for this look really good when enabling the pref, nice work Cameron!

Could you also file a bug for flipping the pref, link it to this bug, the meta bugs, and the bug(s) that block the flipping?

Blocks: 1533569

I don't think the comments paint a high level view of what's going on, so here's an overview:

  • In the parent process early on, we parse the UA style sheets into Gecko StyleSheet objects. Whereas some of the UA sheets used to be lazily loaded, we now load them all up front.
  • We create a shared memory segment that we know ahead of time will be big enough to store the UA sheets.
  • We create a Rust SharedMemoryBuilder object and pass it a pointer to the start of the shared memory.
  • Then for each UA sheet we call Servo_SharedMemoryBuilder_AddStylesheet, which takes the StylesheetContents::rules (an Arc<Locked<CssRules>>), produces a deep clone of it, and writes that clone into the shared memory.
    • The deep clone isn't a clone() call, but a call to ToShmem::to_shmem. The ToShmem trait must be implemented on every type that is reachable under the Arc<Locked<CssRules>>. The to_shmem call for each type will clone the value, but any heap allocation will be cloned and placed into the shared memory buffer, rather than heap allocated.
    • For most types, the ToShmem implementation is simple, and we just #[derive(ToShmem)] it. For the types that need special handling due to having heap allocations (Vec<T>, Box<T>, Arc<T>, etc.) we have impls that call to_shmem on the heap allocated data, and then create a new container (e.g. using Box::from_raw) that points into the shared memory.
    • Arc<T> and Locked<T> want to perform atomic writes on data that needs to be in the shared memory buffer (the reference count for Arc<T>, and the SharedRwLock's AtomicRefCell for Locked<T>), so we add special modes to those objects that skip the writes. For Arc<T>, that means never dropping the object since we don't track the reference count. That's fine, since we want to just drop the entire shared memory buffer at shutdown. For Locked<T>, we just panic on attempting to take the lock for writing. That's also fine, since we don't want devtools being able to modify UA sheets.
    • For Atoms in Rust, we change the representation of static atoms to be an index into the static atom table. Then if we need to Deref the Atom we look up the table. We panic if any Atom we encounter in the UA style sheets is not a static atom.
  • For each UA sheet, we create a new C++ StyleSheet object using the shared memory clone of the sheet contents, and throw away the original heap allocated one. (I guess we could avoid creating a new C++ StyleSheet object wrapping the shared contents, and update the original StyleSheet object's contents, but I doubt it matters.)
  • When we initially map the shared memory in the parent process, we choose an address which is far away from the current extent of the heap. Although not too far, since we don't want to unnecessarily fragment the virtual address space. Then in the child process, as early as possible, we try to map the shared memory at that same address. This means any internal pointers in the UA sheets back into the shared memory buffer are valid in the child process too. In practice this works nearly all the time. If we fail to map at the address we need, the child process falls back to parsing and allocating its own copy of the UA sheets.

For add-ons, the way style sheets are generated with site specific rules would defeat the sharing. uBlock Origin for example has small-ish style sheets with just the right rules needed for each site, and so wouldn't benefit much from sharing. ABP has massive style sheets, but they are almost all unique style sheet text, and so wouldn't be shareable. We could ask ABP to change what they're doing.

But there are at least two challenges to extending this UA sheet sharing approach to work for addon loaded sheets. One is that they are dynamically added at a point in the process' life much later than we do the UA sheets. That means if we wanted a separate shared memory segment per addon loaded style sheet, there is a much lower chance of being able to map it to the same address in the child processes. Maybe we could reserve a certain amount of virtual address space ahead of time, but we would have to choose a size. The other challenge is that the addon loaded sheets have all kinds of dynamic atoms in them. We would need to have some kind of coordinated atom allocation strategy between processes if we wanted to make sure each of these atoms could be associated with a given index that could then be stored in the shared memory.

That isn't something I'm working on now.

Flags: needinfo?(cam)

Bug 1533569 is the one to turn the pref on.

(In reply to Cameron McCormack (:heycam) (away Apr 19-28) from comment #77)

I don't think the comments paint a high level view of what's going on, so here's an overview:

This is an excellent writeup, thank you! Would you mind dropping it in a comment somewhere so that it's in-tree? r=me on this text wherever you feel is appropriate.

Flags: needinfo?(cam)
Pushed by cmccormack@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3bd02d619749
Document the shared memory UA sheet setup in tree. r=bholley
Flags: needinfo?(cam)
You need to log in before you can comment on or make changes to this bug.