Closed Bug 1367854 Opened 2 years ago Closed 9 months ago

[meta] stylo: Memory usage seems to be higher than Gecko

Categories

(Core :: CSS Parsing and Computation, enhancement, P4)

53 Branch
enhancement

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: bzbarsky, Unassigned)

References

(Depends on 5 open bugs, Blocks 2 open bugs)

Details

(Keywords: meta)

Attachments

(9 files)

Measuring on https://en.wikipedia.org/wiki/Barack_Obama I see stylo using about 20MB (aka 30% of total process memory) more memory than Gecko.

I'll file some bugs blocking this one with details from the investigation.
Keywords: meta
Priority: -- → P5
No longer blocks: stylo
Bumping this up to p1 and repurposing it a bit to track our deficit on AWSY [1].

I'd been hoping bug 1368290 would make a dent in this. But while it seems to have helped a little bit [2] it's not as much as I was hoping. Doing a bit of final validation that the sharing is working in bug 1367862 and bug 1369902, just to be sure.

In general, the memory impact of stylo should break down into two things:
(1) ComputedValues and the things they entrain (mostly rule nodes and style structs). These are held alive by elements and frames, and should generally scale with the size of the DOM.
(2) PerDocumentStyleData, which holds onto the stylist, which holds onto the parsed stylesheets and the various hashtables for looking up rules. This is all per-document and should generally scale with the size of the stylesheets.

For (1): IIUC, we should be measuring ComputedValues after bug 1383977, or at least the ComputedValues held alive by elements. The ComputedValues held alive by text node frames and anonymous box frames (which is what bug 1368290 was supposed to fix) would not be measured by that patch. Nick, are the sizes for these currently bundled into element-nodes? We probably need to share the same "seen" hashtable with a traversal of the frame tree to pick up non-element frames, at which point it may make more sense to count the ComputedValues overhead in a separate traversal of both the DOM and the frame tree.

For (2): AFAICT, we're currently measuring the servo side of stylesheets, but _not_ the rest of the stylist (because we reach the stylesheets via pointers from the gecko-side stylesheets). This misses the hashtables and all the other stuff that hangs off the stylist.

Given the above, my best guess as to the memory issue is that our stylist data structures are way too big. I make this guess because:
* We seem to have a reasonable number of style structs, and so (1) seems less likely to be the culprit.
* We're not measuring the hashtable sizes right now with the memory reporters, and there is a lot of heap-unclassified on the memory reports.
* Bug 1384945 suggests that heap-unclassified got significantly larger when legacy ABP was enabled, and legacy ABP adds lots and lots of style rules, which will end up in our hashtables.
* We know we spend a ton of time rebuilding and clearing the hashtables in the stylist (bug 1386045), so it would make sense if we were also putting way too much data in there.

So we should confirm or disprove the above theory, and then fix whatever the problem is. Nick, can you dig in with DMD and see what's going on?

[1] https://treeherder.mozilla.org/perf.html#/graphs?timerange=604800&series=%5Bmozilla-inbound,a630e0d4f7000610ca57d0f8da52b55d117632a9,1,4%5D&series=%5Bmozilla-inbound,eb19722dabafa2ad2c817df4f2b650a13b3ce984,1,4%5D&series=%5Bmozilla-inbound,5846d174e9e152bc853b8761026dd7b2d762b572,1,4%5D&series=%5Bautoland,5846d174e9e152bc853b8761026dd7b2d762b572,1,4%5D&series=%5Bautoland,eb19722dabafa2ad2c817df4f2b650a13b3ce984,1,4%5D

[2] The graphs are a bit tricky to read because bug 1381821 regressed style sharing and then bug 1387116 fixed it. So we need to measure with both bug 1387116 and bug 1368290 applied, but I think the recent graph results on autoland should have both.
Flags: needinfo?(n.nethercote)
Priority: P5 → P1
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #1)
> 
> For (1): IIUC, we should be measuring ComputedValues after bug 1383977, or
> at least the ComputedValues held alive by elements. The ComputedValues held
> alive by text node frames and anonymous box frames (which is what bug
> 1368290 was supposed to fix) would not be measured by that patch. Nick, are
> the sizes for these currently bundled into element-nodes?

Sorry, I'm not sure what "these" refers to in that last sentence -- "ComputedValues held alive by elements", or "ComputedValues held alive by text node frames and anonymous box frames"? Can you clarify?


> We probably need
> to share the same "seen" hashtable with a traversal of the frame tree to
> pick up non-element frames, at which point it may make more sense to count
> the ComputedValues overhead in a separate traversal of both the DOM and the
> frame tree.

My grasp of all these data structures -- both Stylo ones, as well as Gecko DOM and layout stuff -- is really weak. But I do know that the SeenPtrs table is created right at the very start of the memory reporting for each window, here:
http://searchfox.org/mozilla-central/rev/b52285fffc13f36eca6b47de735d4e4403b3859e/dom/base/nsWindowMemoryReporter.cpp#312-316
AIUI all the DOM and layout and JS reporting for a single window happens underneath that. So as long as there is no cross-window sharing of data structures happening, I don't think separate traversals will be necessary.

And if there is cross-window sharing, we can just move the creation of SeenPtrs up another level, to outside this loop:
http://searchfox.org/mozilla-central/rev/b52285fffc13f36eca6b47de735d4e4403b3859e/dom/base/nsWindowMemoryReporter.cpp#531-536
Flags: needinfo?(n.nethercote) → needinfo?(bobbyholley)
> For (2): AFAICT, we're currently measuring the servo side of stylesheets,
> but _not_ the rest of the stylist (because we reach the stylesheets via
> pointers from the gecko-side stylesheets). This misses the hashtables and
> all the other stuff that hangs off the stylist.
> 
> Given the above, my best guess as to the memory issue is that our stylist
> data structures are way too big. I make this guess because:
> * We seem to have a reasonable number of style structs, and so (1) seems
> less likely to be the culprit.
> * We're not measuring the hashtable sizes right now with the memory
> reporters, and there is a lot of heap-unclassified on the memory reports.
> * Bug 1384945 suggests that heap-unclassified got significantly larger when
> legacy ABP was enabled, and legacy ABP adds lots and lots of style rules,
> which will end up in our hashtables.
> * We know we spend a ton of time rebuilding and clearing the hashtables in
> the stylist (bug 1386045), so it would make sense if we were also putting
> way too much data in there.
> 
> So we should confirm or disprove the above theory, and then fix whatever the
> problem is. Nick, can you dig in with DMD and see what's going on?

Ok.

For everyone reading: DMD is best known as the tool you use to understand "heap-unclassified", but it can also be used as a generic memory profiler in "live" mode. (It also has a "cumulative" mode, which tells you about heap churn.) DMD breaks the info down in a different way to about:memory, and typically gives a lot more details, and so is likely to be a better tool for answering questions like the ones in comment 1.

I've done a DMD run. I will attach the output shortly. Here's the short version of how I did it:

- Run `mach run --dmd --stacks=full --live`
- Load the Obama Wikipedia page
- Open about:memory, click on the "Save" button in the "Save DMD Output" section.
- Move the DMD output files from /tmp to my current directory.
- Post-process each one with `$OBJDIR/dist/bin/dmd.py $FILE -f 10`. (On Linux
  this is slow, taking 10+ minutes per file, alas; blame addr2line. On Mac it's 
  faster.)

Full instructions for DMD are here: 
https://developer.mozilla.org/en-US/docs/Mozilla/Performance/DMD

I would recommend other Stylo people try this out and get familiar with DMD, because "Stylo authors get familiar with DMD" is likely to scale better and produce good outcomes faster than "DMD author gets familiar with Stylo".
Flags: needinfo?(bobbyholley)
Attached file DMD output with -f 5
I've attached DMD "live" output showing 5 frames per record, the top few Stylo-related records.

The types that show up, in rough order of importance:

- ComputedValues
- Stylist hash tables
- nsIPresShell arena allocations (I'm not sure if that is Stylo data per se)
- GeckoDisplay
- GeckoPosition
- PropertyDeclarationBlock
- HeaderWithLength
- ElementData
Attached file DMD output with -f 10
This is the same info as the previous attachment, but with 10 frames per record. This splits some records up, but gives more context for each record.
(In reply to Nicholas Nethercote [:njn] from comment #2)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #1)
> > 
> > For (1): IIUC, we should be measuring ComputedValues after bug 1383977, or
> > at least the ComputedValues held alive by elements. The ComputedValues held
> > alive by text node frames and anonymous box frames (which is what bug
> > 1368290 was supposed to fix) would not be measured by that patch. Nick, are
> > the sizes for these currently bundled into element-nodes?
> 
> Sorry, I'm not sure what "these" refers to in that last sentence --
> "ComputedValues held alive by elements", or "ComputedValues held alive by
> text node frames and anonymous box frames"? Can you clarify?

The former. I believe we're measuring element->CV but not frame->CV, so I want to confirm where element->CV is being billed (since frame->CV is presumably not being billed at all).

> 
> 
> > We probably need
> > to share the same "seen" hashtable with a traversal of the frame tree to
> > pick up non-element frames, at which point it may make more sense to count
> > the ComputedValues overhead in a separate traversal of both the DOM and the
> > frame tree.
> 
> My grasp of all these data structures -- both Stylo ones, as well as Gecko
> DOM and layout stuff -- is really weak. But I do know that the SeenPtrs
> table is created right at the very start of the memory reporting for each
> window, here:
> http://searchfox.org/mozilla-central/rev/
> b52285fffc13f36eca6b47de735d4e4403b3859e/dom/base/nsWindowMemoryReporter.
> cpp#312-316
> AIUI all the DOM and layout and JS reporting for a single window happens
> underneath that. So as long as there is no cross-window sharing of data
> structures happening,

There is no meaningful cross-window sharing.

> I don't think separate traversals will be necessary.


Well, I guess my concern is that we're not usefully grouping all the ComputedValues together. IIUC, the current setup is that element ComputedValues are billed to the element, and frame ComputedValues are ignored, so ComputedValues for element-less frames are not measured.

This means that we also need to measuring ComputedValues during a traversal of the frame tree. I'm not sure whether we currently traverse the frame tree at all during memory reporting. But even if we do, billing ComputedValues to elements and frames doesn't really make it clear what the overall cost of the ComputedValues is.
Thanks for the DMD reports Nick, they're very helpful.

I filed bug 1387956 and bug 1387958 on improving the memory reporters for ComputedValues and Stylist, respectively.

Xidorn is going to do the stylo-side investigation on why we're 10% worse on AWSY.
Assignee: nobody → xidorn+moz
And in terms of dividing up the work, I agree that it probably makes sense for xidorn to continue digging into the DMD reports (with some help from njn as needed), and njn to work on improving the memory reporters in bug 1387956 and bug 1387958 (with some help from xidorn/heycam/me/other as needed).
And to be clear, we need both to understand the problem and fix it (or at least make good progress on fixing it) by August 14th (one week from today), or we risk trouble in the next go/no-go discussion.
I should mention that both of those DMD outputs were for the Obama Wikipedia page.
Here's DMD output for the single-page HTML5 spec. It was done in "dark matter" mode; I've intermingled the "unreported" and "once-reported" records so they are sorted by size.

The various types that show up, in order of importance:

- ArenaAllocator (not a Stylo type?)
- ComputedValues
- GeckoDisplay
- ElementData
- GeckoTextReset
- GeckoFont
- GeckoBackground
- GeckoBorder
- GeckoPosition
- GeckoText

Compared to the Obama Wikipedia page, there is obviously more of the Gecko* types.
> (1): IIUC, we should be measuring ComputedValues after bug 1383977

We're measuring the ComputedValues but not the structs, right?  Gecko has various struct-sharing optimizations that Servo doesn't seem to, but I don't have a good idea of how much those might be affecting things.

I'm also not sure how the measurement in question shows up in about:memory; is it just part of the "element-nodes" number?

In any case, on autoland from last night, with the various sharing bits landed, on https://en.wikipedia.org/wiki/Barack_Obama I see numbers like so (after minimizing memory to make sure gc happens):

  stylo-parallel:
    8355 total ComputedValues
    89.35 MB total about:memory memory
    28.85 MB heap-unclassified
     8.09 MB element-nodes
     2.53 MB text-nodes
     0.35 MB frame-properties

  stylo-sequential:
    3971 total ComputedValues
    79.10 MB total about:memory memory
    25.46 MB heap-unclassified
     7.40 MB element-nodes
     2.53 MB text-nodes
     0.35 MB frame-properties

  Gecko:
    2621 total ComputedValues
    76.50 MB total about:memory memory
    20.32 MB heap-unclassified
     6.87 MB element-nodes
     2.53 MB text-nodes
     0.59 MB style-structs
     0.40 MB style-contexts
     0.26 MB rule-nodes
     0.31 MB style-sets
     0.35 MB frame-properties

I was using text-nodes and frame-properties as controls, because I expect them to not change between stylo and non-stylo builds.

Looks like Gecko's style system on this page uses a total of 1.56MB (style-structs, style-contexts, rule-nodes, style-sets)

stylo-sequential has an extra 5.24MB unclassified and 0.53MB "element-nodes" (so presumably some ComputedValues are being counted here), for what should be an increase of 5.42 + 0.53 - 1.56 = 4.39 MB.  The actual increase is 2.60MB; I'm not sure how to reconcile that, or the fact that diff between the reports claims an increase of 3.04MB...  Actually diffing the reports shows that gecko's "pres-shell" measurement is about 0.48MB more; not sure what that ends up measuring.  Also, gecko has a "style-sheets" measurement that is 0.8MB more than stylo-sequential's; presumably some of these bits live in heap-unclassified with servo sheets.

Looking at stylo-parallel compared to Gecko, there's 8.86MB extra unclassified (according to the about:memory diff) and 5.49MB more heap-overhead, mostly from bin-unused.  There's 1.56MB less memory used under window-objects (that's counting all the Gecko style system bits listed above, style-sheets, element nodes, etc).

Anyway, 4000 extra ComputedValue should be about 1MB or so last I checked.  So we're not quite measuring all ComputedValues going from sequential to parallel, and there's a bunch of additional stuff we're not measuring.  

Crucially, going from sequential to parallel should not change things like stylist and servo style sheet memory consumption, so that increase (about 5MB if we ignore the heap-overhead number, and I'm not sure whether we should ignore it) is all due to more ComputedValues and style structs, right?

I'll attach my memory reports in case people want to look.
> presumably some of these bits live in heap-unclassified with servo sheets.

Or servo sheets are actually a lot smaller than Gecko ones.  Or both.  ;)

Things I forgot to mention:

> * We seem to have a reasonable number of style structs, and so (1) seems less likely to be the culprit.

Do we?  Do we have actual numbers on style structs, as opposed to ComputedValues?

I agree that we should try to add memory reporting for the stylist hashtables and for style structs, so we can have a better idea of where the memory is going...
OK, so now looking at njn's dmd reports, three notes.

Attachment 8894399 [details] shows 1.761MB if memory used by hashtable bits (1.1MB was asked for, and another 660KB is slop).  This should be compared to "0.31MB" used for "style-sets", which I believe includes the relevant bits for Gecko.  But note that Gecko doesn't include the shared UA sheet bits in this, because they're shared.

The same attachment shows "1,357,776 bytes (1,332,632 requested / 25,144 slop)" for GeckoDisplay allocations.  That's 3,143 GeckoDisplay allocations at 432 bytes each.  For comparison, if you recall, Gecko used a total of 0.59MB for _all_ style structs taken together in my measurements.  So assuming njn's tests were done in a build with all the sharing bits, we're doing a lot worse at sharing GeckoDisplay with stylo than with Gecko.  Note that this can be shared both via sharing stylecontext/ComputedValues and via sharing across stylecontext/ComputedValues instances, and Gecko and stylo have pretty different mechanisms for the latter.

That same attachment also shows "748,544 bytes (385,968 requested / 362,576 slop)" for GeckoPosition, which has a size of "1,024".  But this is size _including_ slop.  The actual size is 528 bytes, which is just a bit bigger than the 512 byte bucket, so ends up in the 1024-byte bucket.  Gecko doesn't have this problem, because in the arena we just allocate these things next to each other....
And another data point: attachment 8894449 [details] shows "5,800,032 bytes (5,692,624 requested / 107,408 slop)" for GeckoDisplay.  Looking at about:memory with Gecko for the html5 single-page page, I see:

  2.85 MB (01.09%) ── style-structs

Other relevant stylo numbers from that attachment:

  1,828,992 bytes (1,676,576 requested / 152,416 slop) GeckoTextReset
  1,782,144 bytes (1,782,144 requested / 0 slop)       GeckoFont
  1,206,656 bytes (1,151,808 requested / 54,848 slop)  GeckoBackground (from mutate())
  1,173,392 bytes (1,120,056 requested / 53,336 slop)  GeckoBackground (from take())
  1,144,640 bytes (1,144,640 requested / 0 slop)       GeckoBorder
    919,552 bytes (474,144 requested / 445,408 slop)   GeckoPosition
    586,784 bytes (560,112 requested / 26,672 slop)    GeckoText

GeckoBackground is the only one where the allocation comes from take(); the rest are from mutate().  That might be worth looking into?
Nice find Boris!

So to crystallize: The issue bz's analysis points to is an important distinction I hadn't paid attention to before: the difference between sharing of StyleContexts (aka ComputedValues) and sharing of style structs.

We've mostly been focused on the sharing of StyleContexts. Gecko does this by putting tree state in GeckoStyleContext, which gives it precise sharing, but that strategy is incompatible with parallelism. So for ServoStyleContext, we get sharing with the opportunistic rust style sharing cache. This strategy is strictly less precise, and thus gives us worse sharing, especially when parallelism is involved. In bug 1367862 and bug 1369902, we've compared ComputedValues sharing between Gecko and Stylo. Stylo is somewhat worse, but the numbers don't seem significant enough to explain our AWSY discrepancy.

Using the same ComputedValues implies sharing style structs. However, we can also get style struct sharing across different ComputedValues. The normal way this happens is with the copy-on-write inheritance mechanism. When we create a new ComputedValues, we start by just referencing all inherited structs from the parent, and all the non-inherited structs from the "default" ComputedValues. Modifications cause us to copy. Gecko handles this in a relatively-isomorphic way.

However, Gecko has an additional optimization, in that it caches non-inherited style structs in the rule tree (as well as inherited structs where every single property is specified, which is generally true for Color, and sometimes for Font). This is an optimization we've been considering for stylo (bug 1367635), but have not implemented absent evidence that it's important. This might be that evidence.
To measure this, we should start with the ComputedValues logging patch in [1], and expand it to include logging of all the different style structs. We can the compare counts between Gecko and Stylo. If we see a large difference in non-inherited structs (and potentially Color/Font) relative to inherited ones, that is a likely indicator that this is hurting us.

[1] https://github.com/bholley/gecko/commits/style_context_count_logging
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #19)
> However, Gecko has an additional optimization, in that it caches
> non-inherited style structs in the rule tree (as well as inherited structs
> where every single property is specified, which is generally true for Color,
> and sometimes for Font). This is an optimization we've been considering for
> stylo (bug 1367635), but have not implemented absent evidence that it's
> important. This might be that evidence.

I guess there's a good way to provide a test-case here where it helps:

Here's a rule that would destroy nsStylePosition's struct sharing for stylo:

 * { box-sizing: border-box; }

This is a pretty common rule, and we won't ever share any nsStylePosition ever if we don't share the whole thing.

Assuming this is the root cause for the memory usage difference between Gecko and Stylo (which sounds plausible, though I guess we'd better have numbers to confirm that):

I've argued against caching stuff in the rule tree before... Mainly, doing it in parallel in the context of the already-complicated-enough rule tree sounds hard to get right and somewhat inefficient, in the sense that you may end up with wasted work and not great memory behaviour. Also, this forces us to rebuild the rule tree in a bunch of circumstances, which is not great, and doing that in parallel is also fun, specially if you want to preserve the avoid-selector-matching optimisations, which is important for when the viewport size changes, for example.

I guess it'd still be doable, though it sounds somewhat nasty... If I understand Gecko's algorithm for computing reset structs for a given rule node, it'd be roughly something like:

 * Compute parent rule node reset struct <x>
 * Apply declarations for this struct on <x>
 * Cache <x>.

It seems like there could be a fair amount of people fighting for a given rule node from different threads, specially for simple selectors like |*| or |div|. It's unclear how much of a problem would that be, assuming it hits the cache most of the time...

If we were to implement something like that, I think a thread-local rule node -> reset struct set cache is simpler both to implement before 57 and easier to reason about than caching stuff in the rule tree in parallel. Otherwise we would need to split the cascade basically in two, one function per reset struct, and one per inherited. Also, caching stuff in the rule node itself implies caching depending on parent font-size, etc, to handle properly relative units, etc... which sounds _way_ easier to implement correctly in a thread-local cache than in the rule tree.

The thread-local cache stuff would not be as optimal as caching in the rule tree, though, I suppose. In the sense that if you have something like:

 div { position: relative; }
 .foo { font-size: 5px; } // NB: Only inherited declarations.

You won't share the nsStyleDisplay structs between <div> and <div class="foo"> (though you'll share it on every <div>, and on every <div class="foo">, regardless on where do they inherit from, as long as they match the same rules).

In general, while we're measuring for what to do about this bug, it'd be nice to find numbers on how much time Gecko shares non-default structs with the same rule nodes vs. with different ones.

Still, I'm not sure how doable is it to fix this in the next few weeks... it's a pretty invasive change, wrt the rest of the cascade code... If it's not, I'd rather find the appropriate design first.

I can also look at what Blink / WebKit do, I think they have a MatchedPropertiesCache that they use for this, but I don't remember the exact details.
(In reply to Emilio Cobos Álvarez [:emilio] from comment #21)
> I can also look at what Blink / WebKit do, I think they have a
> MatchedPropertiesCache that they use for this, but I don't remember the
> exact details.

I just looked real quick, and as far as I can tell they only share reset data either fully or nothing. If they share reset data but not inherited data, they do the equivalent of cascade(), but without applying non-inherited properties... I'm still not sure where/when do they handle different font-sizes and similar stuff. I _think_ that's the needs_apply_pass stuff (which would re-cascade the properties with more than a given priority... but I'd need to check more carefully).
(In reply to Emilio Cobos Álvarez [:emilio] from comment #22)
> (In reply to Emilio Cobos Álvarez [:emilio] from comment #21)
> > I can also look at what Blink / WebKit do, I think they have a
> > MatchedPropertiesCache that they use for this, but I don't remember the
> > exact details.
> 
> I just looked real quick, and as far as I can tell they only share reset
> data either fully or nothing. If they share reset data but not inherited
> data, they do the equivalent of cascade(), but without applying
> non-inherited properties... I'm still not sure where/when do they handle
> different font-sizes and similar stuff. I _think_ that's the
> needs_apply_pass stuff (which would re-cascade the properties with more than
> a given priority... but I'd need to check more carefully).

Forgot link to the relevant bit: https://cs.chromium.org/chromium/src/third_party/WebKit/Source/core/css/resolver/StyleResolver.cpp?l=1723&rcl=493cb615c06c29723ba41d63ea1486cb4ac1b00e
Also, another potential source of dumb memory usage is something like what follows:

 * { font-size: 10px; }

That will create a unique font struct per element, when we could just check whether the inherited value differs from the specified value to shave a bunch of structs.

I suspect this may be important for stuff like the |color| property and such.

Again, worth measuring. If we can close the gap with Gecko like this, I'd prefer to do this than introducing reset struct caching, fwiw.
(In reply to Emilio Cobos Álvarez [:emilio] from comment #24)
> Again, worth measuring. If we can close the gap with Gecko like this, I'd
> prefer to do this than introducing reset struct caching, fwiw.

(If only because this is a non-invasive and completely transparent optimization)
So I did a measurement on the count of style structs, and it seems to confirm bz's theory that we are doing a lot worse on style struct sharing, which may lead to the large memory gap.

The comparison can be seen below:
+-----------------------+-------+-------+----------+
| Struct                | Gecko | Stylo |      S/G |
+-----------------------+-------+-------+----------+
| nsStyleContext        |  6290 |  6008 |    0.96x |
+-----------------------+-------+-------+----------+
| nsStyleBackground     |   613 |  1203 |    1.96x |
| nsStyleBorder         |   131 |   289 |    2.21x |
| nsStyleColor          |   306 |  1145 |    3.74x |
| nsStyleColumn         |    54 |    47 |    0.87x |
| nsStyleContent        |    19 |   193 |   10.16x |
| nsStyleDisplay        |   477 |  2299 |    4.82x |
| nsStyleEffects        |    32 |    75 |    2.34x |
| nsStyleFont           |   433 |  1872 |    4.32x |
| nsStyleList           |    23 |    81 |    3.52x |
| nsStyleMargin         |   133 |   540 |    4.06x |
| nsStyleOutline        |    10 |    12 |    1.20x |
| nsStylePadding        |   133 |   619 |    4.65x |
| nsStylePosition       |   259 |   346 |    1.34x |
| nsStyleSVG            |    23 |    25 |    1.09x |
| nsStyleSVGReset       |    10 |    12 |    1.20x |
| nsStyleTable          |     1 |     3 |    3.00x |
| nsStyleTableBorder    |    14 |    31 |    2.21x |
| nsStyleText           |   267 |   697 |    2.61x |
| nsStyleTextReset      |   584 |  1867 |    3.20x |
| nsStyleUIReset        |     1 |  1068 | 1068.00x |
| nsStyleUserInterface  |   814 |   588 |    0.72x |
| nsStyleVisibility     |    32 |    49 |    1.53x |
| nsStyleXUL            |     5 |    47 |    9.40x |
+-----------------------+-------+-------+----------+

I have no idea how we end up allocating fewer objects for some of the types (including style context), it seems style struct sharing is indeed a problem for us. Most reset structs stylo allocates 2x ~ 4x as many as gecko allocates. For some extreme cases (nsStyleContent and nsStyleUIReset), it could be 10x to 1000x. Also, inherit structs like nsStyleFont, nsStyleColor, and nsStyleList also shows up, which is understandable for the same reason.
My approach:

Preparation:
1. apply the measurement patch and build
2. export environment variable
> XPCOM_MEM_LOG_CLASSES=nsStyleFont,nsStyleColor,nsStyleList,nsStyleText,nsStyleVisibility,nsStyleUserInterface,nsStyleTableBorder,nsStyleSVG,nsStyleVariables,nsStyleBackground,nsStylePosition,nsStyleTextReset,nsStyleDisplay,nsStyleContent,nsStyleUIReset,nsStyleTable,nsStyleMargin,nsStylePadding,nsStyleBorder,nsStyleOutline,nsStyleXUL,nsStyleSVGReset,nsStyleColumn,nsStyleEffects,nsStyleContext
3. put "user_pref("layout.css.servo.enabled", false);" into objdir/tmp/scratch_user/prefs.js

For each measurement:
1. XPCOM_MEM_ALLOC_LOG=alloc.log mach run https://en.wikipedia.org/wiki/Barack_Obama
2. wait until the page finishes loading
3. open a new tab and enter about:memory
4. click the "Minimize memory usage" and wait until it finishes (for flushing the log)
5. click "Measure" and find the pid of content process which runs the page
6. find the log file for corresponding pid, copy it as {gecko,stylo}-{1,2}.log
7. exit firefox
8. copy the same log file again {gecko,stylo}-{1,2}-end.log

Repeat the steps above twice for each of gecko and stylo. Stylo with environments:
> STYLO_THREADS=1 STYLO_FORCE_ENABLED=1

Collect data:
1. use my counting script (will upload later) to check that all count downs to zero in *-end.log
2. list data in {gecko,stylo}-{1,2}.log

Data between gecko-1.log and gecko-2.log, as well as between stylo-1.log and stylo-2.log, is pretty stable. They have exactly identical numbers for live objects (though the allocation / destruction number are different). It is just an extra step to ensure that the data actually makes sense.
Attached file counting script
This is the script I mentioned in comment 27 to post-process the output from allocation log.
This is great analysis. Thanks Xidorn!

It sounds like we should probably pursue bug 1367635, unless anyone disagrees. Assigning this to heycam, since he's going to look at that bug.
Assignee: xidorn+moz → cam
Given the amount of work involved in bug 1367635 I think it would really be worth quantifying the actual memory impact of style structs if we can do it cheaply.  In particular if we could get an idea of how much of the awsy  memory gap is due to style structs, that would be pretty useful data.

I'm not sure how easy it is to run awsy locally to test this stuff, nor how easy it is to hack together something that we could run on try (can we run awsy on try?) and get some ballpark awsy numbers with the problem "fixed" (even in a way that is no good to land because of performance concerns).
Yeah that's a good point. Eric, how can we go about doing xidorn's measurements on AWSY, and using them to determine how much of the deficit we'll make up?
Flags: needinfo?(erahm)
My measurement is reusing the MOZ_COUNT_{CTOR,DTOR}, but that doesn't in general work with Stylo. I need to comment out the assertion about serial number.

But that's code change we can do in try push anyway. I think we just need to be able to set proper environment variables, flush and copy log at the right time, and upload it to treeherder, then we can do the same measurement with try push.
I filed bug 1388241 on the allocator slop issue from comment 17.

I guess the nsStyleBackground take() vs mutate() thing is because background props are vector properties and http://searchfox.org/mozilla-central/rev/bd39b6170f04afeefc751a23bb04e18bbd10352b/servo/components/style/properties/helpers.mako.rs#351-360 ends up moving allocations back and forth or something?  I'm not sure why that would cause us to have a long-lived thing created with take(), though....
So I did another analysis considering what the slop means for us even when we allocate the same number of style structs as gecko.

See the following table, the first "size" column means the sizeof value of the struct, the "number" column is the live number of the struct allocated by Gecko for Obama page, from comment 26.

In Gecko section, the "size" is the size of struct aligned to 8 bytes (the alignment of pres arena), and the "total" is size * number.

In Stylo section, the "size" is the size of ArcInner (aligned to 8 bytes then plus 8 bytes), "bucket" column is the jemalloc bucket it would fit in, assuming jemalloc has bucket for each power of two. "total" is bucket * number.

+-----------------------+---------------+---------------+------------------------+
|                       |               |    Gecko      |        Stylo           |
|                       | size | number | size | total  | size | bucket |  total |
+-----------------------+------+--------+------+--------+------+--------+--------|
| nsStyleColor          |    4 |    306 |    8 |   2448 |   16 |     16 |   4896 |
| nsStyleVisibility     |    7 |     32 |    8 |    256 |   16 |     16 |    512 |
| nsStyleTable          |    8 |      1 |    8 |      8 |   16 |     16 |     16 |
| nsStyleTableBorder    |   12 |     14 |   16 |    224 |   24 |     32 |    448 |
| nsStyleXUL            |   16 |      5 |   16 |     80 |   24 |     32 |    160 |
| nsStyleUserInterface  |   24 |    814 |   24 |  19536 |   32 |     32 |  26048 |
| nsStyleContent        |   24 |     19 |   24 |    456 |   32 |     32 |    608 |
| nsStyleMargin         |   40 |    133 |   40 |   5320 |   48 |     64 |   8512 |
| nsStylePadding        |   40 |    133 |   40 |   5320 |   48 |     64 |   8512 |
| nsStyleEffects        |   40 |     32 |   40 |   1280 |   48 |     64 |   2048 |
| nsStyleList           |   48 |     23 |   48 |   1104 |   56 |     64 |   1472 |
| nsStyleUIReset        |   56 |      1 |   56 |     56 |   64 |     64 |     64 |
| nsStyleColumn         |   64 |     54 |   64 |   3456 |   72 |    128 |   6912 |
| nsStyleTextReset      |   80 |    584 |   80 |  46720 |   88 |    128 |  74752 |
| nsStyleOutline        |  104 |     10 |  104 |   1040 |  112 |    128 |   1280 |
| nsStyleFont           |  120 |    433 |  120 |  51960 |  128 |    128 |  55424 |
| nsStyleSVG            |  128 |     23 |  128 |   2944 |  136 |    256 |   5888 |
| nsStyleText           |  160 |    267 |  160 |  42720 |  168 |    256 |  68352 |
| nsStyleBackground     |  160 |    613 |  160 |  98080 |  168 |    256 | 156928 |
| nsStyleSVGReset       |  192 |     10 |  192 |   1920 |  200 |    256 |   2560 |
| nsStyleBorder         |  312 |    131 |  312 |  40872 |  320 |    512 |  67072 |
| nsStyleDisplay        |  416 |    477 |  416 | 198432 |  424 |    512 | 244224 |
| nsStylePosition       |  440 |    259 |  440 | 113960 |  448 |    512 | 132608 |
+-----------------------+------+--------+------+--------+------+--------+--------+
|                 (sum) |                      | 638192 |               | 869296 |
+-----------------------+------+--------+------+--------+------+--------+--------+

So Stylo would still take 231KB more memory (or ~36%) than Gecko even if the number of style struct allocated is on par. Even if we consider the worst case that Gecko's arena can waste (which is considered < 12% based on bug 1388241 comment 0), we would still consume 155KB (or ~22%) more.

I'm not sure this amount can justify the complexity of using arena for style struct allocation in Stylo. Probably not.
Oh, and the size of nsStylePosition above is based on the estimation after we apply the patch in bug 1388255. Without that patch, Stylo would take 343KB (or ~52%) more memory.
If we consider the allocated number of Stylo in comment 26, the total size difference of style structs ("number for stylo" * "bucket in stylo" - "number for gecko" * "size in gecko") would be ~2.2MB, which claims the majority portion of increment mentioned in comment 12.
> assuming jemalloc has bucket for each power of two.

The actual jemalloc buckets are described at http://searchfox.org/mozilla-central/rev/c329d562fb6c6218bdb79290faaf015467ef89e2/memory/mozjemalloc/mozjemalloc.cpp#53-83 and come in 16-byte increments between 16 and 512 bytes; that's why it's useful to stay under 512.

What that means for the numbers above is that the table looks more like this:


+-----------------------+---------------+---------------+------------------------+
|                       |               |    Gecko      |        Stylo           |
|                       | size | number | size | total  | size | bucket |  total |
+-----------------------+------+--------+------+--------+------+--------+--------|
| nsStyleColor          |    4 |    306 |    8 |   2448 |   16 |     16 |   4896 |
| nsStyleVisibility     |    7 |     32 |    8 |    256 |   16 |     16 |    512 |
| nsStyleTable          |    8 |      1 |    8 |      8 |   16 |     16 |     16 |
| nsStyleTableBorder    |   12 |     14 |   16 |    224 |   24 |     32 |    448 |
| nsStyleXUL            |   16 |      5 |   16 |     80 |   24 |     32 |    160 |
| nsStyleUserInterface  |   24 |    814 |   24 |  19536 |   32 |     32 |  26048 |
| nsStyleContent        |   24 |     19 |   24 |    456 |   32 |     32 |    608 |
| nsStyleMargin         |   40 |    133 |   40 |   5320 |   48 |     48 |   6384 |
| nsStylePadding        |   40 |    133 |   40 |   5320 |   48 |     48 |   6384 |
| nsStyleEffects        |   40 |     32 |   40 |   1280 |   48 |     48 |   1536 |
| nsStyleList           |   48 |     23 |   48 |   1104 |   56 |     64 |   1472 |
| nsStyleUIReset        |   56 |      1 |   56 |     56 |   64 |     64 |     64 |
| nsStyleColumn         |   64 |     54 |   64 |   3456 |   72 |     80 |   4320 |
| nsStyleTextReset      |   80 |    584 |   80 |  46720 |   88 |     96 |  56064 |
| nsStyleOutline        |  104 |     10 |  104 |   1040 |  112 |    112 |   1120 |
| nsStyleFont           |  120 |    433 |  120 |  51960 |  128 |    128 |  55424 |
| nsStyleSVG            |  128 |     23 |  128 |   2944 |  136 |    144 |   3312 |
| nsStyleText           |  160 |    267 |  160 |  42720 |  168 |    176 |  46992 |
| nsStyleBackground     |  160 |    613 |  160 |  98080 |  168 |    176 | 107888 |
| nsStyleSVGReset       |  192 |     10 |  192 |   1920 |  200 |    208 |   2080 |
| nsStyleBorder         |  312 |    131 |  312 |  40872 |  320 |    320 |  41920 |
| nsStyleDisplay        |  416 |    477 |  416 | 198432 |  424 |    432 | 206064 |
| nsStylePosition       |  440 |    259 |  440 | 113960 |  448 |    448 | 116032 |
+-----------------------+------+--------+------+--------+------+--------+--------+
|                 (sum) |                      | 638192 |               | 689744 |
+-----------------------+------+--------+------+--------+------+--------+--------+

So we're only 8% worse than Gecko, not 36% worse.

For what it's worth, given the sizes here, the worst Gecko's arena can waste is not 12% but about 3.5% (in the nsStylePosition/nsStyleDisplay cases).  The worst stylo can waste is about 30%, if all our structs are nsStyleTableBorder/nsStyleXUL, which is pretty unlikely to actually happen.  ;)
Based on the jemalloc bucket strategy described in comment 37, the total size difference of style structs (mentioned in comment 36) would be ~1.7MB, so it leaves about 1MB unexplained, which may still worth investigating.
(In reply to Emilio Cobos Álvarez [:emilio] from comment #24)
> Also, another potential source of dumb memory usage is something like what
> follows:
> 
>  * { font-size: 10px; }
> 
> That will create a unique font struct per element, when we could just check
> whether the inherited value differs from the specified value to shave a
> bunch of structs.

I tried this, and it does help a bit on the servo-style-struct measurement of one of the sites (zol.com.zn) I picked out of the AWSY about:memory reports:

  Gecko                       0.55 MB
  Stylo                       4.27 MB
  Stylo + this                3.10 MB
  Stylo + bug 1367635         1.85 MB
  Stylo + this + bug 1367635  1.28 MB

Still, in all of these configurations, I didn't see any change in the reported explicit/resident memory usage that AWSY is reporting.  I'll look deeper into what the about:memory differences are showing.
Depends on: 1394297
heap-overhead is always bigger with Stylo, and with my patches for bug 1367635 it gets even bigger, wiping out the gains from style struct savings.  Nick wondered whether this was because we don't have arena allocation.

I just tried doing an AWSY try run for Gecko with nsPresArena allocation disabled for style structs (like this: https://hg.mozilla.org/try/rev/5b74cd2a1ab8e7bafb8aaf406c4fd3ac3271edd1).  But an about:memory difference report for memory-report-TabsOpenSettled-2.json.gz in both didn't show much of a difference in heap-overhead.
Blocks: 1392314
(This is crossing conversation a bit between here and bug 1367635, sorry.)

I experimented with using an arena in Stylo, to see its effect on heap-overhead, with this patch:

  https://hg.mozilla.org/try/rev/757ed65eff3e4072225ed0d12866f38d887ec300

That just allocates all style structs and ComputedValues objects using a single, global nsPresArena with a lock around accesses to it.  (To use the existing per-document arena, it seemed tricky to know which one it is from Arc::drop_slow.  I guess I could fatten ArcInner with an extra pointer to store the arena pointer, and see what happens.  But I'm not sure whether the single arena would bias towards better or worse looking results.)

Here are heap-overhead values for ~current mozilla-central AWSY memory-report-TabsOpenSettled-2.json reports, summing up each of the four content processes, averaged over four runs, for various configurations:

  Gecko:                           39.55 MB  [1]
  Stylo seq:                       47.02 MB  [2]
  Stylo seq + bug 1367635:         52.19 MB  [3]
  Stylo seq + arena:               49.78 MB  [4]
  Stylo seq + arena + bug 1367635: 48.82 MB  [5]

Calculated with:

  jq '.reports | map(select((.process | contains("Content")) and (.path == "heap-committed/overhead"))) | map(.amount) | add' reports-to-average-*.json | awk '{ n += $1 } END { print n / NR }'

So it seems like at least that the use of the arena prevents bug 1367635 from making things substantially worse.

Though there are still other issues with patch there, e.g. here's the about:memory difference between one report each from [2] and [5]:

352.24 MB (100.0%) -- explicit
├──196.02 MB (55.65%) -- images
│  ├──196.04 MB (55.66%) ++ content
│  └───-0.02 MB (-0.01%) ++ (2 tiny)
├──189.81 MB (53.89%) ── heap-unclassified
├──-35.90 MB (-10.19%) -- window-objects
│  ├──-29.20 MB (-8.29%) ++ (30 tiny)
│  └───-6.70 MB (-1.90%) -- top(http://localhost:8081/page_load_test/tp5n/zol.com.cn/www.zol.com.cn/index.html, id=NNN)
│      ├──-6.70 MB (-1.90%) -- active
│      │  ├──-6.66 MB (-1.89%) -- window(http://localhost:8081/page_load_test/tp5n/zol.com.cn/www.zol.com.cn/index.html)
│      │  │  ├──-6.64 MB (-1.89%) -- layout
│      │  │  │  ├──-4.27 MB (-1.21%) ++ servo-style-structs
│      │  │  │  └──-2.37 MB (-0.67%) -- (2 tiny)
│      │  │  │     ├──-2.37 MB (-0.67%) ++ computed-values
│      │  │  │     └───0.00 MB (00.00%) ── style-sets
│      │  │  └──-0.02 MB (-0.01%) ++ js-compartment(http://localhost:8081/page_load_test/tp5n/zol.com.cn/www.zol.com.cn/index.html)
│      │  └──-0.04 MB (-0.01%) ++ window(http://localhost:8081/page_load_test/tp5n/zol.com.cn/pic.zol-img.com.cn/201104/zhuanti0423home_0418.html)/layout
│      └──-0.00 MB (-0.00%) ++ js-zone(0xNNN)
└────2.31 MB (00.66%) -- (7 tiny)
     ├──2.12 MB (00.60%) ++ heap-overhead
     ├──0.23 MB (00.06%) ── atom-tables/main [4]
     ├──-0.18 MB (-0.05%) ++ js-non-window
     ├──0.14 MB (00.04%) ── profiler/profiler-state [4]
     ├──0.00 MB (00.00%) ++ xpconnect
     ├──-0.00 MB (-0.00%) ── gfx/font-shaped-words [4]
     └──0.00 MB (00.00%) ── preferences [4]

Since [5] is with the arena, and I didn't add any measurement code for that arena, all of the style struct and ComputedValues measurements are really inside heap-unclassified.  Still that just accounts for 6.64 MB, not the entire additional 189.81 MB of heap-unclassified.  (The current bug 1367635 patch doesn't yet measure cached-but-currently-unused ComputedValues hanging off rule nodes, or the structs created to hold them.  I'll add that tomorrow once bug 1394729 lands.  Even so that isn't likely to be the bulk of the heap-unclassified.)


[1] https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=1e89ea2871e7f3a2d3158ea0e660199bd282efce&filter-searchStr=awsy
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=1f9bc2e77651f8ad44cf40899c4fd078dc0c8568&group_state=expanded
[3] https://treeherder.mozilla.org/#/jobs?repo=try&revision=f3e47789db9c425f67f01c465916a21dcb5d7dbf&group_state=expanded
[4] https://treeherder.mozilla.org/#/jobs?repo=try&revision=5a7331ee4e2c5a64eb2a06e168a1025fd163a7a7&group_state=expanded
[5] https://treeherder.mozilla.org/#/jobs?repo=try&revision=e3f6ff7362df0218b08c0ff3659ba2d58aaa7120&group_state=expanded
(In reply to Cameron McCormack (:heycam) from comment #41)
> Since [5] is with the arena, and I didn't add any measurement code for that
> arena, all of the style struct and ComputedValues measurements are really
> inside heap-unclassified.  Still that just accounts for 6.64 MB, not the
> entire additional 189.81 MB of heap-unclassified.

That should be the 35 MB of the top window-obejcts measurement; 6.64 MB is just for that one site.  Still plenty left to account for though.
I belatedly remembered that DMD can do diffs. So I did two --mode=live runs comparing Gecko and Stylo on the Obama Wikipedia page. Output is attached. Positive numbers mean Stylo is using more memory than Gecko.

Notes:

- I did a bit of editing to trim out obviously uninteresting records.

- There's a lot of expected stuff here, e.g. Stylo-only data structures such as CVs showing up as positive, and ArenaAllocator showing up as negative.

- There's a bunch of JS stuff, including XDRState, showing up higher for Stylo. It's possible this is just non-deterministic noise.

- The most interesting thing I learned: the extra threads used by Stylo increases the Gecko Profiler's memory usage. 6 extra threads costs 221,184 bytes on 64-bit (see record 24). This extra space is mostly for each thread's PseudoStack, which has 1024 entries of 32 bytes each on 64-bit.

- Total difference in live heap allocations was 7.5 MB.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #31)
> Yeah that's a good point. Eric, how can we go about doing xidorn's
> measurements on AWSY, and using them to determine how much of the deficit
> we'll make up?

I'm guessing this ni? has expired.
Flags: needinfo?(erahm)
Depends on: 1397971
Depends on: 1397976
Depends on: 1397626
Depends on: 1398322
Depends on: 1398328
Depends on: 1398658
Depends on: 1395488
Cameron, Boris, Emilio, and I are all working on this from different angles. Taking this bug since I'm on the hook to coordinate the right outcome.
Assignee: cam → bobbyholley
We're now at 275 vs 270 on AWSY. Great work everyone!

Still some more stuff to do here, but Joe requested during today's quantum call that we forgo further memory optimizations in 57 to reduce risk. Marking this as no longer blocking.
Assignee: bobbyholley → nobody
Priority: P1 → P4
Summary: stylo: Memory usage seems to be higher than Gecko → [meta] stylo: Memory usage seems to be higher than Gecko
Status: NEW → RESOLVED
Closed: 9 months ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.