stylo: Investigate sources of stylo memory overhead on the html single-page spec

NEW
Unassigned

Status

()

defect
P3
normal
2 years ago
2 years ago

People

(Reporter: bzbarsky, Unassigned)

Tracking

(Depends on 1 bug)

53 Branch
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox57 wontfix)

Details

()

Attachments

(4 attachments, 13 obsolete attachments)

49.56 KB, application/x-gzip
Details
48.19 KB, application/x-gzip
Details
47.31 KB, application/x-gzip
Details
47.34 KB, application/x-gzip
Details
Spun off from bug 1384945 for the non-adblock investigation.

I'll be posting some memory reports and analysis.
Blocks: 1367854
Summary: Investigate sources of stylo memory overhead on the html single-page spec → stylo: Investigate sources of stylo memory overhead on the html single-page spec
Posted file no-stylo.json.gz (obsolete) —
Posted file stylo-threads-1.json.gz (obsolete) —
Posted file stylo-threads-3.json.gz (obsolete) —
Posted file stylo-threads-6.json.gz (obsolete) —
Posted file stylo-threads-12.json.gz (obsolete) —
OK, so some notes.  The baseline no-stylo explicit allocation number here is ~270MB for the web content process.  All measurements are done on an 8-core (+ hyperthreading, so 16 logical cores) Linux box.  The STYLO_THREADS environment variable was used to force particular stylo threading behaviors.

Going from non-stylo to stylo-1-thread seems to have the following changes:

  -4MB style contexts.
  -3MB Gecko style structs
 +12MB ComputedValues
 +16MB servo style structs
  +8MB heap-unclassified
  +2MB heap-overhead, mostly bin-unused and bookkeeping.

Going from 1 thread to 3 threads for stylo gives us:

  +10MB ComputedValues
  +10MB servo style structs
   +2MB heap-unclassified
   +6MB heap-overhead, split 50-50 between bin-unused and page-cache.

Going from 3 threads to 6 threads for stylo gives us:

   +2.5MB ComputedValues
   +1.2MB servo style structs
   no real changes to heap-overhead or heap-unclassified.

Going from 6 to 12 threads for stylo gives us:

   +3MB ComputedValues
   +3MB servo style structs
   +7MB heap-overhead
   no real changes to heap-unclassified

I suspect the heap-overhead bits for 6 threads being exactly like 3 and then 12 being a lot more may somewhat be noise.

I will look into the heap-unclassified here next.
(P3 just to avoid having multiple blocking bugs tracking memory usage)
Priority: -- → P3
Oh, and notably all the above numbers are on 64-bit; I don't have an easily accessible 32-bit setup right now.

I tried doing a DMD run with STYLO_THREADS=3 to look at the heap-unclassified there.  The following bits are popping up as stylo-related:

1) 160185 32-byte allocations for ElementData.  That's a bit over 5MB, or about
   half the increase in heap-unclassified over Gecko if my numbers above match up
   with this number.  This page has 154180 elements on it, and some of those have
   ::before/::after, so this number of ElementData instances looks quite plausible.
   I filed bug 1397472 on adding a memory reporter here.
2) 10322 128-byte, 260 112-byte, 5876 80-byte, 63 64-byte, 99 32-byte allocations of
   nsTArray buffers under Gecko_FontFamilyList_AppendNamed and nsFont constructors.
   This adds up to about 1.8MB with servo, but only 340KB with Gecko, presumably
   because we have more Font structs with Servo. This is an array of FontFamilyName structs.
   Each such struct has a uint32_t type, then an nsString.  nsString is a char_type*,
   then uint32_t, then uint16_t, then uint16_t.  So nsString is 16 bytes on 64-bit,
   but alignment presumably means FontFamilyName is 32 bytes (vs 16 bytes on 32-bit).
   Plus we have the 8 bytes of length/capacity arrays store.  So a 112-byte allocation
   should hold 3 family names, while a 128 byte one can't hold 4... I dunno why we
   have 128-byte allocations here, unless we ended up doubling from a smaller size.
   Anyway, maybe we can shrink FontFamilyName, maybe we can do more sharing or something.
3) 2828 272-byte allocations of ComputedValues, created from Servo_ComputedValues_Inherit,
   called from ResolveStyleForText.  This is about 800KB.  I'm not sure where this ComputedValues
   is hanging out such that we don't report it.  Maybe cached on its parent ComputedValues?  We
   certainly _are_ reporting, via nsIFrame::AddSizeOfExcludingThisForTree, many megabytes of other
   ComputedValues we create for text...  Might be good to figure out how to memory-report this.
4) 95 272-byte allocations of ComputedValues created from nsButtonFrameRenderer::ReResolveStyles
   calling ProbePseudoElementStyle.  I guess this is the mInnerFocusStyle of the button frame.
   This is only 26KB, so not a huge deal, but we should still add a memory reporter here.
5) 9 ComputedValues allocations that came from Servo_TraverseSubtree computing style for an element.
   I have no idea why these were not reported, and that worries me a bit.
6) Various allocations under Servo_StyleSheet_FromUTF8Bytes.  Might be about 500KB in there;
   it's a bit hard to tell because they're so scattered all over.  I had thought we had
   memory reporters for the servo stylesheets...
7) Array allocations from Gecko_ClearAndResizeStyleContents, allocating arrays of nsStyleContentData.
   Looks like about 110KB of this.  In Gecko it's not in the top 1000 heap block records.
8) 12 8192-byte blocks allocated in SHARING_CACHE_KEY::__init.  Why 12?  I'm pretty sure I did
   STYLO_THREADS=3.  12 is the normal number we'd use; are we forgetting to consider
   STYLO_THREADS somewhere?  Or is this just coincidence?  I also have 12 BLOOM_KEY::__init, 
9) A few hundred KB under EagerPseudoStyles::set.

The stacks DMD produces for me have useful symbols, but claim "crtstuff.c:?" for the file/line in the Gecko code, which makes it a bit hard to tell where some allocations are happening.  :(

Anyway, the upshot is that we have lots of ElementData we're not reporting, we have a bunch more FontFamilyName array stuff due to worse struct/style sharing, we have fair amounts of ComputedValues (esp for text) that we're not reporting yet.  I think those items between them cover at least 3/4 of the heap-unclassified regression.
(In reply to Boris Zbarsky [:bz] (still digging out from vacation mail) from comment #8)
> Anyway, maybe we can shrink FontFamilyName, maybe we can do more sharing or something.

We can probably turn font family names into atom so that we avoid that allocation.
(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #11)
> (In reply to Boris Zbarsky [:bz] (still digging out from vacation mail) from
> comment #8)
> > Anyway, maybe we can shrink FontFamilyName, maybe we can do more sharing or something.
> 
> We can probably turn font family names into atom so that we avoid that
> allocation.

Oh, the array buffer under Gecko_FontFamilyList_AppendNamed would not be affected by using atom, so yes this is really just from lack of style struct sharing.
> the array buffer under Gecko_FontFamilyList_AppendNamed would not be affected by using atom

Well, it would, in that an nsIAtom* is smaller than an nsString....
Would it make sense to make FontFamilyList a refcounted object, and have that object created (and stored) in the specified value of font-family?
So the patch in bug 1367635 helps a good bit with the style struct numbers.  They're still not down to Gecko level, but a lot closer than without.  With that patch, we get behavior like this (to be compared to comment 6):

Going from non-stylo to stylo-1-thread:

  +6MB servo style structs

Going from 1 thread to 3 threads for stylo gives us:

  +2.4MB servo style structs

Going from 3 threads to 6 threads for stylo gives us:

  +2.25MB servo style structs
> Would it make sense to make FontFamilyList a refcounted object

Possibly, yes.
> but alignment presumably means FontFamilyName is 32 bytes (vs 16 bytes on 32-bit).

This doesn't sound correct. FontFamilyName should be 24 bytes on 64bit, and 16 bytes on 32bit.

So 128 bytes can hold 5, 112 bytes can hold 4, 80 bytes can hold 3, 64 bytes can hold 2, and 32 bytes can hold 1.

If we change FontFamilyName to store an atom, it would become 16 bytes (vs 8 bytes on 32bit), and we would have 96 bytes for 5, 80 bytes for 4, 64 bytes for 3, 48 bytes for 2, and 32 bytes for 1. That sum up to ~1.3MB, which doesn't seem to be a big win.

Also note that, 16,620 nsStyleFont is 2MB itself... so probably finding a way to share this struct is still worthy?
Depends on: 1397626
(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #23)
> > but alignment presumably means FontFamilyName is 32 bytes (vs 16 bytes on 32-bit).
> 
> This doesn't sound correct. FontFamilyName should be 24 bytes on 64bit, and
> 16 bytes on 32bit.

malloc_usable_size(malloc(24)) is 32.
(In reply to Mike Hommey [:glandium] from comment #24)
> malloc_usable_size(malloc(24)) is 32.

FontFamilyName is stored in an array, I don't think malloc_usable_size applies here.
> FontFamilyName should be 24 bytes on 64bit

Good catch.  

> That sum up to ~1.3MB, which doesn't seem to be a big win.

"big" is kinda relative.  We have a 60MB regression, so 1.3MB is not a huge chunk of it, but it's not nothing either.

I agree that sharing the Font struct would be good too, and anything we do there will reduce the win we get from making the FontFamilyName storage smaller.

Anyway, I went through the list in comment 8 and filed bug 1397813 on item 3, bug 1397815 on item 4.  As for item 8, I tried again with STYLO_THREADS=3 and can't reproduce.  So I must have somehow failed to pass STYLO_THREADS to mach run correctly when I got the numbers above...
Depends on: 1397813, 1397815
Attachment #8905180 - Attachment is obsolete: true
Attachment #8905181 - Attachment is obsolete: true
Attachment #8905182 - Attachment is obsolete: true
Attachment #8905184 - Attachment is obsolete: true
Attachment #8905185 - Attachment is obsolete: true
Attachment #8905312 - Attachment is obsolete: true
Attachment #8905313 - Attachment is obsolete: true
Attachment #8905316 - Attachment is obsolete: true
Attachment #8905317 - Attachment is obsolete: true
Attachment #8905318 - Attachment is obsolete: true
Attachment #8905325 - Attachment is obsolete: true
Attachment #8905326 - Attachment is obsolete: true
Attachment #8905327 - Attachment is obsolete: true
Compared to no-stylo this shows:

  +16MB ComputedValues
  +11MB servo style structs
   +5MB servo element data
   -4MB style contexts
   -3MB gecko style structs
   -2MB "pres-shell" (not sure where this is coming from)
   +2MB heap-unclassified
   +1MB heap-overhead

Compare this to comment 6, where we had +22MB ComputedValues and +26MB servo style structs, +10MB heap-unclassified, +8MB heap-overhead.  The total regression back then was ~57MB; now it's ~26MB.
This shows:

  +4.5MB ComputedValues
    +2MB servo-style-structs
  +0.5MB heap-unclassified
  +0.5MB heap-overhead

when compared against 3 threads.  Again, compare comment 6, where we used to only add 3.7MB here.  This makes sense, since we added more various sharing stuff, but it's all thread-local...
Here we have a 15MB regression over no-stylo.  ComputedValues is ~9MB and servo-stye-structs ~8MB.
Depends on: 1400915
You need to log in before you can comment on or make changes to this bug.