The default bug view has changed. See this FAQ.

Investigate which of the rulehash arrays should be nsAutoTArray

RESOLVED FIXED in mozilla10



CSS Parsing and Computation
6 years ago
6 years ago


(Reporter: bz, Assigned: bz)


(Blocks: 1 bug)

Mac OS X
Dependency tree / graph
Bug Flags:
in-testsuite -

Firefox Tracking Flags

(Not tracked)


(Whiteboard: [MemShrink:P2])


(17 attachments)

3.19 KB, patch
Details | Diff | Splinter Review
658 bytes, text/plain
75 bytes, text/plain
663 bytes, text/plain
227 bytes, text/plain
67 bytes, text/plain
34 bytes, text/plain
1.17 KB, text/plain
223 bytes, text/plain
336 bytes, text/plain
483 bytes, text/plain
95 bytes, text/plain
280 bytes, text/plain
35 bytes, text/plain
552 bytes, text/plain
11.10 KB, patch
Details | Diff | Splinter Review
7.12 KB, patch
: review+
Details | Diff | Splinter Review
Created attachment 555529 [details] [diff] [review]
Measurement patch

There are a bunch of arrays that rulehash uses.  Some of these are actually guaranteed to never be empty.  We should look into using nsTArray for these, perhaps.

The attached patch logs at destruction time the sizes of the following arrays:

1) Array of rulevalues in RuleHashTableEntry
2) Array of universal rules in RuleHash
3) Array of nsCSSSelector* in AtomSelectorEntry
4) Array of state selectors in RuleCascadeData
5) Array of possibly-negated class selectors in RuleCascadeData
6) Array of possible-negated id selectors in RuleCascadeData
7) Array of rules in PerWeightData

I then ran the log through some grep/sed/sort/uniq stuff to produce data files for each of those 7 with the following structure:  First line is the total number of arrays of the given type, the following lines are counts and lengths at destruction with the count listed first and the length second.  I'll attach all that stuff, then analysis of the results after loading google, gmail, cnn, slashdot, twitter,,,
Blocks: 677571
Created attachment 555530 [details]
RuleHashTableEntry data
Created attachment 555531 [details]
Universal rule data
Created attachment 555532 [details]
AtomSelectorEntry data
Created attachment 555533 [details]
State selector data
Created attachment 555534 [details]
Possibly-negated class data
Created attachment 555535 [details]
Possibly-negated id data
Created attachment 555536 [details]
PerWeightData data

For the RuleHashTableEntry, the beginning of the file looks like this:

  36195  1
  11178  2
  3680  3
  2206  4
  1516  5

No 0s as expected: we don't create an entry if there is nothing to put in it.  So using at least an nsAutoTArray<1> or somethign that can hold one element is a no-brainer.  Not sure about auto-space for 2 elements; past that it's probably not that worthwhile.

For AtomSelectorEntry, we have:

  22027  1
  8164  2
  3172  3
  2037  4
  1333  5

Again, auto-space for one entry is a no-brainer, space for 2 I'm not sure about, things past that are probably not worth it.

For universal rules, we see that there are 867 total rulehashes of which 492 had no universal rules at all (yay!).  Not sure it's worth making this an auto array at all.

For possibly-negated classes, we have 248 RuleCascadeData of which 239 had no negated classes.  For ids, that number is 244.  So there's no point making these auto arrays either.

For state selectors, things are more interesting; out of 248 RuleCascadeData 84 had 4 state selectors, and 89 had 87 state selectors, with a smaller peak at 0 (15 instances) and 1 (11 instances).  I'm not sure it's worth doing an auto array here either.

For the PerWeightData, we have:

  1792  1
   450  2
   852  3
   280  4
   146  5

(again, as expected: no rules with a weight means no PerWeightData for that weight).  So an auto array holding one entry is totally worth it.  Past that point, I'm not sure; the distribution doesn't die off nearly as fast as the hashtable entry arrays (e.g. there are 302 instanced of 7, 224 instances of 15, 222 instances of 17, 128 instances of 35, 124 instances of 335).
AtomSelectorEntry holds pointers, but the other two lists don't.  That suggests we might want something like AutoTArray<1>, which allocates extra space for one element, and also something like SmallVoidArray, which stores the pointer in mHdr.

I'm hoping to make an AutoTArray<1> lookalike which has a smaller footprint and is also memmovable.  If we want AutoTArray<N> which is memmovable, that should also be doable...
Oh, on the other hand for the PerWeightData case the arrays are all temporary (they live in a hashtable owned by a stack-only CascadeEnumData).  So the only drawback of bigger tarrays is the cache locality loss; there's no long-term memory usage effect there.

Given that, an autoarray of size 4 (which would cover almost half the arrays in that data) might make sense.
Blocks: 682735
No longer blocks: 677571
Depends on: 677571
Erm, this should *depend on* bug 682735, which makes nsAutoTArray memmove'able.
No longer blocks: 682735
Depends on: 682735
Priority: -- → P2
dbaron pointed out that the atomselector and rulehash entries are actually used in multiple tables with possibly different growth behavior.  I'll attach some data files showing breakdowns of that stuff.
Created attachment 560449 [details]
Attribute AtomSelectorEntry data
Created attachment 560451 [details]
ID AtomSelectorEntry data
Attachment #560449 - Attachment description: Attribute AtomeSelectorEntry data → Attribute AtomSelectorEntry data
Created attachment 560452 [details]
Class AtomSelectorEntry data
Created attachment 560453 [details]
ID RuleHashTableEntry data
Created attachment 560455 [details]
Class RuleHashTableEntry data
Created attachment 560456 [details]
Namespace RuleHashTableEntry data
Created attachment 560457 [details]
Tag RuleHashTableEntry data
Created attachment 560458 [details] [diff] [review]
Updated measurement patch
It looks to me like for the most part the breakdown is the same for all the table types.  Exceptions are the tiny Namespace RuleHashTableEntry table and the ID AtomSelectorEntry table (for which length 1 would only cover about 35% of the entries; lengths 1 and 2 would cover closer to 55%).

David, thoughts on the array sizing?  Should we just use auto-2 arrays?
Created attachment 560506 [details] [diff] [review]
Test patch to make all these arrays auto arrays of size 1

Justin, want to give this a cachegrind spin?
Comment on attachment 560506 [details] [diff] [review]
Test patch to make all these arrays auto arrays of size 1

> +  newEntry->mSelectors.SwapElements(oldEntry->mSelectors);

Don't you need to fix up nsTArray's SwapElements implementation first?
Er, yes.  Gah.  I forgot about that.
Not happening tonight.  :(
Depends on: 687722
With the patch from bug 687722 applied, I measured the number of calls to malloc() when starting the browser to about:home, loading my gmail inbox, and closing the browser.

Before: 1 350 000
After:  1 280 000

That's a 1.05x change.
Pushed to Try to get some Talos results:
So the before/after numbers there are both with bug 687722 but, but before is without the attached patch and after is with the attached patch?

How do things look if you use auto-2 arrays in the attached patch?
Blocks: 688532
(In reply to Boris Zbarsky (:bz) from comment #28)
> So the before/after numbers there are both with bug 687722 but, but before
> is without the attached patch and after is with the attached patch?


> How do things look if you use auto-2 arrays in the attached patch?

Argh; I can't reproduce those numbers anymore.  (I measured multiple times originally!)  I think I need a more static testcase.
New numbers.  These are from running an instrumented build on the first 10 pages in the TP5 manifest.  I ran the test 10 times for each build.  Here are the mean number of times we call malloc:

Unpatched        1138000
nsAutoTArray<1>  1113000
nsAutoTArray<2>  1106000

So nsAutoTArray<1> is a 2% decrease, but nsAutoArray<2> is only a 0.5% decrease over nsAutoTArray<1>.
If Talos is to be believed, this is good for a moderate memory improvement on Linux and Mac.  (It looks like the Windows Talos job was never started, for some reason.)

Results below are (patch) / (smallest recent value on m-i), in mb:

  private: 395 / 400
  rss:     139 / 143

  private: 453 / 461
  rss:     192 / 198

Mac 10.5:
  private: 909 / 902 (very noisy)
  rss:     173 / 179

Mac 10.6:
  private: 3500 / 3700 (this difference is too huge)
  rss:     279 / 280

I'll take a 2-3% drop in RSS!

I'll push to try to see what happens with size-2 nsAutoTArrays.
Try push with nsAutoTArray<2>:
erm, I mean
nsAutoTArray<2> gives maybe a 1mb decrease in RSS on Linux-32 and maybe 4mb on Mac 10.6.  These could be noise, though; the mac numbers in particular are very noisy.

I don't know if the distribution of rulehash sizes is different on TP5 than on what Boris tested on; in some other tests, I've seen that TP5 is a lot simpler than many of the sites I visit.

I'm inclined to suggest we should go with nsAutoTArray<2> because some of Boris's data suggests that it would be useful, and since we're already paying an extra 16 bytes for the empty nsAutoTArray plus an extra 4 bytes for the one RuleValue or an extra word for the one nsCSSSelector*.  So the choice is between RuleValue entries which have size 20 versus 24 bytes, and AtomSelectorEntrys with size 20/24 versus 24/30 bytes (32/64-bit).

We're already paying so much more for these auto arrays, maybe should eat the extra 4 or 8 bytes.

But I think we'd do well with either nsAutoTArray<1> or nsAutoTArray<2>.
Comment on attachment 560506 [details] [diff] [review]
Test patch to make all these arrays auto arrays of size 1

dbaron, do you have a preference wrt size-1 or size-2?  It would be great to get this in before we branch.
Attachment #560506 - Flags: review?(dbaron)
Hm, I did a pretty bad job with those size computations.  Let me try again.  On 32-bit:

  originally        8 bytes (PLDHashEntryHdr=4, nsTArray=4)
  nsAutoTArray<1>  32 bytes (PLDHashEntryHdr=4, nsAutoTArray<0>=16, RuleValue=12)
  nsAutoTArray<2>  44 bytes (PLDHashEntryHdr=4, nsAutoTArray<0>=16, 2 x RuleValue=24)

  originally       12 bytes (PLDHashEntryHdr=4, nsIAtom*=4, nsTArray=4)
  nsAutoTArray<1>  28 bytes (PLDHashEntryHdr=4, nsIAtom*=4, nsAutoTArray<0>=16, nsCSSSelector*=4)
  nsAutoTArray<2>  32 bytes (PLDHashEntryHdr=4, nsIAtom*=4, nsAutoTArray<0>=16, 2 x nsCSSSelector*=8)
Comment on attachment 560506 [details] [diff] [review]
Test patch to make all these arrays auto arrays of size 1

>+static void
>+RuleHash_MoveEntry(PLDHashTable *table, const PLDHashEntryHdr *from,
>+                   PLDHashEntryHdr *to)

Please assert that from != to.

>+  RuleHashTableEntry *oldEntry =
>+    const_cast<RuleHashTableEntry*>(
>+      static_cast<const RuleHashTableEntry*>(from));
>+  RuleHashTableEntry *newEntry = new (to) RuleHashTableEntry();
>+  newEntry->mRules.SwapElements(oldEntry->mRules);

It seems like you need to call oldEntry->~RuleHashTableEntry(), or else you'll get leak stats misreporting.  Or am I missing something?

>+static void
>+RuleHash_TagTable_MoveEntry(PLDHashTable *table, const PLDHashEntryHdr *from,
>+                            PLDHashEntryHdr *to)

Same here (both comments).

>+static void
>+AtomSelector_MoveEntry(PLDHashTable *table, const PLDHashEntryHdr *from,
>+                       PLDHashEntryHdr *to)

and same 2 comments here

r=dbaron with that
Attachment #560506 - Flags: review?(dbaron) → review+
Good catch on both of those.  Will fix.
Per discussion with jlebar, going to go with auto-1 on the rulehash entries and auto-2 on the atomselector ones.  The atomselector case has a larger fraction of length-2 lists, and the cost of an extra entry is less there as well.

I'll file a followup on the PerWeightData arrays.
Flags: in-testsuite-
Target Milestone: --- → mozilla9
Backed that out in - debug Windows mochitest-4 logs got bloated up to the point of being truncated (at 52428800 bytes), and this was my best guess out of that push.
Whiteboard: [MemShrink]
Yeah, that log has about 65k lines with pldhash containing entries of size 48 and hence spewing warnings.

Double-checking our entry size numbers now.
So this is failing because our nsAutoTArrays are bigger on Windows than on Mac (and also maybe on Linux?), because Windows apparently aligns PRUint64 to 8 bytes but Mac aligns to 4 bytes.

The solution is to make nsAutoTArray<E> align E to its natural alignment.  I'll file a bug on this now.
Depends on: 689433
So the key difference is apparently that the alignment requirement of PRUint64 is different on Mac (at least) and Windows. It's 4 words for the former and 8 for the latter.

So on Mac, RuleHashTagTableEntry is 4 bytes of PRUint32, then 4 bytes of padding so the array can align on a boundary, then 16 bytes or array header (of which 4 bytes is actually padding!), then 12 bytes of first array entry, then 4 bytes of padding so the array size can be a multiple of 8.  Then another nsCOMPtr<nsIAtom>.

If you're counting along at home, that's 4+4+16+12+4+4 = 44.  And presumably MSVC wants to align the whole struct on an 8-byte boundary so those paddings make sense in the _next_ struct, so we get 48.

pldhash warns when the entrysize is > 10*sizeof(void*), and 48 > 40, so we lose.

On gcc32, the array is 4-byte aligned.  So we drop the 4*4 == 16 bytes of padding and end up with a 32-byte struct, which is small enough that pldhash won't warn.  If we added another RuleValue, we'd be up at 44 bytes there and warning as well, so it's a good thing we didn't.  ;)

On 64-bit we have 80 bytes to play with, so things will be fine there as long as they're fine on 32-bit, since alignment requirements will not be able to create word-sized holes like they did here.

The plan is for jlebar to convert nsAutoTArray to natural alignment, then we reland this patch.
Whiteboard: [MemShrink] → [need landing][MemShrink]
Target Milestone: mozilla9 → ---
Filed bug 689443 on sorting out what to do with PerWeightData.
Blocks: 689443


6 years ago
Whiteboard: [need landing][MemShrink] → [need landing][MemShrink:P2]
No longer depends on: 682735
I pushed this; I hope you don't mind, bz!
Whiteboard: [need landing][MemShrink:P2] → [MemShrink:P2][inbound]
(In reply to comment 31)
> I'll take a 2-3% drop in RSS!

I fell for the oldest mistake in the book here.  TBPL shows RSS in megabytes (1024^2 bytes), but I computed megabytes from the graphs by looking at the top three digits.  So naturally, the TBPL results are 2-3% smaller than the results from the graphs...

I really should not have made that mistake.

We'll see whether there's an actual RSS decrease, now that I've pushed this, but I'm not too optimistic.

I'll go make some noise now in the bugs I filed long ago on graphserver's units (or lack thereof).
Target Milestone: --- → mozilla10
I originally landed the wrong version of this, but I backed out and relanded.
Target Milestone: mozilla10 → ---
Correct version:
Last Resolved: 6 years ago
Resolution: --- → FIXED
Whiteboard: [MemShrink:P2][inbound] → [MemShrink:P2]
Target Milestone: --- → mozilla10
You need to log in before you can comment on or make changes to this bug.