Closed Bug 129187 Opened 22 years ago Closed 22 years ago

49% of pageload spent in nsRuleNode::Transition on some pages

Categories

(Core :: CSS Parsing and Computation, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla1.0

People

(Reporter: bzbarsky, Assigned: dbaron)

References

()

Details

(Keywords: perf)

Attachments

(2 files, 3 obsolete files)

BUILD: 2002-03-04

STEPS TO REPRODUCE:
1) Load the page linked to in the URL field
2) Wait a long while
Keywords: perf
Attached file jprof profile
I have another bug on this (I think it's worded as a table perf bug), although I
should probably use this one since the other one is a mess.

We probably need to fix rule nodes so that they switch to a hashtable if they
hit a certain child count.  (They used to use hashtables for everything, but
that was bloated and slow, so now only the root uses a hashtable.)  There are
certain cases where a non-root rule node can end up with a huge number of children.

This particular case, though, would also be fixed if we mapped identical STYLE
attributes to the same object.
Blocks: 54542
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla1.0
Attached patch patch (obsolete) — Splinter Review
This patch fixes the problem by changing the basis for the decision whether to
use a linked list or a hashtable for storing a rule node's children. 
Previously (since hyatt deCOMified nsRuleNode), only the root rule node stored
its children in a hashtable.  This patch allows other rule nodes to store their
children in a hashtable when they get above a certain number of children
(currently set at 16, because that sounded like a good number to me).  This can
be done easily simply by using the low byte of the children pointer to store
whether we're using a list or a hashtable.

There is no need to store the count of the children because the only time we
ever add children is in nsRuleNode::Transition, so we can *count* the children
while traversing the linked list (if we're using a linked list -- and we never
want to convert in the other direction).  (Did I make an off-by-one error in my
definition of the max -- I need to work through the logic there carefully.  I
think I probably mean >= rather than >.  I should also probably add more
comments.)

Anyway, on the testcase from bug 54542 (this one taxed my patience),
http://www.mozilla.org/newlayout/testcases/stress/wbtblclr.html , I did
two tests without my patch and two with my patch (just using the time in
the statusbar):

before: 12.885 12.753
after: 5.565 5.380
It's worth mentioning that I tried to keep the masking code as close as possible
to the code in nsXULAttributeValue.cpp, which is known to be portable.
The testcase in this bug goes from 111 seconds in a build I had lying around
from a few days ago to 44 seconds in a somewhat newer build that has my patch.
Comment on attachment 74565 [details] [diff] [review]
patch

16 is a little big for switching to hashing if you're using double hashing --
for good alpha, the table would be at least 32 entries in absolute capacity.

But this code still uses nsHashtable, sigh.  Any hope of PLDHashTable-ifying?

>-    nsRuleList*  mChildren;
>+  // A hashtable or list that maps from rules to our RuleNode children.
>+  // When matching rules, we use this table to transition from node to
>+  // node (constructing new nodes as needed to flesh out the tree).
>+
>+  void *mChildrenMaskedPtr;

The canonical term is Tagged, not Masked (the pointer as stored may have a tag
bit set, so it may need masking before use -- but it's not masked as stored).

>+  nsHashtable* ChildrenHash() {
>+    return (nsHashtable*) (PRWord(mChildrenMaskedPtr) & ~PRWord(kTypeMask));
>+  }

NS_REINTERPRET_CAST?

>+  void SetChildrenHash(nsHashtable *aHashtable) {
>+    mChildrenMaskedPtr = (void*)(PRWord(aHashtable) | kHashType);
>+  }

Ditto?

>@@ -454,46 +462,57 @@
> {
>   nsRuleNode* next = nsnull;
> 
>-  if (!mParent) {
>+  if (HaveChildren() && !ChildrenAreHashed()) {
>+    PRInt32 numKids = 0;
>+    nsRuleList* curr = ChildrenList();
>+    for ( ; curr && curr->mRuleNode->mRule != aRule;
>+            curr = curr->mNext, ++numKids);

While loop for better readability?  Then the third part of the for loop becomes
the body.

>+    if (curr)
>+      next = curr->mRuleNode;
>+    else if (numKids > kMaxChildrenInList)

Yes, use >=, or just == if you can never fail to switch to a hash table when
crossing the threshold.

/be
Forgot to ask: how are list nodes and nsHashtable allocated?  The code in
nsXULAttributeValue.cpp that you cite is not portable, it assumes malloc returns
0 mod 2 addresses.  I believe shaver is fixing (has fixed? patch in progress) it
to use nsFixedSizeAllocator, which uses an arena (arenas can ensure arbitrary
alignment).

JS is the mac-daddy in Mozilla of pointer-tagging, but again, its GC heap uses
arenas to ensure sufficient alignment.

/be
The nsHashtable objects are from |new|, the |nsRuleValue| are from the pres
shell's arena (which uses PL_ARENA_CONST_ALIGN_MASK == 3).  I thought the code
waterson wrote that was causing problems was the nsXULElement code where he
assumed that all pointers were 8-byte aligned.  Are there really any allocators
that don't produce 2-byte aligned pointers?

Anyway, new patch coming, once I clean up a few other issues.
Oh, right -- it sounds like 4-byte alignment was observed where 8 was expected
in bug 124335.  You're probably ok -- any architecture that byte-aligns
arbitrary memory that might contain a double at the front is not known to me,
and not a port that Mozilla supports currently (JS_SetPrivate require its API
users to keep private data on 0 mod 2 boundaries).  Sorry for the false alarm.

PLDHashTable would still be a win -- is jkeiser's C++ sugar for it ready for 1.0?

/be 
Attached patch revised patch (obsolete) — Splinter Review
This patch fixes most of the comments and a few other minor things.  However:
 * I didn't change the two old-style casts, which were why I made comment 4.
 * I'm still hoping that |new| is 2-byte aligned everywhere.  (It certainly is
on all platforms where mozilla *currently* runs.)
 * I'm more worried about time with the list vs. hashtable tradeoff than about
space.	I think rule nodes with greater than 16 children are likely to be quite
rare.  The big question is how slow an nsHashtable lookup is.
 * I might want to convert this to pldhash at some point in the future, but not
as part of this patch.	I'm already manually separating this patch from other
changes in the same file (bug 131454).
Attachment #74565 - Attachment is obsolete: true
Yes, and I won't have time to add C++ happiness to pldhash until post 1.0, too
many other bugs.
After putting the following instrumentation in the destructor of nsRuleNode:

  PRInt32 count = 0;
  if (ChildrenAreHashed())
    count = ChildrenHash()->Count();
  else
    for (nsRuleList *curr = ChildrenList(); curr; curr = curr->mNext)
      ++count;
  printf("nsRuleNode(this=%p) has %ld children.\n", this, count);

and piping the output of mozilla brining up a navigator window through
'| grep RuleNode | cut -d")" -f2 | sort | uniq -c | sort -k3 -n', I ended up
with the histogram:

    216	 has 0 children.
    346	 has 1 children.
     44	 has 2 children.
      9	 has 3 children.
      7	 has 4 children.
      2	 has 5 children.
      2	 has 6 children.
      1	 has 7 children.
      2	 has 8 children.
      1	 has 9 children.
      1	 has 20 children.
      1	 has 65 children.
If I brought up a bunch of web pages in addition to my homepage and also brought
up the mail window, I get the histogram:

   1101	 has 0 children.
   1231	 has 1 children.
    130	 has 2 children.
     56	 has 3 children.
     24	 has 4 children.
     16	 has 5 children.
      5	 has 6 children.
      9	 has 7 children.
      8	 has 8 children.
      3	 has 9 children.
      1	 has 10 children.
      3	 has 11 children.
      1	 has 12 children.
      1	 has 13 children.
      3	 has 14 children.
      1	 has 16 children.
      1	 has 17 children.
      1	 has 19 children.
      1	 has 20 children.
      1	 has 23 children.
      1	 has 30 children.
      1	 has 35 children.
      1	 has 40 children.
      1	 has 71 children.
      1	 has 83 children.
      1	 has 102 children.

I'm wondering what the best threshhold is... probably it won't make a
measureable difference, though, so I may as well leave it at (or I could drop it
back to 16).
For loading a navigator window with 
http://l10n.openoffice.org./localization/English_Russian_OOo_Glossary.htm
I get:

  38135	 has 0 children.
    478	 has 1 children.
     48	 has 2 children.
     10	 has 3 children.
      9	 has 4 children.
      4	 has 5 children.
      2	 has 6 children.
      1	 has 7 children.
      3	 has 8 children.
      1	 has 9 children.
      1	 has 10 children.
      1	 has 17 children.
      1	 has 42 children.
      1	 has 46 children.
      1	 has 65 children.
      1	 has 296 children.
      1	 has 12502 children.
      2	 has 12503 children.

For loading a navigator window with
http://www.mozilla.org/newlayout/testcases/stress/wbtblclr.html I get:

   9117	 has 0 children.
    314	 has 1 children.
     41	 has 2 children.
      8	 has 3 children.
      7	 has 4 children.
      2	 has 5 children.
      2	 has 6 children.
      1	 has 7 children.
      2	 has 8 children.
      1	 has 9 children.
      1	 has 16 children.
      1	 has 65 children.
      1	 has 8911 children.
Keywords: mozilla1.0
This patch also converts to PLDHashTable.  The performance numbers seem roughly
the same as the previous patch (although the testcase for this bug took 48
seconds, twice, rather than the 44 I saw in one test before, but that's
probably a fluke).
Attachment #74576 - Attachment is obsolete: true
Comment on attachment 74628 [details] [diff] [review]
patch also converting to PLDHashTable

>+  PLDHashTable *hash = PL_NewDHashTable(&ChildrenHashOps, nsnull,
>+                                sizeof(ChildrenHashEntry), kMaxChildrenInList);

I just replaced this, which is bad since it triggers an immediate
table-grow, with:

  PLDHashTable *hash = PL_NewDHashTable(&ChildrenHashOps, nsnull,
					sizeof(ChildrenHashEntry),
					kMaxChildrenInList * 4);

which tends a little towards the other extreme (I could have used *2,
which would have given an initial alpha of just over .5 rather than
just over .25), but I figure an extra 512 bytes in a few spots to
save a bit of churn won't hurt.
Comment on attachment 74628 [details] [diff] [review]
patch also converting to PLDHashTable

>+static PLDHashTableOps ChildrenHashOps = {
>+  // It's probably better to allocate the table itself using the libc
>+  // allocator rather than the pres shell's arena because the table
>+  // doesn't grow very often and the pres shell's arena doesn't recycle
>+  // very large size allocations.

Agreed.  How about "using malloc and free" instead of "using the libc
allocator", which is Unix-centric (not that *I* mind :-).

>+  } else if (!next) {
>+    // Create the new entry in our list.
>+    next = new (mPresContext) nsRuleNode(mPresContext, aRule, this);
>+    SetChildrenList(new (mPresContext) nsRuleList(next, ChildrenList()));
>   }

Beware new failure here, return NS_ERROR_OUT_OF_MEMORY?

>+void
>+nsRuleNode::ConvertChildrenToHash()
>+{
>+  NS_ASSERTION(!ChildrenAreHashed() && HaveChildren(),
>+               "must have a non-empty list of children");
>+  PLDHashTable *hash = PL_NewDHashTable(&ChildrenHashOps, nsnull,
>+                                sizeof(ChildrenHashEntry), kMaxChildrenInList);

Nit: underhanging indent should line up with initial parameter column.

>+  for (nsRuleList* curr = ChildrenList(); curr;
>+       curr = curr->DestroySelf(mPresContext)) {
>+    ChildrenHashEntry *entry = NS_STATIC_CAST(ChildrenHashEntry*,
>+        PL_DHashTableOperate(hash, curr->mRuleNode->mRule, PL_DHASH_ADD));

If (!entry), return NS_ERROR_OUT_OF_MEMORY here.

>+    NS_ASSERTION(!entry->mRuleNode, "duplicate entries in list");
>+    entry->mRuleNode = curr->mRuleNode;
>+  }
>+  SetChildrenHash(hash);
>+}

Looks good otherwise, r/sr=brendan@mozilla.org -- do you not want your #ifdef
DEBUG_dbaron instrumentation, for sharing and posterity?

/be
Attachment #74628 - Flags: superreview+
Forgot to say "agree on the kMaxChildrenInList*4" change dbaron mentioned in
comment #16.

Hyatt, can you r= so we can get this into 1.0?

/be
I added an out-of-memory check that Brendan didn't mention, and I omitted the
out-of-memory check within the loop in ConvertChildrenToHash, since the initial
capacity of the table ensures that we won't run out.
Attachment #74628 - Attachment is obsolete: true
Comment on attachment 74659 [details] [diff] [review]
patch addressing Brendan's comments

Right, I mentioned there were other operator new calls to check -- then I
forgot about the preallocated table size being big enough (I'm hacking similar
code in js/src, I should have known).  This is all good.

/be
Attachment #74659 - Flags: superreview+
Great perf win for 54542, dbaron!  A very good idea.  As for alignment -
nsSmallVoidArray (which once was nsCheapVoidArray in content) holds child
pointers, and that already assumes that object (element) pointers are mod 2 (and
has assertions to that effect).
 r=hyatt
Comment on attachment 74659 [details] [diff] [review]
patch addressing Brendan's comments

a=asa (on behalf of drivers) for checkin to the 1.0 trunk
Attachment #74659 - Flags: approval+
Fix checked in 2002-03-18 17:29 PST.
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
The page given in the initial description still takes nearly 2 minutes (about
110s) to load on my 1.1GHz machine with 512MB of RAM and a cable connection.

Lynx did it in just over a minute (about 70s).

Should I reopen this bug or file a new one?

(Note. The page is around 5MB.)
profile now shows the usual "large page" stuff -- lots of hits all over, nothing
standing out...
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: