Last Comment Bug 240884 - (indexof) Thoughts on making IndexOf() suck less
(indexof)
: Thoughts on making IndexOf() suck less
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: DOM (show other bugs)
: Trunk
: x86 Linux
: -- normal (vote)
: ---
Assigned To: Jonas Sicking (:sicking) PTO Until July 5th
: Hixie (not reading bugmail)
Mentors:
Depends on:
Blocks: 237735
  Show dependency treegraph
 
Reported: 2004-04-18 09:46 PDT by Boris Zbarsky [:bz]
Modified: 2006-02-20 15:52 PST (History)
13 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Data (257 bytes, text/plain)
2004-04-18 15:26 PDT, Boris Zbarsky [:bz]
no flags Details
patch v1 (3.85 KB, patch)
2005-03-04 17:13 PST, Jonas Sicking (:sicking) PTO Until July 5th
bzbarsky: review+
bzbarsky: superreview+
Details | Diff | Splinter Review
raw testdata (109.12 KB, text/plain)
2005-03-10 16:27 PST, Jonas Sicking (:sicking) PTO Until July 5th
no flags Details
perlscript used to analyze stats (408 bytes, text/plain)
2005-11-17 20:08 PST, Jonas Sicking (:sicking) PTO Until July 5th
no flags Details

Description Boris Zbarsky [:bz] 2004-04-18 09:46:49 PDT
The basic problem is that our internal APIs are array-based, while the DOM has
things like nextSibling/prevSibling, and insertion based on another node's
location, instead of at an index.  So we end up calling nsIContent::IndexOf() a
lot.  And if there are a lot of kids, that can suck.

So my proposal is as following.  Create a class static hashtable in
nsAttrAndChildArray that maps nsAttrAndChildArray* to PRInt32.  Whenever an
operation is performed on nsAttrAndChildArray (or maybe only when ChildAt() and
IndexOfChild() are performed?) check whether we have more than N (needs tuning)
children, and if so save the index of the slot the operation affected.

Then when IndexOfChild() is called and we have more than N children, instead of
doing a linear sweep get the last accessed index from the hashtable and do an
outward sweep from it.  Pseudocode:

  PRInt32 cursor = GetCursorFromHashtable();
  PRInt32 increment = 1;
  PRInt32 childCount = GetChildCount();
  while (cursor >= 0 && cursor < childCount) {
    // Check child at cursor;
    cursor += increment;
    increment = - (increment + 1);
  }
  // If we get here, we ran into one of the ends of the array; search what
  // remains linearly, starting at the end nearest the cursor.

The idea is to exploit locality of reference; usually whatever index we want
will probably be close to whatever index we last worked with.  That certainly
holds for things like editor transactions, and I suspect it's true for DOM
access in general.

Thoughts?
Comment 1 Boris Zbarsky [:bz] 2004-04-18 15:24:23 PDT
I instrumented nsGenericElement::~nsGenericElement() by adding the following
line at the beginning:

   fprintf(stdout, "%d\n", GetChildCount());

Then loaded the following pages:

http://www.cnn.com
http://www.nytimes.com
http://news.bbc.co.uk
http://lxr.mozilla.org/seamonkey/source/layout/html/style/src/nsCSSFrameConstructor.cpp
http://www.tomshardware.com/

and then ran the resulting log file through:

perl -ne 'chomp; $test{$_}++;  END { for $i (keys %test) { print $i, " ",
$test{$i}, "\n"; } }' ~/log.txt | sort -n > ~/data.txt

I'll attach the results in a sec.
Comment 2 Boris Zbarsky [:bz] 2004-04-18 15:26:39 PDT
Created attachment 146443 [details]
Data

Conclusions:  most elements have either no kids or one kid.  Above about 12
kids the distribution is pretty flat ("a few here and there").	The one node
with ridiculous numbers of kids is the <pre> on the nsCSSFrameConstructor page,
probably.
Comment 3 Boris Zbarsky [:bz] 2004-04-18 15:31:48 PDT
More conclusions:

97% of all nodes in that test have 0 or 1 kids
99.84% have no more than 12 kids
A single node has 60% of the kids (the aforementioned <pre>).
Comment 4 Brendan Eich [:brendan] 2004-04-18 16:16:28 PDT
bz: so your idea of 20 as the hashing threshold is not a bad SWAG.  It doesn't
matter, so long as the threshold is not too close to 1.

/be
Comment 5 Boris Zbarsky [:bz] 2004-04-18 16:33:27 PDT
I had another thought.  Why hash?  We have a perfectly good array sitting there
that we could probably use to store the "last accessed" slot number....
Comment 6 Boris Zbarsky [:bz] 2004-04-18 16:56:38 PDT
In fact, we could steal a bit from mAttrAndChildCount to indicate whether we
have a child cursor.....
Comment 7 Boris Zbarsky [:bz] 2004-04-18 17:37:18 PDT
While I'm having evil thoughts....  At the moment, mAttrAndChildCount uses 22
bits for the child count and 10 bits for the attr count.  If I change that to
use 19 for child count and 10 for attr count, that's three bits I can use.  I
can steal two bits each from mImpl and mImpl->mMappedAttrs.  That's 7 bits. 
Stealing 12 more bits from mBufferSize gives me the 19 bits I need to represent
anything up to the child count.  That leaves 20 bits for mBufferSize, which is
enough, since the max buffer size we may need would be (childcount + attrcount *
attrsize), attrsize is two, and 19 bits plus two times 11 bits can be
represented in 20 bits....

If this were an obfuscated C++ contest (and we were ok to limiting to 500,000
kids), I think I'd write it this way.  Failing that, I'm currently planning to
steal a single bit from childcount to flag whether we have a cursor, and store
the cursor itself as the last entry in mImpl->mBuffer.  That will mean some
surgery on various code that makes assumptions about what lives where in
mBuffer, but.....  Sicking, sound reasonable?

Or should I just use a class-static hashtable as I was going to originally?
Comment 8 Jonas Sicking (:sicking) PTO Until July 5th 2004-04-18 20:23:33 PDT
My first question is, is IndexOf showing up high on profiles, and if it is could
clients be fixed rather then "fixing" IndexOf. This seems like quite a lot of
complexity if it just saves us 1-2%.

Other then that I think i like the hash-idea better. IMHO the current 22/10 bits
for attr and child counts are cutting it low as it is. Especially seing we in
the future probably want to inline a buffer in nsAttrAndChildArray, which is
going to complicate the code quite a bit anyway due to trying to squeeze bits.
Comment 9 Boris Zbarsky [:bz] 2004-04-18 20:43:25 PDT
> My first question is, is IndexOf showing up high on profiles

Reasonable question.  See the profile in the bug this one blocks.  With the
fixes I've made to speed up other top things in that profile recently, the
IndexOf fraction of the time is about 10%.  It's the #1 item in the flat
profile.  Granted, one could argue that this is a broken client and a pretty
weird case in general.

> and if it is could clients be fixed rather then "fixing" IndexOf. 

The basic problem is that most operations using the DOM apis require use of
IndexOf().....  See bug 240292 comment 2.  I don't see a way to fix such clients
short of moving them off the DOM apis or making them use fewer nodes.  The
latter we may be able to do with editor, with a lot of pain (and the former,
maybe, but the editor folks will oppose it bitterly).  But you can't do that for
actual web pages.  Granted, not many of these use that many nodes.

> This seems like quite a lot of complexity if it just saves us 1-2%.

If we do use a hashtable, the code involved will be tiny (a few hundred bytes;
just the modified search code).  If we set the threshold for it kicking in high
enough, it won't really be hit except on pathological examples where it's really
needed.  This is not to say that it's worth the trouble, necessarily, but I
wanted to at least have some discussion on the issue.

> Other then that I think i like the hash-idea better.

Ok, sounds good.
Comment 10 Jonas Sicking (:sicking) PTO Until July 5th 2004-04-18 21:03:31 PDT
> > and if it is could clients be fixed rather then "fixing" IndexOf. 
> 
> The basic problem is that most operations using the DOM apis require use of
> IndexOf().....  See bug 240292 comment 2.  I don't see a way to fix such clients
> short of moving them off the DOM apis or making them use fewer nodes.

Hmm.. moving off DOM apis sounds like a good idea for stuff that we're planning
on shipping with GRE, which would at least include parts of editor.

> The
> latter we may be able to do with editor, with a lot of pain (and the former,
> maybe, but the editor folks will oppose it bitterly).

Editor needs to stopp doing this <br> madness anyway. So if that's all that it
takes I don't see it as a showstopper.
Comment 11 Boris Zbarsky [:bz] 2004-04-18 21:24:15 PDT
> Hmm.. moving off DOM apis sounds like a good idea

The editor is _very_ purposefully built around frozen apis as much as possible.
 We can't expect to reserve the right to change nsIContent at whim (and change
it so) and at the same time expect people who don't want to rewrite their code
every two months to use it.

And frankly, I'm opposed to anyone outside gklayout using nsIContent on general
principle.  We have a perfectly good frozen, external API to the DOM tree, and
that's what external consumers should be using.  It's our job, as implementors
of this API, to do a half-decent job of it instead of solving all problems by
telling people not to use it.

> Editor needs to stopp doing this <br> madness anyway.

We keep saying this.  Let's see it happen.  Bug 240933 filed.

For the rest, I'm fine on holding off on this till we get some DOM performace
tests up and running; at that point, I think I'd like to give this a shot just
to see whether it has any effect (probably not, on small testcases).
Comment 12 Peter Van der Beken [:peterv] - away till Aug 1st 2004-04-19 05:12:08 PDT
> And frankly, I'm opposed to anyone outside gklayout using nsIContent
> on general principle.  We have a perfectly good frozen, external API
> to the DOM tree, and that's what external consumers should be using. 
> It's our job, as implementors of this API, to do a half-decent job of
> it instead of solving all problems by telling people not to use it.

I think that's a bit extreme. I'm all for optimizing our DOM API and leaving
clients the choice of sticking with a frozen, stable API without too much of a
performance cost. On the other hand using that API comes with a cost and I doubt
you can really take it all away. A lot of the error checks that the DOM API
mandates are unneeded if the client code is well written, and keeping some of
the Gecko implementation details in mind allows good optimizations too, even
though they impose the burden of following the changes to that implementation.
Clients that want to optimize performance by using nsIDocument, nsIContent,
nsIAttribute, ... should be given that option. That's the choice we made for the
XSLT processor and there's other code that could make the same choice IMHO.
Comment 13 Jonas Sicking (:sicking) PTO Until July 5th 2004-04-19 08:49:47 PDT
I agree with peterv. Though limited to code that we ship with "gecko" since such
code should be updated together with the reset of "gecko". In other words, we
don't want modules that might be running on different versions of gecko to use
nsIContent etc.

So in the context of editor, i'd think it would be fine to let libeditor use
nsIContent since that is always shipped with "gecko", however Composer should
probably avoid doing so so that it is possible to upgrade Composer and GRE
separatly.
Comment 14 Boris Zbarsky [:bz] 2004-04-28 19:37:49 PDT
One other thought on indexof.  It may turn out that some of the CSS3 selectors
require nontrivial use of indexof...
Comment 15 Boris Zbarsky [:bz] 2005-02-03 21:35:14 PST
Talked to sicking some more.  He's thinking a small static LRU cache is the way
to go here.
Comment 16 Jonas Sicking (:sicking) PTO Until July 5th 2005-02-03 22:16:32 PST
Static or per document sounds fine either way to me
Comment 17 Jonas Sicking (:sicking) PTO Until July 5th 2005-03-02 15:10:14 PST
I started working on this. I'm not sure adding stuff to this cache during
ChildAt or even GetSafeChildAt is a good idea. The things is that we call those
functions a ton so we might take a hit in a general case but only have a win for
the cases where IndexOf is used (which ideally should be rare). So i'm going to
try to update the indexcache in IndexOf only.
Comment 18 Jonas Sicking (:sicking) PTO Until July 5th 2005-03-04 17:13:38 PST
Created attachment 176334 [details] [diff] [review]
patch v1

I havn't run any numbers on this yet, but just stepping through the code this
looks pretty nice actually. I might defenetly be worth to play around with the
define parameters though.

One bad thing is that this "outward searching" quickly gets inefficient. For
example if the new child is just 10 indexes away we end up searching 20 kids in
a fairly inefficient manner (the outward search loop is much more complex then
a normal loop). So in many cases it might have just been more efficient to do a
normal search. OTOH, the extra complexity is stuff that modern processors
should be extreamly good at optimizing.

Hmm.. and come to think of it.. if I duplicate some code in there I could get a
more optimal loop... Not sure if it's worth it.

Something to try might be to just seach from the cached index and forward. Or
possibly from one back from the cached index and forward to catch reversed
iteration. Once I get some measuring code in there I'll try some different
algorithms.
Comment 19 Jonas Sicking (:sicking) PTO Until July 5th 2005-03-10 16:27:03 PST
Created attachment 177068 [details]
raw testdata

I ran a normal browsing session (that didn't include typeaheadfind or
nsCSSFrameConstructor) and logged all IndexOf calls. I then ran some stats on
the difference between the found indexes between calls to the same
nsAttrAndChildArray and got the following stats:

nf: 851
-19: 4
-17: 3
-16: 2
-11: 10
-10: 9
-8: 2
-6: 5
-5: 3
-4: 57
-3: 11
-2: 114
-1: 65
0: 5208
1: 462
2: 250
3: 4
4: 44
6: 4
8: 2
9: 2
10: 4
11: 12
13: 2
14: 6
16: 2
82: 2


I've filtered out. So in other words there were 3 calls to IndexOf where the
found index was 20 higher then a cached one (i filtered out all the 1s to keep
the data overviewable). 'nf' is how many times the searched-for child wasn't
found.

It's pretty interesting that 0 is by far the most common delta. I.e. most of
the time the returned index is the same as last time we called. So with caching
we'll hit the right index on the first try. It's also interesting that the
found index is often just before the cached index.

If I just get stats for cases where the number of children is >=15 the stats
look like:

nf: 27
-19: 4
-17: 3
-16: 2
-11: 10
-10: 9
-6: 2
-3: 6
-2: 6
-1: 7
0: 612
1: 45
2: 84
3: 3
4: 12
9: 2
10: 4
11: 12
13: 2
14: 6
16: 2
82: 2


I'll attach the raw data that I used to make these stats. The format is:
<thispointer> <number of children> <found index>.

I also ran some stats with nsCSSFrameConstrucor and some typeaheadfind. The
basic result of that is that typeaheadfind causes tons and tons of indexof
calls and seem to be stepping through the tree both forward and backward in a
very inefficient manner...
Comment 20 Jonas Sicking (:sicking) PTO Until July 5th 2005-03-10 16:30:24 PST
Comment on attachment 176334 [details] [diff] [review]
patch v1

Given that we so often find the same index as the one that's in the cache i
think we should lower the child-limit to 15. Other then that this patch is
fine.

I could save a few cycles by duplicating some code, but I don't think it's
worth it.
Comment 21 Boris Zbarsky [:bz] 2005-03-10 20:51:57 PST
Comment on attachment 176334 [details] [diff] [review]
patch v1

>Index: nsAttrAndChildArray.cpp
>+// This is inited to all zeroes since it's on the stack.

You mean "since it's static", right?

>+AddIndexToCache(const nsAttrAndChildArray* aArray, PRInt32 aIndex)
>+{
>+  if (indexCache[0].array != aArray) {
>+    PRUint32 i, last = NUM_INDEX_CACHE_SLOTS - 1;

Why bother with i?  You have the invariant that after the loop is done, last ==
i (as long as NUM_INDEX_CACHE_SLOTS > 1), so why not just have "last" as the
loop variable and assert that NUM_INDEX_CACHE_SLOTS > 1?

>+  // Use unsigned here since we compare count to cursor which has to be signed

You mean "use signed", right?

>+    PRInt32 inc = 1, sign = 1;
>+    // Seek outward from the last found index

Document that the inc/sign stuff just makes sure that inc changes sign every
time through and that the absolute value of inc increments by 1 each time
through?

>+    // We ran into one 'edge'. Add inc to cursor ones more to get back to

"once more"

>+    cursor += inc;

Want to assert here that "cursor" is in bounds after we do that, just in case? 
Wouldn't want someone to come along later and screw this up.

>+    if (sign > 0) {
>+      for (; cursor < count; ++cursor) {
>+        if (children[cursor] == aPossibleChild) {
>+          AddIndexToCache(this, cursor);
>+
>+          return NS_STATIC_CAST(PRInt32, cursor);
>+        }
>+      }
>+    }
>+    else {
>+      for (; cursor >= 0; --cursor) {
>+        if (children[cursor] == aPossibleChild) {
>+          AddIndexToCache(this, cursor);
>+
>+          return NS_STATIC_CAST(PRInt32, cursor);
>+        }
>+      }
>+    }

Another option is:

PRInt32 limit = (sign + 1) * (count + 1) - 1;
for (; cursor != limit; cursor += sign) {
  // loop body here
}

Avoids having to do that one branch and reduces code, but may be too obfuscated
for our taste, have slower increment, and have that random multiplication in
it...  Your call, anyway.

>   for (i = 0; i < count; ++i) {
>     if (children[i] == aPossibleChild) {
>+      if (count >= INDEX_CACHE_CHILD_LIMIT) {
>+        AddIndexToCache(this, i);

No need for this.  If that test were true, we would have returned by now (see
return -1 at the end of the big branch you added).

r+sr=bzbarsky with these addressed.  No strong opinions on 15 vs 20; pick one
based on what the data looks like?
Comment 22 Jonas Sicking (:sicking) PTO Until July 5th 2005-03-11 01:34:57 PST
(In reply to comment #21)
> >+    cursor += inc;
> 
> Want to assert here that "cursor" is in bounds after we do that, just in case? 
> Wouldn't want someone to come along later and screw this up.

It's perfectly valid that the cursor is out of bounds here, it just means that
the cached index was right in the middle of the childlist. The for-loops below
will deal.

> Another option is:
> 
> PRInt32 limit = (sign + 1) * (count + 1) - 1;
> for (; cursor != limit; cursor += sign) {
>   // loop body here
> }
> 
> Avoids having to do that one branch and reduces code, but may be too obfuscated
> for our taste, have slower increment, and have that random multiplication in
> it...  Your call, anyway.

I'll keep what was there. You need a divide by 2 in there too so that you don't
get count*2 in the limit. And this is a very common path since we'll get here
whenever no cached index was found.

> >   for (i = 0; i < count; ++i) {
> >     if (children[i] == aPossibleChild) {
> >+      if (count >= INDEX_CACHE_CHILD_LIMIT) {
> >+        AddIndexToCache(this, i);
> 
> No need for this.  If that test were true, we would have returned by now (see
> return -1 at the end of the big branch you added).

Doh! The code was written differently initially.

Did everything else.
Comment 23 Boris Zbarsky [:bz] 2005-03-11 03:04:48 PST
> It's perfectly valid that the cursor is out of bounds here,

Yep.  Realized that a few hours too late.

> You need a divide by 2 in there too 

Same here.
Comment 24 Jonas Sicking (:sicking) PTO Until July 5th 2005-03-11 15:22:29 PST
Checked in. It might be worth sometime to tweak the two defines and see if they
affect any of the critical usecases we find. But I feel pretty confident that
the current algorithm is a good one.
Comment 25 Jonas Sicking (:sicking) PTO Until July 5th 2005-11-17 20:08:55 PST
Created attachment 203489 [details]
perlscript used to analyze stats

Found the perlscript I used while fixing this bug. Attaching here so that I can kill it off my drive.
Comment 26 Jonas Sicking (:sicking) PTO Until July 5th 2006-02-20 15:52:46 PST
An even better cache implementation is done in bug 327969

Note You need to log in before you can comment on or make changes to this bug.