240884 - (indexof) Thoughts on making IndexOf() suck less

Reporter

Description

•

21 years ago

The basic problem is that our internal APIs are array-based, while the DOM has things like nextSibling/prevSibling, and insertion based on another node's location, instead of at an index. So we end up calling nsIContent::IndexOf() a lot. And if there are a lot of kids, that can suck. So my proposal is as following. Create a class static hashtable in nsAttrAndChildArray that maps nsAttrAndChildArray* to PRInt32. Whenever an operation is performed on nsAttrAndChildArray (or maybe only when ChildAt() and IndexOfChild() are performed?) check whether we have more than N (needs tuning) children, and if so save the index of the slot the operation affected. Then when IndexOfChild() is called and we have more than N children, instead of doing a linear sweep get the last accessed index from the hashtable and do an outward sweep from it. Pseudocode: PRInt32 cursor = GetCursorFromHashtable(); PRInt32 increment = 1; PRInt32 childCount = GetChildCount(); while (cursor >= 0 && cursor < childCount) { // Check child at cursor; cursor += increment; increment = - (increment + 1); } // If we get here, we ran into one of the ends of the array; search what // remains linearly, starting at the end nearest the cursor. The idea is to exploit locality of reference; usually whatever index we want will probably be close to whatever index we last worked with. That certainly holds for things like editor transactions, and I suspect it's true for DOM access in general. Thoughts?

Boris Zbarsky [:bzbarsky]

Reporter

Comment 1

•

21 years ago

I instrumented nsGenericElement::~nsGenericElement() by adding the following line at the beginning: fprintf(stdout, "%d\n", GetChildCount()); Then loaded the following pages: http://www.cnn.com http://www.nytimes.com http://news.bbc.co.uk http://lxr.mozilla.org/seamonkey/source/layout/html/style/src/nsCSSFrameConstructor.cpp http://www.tomshardware.com/ and then ran the resulting log file through: perl -ne 'chomp; $test{$_}++; END { for $i (keys %test) { print $i, " ", $test{$i}, "\n"; } }' ~/log.txt | sort -n > ~/data.txt I'll attach the results in a sec.

Boris Zbarsky [:bzbarsky]

Reporter

Comment 2

•

21 years ago

Attached file Data — Details

Conclusions: most elements have either no kids or one kid. Above about 12 kids the distribution is pretty flat ("a few here and there"). The one node with ridiculous numbers of kids is the <pre> on the nsCSSFrameConstructor page, probably.

Boris Zbarsky [:bzbarsky]

Reporter

Comment 3

•

21 years ago

More conclusions: 97% of all nodes in that test have 0 or 1 kids 99.84% have no more than 12 kids A single node has 60% of the kids (the aforementioned <pre>).

Brendan Eich [:brendan]

Comment 4

•

21 years ago

bz: so your idea of 20 as the hashing threshold is not a bad SWAG. It doesn't matter, so long as the threshold is not too close to 1. /be

Boris Zbarsky [:bzbarsky]

Reporter

Comment 5

•

21 years ago

I had another thought. Why hash? We have a perfectly good array sitting there that we could probably use to store the "last accessed" slot number....

Boris Zbarsky [:bzbarsky]

Reporter

Comment 6

•

21 years ago

In fact, we could steal a bit from mAttrAndChildCount to indicate whether we have a child cursor.....

Boris Zbarsky [:bzbarsky]

Reporter

Comment 7

•

21 years ago

While I'm having evil thoughts.... At the moment, mAttrAndChildCount uses 22 bits for the child count and 10 bits for the attr count. If I change that to use 19 for child count and 10 for attr count, that's three bits I can use. I can steal two bits each from mImpl and mImpl->mMappedAttrs. That's 7 bits. Stealing 12 more bits from mBufferSize gives me the 19 bits I need to represent anything up to the child count. That leaves 20 bits for mBufferSize, which is enough, since the max buffer size we may need would be (childcount + attrcount * attrsize), attrsize is two, and 19 bits plus two times 11 bits can be represented in 20 bits.... If this were an obfuscated C++ contest (and we were ok to limiting to 500,000 kids), I think I'd write it this way. Failing that, I'm currently planning to steal a single bit from childcount to flag whether we have a cursor, and store the cursor itself as the last entry in mImpl->mBuffer. That will mean some surgery on various code that makes assumptions about what lives where in mBuffer, but..... Sicking, sound reasonable? Or should I just use a class-static hashtable as I was going to originally?

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 8

•

21 years ago

My first question is, is IndexOf showing up high on profiles, and if it is could clients be fixed rather then "fixing" IndexOf. This seems like quite a lot of complexity if it just saves us 1-2%. Other then that I think i like the hash-idea better. IMHO the current 22/10 bits for attr and child counts are cutting it low as it is. Especially seing we in the future probably want to inline a buffer in nsAttrAndChildArray, which is going to complicate the code quite a bit anyway due to trying to squeeze bits.

Boris Zbarsky [:bzbarsky]

Reporter

Comment 9

•

21 years ago

> My first question is, is IndexOf showing up high on profiles Reasonable question. See the profile in the bug this one blocks. With the fixes I've made to speed up other top things in that profile recently, the IndexOf fraction of the time is about 10%. It's the #1 item in the flat profile. Granted, one could argue that this is a broken client and a pretty weird case in general. > and if it is could clients be fixed rather then "fixing" IndexOf. The basic problem is that most operations using the DOM apis require use of IndexOf()..... See bug 240292 comment 2. I don't see a way to fix such clients short of moving them off the DOM apis or making them use fewer nodes. The latter we may be able to do with editor, with a lot of pain (and the former, maybe, but the editor folks will oppose it bitterly). But you can't do that for actual web pages. Granted, not many of these use that many nodes. > This seems like quite a lot of complexity if it just saves us 1-2%. If we do use a hashtable, the code involved will be tiny (a few hundred bytes; just the modified search code). If we set the threshold for it kicking in high enough, it won't really be hit except on pathological examples where it's really needed. This is not to say that it's worth the trouble, necessarily, but I wanted to at least have some discussion on the issue. > Other then that I think i like the hash-idea better. Ok, sounds good.

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 10

•

21 years ago

> > and if it is could clients be fixed rather then "fixing" IndexOf. > > The basic problem is that most operations using the DOM apis require use of > IndexOf()..... See bug 240292 comment 2. I don't see a way to fix such clients > short of moving them off the DOM apis or making them use fewer nodes. Hmm.. moving off DOM apis sounds like a good idea for stuff that we're planning on shipping with GRE, which would at least include parts of editor. > The > latter we may be able to do with editor, with a lot of pain (and the former, > maybe, but the editor folks will oppose it bitterly). Editor needs to stopp doing this <br> madness anyway. So if that's all that it takes I don't see it as a showstopper.

Boris Zbarsky [:bzbarsky]

Reporter

Comment 11

•

21 years ago

> Hmm.. moving off DOM apis sounds like a good idea The editor is _very_ purposefully built around frozen apis as much as possible. We can't expect to reserve the right to change nsIContent at whim (and change it so) and at the same time expect people who don't want to rewrite their code every two months to use it. And frankly, I'm opposed to anyone outside gklayout using nsIContent on general principle. We have a perfectly good frozen, external API to the DOM tree, and that's what external consumers should be using. It's our job, as implementors of this API, to do a half-decent job of it instead of solving all problems by telling people not to use it. > Editor needs to stopp doing this <br> madness anyway. We keep saying this. Let's see it happen. Bug 240933 filed. For the rest, I'm fine on holding off on this till we get some DOM performace tests up and running; at that point, I think I'd like to give this a shot just to see whether it has any effect (probably not, on small testcases).

Peter Van der Beken [:peterv]

Comment 12

•

21 years ago

> And frankly, I'm opposed to anyone outside gklayout using nsIContent > on general principle. We have a perfectly good frozen, external API > to the DOM tree, and that's what external consumers should be using. > It's our job, as implementors of this API, to do a half-decent job of > it instead of solving all problems by telling people not to use it. I think that's a bit extreme. I'm all for optimizing our DOM API and leaving clients the choice of sticking with a frozen, stable API without too much of a performance cost. On the other hand using that API comes with a cost and I doubt you can really take it all away. A lot of the error checks that the DOM API mandates are unneeded if the client code is well written, and keeping some of the Gecko implementation details in mind allows good optimizations too, even though they impose the burden of following the changes to that implementation. Clients that want to optimize performance by using nsIDocument, nsIContent, nsIAttribute, ... should be given that option. That's the choice we made for the XSLT processor and there's other code that could make the same choice IMHO.

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 13

•

21 years ago

I agree with peterv. Though limited to code that we ship with "gecko" since such code should be updated together with the reset of "gecko". In other words, we don't want modules that might be running on different versions of gecko to use nsIContent etc. So in the context of editor, i'd think it would be fine to let libeditor use nsIContent since that is always shipped with "gecko", however Composer should probably avoid doing so so that it is possible to upgrade Composer and GRE separatly.

Boris Zbarsky [:bzbarsky]

Reporter

Comment 14

•

21 years ago

One other thought on indexof. It may turn out that some of the CSS3 selectors require nontrivial use of indexof...

Boris Zbarsky [:bzbarsky]

Reporter

Comment 15

•

21 years ago

Talked to sicking some more. He's thinking a small static LRU cache is the way to go here.

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 16

•

21 years ago

Static or per document sounds fine either way to me

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 17

•

21 years ago

I started working on this. I'm not sure adding stuff to this cache during ChildAt or even GetSafeChildAt is a good idea. The things is that we call those functions a ton so we might take a hit in a general case but only have a win for the cases where IndexOf is used (which ideally should be rare). So i'm going to try to update the indexcache in IndexOf only.

Assignee: general → bugmail

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 18

•

21 years ago

Attached patch patch v1 — Details — Splinter Review

I havn't run any numbers on this yet, but just stepping through the code this looks pretty nice actually. I might defenetly be worth to play around with the define parameters though. One bad thing is that this "outward searching" quickly gets inefficient. For example if the new child is just 10 indexes away we end up searching 20 kids in a fairly inefficient manner (the outward search loop is much more complex then a normal loop). So in many cases it might have just been more efficient to do a normal search. OTOH, the extra complexity is stuff that modern processors should be extreamly good at optimizing. Hmm.. and come to think of it.. if I duplicate some code in there I could get a more optimal loop... Not sure if it's worth it. Something to try might be to just seach from the cached index and forward. Or possibly from one back from the cached index and forward to catch reversed iteration. Once I get some measuring code in there I'll try some different algorithms.

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Updated

•

21 years ago

Alias: indexof

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 19

•

21 years ago

Attached file raw testdata — Details

I ran a normal browsing session (that didn't include typeaheadfind or nsCSSFrameConstructor) and logged all IndexOf calls. I then ran some stats on the difference between the found indexes between calls to the same nsAttrAndChildArray and got the following stats: nf: 851 -19: 4 -17: 3 -16: 2 -11: 10 -10: 9 -8: 2 -6: 5 -5: 3 -4: 57 -3: 11 -2: 114 -1: 65 0: 5208 1: 462 2: 250 3: 4 4: 44 6: 4 8: 2 9: 2 10: 4 11: 12 13: 2 14: 6 16: 2 82: 2 I've filtered out. So in other words there were 3 calls to IndexOf where the found index was 20 higher then a cached one (i filtered out all the 1s to keep the data overviewable). 'nf' is how many times the searched-for child wasn't found. It's pretty interesting that 0 is by far the most common delta. I.e. most of the time the returned index is the same as last time we called. So with caching we'll hit the right index on the first try. It's also interesting that the found index is often just before the cached index. If I just get stats for cases where the number of children is >=15 the stats look like: nf: 27 -19: 4 -17: 3 -16: 2 -11: 10 -10: 9 -6: 2 -3: 6 -2: 6 -1: 7 0: 612 1: 45 2: 84 3: 3 4: 12 9: 2 10: 4 11: 12 13: 2 14: 6 16: 2 82: 2 I'll attach the raw data that I used to make these stats. The format is: <thispointer> <number of children> <found index>. I also ran some stats with nsCSSFrameConstrucor and some typeaheadfind. The basic result of that is that typeaheadfind causes tons and tons of indexof calls and seem to be stepping through the tree both forward and backward in a very inefficient manner...

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 20

•

21 years ago

Comment on attachment 176334 [details] [diff] [review] patch v1 Given that we so often find the same index as the one that's in the cache i think we should lower the child-limit to 15. Other then that this patch is fine. I could save a few cycles by duplicating some code, but I don't think it's worth it.

Attachment #176334 - Flags: superreview?(bzbarsky)

Attachment #176334 - Flags: review?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Reporter

Comment 21

•

21 years ago

Comment on attachment 176334 [details] [diff] [review] patch v1 >Index: nsAttrAndChildArray.cpp >+// This is inited to all zeroes since it's on the stack. You mean "since it's static", right? >+AddIndexToCache(const nsAttrAndChildArray* aArray, PRInt32 aIndex) >+{ >+ if (indexCache[0].array != aArray) { >+ PRUint32 i, last = NUM_INDEX_CACHE_SLOTS - 1; Why bother with i? You have the invariant that after the loop is done, last == i (as long as NUM_INDEX_CACHE_SLOTS > 1), so why not just have "last" as the loop variable and assert that NUM_INDEX_CACHE_SLOTS > 1? >+ // Use unsigned here since we compare count to cursor which has to be signed You mean "use signed", right? >+ PRInt32 inc = 1, sign = 1; >+ // Seek outward from the last found index Document that the inc/sign stuff just makes sure that inc changes sign every time through and that the absolute value of inc increments by 1 each time through? >+ // We ran into one 'edge'. Add inc to cursor ones more to get back to "once more" >+ cursor += inc; Want to assert here that "cursor" is in bounds after we do that, just in case? Wouldn't want someone to come along later and screw this up. >+ if (sign > 0) { >+ for (; cursor < count; ++cursor) { >+ if (children[cursor] == aPossibleChild) { >+ AddIndexToCache(this, cursor); >+ >+ return NS_STATIC_CAST(PRInt32, cursor); >+ } >+ } >+ } >+ else { >+ for (; cursor >= 0; --cursor) { >+ if (children[cursor] == aPossibleChild) { >+ AddIndexToCache(this, cursor); >+ >+ return NS_STATIC_CAST(PRInt32, cursor); >+ } >+ } >+ } Another option is: PRInt32 limit = (sign + 1) * (count + 1) - 1; for (; cursor != limit; cursor += sign) { // loop body here } Avoids having to do that one branch and reduces code, but may be too obfuscated for our taste, have slower increment, and have that random multiplication in it... Your call, anyway. > for (i = 0; i < count; ++i) { > if (children[i] == aPossibleChild) { >+ if (count >= INDEX_CACHE_CHILD_LIMIT) { >+ AddIndexToCache(this, i); No need for this. If that test were true, we would have returned by now (see return -1 at the end of the big branch you added). r+sr=bzbarsky with these addressed. No strong opinions on 15 vs 20; pick one based on what the data looks like?

Attachment #176334 - Flags: superreview?(bzbarsky)

Attachment #176334 - Flags: superreview+

Attachment #176334 - Flags: review?(bzbarsky)

Attachment #176334 - Flags: review+

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 22

•

21 years ago

(In reply to comment #21) > >+ cursor += inc; > > Want to assert here that "cursor" is in bounds after we do that, just in case? > Wouldn't want someone to come along later and screw this up. It's perfectly valid that the cursor is out of bounds here, it just means that the cached index was right in the middle of the childlist. The for-loops below will deal. > Another option is: > > PRInt32 limit = (sign + 1) * (count + 1) - 1; > for (; cursor != limit; cursor += sign) { > // loop body here > } > > Avoids having to do that one branch and reduces code, but may be too obfuscated > for our taste, have slower increment, and have that random multiplication in > it... Your call, anyway. I'll keep what was there. You need a divide by 2 in there too so that you don't get count*2 in the limit. And this is a very common path since we'll get here whenever no cached index was found. > > for (i = 0; i < count; ++i) { > > if (children[i] == aPossibleChild) { > >+ if (count >= INDEX_CACHE_CHILD_LIMIT) { > >+ AddIndexToCache(this, i); > > No need for this. If that test were true, we would have returned by now (see > return -1 at the end of the big branch you added). Doh! The code was written differently initially. Did everything else.

Boris Zbarsky [:bzbarsky]

Reporter

Comment 23

•

21 years ago

> It's perfectly valid that the cursor is out of bounds here, Yep. Realized that a few hours too late. > You need a divide by 2 in there too Same here.

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 24

•

21 years ago

Checked in. It might be worth sometime to tweak the two defines and see if they affect any of the critical usecases we find. But I feel pretty confident that the current algorithm is a good one.

Status: NEW → RESOLVED

Closed: 21 years ago

Resolution: --- → FIXED

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 25

•

20 years ago

Attached file perlscript used to analyze stats — Details

Found the perlscript I used while fixing this bug. Attaching here so that I can kill it off my drive.

Jonas Sicking (:sicking) No longer reading bugmail consistently

Assignee

Comment 26

•

20 years ago

An even better cache implementation is done in bug 327969

Nobody; OK to take it and work on it

Updated

•

7 years ago

Component: DOM → DOM: Core & HTML

Data 21 years ago Boris Zbarsky [:bzbarsky] 257 bytes, text/plain		Details
patch v1 21 years ago Jonas Sicking (:sicking) No longer reading bugmail consistently 3.85 KB, patch	bzbarsky : review+ bzbarsky : superreview+	Details \| Diff \| Splinter Review
raw testdata 21 years ago Jonas Sicking (:sicking) No longer reading bugmail consistently 109.12 KB, text/plain		Details
perlscript used to analyze stats 20 years ago Jonas Sicking (:sicking) No longer reading bugmail consistently 408 bytes, text/plain		Details