Closed Bug 549541 Opened 10 years ago Closed 6 years ago

operating on mPtrToNodeMap makes up 12% of our total cycle collector time

Categories

(Core :: XPCOM, defect)

x86
macOS
defect
Not set

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: gal, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [Snappy:p3])

Attachments

(2 files, 1 obsolete file)

We might want to try a more C++-ish hash table implementation or something fancy like a radix tree. There is a lot of integer multiplications in the hash table code on hot paths.
Blocks: 549533
Still true a year and a half later, though post GC-CC split it seems to take about 12% of the total time.  pldhash also uses a crazy amount of memory: about as much as the CC graph, according to some numbers njn had.

I'll look into what happens if we swap out pldhash for js/public/Hashtable.  Should be easy to do, so if we can get even a little bit of a win off that that would be nice.
Summary: operating on mPtrToNodeMap makes up 6% of our total CC/GC time → operating on mPtrToNodeMap makes up 12% of our total cycle collector time
It was pretty easy to change, but no clear performance win.  After this patch, we end up with GCGraphBuilder::AddNode being about 12-14% of the total time, which is about the same as what PL_DHashTableOperate was taking, possibly a little more.
Is there a better way of doing this? How many objects do we insert in the graph on average? Maybe we should have a word in each header?
Or maybe for classes that are particularly frequent we can add a word and avoid putting them in the map.
I think there are about 40k to 100k items.

I've been thinking that we could use the same mechanism as is currently used to store purple buffer entry pointers in the ref count.  The first thing we do is drain the purple buffer, so by the time we can possibly hit nodes we've seen a second time, there is nothing in the purple buffer, so we can check the low order bit to see if it is mapped or not.  The locality will probably be much worse, but we can do a lookup with just a few loads and a bitwise operations.
It would also require some kind of unwind operation at the end of cycle collection to restore the ref counts, so that could be bad.
Luke, I tried replacing the PLDHash in the cycle collector with js/HashTable.  This is a simple mapping from pointers to pointers.  In my fairly unscientific testing (just browsing around) this didn't really seem to affect the % of the time the CC spends in the hash table, according to Shark.  I was wondering if that is expected?  Did I do something stupid with my implementation in the attached patch?  Thanks.  (I could also try setting up more of a precise microbenchmark that does a series of hash table operations on data gathered from a real run, and time that.)
Attached patch use crude trie (obsolete) — Splinter Review
Just for fun, I threw together a very simple trie to use instead of the hash table.  You can browse around a little without crashing.  It implements the single real trie operation we need which is "Look up this key and if you don't find it, add an entry for it anyways, then in either case give me the reference to the value".

Over the 8 CCs I sharked, the trie lookup took 4% to 10% of the total time, mostly on the lower end, so that looks like an improvement.  I should figure out what memory usage looks like, and if radix trieing it will help.  I'd guess there could be some savings that way.
My trie had a terrible bug, so my performance numbers are totally bogus.  With the correct version, the performance seems to be awful, so I'll have to look into this a bit more...
(In reply to Andrew McCreight [:mccr8] from comment #7)
> Luke, I tried replacing the PLDHash in the cycle collector with
> js/HashTable.  This is a simple mapping from pointers to pointers.  In my
> fairly unscientific testing (just browsing around) this didn't really seem
> to affect the % of the time the CC spends in the hash table, according to
> Shark.  I was wondering if that is expected?

Microbenchmarks generally show 1.5x - 3.2x improvement (bug 515812 comment 1).  bz also reported a significant speedup over JS_DHash, which is why js::HashMap is being used in layout/ (bug 598832, although no perf numbers are reported there).

Actually, I'm surprised you can compare time in HashMap vs. JS_DHash using Shark; usually HashMap/Set are inlined into the context, giving them no HashMap::* symbols and making it hard to differentiate time in the table vs. the calling code.  What does your profile look like?

Also, unlikely, but worth noting: js::DefaultHasher<void *> cuts of log2(sizeof(void*)) whereas JS_DHashVoidPtrKeyStub simply does ptr >> 2.  That should only make a difference on 64-bit if the void*'s are *not* 8-byte aligned :)
Comment on attachment 567346 [details] [diff] [review]
use crude trie

+    ~TrieNode() {
+        for (PRUint32 i = 0; i < childsLen; i++) {
+            if (!childs[i])
+                delete childs[i];
+        }
+    }

That's probably not what you wanted, right?
(In reply to Luke Wagner [:luke] from comment #10)
> Microbenchmarks generally show 1.5x - 3.2x improvement (bug 515812 comment
> 1).

Okay, good to know.

> Actually, I'm surprised you can compare time in HashMap vs. JS_DHash using
> Shark; usually HashMap/Set are inlined into the context, giving them no
> HashMap::* symbols and making it hard to differentiate time in the table vs.
> the calling code.  What does your profile look like?

Right, but the function that calls it does very little aside from hash lookups.  Though now that I think about it, there is one other function call in there, which might take some time occasionally, so I'll have to try to split out that cost somehow.  Or at least figure out how long it normally takes.

> That's probably not what you wanted, right?

Oops.  I guess you are referring to that it should be "if (childs[i])".

I never wrote it down here, so the perf problem with the correct version ended up being that it was spending an insane amount of time in bzero.  It was spending about 15% of CC execution time in bzero, which was just as much as the hash lookup normally takes.
A perhaps interesting data point here: http://blog.mozilla.com/nfroyd/ ("reordering plugins and edge profiles", no direct link because it 404's).  NoteXPComChild calls PL_DHashTableOperate 167k times, which I think is the mPtrToNodeMap lookup described here.  It looks like the cycle collector may be one of the heaviest users of PLDHash.
I followed Luke's advice and wrapped up the call to the hash table lookup with a JS_NEVER_INLINE function so I could figure out exactly the % of time spent in lookups.  Doing that, it looks like 12-16% of the time in the CC is spent in js/Hashtable lookups, which is about the same as PLDHash.  It sounds slightly higher, but my 12% number for PLDHash doesn't actually include all of the weird subcalls that it does, which are another few percent.

Another thing that is probably relevant is that the other large chunk of time in the CC that I'm seeing is vm_map_lookup_entry, which I guess is indicative of a page fault or some other kind of cache miss, and that takes can take about 12-14% of the CC time (though sometimes much less).

Anyways, another suggestion Luke had for investigation is to see what the hash table stats are like.  I'll look into that.  If we're getting crazy collisions that would be a problem.  It could also just be that we're trashing memory too much with all of our cycle collector machinations, so that the hash table behavior doesn't matter too much.  If it is the former problem, I can construct a cycle collector based microbenchmark, but that won't simulate the real environment where we are doing all this other lookup stuff.
Blocks: 698919
Whiteboard: [Snappy]
Whiteboard: [Snappy] → [Snappy:p3]
Here's the PLDHash statistics from a 40ms graph building phase:

Double hashing statistics:
    table size (in entries): 65536
          number of entries: 46866
  number of removed entries: 0
         number of searches: 199321
             number of hits: 127879
           number of misses: 71442
      mean steps per search: 0.443129
     mean hash chain length: 1.50898
         standard deviation: 2.31379
  maximum hash chain length: 26
          number of lookups: 0
 adds that made a new entry: 46866
adds that recycled removeds: 0
   adds that found an entry: 127879
               add failures: 0
             useful removes: 0
            useless removes: 0
removes that freed an entry: 0
  removes while enumerating: 0
            number of grows: 1
          number of shrinks: 0
       number of compresses: 0
number of enumerate shrinks: 0
Hash table operations still take about 12 to 15% of CC time, when we're freeing a lot of objects (closing a tab).  If I recall correctly, it didn't take up much of the smaller steady state CCs.
The numbers don't look catastrophic, but you still may want to play with table->maxAlphaFrac: a bigger table (for the same number of entries) would mean less collisions, shorter chains, so less random memory probing.  (OTOH, bigger table is bigger so more cache misses in general.)
No longer blocks: 698919
Blocks: 698919
This is still probably true, but with ICC, it is less of a problem.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.