Closed Bug 948445 Opened 6 years ago Closed 6 years ago

investigate alternative implementations for ProtoAndIfaceCache memory usage

Categories

(Core :: DOM: Core & HTML, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla31

People

(Reporter: froydnj, Assigned: froydnj)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [MemShrink:P2])

Attachments

(3 files, 4 obsolete files)

From bug 946781 comment 20:

I wonder if the proto-iface cache really needs to be an array; I can't imagine that many members are really going to be hit in your typical scope.  Some webpages, sure, but not all of them.  Going for a balanced tree or something to reduce memory usage in the average case might be worth investigating.

From bug 946781 comment 22 and comment 23:

The perf of lookups here is critical for the performance of JS-wrapping.  One of the big speedups of WebIDL bindings over XPConnect ones is that we made the lookup of the prototype object not take approximately forever.

That said, some sort of balanced tree might not be too bad.  It would still be way better than the xpconnect setup.

Also, it should be simple to test the performance here: something like createElement or 
new Event or some such ought to be worst-cases for wrapping performance.

This bug is about investigating those alternatives.
Assuming I'm counting the size of std::map's entries correctly (I don't see a good way to report their memory explicitly, so I'm resorting to counting), implementing the cache with a std::map makes our xpconnect tree for a newly-opened browser look like:

└───0.95 MB (01.36%) -- xpconnect
    ├──0.62 MB (00.89%) ── scopes
    ├──0.25 MB (00.36%) ── runtime
    ├──0.05 MB (00.07%) ── js-component-loader
    └──0.03 MB (00.04%) ── proto-iface-cache

Compare this to the array-based cache reported in bug 946781 comment 20:

├───3.07 MB (04.09%) -- xpconnect
│   ├──2.06 MB (02.75%) ── proto-iface-cache
│   ├──0.75 MB (01.01%) ── scopes
│   └──0.25 MB (00.34%) -- (2 tiny)
│      ├──0.25 MB (00.34%) ── runtime
│      └──0.00 MB (00.00%) ── js-component-loader

~60x size reduction...I'll take it!  Now for the performance benchmarking side of things.
So, I'm completely new to DOM benchmarking.  bz, does this look like a reasonable benchmark for wrapping performance?

  var n = 100000
  var start = Date.now();
  for (var i = 0; i < n; ++i) {
    e = new CustomEvent("ev", {});
  }
  var end = Date.now();
  document.write("Time to wrap " + n + " Events: " + (end - start));
Flags: needinfo?(bzbarsky)
The way to tell is to profile.  ;)

And what profiling will tell you is that this case is dominated by the unforgeable attribute and some other event suck.  For example, new Event is 1.5x faster than new CustomEvent, and createElement is about 8x faster.  createTextNode is 16x faster.

So start with createTextNode, I guess:

  (function() {
    var n = 100000
    var start = Date.now();
    for (var i = 0; i < n; ++i) {
      var t = document.createTextNode("x");
    }
    var end = Date.now();
    document.write("Time to wrap " + n + " TextNodes: " + (end - start));
  })();
Flags: needinfo?(bzbarsky)
Thanks for clarifying.

With that benchmark in hand, the numbers are roughly:

1M TextNodes, array cache: 2554ms
1M TextNodes, std::map cache: 2751ms

This is on x86-64 Linux, FWIW.

So switching over to std::map (or similar red-black tree) slows us down by ~8%, in exchange for an order-of-magnitude less memory overhead for the cache.  (Counting the nodes correctly gives us a ~20x improvement in memory usage for the caches in a freshly started browser.)
It's worth measuring the memory for the cache in a browser that actually loaded something interesting (and hence might have created several different objects).

In fact, it's worth testing the performance of the testcase in such a scenario (where presumably the std::map contains more stuff, right?)
Yeah, we should try a benchmark where we've first enumerated everything on Window.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #6)
> Yeah, we should try a benchmark where we've first enumerated everything on
> Window.

Looks like enumerating everything on Window, then doing the TextNode benchmark costs us another 5-7%.  So the performance is ~15% worse than what we have now after enumerating everything.
I think that's worth it but bz may disagree.

We could also cache the last interface lookup or something if rewrapping the same iface over and over again is common.
(In reply to Nathan Froyd (:froydnj) from comment #7)
> Looks like enumerating everything on Window, then doing the TextNode
> benchmark costs us another 5-7%.  So the performance is ~15% worse than what
> we have now after enumerating everything.
Sounds rather horrible. We're trying to make element/textnode creation faster
(need to for example get rid of one extra addref/release).
As far as performance goes, in the benchmark above, each createTextNode call is about 300ns for me today, if I'm measuring right.  So an 8% hit is about 24ns.  A 15% hit is about 45ns.  It's a 2.6GHz CPU, so those are about 60-100 CPU cycles, which is not implausible for a binary tree search, I guess (though the 60 feels a bit high for a nearly-empty tree).

How does this change affect the "more real-world" (sic) test at <http://dromaeo.com/?dom-mod>?

Also, what is the memory usage once you've enumerated everything on Window?  That was the other question I had in comment 5.
Another thought:

If we can define a frequency-ordering on these things, we might be able to keep using an array but only size it big enough for what we've had to use in the scope so far.  This would require not storing protos and interface objects in a single array, of course (though you have that constraint anyway with the std::map approach, I expect).
(In reply to Boris Zbarsky [:bz] from comment #10)
> Also, what is the memory usage once you've enumerated everything on Window? 
> That was the other question I had in comment 5.

Worst case memory usage, where the # of prototypes and interfaces is N:

array cache: N*wordsize, rounded up to pagesize; 8K on a 32-bit platform (or it will be very shortly)

std::map cache: ~N*(6*wordsize), where 6*wordsize is about the size of a std::map tree node in libstdc++ (key, value, parent, left child, right child pointers, and an enum for red/black).  Mac and Windows <map> implementations have similar implementations.  I'm assuming Android does too.

So pretty bad memory usage regression in the worst case.
We're definitely over 1024 entries in this list, so I agree that we're at 8K on 32-bit and 16K on 64-bit..

6*wordsize, if it gets heap-allocated by jemalloc would get rounded up to 32 bytes on 32-bit and would land at 48 bytes on 64-bit.  So if those nodes are all malloced, we're actually looking at N*8*wordsize on 32-bit, or about a 4x regression over the current state (and getting worse for a bit as we add more things, until we bump up to 12K for the array).

I wonder whether we can get some data on real pages as to many nodes the tree would have for them.  But no matter what the case of about:blank and other nearly-empty (but not entirely!) globals is worth making small if we can...
And one other note.  A 24KB regression (that's our worst case on 32-bit; on 64-bit it's more like 32KB) in the size of this data structure on something like gmail is sort of a drop in a bucket.  So the real worst case are tiny pages that manage to populate a lot of this data structure, which basically requires something like enumerating the window, I expect.
Something like this seems like a decent compromise: we get memory
savings because we lazily allocate portions of the cache, but lookups
remain fast no matter how many slots in the cache we've filled up.
Testing locally with the TextNode benchmark shows the difference between
the two to be in the noise (~0.5% or less) regardless of whether we've
enumerated everything on Window or not.  Testing single lookups might
show up the extra mallocs that we're doing now.

Memory reduction in a newly-started browser is smaller, about 7-8x for the
proto-iface-caches line item in about:memory:

├───1.35 MB (01.93%) -- xpconnect
│   ├──0.79 MB (01.13%) ── scopes
│   └──0.56 MB (00.80%) -- (3 tiny)
│      ├──0.29 MB (00.42%) ── proto-iface-cache
│      ├──0.21 MB (00.31%) ── runtime
│      └──0.05 MB (00.08%) ── js-component-loader

Further reductions could be gained if we did a more intelligent sorting
of the IDs as bz suggested in comment 11.

(Yes, I know there's a memory leak in the patch; a few other things could
stand to be cleaned up as well...)
Assignee: nobody → nfroyd
Whiteboard: [MemShrink] → [MemShrink:P2]
(In reply to Nathan Froyd (:froydnj) from comment #15)
> Something like this seems like a decent compromise: we get memory
> savings because we lazily allocate portions of the cache, but lookups
> remain fast no matter how many slots in the cache we've filled up.
> Testing locally with the TextNode benchmark shows the difference between
> the two to be in the noise (~0.5% or less) regardless of whether we've
> enumerated everything on Window or not.

Sadly, these numbers are totally bogus because I was testing a opt/debug build...for all the measurements I took yesterday.  Doh.

A opt/no-debug build says that the difference between the current flat scheme and this two-level scheme is ~15%, which is still too much.  I'm going to try splitting things so that the cache has explicit |get|, |set|, and |has| methods, which will at least eliminate a branch from |get| and ideally make things a bit faster.
Here's a better version that actually benchmarks well (on the
createTextNode benchmark) against the current scheme.  Somewhat
surprisingly, this version is about 5% *faster* than what we currently
have in the tree (opt build, x86-64 Linux); I don't have a good
explanation for this yet.

The code, however, is somewhat uglier.  I'm also not sure about the
JSAPI bits that I'd need to get HasEntry correct.  I see that GCC is
choosing to not inline it, presumably because JS::Heap<JSObject*>::set
is getting called in the middle, which seems suboptimal.

I want to break this up into two pieces (rename ProtoAndIfaceArray, then
reimplement) for actual review, but I'd like feedback on whether this is
going in the right direction.
Attachment #8345494 - Attachment is obsolete: true
Attachment #8346784 - Flags: feedback?(bzbarsky)
Comment on attachment 8346784 [details] [diff] [review]
use a two-level caching scheme for ProtoAndIfaceArray

I don't see why we can't just replace HasEntry with GetEntryIfExists() and have it still return JSObject*.  

In general, I'm ok with this.  This cuts our memory usage by a factor of presumably somewhere close to 16 in the about:blank case, and very slightly increases it in the "enumerate the window" case?
Attachment #8346784 - Flags: feedback?(bzbarsky) → feedback+
Summary: investigate alternative implementations for ProtoAndIfaceCache → investigate alternative implementations for ProtoAndIfaceCache memory usage
After looking at some CC logs, it looks like there are two distinct cases for these caches:

- Window, where the cache might have a lot of entries and access speed is (relatively) crucial;
- Everything else (e.g. ContentFrameMessageManager, BackstagePass)

Maybe it's worthwhile implementing two different cache strategies?  I have no idea how much plumbing would have to be done so we know which kind of cache we should be using and then point when the cache should be allocated, though.
It could be.  How many non-Window globals do we have floating around?  I guess one per worker, if nothing else...
Well, every JSM has a BackstagePass. There are something like ~100 system globals out there, right?
(In reply to Boris Zbarsky [:bz] from comment #20)
> It could be.  How many non-Window globals do we have floating around?  I
> guess one per worker, if nothing else...

The aforementioned CC log says that we have 213 globals, 200 of which are non-windows.  And some of those 13 Windows are ChromeWindows.
OK.

We could obviously flag somehow what sort of thing we have (e.g. bit-tagging the pointer in the private slot?).  It's not clear that the result would be faster than what we've already tried in this bug, though...
Also, note that the global already has 100-200 reserved slots for the js builtin version of this...
So what's the path forward here (if any)?
Flags: needinfo?(nfroyd)
The current status is this:

Benchmarking says we lose a non-trivial amount on a benchmark like comment 3.  I think the numbers were in the 5-15% range; I'd have to try redoing those.  That was with a two-level page-table-esque scheme.  I haven't tried more exotic approaches like separate cache implementations for Window-like globals and everything else.  I can re-benchmark and try something more exotic this week.

Ultimately, everything is going to be adding instructions (including branches) on top of the single memory read we have now.  So the real question is how much speed we are willing to sacrifice for the memory gains.  There's disagreement here, but I think smaug and/or bz have the final say in this.
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd (:froydnj) from comment #26)
> I think the numbers were in the 5-15% range; I'd have to try redoing
> those.  That was with a two-level page-table-esque scheme.

Re-benchmarking locally confirms this.  The numbers that I'm getting aren't particularly stable, but it looks like 10-15% is in the right ballpark.

> I haven't tried
> more exotic approaches like separate cache implementations for Window-like
> globals and everything else.  I can re-benchmark and try something more
> exotic this week.

This is the next task.
OK, so here's a version that does the separate implementations for window-like
globals and everything else.  Microbenchmarking shows it to be about as fast as
the current implementation, and about:memory confirms that we save a good bit
of space (~2.3MB on a newly-opened browser on a mostly-empty profile on x86-64 Linux).

We could squeeze out a little more space by using std::map (see comment 15) or
mozilla::SplayTree (less overhead than std::map) for our non-window-like cache
if desired.
Attachment #8346784 - Attachment is obsolete: true
Attachment #8389296 - Flags: review?(bugs)
Comment on attachment 8389296 [details] [diff] [review]
use different caching schemes for ProtoAndIfaceArray depending on the global kind

Review of attachment 8389296 [details] [diff] [review]:
-----------------------------------------------------------------

Erm.  One small mistake, let's fix that first.
Attachment #8389296 - Flags: review?(bugs)
Let's try that again.  Please see comment 28 for the details.
Attachment #8389296 - Attachment is obsolete: true
Attachment #8389357 - Flags: review?(bugs)
Comment on attachment 8389357 [details] [diff] [review]
use different caching schemes for ProtoAndIfaceArray depending on the global kind

Boris, could you review this. My queue is getting out of bounds.
Attachment #8389357 - Flags: review?(bugs) → review?(bzbarsky)
Comment on attachment 8389357 [details] [diff] [review]
use different caching schemes for ProtoAndIfaceArray depending on the global kind

The ProtoAndIfaceArray name might want changing too, since it's no longer an array.  Followup ok for that.

>+    Array<Page*, kProtoAndIfaceCacheCount / kPageSize + 1> mPages;

This will overallocate by 1 when kProtoAndIfaceCacheCount % kPageSize == 0.

Does it make sense to do kProtoAndIfaceCacheCount / kPageSize + size_t(bool(kProtoAndIfaceCacheCount % kPageSize)) or something?

>@@ -2392,17 +2547,17 @@ CreateGlobal(JSContext* aCx, T* aObject, nsWrapperCache* aCache,

Afaik, this is used for workers.  Do we want the WindowLike cache there?

>@@ -427,17 +427,17 @@ CreateGlobalObject(JSContext *cx, const JSClass *clasp, nsIPrincipal *principal,

And this is used for windows right now as far as I know...  Did you make sure that the right sorts of caches are being created for windows?

r=me with those nits dealt with and questions answered.
Attachment #8389357 - Flags: review?(bzbarsky) → review+
(In reply to Boris Zbarsky [:bz] from comment #32)
> >+    Array<Page*, kProtoAndIfaceCacheCount / kPageSize + 1> mPages;
> 
> This will overallocate by 1 when kProtoAndIfaceCacheCount % kPageSize == 0.
> 
> Does it make sense to do kProtoAndIfaceCacheCount / kPageSize +
> size_t(bool(kProtoAndIfaceCacheCount % kPageSize)) or something?

Doh, probably.  I had this fixed at one point and must have lost it in rebasing.

Do you have an opinion about using std::map or friends?

> >@@ -2392,17 +2547,17 @@ CreateGlobal(JSContext* aCx, T* aObject, nsWrapperCache* aCache,
> 
> Afaik, this is used for workers.  Do we want the WindowLike cache there?

I suspect not, since there are so many fewer interfaces available in workers?  Followup?

> >@@ -427,17 +427,17 @@ CreateGlobalObject(JSContext *cx, const JSClass *clasp, nsIPrincipal *principal,
> 
> And this is used for windows right now as far as I know...  Did you make
> sure that the right sorts of caches are being created for windows?

It looks like windows go through CreateGlobal; at least that's what WindowBinding.cpp has.
Flags: needinfo?(bzbarsky)
> Do you have an opinion about using std::map or friends?

I don't....

> Followup?

Sure.

> at least that's what WindowBinding.cpp has.

But does anything call that code so far?  I expect WindowBinding::Wrap is dead code at the moment...
Flags: needinfo?(bzbarsky)
FWIW, this cache accounts for ~2 MiB of the increases in start-up memory consumption we've seen over the past year or two.
Boris was right: windows weren't using the plain flat cache in attachment
8389357 [review].  That's because they were still going through the XPConnect bits for
creating global objects.

In the absence of WebIDL Window, then, we have to have a way for XPConnect to
create prototype caches in the style that Window would like.  This patch
implements that as an additional flag to initClassesWithNewWrappedGlobal.

It's pretty straightforward: add flag, pass it down to where it needs to go,
and fix up callers where appropriate.  Advice on the name welcome.

This patch will be folded into the previous one prior to landing.
Attachment #8394238 - Flags: review?(bobbyholley)
Assuming bholley is happy with the XPConnect patch, this is the bit that makes
nsGlobalWindow actually get the cache kind it would like.  I verified that we
do indeed get WindowLike caches with this patch.  Need to re-do the performance
numbers, though...
Attachment #8394241 - Flags: review?(bzbarsky)
Whoops, let's attach the patch without the busted debugging bits.
Attachment #8394238 - Attachment is obsolete: true
Attachment #8394238 - Flags: review?(bobbyholley)
Attachment #8394322 - Flags: review?(bobbyholley)
Comment on attachment 8394322 [details] [diff] [review]
add nsIXPConnect::USE_FLAT_PROTOTYPE_CACHE and propagate it around

Review of attachment 8394322 [details] [diff] [review]:
-----------------------------------------------------------------

I'm not really wild about adding all this goop, especially since we're going to have to rip it out in a few weeks. Can't we just strcmp js::GetObjectClass with "Window" and "ChromeWindow" in AllocateProtoAndIfaceCache and be done with it?
(In reply to Bobby Holley (:bholley) from comment #39)
> I'm not really wild about adding all this goop, especially since we're going
> to have to rip it out in a few weeks. Can't we just strcmp
> js::GetObjectClass with "Window" and "ChromeWindow" in
> AllocateProtoAndIfaceCache and be done with it?

This is why we ask for reviews. :)  That works for me...Boris, WDYT?
Flags: needinfo?(bzbarsky)
Worksforme as well.
Flags: needinfo?(bzbarsky)
Attachment #8394241 - Flags: review?(bzbarsky)
Attachment #8394322 - Flags: review?(bobbyholley)
Blocks: 987457
Blocks: 987460
https://hg.mozilla.org/mozilla-central/rev/5ad139206674
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.