Closed Bug 747066 Opened 12 years ago Closed 11 years ago

xpc_UnmarkGrayObject is slow

Categories

(Core :: XPConnect, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla20

People

(Reporter: bzbarsky, Assigned: billm)

References

(Blocks 1 open bug)

Details

(Keywords: perf)

Attachments

(2 files, 1 obsolete file)

This is called all the time on hot paths when returning things from C++ to JS.

I see at least 3 issues here:

1)  Null-check of obj.  Lots of callers know they have a non-null object; we
    should have a version that skips the null-check.
2)  js::GCThingIsMarkedGray is not inlined.
3)  js::IsIncrementalBarrierNeededOnObject is not inlined.

As far as that goes, why exactly do we only need to do the IsIncrementalBarrier thing on non-gray objects?  Does unmarking gray automatically mark for GC purposes or something?  This part is not making sense to me or terrence.

In any case, anything we can do to speed this up would help.  My goal for total binding overhead is about 25 cycles, so anything eating into that is a bit painful.  ;)
Keywords: perf
I looked into this a little before.  It is pretty sad that js::GCThingIsMarkedGray isn't inlined, given that it just involves a few bit operations and a load (I think).  Unfortunately, the actual definition of it involves a crazy chain of dependencies deep into various GC machinery.  I wonder if we could make a shadow version of it that just has the relevant operations.

Similarly, it looks like needsBarrier is just grabbing a boolean out of the compartment.
Talking to Boris, it sounds like the right approach here is to JIT all the calls to xpc_UnmarkGray that are actually hot. That way we can keep the internal mark bit stuff inside the JS engine. Also, we can almost entirely eliminate the incremental GC overhead from the JIT since the code can be specialized for whether IGC is running or not.

Here's why the incremental barrier is needed even if the object is not gray. Let's say an object was gray. Then we start an incremental GC. The first thing it does is to reset all the mark bits. The second thing it does is to ask XPConnect for the gray roots. Then it does a bunch of slices where it marks stuff black. During that time, the object that was gray (and is now white because the mark bits were cleared) could be passed into the JS engine. That's when xpc_UnmarkGray needs to mark it black. If it didn't then at best it might be marked gray during the gray marking phase, which happens in the final incremental GC slice. At worst, it would no longer be reachable from the gray roots and be collected.
That's certainly the right long-term approach.  It'd be nice to have some speedups shorter-term if we can, though.  As in, this is one of the things making imagedata slower than it should be in Firefox 14...

> Here's why the incremental barrier is needed even if the object is not gray.

I get that part.  Why is not needed the object _is_ gray?
If an object is grey, then it also has its black bit set (indicating it is alive), so there's no danger of it being collected.  I think.
> I get that part.  Why is not needed the object _is_ gray?

Sorry, I misread your question. However, when JS is running during an incremental GC, nothing is ever gray. So that will never happen.
OK, I can quantify "slow" now, since Bill was asking earlier today.

On my 2.66Ghz chip, if I just run the ImageData.data getter as it is now, I see it take about 41.5ns best case, and about 42.5ns on average.

If I comment out the xpc_UnmarkGrayObject in that getter, that drops to about 36.5ns on average (35ns best case).

So figure right now this call is taking about 6ns, or 16 cycles.  And yes, there's lots of other stuff we need to speed up too in the ImageData.data getter; I have a plan for most of the rest of it.  ;)

For comparison, Safari spends about 12-13ns _total_ for a getter like this on my hardware.

For getting an int, Safari spends 9ns.  With new-bindings we can manage that in about 15ns, with more speedups coming.  So if we can just shave _some_ time off the 6ns xpc_UnmarkGrayObject is taking, that will be a pretty good start.  Again, I'd love us to do what we can in terms of inlining in the near term, and do more jit work in the longer term.
Depends on: 747285
I can try to hack up some inlined versions of these functions for you to try next week, to see how the perf looks, but they will make Bill sad so they are not going to be a real solution.
Depends on: 747290
That would be great.  At the very least, it would tell me how much I can expect to win by making Bill sad, and hence whether it's worth thinking about making him sad.  ;)
(In reply to Andrew McCreight [:mccr8] from comment #7)
> I can try to hack up some inlined versions of these functions for you to try
> next week, to see how the perf looks, but they will make Bill sad so they
> are not going to be a real solution.

That sounds great. I'm all for getting more data.

Another idea that might be worth a try is to have a small stack of objects that UnmarkGray was called on. We would push onto the stack until it was full, and then call a function to process the items. I doubt we'd need a very big stack to amortize the call overhead. One problem I see with that is that we wouldn't have any place to store the stack, aside from a global variable. I'm not sure how kosher that would be in Gecko-land.
I can provide supervision to ensure that the making of that particular sausage is claimed to be kosher, if needed.  ;)
(In reply to Boris Zbarsky (:bz) from comment #8)
> That would be great.  At the very least, it would tell me how much I can
> expect to win by making Bill sad, and hence whether it's worth thinking
> about making him sad.  ;)

Yeah, that was my thought.

> (In reply to Andrew McCreight [:mccr8] from comment #7)
> Another idea that might be worth a try is to have a small stack of objects
> that UnmarkGray was called on. We would push onto the stack until it was
> full, and then call a function to process the items. I doubt we'd need a
> very big stack to amortize the call overhead. One problem I see with that is
> that we wouldn't have any place to store the stack, aside from a global
> variable. I'm not sure how kosher that would be in Gecko-land.

The stack idea is interesting.  You'd have to be careful about flushing it before objects on the stack be read from using JS or any place we don't barrier reads, which would limit the effectiveness a bit.  I'm not sure what sort of call patterns there are in these hot cases.  If it is just something like, bounce to C++, grab an object, return immediately to JS, then you may not get any benefit here.
(In reply to Andrew McCreight [:mccr8] from comment #11)
> The stack idea is interesting.  You'd have to be careful about flushing it
> before objects on the stack be read from using JS or any place we don't
> barrier reads, which would limit the effectiveness a bit.  I'm not sure what
> sort of call patterns there are in these hot cases.  If it is just something
> like, bounce to C++, grab an object, return immediately to JS, then you may
> not get any benefit here.

Yeah, I guess you're right. It's probably not worth it then.
There are several different hot cases.  One is "the microbenchmark".  The call pattern looks like this:

  for (var i = 0; i < bignum; ++i)
    document.firstChild;

One is "the silly code that didn't cache things in locals" and looks like this:

  var imgData = getCanvasImageData();
  for (var i = 0; i < imgData.width*imgData.height; ++i) {
    imgData.data[i*4] = getR(i);
    imgData.data[i*4+1] = getG(i);
    imgData.data[i*4+2] = getB(i);
    imgData.data[i*4+3] = 0;
  }

(I hope we can CSE our way to victory here eventually).  In both of these, it's precisely jump into C++, grab an object, return to JS.

Finally, there is "just walk over the DOM tree" which has lots of instances of jump into C++, grab an object, return.
This is quite hideous, but it does inline everything.  I haven't added a variant that skips the null check, but that should be easy enough if you want me to do that, bz.
I moved a field of JSCompartment which I guess could undo some kind of microoptimization.  I can undo that, but it will be a bit tedious.
Thanks.  With the inlining (but with the null-check left in place for now), our time for a single get in the spirit of https://bugs.webkit.org/attachment.cgi?id=130594 drops from about 41ns to about 36ns or so.  So that's most of the 6ns I measured in comment 6, for a microbenchmark in which everything is cached.
xpc_UnmarkGrayObject tends to show up in profiles. Could we get this fixed?
Andrew, is your patch too ugly?
It is way too ugly as is. It may be possible to make it less terrible. We really just need to compute a few offsets. I can take a look at this this week or next week if nobody else is interested.
It sounds like this needs to get done soon. This patch is less ugly, and it should get us what we need. I didn't deal with the NULL check, but that should be easy.

There are two magic constants in the HeapAPI.h header. We'll have to change them if we ever change the layout of a chunk. However, that's easy enough to do, and we hardly ever change this stuff.
Assignee: nobody → wmccloskey
Attachment #618058 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #682666 - Flags: review?(terrence)
Comment on attachment 682666 [details] [diff] [review]
inline getter for mark bits and incremental stuff

Review of attachment 682666 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/public/HeapAPI.h
@@ +41,5 @@
> +const size_t CellMask = CellSize - 1;
> +
> +/* These are magic constants derived from actual offsets in gc/Heap.h. */
> +const size_t ChunkMarkBitmapOffset = 1032376;
> +const size_t ChunkMarkBitmapBits = 129024;

Can we declare a matching pair in gc/Heap.h that are computed from the hidden constants and statically assert that they match their hard-coded public version?
Attachment #682666 - Flags: review?(terrence) → review+
Bill, thanks!  This is a good bit better.  We're at about parity with Chrome now on the tests I'm running.  Would be nice to see if we can get to within spitting distance of Safari.... ;)

The short of it is that we're at about 30 cycles for a getter now, and it would be good to shave off 7-10 cycles off of that.

Looking at a profile with per-line blame, insofar as one believes such things, it looks like there's no longer an obvious culprit; just lots of memory ops and whatnot.  For example, setting the *maskp in GetGCThingMarkeWordAndMask is claimed to be about 12% of total testcase time on a single "inc eax" instruction; it slightly screams "cache miss" of some sort to me (and we're not missing L2 too much here, obviously, given the times involved!).

I'm going to try a patch to eliminate the null-check here and see how much it helps (prediction based on profile: about one cycle).  Past that, I'm not quite sure what else we can do here yet...
Looks like ditching the null-check (bug 813419) is about one cycle at best, yes.  The rest of UnmarkGray is now maybe 4 cycles or so.  I'll have to squeeze some more blood from other stones.  ;)
https://hg.mozilla.org/mozilla-central/rev/2ee20348ae59
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla20
Backed out for bug 813619:
https://hg.mozilla.org/integration/mozilla-inbound/rev/fb729e54421e
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 813619
I'm able to reproduce the crashes on a local Win64 machine. Assuming it is a PGO issue, I pushed something to try that disables the inlining on Win64. I'll retest when that build is finished.
  https://tbpl.mozilla.org/?tree=Try&rev=a08cd03eefc9
I pushed it to try twice and they both came back okay. Maybe the MacOS crashes were caused by something else.
https://hg.mozilla.org/integration/mozilla-inbound/rev/a6b604916694
I think the Win64 workaround was actually covering up a problem.

In particular, on Mac opt builds with these changes, IsIncrementalBarrierNeededOnGCThing always returns true.  And it does that because the object layout of a JSCompartment in those builds looks like this:

150         return reinterpret_cast<shadow::Compartment *>(comp)->needsBarrier_;
(gdb) x/3ga comp
0x108e88000:    0x101408430 <_ZTV13JSCompartment+16>    0x0
0x108e88010:    0x0

That is, the first word of the JSCompartment is the vtable pointer, and the data members (including the data member inherited from shadow::Compartment) come after that.  I wonder whether an assert in the JSCompartment ctor that the reinterpret_cast does the right thing would fail in a debug build... that is, whether the object layout in debug is actually different, or whether the code generated for the member get ends up somehow producing 0 instead of 1 in some cases.

In either case, if that method can return bogus values, I see no reason it couldn't return 0 when it should return 1, leading to crashes, right?
Thanks Boris! That would have taken me a while to figure out. Backed out again. I'll post a fix in a bit.
https://hg.mozilla.org/integration/mozilla-inbound/rev/a4c6aedf00de
So I've thought of two plausible fixes so far:

1)  Add a virtual function to shadow::Compartment, and an assert in the JSCompartment ctor
    that reinterpret_cast and static_cast to shadow::Compartment return the same thing.
2)  Do something with an extern size_t storing the actual offset of the boolean member
    inside JSCompartment, etc.

My preference of those is #1.
Applying this on top of the first patch in this bug makes things work fine on Mac (and by "fine" I mean "faster in opt, no asserts in debug").
The route I'd like to take is to remove the virtual method from JSCompartment so that it doesn't have a vtable anymore. I'm afraid that this will cause more problems in the future. I'm working on a patch to do that, but I ran into some unexpected problems. I think I should be able to finish it in a few days.

Boris, if it's important we get this in sooner, I think it would be fine to take your patch as a temporary measure.
This isn't "must land within days" urgent.  Would just like to have it for 20, if possible.
It might still be a good idea to add an assert to the JSCompartment ctor that the cast is safe.
Oh yeah, good idea. You were away, so I pushed with r=bz. Hope this is okay.

https://hg.mozilla.org/integration/mozilla-inbound/rev/7f8e7625b12e
Yeah, looks good to me.  Thank you, and thank you again for inlining this stuff!
"Improvement! Dromaeo (DOM) increase 1.99% on MacOSX 10.8 Mozilla-Inbound" ;)
https://hg.mozilla.org/mozilla-central/rev/eadbe7e8bfa2
https://hg.mozilla.org/mozilla-central/rev/7f8e7625b12e
Status: REOPENED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → FIXED
Regressions: 1687002
You need to log in before you can comment on or make changes to this bug.