Closed Bug 1058631 Opened 5 years ago Closed 5 years ago

Some small nursery GC performance improvements

Categories

(Core :: JavaScript: GC, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla34

People

(Reporter: jandem, Assigned: jandem)

Details

Attachments

(2 files)

I'll attach some patches that make the micro-benchmark below at least 10% faster on OS X. They also improve things on Linux and Windows, though especially on Linux the win appears to be smaller.

The patches also win a few hundred points each on Octane earley-boyer.

function f() {
    var prevO = null;
    for (var i=0; i<100000000; i++) {
	var o = {x: 1, y: 2, z: 3, prev: prevO};
	if (i % 8 === 0)
	    prevO = o;
    }
}
var t = new Date;
f();
print(new Date - t);
* Clang was not inlining Nursery::moveSlotsToTenured and Nursery::moveElementsToTenured into their only caller, I added MOZ_ALWAYS_INLINE to force inlining.

* Flags CrashAtUnhandlableOOM as MOZ_NORETURN, to hint to the compiler calls to it are cold and won't return.

* Some minor changes to avoid a few dereferences. These are unlikely to matter much, but very straight-forward so I left them in.

This patch improves the micro-benchmark in comment 0 from 1604 to 1502 ms on OS X 32-bit.
Attachment #8479080 - Flags: review?(terrence)
Since you're micro-optimizing: PodCopy, used in the copy operations, is not clever for most slot arrays, in fact it is fairly strange: For 128 elements or more PodCopy becomes a single memcpy.  For less than 128 elements it becomes a loop plus one memcpy call per element.

Of course it's possible that a good C++ compiler does something clever with that.
Comment on attachment 8479080 [details] [diff] [review]
Part 1 - Assorted tweaks

Review of attachment 8479080 [details] [diff] [review]:
-----------------------------------------------------------------

Nice!
Attachment #8479080 - Flags: review?(terrence) → review+
Clang was emitting a call to a generic memcpy function. This patch adds an AllocKind switch statement so that each of the memcpy calls has a constant "size" argument, and the compiler can do something smarter.

For instance, for the FINALIZE_OBJECT0_BACKGROUND switch-case, Clang now emits the following instead of a memcpy call:

	movsd	(%ebx), %xmm0
	movsd	8(%ebx), %xmm1
	movsd	%xmm1, 8(%esi)
	movsd	%xmm0, (%esi)
Attachment #8479123 - Flags: review?(terrence)
Part 2 also wins about 100 ms on the comment 0 micro-benchmark, 1502 -> 1408 ms on OS X 32-bit.
(In reply to Lars T Hansen [:lth] from comment #2)
> Since you're micro-optimizing: PodCopy, used in the copy operations, is not
> clever for most slot arrays, in fact it is fairly strange: For 128 elements
> or more PodCopy becomes a single memcpy.  For less than 128 elements it
> becomes a loop plus one memcpy call per element.
> 
> Of course it's possible that a good C++ compiler does something clever with
> that.

Yeah, I think it's assuming that C++ compilers do something clever with small constant-size memcpy but not with variable-size memcpy (or at least not all compilers).
I do know that the < 128 elements thing was based on performance measurements; Waldo probably remembers more of the details.
Comment on attachment 8479123 [details] [diff] [review]
Part 2 - Specialize memcpy

I see a small regression with this patch on OS X x64 and Linux x64. Not sure why it behaves differently on x64 but it's probably not worth the complexity then.
Attachment #8479123 - Flags: review?(terrence)
https://hg.mozilla.org/mozilla-central/rev/0556ceb562e3
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla34
You need to log in before you can comment on or make changes to this bug.