The default bug view has changed. See this FAQ.

Cell::arenaHeader() should be avoided on the fast paths

RESOLVED FIXED in mozilla9

Status

()

Core
JavaScript Engine
RESOLVED FIXED
6 years ago
6 years ago

People

(Reporter: Igor Bukanov, Assigned: Igor Bukanov)

Tracking

unspecified
mozilla9
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 1 obsolete attachment)

v1
60.59 KB, patch
bhackett
: review+
Details | Diff | Splinter Review
(Assignee)

Description

6 years ago
Currently in a few places outside the GC we use Cell::arenaHeader() to access the compartment for the GC thing or its allocation kind. I discovered when experimenting with alternative layout for GC things such access is not free as it involves some arithmetic operations and a read from the start of the arena which may not be in cache. 

In particular, changing JSObject::nativeSearch to use cx to get the JSRuntime instance, not Cell::compartment()->rt, and calculating in CopyInitializerObject the allocation kind based on the number of the fixed slots in the object, and not using Cell::allocKind, resulted in a 0.5% win in V8. the most of the improvement comes from the deltablue benchmark that is 2.6% faster with the patch. The patch does not affect SunSpider scores.
(Assignee)

Comment 1

6 years ago
Created attachment 559581 [details] [diff] [review]
v1

This patch implements the above mention elimination of Cell::compartments()->rt and Cell::allocKind() calls in couple of places. As the patch adds JSContext *cx parameter to  JSObject::nativeSearch that affected quite a number callers and their signatures.
Attachment #559581 - Flags: review?(bhackett1024)
(Assignee)

Comment 2

6 years ago
Created attachment 559585 [details] [diff] [review]
v1

The previous attachment contains unrelated changes.
Attachment #559581 - Attachment is obsolete: true
Attachment #559581 - Flags: review?(bhackett1024)
Attachment #559585 - Flags: review?(bhackett1024)
(Assignee)

Comment 3

6 years ago
The patch here allows to minimize the regressions from the smaller chunk sizes and different chunk layouts in bug 671702.
Blocks: 671702
(Assignee)

Updated

6 years ago
Blocks: 686144
Comment on attachment 559581 [details] [diff] [review]
v1

Makes sense, I've seen this cost for accessing things from the arena header before.  I'm surprised that changing CopyInitializerObject made a difference, though.  With TI disabled this function will get called a lot, but with TI enabled initializer objects should normally be constructed in jitcode.  Do you know how much CopyInitializerObject gets called in v8bench with TI enabled?
Attachment #559581 - Flags: review+
Attachment #559585 - Flags: review?(bhackett1024) → review+
(Assignee)

Comment 5

6 years ago
(In reply to Brian Hackett from comment #4)
>I'm surprised that changing CopyInitializerObject made a
> difference, though.  With TI disabled this function will get called a lot,
> but with TI enabled initializer objects should normally be constructed in
> jitcode.  Do you know how much CopyInitializerObject gets called in v8bench
> with TI enabled?

With TI it is called just 7141 times during splay. Without TI the number goes to 509056. So maybe the speedup from that chamge is just a noise. On the other hand with 100 v8 runs on a system that just runs a VNC session and a shell to run the benchmark I did observe speedup around one 1ms from that extra change. That can be explained if those 7141 calls all resulted in cache misses. That would translate into 250*7141 or about 1.7e6 cycles and that is about 0.5ms. So the numbers roughly match. The question does the cache miss theory sounds plausible?
(In reply to Igor Bukanov from comment #5)
> With TI it is called just 7141 times during splay. Without TI the number
> goes to 509056. So maybe the speedup from that chamge is just a noise. On
> the other hand with 100 v8 runs on a system that just runs a VNC session and
> a shell to run the benchmark I did observe speedup around one 1ms from that
> extra change. That can be explained if those 7141 calls all resulted in
> cache misses. That would translate into 250*7141 or about 1.7e6 cycles and
> that is about 0.5ms. So the numbers roughly match. The question does the
> cache miss theory sounds plausible?

Yeah, that sounds reasonable.  Once we have a generational GC and aren't constantly taking cache misses during allocation these ancillary sources of misses will become more and more important to track down.
(Assignee)

Comment 7

6 years ago
http://hg.mozilla.org/integration/mozilla-inbound/rev/f3908eb90151
http://hg.mozilla.org/mozilla-central/rev/f3908eb90151
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla9
You need to log in before you can comment on or make changes to this bug.