Last Comment Bug 686017 - Cell::arenaHeader() should be avoided on the fast paths
: Cell::arenaHeader() should be avoided on the fast paths
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: All All
: -- normal (vote)
: mozilla9
Assigned To: Igor Bukanov
:
Mentors:
Depends on:
Blocks: 671702 686144
  Show dependency treegraph
 
Reported: 2011-09-09 14:38 PDT by Igor Bukanov
Modified: 2011-09-12 06:49 PDT (History)
6 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
v1 (64.90 KB, patch)
2011-09-09 14:44 PDT, Igor Bukanov
bhackett1024: review+
Details | Diff | Review
v1 (60.59 KB, patch)
2011-09-09 14:50 PDT, Igor Bukanov
bhackett1024: review+
Details | Diff | Review

Description Igor Bukanov 2011-09-09 14:38:41 PDT
Currently in a few places outside the GC we use Cell::arenaHeader() to access the compartment for the GC thing or its allocation kind. I discovered when experimenting with alternative layout for GC things such access is not free as it involves some arithmetic operations and a read from the start of the arena which may not be in cache. 

In particular, changing JSObject::nativeSearch to use cx to get the JSRuntime instance, not Cell::compartment()->rt, and calculating in CopyInitializerObject the allocation kind based on the number of the fixed slots in the object, and not using Cell::allocKind, resulted in a 0.5% win in V8. the most of the improvement comes from the deltablue benchmark that is 2.6% faster with the patch. The patch does not affect SunSpider scores.
Comment 1 Igor Bukanov 2011-09-09 14:44:18 PDT
Created attachment 559581 [details] [diff] [review]
v1

This patch implements the above mention elimination of Cell::compartments()->rt and Cell::allocKind() calls in couple of places. As the patch adds JSContext *cx parameter to  JSObject::nativeSearch that affected quite a number callers and their signatures.
Comment 2 Igor Bukanov 2011-09-09 14:50:14 PDT
Created attachment 559585 [details] [diff] [review]
v1

The previous attachment contains unrelated changes.
Comment 3 Igor Bukanov 2011-09-10 10:25:50 PDT
The patch here allows to minimize the regressions from the smaller chunk sizes and different chunk layouts in bug 671702.
Comment 4 Brian Hackett (:bhackett) 2011-09-11 12:39:22 PDT
Comment on attachment 559581 [details] [diff] [review]
v1

Makes sense, I've seen this cost for accessing things from the arena header before.  I'm surprised that changing CopyInitializerObject made a difference, though.  With TI disabled this function will get called a lot, but with TI enabled initializer objects should normally be constructed in jitcode.  Do you know how much CopyInitializerObject gets called in v8bench with TI enabled?
Comment 5 Igor Bukanov 2011-09-11 13:21:44 PDT
(In reply to Brian Hackett from comment #4)
>I'm surprised that changing CopyInitializerObject made a
> difference, though.  With TI disabled this function will get called a lot,
> but with TI enabled initializer objects should normally be constructed in
> jitcode.  Do you know how much CopyInitializerObject gets called in v8bench
> with TI enabled?

With TI it is called just 7141 times during splay. Without TI the number goes to 509056. So maybe the speedup from that chamge is just a noise. On the other hand with 100 v8 runs on a system that just runs a VNC session and a shell to run the benchmark I did observe speedup around one 1ms from that extra change. That can be explained if those 7141 calls all resulted in cache misses. That would translate into 250*7141 or about 1.7e6 cycles and that is about 0.5ms. So the numbers roughly match. The question does the cache miss theory sounds plausible?
Comment 6 Brian Hackett (:bhackett) 2011-09-11 13:32:41 PDT
(In reply to Igor Bukanov from comment #5)
> With TI it is called just 7141 times during splay. Without TI the number
> goes to 509056. So maybe the speedup from that chamge is just a noise. On
> the other hand with 100 v8 runs on a system that just runs a VNC session and
> a shell to run the benchmark I did observe speedup around one 1ms from that
> extra change. That can be explained if those 7141 calls all resulted in
> cache misses. That would translate into 250*7141 or about 1.7e6 cycles and
> that is about 0.5ms. So the numbers roughly match. The question does the
> cache miss theory sounds plausible?

Yeah, that sounds reasonable.  Once we have a generational GC and aren't constantly taking cache misses during allocation these ancillary sources of misses will become more and more important to track down.
Comment 8 :Ehsan Akhgari (busy, don't ask for review please) 2011-09-12 06:49:43 PDT
http://hg.mozilla.org/mozilla-central/rev/f3908eb90151

Note You need to log in before you can comment on or make changes to this bug.