Last Comment Bug 686017 - Cell::arenaHeader() should be avoided on the fast paths
: Cell::arenaHeader() should be avoided on the fast paths
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: All All
: -- normal (vote)
: mozilla9
Assigned To: Igor Bukanov
:
Mentors:
Depends on:
Blocks: 671702 686144
  Show dependency treegraph
 
Reported: 2011-09-09 14:38 PDT by Igor Bukanov
Modified: 2011-09-12 06:49 PDT (History)
6 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
v1 (64.90 KB, patch)
2011-09-09 14:44 PDT, Igor Bukanov
bhackett1024: review+
Details | Diff | Splinter Review
v1 (60.59 KB, patch)
2011-09-09 14:50 PDT, Igor Bukanov
bhackett1024: review+
Details | Diff | Splinter Review

Description Igor Bukanov 2011-09-09 14:38:41 PDT
Currently in a few places outside the GC we use Cell::arenaHeader() to access the compartment for the GC thing or its allocation kind. I discovered when experimenting with alternative layout for GC things such access is not free as it involves some arithmetic operations and a read from the start of the arena which may not be in cache. 

In particular, changing JSObject::nativeSearch to use cx to get the JSRuntime instance, not Cell::compartment()->rt, and calculating in CopyInitializerObject the allocation kind based on the number of the fixed slots in the object, and not using Cell::allocKind, resulted in a 0.5% win in V8. the most of the improvement comes from the deltablue benchmark that is 2.6% faster with the patch. The patch does not affect SunSpider scores.
Comment 1 Igor Bukanov 2011-09-09 14:44:18 PDT
Created attachment 559581 [details] [diff] [review]
v1

This patch implements the above mention elimination of Cell::compartments()->rt and Cell::allocKind() calls in couple of places. As the patch adds JSContext *cx parameter to  JSObject::nativeSearch that affected quite a number callers and their signatures.
Comment 2 Igor Bukanov 2011-09-09 14:50:14 PDT
Created attachment 559585 [details] [diff] [review]
v1

The previous attachment contains unrelated changes.
Comment 3 Igor Bukanov 2011-09-10 10:25:50 PDT
The patch here allows to minimize the regressions from the smaller chunk sizes and different chunk layouts in bug 671702.
Comment 4 Brian Hackett (:bhackett) 2011-09-11 12:39:22 PDT
Comment on attachment 559581 [details] [diff] [review]
v1

Makes sense, I've seen this cost for accessing things from the arena header before.  I'm surprised that changing CopyInitializerObject made a difference, though.  With TI disabled this function will get called a lot, but with TI enabled initializer objects should normally be constructed in jitcode.  Do you know how much CopyInitializerObject gets called in v8bench with TI enabled?
Comment 5 Igor Bukanov 2011-09-11 13:21:44 PDT
(In reply to Brian Hackett from comment #4)
>I'm surprised that changing CopyInitializerObject made a
> difference, though.  With TI disabled this function will get called a lot,
> but with TI enabled initializer objects should normally be constructed in
> jitcode.  Do you know how much CopyInitializerObject gets called in v8bench
> with TI enabled?

With TI it is called just 7141 times during splay. Without TI the number goes to 509056. So maybe the speedup from that chamge is just a noise. On the other hand with 100 v8 runs on a system that just runs a VNC session and a shell to run the benchmark I did observe speedup around one 1ms from that extra change. That can be explained if those 7141 calls all resulted in cache misses. That would translate into 250*7141 or about 1.7e6 cycles and that is about 0.5ms. So the numbers roughly match. The question does the cache miss theory sounds plausible?
Comment 6 Brian Hackett (:bhackett) 2011-09-11 13:32:41 PDT
(In reply to Igor Bukanov from comment #5)
> With TI it is called just 7141 times during splay. Without TI the number
> goes to 509056. So maybe the speedup from that chamge is just a noise. On
> the other hand with 100 v8 runs on a system that just runs a VNC session and
> a shell to run the benchmark I did observe speedup around one 1ms from that
> extra change. That can be explained if those 7141 calls all resulted in
> cache misses. That would translate into 250*7141 or about 1.7e6 cycles and
> that is about 0.5ms. So the numbers roughly match. The question does the
> cache miss theory sounds plausible?

Yeah, that sounds reasonable.  Once we have a generational GC and aren't constantly taking cache misses during allocation these ancillary sources of misses will become more and more important to track down.

Note You need to log in before you can comment on or make changes to this bug.