Open
Bug 659404
Opened 14 years ago
Updated 1 year ago
Measure TLS cost on platforms/compilers of interest
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
NEW
People
(Reporter: dmandelin, Unassigned)
References
Details
Attachments
(3 files)
(Spun off from bug 659241 comment 8.)
We keep talking about using TLS for JSContexts or other things, and the perf question keeps coming up. It would be nice to know what the cost of actually is. Let's do a round of measurements of TLS on at least x86/x64 MSVC and x64/ARM GCC. Basic things we want to know:
- How long does it take to access something from TLS in cycles? It seems OK to do a hot-cache experiment, because JSContext and anything needed to get to it should stay in good caches most of the time.
- How many loads or other slow ops need to run? I.e., a brief summary of how TLS is actually accessed, to enhance understanding.
- Ultimately, we'd like an estimator of how much it will affect us in practice; some estimate of how many times cx gets accessed per million cycles run or something like that.
We need to measure the cost of the NSPR call on Mac, Linux, and Windows, not the compiler intrinsics. The GCC intrinsics aren't always available (they're not on Mac) and the Windows ones are basically broken except on Windows 7 - though we might be able to deal with it based on how we load the XUL library. Mac has some "fast" pthread variant but it's not clear that you can use this outside of the Kits, since it requires a hardcoded constant and Apple has reserved one for each place they need it.
v8, FWIW, does not use intrinsics.
For IonMonkey we mostly want to use TLS to get at the allocation pool. For vector resizes this should be somewhat rare, but not for instruction allocation, so it's worth measuring.
Comment 2•14 years ago
|
||
Assignee: general → adrake
Status: NEW → ASSIGNED
Comment 3•14 years ago
|
||
Comment 4•14 years ago
|
||
Comment 5•14 years ago
|
||
On latest OS X, 2.66 GHz Core i7.
adrake@charon:~/prbench$ ./glbench
1000000000 rounds in 4 seconds, ~250000000.000 rounds/sec
adrake@charon:~/prbench$ ./prbench
1000000000 rounds in 17 seconds, ~58823528.000 rounds/sec
Comment 6•14 years ago
|
||
On Fedora 15, x86_64, 2.66 Ghz Core i7 (same machine).
[adrake@charon prbench]$ ./glbench
1000000000 rounds in 4 seconds, ~250000000.000 rounds/sec
[adrake@charon prbench]$ ./prbench
1000000000 rounds in 23 seconds, ~43478260.000 rounds/sec
Comment 7•14 years ago
|
||
On Windows 7, x86_64, 2.66 GHz Core i7 (same machine).
adrake@CHARON ~/Desktop/prbench
$ glbench.exe
1000000000 rounds in 5 seconds, ~200000000.000 rounds/sec
adrake@CHARON ~/Desktop/prbench
$ prbench
1000000000 rounds in 19 seconds, ~52631580.000 rounds/sec
On Windows 7, x86, 2.66 GHz Core i7 (same machine).
adrake@CHARON ~/Desktop/prbench
$ glbench.exe
1000000000 rounds in 4 seconds, ~0.000 rounds/sec
adrake@CHARON ~/Desktop/prbench
$ prbench
1000000000 rounds in 20 seconds, ~0.000 rounds/sec
Yes, I screwed up the rounds/sec on 32-bit. The time numbers are still good, though.
Some additional notes: there wasn't any detectable change in performance with different sorts of threads. The results were quite reproducible on all platforms of ~4x slower to do a full NSPR TLS read/write cycle than to do two movs (they were not optimized away) in the same place.
I don't think this will noticeably slow down the engine by adding a few nanoseconds on top of every temporary allocation -- I suspect it will get lost in the noise. It may even end up being a win due to not needing a register or space in structures to pass around the generator or context, but I suspect the effect is likely to be very small.
| Reporter | ||
Comment 8•14 years ago
|
||
(In reply to comment #7)
> I don't think this will noticeably slow down the engine by adding a few
> nanoseconds on top of every temporary allocation -- I suspect it will get
> lost in the noise. It may even end up being a win due to not needing a
> register or space in structures to pass around the generator or context, but
> I suspect the effect is likely to be very small.
My rough estimate is the same. So let's try this in IonMonkey, benchmarking compilation performance before and after the change. If it doesn't slow anything down, we can consider doing it for the rest of the engine.
Comment 9•14 years ago
|
||
On Windows, don't use the NSPR functions, just use TlsAlloc directly: it's somewhat faster and less overhead. Just note that there are a limited number (256) of TLS allocations, so you should really just have one for all of spidermonkey.
Updated•14 years ago
|
Assignee: adrake → general
| Assignee | ||
Updated•11 years ago
|
Assignee: general → nobody
Updated•4 years ago
|
Updated•3 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•