Open Bug 659404 Opened 14 years ago Updated 1 year ago

Measure TLS cost on platforms/compilers of interest

Categories

(Core :: JavaScript Engine, defect)

defect

Tracking

()

People

(Reporter: dmandelin, Unassigned)

References

Details

Attachments

(3 files)

(Spun off from bug 659241 comment 8.) We keep talking about using TLS for JSContexts or other things, and the perf question keeps coming up. It would be nice to know what the cost of actually is. Let's do a round of measurements of TLS on at least x86/x64 MSVC and x64/ARM GCC. Basic things we want to know: - How long does it take to access something from TLS in cycles? It seems OK to do a hot-cache experiment, because JSContext and anything needed to get to it should stay in good caches most of the time. - How many loads or other slow ops need to run? I.e., a brief summary of how TLS is actually accessed, to enhance understanding. - Ultimately, we'd like an estimator of how much it will affect us in practice; some estimate of how many times cx gets accessed per million cycles run or something like that.
We need to measure the cost of the NSPR call on Mac, Linux, and Windows, not the compiler intrinsics. The GCC intrinsics aren't always available (they're not on Mac) and the Windows ones are basically broken except on Windows 7 - though we might be able to deal with it based on how we load the XUL library. Mac has some "fast" pthread variant but it's not clear that you can use this outside of the Kits, since it requires a hardcoded constant and Apple has reserved one for each place they need it. v8, FWIW, does not use intrinsics. For IonMonkey we mostly want to use TLS to get at the allocation pool. For vector resizes this should be somewhat rare, but not for instruction allocation, so it's worth measuring.
Assignee: general → adrake
Status: NEW → ASSIGNED
On latest OS X, 2.66 GHz Core i7. adrake@charon:~/prbench$ ./glbench 1000000000 rounds in 4 seconds, ~250000000.000 rounds/sec adrake@charon:~/prbench$ ./prbench 1000000000 rounds in 17 seconds, ~58823528.000 rounds/sec
On Fedora 15, x86_64, 2.66 Ghz Core i7 (same machine). [adrake@charon prbench]$ ./glbench 1000000000 rounds in 4 seconds, ~250000000.000 rounds/sec [adrake@charon prbench]$ ./prbench 1000000000 rounds in 23 seconds, ~43478260.000 rounds/sec
On Windows 7, x86_64, 2.66 GHz Core i7 (same machine). adrake@CHARON ~/Desktop/prbench $ glbench.exe 1000000000 rounds in 5 seconds, ~200000000.000 rounds/sec adrake@CHARON ~/Desktop/prbench $ prbench 1000000000 rounds in 19 seconds, ~52631580.000 rounds/sec On Windows 7, x86, 2.66 GHz Core i7 (same machine). adrake@CHARON ~/Desktop/prbench $ glbench.exe 1000000000 rounds in 4 seconds, ~0.000 rounds/sec adrake@CHARON ~/Desktop/prbench $ prbench 1000000000 rounds in 20 seconds, ~0.000 rounds/sec Yes, I screwed up the rounds/sec on 32-bit. The time numbers are still good, though. Some additional notes: there wasn't any detectable change in performance with different sorts of threads. The results were quite reproducible on all platforms of ~4x slower to do a full NSPR TLS read/write cycle than to do two movs (they were not optimized away) in the same place. I don't think this will noticeably slow down the engine by adding a few nanoseconds on top of every temporary allocation -- I suspect it will get lost in the noise. It may even end up being a win due to not needing a register or space in structures to pass around the generator or context, but I suspect the effect is likely to be very small.
(In reply to comment #7) > I don't think this will noticeably slow down the engine by adding a few > nanoseconds on top of every temporary allocation -- I suspect it will get > lost in the noise. It may even end up being a win due to not needing a > register or space in structures to pass around the generator or context, but > I suspect the effect is likely to be very small. My rough estimate is the same. So let's try this in IonMonkey, benchmarking compilation performance before and after the change. If it doesn't slow anything down, we can consider doing it for the rest of the engine.
On Windows, don't use the NSPR functions, just use TlsAlloc directly: it's somewhat faster and less overhead. Just note that there are a limited number (256) of TLS allocations, so you should really just have one for all of spidermonkey.
Assignee: adrake → general
Assignee: general → nobody
No assignee, updating the status.
Status: ASSIGNED → NEW
a11y-review: requested → ---
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: