Open Bug 659404 Opened 14 years ago Updated 1 year ago

Measure TLS cost on platforms/compilers of interest

Tracking

()

Status:

NEW

People

(Reporter: dmandelin, Unassigned)

References

Details

Attachments

(3 files)

Preliminary TLS Benchmark 14 years ago Andrew Drake [:adrake] 877 bytes, text/plain		Details
Preliminary Control Benchmark 14 years ago Andrew Drake [:adrake] 774 bytes, text/plain		Details
Preliminary Benchmark Makefile 14 years ago Andrew Drake [:adrake] 85 bytes, text/plain		Details

David Mandelin [:dmandelin]

Reporter

Description

•

14 years ago

(Spun off from bug 659241 comment 8.) We keep talking about using TLS for JSContexts or other things, and the perf question keeps coming up. It would be nice to know what the cost of actually is. Let's do a round of measurements of TLS on at least x86/x64 MSVC and x64/ARM GCC. Basic things we want to know: - How long does it take to access something from TLS in cycles? It seems OK to do a hot-cache experiment, because JSContext and anything needed to get to it should stay in good caches most of the time. - How many loads or other slow ops need to run? I.e., a brief summary of how TLS is actually accessed, to enhance understanding. - Ultimately, we'd like an estimator of how much it will affect us in practice; some estimate of how many times cx gets accessed per million cycles run or something like that.

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 1

•

14 years ago

We need to measure the cost of the NSPR call on Mac, Linux, and Windows, not the compiler intrinsics. The GCC intrinsics aren't always available (they're not on Mac) and the Windows ones are basically broken except on Windows 7 - though we might be able to deal with it based on how we load the XUL library. Mac has some "fast" pthread variant but it's not clear that you can use this outside of the Kits, since it requires a hardcoded constant and Apple has reserved one for each place they need it. v8, FWIW, does not use intrinsics. For IonMonkey we mostly want to use TLS to get at the allocation pool. For vector resizes this should be somewhat rare, but not for instruction allocation, so it's worth measuring.

Andrew Drake [:adrake]

Comment 2

•

14 years ago

Attached file Preliminary TLS Benchmark — Details

Assignee: general → adrake

Status: NEW → ASSIGNED

Andrew Drake [:adrake]

Comment 3

•

14 years ago

Attached file Preliminary Control Benchmark — Details

Andrew Drake [:adrake]

Comment 4

•

14 years ago

Attached file Preliminary Benchmark Makefile — Details

Andrew Drake [:adrake]

Comment 5

•

14 years ago

On latest OS X, 2.66 GHz Core i7. adrake@charon:~/prbench$ ./glbench 1000000000 rounds in 4 seconds, ~250000000.000 rounds/sec adrake@charon:~/prbench$ ./prbench 1000000000 rounds in 17 seconds, ~58823528.000 rounds/sec

Andrew Drake [:adrake]

Comment 6

•

14 years ago

On Fedora 15, x86_64, 2.66 Ghz Core i7 (same machine). [adrake@charon prbench]$ ./glbench 1000000000 rounds in 4 seconds, ~250000000.000 rounds/sec [adrake@charon prbench]$ ./prbench 1000000000 rounds in 23 seconds, ~43478260.000 rounds/sec

Andrew Drake [:adrake]

Comment 7

•

14 years ago

On Windows 7, x86_64, 2.66 GHz Core i7 (same machine). adrake@CHARON ~/Desktop/prbench $ glbench.exe 1000000000 rounds in 5 seconds, ~200000000.000 rounds/sec adrake@CHARON ~/Desktop/prbench $ prbench 1000000000 rounds in 19 seconds, ~52631580.000 rounds/sec On Windows 7, x86, 2.66 GHz Core i7 (same machine). adrake@CHARON ~/Desktop/prbench $ glbench.exe 1000000000 rounds in 4 seconds, ~0.000 rounds/sec adrake@CHARON ~/Desktop/prbench $ prbench 1000000000 rounds in 20 seconds, ~0.000 rounds/sec Yes, I screwed up the rounds/sec on 32-bit. The time numbers are still good, though. Some additional notes: there wasn't any detectable change in performance with different sorts of threads. The results were quite reproducible on all platforms of ~4x slower to do a full NSPR TLS read/write cycle than to do two movs (they were not optimized away) in the same place. I don't think this will noticeably slow down the engine by adding a few nanoseconds on top of every temporary allocation -- I suspect it will get lost in the noise. It may even end up being a win due to not needing a register or space in structures to pass around the generator or context, but I suspect the effect is likely to be very small.

David Mandelin [:dmandelin]

Reporter

Comment 8

•

14 years ago

(In reply to comment #7) > I don't think this will noticeably slow down the engine by adding a few > nanoseconds on top of every temporary allocation -- I suspect it will get > lost in the noise. It may even end up being a win due to not needing a > register or space in structures to pass around the generator or context, but > I suspect the effect is likely to be very small. My rough estimate is the same. So let's try this in IonMonkey, benchmarking compilation performance before and after the change. If it doesn't slow anything down, we can consider doing it for the rest of the engine.

Benjamin Smedberg

Comment 9

•

14 years ago

On Windows, don't use the NSPR functions, just use TlsAlloc directly: it's somewhat faster and less overhead. Just note that there are a limited number (256) of TLS allocations, so you should really just have one for all of spidermonkey.

Andrew Drake [:adrake]

Updated

•

14 years ago

Assignee: adrake → general

Nobody; OK to take it and work on it

Assignee

Updated

•

11 years ago

Assignee: general → nobody

Sylvestre Ledru [:Sylvestre]

Comment 10

•

7 years ago

No assignee, updating the status.

Status: ASSIGNED → NEW

Tom S. (please needinfo tschuster)

Updated

•

4 years ago

a11y-review: requested → ---

tracking-firefox90: ? → ---

tracking-firefox-esr91: ? → ---

BMO Automation

Updated

•

3 years ago

Severity: normal → S3

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Measure TLS cost on platforms/compilers of interest

Categories

(Core :: JavaScript Engine, defect)

Tracking

()

People

(Reporter: dmandelin, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(3 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Updated

Comment 10

Updated

Updated

Attachment

General

Description

File Name

Content Type