1435209 - Use CMOVcc instead of index masking

Assignee

Description

•

8 years ago

Using CMOVcc to zero the index register if OOB has a number of benefits: * Performance On x64 CMOVcc is a bit faster than the index masking, but it's significantly faster on x86 (since we use the slower masking scheme there). On Kraken I get the following on 32-bit: no mitigations: 832 ms index masking: 1011 ms (21.5% slower) CMOVcc: 911 ms (9.5% slower) Furthermore, this is the naive way - in many cases (masm.boundsCheck32ForLoad, wasm bounds checks maybe?) we can fold the CMP with the CMP we do for the bounds check and that will result in more speedups. * simplicity: it's much easier to reason about.

Nicolas B. Pierron [:nbp]

Updated

•

8 years ago

Depends on: 1435249

Jan de Mooij [:jandem]

Assignee

Updated

•

8 years ago

Assignee: nobody → jdemooij

Status: NEW → ASSIGNED

Jan de Mooij [:jandem]

Assignee

Updated

•

8 years ago

Blocks: 1435266

Jan de Mooij [:jandem]

Assignee

Comment 1

•

8 years ago

Does anyone have an older machine, to benchmark a patch for this? My MBP has an i7-4980HQ CPU but it would be great to verify these results on older CPUs. I'm also curious about AMD.

Flags: needinfo?(nicolas.b.pierron)

Flags: needinfo?(lhansen)

Jan de Mooij [:jandem]

Assignee

Comment 2

•

8 years ago

Newer architectures (Broadwell, Skylake) would be interesting too...

Lars T Hansen [:lth]

Comment 3

•

8 years ago

I have access to various ancient hardware, including older AMD and Intel chips (AMD FX-4100 and a couple of Macs that probably run Core-2 Duo, and some older i7 - at least). Until Monday I also have my mom's consumer Lenovo laptop in the house for servicing :-P so I can even run something there... My new Linux system is some Xeon Sandy Bridge thing, if that's interesting.

Flags: needinfo?(lhansen)

Jan de Mooij [:jandem]

Assignee

Comment 4

•

8 years ago

Great! I will work on a patch and do a Try push so we have builds we can compare.

Jan de Mooij [:jandem]

Assignee

Comment 5

•

8 years ago

Attached file Micro-benchmark — Details

First, here's a silly micro-benchmark that runs in the browser (but it's easy to run it in the shell too, of course). In the shell I got the following, best of 5 runs: x64 no mitigations: 333 ms x64 index masking: 323 ms x64 CMOVcc: 316 ms x86 no mitigations: 327 ms x86 index masking: 420 ms x86 CMOVcc: 314 ms So: (1) Yes, on this micro-benchmark we're slower without mitigations. Probably some weird cache effect but I'm not complaining. (2) This also confirms CMOVcc is a small win on 64-bit on my machine, but on 32-bit the improvement is much bigger.

Jan de Mooij [:jandem]

Assignee

Comment 6

•

8 years ago

I have builds to compare index masking vs CMOVcc. Mitigations are enabled by default with these builds, so you don't need to change any prefs or use shell flags. If anyone reading this could run Kraken [0] (and probably the micro-benchmark I attached) in a browser/shell build and report the results, that would be great. Ideally both 32-bit and 64-bit, but just one of them would help too. Index masking: https://treeherder.mozilla.org/#/jobs?repo=try&revision=55f3f4c434f82e4f0c7df9f3404da2d93b073ac5 CMOVcc: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b4efd393f6f3058858304268a6804134dd090bb0 [0] https://krakenbenchmark.mozilla.org/

Jan de Mooij [:jandem]

Assignee

Comment 7

•

8 years ago

Talos agrees this is an improvement. Linux64 Talos machines are Skylakes now (Xeon E3-1585Lv5) and it's a consistent 1-3% improvement there on Kraken compared to index masking. As expected, this is bigger on 32-bit: 8.83% on Win32. So it looks like CMOV performs well on (newer) Intel CPUs. Still very interested in AMD results and older Intels.

André Bargull [:anba]

Comment 8

•

8 years ago

AMD FX-6300 on Win7 (x64) Kraken, three runs for each configuration (chronology order). x64 58.0.1: [1678.6ms +/- 4.8%, 1609.9ms +/- 1.4%, 1651.1ms +/- 2.5%] x64 index masking: [1914.6ms +/- 5.6%, 1839.5ms +/- 1.3%, 1855.2ms +/- 1.5%] x64 CMOVcc: [1840.5ms +/- 4.0%, 1818.6ms +/- 2.1%, 1782.1ms +/- 3.8%] x86 index masking: [2104.8ms +/- 2.8%, 2117.3ms +/- 2.6%, 2091.4ms +/- 3.3%] x86 CMOVcc: [1949.2ms +/- 2.9%, 1968.9ms +/- 2.3%, 1932.7ms +/- 3.8%] Shell test case: x64 tip@094c6dbe48a3: 214ms x64 index masking: 370ms x64 CMOVcc: 319ms x86 index masking: 559ms x86 CMOVcc: 318ms

Jan de Mooij [:jandem]

Assignee

Comment 9

•

8 years ago

(In reply to André Bargull [:anba] from comment #8) > AMD FX-6300 on Win7 (x64) Thanks a lot! I'm very happy to see CMOVcc is faster on AMD too - I was worried it would be slow.

Emanuel Hoogeveen [:ehoogeveen]

Comment 10

•

8 years ago

I don't know if this is old enough to be interesting, but here are results for an Intel Core i7-3770K (Ivy Bridge, 2012) on Windows 10 x64. Kraken, average result first, individual results in parentheses: x64 Nightly: 1145.5ms (1152.7ms 1144.3ms 1166.4ms 1135.1ms 1144.7ms 1129.8ms) x64 index masking: 1293.8ms (1314.9ms 1284.4ms 1294.6ms 1303.6ms 1280.4ms 1284.7ms) x64 CMOVcc: 1310.2ms (1293.2ms 1310.3ms 1332.2ms 1298.7ms 1315.1ms 1311.6ms) x86 Nightly: 1196.6ms (1201.9ms 1195.0ms 1186.8ms 1211.0ms 1189.0ms 1195.7ms) x86 index masking: 1428.3ms (1404.9ms 1424.1ms 1452.8ms 1421.8ms 1444.0ms 1422.3ms) x86 CMOVcc: 1359.7ms (1379.6ms 1366.7ms 1359.6ms 1324.1ms 1388.0ms 1340.0ms) So on x64 it looks like a wash or a (very) slight regression, on x86 CMOVcc is a definite win.

Emanuel Hoogeveen [:ehoogeveen]

Comment 11

•

8 years ago

Results for the microbenchmark (all builds compiled locally): median min mean max sd x64 m-c tip: 304 300 305 321 3 x64 index masking: 316 300 318 359 14 x64 CMOVcc: 319 299 320 356 14 x86 m-c tip: 303 300 303 309 2 x86 index masking: 443 436 443 457 4 x86 CMOVcc: 315 299 318 349 13 Note that the difference between index masking and CMOVcc on x64 is not statistically significant after 200 runs.

Jan de Mooij [:jandem]

Assignee

Comment 12

•

8 years ago

Thanks! It's interesting the difference is smaller on that machine on x64. That makes me wonder about even older architectures like Core 2, maybe Lars can help us with that. But considering the other benefits, better perf on 32-bit and newer CPUs, we could take a perf regression there, as long as it's not too horrible. And we can still improve perf more by eliminating the second CMP in some cases.

hazel ward

Comment 13

•

8 years ago

x64 nightly: 1035.7ms +/- 2.6% 1052.2ms +/- 2.6% 1089.0ms +/- 3.7% Index Masking x64: 1233.8ms +/- 4.0% 1182.7ms +/- 1.7% 1247.9ms +/- 3.5% x86: 1503.0ms +/- 4.2% 1411.1ms +/- 2.0% 1448.0ms +/- 5.9% CMOVcc x64: 1155.6ms +/- 1.7% 1168.4ms +/- 5.3% 1157.7ms +/- 2.2% x86: 1283.2ms +/- 3.1% 1315.2ms +/- 3.5% 1264.7ms +/- 2.1%

Ekanan Ketunuti

Comment 14

•

8 years ago

Ekanan Ketunuti

Comment 15

•

8 years ago

Intel(R) Core(TM)2 Duo CPU P8400 @ 2.26GHz 2.27GHz micro-benchmark jsshell x86 no mitigations: 491 index masking: 1279 CMOVcc: 850

Jan de Mooij [:jandem]

Assignee

Comment 16

•

8 years ago

Thanks a lot people, very helpful. Ekanan, just curious, is that 32-bit Windows? If 64-bit, it would be very interesting to have data for Firefox 64-bit on Core 2 :)

Ekanan Ketunuti

Comment 17

•

8 years ago

It's windows 10 32-bit :)

hazel ward

Comment 18

•

8 years ago

I forgot to mention my build specs in #13 Win10 Pro x64 i7 6700HQ

Lars T Hansen [:lth]

Comment 19

•

8 years ago

Huh, turns out both my Core 2 Duo systems run Mac OS X 32-bit, not really a current or interesting platform... Anyway one of them is a model with the CPU designation T7600 @ 2.33GHz (MacBook Pro ca 2006).

Jan de Mooij [:jandem]

Assignee

Comment 20

•

8 years ago

Attached patch Patch — Details — Splinter Review

It turns out that this version: CMP index, length CMOVB index, output is a lot faster than this one: CMP length, index CMOVA index, output I don't understand why, but it repros on multiple benchmarks (index and length are both in registers and the code size is exactly the same). Anyway, I replaced the first version with the second, then had to revert it because the first is faster. Oh well.

Attachment #8948819 - Flags: review?(luke)

Luke Wagner [:luke]

Comment 21

•

8 years ago

Comment on attachment 8948819 [details] [diff] [review] Patch Review of attachment 8948819 [details] [diff] [review]: ----------------------------------------------------------------- Faster, shorter, simpler; pick any three.

Attachment #8948819 - Flags: review?(luke) → review+

Pulsebot

Comment 22

•

8 years ago

Pushed by jandemooij@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/0bc556c6e060 Use CMOVcc instead of index masking. r=luke

Andrei Ciure[:aciure]

Comment 23

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/0bc556c6e060

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

status-firefox60: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla60

Jan de Mooij [:jandem]

Assignee

Comment 24

•

8 years ago

Comment on attachment 8948819 [details] [diff] [review] Patch See bug 1431173 comment 6.

Attachment #8948819 - Flags: approval-mozilla-beta?

Ryan VanderMeulen [:RyanVM]

Comment 25

•

8 years ago

Comment on attachment 8948819 [details] [diff] [review] Patch Spectre-related fix. Taking for 59b8.

Attachment #8948819 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 26

•

8 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/41310e950428

status-firefox59: --- → fixed

tomaspartl

Comment 27

•

8 years ago

Intel(R) Core(TM) i5-2430M CPU @ 2.40GHz Ubuntu 16.04 Micro benchmark: CMOV: 511 Index Masking: 522 ==================== Kraken =================== --------------------- CMOV -------------------- ----------------------------------------------- Total: 2368.4ms +/- 12.7% ----------------------------------------------- ai: 132.7ms +/- 17.0% astar: 132.7ms +/- 17.0% audio: 757.9ms +/- 9.9% beat-detection: 171.8ms +/- 15.8% dft: 354.9ms +/- 9.9% fft: 109.3ms +/- 16.0% oscillator: 121.9ms +/- 19.2% imaging: 610.1ms +/- 17.7% gaussian-blur: 197.8ms +/- 21.3% darkroom: 222.3ms +/- 18.6% desaturate: 190.0ms +/- 13.3% json: 146.7ms +/- 26.8% parse-financial: 88.8ms +/- 36.5% stringify-tinderbox: 57.9ms +/- 21.6% stanford: 721.0ms +/- 17.6% crypto-aes: 134.2ms +/- 17.8% crypto-ccm: 221.5ms +/- 20.8% crypto-pbkdf2: 256.1ms +/- 20.5% crypto-sha256-iterative: 109.2ms +/- 30.7% -------------------- Index Masking ------------ ----------------------------------------------- Total: 2574.1ms +/- 11.7% ----------------------------------------------- ai: 139.9ms +/- 16.6% astar: 139.9ms +/- 16.6% audio: 887.8ms +/- 12.9% beat-detection: 190.9ms +/- 16.3% dft: 422.5ms +/- 13.2% fft: 140.6ms +/- 28.9% oscillator: 133.8ms +/- 17.0% imaging: 626.9ms +/- 18.0% gaussian-blur: 210.6ms +/- 22.9% darkroom: 229.7ms +/- 18.0% desaturate: 186.6ms +/- 19.1% json: 146.1ms +/- 23.5% parse-financial: 75.3ms +/- 21.6% stringify-tinderbox: 70.8ms +/- 39.8% stanford: 773.4ms +/- 20.3% crypto-aes: 144.6ms +/- 19.0% crypto-ccm: 271.0ms +/- 39.1% crypto-pbkdf2: 262.2ms +/- 17.9% crypto-sha256-iterative: 95.6ms +/- 20.6% ============================== Octane ========================== -------------------------------- CMOV --------------------------- Octane 2.0 Octane Score: 17463 Richards 19691 Core language features Deltablue 23683 Core language features Crypto 16758 Bit & Math operations Raytrace 57218 Core language features EarleyBoyer 15394 Memory & GC Regexp 2898 Strings & arrays Splay 13470 Memory & GC SplayLatency 13934 GC latency NavierStokes 26284 Strings & arrays pdf.js 11430 Strings & arrays Mandreel 12715 Virtual machine MandreelLatency 11785 Compiler latency GB Emulator 33341 Virtual machine CodeLoad 14284 Loading & Parsing Box2DWeb 21731 Bit & Math operations zlib 53209 asm.js Typescript 14073 Virtual machine & GC ------------------------------ Index Masking ------------------- Octane 2.0 Octane Score: 16929 Richards 20606 Core language features Deltablue 21323 Core language features Crypto 16531 Bit & Math operations Raytrace 53057 Core language features EarleyBoyer 15761 Memory & GC Regexp 2286 Strings & arrays Splay 13438 Memory & GC SplayLatency 15719 GC latency NavierStokes 26389 Strings & arrays pdf.js 7344 Strings & arrays Mandreel 14224 Virtual machine MandreelLatency 17250 Compiler latency GB Emulator 37916 Virtual machine CodeLoad 13920 Loading & Parsing Box2DWeb 18199 Bit & Math operations zlib 35333 asm.js Typescript 16272 Virtual machine & GC

tomaspartl

Comment 28

•

8 years ago

Ooops! Please, ignore my comment #27. The correct results are: Intel(R) Core(TM) i5-2430M CPU @ 2.40GHz Ubuntu 16.04 MICRO BENCHMARK --------------- CMOV: 520 INDEX MASKING: 559 KRAKEN ------- CMOV: 2394.7ms +/- 16.6% INDEX MASKING: 2609.1ms +/- 10.4% CMOV - KRAKEN - details ================================ Kraken JavaScript Benchmark Results Content Version: kraken-1.1 Run Again (You can bookmark this results URL for later comparison.) To compare to another run, paste a saved result URL in the text field below and press enter: =============================================== RESULTS (means and 95% confidence intervals) ----------------------------------------------- Total: 2394.7ms +/- 16.6% ----------------------------------------------- ai: 126.9ms +/- 16.3% astar: 126.9ms +/- 16.3% audio: 896.9ms +/- 25.1% beat-detection: 174.1ms +/- 18.5% dft: 457.1ms +/- 39.5% fft: 134.2ms +/- 30.5% oscillator: 131.5ms +/- 19.9% imaging: 592.7ms +/- 20.3% gaussian-blur: 190.1ms +/- 21.9% darkroom: 219.9ms +/- 21.7% desaturate: 182.7ms +/- 18.1% json: 127.8ms +/- 17.9% parse-financial: 77.2ms +/- 20.1% stringify-tinderbox: 50.6ms +/- 19.2% stanford: 650.4ms +/- 11.4% crypto-aes: 123.2ms +/- 13.3% crypto-ccm: 211.5ms +/- 34.2% crypto-pbkdf2: 220.1ms +/- 7.8% crypto-sha256-iterative: 95.6ms +/- 15.1% INDEX MASKING - KRAKEN - details ================================= Kraken JavaScript Benchmark Results Content Version: kraken-1.1 Run Again (You can bookmark this results URL for later comparison.) To compare to another run, paste a saved result URL in the text field below and press enter: =============================================== RESULTS (means and 95% confidence intervals) ----------------------------------------------- Total: 2609.1ms +/- 10.4% ----------------------------------------------- ai: 160.9ms +/- 16.1% astar: 160.9ms +/- 16.1% audio: 1028.7ms +/- 24.7% beat-detection: 250.7ms +/- 31.2% dft: 518.2ms +/- 32.4% fft: 133.9ms +/- 35.8% oscillator: 125.9ms +/- 20.5% imaging: 568.9ms +/- 12.9% gaussian-blur: 195.2ms +/- 16.0% darkroom: 205.7ms +/- 15.6% desaturate: 168.0ms +/- 11.8% json: 138.3ms +/- 23.4% parse-financial: 82.5ms +/- 29.0% stringify-tinderbox: 55.8ms +/- 37.5% stanford: 712.3ms +/- 15.6% crypto-aes: 156.6ms +/- 44.3% crypto-ccm: 192.8ms +/- 22.6% crypto-pbkdf2: 249.5ms +/- 15.3% crypto-sha256-iterative: 113.4ms +/- 22.4%

Jan de Mooij [:jandem]

Assignee

Comment 29

•

8 years ago

(In reply to tomaspartl from comment #28) > The correct results are: Thank you for these results!

Nicolas B. Pierron [:nbp]

Updated

•

7 years ago

Flags: needinfo?(nicolas.b.pierron)

Micro-benchmark 8 years ago Jan de Mooij [:jandem] 284 bytes, text/html		Details
Patch 8 years ago Jan de Mooij [:jandem] 19.99 KB, patch	luke : review+ RyanVM : approval-mozilla-beta+	Details \| Diff \| Splinter Review