412978 - TT: doubleToInt32 needs optimization

Reporter

Description

•

18 years ago

Some benchmarks show the doubleToInt32 primitive at over 3% of processing time. The code itself can be optimized further (especially on SSE2), but it also is likely getting called too often (eg, in unnecessary int->double promotion in loops).

Moh Haghighat

Comment 1

•

18 years ago

The performance of the "doubleToInt32(double d)" primitive in TT can be significantly improved by using the highly optimized code that we propose here. This primitive is hot and is used for array index conversions among others. Here are the speedups I can reproduce on a Core-Duo laptop: Test % Speedup ------------ --------- math-cordic 111.11 bitops-bit-in-byte 63.71 run-nsieve 57.45 bitops-nsieve-bits 34.76 crypto-sha1 34.29 run-access-fannkuch 25.20 3dmorph 20.51 *Note that Sunspider performance numbers with extremely short running time should be seen with caution. JSBench/HeapSort 25% This is a long-running test, and the speedup is stable. Here's the propose code: ------------------------------------------------------- typedef unsigned __int64 UINT64; typedef signed __int64 SINT64; typedef union { double d; UINT64 i; struct { unsigned int il, ih;} i32; } double_int; int32_t FASTCALL doubleToInt32(double d) { double_int du, duh, two32; #if DBLTOINT32_INT64 double_int adu; UINT64 sign_d; SINT64 MASK; #endif unsigned DI_H, u_tmp, expon, shift_amount; signed int res, mask32; // Here are the ECMA Specs // From the ES3 spec, 9.5 // 2. If Result(1) is NaN, +0, -0, +Inf, or -Inf, return +0. // 3. Compute sign(Result(1)) * floor(abs(Result(1))). // 4. Compute Result(3) modulo 2^32; that is, a finite integer value k of Number // type with positive sign and less than 2^32 in magnitude such the mathematical // difference of Result(3) and k is mathematically an integer multiple of 2^32. // 5. If Result(4) is greater than or equal to 2^31, return Result(4)- 2^32, // otherwise return Result(4). // Algorithm Outline // Step 1. If d is NaN, +/-Inf or |d|>=2^84 or |d|<1, then return 0 // All of this is implemented based on an exponent comparison. // Step 2. If |d|<2^31, then return (int)d // The cast to integer (conversion in RZ mode) returns the correct result. // Step 3. If |d|>=2^32, d:=fmod(d, 2^32) is taken -- but without a call // Step 4. If |d|>=2^31, then the fractional bits are cleared before applying the correction by 2^32: d - sign(d)*2^32 // Step 5. Return (int)d du.d = d; DI_H = du.i32.ih; u_tmp = (DI_H & 0x7ff00000) - 0x3ff00000; if(u_tmp >= (0x45300000-0x3ff00000)) // d is Nan, +/-Inf or +/-0, or |d|>=2^(32+52) or |d|<1, in which case result=0 return 0; if(u_tmp < 0x01f00000) // |d|<2^31 { res = (signed int)d; return (int32_t)res; } if(u_tmp > 0x01f00000) { // |d|>=2^32 expon = u_tmp >> 20; shift_amount = expon - 21; duh.i = du.i; #if DBLTOINT32_INT64 MASK = 0x8000000000000000ll; MASK = MASK >> shift_amount; #else mask32 = 0x80000000; if(shift_amount<32) { mask32 >>= shift_amount; duh.i32.ih = du.i32.ih & mask32; duh.i32.il = 0; } else { mask32 >>= (shift_amount-32); duh.i32.ih = du.i32.ih; duh.i32.il = du.i32.il & mask32; } #endif du.d -= duh.d; } DI_H = du.i32.ih; // eliminate fractional bits u_tmp = (DI_H & 0x7ff00000); if(u_tmp >= 0x41e00000) // |d|>=2^31 { expon = u_tmp >> 20; shift_amount = expon - (0x3ff - 11); #if DBLTOINT32_INT64 MASK = 0x8000000000000000ll; MASK = MASK >> shift_amount; du.i &= (UINT64)MASK; #else mask32 = 0x80000000; if(shift_amount<32) { mask32 >>= shift_amount; du.i32.ih &= mask32; du.i32.il = 0; } else { mask32 >>= (shift_amount-32); du.i32.il &= mask32; } #endif #if DBLTOINT32_INT64 sign_d = du.i & 0x8000000000000000ull; two32.i = 0x41f0000000000000ull ^ sign_d; #else two32.i32.ih = 0x41f00000 ^ (du.i32.ih & 0x80000000); two32.i32.il = 0; #endif du.d -= two32.d; } res = (signed int)du.d; return (int32_t)res; }

Steven Johnson

Reporter

Comment 2

•

18 years ago

I'm unclear on what DBLTOINT32_INT64 indicates and how/whether it is defined. Also, a minor point: we're now using C99 integer types for new code in TT, so this chould probably use uint64_t, etc.

Edwin Smith

Comment 3

•

18 years ago

Attached patch code below as patch to TT — Details — Splinter Review

Here's the same code in patch form. I converted to C99 types with the following assumptions, besides the obvious conversion: (signed int) => int32_t hopefully that's right for both 32bit and 64bit compilers. I also separated out the code based on DBLTOINT32_INT64.

Attachment #319495 - Flags: review?(mohammad.r.haghighat)

Edwin Smith

Comment 4

•

18 years ago

winxp bootcamp on macbook pro 2.6ghz core 2 duo (winxp noise is +/- 15ms) TT TT+patch boids 5813 5735 boidshackHeadless 454 407 gameoflife 422 360 access-binary-trees 141 156 access-fannkuch 157 125 access-nbody 187 188 access-nsieve 62 62 bitops-3bit-bits-in-byte 16 16 bitops-bits-in-byte 47 32 bitops-bitwise-and 266 251 bitops-nsieve-bits 79 47 crypto-md5 187 187 crypto-sha1 47 47 math-cordic 79 47 math-partial-sums 234 235 math-spectral-norm 32 31 s3d-cube 172 156 s3d-morph 94 78 s3d-raytrace 266 250 string-fasta 172 172 controlflow-recursive 31 31 chess 51 50 Crypt 1078 703 SOR 1218 860 HeapSort 10807 9493

Moh Haghighat

Comment 5

•

18 years ago

Setting DBLTOINT32_INT64, should result in even more speedup on 64-bit systems (e.g., Intel EM64T).

Moh Haghighat

Comment 6

•

18 years ago

A clarification: DBLTOINT32_INT64 should be set only if the target platform is 64-bit, otherwise (i.e., if it is set when the target platform is 32-bit) the code will still run correctly but not necessarily faster.

Moh Haghighat

Comment 7

•

18 years ago

The code looks good. It, however, needs to be guarded so it is used only on little-endian targets, such as x86. I'm not sure how this would work on all versions of ARM.

Edwin Smith

Comment 8

•

18 years ago

Comment on attachment 319495 [details] [diff] [review] code below as patch to TT marking + on behalf of Moh

Attachment #319495 - Flags: review?(mohammad.r.haghighat) → review+

Moh Haghighat

Comment 9

•

18 years ago

In SpiderMonkey on Windows, the total number of calls to js_DoubleToECMAInt32() during a complete run of Sunspider (5 iterations) is currently ~7.07M, while in TT, the function doubleToInt32() is currently called ~33.36M times (almost a factor of 5x). We should be able to eliminate a large number of unnecessary calls.

Edwin Smith

Comment 10

•

18 years ago

SM has another function, js_ValueToInt32() that is used in other situations. And, TT is using doubleToInt32 (ecma semantics) in cases when it could just use the simpler double->int cast. So, TT should use a simpler primitive where possible, but we should track stats on both uses. Even once the stats are adjusted i still agree TT could eliminate many calls to either primitive. some factors * SM (presumably) tries to keep ints as ints, by checking for overflow. TT currently will promote int->double, in cases then later use a primitive to check if the result is in the integer range. partly this is so type-analysis can be done... the result of an op is either always int or always double. Because SM doesn't need a result to always be one type or another it can be more aggressive about avoiding int->double promotion. (TT can/should do this too, but its more interesting while also trying to do type analysis both statically and at trace time).

Edwin Smith

Updated

•

18 years ago

Priority: -- → P1

Brendan Eich [:brendan]

Comment 11

•

18 years ago

(In reply to comment #10) > SM has another function, js_ValueToInt32() that is used in other situations. Just to clarify, this is the original value-to-int32 function but as you can see from a quick grep: $ grep js_ValueToInt32 *.[ch] jsapi.c: *ip = js_ValueToInt32(cx, &tvr.u.value); jsnum.c:js_ValueToInt32(JSContext *cx, jsval *vp) jsnum.h:js_ValueToInt32(JSContext *cx, jsval *vp); it's no longer used except by its API wrapper. The old name stuck in the JS API, and the API entry point JS_ValueToInt32 is used from setTimeout still (no clear DOM standard yet -- the WHATWG is working on it, though I think back-burner for now). Compatibility means more names, thus the ECMA name-parts. So no benchmark should show any SM code in js_ValueToInt32 -- only in the js_{Double,Value}ToECMA{Ui,I}nt32 quad. /be

Dan Schaffer

Comment 12

•

18 years ago

scrubbing tamarin bugs changing component to Tracing Virtual Machine.

Component: Virtual Machine → Tracing Virtual Machine

QA Contact: vm → tracing-vm

Edwin Smith

Updated

•

18 years ago

Attachment #319495 - Flags: review+ → review?(stejohns)

Edwin Smith

Updated

•

18 years ago

Attachment #319495 - Flags: review?(rwinchel)

Steven Johnson

Reporter

Comment 13

•

18 years ago

Comment on attachment 319495 [details] [diff] [review] code below as patch to TT Does the double_int union need an ifdef AVMPLUS_LITTLE_ENDIAN/AVMPLUS_BIG_ENDIAN? Otherwise, looks good

Attachment #319495 - Flags: review?(stejohns) → review+

Edwin Smith

Comment 14

•

18 years ago

aside from not having any big endian targets at the moment, yes. changeset: 415:ef89a85db977

Status: NEW → RESOLVED

Closed: 18 years ago

Resolution: --- → FIXED

Edwin Smith

Updated

•

18 years ago

Attachment #319495 - Flags: review?(rwinchel)

Andreas Gal :gal

Comment 15

•

17 years ago

The 64-bit implementation in this patch is buggy. It fails TM's mul-overflow test.