416398 - eliminate dbox where possible, when we know canonical NaN's will be preserved

Reporter

Description

•

17 years ago

The implementation of the math function isNaN() cab be significantly improved. Currently, it is defined as: static bool isNaN(double value) { return value != value; } On x86, C++ compilers use x87 instructions to implement this code. The generated code compares the argument value against itself and checks the flags. fld QWORD PTR [esp+04h] fcomp QWORD PTR [esp+04h] fnstsw ax test ah, 0x44h When the value is actually Not a Number (NaN), hardware exceptions happen and HW microcode handles the exception. This has severe performance penalty. There is a much more efficient way of doing this by avoiding the x87 FPU and using integer instructions (event without any need for SSE2). The performance improvement of the primitive is huge when handling NaNs - something like 35x improvement. When the argument is not a NaN value, the performance of the two approaches are comparable. IsNaN() is one of the hot Tamarin helper functions on the JSBench benchmarks (MolDyn), where ~29% of the total execution time is in isNaN(). By using the right code (fully expressible in C), I get about 20% overall improvement in that benchmark. - moh

Edwin Smith

Reporter

Comment 1

•

17 years ago

From: Edwin Smith [mailto:edwsmith@adobe.com] Sent: Thursday, February 07, 2008 9:36 PM To: Haghighat, Mohammad R; tamarin-devel@mozilla.org Subject: RE: a much more efficient implementation of isNaN() In tamarin-tracing, "x != x" is frequently used as an isNaN check, to box nan values with a canonical bit pattern. we also have a decent amount of C code doing "x != x" as a nan check, without explicitly calling isNaN. also, in TT, the code we emit is different for SSE2 or x87. usually, the values are not NaN. is this penalty for a NaN specific to the x87 instructions? on TT, with the x87 or sse2, we might be able to use the status registers after the instruction to detect NaN, rather than explicitly comparing x != x. with sse2 we could also use an integer instruction to check for the nan bit pattern. but for x87 if we dont use the fpu status register, then it sounds like a store and integer-load is required? Ed

Edwin Smith

Reporter

Comment 2

•

17 years ago

-----Original Message----- From: Mike Pall [mailto:mikelu-0802@mike.de] Sent: Friday, February 08, 2008 8:39 AM To: tamarin-devel@mozilla.org Cc: Edwin Smith Subject: Re: a much more efficient implementation of isNaN() Edwin Smith wrote: > In tamarin-tracing, "x != x" is frequently used as an isNaN check, to > box nan values with a canonical bit pattern. There's no need to canonicalize all NaNs. It's sufficient to check numbers flowing into the VM before boxing them (a simple integer comparison on the hiword). You'll need this for strtod() or calls/returns from C carrying an untrusted double to be boxed. The FPU only ever generates the canonical NaN on its own (e.g. for 0/0). NaNs are propagated by choosing the first NaN operand. This way you'll end up with canonical NaNs everywhere. The only caveat with your representation is the -(NaN) case for SSE (using xorpd). You'll want to avoid using this bit pattern for other types. But there are plenty of mantissa bits left. [At least that's what I'm doing and so far it works fine.] > is this penalty for a NaN specific to the x87 instructions? Nope, it applies everywhere (and to +-Inf and denormals, too). There is only a single FPU driven by different instruction sets. Same problem is present on other platforms. Should you ever support VFP on ARM, you'll find that it traps to the kernel in these cases. > on TT, with the x87 or sse2, we might be able to use the status > registers after the instruction to detect NaN, rather than explicitly > comparing x != x. The status register bits are sticky. You'd need to clear them before every operation. But the clear operation is microcoded and causes a pipeline flush. Believe me, you don't want this. :-) --Mike

Edwin Smith

Reporter

Comment 3

•

17 years ago

From: Haghighat, Mohammad R [mailto:mohammad.r.haghighat@intel.com] Sent: Friday, February 08, 2008 1:19 AM To: Edwin Smith; tamarin-devel@mozilla.org Subject: RE: a much more efficient implementation of isNaN() The penalty is for the float compare instruction (fcomp). Other instructions may also have penalties. I've got to see them before saying anything for sure. But the code for doing isNaN is actually very simple, specially if it is on 64 bit systems. Here's the code. I've also included that in the performance bug I submitted. #define I64(f) (*(long long int *)&f) static bool isNaN (double value) { unsigned long long int jvalue = (I64(value) & ~0x8000000000000000uLL); return (jvalue > 0x7ff0000000000000uLL); } Essentially, one needs to clear out the sign bit and compare the absolute value (interpreted as an unsigned integer) of the argument to +Infinity (again interpreted as an unsigned integer). If the argument is greater, then the value is a NaN. It would be a good idea to have an assertion (preferably during compile time) that the sizeof (long long int, or __int64) is 8 to ensure it is 64 bits. The code relies on this assumption. - moh

Edwin Smith

Reporter

Updated

•

17 years ago

Component: Virtual Machine → Tracing Virtual Machine

QA Contact: vm → tracing-vm

Edwin Smith

Reporter

Updated

•

17 years ago

Priority: -- → P2

Edwin Smith

Reporter

Comment 6

•

17 years ago

TT's Box encoding uses high bits fff8-ffff for int thru null. on x86, the canonical nan is 0xfff80... so we'll need to tweak the tagging scheme to fit the type tags in the range > 0xfff80... some ideas: 1. merge boxedNull & boxedVoid. null = BoxedVoid|0 undefined = BoxedVoid|1 hole = BoxedVoid|2 gc obj = BoxedVoid|ptr 1a. eliminate null as a tag, instead lead BoxedObj|Str|Ns have a null payload. - this eliminates branching when boxing pointers, but introduces branching for the various cases that process boxed obj|str|null. - TC encodes nulls as Str|0, Obj|0, etc. 2. eliminate BoxedNs. ns is a primitive to aid in bootstrapping symbol tables before class Namespace comes into being. however we would win if namespaces in symbol tables were not script-visible namespace objects but rather internal namespace values that dont require boxing, ever. 3. use top 17 bits for type tag: fff80 = canonical NaN fff88 fff90 fff98 ... ffff0 ffff8 = highest flag now there's plenty of room for more than 7 types. (17 bits = room for 15 types. 18 bits = room for 31 types, etc). as long as the loads & stores are 32 bits this should be fine since we dont need the next 16 (or 15) bits anyway. trying to use 16bit load/stores leads to store forwarding penalties on 32bit machines. could be non-optimal for 16bit databus arm7's

Steven Johnson

Comment 7

•

17 years ago

Of these, removing BoxedNs sounds the most appealing to me, since de-magicking Namespace is something we've tossed around for a while anyway. Not sure how much work it might be to do so, however. Re: 17 or 18 bits, keeping it 16 bits for now might prove a win on some ARM architectures, so if we can do so painlessly, IMHO worth doing. The penalties on Intel can easily be avoided by just promoting to 32-bit load/stores.

Edwin Smith

Reporter

Comment 8

•

17 years ago

Probably the bulk of work in de-magicking Namespace, is also where the bulk of the benefit is, namely: letting symbol tables contain some much lighter weight namespace thing (like a pointer into the abc or IL, nothing more), making light namespace-things from real Namespaces for dynamic looksup, and making real Namespaces from light namespaces, for iteration and proxy. (or maybe not for iteration, because dynamic expandos dont contain namespaces). the acid test is being able to bootstrap system to the point where we can new a real Namespace.

Edwin Smith

Reporter

Comment 9

•

17 years ago

After looking at the forth code some, i'm also warming up to combining BoxedNull and BoxedVoid. BoxedSpecial|0 = null, Special|1 = undefined, Special|2 = array hole, etc. ES4 will need a new not-initialized-const value too, maybe Special|3. most of the cases for null & void are the same, and where they are different, they seem to be cold paths. doing this *and* de-magiking namespace would eliminate 2 cases from all our 9-entry switch(boxtype) tables, which is a decent savings. similarly a con to using more boxtype types is more cases to handle in every such table. its also tempting to try and merge Int|Uint and use a longer integer-like value to avoid promotions... but now i'm starting to ramble.

Steven Johnson

Comment 10

•

17 years ago

yeah, combining null and void is probably easier to accomplish short-term, and is often handled the same. merging int and uint, I'm not as sure about... time for a new tracking bug on that topic :-)

Brendan Eich [:brendan]

Comment 11

•

17 years ago

Igor is making void a pseudo-boolean in SM (hole is a pseudo-b already, and we have another for async return -- what gets thrown at a generator being closed). Jason is reviving the hack-patch I did in bug 161110 to make SM's jsval mimic TT's box, so cc'ing him too. /be

Mike Pall

Comment 12

•

17 years ago

Attached file Store-to-load forwarding microbenchmark — Details

Here's a simple store-to-load forwarding microbenchmark. Relevance to TT (and for possible re-use in SM) explained below. It measures the time in cycles it takes to forward different data sizes through memory. Some cases are gracefully handled, but others incur heavy pipeline stalls. Which cases are good or bad highly depends on the CPU microarchitecture. Note that all of them assume proper alignment of the stack. Penalties for misalignment are much higher and they often disable the fast forwarding cases, too. I'm assuming you've already made sure the box type is always 8-byte aligned (a double inside a struct/union in the Linux/*BSD x86 ABI only forces alignment to 4 bytes, the Windows x86 ABI and probably all non-x86 architectures force 8 bytes). It's hard to adjust for the loop overhead on a super-scalar CPU. But it usually vanishes in the pipeline noise. Still, the cycle times may not accurately reflect just the forwarding itself. But the outliers are clearly visible. Below are the results for a Core2 (65nm), Pentium 4 (Celeron) and a Pentium III. Timings for an Intel Atom would be interesting (if anyone has got one already). I don't have an AMD K8 or K10 at hand, but the manuals indicate similar penalties. Now for the relevance to TT (and possibly SM): As you can see this relates to the type checks you're doing in TT after storing a number to a box. The double->short_hi case is always one of the worst. The double->int_hi case is much cheaper and basically free on a Core2. This means you should definitely convert over to 32 bit type tags. If you use type tags in the range -1..-128 you also get shorter instruction encodings for the type checks, too (cmp [mem], imm8). *** Store-to-load forwarding microbenchmark *** model name : Intel(R) Core(TM)2 CPU 6420 @ 2.13GHz CPU Frequency : 2128.000 MHz -- Same size: 2.3 cycles short 2->2 short 2.3 cycles int 4->4 int 6.0 cycles double 8->8 double -- Wide to narrow: 2.3 cycles int 4->2 short lo 13.0 cycles int 4->2 short hi 2.3 cycles double 8->2 short lo 13.0 cycles double 8->2 short hi CURRENT 16 bit box type check 2.3 cycles double 8->4 int lo 2.3 cycles double 8->4 int hi PROPOSED 32 bit box type check 2.3 cycles xmm 16->2 short lo 13.0 cycles xmm 16->2 short hi 2.3 cycles xmm 16->4 int lo 2.3 cycles xmm 16->4 int hi 3.0 cycles xmm 16->8 double lo 3.0 cycles xmm 16->8 double hi -- Narrow to wide: 14.0 cycles short 2+2->4 int 14.0 cycles int 4+4->8 double 14.0 cycles double 8+8->16 xmm *** Store-to-load forwarding microbenchmark *** model name : Intel(R) Celeron(R) CPU 2.66GHz CPU Frequency : 2660.468 MHz -- Same size: 2.7 cycles short 2->2 short 2.1 cycles int 4->4 int 20.0 cycles double 8->8 double -- Wide to narrow: 2.9 cycles int 4->2 short lo 5.1 cycles int 4->2 short hi 3.2 cycles double 8->2 short lo 38.0 cycles double 8->2 short hi CURRENT 16 bit box type check 2.4 cycles double 8->4 int lo 4.5 cycles double 8->4 int hi PROPOSED 32 bit box type check 2.7 cycles xmm 16->2 short lo 38.6 cycles xmm 16->2 short hi 2.9 cycles xmm 16->4 int lo 38.0 cycles xmm 16->4 int hi 6.1 cycles xmm 16->8 double lo 57.7 cycles xmm 16->8 double hi -- Narrow to wide: 42.0 cycles short 2+2->4 int 56.1 cycles int 4+4->8 double 71.8 cycles double 8+8->16 xmm *** Store-to-load forwarding microbenchmark *** model name : Pentium III (Coppermine) CPU Frequency : 996.698 MHz -- Same size: 9.0 cycles short 2->2 short 3.0 cycles int 4->4 int 3.1 cycles double 8->8 double -- Wide to narrow: 3.0 cycles int 4->2 short lo 11.1 cycles int 4->2 short hi 3.0 cycles double 8->2 short lo 11.0 cycles double 8->2 short hi CURRENT 16 bit box type check 4.1 cycles double 8->4 int lo 11.1 cycles double 8->4 int hi PROPOSED 32 bit box type check 4.0 cycles xmm 16->2 short lo 12.7 cycles xmm 16->2 short hi 3.1 cycles xmm 16->4 int lo 12.7 cycles xmm 16->4 int hi 3.0 cycles xmm 16->8 double lo 4.0 cycles xmm 16->8 double hi -- Narrow to wide: 13.1 cycles short 2+2->4 int 13.0 cycles int 4+4->8 double 4.1 cycles double 8+8->16 xmm

Steven Johnson

Comment 13

•

17 years ago

That sounds like a compelling win, but I think I'm missing something: even if we move to 32-bit type tags, we still have to fit the values into NaN bit patterns -- it's not clear to me how we can do this and also use values in the range -1...-128. Or are you simply proposing we replace our 16-bit load with a 32-bit load? (Sorry, slow brain day, I think...) Good food for thought, though... we need to revisit our Box approach in any event for x86-64 as well. (Does LuaJIT handle x86-64 at this point? Last I checked it was 32-bit only, not sure if you've pondered that issue.)

Mike Pall

Comment 14

•

17 years ago

Yes, you should replace a 16 bit load with a 32 bit load. And while you're at it, it makes sense not to use 0xfff90000-0xffff0000, but rather 0xfffffff9-0xffffffff as type tags. This gives you small immediates to compare against. And if you need a positive tag range, just use ~tag. Of course you are no longer restricted to only 7 tags, too. 2^19-1 tags give you a lot more freedom. Actually I'm currently using 13 tags in LuaJIT. E.g. splitting up the boolean type into a true and a false tag pays off. There are only a couple non-performance sensitive paths where you need to treat them the same. With careful ordering of tags you can also reduce the number of branches. E.g. checking for truth/falseness in if-statements reduces to a single compare in Lua. Of course JS has different semantics, so one needs to analyze the common fast-paths and optimize the tag layout accordingly. I've exchanged some thoughts with Edwin about how to keep 64 bit boxes on x64. I've given him permission to add the email exchange to bug 433793.

Dan Smith

Updated

•

16 years ago

Summary: TT: eliminate dbox where possible, when we know canonical NaN's will be preserved → eliminate dbox where possible, when we know canonical NaN's will be preserved

Dan Smith

Updated

•

16 years ago

Component: Tracing Virtual Machine → JIT Compiler (NanoJIT)

QA Contact: tracing-vm → nanojit

Edwin Smith

Reporter

Updated

•

15 years ago

Severity: normal → enhancement

Priority: P2 → --

Target Milestone: --- → Future

Edwin Smith

Reporter

Comment 15

•

15 years ago

this bug depends on the Box port to TR.

Depends on: 490565

Trevor Baker

Updated

•

15 years ago

Component: JIT Compiler (NanoJIT) → Nanojit

Product: Tamarin → Core

QA Contact: nanojit → nanojit

Trevor Baker

Updated

•

15 years ago

Component: Nanojit → JIT Compiler (NanoJIT)

Product: Core → Tamarin

QA Contact: nanojit → nanojit

Dan Smith

Updated

•

13 years ago

Flags: flashplayer-qrb+

Flags: flashplayer-injection-

Flags: flashplayer-bug-

Sylvestre Ledru [:Sylvestre]

Comment 16

•

6 years ago

Tamarin isn't maintained anymore. WONTFIX remaining bugs.

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → WONTFIX