Closed Bug 416398 Opened 16 years ago Closed 6 years ago

eliminate dbox where possible, when we know canonical NaN's will be preserved

Categories

(Tamarin Graveyard :: Baseline JIT (CodegenLIR), enhancement)

x86
Windows XP
enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX
Future

People

(Reporter: edwsmith, Unassigned)

References

Details

Attachments

(1 file)

The implementation of the math function isNaN() cab be significantly improved. Currently, it is defined as:

static bool isNaN(double value) { return value != value; }

On x86, C++ compilers use x87 instructions to implement this code. The generated code compares the argument value against itself and checks the flags. 

fld      QWORD PTR [esp+04h]
fcomp    QWORD PTR [esp+04h]
fnstsw   ax
test     ah, 0x44h

When the value is actually Not a Number (NaN), hardware exceptions happen and HW microcode handles the exception. This has severe performance penalty. There is a much more efficient way of doing this by avoiding the x87 FPU and using integer instructions (event without any need for SSE2). The performance improvement of the primitive is huge when handling NaNs - something like 35x improvement. When the argument is not a NaN value, the performance of the two approaches are comparable.

IsNaN() is one of the hot Tamarin helper functions on the JSBench benchmarks (MolDyn), where ~29% of the total execution time is in isNaN(). By using the right code (fully expressible in C), I get  about 20% overall improvement in that benchmark. 

- moh
From: Edwin Smith [mailto:edwsmith@adobe.com] 
Sent: Thursday, February 07, 2008 9:36 PM
To: Haghighat, Mohammad R; tamarin-devel@mozilla.org
Subject: RE: a much more efficient implementation of isNaN()
In tamarin-tracing, "x != x" is frequently used as an isNaN check, to box nan values with a canonical bit pattern.  we also have a decent amount of C code doing "x != x" as a nan check, without explicitly calling isNaN.  also, in TT, the code we emit is different for SSE2 or x87.  usually, the values are not NaN.  is this penalty for a NaN specific to the x87 instructions?

on TT, with the x87 or sse2, we might be able to use the status registers after the instruction to detect NaN, rather than explicitly comparing x != x.  with sse2 we could also use an integer instruction to check for the nan bit pattern.  but for x87 if we dont use the fpu status register, then it sounds like a store and integer-load is required?  

Ed

-----Original Message-----
From: Mike Pall [mailto:mikelu-0802@mike.de] 
Sent: Friday, February 08, 2008 8:39 AM
To: tamarin-devel@mozilla.org
Cc: Edwin Smith
Subject: Re: a much more efficient implementation of isNaN()

Edwin Smith wrote:
> In tamarin-tracing, "x != x" is frequently used as an isNaN check, to
> box nan values with a canonical bit pattern.

There's no need to canonicalize all NaNs. It's sufficient to
check numbers flowing into the VM before boxing them (a simple
integer comparison on the hiword). You'll need this for strtod()
or calls/returns from C carrying an untrusted double to be boxed.

The FPU only ever generates the canonical NaN on its own (e.g.
for 0/0). NaNs are propagated by choosing the first NaN operand.
This way you'll end up with canonical NaNs everywhere.

The only caveat with your representation is the -(NaN) case for
SSE (using xorpd). You'll want to avoid using this bit pattern
for other types. But there are plenty of mantissa bits left.

[At least that's what I'm doing and so far it works fine.]

> is this penalty for a NaN specific to the x87 instructions?

Nope, it applies everywhere (and to +-Inf and denormals, too).
There is only a single FPU driven by different instruction sets.

Same problem is present on other platforms. Should you ever
support VFP on ARM, you'll find that it traps to the kernel
in these cases.

> on TT, with the x87 or sse2, we might be able to use the status
> registers after the instruction to detect NaN, rather than explicitly
> comparing x != x.

The status register bits are sticky. You'd need to clear them
before every operation. But the clear operation is microcoded and
causes a pipeline flush. Believe me, you don't want this. :-)

--Mike
From: Haghighat, Mohammad R [mailto:mohammad.r.haghighat@intel.com] 
Sent: Friday, February 08, 2008 1:19 AM
To: Edwin Smith; tamarin-devel@mozilla.org
Subject: RE: a much more efficient implementation of isNaN()

The penalty is for the float compare instruction (fcomp). Other instructions may also have penalties. I've got to see them before saying anything for sure.
But the code for doing isNaN is actually very simple, specially if it is on 64 bit systems.
 
Here's the code. I've also included that in the performance bug I submitted.
 
#define I64(f) (*(long long int *)&f)
static bool isNaN (double value)
{
    unsigned long long int jvalue = (I64(value) & ~0x8000000000000000uLL);
    return (jvalue > 0x7ff0000000000000uLL);
}
Essentially, one needs to clear out the sign bit and compare the absolute value (interpreted as an unsigned integer) of the argument to +Infinity (again interpreted as an unsigned integer). If the argument is greater, then the value is a NaN. It would be a good idea to have an assertion (preferably during compile time) that the sizeof (long long int, or __int64) is 8 to ensure it is 64 bits. The code relies on this assumption.
 
- moh
 
Component: Virtual Machine → Tracing Virtual Machine
QA Contact: vm → tracing-vm
Priority: -- → P2
TT's Box encoding uses high bits fff8-ffff for int thru null.  on x86, the canonical nan is 0xfff80...  so we'll need to tweak the tagging scheme to fit the type tags in the range > 0xfff80...  some ideas:

1. merge boxedNull & boxedVoid.

null      = BoxedVoid|0
undefined = BoxedVoid|1
hole      = BoxedVoid|2
gc obj    = BoxedVoid|ptr

1a.  eliminate null as a tag, instead lead BoxedObj|Str|Ns have a null payload.

- this eliminates branching when boxing pointers, but introduces branching for the various cases that process boxed obj|str|null.  
- TC encodes nulls as Str|0, Obj|0, etc.  

2. eliminate BoxedNs.

ns is a primitive to aid in bootstrapping
symbol tables before class Namespace comes into being.  however we
would win if namespaces in symbol tables were not script-visible
namespace objects but rather internal namespace values that dont
require boxing, ever.

3. use top 17 bits for type tag:

fff80 = canonical NaN
fff88
fff90
fff98
...
ffff0
ffff8 = highest flag

now there's plenty of room for more than 7 types.  (17 bits = room for 15 types.  18 bits = room for 31 types, etc).

as long as the loads & stores are 32 bits this should be fine since we dont need the next 16 (or 15) bits anyway.  trying to use 16bit load/stores leads to store forwarding penalties on 32bit machines.  

could be non-optimal for 16bit databus arm7's

Of these, removing BoxedNs sounds the most appealing to me, since de-magicking Namespace is something we've tossed around for a while anyway. Not sure how much work it might be to do so, however.

Re: 17 or 18 bits, keeping it 16 bits for now might prove a win on some ARM architectures, so if we can do so painlessly, IMHO worth doing. The penalties on Intel can easily be avoided by just promoting to 32-bit load/stores.

 
Probably the bulk of work in de-magicking Namespace, is also where the bulk of the benefit is, namely: letting symbol tables contain some much lighter weight namespace thing (like a pointer into the abc or IL, nothing more), making light namespace-things from real Namespaces for dynamic looksup, and making real Namespaces from light namespaces, for iteration and proxy.  (or maybe not for iteration, because dynamic expandos dont contain namespaces).

the acid test is being able to bootstrap system to the point where we can new a real Namespace.
After looking at the forth code some, i'm also warming up to combining BoxedNull and BoxedVoid.  BoxedSpecial|0 = null, Special|1 = undefined, Special|2 = array hole, etc.  ES4 will need a new not-initialized-const value too, maybe Special|3.  most of the cases for null & void are the same, and where they are different, they seem to be cold paths.

doing this *and* de-magiking namespace would eliminate 2 cases from all our 9-entry switch(boxtype) tables, which is a decent savings.  

similarly a con to using more boxtype types is more cases to handle in every such table.

its also tempting to try and merge Int|Uint and use a longer integer-like value to avoid promotions... but now i'm starting to ramble.
yeah, combining null and void is probably easier to accomplish short-term, and is often handled the same.

merging int and uint, I'm not as sure about... time for a new tracking bug on that topic :-)
Igor is making void a pseudo-boolean in SM (hole is a pseudo-b already, and we have another for async return -- what gets thrown at a generator being closed).

Jason is reviving the hack-patch I did in bug 161110 to make SM's jsval mimic TT's box, so cc'ing him too.

/be
Here's a simple store-to-load forwarding microbenchmark. Relevance to TT (and for possible re-use in SM) explained below.

It measures the time in cycles it takes to forward different data sizes through memory. Some cases are gracefully handled, but others incur heavy pipeline stalls. Which cases are good or bad highly depends on the CPU microarchitecture.

Note that all of them assume proper alignment of the stack. Penalties for misalignment are much higher and they often disable the fast forwarding cases, too. I'm assuming you've already made sure the box type is always 8-byte aligned (a double inside a struct/union in the Linux/*BSD x86 ABI only forces alignment to 4 bytes, the Windows x86 ABI and probably all non-x86 architectures force 8 bytes).

It's hard to adjust for the loop overhead on a super-scalar CPU. But it usually vanishes in the pipeline noise. Still, the cycle times may not accurately reflect just the forwarding itself. But the outliers are clearly visible.

Below are the results for a Core2 (65nm), Pentium 4 (Celeron) and a Pentium III. Timings for an Intel Atom would be interesting (if anyone has got one already). I don't have an AMD K8 or K10 at hand, but the manuals indicate similar penalties.

Now for the relevance to TT (and possibly SM):

As you can see this relates to the type checks you're doing in TT after storing a number to a box. The double->short_hi case is always one of the worst. The double->int_hi case is much cheaper and basically free on a Core2. This means you should definitely convert over to 32 bit type tags. If you use type tags in the range -1..-128 you also get shorter instruction encodings for the type checks, too (cmp [mem], imm8).

*** Store-to-load forwarding microbenchmark ***
model name	: Intel(R) Core(TM)2 CPU          6420  @ 2.13GHz
CPU Frequency   : 2128.000 MHz
-- Same size:
 2.3 cycles  short    2->2  short
 2.3 cycles  int      4->4  int
 6.0 cycles  double   8->8  double
-- Wide to narrow:
 2.3 cycles  int      4->2  short  lo
13.0 cycles  int      4->2  short  hi
 2.3 cycles  double   8->2  short  lo
13.0 cycles  double   8->2  short  hi        CURRENT  16 bit box type check
 2.3 cycles  double   8->4  int    lo
 2.3 cycles  double   8->4  int    hi        PROPOSED 32 bit box type check
 2.3 cycles  xmm     16->2  short  lo
13.0 cycles  xmm     16->2  short  hi
 2.3 cycles  xmm     16->4  int    lo
 2.3 cycles  xmm     16->4  int    hi
 3.0 cycles  xmm     16->8  double lo
 3.0 cycles  xmm     16->8  double hi
-- Narrow to wide:
14.0 cycles  short  2+2->4  int
14.0 cycles  int    4+4->8  double
14.0 cycles  double 8+8->16 xmm

*** Store-to-load forwarding microbenchmark ***
model name      : Intel(R) Celeron(R) CPU 2.66GHz
CPU Frequency   : 2660.468 MHz
-- Same size:
 2.7 cycles  short    2->2  short
 2.1 cycles  int      4->4  int
20.0 cycles  double   8->8  double
-- Wide to narrow:
 2.9 cycles  int      4->2  short  lo
 5.1 cycles  int      4->2  short  hi
 3.2 cycles  double   8->2  short  lo
38.0 cycles  double   8->2  short  hi        CURRENT  16 bit box type check
 2.4 cycles  double   8->4  int    lo
 4.5 cycles  double   8->4  int    hi        PROPOSED 32 bit box type check
 2.7 cycles  xmm     16->2  short  lo
38.6 cycles  xmm     16->2  short  hi
 2.9 cycles  xmm     16->4  int    lo
38.0 cycles  xmm     16->4  int    hi
 6.1 cycles  xmm     16->8  double lo
57.7 cycles  xmm     16->8  double hi
-- Narrow to wide:
42.0 cycles  short  2+2->4  int
56.1 cycles  int    4+4->8  double
71.8 cycles  double 8+8->16 xmm

*** Store-to-load forwarding microbenchmark ***
model name      : Pentium III (Coppermine)
CPU Frequency   : 996.698 MHz
-- Same size:
 9.0 cycles  short    2->2  short
 3.0 cycles  int      4->4  int
 3.1 cycles  double   8->8  double
-- Wide to narrow:
 3.0 cycles  int      4->2  short  lo
11.1 cycles  int      4->2  short  hi
 3.0 cycles  double   8->2  short  lo
11.0 cycles  double   8->2  short  hi        CURRENT  16 bit box type check
 4.1 cycles  double   8->4  int    lo
11.1 cycles  double   8->4  int    hi        PROPOSED 32 bit box type check
 4.0 cycles  xmm     16->2  short  lo
12.7 cycles  xmm     16->2  short  hi
 3.1 cycles  xmm     16->4  int    lo
12.7 cycles  xmm     16->4  int    hi
 3.0 cycles  xmm     16->8  double lo
 4.0 cycles  xmm     16->8  double hi
-- Narrow to wide:
13.1 cycles  short  2+2->4  int
13.0 cycles  int    4+4->8  double
 4.1 cycles  double 8+8->16 xmm
That sounds like a compelling win, but I think I'm missing something: even if we move to 32-bit type tags, we still have to fit the values into NaN bit patterns -- it's not clear to me how we can do this and also use values in the range -1...-128. Or are you simply proposing we replace our 16-bit load with a 32-bit load? (Sorry, slow brain day, I think...)

Good food for thought, though... we need to revisit our Box approach in any event for x86-64 as well. (Does LuaJIT handle x86-64 at this point? Last I checked it was 32-bit only, not sure if you've pondered that issue.)
Yes, you should replace a 16 bit load with a 32 bit load. And while you're at it, it makes sense not to use 0xfff90000-0xffff0000, but rather 0xfffffff9-0xffffffff as type tags. This gives you small immediates to compare against. And if you need a positive tag range, just use ~tag.

Of course you are no longer restricted to only 7 tags, too. 2^19-1 tags give you a lot more freedom.

Actually I'm currently using 13 tags in LuaJIT. E.g. splitting up the boolean type into a true and a false tag pays off. There are only a couple non-performance sensitive paths where you need to treat them the same.

With careful ordering of tags you can also reduce the number of branches. E.g. checking for truth/falseness in if-statements reduces to a single compare in Lua. Of course JS has different semantics, so one needs to analyze the common fast-paths and optimize the tag layout accordingly.


I've exchanged some thoughts with Edwin about how to keep 64 bit boxes on x64. I've given him permission to add the email exchange to bug 433793.
Summary: TT: eliminate dbox where possible, when we know canonical NaN's will be preserved → eliminate dbox where possible, when we know canonical NaN's will be preserved
Component: Tracing Virtual Machine → JIT Compiler (NanoJIT)
QA Contact: tracing-vm → nanojit
Severity: normal → enhancement
Priority: P2 → --
Target Milestone: --- → Future
this bug depends on the Box port to TR.
Depends on: 490565
Component: JIT Compiler (NanoJIT) → Nanojit
Product: Tamarin → Core
QA Contact: nanojit → nanojit
Component: Nanojit → JIT Compiler (NanoJIT)
Product: Core → Tamarin
QA Contact: nanojit → nanojit
Flags: flashplayer-qrb+
Flags: flashplayer-injection-
Flags: flashplayer-bug-
Tamarin isn't maintained anymore. WONTFIX remaining bugs.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: