Closed
Bug 416398
Opened 17 years ago
Closed 6 years ago
eliminate dbox where possible, when we know canonical NaN's will be preserved
Categories
(Tamarin Graveyard :: Baseline JIT (CodegenLIR), enhancement)
Tracking
(Not tracked)
RESOLVED
WONTFIX
Future
People
(Reporter: edwsmith, Unassigned)
References
Details
Attachments
(1 file)
6.03 KB,
text/x-csrc
|
Details |
The implementation of the math function isNaN() cab be significantly improved. Currently, it is defined as:
static bool isNaN(double value) { return value != value; }
On x86, C++ compilers use x87 instructions to implement this code. The generated code compares the argument value against itself and checks the flags.
fld QWORD PTR [esp+04h]
fcomp QWORD PTR [esp+04h]
fnstsw ax
test ah, 0x44h
When the value is actually Not a Number (NaN), hardware exceptions happen and HW microcode handles the exception. This has severe performance penalty. There is a much more efficient way of doing this by avoiding the x87 FPU and using integer instructions (event without any need for SSE2). The performance improvement of the primitive is huge when handling NaNs - something like 35x improvement. When the argument is not a NaN value, the performance of the two approaches are comparable.
IsNaN() is one of the hot Tamarin helper functions on the JSBench benchmarks (MolDyn), where ~29% of the total execution time is in isNaN(). By using the right code (fully expressible in C), I get about 20% overall improvement in that benchmark.
- moh
Reporter | ||
Comment 1•17 years ago
|
||
From: Edwin Smith [mailto:edwsmith@adobe.com]
Sent: Thursday, February 07, 2008 9:36 PM
To: Haghighat, Mohammad R; tamarin-devel@mozilla.org
Subject: RE: a much more efficient implementation of isNaN()
In tamarin-tracing, "x != x" is frequently used as an isNaN check, to box nan values with a canonical bit pattern. we also have a decent amount of C code doing "x != x" as a nan check, without explicitly calling isNaN. also, in TT, the code we emit is different for SSE2 or x87. usually, the values are not NaN. is this penalty for a NaN specific to the x87 instructions?
on TT, with the x87 or sse2, we might be able to use the status registers after the instruction to detect NaN, rather than explicitly comparing x != x. with sse2 we could also use an integer instruction to check for the nan bit pattern. but for x87 if we dont use the fpu status register, then it sounds like a store and integer-load is required?
Ed
Reporter | ||
Comment 2•17 years ago
|
||
-----Original Message-----
From: Mike Pall [mailto:mikelu-0802@mike.de]
Sent: Friday, February 08, 2008 8:39 AM
To: tamarin-devel@mozilla.org
Cc: Edwin Smith
Subject: Re: a much more efficient implementation of isNaN()
Edwin Smith wrote:
> In tamarin-tracing, "x != x" is frequently used as an isNaN check, to
> box nan values with a canonical bit pattern.
There's no need to canonicalize all NaNs. It's sufficient to
check numbers flowing into the VM before boxing them (a simple
integer comparison on the hiword). You'll need this for strtod()
or calls/returns from C carrying an untrusted double to be boxed.
The FPU only ever generates the canonical NaN on its own (e.g.
for 0/0). NaNs are propagated by choosing the first NaN operand.
This way you'll end up with canonical NaNs everywhere.
The only caveat with your representation is the -(NaN) case for
SSE (using xorpd). You'll want to avoid using this bit pattern
for other types. But there are plenty of mantissa bits left.
[At least that's what I'm doing and so far it works fine.]
> is this penalty for a NaN specific to the x87 instructions?
Nope, it applies everywhere (and to +-Inf and denormals, too).
There is only a single FPU driven by different instruction sets.
Same problem is present on other platforms. Should you ever
support VFP on ARM, you'll find that it traps to the kernel
in these cases.
> on TT, with the x87 or sse2, we might be able to use the status
> registers after the instruction to detect NaN, rather than explicitly
> comparing x != x.
The status register bits are sticky. You'd need to clear them
before every operation. But the clear operation is microcoded and
causes a pipeline flush. Believe me, you don't want this. :-)
--Mike
Reporter | ||
Comment 3•17 years ago
|
||
From: Haghighat, Mohammad R [mailto:mohammad.r.haghighat@intel.com]
Sent: Friday, February 08, 2008 1:19 AM
To: Edwin Smith; tamarin-devel@mozilla.org
Subject: RE: a much more efficient implementation of isNaN()
The penalty is for the float compare instruction (fcomp). Other instructions may also have penalties. I've got to see them before saying anything for sure.
But the code for doing isNaN is actually very simple, specially if it is on 64 bit systems.
Here's the code. I've also included that in the performance bug I submitted.
#define I64(f) (*(long long int *)&f)
static bool isNaN (double value)
{
unsigned long long int jvalue = (I64(value) & ~0x8000000000000000uLL);
return (jvalue > 0x7ff0000000000000uLL);
}
Essentially, one needs to clear out the sign bit and compare the absolute value (interpreted as an unsigned integer) of the argument to +Infinity (again interpreted as an unsigned integer). If the argument is greater, then the value is a NaN. It would be a good idea to have an assertion (preferably during compile time) that the sizeof (long long int, or __int64) is 8 to ensure it is 64 bits. The code relies on this assumption.
- moh
Reporter | ||
Updated•17 years ago
|
Component: Virtual Machine → Tracing Virtual Machine
QA Contact: vm → tracing-vm
Reporter | ||
Updated•17 years ago
|
Priority: -- → P2
Reporter | ||
Comment 6•17 years ago
|
||
TT's Box encoding uses high bits fff8-ffff for int thru null. on x86, the canonical nan is 0xfff80... so we'll need to tweak the tagging scheme to fit the type tags in the range > 0xfff80... some ideas:
1. merge boxedNull & boxedVoid.
null = BoxedVoid|0
undefined = BoxedVoid|1
hole = BoxedVoid|2
gc obj = BoxedVoid|ptr
1a. eliminate null as a tag, instead lead BoxedObj|Str|Ns have a null payload.
- this eliminates branching when boxing pointers, but introduces branching for the various cases that process boxed obj|str|null.
- TC encodes nulls as Str|0, Obj|0, etc.
2. eliminate BoxedNs.
ns is a primitive to aid in bootstrapping
symbol tables before class Namespace comes into being. however we
would win if namespaces in symbol tables were not script-visible
namespace objects but rather internal namespace values that dont
require boxing, ever.
3. use top 17 bits for type tag:
fff80 = canonical NaN
fff88
fff90
fff98
...
ffff0
ffff8 = highest flag
now there's plenty of room for more than 7 types. (17 bits = room for 15 types. 18 bits = room for 31 types, etc).
as long as the loads & stores are 32 bits this should be fine since we dont need the next 16 (or 15) bits anyway. trying to use 16bit load/stores leads to store forwarding penalties on 32bit machines.
could be non-optimal for 16bit databus arm7's
Comment 7•17 years ago
|
||
Of these, removing BoxedNs sounds the most appealing to me, since de-magicking Namespace is something we've tossed around for a while anyway. Not sure how much work it might be to do so, however.
Re: 17 or 18 bits, keeping it 16 bits for now might prove a win on some ARM architectures, so if we can do so painlessly, IMHO worth doing. The penalties on Intel can easily be avoided by just promoting to 32-bit load/stores.
Reporter | ||
Comment 8•17 years ago
|
||
Probably the bulk of work in de-magicking Namespace, is also where the bulk of the benefit is, namely: letting symbol tables contain some much lighter weight namespace thing (like a pointer into the abc or IL, nothing more), making light namespace-things from real Namespaces for dynamic looksup, and making real Namespaces from light namespaces, for iteration and proxy. (or maybe not for iteration, because dynamic expandos dont contain namespaces).
the acid test is being able to bootstrap system to the point where we can new a real Namespace.
Reporter | ||
Comment 9•17 years ago
|
||
After looking at the forth code some, i'm also warming up to combining BoxedNull and BoxedVoid. BoxedSpecial|0 = null, Special|1 = undefined, Special|2 = array hole, etc. ES4 will need a new not-initialized-const value too, maybe Special|3. most of the cases for null & void are the same, and where they are different, they seem to be cold paths.
doing this *and* de-magiking namespace would eliminate 2 cases from all our 9-entry switch(boxtype) tables, which is a decent savings.
similarly a con to using more boxtype types is more cases to handle in every such table.
its also tempting to try and merge Int|Uint and use a longer integer-like value to avoid promotions... but now i'm starting to ramble.
Comment 10•17 years ago
|
||
yeah, combining null and void is probably easier to accomplish short-term, and is often handled the same.
merging int and uint, I'm not as sure about... time for a new tracking bug on that topic :-)
Comment 11•17 years ago
|
||
Igor is making void a pseudo-boolean in SM (hole is a pseudo-b already, and we have another for async return -- what gets thrown at a generator being closed).
Jason is reviving the hack-patch I did in bug 161110 to make SM's jsval mimic TT's box, so cc'ing him too.
/be
Comment 12•17 years ago
|
||
Here's a simple store-to-load forwarding microbenchmark. Relevance to TT (and for possible re-use in SM) explained below.
It measures the time in cycles it takes to forward different data sizes through memory. Some cases are gracefully handled, but others incur heavy pipeline stalls. Which cases are good or bad highly depends on the CPU microarchitecture.
Note that all of them assume proper alignment of the stack. Penalties for misalignment are much higher and they often disable the fast forwarding cases, too. I'm assuming you've already made sure the box type is always 8-byte aligned (a double inside a struct/union in the Linux/*BSD x86 ABI only forces alignment to 4 bytes, the Windows x86 ABI and probably all non-x86 architectures force 8 bytes).
It's hard to adjust for the loop overhead on a super-scalar CPU. But it usually vanishes in the pipeline noise. Still, the cycle times may not accurately reflect just the forwarding itself. But the outliers are clearly visible.
Below are the results for a Core2 (65nm), Pentium 4 (Celeron) and a Pentium III. Timings for an Intel Atom would be interesting (if anyone has got one already). I don't have an AMD K8 or K10 at hand, but the manuals indicate similar penalties.
Now for the relevance to TT (and possibly SM):
As you can see this relates to the type checks you're doing in TT after storing a number to a box. The double->short_hi case is always one of the worst. The double->int_hi case is much cheaper and basically free on a Core2. This means you should definitely convert over to 32 bit type tags. If you use type tags in the range -1..-128 you also get shorter instruction encodings for the type checks, too (cmp [mem], imm8).
*** Store-to-load forwarding microbenchmark ***
model name : Intel(R) Core(TM)2 CPU 6420 @ 2.13GHz
CPU Frequency : 2128.000 MHz
-- Same size:
2.3 cycles short 2->2 short
2.3 cycles int 4->4 int
6.0 cycles double 8->8 double
-- Wide to narrow:
2.3 cycles int 4->2 short lo
13.0 cycles int 4->2 short hi
2.3 cycles double 8->2 short lo
13.0 cycles double 8->2 short hi CURRENT 16 bit box type check
2.3 cycles double 8->4 int lo
2.3 cycles double 8->4 int hi PROPOSED 32 bit box type check
2.3 cycles xmm 16->2 short lo
13.0 cycles xmm 16->2 short hi
2.3 cycles xmm 16->4 int lo
2.3 cycles xmm 16->4 int hi
3.0 cycles xmm 16->8 double lo
3.0 cycles xmm 16->8 double hi
-- Narrow to wide:
14.0 cycles short 2+2->4 int
14.0 cycles int 4+4->8 double
14.0 cycles double 8+8->16 xmm
*** Store-to-load forwarding microbenchmark ***
model name : Intel(R) Celeron(R) CPU 2.66GHz
CPU Frequency : 2660.468 MHz
-- Same size:
2.7 cycles short 2->2 short
2.1 cycles int 4->4 int
20.0 cycles double 8->8 double
-- Wide to narrow:
2.9 cycles int 4->2 short lo
5.1 cycles int 4->2 short hi
3.2 cycles double 8->2 short lo
38.0 cycles double 8->2 short hi CURRENT 16 bit box type check
2.4 cycles double 8->4 int lo
4.5 cycles double 8->4 int hi PROPOSED 32 bit box type check
2.7 cycles xmm 16->2 short lo
38.6 cycles xmm 16->2 short hi
2.9 cycles xmm 16->4 int lo
38.0 cycles xmm 16->4 int hi
6.1 cycles xmm 16->8 double lo
57.7 cycles xmm 16->8 double hi
-- Narrow to wide:
42.0 cycles short 2+2->4 int
56.1 cycles int 4+4->8 double
71.8 cycles double 8+8->16 xmm
*** Store-to-load forwarding microbenchmark ***
model name : Pentium III (Coppermine)
CPU Frequency : 996.698 MHz
-- Same size:
9.0 cycles short 2->2 short
3.0 cycles int 4->4 int
3.1 cycles double 8->8 double
-- Wide to narrow:
3.0 cycles int 4->2 short lo
11.1 cycles int 4->2 short hi
3.0 cycles double 8->2 short lo
11.0 cycles double 8->2 short hi CURRENT 16 bit box type check
4.1 cycles double 8->4 int lo
11.1 cycles double 8->4 int hi PROPOSED 32 bit box type check
4.0 cycles xmm 16->2 short lo
12.7 cycles xmm 16->2 short hi
3.1 cycles xmm 16->4 int lo
12.7 cycles xmm 16->4 int hi
3.0 cycles xmm 16->8 double lo
4.0 cycles xmm 16->8 double hi
-- Narrow to wide:
13.1 cycles short 2+2->4 int
13.0 cycles int 4+4->8 double
4.1 cycles double 8+8->16 xmm
Comment 13•17 years ago
|
||
That sounds like a compelling win, but I think I'm missing something: even if we move to 32-bit type tags, we still have to fit the values into NaN bit patterns -- it's not clear to me how we can do this and also use values in the range -1...-128. Or are you simply proposing we replace our 16-bit load with a 32-bit load? (Sorry, slow brain day, I think...)
Good food for thought, though... we need to revisit our Box approach in any event for x86-64 as well. (Does LuaJIT handle x86-64 at this point? Last I checked it was 32-bit only, not sure if you've pondered that issue.)
Comment 14•17 years ago
|
||
Yes, you should replace a 16 bit load with a 32 bit load. And while you're at it, it makes sense not to use 0xfff90000-0xffff0000, but rather 0xfffffff9-0xffffffff as type tags. This gives you small immediates to compare against. And if you need a positive tag range, just use ~tag.
Of course you are no longer restricted to only 7 tags, too. 2^19-1 tags give you a lot more freedom.
Actually I'm currently using 13 tags in LuaJIT. E.g. splitting up the boolean type into a true and a false tag pays off. There are only a couple non-performance sensitive paths where you need to treat them the same.
With careful ordering of tags you can also reduce the number of branches. E.g. checking for truth/falseness in if-statements reduces to a single compare in Lua. Of course JS has different semantics, so one needs to analyze the common fast-paths and optimize the tag layout accordingly.
I've exchanged some thoughts with Edwin about how to keep 64 bit boxes on x64. I've given him permission to add the email exchange to bug 433793.
Summary: TT: eliminate dbox where possible, when we know canonical NaN's will be preserved → eliminate dbox where possible, when we know canonical NaN's will be preserved
Component: Tracing Virtual Machine → JIT Compiler (NanoJIT)
QA Contact: tracing-vm → nanojit
Reporter | ||
Updated•15 years ago
|
Severity: normal → enhancement
Priority: P2 → --
Target Milestone: --- → Future
Updated•15 years ago
|
Component: JIT Compiler (NanoJIT) → Nanojit
Product: Tamarin → Core
QA Contact: nanojit → nanojit
Updated•15 years ago
|
Component: Nanojit → JIT Compiler (NanoJIT)
Product: Core → Tamarin
QA Contact: nanojit → nanojit
Flags: flashplayer-qrb+
Flags: flashplayer-injection-
Flags: flashplayer-bug-
Comment 16•6 years ago
|
||
Tamarin isn't maintained anymore. WONTFIX remaining bugs.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•