Closed Bug 697097 Opened 13 years ago Closed 13 years ago

float4_t pass/return conventions (was: float4 % does not produce the correct value when interpreted on win64)

Categories

(Tamarin Graveyard :: Interpreter, defect, P1)

x86_64
Windows 7
defect

Tracking

(Not tracked)

VERIFIED FIXED
Q2 12 - Cyril

People

(Reporter: brbaker, Assigned: virgilp)

References

Details

Attachments

(1 file, 1 obsolete file)

It appears as though the result ends up being float4.xxyy

Code:
var f1:float4 = float4(3f, 6f, 9f, 3f);
var f2:float4 = float4(2f, 4f, 3f, 4f);

function check(val1:*, val2:*):*
{
    return (val1 % val2);
}
print("expected: 1,2,0,3");
print(f1 % f2);
print(check(f1, f2));

When this is interpreted the output is '1,1,2,2'  instead of '1,2,0,3'

This only happens on windows64, the 32bit shell does NOT have this issue.

Testmedia: as3/Types/Float4/flt4_5_2_3.as
Flags: in-testsuite+
Flags: flashplayer-triage+
Flags: flashplayer-qrb+
Flags: flashplayer-bug-
I think this might be a little more general and not just isolated to modulo

Consider that this code also produces incorrect results on win64, but not win32

Code:
print(float4(float.NaN, 0f, -0f, 1f) ? 'PASSED' : 'FAILED');
var b:Boolean = float4(float.NaN, 0f, -0f, 1f);
print(b);

Expected output:
PASSED
true

Win64 -Ojit
FAILED
false

Win64 -Dinterp
PASSED
false
Hardware: x86 → x86_64
sh*t. I'm fairly certain this is a compiler bug :(
I have no solution that can be applied today, the bug is too pervasive.
Basically, sometimes when returning a float4 (__m128) value, the MSVC x64 compiler would mess it up (double the first two values, i.e. return "(a,a,b,b)" instead of "(a,b,c,d)"). That explains the "modulo" bug, that also explains the "toBoolean" bug (put the "1f" on the first position and you will no longer reproduce it). 

The compiler bug appears related to inlining, but disabling inlining (/Ob0) doesn't fix it, so it may be more than that I tired selectively disabling optimization around a few functions, but while it locally "fixes" the problem for that function, the bug just pops  up somewhere else.

I suggest that right now we work with "/Od" (disable optimization; but keep all the other "release" flags) work Win64, so that we may at least test correctness.
Attached patch disable optimizations on win64 (obsolete) — Splinter Review
stopgap fix to at least make sure that we can test properly on win64
Attachment #570564 - Flags: review?(virgilp)
Added to the patch queue series as 'bug697097.disableWIN64optimizations.diff' in changeset 321:2056bcae3a1b
Attachment #570564 - Flags: review?(virgilp) → review+
Priority: -- → P3
Target Milestone: --- → Q2 12 - Cyril
I'm pretty sure this bug is really about the float4_t pass/return, hence P1.
Priority: P3 → P1
This patch should transforms float4 into a "struct" and works around the MSVC compiler bug(s?). It should also enable us to run on Tegra hardware (but build scripts need some modifications for that to work).

The general solution is: JIT uses the regular, "intrinsic" calling convention. The nativegen.py script is modified so that all AS natives that receive "float4" parameters will actually translate to a call to a C++ method that receives a "const float4_t&" reference; and all AS natives that return float4 will translate to a C++ method that receives an additional "float4_t& retval" parameter.  There is some magic (and one additional level of call indirection) inserted in builtin.cpp to adapt the "C++ calling convention" to the "JIT calling convention" (and also to work around MS compiler bugs - e.g. I had to manually align some variables on stack). 

The general rule of thumb going forward should be that NO (*) method from jit-calls.h should ever receive a "F4" argument, or return a "F4" argument - they should always receive "P" and return "V" (with an additional P parameter if they return float4). I tried to assert this but I can't really since the above rule isn't valid for e.g. vfcallimt. But some form of assertion should be added going forward (maybe ARGTYPE_F4 should be "undefined" after the thunk declarations). And yes, currently vector methods are incorrect, we no longer call those from codegenlir (otherwise we'd get wrong results).
Attachment #570564 - Attachment is obsolete: true
Attachment #575887 - Flags: superreview?(wmaddox)
Attachment #575887 - Flags: review?(lhansen)
Attachment #575887 - Flags: feedback?(edwsmith)
Comment on attachment 575887 [details] [diff] [review]
Transform "float4_t" to a plain struct type.

One more observation: as things stand right now, it's going to be tricky to guarantee -Ojit/-Dinterp consistancy. For example - we moved "f4_mul" from an intrinsic that operates on SIMD types to a function that operates on structures (one float member at a time). The interpreter calls f4_mul - but CodegenLIR will of course continue to generate the SIMD instruction for multiplication (on all platforms but Tegra).  I'm not sure that those two will yield strictly identical results in all circumstances - they may, but we must be careful (e.g. haandling of denormalized values may be a difference). 
We could of course keep the "intrinsic C++ version", but that will only narrow down the problem to specific cases (e.g. ARM devices that *do* have NEON hardware), it would not completely eliminate it.
Attachment #575887 - Flags: superreview?(wmaddox)
Attachment #575887 - Flags: superreview?(edwsmith)
Attachment #575887 - Flags: review?(wmaddox)
Attachment #575887 - Flags: feedback?(edwsmith)
Attachment #575887 - Flags: feedback?
FWIW, this patch is tested to work on debug-debugger Mac32, Win32, release-debugger Mac32, Win32, Mac64, Win64, Arm-Android. It also crashes on Honeycomb/Tegra2 with "illegal instruction", presumably due to our use of Neon instructions.
Comment on attachment 575887 [details] [diff] [review]
Transform "float4_t" to a plain struct type.

Review of attachment 575887 [details] [diff] [review]:
-----------------------------------------------------------------

This looks pretty good, and, although there are quite a few issues for consideration below, I didn't see any disastrous flaws.  I'll give this an R+, subject to adding some additional commentary as requested, and giving due consideration to the other points raised.  The platform-specific arithmetic intrinsics *must* be reinstated in the absence of a considered determination to the contrary from the spec side, or a thorough verification that the SIMD instructions behave identically to the generic float-based implementations.

Substantive issues:

The entire scheme for this patch is out of tune relative to established conventions in our codebase, and a bit of an ugly wart.  There needs to be a high-level design-oriented comment that explains why all of this was done, i.e., the various compiler bugs, funky alignment restrictions for SIMD types, differing ABIs, etc. so that the reader is not surprised, and knows which choices were made explicitly for good reason, and what might be incidental and safely subject to revision as the code evolves.

You noted the possibility that the generic float4 operations may differ in semantics from the SIMD instructions emitted by the JIT.  I noticed that you removed the platform-specific versions using the intrinsics, which are used not only by the builtin functions, but by constant-folding in the JIT.  It really is unacceptable that expressions evaluated opportunistically at compile-time should yield different results than had the computation been deferred.  I think that we should use the arithmetic intrinsics where available, and fall back to generic versions only on platforms that have no meaningful hardware support for float4.  On these platforms, we'll have to emulate float4 with scalar floats, so consistency should be maintained.  Operations that can be expected to behave consistently, such as swizzling and component selection, can continue to be implemented using generic code on al platforms. You've also noted the JIT vs. interpreter issue, which is at least as bad as the constant folding issue, at least when OSR is enabled.

It is not clear why AvmCore::float4(float4_t& retval, Atom atom) is concerned about the alignment of the retval argument, as it is a struct, and will not written using a SIMD instruction.  A simple structure assignment should suffice in all cases.

#define VECTORFLOAT4ADDRF4(f) NULL
This is apparently just a placeholder, but is nonsense if actually invoked.
Need a comment here.

else if (false) { // was: objType == VECTORFLOAT4_TYPE
Presumably you ore disabling temporarily-broken optimized vector paths. Need a comment here.

lea((sp-1) << 4
You no longer load float4 values with localGetf4, but you should be able to provide a functional wrapper around the code that generates the addres.  Perhaps localGetf4Addr() or somesuch.

// FIXME (Bugzilla 703605): Probably not right for either NaN nor -0 vs +0
For F4 min or max.  Definitely wrong for Number min/max.  Did we decide to go for the naive single comparison for Float/Float4?

In instr.cpp, (e.g. __modulo), and in VMPI.h, we break the usual convention that C++ code returns float4_t via an 'out' reference parameter, returning a struct value instead.  Why do we mix these conventions?  If the return value is handled correctly and efficiently, it seems that we should use the ordinary return syntax consistently, except where the function can be invoked directly from JIT'd code, i.e., the return value must be returned to JIT'd code.

Why are thunkEnterVECR_adapter() and friends defined in generated code?  The code is boilerplate that is emitted w/o customization.  Also, given that it is defined in builtin.cpp, why is it declared, not in builtin.h, but in exec.cpp?

// Hack: that's not really the prototype, but that's the point, we can't specify the real prototype.
Why not?  I realize this isn't the prototype you want to use, but it appears that you cast function pointers to these routines to other types anyway.  The void* return type is simply bogus, and can't possibly reliably do the right thing, because the actual float4_ret_t result is not a pointer at all -- you've got to cast the function to something with the correct return type at the point you actually want to consume the result. (For example, see casts to VecrMethodProc or to GprMethodProc.)

op_negate, and others, in instr.cpp:
Are there tests for these, especially corner cases like -0?

Why all the deleted asType() definitions in builtin.h?  Were they previously superfluous as well, or did I miss something float/float4 related?

Loose end in nativegen.py:
if(ctype != CTYPE_FLOAT4): # In fact we only need this for one class (ClassClosure? I forget now.); but it's only important that we don't emit it for float4_t


Style:

I prefer to use an explicit pointer argument in preference to a non-const reference argument wherever possible, e.g., the first argument to AvmCore::float4(). This way, the mutability of the argument is made obvious by the fact that a pointer is passed, likely with an explicit '&' at the call site.  This convention is also part of the Google coding conventions used in Halfmoon.  It's not a big deal, though.

Patch hygiene:

There was an extensive systematic renaming of leaIns() to lea() that would best be left to a separate cleanup patch.  I say this for future reference, as I doubt it's worth the trouble to do that now.

double arg1 = AvmThunkUnbox_DOUBLE(double, argv[argoff1]);
In builtin.cpp, arguments are now broken out into temporary variables, such as arg1. This was likely not an essential part of the patch, and should have been done in a separate cleanup patch.  It created a huge amount of noise in this patch.

In general, a posted patch should not include machine-generated code, though, in this case, because the generation scheme itself was changed, it was helpful to see at least a few illustrative cases of the generator changes at work.

Nits:

Atom Float4Class::construct(int argc, Atom* argv)
-- funky indentation, possibly tabs vs. spaces

Please put a space after commas.

In instr.cpp: meas -> means (typo)
Attachment #575887 - Flags: review?(wmaddox) → review+
Just a quick comment on the "why not" question: we can't use "float4_ret_t" since it's not available (that's the whole point), and we can't use "float4_t" since that is a struct and would insert an additional argument in the parameter list for most platforms. We need to use a return type that doesn't modify the rest of the parameter list. Between "void" and "void*" I choose "void*" because I thougt it'll at least communicate that we return something. Maybe I should've used "float4_t*" or "float4_t&" ?
I think it is a very bad idea to have prototypes that don't match the function definition.  In fact, the only way you are getting away with this at all is by using "C" linkage so that name mangling doesn't occur.  Why we can't make float4_ret_t available?  The actual calling convention used when the trampoline is called will depend on the type to which the function pointer is cast -- the prototype itself
will have no effect.

Alternatively, you could define pre-casted function pointer constants alongside the definition of the trampolines, so that their actual signatures need not be made visible at all.
Summary: float4 % does not produce the correct value when interpreted on win64 → float4_t pass/return conventions (was: float4 % does not produce the correct value when interpreted on win64)
Some notes (less interesting that Bill's):

Not obvious to me that treating float4_t as an array-of-float, which is done repeatedly, is in fact portable -- not obvious that the compiler can't insert packing between the elements.  However my reading of the spec has not yet delivered up any evidence one way or the other.

However, since we do depend on that, the rewrite to get rid of float4_overlay seems unnecessary, since the float4 could clearly be overlaid with a float[4] array.  The new code is less clear than the old code.

Stylistically having multiple statements per line is bad, as in this example from instr.cpp (but there are others, also not inside macros - I understand these lines replace initializing assignments but even so it's impossible to read the code):

   float4_t lhsv, rhsv; AvmCore::float4(lhsv, lhs); AvmCore::float4(rhsv, rhs);

If you really wanted to encapsulate it you would have a macro:

   #define float4_t_decl(var, init)  float4_t var; AvmCore::float4(var, init)

and then

   float4_t_decl(lhsv, lhs);
   float4_t_decl(rhsv, rhs);

But IMO just writing out the code on three (or four) lines is equally clear.

The added code in op_negate appears redundant, note the test against zeroIntAtom that guards it.

The METHOD for Float4VectorObject_getNativeIntProperty looks wrong, should it not be RF4 instead of P?

Above Float4Class::reciprocal you removed the reference to the pertinent bugzilla, please put it back.
Regarding intrinsics vs not: Generally, float4 results will be machine dependent, for example, on a machine that has no rsqrt support the result for rsqrt(x) will be the same as for 1/sqrt(x), but on a machine that has rsqrt support the result will be something else (and indeed different machines that have rsqrt support may provide different results).   For another example, the result for SIMD "+" will be whatever the machine provides, and may differ slightly from machine to machine: on a machine where the SIMD instructions do not support denormals, float4 "+" will not properly handle denormals.  (We could choose differently but that's the current strategy.)  We will likely extend this to all float4 methods such as max.

However the result for SIMD "+" should not ever vary *on the same machine* depending on eg optimization strategies, OSR, interpreter vs JIT, etc etc - so it is incorrect to implement SIMD "+" with elementwise non-SIMD operations in one case and with the SIMD operations in another case, unless those can be proven to have the same semantics.  If the results can differ then we must use the SIMD instructions, even if that's a headache.

Now one can argue that that's a lot of runaround for very little gain, but I think the situation where results can vary depending on completely random-seeming factors is unacceptable.  One thing is denormals, another thing is issues around +0, -0, and NaN, and who knows what else (hopefully not too much, but then - who knows?)
Other than that I subscribe to all of Bill's comments as well.

A very guarded R+.  I suggest waiting for the discussion to settle down and the issues to be resolved and then make a decision about whether we want to try to land this before the first drop or not (esp since it breaks the Vector.<float4> optimizations, though probably not in any unfixable way really).
Attachment #575887 - Flags: review?(lhansen) → review+
Should be fixed now - changes integrated into tr-float, tests pass.
Vector.<float4> is not optimized right now.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
(In reply to Brent Baker from comment #5)
> Added to the patch queue series as
> 'bug697097.disableWIN64optimizations.diff' in changeset 321:2056bcae3a1b

changeset:   7083:32f017a61221
tag:         tip
user:        Brent Baker <brbaker@adobe.com>
date:        Wed Dec 14 13:33:33 2011 -0500
summary:     Bug 679097: revert change that disabled compiler optimizations on windows64
Comment on attachment 575887 [details] [diff] [review]
Transform "float4_t" to a plain struct type.

rubber stamp... boing!
Attachment #575887 - Flags: superreview?(edwsmith) → superreview+
changeset: 6823:3be1efe8d051
user:      Virgil Palanciuc <virgilp@adobe.com>
summary:   bug 697097 disable optimizations on win64 builds

http://hg.mozilla.org/tamarin-redux/rev/3be1efe8d051
changeset:   7109:32f017a61221
user:        Brent Baker <brbaker@adobe.com>
date:        Wed Dec 14 13:33:33 2011 -0500
summary:     Bug 679097: revert change that disabled compiler optimizations on windows64
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: