Open Bug 996375 Opened 8 years ago Updated 5 years ago

optimize fast_sincos with SIMD

Categories

(Core :: JavaScript Engine: JIT, enhancement, P5)

enhancement

Tracking

()

People

(Reporter: sunfish, Assigned: sunfish)

References

(Blocks 1 open bug)

Details

Attachments

(3 files, 2 obsolete files)

Attached patch simd-sincos.patch (obsolete) — Splinter Review
This patch optimizes the fast_sincos routine introduced in bug 967709 to use SIMD to compute the two polynomial functions in parallel, for approximately a 15% speedup on targets which build with SSE2 support. Unfortunately, this doesn't include 32-bit Windows, but it does include 64-bit Windows, all x64 targets in fact, and it does include 32-bit Mac.

It also introduces the beginning of a new header, "mozilla/SIMD.h", which contains basic abstractions for doing explicit SIMD programming.

Unfortunately, the sincos algorithm is still slower than V8's, but there are a few additional optimizations beyond this one being discussed.

I'm also considering reverting the fast_sincos patch altogether, to fix the performance regressions, and because this SIMD patch doesn't help 32-bit Windows. Possibly we can restructure it to compile two versions and check processor support at runtime, but that would take more time.
Presumably the reason we're building for Windows-32 without SSE2 support is recorded somewhere.  Do you have a reference?  (Not questioning it; I'm merely curious.)

Keeping an old piece of information about non-SSE2 systems alive (https://bugzilla.mozilla.org/show_bug.cgi?id=521245#c18):

"Evidently the low-power Via C7, which is fairly recent, does not support SSE2 but is supported in configurations up to 2GHz, ie, SSE2 issues aside Flash content might run well on systems based around that CPU.  Wikipedia notes that HP ultraportables running XP and Vista were based on this CPU as late as 2008."
I don't know of any authoritative statements, but there's bug 544402 which is still open, and there's some discussion in bug 836658 too.
Comment on attachment 8406548 [details] [diff] [review]
simd-sincos.patch

This patch is one step toward making the current fast_sincos fast enough. I'm also pursuing alternative approaches too.
Attachment #8406548 - Flags: review?(jdemooij)
Comment on attachment 8406548 [details] [diff] [review]
simd-sincos.patch

Review of attachment 8406548 [details] [diff] [review]:
-----------------------------------------------------------------

Cool! I'll forward this to Waldo though as he owns the mfbt module and I'm not familiar with its style and other rules.

::: mfbt/SIMD.h
@@ +42,5 @@
> +#ifdef __GNUC__
> +#define MOZ_COPYSIGN __builtin_copysign
> +#elif defined _WIN32
> +#define MOZ_COPYSIGN _copysign
> +#else  

Nit: trailing whitespace.
Attachment #8406548 - Flags: review?(jdemooij) → review?(jwalden+bmo)
Attached patch simd-sincos.patch (obsolete) — Splinter Review
I just discovered mfbt/STYLE, so here's a patch which fixes the obvious issues. The only outstanding issue I'm aware of is the naming convention.

The obvious thing would be to rename functions like f_add to InterCaps-ish FAdd and types like v4f64 to V4F64. I'm happy to do that if that's what's desired, or any other naming scheme.
Assignee: nobody → sunfish
Attachment #8406548 - Attachment is obsolete: true
Attachment #8406548 - Flags: review?(jwalden+bmo)
Attachment #8407699 - Flags: review?(jwalden+bmo)
Coming back to an earlier question, I compared three variants of the sine approximation:

Error bits    Method
    0.0487    |sol = sol * (x * (x * x)); sol += x;|
    0.0489    |sol = x * (sol * (x * x)); sol += x;|
    0.0484    |sol = (x * x) * (sol * x); sol += x;|

The difference is statistically significant, but very small - about 1% between the best and worst methods. In particular, this is smaller than the gain (8.21%) from the coefficients I proposed in bug 997459.

I don't know if it makes a difference for the SIMD implementation here, but if it does I suggest just using the fastest/cleanest method :)
Oh, and I suspect this variant would be nicest in terms of parity with cosine:

|sol *= x * x; sol += 1.0; sol *= x;|

However, that variant *does* perform significantly worse, so avoid that one.
Refresh patch against trunk, replace the extract/merge signbit hack with a nicer, faster, and more portable approach, and rename all the functions and types in SIMD.h to conform to mfbt/STYLE (which was the last remaining STYLE issue I'm aware of).
Attachment #8407699 - Attachment is obsolete: true
Attachment #8407699 - Flags: review?(jwalden+bmo)
Attachment #8409713 - Flags: review?(jwalden+bmo)
jwalden, ping?
There was a discussion on IRC about this last night - Waldo has been extremely busy with security audits for a number (6+) of weeks now, but suggested froydnj and jcranmer for review on an MFBT bug (bug 983538, specifically). They're not listed as MFBT peers, but it appears the list is just out of date.
Comment on attachment 8409713 [details] [diff] [review]
simd-sincos.patch

Review of attachment 8409713 [details] [diff] [review]:
-----------------------------------------------------------------

Haven't looked at jsmath.cpp or TestSIMD.cpp changes, but this at least gets something going.  Will pick those up tomorrow hopefully.

I am sort of pretending I understand various bits of the inline asm here and all.  :-)

::: mfbt/SIMD.h
@@ +22,5 @@
> +   // Clang supports the SIMD syntax and features even on targets which lack
> +   // SIMD support.
> +#  define MOZ_EXT_VECTOR_SIMD 1
> +#  define MOZ_BUILTIN_SIMD 1
> +#elif defined(__GNUC__)

MOZ_IS_GCC

@@ +24,5 @@
> +#  define MOZ_EXT_VECTOR_SIMD 1
> +#  define MOZ_BUILTIN_SIMD 1
> +#elif defined(__GNUC__)
> +   // GCC emits suboptimal code around xorpd though (filed as GCC bug 60826), so
> +   // we use <xmmintrin.h> there also for now.

Would be nice to say "as of GCC 4.9" or so to give a quick number for comparison, for people wondering if this might possibly be out of date at all.  Bug number's authoritative, but it's slower to check through.

@@ +31,5 @@
> +#  elif defined(__ARM_NEON__) || defined(__ALTIVEC__)
> +#    define MOZ_BUILTIN_SIMD 1
> +#  endif
> +#endif
> +#if !defined(MOZ_XMMINTRIN_SIMD) && !defined(MOZ_BUILTIN_SIMD)

Blank line before this, please, to separate feature-detection from fallback-creation.

@@ +33,5 @@
> +#  endif
> +#endif
> +#if !defined(MOZ_XMMINTRIN_SIMD) && !defined(MOZ_BUILTIN_SIMD)
> +   // No SIMD support. Do what we can.
> +#  define MOZ_FAKE_SIMD 1

Rather than three separate #defines for all this, how about:

#define MOZ_BUILTIN_SIMD 0
#define MOZ_XMMINTRIN_SIMD 1
#define MOZ_EMULATED_SIMD 2

#define MOZ_SIMD_TYPE

#if MOZ_SIMD_TYPE == MOZ_BUILTIN_SIMD
...
#elif MOZ_SIMD_TYPE == MOZ_XMMINTRIN_SIMD
...
#elif MOZ_SIMD_TYPE == MOZ_EMULATED_SIMD
...
#else
#  error "No suitable SIMD implementation selected"
#endif

@@ +36,5 @@
> +   // No SIMD support. Do what we can.
> +#  define MOZ_FAKE_SIMD 1
> +#endif
> +
> +// Figure out how to call copysign.

Put this in FloatingPoint.h as mozilla::CopySign (with float/double overloads), I think.  It's not SIMD-centric at all.

@@ +37,5 @@
> +#  define MOZ_FAKE_SIMD 1
> +#endif
> +
> +// Figure out how to call copysign.
> +#ifdef __GNUC__

MOZ_IS_GCC and Compiler.h, and __clang__, are slightly preferred here.  And these days you should probably ask ehsan how to not mess with the up-and-coming clang-cl.  (Or at least give him a heads-up here.)

@@ +51,5 @@
> +#endif
> +
> +namespace mozilla {
> +
> +namespace SIMD {

Namespaces are typically lowercase, not sure if SIMD should or shouldn't be an exception.  Whatever works for you.

@@ +58,5 @@
> +   // Add may_alias since we often have to cast pointers around, especially
> +   // since we're restricted in what we can do due to portability constraints.
> +#  ifdef MOZ_EXT_VECTOR_SIMD
> +typedef double V2F64 __attribute__((__ext_vector_type__(2), __may_alias__));
> +typedef int64_t V2I64 __attribute__((__ext_vector_type__(2), __may_alias__));

Kinda sad the only way to make this work is the __may_alias__ hammer, and not something finer-grained about this only aliasing double and int64_t storage or so.

(I know clang lets you surround attribute names with __ to avoid conflicts.  Does gcc as well, I take it, going back as far as we care?)

@@ +64,5 @@
> +typedef double V2F64 __attribute__((__vector_size__(16), __may_alias__));
> +typedef int64_t V2I64 __attribute__((__vector_size__(16), __may_alias__));
> +#  endif
> +#endif
> +#ifdef MOZ_XMMINTRIN_SIMD

Use #elif here to make the alternation clear.  This looks a lot like possible duplicate definition, otherwise.

@@ +70,5 @@
> +typedef __m128i V2I64;
> +#endif
> +#ifdef MOZ_FAKE_SIMD
> +typedef union { double e[2]; int64_t i[2]; } V2F64;
> +typedef struct { int64_t e[2]; } V2I64;

union V2F64 { ... };
struct V2I64 { ... };

unless there's some reason not to (and if so, enlighten me!).

@@ +79,5 @@
> +inline V2F64 FSub(V2F64 l, V2F64 r) { return l - r; }
> +inline V2F64 FMul(V2F64 l, V2F64 r) { return l * r; }
> +inline V2F64 FDiv(V2F64 l, V2F64 r) { return l / r; }
> +#endif
> +#ifdef MOZ_XMMINTRIN_SIMD

#elif

@@ +85,5 @@
> +inline V2F64 FSub(V2F64 l, V2F64 r) { return _mm_sub_pd(l, r); }
> +inline V2F64 FMul(V2F64 l, V2F64 r) { return _mm_mul_pd(l, r); }
> +inline V2F64 FDiv(V2F64 l, V2F64 r) { return _mm_div_pd(l, r); }
> +#endif
> +#ifdef MOZ_FAKE_SIMD

#elif with a #else #error "need SIMD definition" #endif, or just plain #else.

@@ +86,5 @@
> +inline V2F64 FMul(V2F64 l, V2F64 r) { return _mm_mul_pd(l, r); }
> +inline V2F64 FDiv(V2F64 l, V2F64 r) { return _mm_div_pd(l, r); }
> +#endif
> +#ifdef MOZ_FAKE_SIMD
> +inline V2F64 FAdd(V2F64 l, V2F64 r) {

inline V2F64
FAdd(...)
{
...
}

and so on.

@@ +118,5 @@
> +// true. Use inline asm rinses, in this case to prevent it from converting
> +// operations like andnpd into integer operations, defeating CSE with other
> +// vectors, requiring awkward movabs instructions to materialize immediates,
> +// and generally doing weird things.
> +inline V2F64 FAnd   (V2F64 l, V2F64 r) {

inline V2F64
FAnd(...)
{
  ...
}

as noted before.

@@ +120,5 @@
> +// vectors, requiring awkward movabs instructions to materialize immediates,
> +// and generally doing weird things.
> +inline V2F64 FAnd   (V2F64 l, V2F64 r) {
> +  asm("" : "+x"(l));
> +  V2F64 result = (V2F64)((V2I64)l & (V2I64)r);

Use C++-style casts throughout, please.  static_cast<> or reinterpret_cast<>, whichever is allowed and does the right thing, else constructor casts if those don't work.  (And if none of those work, oh well, I guess C-style casts aren't end of the world.)  Written this way this looks really innocuous -- best to make it look at least a little dodgy.

@@ +123,5 @@
> +  asm("" : "+x"(l));
> +  V2F64 result = (V2F64)((V2I64)l & (V2I64)r);
> +  return result;
> +}
> +inline V2F64 FOr  (V2F64 l, V2F64 r) {

Unless l/r are canonical names, I'd slightly prefer a/b due to "l" looking like "1".

@@ +139,5 @@
> +  V2F64 result = (V2F64)(~(V2I64)l & (V2I64)r);
> +  return result;
> +}
> +#  else
> +inline V2F64 FAnd   (V2F64 l, V2F64 r) { return (V2F64)((V2I64)l & (V2I64)r); }

Same formatting as previously mentioned.

@@ +145,5 @@
> +inline V2F64 FXor   (V2F64 l, V2F64 r) { return (V2F64)((V2I64)l ^ (V2I64)r); }
> +inline V2F64 FAndnot(V2F64 l, V2F64 r) { return (V2F64)(~(V2I64)l & (V2I64)r);}
> +#  endif
> +#endif
> +#ifdef MOZ_XMMINTRIN_SIMD

#elif

@@ +146,5 @@
> +inline V2F64 FAndnot(V2F64 l, V2F64 r) { return (V2F64)(~(V2I64)l & (V2I64)r);}
> +#  endif
> +#endif
> +#ifdef MOZ_XMMINTRIN_SIMD
> +inline V2F64 FAnd   (V2F64 l, V2F64 r) { return _mm_and_pd   (l, r); }

Formatting.  (I'm going to stop mentioning this now.)

@@ +151,5 @@
> +inline V2F64 FOr    (V2F64 l, V2F64 r) { return _mm_or_pd    (l, r); }
> +inline V2F64 FXor   (V2F64 l, V2F64 r) { return _mm_xor_pd   (l, r); }
> +inline V2F64 FAndnot(V2F64 l, V2F64 r) { return _mm_andnot_pd(l, r); }
> +#endif
> +#ifdef MOZ_FAKE_SIMD

#elif and #else #error, or #else.

@@ +188,5 @@
> +  asm("" : "+x"(result));
> +  return result;
> +}
> +#  else
> +inline V2F64 BuildVector(double e0, double e1) { return (V2F64){ e0, e1 }; }

You should probably be consistent about brace-initializing a local and returning it, versus returning a cast aggregate.  I guess this all works because of different levels of compiler support for all this?  Probably not so good, because then someone will think casting a brace-init will work everywhere, not realizing that it merely happens to work in a particular place or so.

@@ +194,5 @@
> +inline V2F64 Splat2(double e) { return (V2F64){ e, e }; }
> +inline V2F64 FZero() { return (V2F64){ 0.0, 0.0 }; }
> +#endif
> +#ifdef MOZ_XMMINTRIN_SIMD
> +// This is not a typo. e1 and e0 really do go in this order.

Ugh.

@@ +227,5 @@
> +// GCC appears to provide no nicer way to do this, but we can get exactly what
> +// we want with an inline asm.
> +inline V2F64 ScalarToVector2(double e) {
> +  V2F64 result;
> +  asm("" : "=x"(result) : "0"(e));

Is this endianness-sensitive maybe?  At least call it out in a comment, if so.

@@ +235,5 @@
> +// This suffices, but it's often suboptimal as some compilers seem to want to
> +// set f to 0.
> +inline V2F64 ScalarToVector2(double e) {
> +  double f;
> +  return BuildVector(e, f);

This isn't actually kosher, I think.  The lvalue-to-rvalue conversion here has undefined behavior because the object is uninitialized, C+11 [conv.lval]p1.  If I'm reading right.  Might work with some compilers, to be sure, but this is in an #else applying to an unbounded set of compilers.  Probably just pass in some nonsense value in debug builds (how about 2.3477194465391097000e-175, i.e. 0x1badbadbadd00b1e?  ;-) ), and 0.0 in opt builds for at least speediness of generation?

@@ +245,5 @@
> +inline double VectorToScalar(V2F64 v) { return v.e[0]; }
> +inline int64_t VectorToScalar(V2I64 v) { return v.e[0]; }
> +#elif defined(__clang__) || \
> +      (defined(__GNUC__) && \
> +       (__GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 8)))

MOZ_GCC_AT_LEAST(4, 8, 0)

@@ +252,5 @@
> +#else
> +// G++ pre 4.8 doesn't support subscripting on vector types.
> +// FIXME: Hopefully there's a better way to do this on MSVC.
> +inline double VectorToScalar(V2F64 v) { return *(double*)&v; }
> +inline int64_t VectorToScalar(V2I64 v) { return *(int64_t*)&v; }

Ye gods!  How do SIMD programmers live with themselves?

@@ +263,5 @@
> +#ifndef MOZ_FAKE_SIMD
> +  return FAndnot(FNegzero(), v);
> +#else
> +  // FAnd is slow on fake SIMD.
> +  V2F64 result = { fabs(v.e[0]), fabs(v.e[1]) };

mozilla::Abs from MathAlgorithms.h.

@@ +293,5 @@
> +
> +#ifndef MOZ_FAKE_SIMD
> +inline unsigned FSignbit(V2F64 v) { return ((V2I64)v)[0] < 0; }
> +#else
> +inline unsigned FSignbit(V2F64 v) { return v.i[0] < 0; }

Use bool for these.

::: mfbt/tests/TestSIMD.cpp
@@ +15,5 @@
> +  V2F64 x, y, z;
> +  x = BuildVector(2.0, 3.0);
> +  y = BuildVector(13.0, 17.0);
> +  z = FAdd(x, y);
> +  MOZ_ASSERT(((double *)&z)[0] == 15.0);

I guess we want MOZ_RELEASE_ASSERT for tests, these days.
Comment on attachment 8409713 [details] [diff] [review]
simd-sincos.patch

Review of attachment 8409713 [details] [diff] [review]:
-----------------------------------------------------------------

Okay, just nitpicks.  I will be interested to see what other magics can be performed with this SIMD stuff, in the future.

::: js/src/jsmath.cpp
@@ +368,2 @@
>  
> +struct sincos_result { double s, c; };

SinCosResult probably more common naming these days.  But I guess this is pre-existing.

@@ +370,3 @@
>  
> +// Constants generated using Mathematica's GeneralMiniMaxApproximation. The
> +// Even elements are for the sin polynomial approximation, the odd for the

even

@@ +373,5 @@
> +// cos polynomial approximation.
> +//
> +// The cos set uses one less coefficient than usual implementations to
> +// increase performance, raising the maximum approximation error to 2 bits.
> +MOZ_ALIGNED_DECL(extern const double coefficients[], 16);

The extern should be outside the macro, I think.

Can this be an array of double[2]?  Pretty strange to have an undifferentiated array when these are only used pairwise.

@@ +375,5 @@
> +// The cos set uses one less coefficient than usual implementations to
> +// increase performance, raising the maximum approximation error to 2 bits.
> +MOZ_ALIGNED_DECL(extern const double coefficients[], 16);
> +const double coefficients[] = {
> +    1.59046813973877163292e-10, 2.06467337476762997948e-9,

Perhaps worth preserving the polevl_sin number comments?

@@ +393,3 @@
>      // Make argument non-negative but save the sign.
> +    V2F64 vx = ScalarToVector2(x);
> +    double absx = VectorToScalar(FAbs(vx));

Is this expected to be faster for some reason than mozilla::Abs(x)?  (And if so, maybe that should be defined this way then.)

@@ +403,3 @@
>          sincos_result result = {
> +            sin(sx),
> +            cos(sx)

Why changing from |x| here?

I'd put the aggregate all on one line if it were me.

@@ +412,3 @@
>  
>      // Integer and fractional part modulo one octant.
> +    double y = static_cast<double>(i & -2);

I'd prefer ~uint32_t(1) rather than having to think about twos-complement representations of negative numbers and/or mixed-sign bitwise math.

@@ +424,2 @@
>      double zz = z * z;
> +    const V2F64 *c = (const V2F64 *)coefficients;

reinterpret_cast<>

@@ +440,4 @@
>  
> +    // Create an array which consists of the q0 sin, cos, -sin, -cos.
> +    const V2F64 reflect_vecs[2] = { ans, FNeg(ans) };
> +    const double *reflect = (const double*)reflect_vecs;

reinterpret_cast<>

@@ +445,5 @@
> +    // Load the result with the desired reflection.
> +    uint32_t quad_index = i >> 1;
> +    uint32_t orig_sign = FSignbit(vx);
> +    double sin_result = reflect[(quad_index + (orig_sign << 1)) & 3];
> +    double cos_result = reflect[(quad_index + 1) & 3];

I'd prefer seeing |% 4| in these, with the 4 better corresponding to "quadrant" and modular wraparound and to the number of elements in |reflect|.

::: mfbt/SIMD.h
@@ +9,5 @@
> +#ifndef mozilla_SIMD_h
> +#define mozilla_SIMD_h
> +
> +#include <math.h>
> +#include <stdint.h>

Oh, re naming in this file, I am assuming these are moderately canonical SIMD names for this stuff.  I have nil experience with SIMD before this patch, to know if these names are normal or not.

@@ +242,5 @@
> +
> +// Demote a vector to a scalar with the value of its first element.
> +#if defined(MOZ_FAKE_SIMD)
> +inline double VectorToScalar(V2F64 v) { return v.e[0]; }
> +inline int64_t VectorToScalar(V2I64 v) { return v.e[0]; }

So you have a method (I assume this is the canonical name?) to extract the first element.  Why not one for the second?  And for that matter, given jsmath.cpp wants 0/1 access, why not have VectorElemen(V2F64, 0 or 1) and VectorElement(V2I64, 0 or 1), so as to get rid of the wacky casting there?

::: mfbt/tests/TestSIMD.cpp
@@ +15,5 @@
> +  V2F64 x, y, z;
> +  x = BuildVector(2.0, 3.0);
> +  y = BuildVector(13.0, 17.0);
> +  z = FAdd(x, y);
> +  MOZ_ASSERT(((double *)&z)[0] == 15.0);

And if MOZ_RELEASE_ASSERT, then the #ifdef DEBUG should die.

@@ +19,5 @@
> +  MOZ_ASSERT(((double *)&z)[0] == 15.0);
> +  MOZ_ASSERT(((double *)&z)[1] == 20.0);
> +  z = FSub(x, y);
> +  MOZ_ASSERT(((double *)&z)[0] == -11.0);
> +  MOZ_ASSERT(((double *)&z)[1] == -14.0);

Use reintepret_cast<> for all these C-style casts.  But, really, we should probably just have

static void
CheckContents(const V2F64& v, double d0, double d1)
{
  double* dv = reinterpret_cast<double*>(&v);
  MOZ_RELEASE_ASSERT(v[0] == d0);
  MOZ_RELEASE_ASSERT(v[1] == d1);
}

and use that everywhere.

@@ +34,5 @@
> +  z = FOr(x, y);
> +  MOZ_ASSERT(((double *)&z)[0] == 13.0);
> +  MOZ_ASSERT(((double *)&z)[1] == 25.0);
> +  z = FXor(x, y);
> +  MOZ_ASSERT(((double *)&z)[0] == 7.2314900401484045e-308);

I'd be inclined to write out |pow(2, -1021) + pow(2, -1022) + pow(2, -1024)| for this.

@@ +35,5 @@
> +  MOZ_ASSERT(((double *)&z)[0] == 13.0);
> +  MOZ_ASSERT(((double *)&z)[1] == 25.0);
> +  z = FXor(x, y);
> +  MOZ_ASSERT(((double *)&z)[0] == 7.2314900401484045e-308);
> +  MOZ_ASSERT(((double *)&z)[1] == 1.3906711615670009e-307);

Likewise |pow(2, -1020) + pow(2, -1021) + pow(2, -1024)|.

@@ +37,5 @@
> +  z = FXor(x, y);
> +  MOZ_ASSERT(((double *)&z)[0] == 7.2314900401484045e-308);
> +  MOZ_ASSERT(((double *)&z)[1] == 1.3906711615670009e-307);
> +  z = FAndnot(x, y);
> +  MOZ_ASSERT(((double *)&z)[0] == 7.2314900401484045e-308);

pow(2, -1021) + pow(2, -1022) + pow(2, -1024)

@@ +38,5 @@
> +  MOZ_ASSERT(((double *)&z)[0] == 7.2314900401484045e-308);
> +  MOZ_ASSERT(((double *)&z)[1] == 1.3906711615670009e-307);
> +  z = FAndnot(x, y);
> +  MOZ_ASSERT(((double *)&z)[0] == 7.2314900401484045e-308);
> +  MOZ_ASSERT(((double *)&z)[1] == 9.4565638986556059e-308);

pow(2, -1020) + pow(2, -1024)

@@ +46,5 @@
> +  MOZ_ASSERT(((double *)&z)[1] == 23.4);
> +
> +  z = ScalarToVector2(54.1);
> +  MOZ_ASSERT(((double *)&z)[0] == 54.1);
> +  MOZ_ASSERT(((double *)&z)[1] == 0.0);

Uh, I thought z1 here was supposed to be uninitialized here.  (Or 0 in release, 0x1badbadbadd00b1e in debug.)  What am I missing?

@@ +57,5 @@
> +  MOZ_ASSERT(((uint64_t *)&z)[1] == 0);
> +
> +  z = FNegzero();
> +  MOZ_ASSERT(((uint64_t *)&z)[0] == uint64_t(INT64_MIN));
> +  MOZ_ASSERT(((uint64_t *)&z)[1] == uint64_t(INT64_MIN));

mozilla::IsNegativeZero is a better way to test for -0.0.

@@ +61,5 @@
> +  MOZ_ASSERT(((uint64_t *)&z)[1] == uint64_t(INT64_MIN));
> +
> +  z = FSign(x);
> +  MOZ_ASSERT(((uint64_t *)&z)[0] == 0);
> +  MOZ_ASSERT(((uint64_t *)&z)[1] == 0);

mozilla::BitwiseCast<uint64_t>(FirstElement(z)) == 0, and so on.

FSign seems like kind of a strange operation for people to want, somewhat.  It's not used in this patch.  When do you expect it to be used?

@@ +65,5 @@
> +  MOZ_ASSERT(((uint64_t *)&z)[1] == 0);
> +
> +  z = FSign(BuildVector(-2.0, -0.0));
> +  MOZ_ASSERT(((uint64_t *)&z)[0] == uint64_t(INT64_MIN));
> +  MOZ_ASSERT(((uint64_t *)&z)[1] == uint64_t(INT64_MIN));

IsNegativeZero again.

@@ +69,5 @@
> +  MOZ_ASSERT(((uint64_t *)&z)[1] == uint64_t(INT64_MIN));
> +
> +  z = FAbs(BuildVector(-2.0, -0.0));
> +  MOZ_ASSERT(((double *)&z)[0] == 2.0);
> +  MOZ_ASSERT(((uint64_t *)&z)[1] == 0);

Okay, maybe the BitwiseCast zero thing should be a helper method at this point.

@@ +73,5 @@
> +  MOZ_ASSERT(((uint64_t *)&z)[1] == 0);
> +
> +  z = FNeg(BuildVector(-2.0, -0.0));
> +  MOZ_ASSERT(((double *)&z)[0] == 2.0);
> +  MOZ_ASSERT(((uint64_t *)&z)[1] == 0);

...yeah, definitely.
Attachment #8409713 - Flags: review?(jwalden+bmo) → review+
Thanks for the review! I'll work on incorporating your feedback and I'll post a new patch.

This patch unfortunately faces a few more challenges now though.

We still support pre-SSE2 hardware on 32-bit x86 (see bug 836658 for some discussion). The patch uses static feature detection, which would mean that all SSE2-capable 32-bit x86 machines would miss out (which would be substantial). It seems the best way to fix this is to use the dynamic SSE feature detection code in mozglue/build/SSE.h (perhaps moving it into MFBT), and update this patch to do dynamic detection.

Then there's the fact that on 32-bit ARM, NEON doesn't support 64-bit floating-point, so this SIMD patch doesn't help at all.
FWIW, I'm working on some constants for cosine that have a significantly lower maximum error, which might bring the maximum error bits down to 1 (more or less matching the 7 coefficient version). The calculations aren't quite done yet, but they're pretty close.
These are the optimal coefficients for cosine:

-4.99999999999994837463e-1 // 9007199254740899 / exp2(54)
 4.16666666665035934081e-2 // 6004799503137160 / exp2(57)
-1.38888888717514846424e-3 // 6405119462134806 / exp2(62)
 2.48015790234723526716e-5 // 7320134093918044 / exp2(68)
-2.75552973116761202654e-7 // 5205048498006694 / exp2(74)
 2.06336381304078253165e-9 // 4988907577686989 / exp2(81)

Unfortunately, the maximum number of error bits with these coefficients is still 2. They are still significantly better than the old set, though; here's some numbers:
                                 Old          New          % of old
Max error (arbitrary precision)  7.87115e-17  5.15257e-17  65.46%
Max error (double precision)     2.22045e-16  2.22045e-16  100.0%
Mean error (double precision)    4.23791e-17  3.26041e-17  76.93%

These coefficients should be optimal down to the last bit - I basically brute forced them with a modified binary search, starting with the old set (the minimax optimization algorithm in Mathematica chooses suboptimal sample points, and I wasn't sure how to fix that).
This bug is also impacted by the news described in bug 967709 comment 81, as this SIMD optimization is built on the patch in that bug.
I don't know if we'll go forward with this approach, but I've been optimizing the coefficients for sine using a different metric: instead of just minimizing the relative error, which seems to be the norm (for sine), I optimized these coefficients to minimize the maximum error of |abs(sin(x) - approx(x)) / exp2(ceil(log2(sin(x))))| (for x in the range (0, Pi/4]), which should be a better match for double precision.

-1.66666666666666379859*e-01 // -6004799503160651 / exp2(55)
 8.33333333332342621191*e-03 //  4803839602522818 / exp2(59)
-1.98412698304596300195*e-04 // -7320136533198066 / exp2(65)
 2.75573138831641897714*e-06 //  6506786771988646 / exp2(71)
-2.50507840070547829866*e-08 // -7571134896929437 / exp2(78)
 1.58981114240929394412*e-10 //  6150283962782802 / exp2(85)

The attached graph shows the difference between the fdlibm coefficients and the new ones using this metric. There are several discontinuities, corresponding to changes in the exponent. The new coefficients are significantly better near 0.125 and near 0.490 (and worse in other places - no free lunch as it were). I haven't had time to test them in actual double precision to see if they produce a lower average error, but I'm confident that they're optimal even if it ends up making no difference.
Attachment #8448302 - Attachment description: Errors of old coefficients and new coefficients → Errors of old coefficients and new coefficients (Cosine)
Measurements in double precision show about a 1.90% improvement over the fdlibm coefficients. That is, the 1-bit errors occur about 1.90% less often. It would be interesting to see how these perform for doubles that can be squared exactly - for instance, single precision numbers cast to double precision. A C++ program could exhaustively test all of these, but I'm not sure Mathematica is fast enough.
These are the optimized coefficients for a 14th order series expansion of cosine on the interval [0, Pi / 4]:

-5.00000000000000000000e-01; // -4503599627370496 / exp2(53)
 4.16666666666666019037e-02; //  6004799503160652 / exp2(57)
-1.38888888888744738669e-03; // -6405119470031391 / exp2(62)
 2.48015872896715402194e-05; //  7320136533669418 / exp2(68)
-2.75573144010328132404e-07; // -5205429515413534 / exp2(74)
 2.08757292817081547831e-09; //  5047441626388429 / exp2(81)
-1.13599342767498618996e-11; // -7031458742419531 / exp2(89)

It's worth noting that the first two coefficients are simply the double-precision versions of the Maclaurin Series expansion. I also optimized sets of coefficients for the nearest higher and lower doubles, but they were significantly worse. Presumably the maximum error is simply so low here that using anything else near 0 becomes impossible to compensate for.

I wanted to optimize these because the fdlibm set was far from optimal (albeit with a low enough error that it probably makes no difference). I will see if I can also optimize the 15th order expansion for sine so we have two optimized sets of equal length for the SIMD path. I still need to check if we can correct for inaccuracies in the squares used in this algorithm though - that would get rid of the 1-bit errors.

Finally, I had some trouble finding this bug again - I think it should block bug 967709.
Blocks: 967709
Priority: -- → P5
You need to log in before you can comment on or make changes to this bug.