Open Bug 967709 Opened 6 years ago Updated Last year

V8 is 2.8x faster at sin/cos

Categories

(Core :: JavaScript Engine: JIT, defect)

defect
Not set

Tracking

()

REOPENED
mozilla31

People

(Reporter: luke, Assigned: sunfish)

References

(Depends on 1 open bug)

Details

(Whiteboard: [leave open])

Attachments

(4 files, 4 obsolete files)

Attached file sin benchmark
On the attached microbenchmark, which just pounds on sin with non-repeating values, V8 is about 2.8x faster on my Linux machine.  (Our sin/cos just call in the C stdlib's sin/cos, so this is highly dependent on OS and stdlib version.  I'd appreciate seeing what numbers other people get.)  Profiling the box2d benchmark on awfy shows about 50% of its time is just calling sin/cos and this gives V8 better overall throughput on my machine.  It looks like V8 rolls their own sin/cos (https://code.google.com/p/v8/source/detail?r=17594) which gives them more predictable performance.  They self-host sin/cos which also avoids the call out from JIT code and all the overhead that that incurs.  Since the sin/code code isn't all that complex, it seems like we could do even better with MSin/MCos MIR/LIR ops.
To wit, when we get a hit in the math cache, we are only ~2x faster than V8 (which has since removed their math cache: https://code.google.com/p/v8/source/detail?r=18344).  With optimized sin/cos, perhaps we'd be able to do the righteous thing and remove our math cache as well (which was only added for benchmarketing parity in the first place).
I just tested on a Win32 Nightly vs. Chrome Dev and the perf difference is about the same as comment 0.
I really don't like this V8 implementation of sin/cos:

1) Uses HUGE 32 KiB lookup table. 
2) Not particularly precise -- up to 4 erroneus bits in [0, pi/2], 2.12 on average.
3) Uses imprecise reduction to [0, pi/2] that will lead to significant errors out of the first quadrant.
4) Can't calculate sin and cos simultanously.

I quickly compared it with Cephes-based VDT [1]. No lookup tables, only 16 double constants. 2 wrong bits max, 0.16 average in [0, pi/2], 0.25 average in [0, 10000]. Good reduction to quadrant. And it's not much slower than implementation in V8 -- without FMA V8 version was faster on my Haswell with GCC 4.8 -- 14.5s vs 11s to calculate 1 billion sin(), but with FMA both scored around 10s (however, V8 function calculates only sine, while VDT calculates both sine and cosine). So it's not even that faster than alternative implementations.

[1] https://svnweb.cern.ch/trac/vdt It's LGPL, but it's just streamlined reimplementation of function from Cephes, not hard to reimplement.
Thanks for the information!  The CERN slides are pretty encouraging about VDT [1][2].  Also, I think we'll have a good oppportunity to emit MSinCos based on some of the testcases we've looked at (box2d).

For your benchmark comparisons, were you comparing V8's builtin sin/cos to native VDT or did you test some other configuration like native VDT vs. native version of V8's impl?

[1] https://indico.cern.ch/event/214784/session/7/contribution/219/material/slides/0.pdf
[2] http://indico.cern.ch/event/202688/session/9/contribution/4/material/slides/0.pdf
I'm seeing V8 as 1.9x faster on Fedora20 x64, glibc 2.18.

v8: 19
sm: 36
(In reply to Luke Wagner [:luke] from comment #4)
> For your benchmark comparisons, were you comparing V8's builtin sin/cos to
> native VDT or did you test some other configuration like native VDT vs.
> native version of V8's impl?

I copypasted V8 implementation to C++ and minimally modified it to make it work: http://pastebin.com/aH0fsV5h Probably not a good port, but I don't think it was unfair to V8 -- generated by gcc code looks reasonable, and JavaScript in V8 should require additional bound checks, etc., I guess? More likely my tests were unfair to VDT, because I see jumps in the generated code.
(In reply to Vladislav Egorov from comment #6)
Great!  I was hoping that was the case.  If we do arch-specific codegen (using MacroAssembler, allowing us to dynamically test cpuid and guarantee SIMD optimization when HW support is present) then it seems like our impl would do relatively better.

Are there any other trig algorithms we should be considering other than VDT?  Some quick Googling seems to strongly support Cephes (and thus VDT).  One annoyance is that, iiuc, we can't just import the VDT code because it is LGPL; we'll need to rewrite it (w/ attribution, of course).  However, the code isn't very long, so that should be fine.
Attached file new_sincos.cpp (obsolete) —
Attached is a C++ implementation of sincos derived from the cephes implementations of sin and cos. Several of the optimizations are inspired by ideas in VDT, though the code is based directly on the cephes implementation and also contains several optimizations of my own.

Unlike VDT, my version includes the range and non-finite checks, and handles negative zero according to the ECMA spec, so it should be safe for any input (though like cephes it won't return useful results for extreme inputs). Also unlike VDT, in my version those are the *only* branches (and they're MOZ_UNLIKELY and should be highly predictable).

On Luke's sin benchmark here, my version is about 10% faster than VDT sincos (even with the extra checks).
Out of curiosity, do you know how the numerical constants in that code were obtained? Calculating a 53-bit mantissa by hand in Mathematica, I get some discrepancies. For instance, in the file you use -1.66666666666666307295e-1 for -1/6, but I get -1.66666666666666657415e-1 (I did this by doing ScientificForm[N[Round[2^55 / 3!] / 2^55, {21, Infinity}]], where Round[2^55 / 3!] is an integer between 2^52 and 2^53).
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #9)
> Out of curiosity, do you know how the numerical constants in that code were
> obtained? Calculating a 53-bit mantissa by hand in Mathematica, I get some
> discrepancies. For instance, in the file you use -1.66666666666666307295e-1
> for -1/6, but I get -1.66666666666666657415e-1 (I did this by doing
> ScientificForm[N[Round[2^55 / 3!] / 2^55, {21, Infinity}]], where Round[2^55
> / 3!] is an integer between 2^52 and 2^53).

No, I just used the constants from cephes. I don't know how they derived those constants, but it appears that they're compensating for rounding somehow. Replacing those constants with the obvious rational approximations seems to lead to less precise end results.
(In reply to Dan Gohman [:sunfish] from comment #10)
> No, I just used the constants from cephes. I don't know how they derived
> those constants, but it appears that they're compensating for rounding
> somehow. Replacing those constants with the obvious rational approximations
> seems to lead to less precise end results.

OK, thanks for confirming that :) I figured I was probably missing something.
Attached patch fast-sincos.patch (obsolete) — Splinter Review
Here's a patch which incoporates this Cephes-based sincos code into IonMonkey (and disables the cache!), plus a few more comments and code tweaks. On my system (Linux, GLIBC 2.17), it speeds up Luke's sin benchmark by over 3x, it speeds up a Box2d benchmark by 47%, and it generally looks good on a variety of other benchmarks. The new code is a little less precise overall, and it doesn't support values over pow(2, 30), but it's my impression that this is considered an acceptable tradeoff here.

This doesn't do the sincos optimization, or emit the code via the MacroAssembler, but it's a good first step.

Help needed:

 - It currently fails on 3d-cube.js in SunSpider because its results are very slightly less precise, and 3d-cube.js' validation code doesn't use an epsilon. Is it important that we make this test pass as is?

 - Testing on other platforms and codes.

Any comments or questions on the code are welcome too :-).
(In reply to Dan Gohman [:sunfish] from comment #12)
> IonMonkey (and disables the cache!),

Sweet!  Is there any overall impact on SunSpider (for which the foul cache was added in the first place)?

> The new code is a little less precise
> overall, and it doesn't support values over pow(2, 30), but it's my
> impression that this is considered an acceptable tradeoff here.

I'm unfamiliar with what is normal for trig impls; is there precedent for this in other impls?

>  - It currently fails on 3d-cube.js in SunSpider because its results are
> very slightly less precise, and 3d-cube.js' validation code doesn't use an
> epsilon. Is it important that we make this test pass as is?

Not important: we just made up those validation benchmarks by running, printing the values, then copying them into the assert verbatim.  (On several occasions in the tracer days, we'd get a massive "speedup" that came from producing the wrong result.)  It'd be fine to add an epsilon to the test.
(In reply to Luke Wagner [:luke] from comment #13)
> Sweet!  Is there any overall impact on SunSpider (for which the foul cache
> was added in the first place)?

To add to this, it was added for 3d-morph so you want to check that test in particular.
(In reply to Jan de Mooij [:jandem] from comment #14)
> (In reply to Luke Wagner [:luke] from comment #13)
> > Sweet!  Is there any overall impact on SunSpider (for which the foul cache
> > was added in the first place)?
> 
> To add to this, it was added for 3d-morph so you want to check that test in
> particular.

The current patch does slow down 3d-morph:
  before: 0.016485148 seconds time elapsed                                          ( +-  0.02% )
  after:  0.019115016 seconds time elapsed                                          ( +-  0.02% )

But, it speeds up math-partial-sums.js:
  before: 0.020669752 seconds time elapsed                                          ( +-  0.02% )
  after:  0.016817460 seconds time elapsed                                          ( +-  0.02% )

Given that we aren't fond of caching, can we call it even? :-)
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #9)
> Out of curiosity, do you know how the numerical constants in that code were
> obtained? Calculating a 53-bit mantissa by hand in Mathematica, I get some
> discrepancies. For instance, in the file you use -1.66666666666666307295e-1
> for -1/6, but I get -1.66666666666666657415e-1 (I did this by doing
> ScientificForm[N[Round[2^55 / 3!] / 2^55, {21, Infinity}]], where Round[2^55
> / 3!] is an integer between 2^52 and 2^53).

That's not Taylor series, but most likely minimax. I can't check, because I don't currently have Mathematica, but something like this should work, if I didn't make mistake somewhere -- MiniMaxApproximation[(Sin[Sqrt[x]]/(x*Sqrt[x])-1/x), {x, {-(Pi/4)^2, (Pi/4)^2}, 6, 0}].
(In reply to Dan Gohman [:sunfish] from comment #15)
Seems like a small win, even.  Also, math-partial-sums.js looks like a great candidate for the MSinCos optimization :)
Following comment 13, I created a patch to add an epsilon to 3d-cube.js on arewefastyet:

https://github.com/sunfishcode/arewefastyet/commit/03016720b128ad52f5e8eb49ec9e8ca8d0e84351

If this fast-sincos.patch is accepted, I'll submit this as a pull request to the main arewefastyet repository.
(In reply to Luke Wagner [:luke] from comment #13)
> (In reply to Dan Gohman [:sunfish] from comment #12)
> >  - It currently fails on 3d-cube.js in SunSpider because its results are
> > very slightly less precise, and 3d-cube.js' validation code doesn't use an
> > epsilon. Is it important that we make this test pass as is?
> 
> Not important: we just made up those validation benchmarks by running,
> printing the values, then copying them into the assert verbatim.  (On
> several occasions in the tracer days, we'd get a massive "speedup" that came
> from producing the wrong result.)  It'd be fine to add an epsilon to the
> test.

@Luke: this is about the actual SS benchmark, not about check-3d-cube.js in the jit-tests. I think we will have to request this upstream to (webkit) sunspider! I don't think they will object too much, since they added a similar thing to 3d-morph in SunSpider 1.0(.1?)

@Dan: I noticed you used an epsilon of 1e-10. 3d-morph is using 1e-13. Is the fault that much bigger?
(In reply to Hannes Verschore [:h4writer] from comment #19)
> @Dan: I noticed you used an epsilon of 1e-10. 3d-morph is using 1e-13. Is
> the fault that much bigger?

Assuming the perfect answer is actually 2889.0, the existing code is already effectively using an epsilon of 1e-11, since some of the sizes were being expected to get 2889.0000000000055. With the fast-sincos patch, the last size (160) gets 2889.0000000000105, which pushes it slightly out of 1e-11 into 1e-10.

(In reply to Vladislav Egorov from comment #16)
> That's not Taylor series, but most likely minimax. I can't check, because I
> don't currently have Mathematica, but something like this should work, if I
> didn't make mistake somewhere --
> MiniMaxApproximation[(Sin[Sqrt[x]]/(x*Sqrt[x])-1/x), {x, {-(Pi/4)^2,
> (Pi/4)^2}, 6, 0}].

I was previously unfamiliar with minimax approximations. Thanks for bringing this to my attention.

Also, I noticed that the fdlibm implementation of sin [0] uses yet a different series for a similar purpose, starting with -1.66666666666666324348e-01. Blindly plugging those values into the cephes code seems to produce slightly better overall results, but I haven't studied this in detail yet.

[0] http://netlib.org/fdlibm/k_sin.c

(In reply to Luke Wagner [:luke] from comment #13)
> (In reply to Dan Gohman [:sunfish] from comment #12)
> > The new code is a little less precise
> > overall, and it doesn't support values over pow(2, 30), but it's my
> > impression that this is considered an acceptable tradeoff here.
> 
> I'm unfamiliar with what is normal for trig impls; is there precedent for
> this in other impls?

The main precedent here is V8's new code. Cephes appears to be significantly more precise than what's now in V8.

Beyond that, it's not clear how much liberty JS really intended. Does the spec leave the door this wide open on purpose, or is it just assuming that everyone would use the C libm or fdlibm (even as suggested in the NOTE in 15.8.2) and consequently not feeling the need to care beyond that?
Vladislav was kind enough to explain some things to me via e-mail, and also linked this paper: http://www.research.scea.com/research/pdfs/RGREENfastermath_GDC02.pdf

After going through it, I've started generating constants using Mathematica's GeneralMiniMaxApproximation (allowing me to use the correct relative error function), and the results are somewhat different from others I've seen so far. I'll see if they give better error bounds before I post them here, but I wanted to give you a heads up.
As an update: I was finally able to prove empirically that fine-tuning the 'Bias' parameter in GeneralMiniMaxApproximation actually has a measurable effect on the result.

Across most of its range (from -0.49 to 0.99, at least) the *maximum* relative error is the same (around 2.22e-16, but consistent to 8 digits of precision or so - not worth tuning) when we compare the double precision result of the approximation to the idealized double precision result of sin(x).

However the *mean* absolute error (and the root mean square error) is measurably affected - going from around 9e-18 at extreme values to below 4.3e-18 at the optimum Bias. I think this optimum is the same Bias that minimizes the maximum relative error when using double precision coefficients but arbitrary precision calculations - something I can determine much more easily and to a higher degree of precision.

For a bit of context: To determine the mean absolute error, I'm doing Abs[1 - SetPrecision[approx[x], 64] / N[Double[Sin[SetPrecision[x,64]]],64]], where Double[] rounds to the nearest multiple of 2^(Floor[Log[2,Abs[x]]] - 52) and returns an arbitrary precision result (N[] doesn't do the correct rounding sometimes - I confirmed this with round-trip conversions).

This 64-digit precision result is often exactly 0, where the result of the approximation matches the result of the arbitrary precision sine function; that makes it very hard to integrate, because the ratio of zero to non-zero results is difficult to measure (NIntegrate falls on its face with anything but the "MonteCarlo" method, and is slower than doing it by hand).

For a given approximation, on each iteration I compute the arithmetic mean using a large number of randomly chosen points. By the central limit theorem, given enough independent samples they should be approximately normally distributed - so I fit them to the normal distribution using Mathematica's FindDistributionParameters[] and iterate until the mean is known to 1/1000 relative to its value, with 2 sigma confidence. Fully parallelized, this takes about 23 minutes per approximation (3 sigma confidence bumps that up to 46 minutes) on my Ivy Bridge clocked at 4.2GHz.

This is the fastest method I've found, so I'm glad it paid off. I'll narrow down the Bias as much as I can with this method, then see if it matches the optimal Bias from arbitrary precision computations like I think it does, and if so narrow it down from there until the double precision constants no longer change.
A slight correction: by "*mean* absolute error" I actually meant "the absolute value of the *mean* relative error". Also in case it wasn't clear, I should have an optimized set of coefficients for sin(x) later today or early tomorrow :) Doing the same thing for cos(x) should be trivial at this point.
Sorry about the delay, another update: I tested against the coefficients in attachment 8379086 [details] [diff] [review] and the others mentioned in comment #20, and found that while my coefficients were better than the former, the latter were far better than mine! I figured out that the result depends heavily on the way the relevant functions are expressed (e.g. a change of variables from x to y^(1/2)), and I've been experimenting with that. I *have* actually managed to improve on the coefficients mentioned in comment #20 now, though since there are now two variables in play (the Bias and z in x -> y^z) it's somewhat slow going.
Thanks for working on this!
Comment on attachment 8379086 [details] [diff] [review]
fast-sincos.patch

(In reply to Dan Gohman [:sunfish] from comment #25)
> Thanks for working on this!

No problem, I just hope it'll pay off :) I had a few comments about the patch:

> +  static const double m_4_pi = 4.0 / M_PI;
> +  int32_t i = static_cast<int32_t>(absx * m_4_pi);
> +
> +  // Integer and fractional part modulo one octant.
> +  int32_t quad_index = ((i + 1) >> 1) & 3;
> +  double y = static_cast<double>(i + (i & 1));
> +
> +  // Extended precision modular arithmetic
> +  double e0 = y * -7.85398125648498535156e-1;
> +  double e1 = y * -3.77489470793079817668e-8;
> +  double e2 = y * -2.69515142907905952645e-15;
> +  double z = absx + e0 + e1 + e2;

So you (effectively) divide the number by the size of an octant (45°), giving the index of the octant in the integer part. Then quad_index holds the quadrant, where quadrant 0 goes from -45° (or 315°) to 45° and so on. z = absx - y * Pi / 4 is then the remaining angle where Pi / 4 to Pi / 2 become -Pi / 4 to 0 (corrected by swapping sine and cosine).

I'd store 4 / Pi explicitly instead of leaving the division up to the compiler, so we can be sure it's rounded correctly (rounding to double precision and using the 21-digit form like the other constants, this would be 1.27323954473516276487). I'd also like to check the "Extended precision modular arithmetic" constants since I noticed there were sets in the original code for single precision, double precision and extended precision, and I think we want double precision here (I'd also like to calculate them myself). Finally, it might be good to ensure the subtractions in |z = absx + e0 + e1 + e2;| happen in the desired order - e.g. |z = ((absx + e0) + e1) + e2;|

I was also wondering: Does the limit really need to be 2^30? The biggest number for which we can unambiguously determine the octant is 7074237752028440.0 (1.27323954473516276487 * 7074237752028440.0 == 2^53 in double precision) - of course that doesn't leave much for the calculation of z, but there are many numbers larger than 2^30 that still have enough precision to reasonably calculate the sine or cosine for them. In addition, we only really need the least significant bits to calculate the octant, so you could do something like:

uint16_t i = static_cast<uint16_t>(static_cast<uint64_t>(absx * m_4_pi) & 0x7ULL);

It's a bit uglier and I don't know how much performance the 64-bit stuff will cost on 32-bit platforms (though it's just a cast and a bitwise AND), but it won't overflow and should be correct (though I haven't tested it).
Wait a minute, I'm stupid. You do need the higher bits because y uses them *facepalm*. So it would have to be

uint64_t i = static_cast<uint64_t>(absx * m_4_pi);

Nicer, but probably slower too.
It turns out I *was* missing something. The modular arithmetic constants are designed so that they have at least n trailing zero bits (except the third constant in each set). DP1sc has 30 trailing zeros, DP1 has 28 trailing zeros and the single precision DP2F has 13 trailing zeros (DP1F happens to have 3 more). This means that when multiplied with an integer, they give an exact result so long as the integer isn't too large.

For DP* (which are used in vdt), the largest integer that can be multiplied exactly is 341782641 (about 2^28.35). Since we round odd numbers up to the nearest even number, the largest valid value of i is 341782640. That puts the largest valid double precision floating point input at 4503599669691516 / 2^16 = 3.41782640999999940395e8.

For DP*sc (which are used in attachment 8379086 [details] [diff] [review]), the largest integer that can be multiplied exactly is 1367130616 (a little over 2^30). This translates to a maximum valid floating point input of 4503599844284045 / 2^22 = 1.07374187571622014046e9, so the limit in the patch is correct (slightly conservative if we're nitpicking :) ).

So the patch as written is correct, but we could push the limit up to 2^32 - 2 for free by making i into an uint32_t and using the following constants:

double e0 = y * -7.85398006439208984375e-1;  // 1647099 / 2^21
double e1 = y * -1.56958208208379801363e-7;  // 1380619 / 2^43
double e2 = y * -3.11168608594830669189e-14; // 4930663418217751 / 2^97

This would push the floating point input limit up to 4.29496815395995807648e9 (4503600527006717 / 2^20).
Uh, I got a bit caught up in my calculations there. The constants I suggested would raise the limit to 3.37325942455970764160e9 (221069929647945 / 2^16), not the value I gave in comment #28, because anything higher would exceed 2^32 - 2.
Attached patch fast-sincos.patch (obsolete) — Splinter Review
Here's an updated version of the patch. Changes from the previous version:

 - Use uint32_t and the new constants and increase the range check to 3.37325942455970764160e9, as described in comment 28 (and comment 29)
 - Replace 4 / M_PI with 1.27323954473516276487, as suggested in comment 26
 - Switch to the fdlibm constants for polevl_sin and polevl_cos.
 - Updated check-3d-cube.js to account for minor differences, following the advice of comment 13.

Current outstanding issues, collected from all the posts above:
 - I haven't yet benchmarked using uint64_t, as suggested in comment 26 and comment 27.
 - Comment 24 suggests that improvements on the fdlibm constants are possible.
 - I don't yet have an answer to the question at the end of comment 20, which is "how precise does JS actually want sin/cos to be?"

Comments, questions, testing, etc. all welcome :-).
Attachment #8379086 - Attachment is obsolete: true
(In reply to Dan Gohman [:sunfish] from comment #30)
>  - I haven't yet benchmarked using uint64_t, as suggested in comment 26 and
> comment 27.

I suggest disregarding this. The limit is mostly set by the modular arithmetic constants; while we could extend the range by using even more 0 bits in the constants, at some point we'd have to switch to using 4 constants or more (it's exponential), and we could never cover the full range anyway (7074237752028440.0 is more than 2^52) unless we switched to doing this step using 64-bit integer maths (which opens up a whole new can of worms).
Here's a new set of constants for cosine:

+  double ans = -1.13608562601317669453e-11; // -7032029422491539 / exp2(89)
+  ans *= x;
+  ans +=  2.08757485627641276222e-09; //  5047446288261708 / exp2(81)
+  ans *= x;
+  ans += -2.75573145560096088057e-07; // -5205429544687823 / exp2(74)
+  ans *= x;
+  ans +=  2.48015872902638771357e-05; //  7320136533844245 / exp2(68)
+  ans *= x;
+  ans += -1.38888888888755342166e-03; // -6405119470031880 / exp2(62)
+  ans *= x;
+  ans +=  4.16666666666666088426e-02; //  6004799503160653 / exp2(57)
+  return ans;

Unfortunately these don't actually improve the RMSE of the approximation much (as far as I'm able to determine, their RMSE is ~0.005% better) - this is because the approximation for cosine is one order higher than the approximation for sine, so even a mediocre set of constants in arbitrary precision will perform near optimally in double precision.

There's more to gain for sine - about 2%, going by what I had - but I lost the results I had to a crash :( I'll regenerate those now.

One thing worth noting: the code currently handles the first (or second, depending on how you look at it) coefficient of -0.5 specially, but there's no reason you couldn't pull that into polevl_cos as well. So you could add

+  ans *= x;
+  ans += -0.5;

to the end of polevl_cos and remove hzz altogether. For that matter, you could even put the final step in there as well with

+  ans *= x;
+  ans += 1.0;

and then just call polevl_cos directly. Perhaps that would be clearer for anyone looking at the code?
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #32)
> Here's a new set of constants for cosine:
[...]
> Unfortunately these don't actually improve the RMSE of the approximation
> much (as far as I'm able to determine, their RMSE is ~0.005% better) - this
> is because the approximation for cosine is one order higher than the
> approximation for sine, so even a mediocre set of constants in arbitrary
> precision will perform near optimally in double precision.

I tried out these numbers for cosine and am seeing a small improvement on [0,10000], but also a regression on [0,pi/2]. Numbers below.

> One thing worth noting: the code currently handles the first (or second,
> depending on how you look at it) coefficient of -0.5 specially, but there's
> no reason you couldn't pull that into polevl_cos as well. So you could add
> 
> +  ans *= x;
> +  ans += -0.5;
> 
> to the end of polevl_cos and remove hzz altogether. For that matter, you
> could even put the final step in there as well with
> 
> +  ans *= x;
> +  ans += 1.0;
> 
> and then just call polevl_cos directly. Perhaps that would be clearer for
> anyone looking at the code?

This actually changes the intermediate rounding, and it looks like a significant improvement! (and it also looks cleaner, and may even be faster). With the test harness from comment 6:

With values from 0 to 10000.0:

                                        | maxbitserr | average diffbits
  --------------------------------------+------------+-----------------
        V8, as implemented in comment 6 |     35     | 10.30454763
       VDT, as implemented in comment 6 |     2      | 0.25374226
                      The current patch |     2      | 0.25348941
    Same but with the new cos constants |     2      | 0.25348651
  Same but with the code simplification |     2      | 0.16708169

With values from 0 to pi/2:

                                        | maxbitserr | average diffbits
  --------------------------------------+------------+-----------------
        V8, as implemented in comment 6 |     4      | 2.11994594
       VDT, as implemented in comment 6 |     2      | 0.15812437
                      The current patch |     2      | 0.15643108
    Same but with the new cos constants |     2      | 0.1568469
  Same but with the code simplification |     1      | 0.07376456
(In reply to Dan Gohman [:sunfish] from comment #33)
> This actually changes the intermediate rounding, and it looks like a
> significant improvement!

Oh, nice! Yeah, I guess addition is left-to-right associative, so this changes things a bit. We can do the same thing for sine, turning it into

x (1 + x^2 (a + x^2 (b + x^2 (c + x^2 (d + x^2 (e + x^2 f))))))

So instead of multiplying by x and adding, we add and then multiply by x. This might preserve more bits of x in the result, although I'm not sure.

It's worth noting that using x^2 in the approximation is actually one of the main sources of error. Comparing Sin[x^2] to x * Sin[Sqrt[Double[x^2]]] / Sqrt[Double[x^2]] (so using the arbitrary precision root of a double precision square as in the approximation) shows one epsilon of error across the range, because we lose some bits by squaring. Of course there's no way around this, but it's good to keep in mind as a theoretical best.
As for the average, I think they *should* be the same for [0,10000] and [0,Pi/4] since it's the same range after range reduction, so maybe this is noise? I had a very hard time getting an RMSE consistent to more than 4 digits or so in Mathematica, but I imagine that doing it in c++ would be a lot faster.

Another explanation could be that I'm optimizing the RMS of the relative error in the arbitrary precision case (with constants rounded to double precision), whereas you're measuring the mean of the *absolute* error (the output of cosine is between sqrt(2) and 1, so epsilon is always the same). I can try optimizing for that instead since relative error doesn't actually make much sense for cosine, given that epsilon is the same across the range.

Oh, and testing in [0,10000] would use the sine approximation instead for 50% of the range, right? Optimizing for relative error does make sense for sine, though there's still a difference between using the mean and the root mean square (I'm not sure which is better here given that in double precision the error is going to be either 0 or epsilon).
Attached patch fast-sincos.patch (obsolete) — Splinter Review
Here's an updated patch which incorporates the cosine changes discussed in comment 32. I experimented with the corresponding sine changes, but they appear to hurt precision. I left the code in place under #if 0 in case anyone wants to see what I tried.

Happily, with these precision improvements, this patch no longer needs to loosen the epsilon values of any of the tests in the testsuite, and no longer requires changes to SunSpider.

I did some quick timing with the benchmark from comment 0. This patch is about 25% percent slower than the V8 code as translated in comment 6, but as seen in comment 33, and now the lack of need for test changes, it's much more precise.
Assignee: nobody → sunfish
Attachment #8388785 - Attachment is obsolete: true
I tried this with :sunfish beside me, and by bumping the value of N up by 100, had times that were approximately 2700, down to about 1825 after applying the patch, or a reduction of about 30% ++.
Any further improvement that new sets of constants can bring is going to be very small at this point, and I don't want to hold this bug up any longer waiting for them, so I can file a followup if you'd like to just get this reviewed and landed.

Dan, what did you use to check the accuracy of each version? I'd like to run any new constants I come up with through that as well.
(In reply to Dan Gohman [:sunfish] from comment #36)
> I did some quick timing with the benchmark from comment 0. This patch is
> about 25% percent slower than the V8 code as translated in comment 6, but as
> seen in comment 33, and now the lack of need for test changes, it's much
> more precise.

If the v8 code is faster but less precise, then what is the benefit of firefox being more precise? It seems like web content will have to assume the worst precision anyhow, and more precise browsers will not actually help anything.

Perhaps the optimal thing would be to try to coordinate with chrome and other browsers on a mutually agreed-upon level of precision?
By the way, we could skip the range reduction code entirely if absx <= m_4_pi (and just use |z = absx;|) so long as we initialize quad_index to 0. Of course that would introduce a branch, and not necessarily a very predictable one - I don't know if code is likely to only use small angles, but code that uses angles in the range [-Pi,Pi] or [0,2Pi] is probably going to alternate a fair bit. So I don't know if this would be better or worse, but I wanted to mention it for completeness.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #38)
> Any further improvement that new sets of constants can bring is going to be
> very small at this point, and I don't want to hold this bug up any longer
> waiting for them, so I can file a followup if you'd like to just get this
> reviewed and landed.

Sounds good.

> Dan, what did you use to check the accuracy of each version? I'd like to run
> any new constants I come up with through that as well.

I'm using the test program linked from comment 6. I tested different ranges by modifying the values of topval and step.

(In reply to Alon Zakai (:azakai) from comment #39)
> (In reply to Dan Gohman [:sunfish] from comment #36)
> > I did some quick timing with the benchmark from comment 0. This patch is
> > about 25% percent slower than the V8 code as translated in comment 6, but as
> > seen in comment 33, and now the lack of need for test changes, it's much
> > more precise.
> 
> If the v8 code is faster but less precise, then what is the benefit of
> firefox being more precise? It seems like web content will have to assume
> the worst precision anyhow, and more precise browsers will not actually help
> anything.
> 
> Perhaps the optimal thing would be to try to coordinate with chrome and
> other browsers on a mutually agreed-upon level of precision?

I think having this patch in place will help discussions on this topic because it's a data point at an interesting place on the spectrum of speed versus precision.

(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #40)
> By the way, we could skip the range reduction code entirely if absx <=
> m_4_pi (and just use |z = absx;|) so long as we initialize quad_index to 0.
> Of course that would introduce a branch, and not necessarily a very
> predictable one - I don't know if code is likely to only use small angles,
> but code that uses angles in the range [-Pi,Pi] or [0,2Pi] is probably going
> to alternate a fair bit. So I don't know if this would be better or worse,
> but I wanted to mention it for completeness.

As you say, this branch would mispredict a lot on some input sets, so I don't think we want to do this, but it is useful to keep in mind.
Ok, let's call this ready to review :-).

:jorendorff gave me the idea to have this code fall back to libm sin and cos when the input range check fails, so we now support the full range of inputs (to the extent that libm does).

Also, I optimized the reflection code for a few more percentage points of speedup, and tweaked fast_sincos to use a struct return instead of returning values by pointer, since its slightly faster on some platforms.
Attachment #8378730 - Attachment is obsolete: true
Attachment #8397345 - Attachment is obsolete: true
Attachment #8397940 - Flags: review?(jorendorff)
Attachment #8397940 - Flags: feedback?(emanuel.hoogeveen)
Comment on attachment 8397940 [details] [diff] [review]
fast-sincos.patch

Review of attachment 8397940 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsmath.cpp
@@ +319,5 @@
> + * nor does it take advantage of the standard's intent to permit JS to use the
> + * system C math library.
> + *
> + * The code carefully avoids branching, to avoid the cost of mispredictions
> + * either on random input sets or on input sets stradling a boundary condition

minorest of nits: "straddling"
Comment on attachment 8397940 [details] [diff] [review]
fast-sincos.patch

Looks good to me. I have one question regarding polevl_sin:

+static double polevl_sin(double z, double zz)
...
+  ans *= zz * z;
+  ans += z;
...
+  double q0_sin = polevl_sin(z, zz);

Would it change the result to just do |double q0_sin = z + z * polevl_sin(zz);| instead? This changes the rounding order from |ans * (zz * z)| to |(ans * zz) * z| (different from the two versions in attachment 8397345 [details] [diff] [review]), but I'm not sure whether one is worse than the other. Technically there's also |(ans * z) * zz| to try if you wanted to go that far :P
Attachment #8397940 - Flags: feedback?(emanuel.hoogeveen) → feedback+
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #44)
> Would it change the result to just do |double q0_sin = z + z *
> polevl_sin(zz);| instead? This changes the rounding order from |ans * (zz *
> z)| to |(ans * zz) * z| (different from the two versions in attachment
> 8397345 [details] [diff] [review]), but I'm not sure whether one is worse
> than the other. Technically there's also |(ans * z) * zz| to try if you
> wanted to go that far :P

I tried both alternatives, and they appear to make the results very slightly worse, so I'll stick with the current patch.
Some additional data points:

I ported the V8 code (as ported to C++ in comment 6) to jsmath.cpp to get a better comparison of the performance of the algorithms themselves, since V8's actual implementation is in JS. The V8 algorithm is about 15% faster on the microbenchmark in comment 0 than my latest patch here.

For precision, as measured by the test code in comment 6, the V8 algorithm average differing bits on the input range [0,10] is 1.74, compared to 0.169 with the current patch here. On the input range [0,10000], the V8 algorithm gets 10.5, the current patch here gets 0.167.
Blocks: 984018
jorendorff, ping
Flags: needinfo?(jorendorff)
Comment on attachment 8397940 [details] [diff] [review]
fast-sincos.patch

Review of attachment 8397940 [details] [diff] [review]:
-----------------------------------------------------------------

Very nice work. Sacrificing a teensy bit of precision for a whole lot of speed is what users will want here.

::: js/src/jsmath.cpp
@@ +343,5 @@
> +static double polevl_sin(double z, double zz)
> +{
> +  // Constants taken from fdlibm k_sin.c
> +  double ans = 1.58969099521155010221e-10;
> +  ans *= zz;

Indent 4 spaces, please.

@@ +919,5 @@
>  
>  double
>  js::math_sin_impl(MathCache *cache, double x)
>  {
> +    return math_sin_uncached(x);

It looks like math_sin_impl does not need to be exposed in jsmath.h. Please remove it and change math_sin to call math_sin_uncached directly (renaming as desired).

Same for cos.
Attachment #8397940 - Flags: review?(jorendorff) → review+
Flags: needinfo?(jorendorff)
It's also worth noting here that this implemention preserves the property that sin(-x) equals -sin(x), and cos(-x) equals cos(x) (NaN cases notwithstanding).

At this time, this property is not preserved in V8's implementation [0]. This is known to have broken at least one real-world codebase [1].

[0] http://code.google.com/p/v8/issues/detail?id=3089
[1] https://github.com/mbostock/d3/commit/bef5de751097c490332b5425415c59e71c8e480e
I just discovered that GLIBC's libm sin/cos got tremendously faster between 2.17 and 2.18, possibly with the help of changes like this [0]. The slow version is still widely used, as that's the version in the current Ubuntu release, for example.

The fast version in GLIBC is actually slightly faster than my patch on bench2d! I haven't studied the implementation in detail, but I did notice that it uses numerous branches, testing whether the input is in ranges like 2^-26 < x < 0.25, 0.25 < x < 0.855469, 0.855469 < x < 2.426265, and several others. This presumably allows it to go fast when it can, but I'm seeing it take about 1.5x as much time on random inputs (as measured by Luke's benchmark in comment 0) due to branch mispredictions.

I'm still planning to procede with my current patch, as it will give us consistent performance across all platforms, versions of platforms, and input value distributions.

[0] https://sourceware.org/ml/libc-alpha/2013-05/msg01035.html
[1] https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=sysdeps/ieee754/dbl-64/s_sin.c;hb=HEAD
I'm a bit embarrassed to see that the benchmark I wrote in comment 0 is clownshoes.  My intention was to generate a bunch of values that kept cycling over [-3, 3] at slightly different offsets, but instead I effectively wrote "last += CONSTANT" so the array of inputs just contains a ton of giant values, so hugely non-representative.  Probably this doesn't negate the value of the patch, but it might be good to measure with something that stays in the common range of radians...
(also on Windows)
I imagine it's common for code to do

  for (i = 0; i < N; i++)
      signal[i] = ...expression containing Math.sin(k*i)...;

and thus end up well outside the range [-pi,pi]. I like that the branch in sunfish's patch is well-predicted until you get up to 3 billion or so. :)

But in any case, what GLIBC is doing (branching on values near zero) sounds like it would mispredict a lot in practice no matter what kind of application you have...
(In reply to Luke Wagner [:luke] from comment #51)
> I'm a bit embarrassed to see that the benchmark I wrote in comment 0 is
> clownshoes.  My intention was to generate a bunch of values that kept
> cycling over [-3, 3] at slightly different offsets, but instead I
> effectively wrote "last += CONSTANT" so the array of inputs just contains a
> ton of giant values, so hugely non-representative.  Probably this doesn't
> negate the value of the patch, but it might be good to measure with
> something that stays in the common range of radians...

I have tested performance on a variety of ranges. My patch has consistent performance on every range, up to its main limit. GLIBC 2.18 is faster on small ranges close to 0, but it's slower on [-3,3]. Bench2d's sin/cos inputs are in [0,2*pi], but about 97% are less than 1, and GLIBC 2.18 is about 6% faster on that benchmark. For several reasons discussed above, I think the patch is still what we want, but this is something we'll continue to watch.

(In reply to Jason Orendorff [:jorendorff] from comment #48)
> Indent 4 spaces, please.

Done.

> It looks like math_sin_impl does not need to be exposed in jsmath.h. Please
> remove it and change math_sin to call math_sin_uncached directly (renaming
> as desired).
> 
> Same for cos.

Done. I removed math_sin_impl and renamed math_sin_uncached to math_sin_impl, so it's now consistent with math_floor_impl and a few others.

For ideas on improving this implementation, please file new bugs. Thanks everyone for all your help! 

https://hg.mozilla.org/integration/mozilla-inbound/rev/8fa46ad24ecc
(In reply to Dan Gohman [:sunfish] from comment #54)
> For several reasons discussed above, I think the
> patch is still what we want, but this is something we'll continue to watch.

Yes, I'm definitely in favor, and glad to hear you've been measuring other ranges too.
(In reply to Dan Gohman [:sunfish] from comment #15)
> The current patch does slow down 3d-morph:
>   before: 0.016485148 seconds time elapsed                                  
> ( +-  0.02% )
>   after:  0.019115016 seconds time elapsed                                  
> ( +-  0.02% )
> 
> But, it speeds up math-partial-sums.js:
>   before: 0.020669752 seconds time elapsed                                  
> ( +-  0.02% )
>   after:  0.016817460 seconds time elapsed                                  
> ( +-  0.02% )
> 
> Given that we aren't fond of caching, can we call it even? :-)

On AWFY this regressed SS about 2.5% due to 3d-morph. There's no difference on partial-sums, did later changes affect this? Maybe we have a faster sin/cos on Mac?
IIRC, 3d-morph is the entire motivation for the math cache.  Before we do anything rash, I noticed that both 3d-cube and math-partial-sums contain trivial SinCos opportunities.  Perhaps the wins from bug 984018 would offset the loss on 3d-morph allowing to keep both our dignity and SS performance?
The current implementation saves a branch by computing both sin and cos at once - it might be faster to take a branch and only compute one or the other (though this would definitely make the implementation somewhat uglier). On the other hand, optimizing for cases that want both sin and cos could help a lot as well. I don't know that we would want to do both - caching the latest result of both cos and sin seems like the most straightforward fix, but I don't know how the math cache is implemented.
It seems like those branches would be static, based on the caller, so we could templatize the whole algorithm.
It's not quite that simple: you only know whether you want sin or cos after range reduction (even if the caller explicitly wants one or the other).

Right now the algorithm chooses in which order to return sin and cos right at the end, but I think it could be refactored to choose after range reduction and only calculate one or the other.
It looks like this made emscripten-box2d slower on awfy, possibly also octane-box2d (octane is noisier, hard to tell). On my local machine I also see a slowdown on emscripten-box2d, 1.03 to 1.18 (15% slower). I guess the awfy bots and my machine might happen to have fast libc implementations of sin/cos?
Random idea: the default AWFY graphs could be an aggregate of all platforms so regressions for any one are easily seen without having to watch them all.
(In reply to Jan de Mooij [:jandem] from comment #56)
> On AWFY this regressed SS about 2.5% due to 3d-morph. There's no difference
> on partial-sums, did later changes affect this? Maybe we have a faster
> sin/cos on Mac?

In the process of tidying up after making changes for review feedback, I made what I thought was an innocuous change which actually disabled the optimization to choose MMathFunction nodes instead of generic MCall nodes. On my machine, this causes a slowdown on 3d-morph and math-partial-sums. I have now filed bug 994993 to track this (with a patch.)
Depends on: 994993
https://hg.mozilla.org/mozilla-central/rev/8fa46ad24ecc
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
On AWFY, bug 994993 helped, but only on 64-bit, and not enough to recover the regression.

I'm now aware that the performance surveying I did in preparation for this bug relied too heavily on benchmarks which used (relatively) large and randomly distributed values, which made system libm implementations appear slower than they often are in practice. On code code where the majority of values are very small and close together (97% of values are less than 1.0 in box2d's case), they are faster. Consequently, it now seems that this branchless cephes-based implementation is not as much of a win in practice as was originally hoped.

I'm now considering the options. Re-introducing the math cache for sin/cos is one possibility; it would help 3d-morph, and it would also make the sincos optimization (bug 984018) much easier to implement. It's possible to compute the sin and cos polynomials in parallel using SIMD (rather than just relying on ILP); in a prototype this gained a few percentage points. Or, we may want to pick a different algorithm altogether.
It seems odd to me that it didn't help at all on 32-bit, but here's a few things we could try:

1) We could try what I suggested in comment #40 - skip past range reduction if absx <= 0.785398163397448278999 (Pi / 4, not m_4_pi as I suggested in comment #40). That should be pretty easy to try out, though we'd still calculate both sin and cos at the same time.

2) Orthogonally, we could refactor things so that fast_sincos takes a parameter saying whether to return sin or cos, then choose which one to calculate based on the results of range reduction (this would look a bit ugly, but..).

3) For cos specifically, we could see if we can get away with using one less coefficient. This would probably make the precision worse on average, but it might still be at most 1 bit off.

4) And finally as you've said, we could go back to caching the results, caching both sin and cos if we stick with the current design, or caching one or the other if we do refactor.
Also, it'd be good to measure against Windows builtin sin/cos in the [0, 1.0] range.
For testing the effects of a branch to skip range reduction, I expect the worst case range would be [0, Pi / 2] or [-Pi / 2, Pi / 2] since that would use range reduction 50% of the time.
(Just to be clear, I'm not saying we have to fix the 3d-morph regression completely: the math cache is pretty silly and may hurt more real-world workloads. If we can win a bit on partial-sums with the sincos optimization we can probably accept a small regression to remove an old hack, apparently V8 did the same.)
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #66)
> It seems odd to me that it didn't help at all on 32-bit, but here's a few
> things we could try:
> 
> 1) We could try what I suggested in comment #40 - skip past range reduction
> if absx <= 0.785398163397448278999 (Pi / 4, not m_4_pi as I suggested in
> comment #40). That should be pretty easy to try out, though we'd still
> calculate both sin and cos at the same time.

I had originally sought to minimize branches, but it seems that this branch helps a lot of several important use cases, and doesn't hurt too much otherwise, so I may do this.

> 2) Orthogonally, we could refactor things so that fast_sincos takes a
> parameter saying whether to return sin or cos, then choose which one to
> calculate based on the results of range reduction (this would look a bit
> ugly, but..).

I'm experimenting with parallelizing the two polynomials with SIMD -- see bug 996375. On machines which support it, this reduces the overhead of computing both, and is less ugly :-}.

> 3) For cos specifically, we could see if we can get away with using one less
> coefficient. This would probably make the precision worse on average, but it
> might still be at most 1 bit off.

This sounds appealing, since it turns out that the cos side is doing one more mul+add than the sin side, which is especially annoying when they're being computed in parallel with SIMD :-). I tried a simple experiment to drop the first coefficient from cos, and I am seeing significant errors; Math.sin(3.93982011218085) is off by 12 bits, for example. Would you be interested in generating another set of coefficients for this purpose?
(In reply to Dan Gohman [:sunfish] from comment #70)
> Would you be
> interested in generating another set of coefficients for this purpose?

Certainly. In particular, I remember that when I tried this briefly, the coefficient for the 2nd order term turned out somewhat less than 1/2 (whereas in the current approximation it works better to set it to 1/2 and reduce the rest). I've almost finished optimizing the coefficients for sin - producing new ones for cos will take time, but I can probably give a rough set in a day or two to see if there's any hope of it being precise enough. Do you want a follow-up for that?
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #71)
> Certainly. In particular, I remember that when I tried this briefly, the
> coefficient for the 2nd order term turned out somewhat less than 1/2
> (whereas in the current approximation it works better to set it to 1/2 and
> reduce the rest). I've almost finished optimizing the coefficients for sin -
> producing new ones for cos will take time, but I can probably give a rough
> set in a day or two to see if there's any hope of it being precise enough.
> Do you want a follow-up for that?

Yes.
(In reply to Dan Gohman [:sunfish] from comment #72)
> (In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #71)
> > Certainly. In particular, I remember that when I tried this briefly, the
> > coefficient for the 2nd order term turned out somewhat less than 1/2
> > (whereas in the current approximation it works better to set it to 1/2 and
> > reduce the rest). I've almost finished optimizing the coefficients for sin -
> > producing new ones for cos will take time, but I can probably give a rough
> > set in a day or two to see if there's any hope of it being precise enough.
> > Do you want a follow-up for that?

Yes. My experiments suggest that we need this speedup (and others) for this algorithm to be competitive. If it's not precise enough and we can't do this we'll have to look at bigger changes, so an early estimate of the precision would be helpful.
Since the performance regressions are not yet resolved, I reverted this feature in anticipation of the release branch.
http://hg.mozilla.org/integration/mozilla-inbound/rev/ba2e9970b80f
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
https://hg.mozilla.org/mozilla-central/rev/ba2e9970b80f
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
(In reply to Carsten Book [:Tomcat] from comment #75)
> https://hg.mozilla.org/mozilla-central/rev/ba2e9970b80f

I think this was a backout - this should be a [leave open]. Feel free to remove this from the whiteboard whenever the new patch is ready to be landed to fix this.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [leave open]
If there are not any other lines of investigation which provide similar performance with a low error, perhaps we should start a conversation (on mozilla.dev.tech.js-engine.internals) about whether we should adopt a similar strategy to V8's.
From comment #46, the V8 algorithm is only about 15% faster than what landed here - and Dan's SIMD implementation might well make up that difference, while still being significantly more precise. I think the main problem here is that we're removing the (dumb) math cache for sin/cos at the same time, and the new algorithm isn't quite fast enough to make up the difference. Is that right, Dan?

Removing the math cache may be a goal in and of itself, but have we checked performance with the new algorithm if we cache the results like before? (we'd still be calculating both sin and cos and only caching the one we use, but the math cache isn't exactly sophisticated to begin with)
Dumb question: wouldn't a "smart" use of the math cache be to cache both sin & cos on look up of either, so the other is available for instances where both sin & cos are needed?
(In reply to Dave Garrett from comment #79)
> Dumb question: wouldn't a "smart" use of the math cache be to cache both sin
> & cos on look up of either, so the other is available for instances where
> both sin & cos are needed?

We are considering it; bug 984018 also discusses this. 

My tentative plan right now is to proceed with the SIMD patch (bug 996375, patch waiting for review), make SIMD work on platforms which need runtime feature detection, and then introduce a simple branch for small values that don't need range reduction. These together largely appear to make up the remaining difference in the testing I've done so far, though meaningful benchmarking is complex. If these specific changes end up insufficient, I'll drop this approach entirely and pursue other avenues mentioned above.

The cache does appear to be disproportionately represented in Sunspider, and I'm attempting to keep it in perspective. Many real applications use sin/cos in essentially uncacheable ways, so it's important to make the non-cached case fast, regardless of whether we also have a cache for the benefit of Sunspider or making the sincos optimization easier to implement or anything else.
Some major news on this topic:

TC39 discussed the topic of floating-point accuracy at their recent meeting. The consensus seems to be that JavaScript should start specifying accuracy requirements for its Math library functions. The details are not yet decided, but it seems that the desire is to be around 1 ulp for all functions, if not better.

Even though we don't have new rules defined yet, V8 has already switched from their fast sin/cos implementation to an fdlibm-derived sin/cos implementation. It is within 1 ulp, though it is also significantly slower than the previous implementation.

A 1-ulp requirement would also rule out the patch attached to this bug as it currently stands. We could potentially refine it to meet this requirement. I haven't yet decided whether this makes sense.

I am currently participating in an effort to develop a set of rules to propose for the standard. If anyone here has any opinions about what should be done, let's talk about them.
The ideal solution would probably to add an optional second argument for these functions to the standard to specify a desired precision. Set a reasonable default, but then things that need best speed or best precision could request the implementation bias they want.
Does 'within 1 ulp' mean the error should be less than 1 bit, or is a 1 bit error acceptable? If we use the older coefficients for cos, we can get that.

If we need perfect double precision, the current patch can't work because squaring the angle introduces an error, but I was wondering if we could use a correction based on the angle sum identities:

sin(a + b) = sin(a)cos(b) + cos(a)sin(b)
cos(a + b) = cos(a)cos(b) - sin(a)sin(b)

If b is extremely small, this should essentially turn into:

sin(a + b) = sin(a) + b cos(a)
cos(a + b) = cos(a) - b sin(a)

We already calculate both sin and cos, so aside from determining 'b' this would simply add another multiply and add. To make the square exact, perhaps we could clear out the (26?) least significant bits of the angle, and set b to the remainder. I'm not sure what the fastest way to do this would be, and I haven't tried it to see if it gets rid of the error.

Searching around the web for a bit I didn't find any other ways to refine existing approximations, though they might exist.
(In reply to Dave Garrett from comment #82)
> The ideal solution would probably to add an optional second argument for
> these functions to the standard to specify a desired precision. Set a
> reasonable default, but then things that need best speed or best precision
> could request the implementation bias they want.

It's an interesting idea. I'll bounce it off some people and see what kind of reaction it gets.

(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #83)
> Does 'within 1 ulp' mean the error should be less than 1 bit, or is a 1 bit
> error acceptable?

"within 1 ulp" permits some 1 bit errors. If we define ulp such that when the infinitely-precise result is between consecutive finite floating-point values a and b, 1 ulp is b - a, then both a and b are valid results within 1 ulp.

> Searching around the web for a bit I didn't find any other ways to refine
> existing approximations, though they might exist.

One of the main data points in favor of "correctly rounded" results is crlibm.
(In reply to Dan Gohman [:sunfish] from comment #84)
> (In reply to Dave Garrett from comment #82)
> > The ideal solution would probably to add an optional second argument for
> > these functions to the standard to specify a desired precision. Set a
> > reasonable default, but then things that need best speed or best precision
> > could request the implementation bias they want.
> 
> It's an interesting idea. I'll bounce it off some people and see what kind
> of reaction it gets.

This probably won't be web-compatible, unfortunately. Consider code that uses Math functions as callbacks for higher order functions that pass in additional arguments:

values = [0, 0.25, 0.5, 1];
sinValues = values.map(Math.sin);

This performs these Math.sin calls:
Math.sin(0, 0, values);
Math.sin(0.25, 1, values);
Math.sin(0.5, 2, values);
Math.sin(1, 3, values);
(In reply to Till Schneidereit [:till] from comment #85)
> (In reply to Dan Gohman [:sunfish] from comment #84)
> > (In reply to Dave Garrett from comment #82)
> > > The ideal solution would probably to add an optional second argument for
> > > these functions to the standard to specify a desired precision. Set a
> > > reasonable default, but then things that need best speed or best precision
> > > could request the implementation bias they want.
> > 
> > It's an interesting idea. I'll bounce it off some people and see what kind
> > of reaction it gets.
> 
> This probably won't be web-compatible, unfortunately. Consider code that
> uses Math functions as callbacks for higher order functions that pass in
> additional arguments:

Hmm... here's a more generalized route: add a Math.precision() method. It would take as its argument a precision in some way (units or constants on Math obj). It would return a variant of the Math object with all of the same methods, however these versions would use the given precision. For example:
Math.precision(n).sin(v)

Instead of changing arguments, this just lets you get variants of the Math object with arbitrary precisions.
I appreciate the ideas. However I've now talked with a variety of people, and the sentiment seems to be that the standard should just have one sine, which prefers accuracy over speed (provided it's within reason), and has no configuration knobs, at least for now. This is what the standard previously intended to have, and the present desire is primarily to close what is seen as a loophole.

Fully configuring sine implementation is complex. In addition to a maximum-ulp value, one might want to specify whether the function is required to be monotonic over the ranges where one would expect it, whether the function can return values outside of [-1,1], and so on. These requirements are useful, but exclude some approximation algorithms, and some applications may not require them.

JavaScript gives programmers enough tools to write quite good and fast implementations for sine and other functions themselves, and hopefully third-party libraries will be developed that most people can use, offering their own tradeoffs. This gives more flexibility any standard-specified API could have.
One thing we could do is add a single precision variant that uses less coefficients (but still uses double precision internally), then call that when someone does |Math.fround(Math.sin(x))|. Getting a 0-ULP single precision result using a shorter double precision calculation (3 or 4 coefficients rather than 6 or 7) should be easy.

Using single precision inside such a variant would presumably be even faster, but we'd probably get 1-ULP single precision errors.
If we could design our sin/cos algorithms to float-commutative, that'd be *fantastic*.
Note, bug 1076670 just had to add a branch on win64 sin so this would be an extra boost there too.
Depends on: 996375
No longer blocks: 984018
FWIW, I have optimal sets of coefficients for up to and including a 15th order approximation (for sine). Unfortunately this approach always loses at least 1 bit of precision by using the double precision square of the input - and my naive attempts at fixing this (using a linear approximation for the less significant bits) didn't work. If we want that final bit, we need some sort of fast refinement step, or try a different approach.

As Dan pointed out, it looks like other approaches seen in standard libraries use a number of branches, but still manage to perform at a similar level to the approach here - or faster in some cases - while being fully precise. They also tend to be much more complicated however, so if we ever wanted to completely inline the computation in Ion, we'd probably want an implementation closer to the one proposed here (in particular the SIMD version).
I rebased the patch to do more performance testing on it.
These are the optimized coefficients for approximations of sine and cosine, up to 15th order and 14th order respectively (giving 7 coefficients for each).

I never got around to checking the number of error bits on these, but some of the lower order approximations might be useful for implementing sin and cos when only single precision output is needed (calculations should still be done in double precision, but need not use the highest order approximations).

These coefficients minimize the maximum error across the input range [0,π/4], relative to the precision of the IEEE 754 double precision output (for cosine this is simply the absolute error times a constant, as the output range is [√½,1]). The sign of the error alternates across the range, as these optimizations were unconstrained.

Unfortunately it is not possible to produce exact double precision answers using this method - even the highest order approximations will have 1-bit errors, because using the double precision square of the input doesn't preserve all the information. But these coefficients should minimize the error from other sources.
Benoit: As of bug 984018 I believe we cache both sin and cos when a joint sincos implementation is available. In addition, the SIMD implementation from bug 996375 should be faster than the latest patch attached to this bug. For performance testing, both should be taken into account.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #93)
> even the highest order approximations will have 1-bit
> errors, because using the double precision square of the input doesn't
> preserve all the information. But these coefficients should minimize the
> error from other sources.

I was wrong: this is not caused by rounding errors in *any* stage of the calculation. I did an experiment with the double-double library, comparing its built-in implementation to one using the coefficients I've proposed. Using the highest order approximation, it gets 62 digits of precision - but rounding both results to double precision first, it *still* shows 1-bit errors. So these 1-bit errors are actually 0.5-bit errors in edge cases where a *lot* of extra precision is required for accurate rounding. There's no getting rid of them without a quadruple precision approximation, and I suspect all OS standard libraries display them (Visual Studio certainly does).

I also did some digging to see if there's any kind of iterative refinement that can be done. I found two schemes, CORDIC and BKM, but they both rely on tracking the approximation of the angle being used, so they can't be used to refine an *existing* approximation. They *can* be used to reduce the range, so a lower order polynomial approximation can be used; but CORDIC seems to accumulate rounding errors pretty quickly, so I'm not sure it's worth it. I haven't managed to implement BKM so far - the papers describing it are rather heavy on mathematical notation and don't summarize the algorithm very well.

To summarize: the algorithm proposed here is as precise as we can reasonably get. While rounding errors in the calculation may increase the *number* of 1-bit errors, they don't introduce 2-bit errors, and it's not possible to eliminate 1-bit errors entirely anyway.

The only improvements possible here are to the performance (it seems branching may not be overly slow) and to the angle reduction (which is good, but doesn't work across the full range).
One addendum to that: Using error compensation during the polynomial approximation *does* make just enough of a difference that we could use 6 coefficients for cosine instead of 7 (matching the 6 we need for sine). Of course, the overhead of performing such error compensation almost certainly isn't worth it compared to just using one more coefficient.
I'm seeing:

v8 (3.3): 119
node (4.4.4): 70
sm: 39

Is it an improvement on our side, or a regression on there side?
Flags: needinfo?(sunfish)
I'm don't believe there have been any changes on our side.
Flags: needinfo?(sunfish)
You need to log in before you can comment on or make changes to this bug.