967709 - V8 is 2.8x faster at sin/cos

Reporter

Description

•

11 years ago

Attached file sin benchmark — Details

On the attached microbenchmark, which just pounds on sin with non-repeating values, V8 is about 2.8x faster on my Linux machine. (Our sin/cos just call in the C stdlib's sin/cos, so this is highly dependent on OS and stdlib version. I'd appreciate seeing what numbers other people get.) Profiling the box2d benchmark on awfy shows about 50% of its time is just calling sin/cos and this gives V8 better overall throughput on my machine. It looks like V8 rolls their own sin/cos (https://code.google.com/p/v8/source/detail?r=17594) which gives them more predictable performance. They self-host sin/cos which also avoids the call out from JIT code and all the overhead that that incurs. Since the sin/code code isn't all that complex, it seems like we could do even better with MSin/MCos MIR/LIR ops.

Luke Wagner [:luke]

Reporter

Comment 1

•

11 years ago

To wit, when we get a hit in the math cache, we are only ~2x faster than V8 (which has since removed their math cache: https://code.google.com/p/v8/source/detail?r=18344). With optimized sin/cos, perhaps we'd be able to do the righteous thing and remove our math cache as well (which was only added for benchmarketing parity in the first place).

Luke Wagner [:luke]

Reporter

Comment 2

•

11 years ago

I just tested on a Win32 Nightly vs. Chrome Dev and the perf difference is about the same as comment 0.

Vladislav Egorov

Comment 3

•

11 years ago

I really don't like this V8 implementation of sin/cos: 1) Uses HUGE 32 KiB lookup table. 2) Not particularly precise -- up to 4 erroneus bits in [0, pi/2], 2.12 on average. 3) Uses imprecise reduction to [0, pi/2] that will lead to significant errors out of the first quadrant. 4) Can't calculate sin and cos simultanously. I quickly compared it with Cephes-based VDT [1]. No lookup tables, only 16 double constants. 2 wrong bits max, 0.16 average in [0, pi/2], 0.25 average in [0, 10000]. Good reduction to quadrant. And it's not much slower than implementation in V8 -- without FMA V8 version was faster on my Haswell with GCC 4.8 -- 14.5s vs 11s to calculate 1 billion sin(), but with FMA both scored around 10s (however, V8 function calculates only sine, while VDT calculates both sine and cosine). So it's not even that faster than alternative implementations. [1] https://svnweb.cern.ch/trac/vdt It's LGPL, but it's just streamlined reimplementation of function from Cephes, not hard to reimplement.

Luke Wagner [:luke]

Reporter

•

11 years ago

(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #9) > Out of curiosity, do you know how the numerical constants in that code were > obtained? Calculating a 53-bit mantissa by hand in Mathematica, I get some > discrepancies. For instance, in the file you use -1.66666666666666307295e-1 > for -1/6, but I get -1.66666666666666657415e-1 (I did this by doing > ScientificForm[N[Round[2^55 / 3!] / 2^55, {21, Infinity}]], where Round[2^55 > / 3!] is an integer between 2^52 and 2^53). No, I just used the constants from cephes. I don't know how they derived those constants, but it appears that they're compensating for rounding somehow. Replacing those constants with the obvious rational approximations seems to lead to less precise end results.

Emanuel Hoogeveen [:ehoogeveen]

Comment 11

•

11 years ago

(In reply to Dan Gohman [:sunfish] from comment #10) > No, I just used the constants from cephes. I don't know how they derived > those constants, but it appears that they're compensating for rounding > somehow. Replacing those constants with the obvious rational approximations > seems to lead to less precise end results. OK, thanks for confirming that :) I figured I was probably missing something.

Dan Gohman [:sunfish]

Comment 12

•

11 years ago

Attached patch fast-sincos.patch (obsolete) — Details — Splinter Review

Here's a patch which incoporates this Cephes-based sincos code into IonMonkey (and disables the cache!), plus a few more comments and code tweaks. On my system (Linux, GLIBC 2.17), it speeds up Luke's sin benchmark by over 3x, it speeds up a Box2d benchmark by 47%, and it generally looks good on a variety of other benchmarks. The new code is a little less precise overall, and it doesn't support values over pow(2, 30), but it's my impression that this is considered an acceptable tradeoff here. This doesn't do the sincos optimization, or emit the code via the MacroAssembler, but it's a good first step. Help needed: - It currently fails on 3d-cube.js in SunSpider because its results are very slightly less precise, and 3d-cube.js' validation code doesn't use an epsilon. Is it important that we make this test pass as is? - Testing on other platforms and codes. Any comments or questions on the code are welcome too :-).

Luke Wagner [:luke]

Reporter

Comment 13

•

11 years ago

(In reply to Dan Gohman [:sunfish] from comment #12) > IonMonkey (and disables the cache!), Sweet! Is there any overall impact on SunSpider (for which the foul cache was added in the first place)? > The new code is a little less precise > overall, and it doesn't support values over pow(2, 30), but it's my > impression that this is considered an acceptable tradeoff here. I'm unfamiliar with what is normal for trig impls; is there precedent for this in other impls? > - It currently fails on 3d-cube.js in SunSpider because its results are > very slightly less precise, and 3d-cube.js' validation code doesn't use an > epsilon. Is it important that we make this test pass as is? Not important: we just made up those validation benchmarks by running, printing the values, then copying them into the assert verbatim. (On several occasions in the tracer days, we'd get a massive "speedup" that came from producing the wrong result.) It'd be fine to add an epsilon to the test.

Jan de Mooij [:jandem]

Comment 14

•

11 years ago

(In reply to Luke Wagner [:luke] from comment #13) > Sweet! Is there any overall impact on SunSpider (for which the foul cache > was added in the first place)? To add to this, it was added for 3d-morph so you want to check that test in particular.

Dan Gohman [:sunfish]

Comment 15

•

11 years ago

(In reply to Jan de Mooij [:jandem] from comment #14) > (In reply to Luke Wagner [:luke] from comment #13) > > Sweet! Is there any overall impact on SunSpider (for which the foul cache > > was added in the first place)? > > To add to this, it was added for 3d-morph so you want to check that test in > particular. The current patch does slow down 3d-morph: before: 0.016485148 seconds time elapsed ( +- 0.02% ) after: 0.019115016 seconds time elapsed ( +- 0.02% ) But, it speeds up math-partial-sums.js: before: 0.020669752 seconds time elapsed ( +- 0.02% ) after: 0.016817460 seconds time elapsed ( +- 0.02% ) Given that we aren't fond of caching, can we call it even? :-)

Vladislav Egorov

Comment 16

•

•

11 years ago

•

11 years ago

Attached patch fast-sincos.patch (obsolete) — Details — Splinter Review

Comment 45

•

11 years ago

(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #44) > Would it change the result to just do |double q0_sin = z + z * > polevl_sin(zz);| instead? This changes the rounding order from |ans * (zz * > z)| to |(ans * zz) * z| (different from the two versions in attachment > 8397345 [details] [diff] [review]), but I'm not sure whether one is worse > than the other. Technically there's also |(ans * z) * zz| to try if you > wanted to go that far :P I tried both alternatives, and they appear to make the results very slightly worse, so I'll stick with the current patch.

Dan Gohman [:sunfish]

Comment 46

•

11 years ago

Some additional data points: I ported the V8 code (as ported to C++ in comment 6) to jsmath.cpp to get a better comparison of the performance of the algorithms themselves, since V8's actual implementation is in JS. The V8 algorithm is about 15% faster on the microbenchmark in comment 0 than my latest patch here. For precision, as measured by the test code in comment 6, the V8 algorithm average differing bits on the input range [0,10] is 1.74, compared to 0.169 with the current patch here. On the input range [0,10000], the V8 algorithm gets 10.5, the current patch here gets 0.167.

Blocks: 984018

Dan Gohman [:sunfish]

Comment 47

•

11 years ago

jorendorff, ping

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Updated

•

11 years ago

Flags: needinfo?(jorendorff)

Jason Orendorff [:jorendorff]

Comment 48

•

11 years ago

Comment on attachment 8397940 [details] [diff] [review] fast-sincos.patch Review of attachment 8397940 [details] [diff] [review]: ----------------------------------------------------------------- Very nice work. Sacrificing a teensy bit of precision for a whole lot of speed is what users will want here. ::: js/src/jsmath.cpp @@ +343,5 @@ > +static double polevl_sin(double z, double zz) > +{ > + // Constants taken from fdlibm k_sin.c > + double ans = 1.58969099521155010221e-10; > + ans *= zz; Indent 4 spaces, please. @@ +919,5 @@ > > double > js::math_sin_impl(MathCache *cache, double x) > { > + return math_sin_uncached(x); It looks like math_sin_impl does not need to be exposed in jsmath.h. Please remove it and change math_sin to call math_sin_uncached directly (renaming as desired). Same for cos.

Attachment #8397940 - Flags: review?(jorendorff) → review+

Jason Orendorff [:jorendorff]

Updated

•

11 years ago

Flags: needinfo?(jorendorff)

Dan Gohman [:sunfish]

Comment 64

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/8fa46ad24ecc

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla31

Dan Gohman [:sunfish]

Comment 65

•

11 years ago

On AWFY, bug 994993 helped, but only on 64-bit, and not enough to recover the regression. I'm now aware that the performance surveying I did in preparation for this bug relied too heavily on benchmarks which used (relatively) large and randomly distributed values, which made system libm implementations appear slower than they often are in practice. On code code where the majority of values are very small and close together (97% of values are less than 1.0 in box2d's case), they are faster. Consequently, it now seems that this branchless cephes-based implementation is not as much of a win in practice as was originally hoped. I'm now considering the options. Re-introducing the math cache for sin/cos is one possibility; it would help 3d-morph, and it would also make the sincos optimization (bug 984018) much easier to implement. It's possible to compute the sin and cos polynomials in parallel using SIMD (rather than just relying on ILP); in a prototype this gained a few percentage points. Or, we may want to pick a different algorithm altogether.

Emanuel Hoogeveen [:ehoogeveen]

Comment 66

•

Carsten Book [:Tomcat]

Comment 75

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/ba2e9970b80f

Status: REOPENED → RESOLVED

Closed: 11 years ago → 11 years ago

Resolution: --- → FIXED

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Comment 76

•

11 years ago

(In reply to Carsten Book [:Tomcat] from comment #75) > https://hg.mozilla.org/mozilla-central/rev/ba2e9970b80f I think this was a backout - this should be a [leave open]. Feel free to remove this from the whiteboard whenever the new patch is ready to be landed to fix this.

Status: RESOLVED → REOPENED

•

11 years ago

One thing we could do is add a single precision variant that uses less coefficients (but still uses double precision internally), then call that when someone does |Math.fround(Math.sin(x))|. Getting a 0-ULP single precision result using a shorter double precision calculation (3 or 4 coefficients rather than 6 or 7) should be easy. Using single precision inside such a variant would presumably be even faster, but we'd probably get 1-ULP single precision errors.

Luke Wagner [:luke]

Reporter

Comment 89

•

11 years ago

If we could design our sin/cos algorithms to float-commutative, that'd be *fantastic*.

Luke Wagner [:luke]

Reporter

Comment 90

•

11 years ago

Note, bug 1076670 just had to add a branch on win64 sin so this would be an extra boost there too.

Emanuel Hoogeveen [:ehoogeveen]

Updated

•

10 years ago

Depends on: 996375

Dan Gohman [:sunfish]

Updated

•

10 years ago

No longer blocks: 984018

Emanuel Hoogeveen [:ehoogeveen]

•

10 years ago

Benoit: As of bug 984018 I believe we cache both sin and cos when a joint sincos implementation is available. In addition, the SIMD implementation from bug 996375 should be faster than the latest patch attached to this bug. For performance testing, both should be taken into account.

Emanuel Hoogeveen [:ehoogeveen]

Comment 95

•

9 years ago

(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #93) > even the highest order approximations will have 1-bit > errors, because using the double precision square of the input doesn't > preserve all the information. But these coefficients should minimize the > error from other sources. I was wrong: this is not caused by rounding errors in *any* stage of the calculation. I did an experiment with the double-double library, comparing its built-in implementation to one using the coefficients I've proposed. Using the highest order approximation, it gets 62 digits of precision - but rounding both results to double precision first, it *still* shows 1-bit errors. So these 1-bit errors are actually 0.5-bit errors in edge cases where a *lot* of extra precision is required for accurate rounding. There's no getting rid of them without a quadruple precision approximation, and I suspect all OS standard libraries display them (Visual Studio certainly does). I also did some digging to see if there's any kind of iterative refinement that can be done. I found two schemes, CORDIC and BKM, but they both rely on tracking the approximation of the angle being used, so they can't be used to refine an *existing* approximation. They *can* be used to reduce the range, so a lower order polynomial approximation can be used; but CORDIC seems to accumulate rounding errors pretty quickly, so I'm not sure it's worth it. I haven't managed to implement BKM so far - the papers describing it are rather heavy on mathematical notation and don't summarize the algorithm very well. To summarize: the algorithm proposed here is as precise as we can reasonably get. While rounding errors in the calculation may increase the *number* of 1-bit errors, they don't introduce 2-bit errors, and it's not possible to eliminate 1-bit errors entirely anyway. The only improvements possible here are to the performance (it seems branching may not be overly slow) and to the angle reduction (which is good, but doesn't work across the full range).

Emanuel Hoogeveen [:ehoogeveen]

•

7 years ago

In case its helpful: found http://gruntthepeon.free.fr/ssemath/sse_mathfun.h (Mentioned in this blog post: https://bitshifter.github.io/blog/2018/06/20/the-last-10-percent/)

BugBot [:suhaib / :marco/ :calixte]

Comment 100

•

3 years ago

The bug assignee didn't login in Bugzilla in the last 7 months.
:sdetar, could you have a look please?
For more information, please visit auto_nag documentation.

Assignee: sunfish → nobody

Flags: needinfo?(sdetar)

Steven DeTar [:sdetar]

Updated

•

3 years ago

Flags: needinfo?(sdetar)

BMO Automation

Updated

•

3 years ago

Severity: normal → S3

Mayank Bansal

Comment 101

•

7 months ago

Is this still relevant now (with potential use if fdlibm)?

sin benchmark 11 years ago Luke Wagner [:luke] 277 bytes, application/javascript		Details
new_sincos.cpp 11 years ago Dan Gohman [:sunfish] 2.41 KB, text/x-c++src		Details
fast-sincos.patch 11 years ago Dan Gohman [:sunfish] 7.04 KB, patch		Details \| Diff \| Splinter Review
fast-sincos.patch 11 years ago Dan Gohman [:sunfish] 9.22 KB, patch		Details \| Diff \| Splinter Review
fast-sincos.patch 11 years ago Dan Gohman [:sunfish] 7.24 KB, patch		Details \| Diff \| Splinter Review
fast-sincos.patch 11 years ago Dan Gohman [:sunfish] 7.04 KB, patch	jorendorff : review+ ehoogeveen : feedback+	Details \| Diff \| Splinter Review
fast-sincos-rebased 10 years ago Benoit Girard (:BenWa) 8.24 KB, patch		Details \| Diff \| Splinter Review
Optimized coefficients (unconstrained) 10 years ago Emanuel Hoogeveen [:ehoogeveen] 9.04 KB, text/plain		Details