Use division directly for generating double in XorShift128PlusRNG

RESOLVED FIXED in Firefox 45

Status

()

Core
MFBT
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: xidorn, Assigned: xidorn)

Tracking

unspecified
mozilla45
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox45 fixed)

Details

MozReview Requests

()

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(2 attachments, 1 obsolete attachment)

(Assignee)

Description

2 years ago
Created attachment 8692263 [details]
benchmark test

It is declared in cppreference.com that std::ldexp is less efficient among many implementions, and according to my benchmark, switching to using division directly gains a ~60% performance improvement with -O3 on g++ and clang, and a 80% improvement with release build on VS2015.

See the two programs in the attachment.
(Assignee)

Comment 1

2 years ago
Created attachment 8692266 [details]
MozReview Request: Bug 1228182 - Use division directly for generating double in XorShift128PlusRNG.

Bug 1228182 - Use division directly for generating double in XorShift128PlusRNG.
Attachment #8692266 - Flags: review?(jwalden+bmo)
(In reply to Xidorn Quan [:xidorn] (UTC+10) from comment #1)
> Created attachment 8692266 [details]
> MozReview Request: Bug 1228182 - Use division directly for generating double
> in XorShift128PlusRNG.
> 
> Bug 1228182 - Use division directly for generating double in
> XorShift128PlusRNG.

Is this faster or slower than bug 1206356 comment 8?
Flags: needinfo?(quanxunzhen)
(Assignee)

Comment 3

2 years ago
That one doesn't produce the same result of the original algorithm.

With g++, that one is actually even slower than the original algorithm, but with clang, it is ~9% faster than mine on my Mac, and with VS2015, it is ~15% faster.

I guess it indicates this could be further optimized. But I'm not quite comfortable with using such magical number, and I think given everything there is already constexpr, the compilers should be able to do all possible optimization that can be done without affecting the final result.
Flags: needinfo?(quanxunzhen)
Thanks for doing this!

I'm looking into using this RNG for Math.random and thought the ldexp seemed more complicated than necessary.

Note that V8 does the following:

  static inline double ToDouble(uint64_t state0, uint64_t state1) {
    // Exponent for double values for [1.0 .. 2.0)
    static const uint64_t kExponentBits = V8_UINT64_C(0x3FF0000000000000);
    static const uint64_t kMantissaMask = V8_UINT64_C(0x000FFFFFFFFFFFFF);
    uint64_t random = ((state0 + state1) & kMantissaMask) | kExponentBits;
    return bit_cast<double>(random) - 1;
  }

What do you all think of that version? No division or multiplication, but I'm not a FP expert.
(Assignee)

Comment 5

2 years ago
Created attachment 8692513 [details]
benchmark test 2

It seems the V8 version is the fastest with my local compilers G++ 5.2.0 and LLVM 7.0.0. It has ~14% improvement over mine with clang, and ~1% with g++.

Also note that the result from V8 version has a very small difference (<0.005%), which might indicate some precision loss (given our current impl is almost identical to bit manipulation, which shouldn't have any precision loss in theory). I have no idea how it impacts, though.

The attachment is my simple benchmark test. You can execute "./run_test.py c++" to run the test. (It requires Python 3.3+)

This test runner randomly picks two 64bit integers at the beginning and feeds them to every test program.

For each test, it prints the accumulated result generated from the test program, and then the time it consumes in second.

* test0 is our current impl with ldexp
* test1 is the impl I proposed
* test2 is from bug 1206356 comment 8
* test3 is the V8 impl provided in comment 4

Someone may be able to migrate this test to Windows to see what's going on there. According to my experience, we may need to reduce the number of iterations with the VS compiler, otherwise it might take too much time.
Attachment #8692263 - Attachment is obsolete: true
xidorn, thanks for testing that.

We should probably just go with the division for now. That's what my patch to inline this RNG in JIT code (bug 322529) uses and it's significantly faster than what we had before anyway :)
Blocks: 322529
Probably worth noting that MSVC 2013 doesn't support constexpr - MOZ_CONSTEXPR expands to nothing for _MSC_VER < 1900 [1]. IIRC MSVC is a bit more generous with what it computes at compile time than gcc or clang, so it might not be an issue. The release version of MSVC 2015 does support constexpr (and should pass that check).

[1] https://dxr.mozilla.org/mozilla-central/source/mfbt/Attributes.h#55
Attachment #8692266 - Flags: review?(jwalden+bmo) → review+
Comment on attachment 8692266 [details]
MozReview Request: Bug 1228182 - Use division directly for generating double in XorShift128PlusRNG.

https://reviewboard.mozilla.org/r/26247/#review24177

::: mfbt/XorShift128PlusRNG.h:14
(Diff revision 1)
> +#include "mozilla/Attributes.h"

This should be alphabetized with the other includes.

::: mfbt/XorShift128PlusRNG.h:78
(Diff revision 1)
> -    return ldexp(static_cast<double>(mantissa), -kMantissaBits);
> +    return double(mantissa) / (1ULL << kMantissaBits);

Use <inttypes.h> (added immediately before <stdint.h> -- or supplanting it, if you want, as <inttypes.h> is defined to include the other) and UINT64_C(1), not 1ULL.

::: mfbt/XorShift128PlusRNG.h:78
(Diff revision 1)
> -    return ldexp(static_cast<double>(mantissa), -kMantissaBits);
> +    return double(mantissa) / (1ULL << kMantissaBits);

Kinda sad compilers seemingly don't optimize ldexp the way you'd expect.
(Assignee)

Comment 9

2 years ago
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #8)
> Comment on attachment 8692266 [details]
> MozReview Request: Bug 1228182 - Use division directly for generating double
> in XorShift128PlusRNG.
> 
> ::: mfbt/XorShift128PlusRNG.h:78
> (Diff revision 1)
> > -    return ldexp(static_cast<double>(mantissa), -kMantissaBits);
> > +    return double(mantissa) / (1ULL << kMantissaBits);
> 
> Kinda sad compilers seemingly don't optimize ldexp the way you'd expect.

Well, as mentioned in bug 1206356 comment 8, ldexp needs to handle some edge cases. On the other hand, a double precision multiplication only cost 5 cycles (and conversion from uint64_t to double float seems to cost 4 cycles) on a modern x64 processor according to Intel's manual, which is hard to beat considering the complexity of the format of floating numbers.

There is one possible optimization as shown in comment 5, which should only cost one double precision subtraction and less conversion cost (which according to comment in bit_cast, only a load/store pair), which can be faster, but doesn't seem to be too much. In addition, it seems doing subtraction potentially loses some precision (which might, but maybe unlikely, lead to loss of statistics randomness, but I don't know anyway), so I wonder whether it's worth to do that optimization.
(Assignee)

Comment 10

2 years ago
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #8)
> Comment on attachment 8692266 [details]
> MozReview Request: Bug 1228182 - Use division directly for generating double
> in XorShift128PlusRNG.
> 
> https://reviewboard.mozilla.org/r/26247/#review24177
> 
> ::: mfbt/XorShift128PlusRNG.h:78
> (Diff revision 1)
> > -    return ldexp(static_cast<double>(mantissa), -kMantissaBits);
> > +    return double(mantissa) / (1ULL << kMantissaBits);
> 
> Use <inttypes.h> (added immediately before <stdint.h> -- or supplanting it,
> if you want, as <inttypes.h> is defined to include the other) and
> UINT64_C(1), not 1ULL.

Could you explain what's the reason behind doing so? Should I also change the line over this line to use UINT64_C?
Flags: needinfo?(jwalden+bmo)
(In reply to Xidorn Quan [:xidorn] (UTC+10) from comment #10)
> Could you explain what's the reason behind doing so?

Mostly that we're dealing explicitly with 64-bit integral math and types here, so using explicitly 64-bit things is good.  ULL is whatever unsigned long long happens to be, which is not necessarily 64-bit.

> Should I also change the line over this line to use UINT64_C?

Yeah.  Actually, that line in particular is working with 64-bit stuff and makes even *more* sense to so modify.
Flags: needinfo?(jwalden+bmo)
(Assignee)

Comment 12

2 years ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/2d2e1828198272efcc76d0a62b54cfd146143534
Bug 1228182 - Use division directly for generating double in XorShift128PlusRNG. r=Waldo
(Assignee)

Updated

2 years ago
Assignee: nobody → quanxunzhen

Comment 13

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/2d2e18281982
Status: NEW → RESOLVED
Last Resolved: 2 years ago
status-firefox45: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla45
You need to log in before you can comment on or make changes to this bug.