Closed Bug 578916 Opened 12 years ago Closed 11 years ago

JM: Cache results of Math.sin()

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: sstangl, Assigned: luke)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(2 files, 3 obsolete files)

During the discussion of bug 578912, gal suggested caching the results of Math.sin() and possibly other fast natives. Dvander requested a bug per fastnative being analyzed. There may be many like this bug, but this one is sin().

Argument counts for Math.sin() are attached. It looks like 216,000 [120*1800] values are meaningful duplicates, while 16,000 are mostly unique.
SunSpider-0.9.1 contains three tests that call Math.sin(). They naturally have vastly different behavior:

3d-cube has the following arguments to Math.sin() [this is the full output]:
    204 0.087266
    204 0.052360
    204 0.017453

3d-morph has 120 different possible arguments; each argument is repeated 1800 times. Math.sin() is called 120 times between each argument seeing its next repeat; the arguments are thus perfectly cyclic.

math-partial-sums contains unique arguments that cannot be cached.
JSC did this. I think they got a 1 ms speedup or so.
This bug is funny.

I think Bigfoot (sayrer's benchmark suite) and JSMeter will not repeat the funny, though.

/be
I dug a little deeper and found we are giving up about 7 ms on 3d-morph due to this. So I guess it matters after all. It's still funny, though. :-D

JM wanted SS win: 7 ms
Blocks: JaegerSpeed
What would be done for 3d-morph?  A cycle of size 120 is hard to cache simply.
The main part of the JSC win might not be due to caching. Maybe our native calls are too slow or something. We should check on the data again after the current work on calls is done to see if there is anything left to do.
I thought native calls didn't suffer the same overheads as scripted calls?
(In reply to comment #7)
> I thought native calls didn't suffer the same overheads as scripted calls?

I'm not sure, but I do believe they are at least better currently in JM. Pretty much all I know about this is what's in comment 2 and comment 4: I know that JSC got about a 1 ms win from doing the caching part, and that when I commented out the call to sin on 3d-morph, then we gained about 7 ms relative speed (i.e., commenting it out in the test gave us 7+X ms gain, and JSC just X ms gain). I don't know if that's still true, and if so, why. If you can figure out what's really going on here, that would definitely help.
This optimization doesn't seem completely fake, I can imagine real codes that call sin/cos/etc for the same value.  I'll take a stab.
Assignee: general → lw
Attached patch WIP (obsolete) — Splinter Review
This just does the simple thing and does caching in math_sin (like JSC).  That alone makes 3d-morph 61% faster (7ms).  It could go even faster by pushing the cache hit to the IC.  Also, it seems like The Right Thing to do this for all the trig functions.  With a 4KB table each, that could get wasteful, so I was thinking about using just 1 table and including the trig function in the key...
Attached patch patch (obsolete) — Splinter Review
With this patch, 3d-morph goes from 16.9ms to 9ms with -j -m.  For reference, jsc gets 12ms.  (-m gives 10.5ms.  Part of the speedup may be that our hash function gives a 99% hit rate on 3d-morph compared to 89% for jsc)

This patch puts the cache in the thread-data (to avoid race conditions) and caches sin,cos,tan,log,asin,acos,atan.  Instead of giving each function its own cache, the function is used in the cache key so all share one.  Since (without the function key) each cache is 65KB (4K entries * 2 * sizeof(double)), this should save a lot of space.  (I suspect this is why jsc only caches sin.)

Also relevant: while this doesn't measurably slow down any other Math-using benchmarks, if you make a micro-bench that pounds on sin/cos/etc (without ever using the same value), there is an 8% slowdown, due to the hashing/cache-touching.
Attachment #480819 - Attachment is obsolete: true
Attachment #481599 - Flags: review?(jwalden+bmo)
math_exp_tn needs to hit the cache too, I assume?

Also, this looks like doing { sin(x); cos(x) } in a loop will always miss, right?  Probably fine, but just checking.
Attached patch add missing math-exp cases (obsolete) — Splinter Review
(In reply to comment #12)
> math_exp_tn needs to hit the cache too, I assume?

Thanks, yes.

> Also, this looks like doing { sin(x); cos(x) } in a loop will always miss,
> right?  Probably fine, but just checking.

Good point!  We could throw some bits of the function-ptr into the hash...
Attachment #481599 - Attachment is obsolete: true
Attachment #481599 - Flags: review?(jwalden+bmo)
Attachment #481603 - Flags: review?(jwalden+bmo)
Attachment #481603 - Attachment is patch: true
Attachment #481603 - Attachment mime type: application/octet-stream → text/plain
(In reply to comment #11)
> 
> This patch puts the cache in the thread-data (to avoid race conditions) and
> caches sin,cos,tan,log,asin,acos,atan.  Instead of giving each function its own
> cache, the function is used in the cache key so all share one.  Since (without
> the function key) each cache is 65KB (4K entries * 2 * sizeof(double)), this
> should save a lot of space.  (I suspect this is why jsc only caches sin.)

So each thread is now 65KB bigger?  That seems like a lot.
(In reply to comment #14)
> So each thread is now 65KB bigger?  That seems like a lot.

Smaller caches with the variety of simple/fast hashes I tried get a lot lower hit rate on 3d-morph (60% for 1/2 size, 25% for 1/4).  JSC apparently found the same.  Feel free to experiment, though, I'm certainly not expert on hash functions.  Note that its not each thread, but each thread that runs JS.  IIUC (which is a big if) thats main thread + a small number of worker threads.
(In reply to comment #15)
> 
> Smaller caches with the variety of simple/fast hashes I tried get a lot lower
> hit rate on 3d-morph (60% for 1/2 size, 25% for 1/4).  JSC apparently found the
> same.  Feel free to experiment, though, I'm certainly not expert on hash
> functions.

Sure, I can totally believe this is an appropriate hash table size for the purpose of this bug.

The broader question is whether this bug is worthwhile, ie. is it worth bloating each JS thread by 65K for an 8ms Sunspider win, when the optimization in question is of questionable general utility?  If someone could point to real-world program that would benefit from this I'd be happier.
I would not be surprised to discover that a bunch of the 3d-type demos hit the cache; IIRC the JSC guys found hits in the wild when they did it, and we see more mathy stuff these days.  Any cache hits in Kraken-land, or on chrome experiments?

Most scripts don't use trig, so if we made the cache be initialized on first trig-use, they wouldn't pay for it anyway.
> Most scripts don't use trig, so if we made the cache be initialized on first
> trig-use, they wouldn't pay for it anyway.

++shaver
(In reply to comment #17)
> Most scripts don't use trig, so if we made the cache be initialized on first
> trig-use, they wouldn't pay for it anyway.

I was going to do that, but it adds a failure mode to the math_sin traceable native.  I suppose it would be rather easy, though, to have TraceRecorder create them when recording calls to such traceable natives.
Attachment #481603 - Attachment is obsolete: true
Attachment #481733 - Flags: review?(jwalden+bmo)
Attachment #481603 - Flags: review?(jwalden+bmo)
Comment on attachment 481733 [details] [diff] [review]
with lazy cache allocation

>+class MathCache
>+{
>+    static const unsigned SizeLog2 = 12;
>+    static const unsigned Size = 1 << SizeLog2;
>+    struct Entry { double in; UnaryFunType f; double out; };
>+    Entry table[Size];
>+
>+  public:
>+    MathCache();
>+
>+    uintN hash(double x) {
>+        union { double d; struct { uint32 one, two; } s; } u = { x };
>+        uint32 hash32 = u.s.one ^ u.s.two;
>+        uint16 hash16 = (uint16)(hash32 ^ (hash32 >> 16));
>+        return (hash16 & (Size - 1)) ^ (hash16 >> (16 - SizeLog2));
>+    }
>+
>+    /*
>+     * N.B. lookup uses double-equality. This is only safe if hash() maps +0
>+     * and -0 to different table entries, which is asserted in MathCache().
>+     */
>+    double lookup(UnaryFunType f, double x) {
>+        uintN index = hash(x);
>+        Entry &e = table[index];
>+        if (e.in == x && e.f == f)
>+            return e.out;
>+        e.in = x;
>+        e.f = f;
>+        return (e.out = f(x));
>+    }
>+};

*This* is what all our caches should look like.

I note in passing that NaN inputs will never result in cache hits, but I can't believe we care.

r=me if you write some tests for a variety of interesting values (infinities, zeroes, NaN, a few values between 0 and 1) and all these Math functions to check that the result of the first call to a function and the result of the second call to the function are equal, and that a first call, followed by traced successive calls, all return the same value.
Attachment #481733 - Flags: review?(jwalden+bmo) → review+
Thanks Waldo!  math-trace-tests.js covers this all pretty well (it found the +0/-0 bug mentioned in the patch comments).

http://hg.mozilla.org/tracemonkey/rev/5b09b3845178
Whiteboard: fixed-in-tracemonkey
http://hg.mozilla.org/mozilla-central/rev/5b09b3845178
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
This optimization seems to help waveform audio being generated for mozWriteAudio.
Vindication!
https://bugzilla.mozilla.org/show_bug.cgi?id=598655 definitely might have somewhat depended on this bug.
See Also: → 901000
You need to log in before you can comment on or make changes to this bug.