Closed Bug 555010 Opened 15 years ago Closed 15 years ago

performance of Euler #14; a cache hinders when it should help

Categories

(Tamarin Graveyard :: Virtual Machine, defect, P3)

x86
macOS
defect

Tracking

(Not tracked)

RESOLVED FIXED
Q3 11 - Serrano

People

(Reporter: pnkfelix, Assigned: pnkfelix)

References

Details

Attachments

(3 files, 3 obsolete files)

Overview: a standard memo-ization technique is hurting performance for tamarin when one would expect it to help (and where it _does_ help in many Javascript implementations). The Betanews benchmark suite [1] includes a port of a program to solve the 14th problem from Project Euler. The Euler #14 benchmark is that old saw about iteratively doing n → n/2 (n is even); n → 3n + 1 (n is odd), for all numbers up to 1,000,000. The version I ported is taken from the url below: http://lanzkron.wordpress.com/2010/01/18/yet-another-meaningless-javascript-benchmark/ There are two versions of the code at the link , "naive" and "optimized". The naive version does the obvious iteration; the optimized one keeps a cache of previously encountered values. (More detail available at website above.) The curious result is this: with avmshell, the optimized versions is significantly slower than the naive one. This does not jibe with the original author's experience with Javascript on a variety of browsers. I've attached the relevant code (ported from the website above and further modified by Ed Smith). bash-3.2$ ~/Dev/tamarin-redux/objdir/shell/avmshell /tmp/eul.abc Naive found 837799 (chain length: 525) in 20133ms Optimized found 837799 (chain length: 525) in 102131ms I mentioned this to a few others over e-mail and we had a conversation with some hypotheses about why this happens. Is it memory management of the cache, or the representation for indexed properties, or . . . ? The task: 1. Find out why Naive >> Optimized. 2. If possible, fix so Optimized >> Naive. p.s. Ed Smith says he knows a tweak (presumably to the runtime) that might be the fix for this; figuring out this bug is in part an educational exercise for me to learn the codebase. [1] http://www.betanews.com/article/Betanews-Relative-Performance-Index-for-browsers-30-How-it-works-and-why/1269459217
(In reply to comment #0) > p.s. Ed Smith says he knows a tweak (presumably to the runtime) that might be > the fix for this; figuring out this bug is in part an educational exercise for > me to learn the codebase. The tweak is to the test case.
Just to round out the description, here's a summary of what came up via email discussion: Tommy: thought VM smart enough to let integer be property key and avoid string conversion, but maybe not. Lars: maybe keys that should be integers but aren't are converted to strings rather than numbers. Could check by throwing in calls to Math.floor in next(). (Ed's tweak is related to Lars' hypothesis in some way, so that's a good first step.)
Assignee: nobody → fklockii
So far it looks like the problem is that the numbers being used as keys can get quite large, and there is a threshold where setting/getting indexed properties becomes 5x to 10x slower. The value where this occurs seems like it is somewhere around 268370000 -- 268410000. But of course it gets slower at some point, because there has to be a shift somewhere from a dense array representation to a sparse hashtable (assuming that's even what is happening at this point of shift). Does the slowdown meet what we should reasonably expect?
Attachment #434939 - Attachment mime type: application/applefile → text/plain
Unless the slowdown is maxing out RAM and starting to swap, I would not "reasonably expect" a 5x or 10x slowdown due to large heaps, arrays, or objects (hashtables). IOW, it could be a symptom we care about. If this test case turns up more than one performance pitfall, all the better! When I ran it, naive & optimized were about neck and neck with a smaller input size (I deleted one 0 from the value in limit()). You could run it in Shark and compare the small-input profile with the big-input profile.
A way to apply the lesson about indexed properties to the eul.as file is to introduce a key_limit constant to the code, which limits how large the keys in the cache can get. One subtlety about applying this alternative tweak is that it is not enough to filter out key insertions; you need to also check the key_limit when you do key lookups (the "n in cache" expression). In fact, performance gets even worse than it was without the key_limit at all if you omit this. This is probably a sign that the "in" operator is much slower than it should be for certain operands. I will probably post a variations of eul.as to illustrate this soon.
* Illustrates threshold where getting and setting indexed properties become 5x slower * Illustrates that "(n in obj)" expression becomes very slow when obj contains many dynamic properties (larger than above threshold) and n has also crossed that threshold. Here, "very slow" means: 50x a fast set/get; note that the in operator is already slow even with "nice" inputs, so crossing the threshold is "only" making the in operator 10x slower. (One million dynamic properties may not be common, but hey, it was in the Euler #14 benchmark...)
Attachment #435160 - Attachment is obsolete: true
Generalized code to use command line to choose which operation and inputs to test. This allows sensible use of profiling tool (e.g. Shark) to investigate problem. I commented out the old invocations but you can still read them to see where the relevant thresholds are. For my testing on the command line, I am using start = 200000000 (good) and 300000000 (bad); it does not seem to matter much if one is reasonably far from the precise threshold value that lies in between those two extremes.
Attachment #435178 - Attachment is obsolete: true
2^28-1 is the largest integer that can be used as an array/object index without being converted to a String. falls in the middle of your good/bad range. MAX_INTEGER_MASK http://hg.mozilla.org/tamarin-redux/file/tip/core/ScriptObject.h#l215 intern'd Strings cache their integer values: http://hg.mozilla.org/tamarin-redux/file/tip/core/StringObject.h#l494
Status: NEW → ASSIGNED
Flags: flashplayer-qrb+
Priority: -- → P3
Target Milestone: --- → flash10.2
Okay, that jibes with what I see from Shark on cacheloop. The 300M value brings GCAlloc::Alloc invocations to the top of the profile data, with 16% of the runtime. Half of the Alloc invocations are from AvmCore::doubleToAtom_sse2; the other half of the Alloc invocations stem from AvmCore::internUint32 (or really from avmplus::setprop_index). There's also overhead from other GC and String methods. Not quite sure which things are open to change. But it probably is wrong for many objects to be created when using the same large index repeatedly. So perhaps more interning would help; something like changing the lookup mechanism to avoid allocating a freshly converted value when there's already a key present in the object's properties that would serve just as well?
Yeah, but even below that threshold, when the string/int memoizing supposedly works right, the optimized test isn't faster than the naive test. so we're onto a problem (and there are probably things to fix), but I suspect not the original problem. so the question is, even below the 2^28 threshold, are there access paths that are allocating strings when they shouldn't be?
Ed's last comment prompted me to gather data about the behavior at different values for limit: the running time of the Naive and Optimized, the number of properties that end up in the cache, and the number of (failed/total) cache lookup attempts. The goal of this refinement is to show that at the outset, Optimized beats Naive, and then as you approach the 2^28-1 threshold, the ratio decreases and finally inverts. Below is the output from the run on my machine. One suspicious thing: the number of entries being put in the cache seems pretty smal compared to the amount of overhead that maintaining the cache is introducing. For example, for limit: 500,000, there are 1,225 large numbers inserted into the cache, but the program takes 20 *seconds* longer to run. If the extra time really is just due to the large numbers alone, that seems absurd to me. . . Anyway, the output is below. ---- limit set to: 10000 Naive found 6171 (chain length: 262) in 84ms Optimized found 6171 (chain length: 262) in 30ms total keys > 2^28-1 in cache: 0 big_fails: 0/0 lil_fails: 21663/31660 limit set to: 20000 Naive found 17647 (chain length: 279) in 176ms Optimized found 17647 (chain length: 279) in 82ms total keys > 2^28-1 in cache: 0 big_fails: 0/0 lil_fails: 43347/63344 limit set to: 50000 Naive found 35655 (chain length: 324) in 480ms Optimized found 35655 (chain length: 324) in 341ms total keys > 2^28-1 in cache: 0 big_fails: 0/0 lil_fails: 108171/158168 limit set to: 100000 Naive found 77031 (chain length: 351) in 1019ms Optimized found 77031 (chain length: 351) in 962ms total keys > 2^28-1 in cache: 26 big_fails: 26/26 lil_fails: 217185/317182 limit set to: 200000 Naive found 156159 (chain length: 383) in 2174ms Optimized found 156159 (chain length: 383) in 4020ms total keys > 2^28-1 in cache: 194 big_fails: 194/194 lil_fails: 433534/633531 limit set to: 500000 Naive found 410011 (chain length: 449) in 5871ms Optimized found 410011 (chain length: 449) in 25514ms total keys > 2^28-1 in cache: 1225 big_fails: 1225/1225 lil_fails: 1084008/1584005 limit set to: 1000000 Naive found 837799 (chain length: 525) in 12821ms Optimized found 837799 (chain length: 525) in 69456ms total keys > 2^28-1 in cache: 4182 big_fails: 4182/4187 lil_fails: 2164428/3164420 limit set to: 2000000 Naive found 1723519 (chain length: 557) in 26668ms Optimized found 1723519 (chain length: 557) in 333779ms total keys > 2^28-1 in cache: 18835 big_fails: 18835/18860 lil_fails: 4324652/6324624
Attachment #434939 - Attachment is obsolete: true
(In reply to comment #11) > Yeah, but even below that threshold, when the string/int memoizing supposedly > works right, the optimized test isn't faster than the naive test. so we're > onto a problem (and there are probably things to fix), but I suspect not the > original problem. You are right. Numbers above 2^28-1 are certainly a problem for us, but they might not be the only problem we have on this benchmark. I discovered this while trying to refine the benchmark to add a key_limit bound on the cache entries, as I described in comment #6 above. The key_limit stops being effective at a *much* smaller threshold than 2^28-1; a cache with a key_limit set to 10^6 will perform quite poorly, which surprised me. This means that the problem surfaces with numbers below 2^28-1; this would help explain why performance for e.g. limit=500,000 was so poor in my results from comment #12 : there were only 1,225 numbers > 2^28-1 in the cache, but there are probably many values in this unexplored bad range below 2^28-1. A key_limit of 1000 *does* yield good cache performance (as in, it beats the naive iteration), and that is why I had posted positive remarks in comment #6, but I did not realize that result was so dependent on my choice of such a small limit. . .
This is similar to the "Two variants of Euler #14" I attached at the outset, but adds a third parameteric variant that imposes a limit on magnitude of the keys of cache entries. (Since the new variant is parametric, it represents k variants rather than just 1.) This code illustrates that for the chosen problem limits, a small key limit (of say 1,000 or 100,000) performs better than one of 1M or above, as discussed in comment #13. Even assuming the problems handling properties greater than 2^28-1, as briefly discussed in comment #9 and comment #10, we should still see good performance for key limits less than 2^28-1 \approx 268M. So as Ed suggested in comment #11, I will next analyze the behavior for the cases in this specific range of keys above 100K but below 200M, and find out why the system is behaving so slowly there.
(In reply to comment #14) > Created an attachment (id=435676) [details] > 2+k variants of Euler #14 illustrating knees in performance curve Here's sample output from running that script. Note the knees at Limit100K for the elapsed time. problem limit set to: 50000 Naive found 35655 (chain length: 324) in 398ms postgc: 1ms Optimized found 35655 (chain length: 324) in 276ms postgc: 3ms Limite10K found 35655 (chain length: 324) in 109ms postgc: 2ms Limit100K found 35655 (chain length: 324) in 136ms postgc: 4ms Limited1M found 35655 (chain length: 324) in 226ms postgc: 5ms Limit100M found 35655 (chain length: 324) in 235ms postgc: 5ms problem limit set to: 100000 Naive found 77031 (chain length: 351) in 845ms postgc: 1ms Optimized found 77031 (chain length: 351) in 706ms postgc: 16ms Limite10K found 77031 (chain length: 351) in 295ms postgc: 3ms Limit100K found 77031 (chain length: 351) in 273ms postgc: 3ms Limited1M found 77031 (chain length: 351) in 676ms postgc: 14ms Limit100M found 77031 (chain length: 351) in 764ms postgc: 9ms problem limit set to: 200000 Naive found 156159 (chain length: 383) in 1808ms postgc: 2ms Optimized found 156159 (chain length: 383) in 3150ms postgc: 51ms Limite10K found 156159 (chain length: 383) in 818ms postgc: 3ms Limit100K found 156159 (chain length: 383) in 444ms postgc: 6ms Limited1M found 156159 (chain length: 383) in 1968ms postgc: 31ms Limit100M found 156159 (chain length: 383) in 2630ms postgc: 44ms problem limit set to: 500000 Naive found 410011 (chain length: 449) in 4899ms postgc: 4ms Optimized found 410011 (chain length: 449) in 19815ms postgc: 59ms Limite10K found 410011 (chain length: 449) in 2528ms postgc: 6ms Limit100K found 410011 (chain length: 449) in 1692ms postgc: 8ms Limited1M found 410011 (chain length: 449) in 16567ms postgc: 51ms Limit100M found 410011 (chain length: 449) in 38018ms postgc: 47ms
Made variant of the key Limited program that iterates one variant selected via command line. Used Shark to analyze the program, comparing the profile with key limit of 100K with the key limit of 1M. The profile when key limit = 1M is grossly dominated (85% of its time) by AvmCore::findString(String) or that function's invocations of the String::equals(String) method, and all of the paths two those two functions are originating in Toplevel::in_operator(int, int) or that function's invocations of intern(int). The implementation of the in operator is my primary candidate for further investigation.
Ed mentioned to me that his tweak was to replace the "n in cache" expression with "cache[n] != undefined". Putting this in does indeed make Optimized much faster than Naive in all cases I saw. Choosing an appropriate key limit performs better still, which one would expect, because a well-chosen key limit will essentially filter out thousands of extra cache entries that are *never* looked up. This argument holds even for key limits << 2^28; i.e. with a problem limit size of 200K, a key limit of 100K outperforms a key limit of 1M, and it is because both modes have the *same* number of successful cache lookups; the only difference is how many extra and unused entries are inserted when you use the limit of 1M (answer: 795,213 entries; there are only 199,997 successful cache lookups in the first place for that problem limit size). So there are two core problems here: * The in operator is absurdly slow compared with doing "obj[n] != undefined" * Indexed property operations setting/getting "obj[n]" are too slow when n >= 2^28. The remaining potential issue (that indexed property operations setting/getting "obj[n]" might be too slow for even (large) n < 2^28) is a red herring here.
Depends on: 555982
Depends on: 556023
bug 556023 should not block this, I *think*. That is, if one can run the original benchmark and determine that the cache now helping rather than hurting, then it will be safe to close this ticket, even if bug 556023 has not been resolved at that time. (The description for 556023 starts off with a back-pointer to here, so no info is lost by this revision to the meta-data.)
No longer depends on: 556023
See Also: → 556023
More recent results for the benchmarks. The results here are on my MacBook Pro 2.8 Ghz Core 2 Duo w/ 4 GB RAM, on a release build of avmshell. (The same machine I did my runs on back in March. Unfortunately I did not write down whether that run was done on a Release or Debugger build; for now I'm assuming it was a Release build, but perhaps that is over-optimistically guessing at how much our performance has improved...) Anyway, lookups on numbers >> 2^28 is still causing problems (which is unsurprising). But that has been filed as bug 556023, and as you can see below, on the *original* benchmark the cache is helping for the most part. So following the logic I gave in Comment 18, I'm closing this ticket. For the immediate result below, compare against the report in the description of this bug. tamarin-redux/objdir-selftest64 (hg:bug555010) % ./shell/avmshell /tmp/eul.abc Naive found 837799 (chain length: 525) in 16462ms Optimized found 837799 (chain length: 525) in 13060ms For the below, compare against comment 12 above. limit set to: 10000 Naive found 6171 (chain length: 262) in 63ms Optimized found 6171 (chain length: 262) in 10ms total keys > 2^28-1 in cache: 0 big_fails: 0/0 lil_fails: 21663/31660 limit set to: 20000 Naive found 17647 (chain length: 279) in 137ms Optimized found 17647 (chain length: 279) in 21ms total keys > 2^28-1 in cache: 0 big_fails: 0/0 lil_fails: 43347/63344 limit set to: 50000 Naive found 35655 (chain length: 324) in 377ms Optimized found 35655 (chain length: 324) in 85ms total keys > 2^28-1 in cache: 0 big_fails: 0/0 lil_fails: 108171/158168 limit set to: 100000 Naive found 77031 (chain length: 351) in 797ms Optimized found 77031 (chain length: 351) in 252ms total keys > 2^28-1 in cache: 26 big_fails: 26/26 lil_fails: 217185/317182 limit set to: 200000 Naive found 156159 (chain length: 383) in 1694ms Optimized found 156159 (chain length: 383) in 708ms total keys > 2^28-1 in cache: 194 big_fails: 194/194 lil_fails: 433534/633531 limit set to: 500000 Naive found 410011 (chain length: 449) in 4596ms Optimized found 410011 (chain length: 449) in 3366ms total keys > 2^28-1 in cache: 1225 big_fails: 1225/1225 lil_fails: 1084008/1584005 limit set to: 1000000 Naive found 837799 (chain length: 525) in 9714ms Optimized found 837799 (chain length: 525) in 13070ms total keys > 2^28-1 in cache: 4182 big_fails: 4182/4187 lil_fails: 2164428/3164420 limit set to: 2000000 Naive found 1723519 (chain length: 557) in 20717ms Optimized found 1723519 (chain length: 557) in 43606ms total keys > 2^28-1 in cache: 18835 big_fails: 18835/18860 lil_fails: 4324652/6324624
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Flags: flashplayer-bug-
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: