959 bytes, text/plain
1.65 KB, patch
|Details | Diff | Splinter Review|
257.43 KB, patch
|Details | Diff | Splinter Review|
3.61 KB, text/plain
It's certainly possible that I'm missing something and this is invalid but there does seem to be some very slow code in spidermonkey. I was intrigued by a posting in n.p.p.performance about js timings but I'm interested in interpreter speed compared to real compiled code. I built a number of spidermonkey versions with different gcc's (egcs, 2.95.3, and 3.1.1) as well as different options (-O, -O2, -O3, -O3 -fomit-frame-pointer). Using slightly modified test code, I translated it into C and built executable using the same compiler options so there are js and C pairs. Going in, I would have been impressed if js was 5 times slower and I would have found 10 a reasonable number. However, all the pairs are between 55 to 60 times slower, almost a full order of magnitude worse than expected. That's not good at all. I do notice that js allocates about 7MB while running the test. Perhaps that has something to do with the time.
Reassigning to Kenton; cc'ing Brendan, rogerl, shaver, jrgm -
Timing Results: C vs 20020804 trunk xpcshell Win2k 1200 MHz Athlon C ----------------------- elapsed time: 0.036161s elapsed time: 0.03638s elapsed time: 0.036716s elapsed time: 0.036495s elapsed time: 0.036258s elapsed time: 0.039186s xpcshell ----------------------- elapsed time = 2.183s elapsed time = 2.213s elapsed time = 2.233s elapsed time = 2.193s elapsed time = 2.193s elapsed time = 2.223s
Did you really intend that the test program carry a divide-by-zero error through all the calculations (for x), and the value of y to always be zero? (Although, I'm not entirely surprised that js might be 60 times slower on a tight numeric loop. Within an order of magnitude (a factor of 10) would be nice, but not always expected).
What's the bug here? Dynamic languages are 10x-100x slower than static ones without aggressive optimizations that we haven't had time or need to do yet in SpiderMonkey. Try a better summary, or risk getting an INVALID resolution. This might end up dup'd against an existing bug (bug 121414, or possibly bug 61898, although I think that one should be closed and new one(s) filed). /be
I missed tenthumbs' comment about 7MB being allocated, but jrgm pointed the way: the testcase is completely bogus. It initializes x to 0, then proceeds to generate 0/0 on the first iteration of each loop, which is NaN. That NaN then spreads throughout the rest of the iterations, requiring a new double-valued GC-thing to be created. This bug would be most useful if it were the successor to bug 61898, with the summary I'm giving it now. /be
Well make that x=2.0, y=3.0 and it's not as bogus as you think. That just moves the numbers into the 50-55 range which isn't much of a change. In any case, infinities are allowed, so I don't see what's so bogus anyway. Now, would you like a new bug on recursive functions that are an order of magnitude slower than the one here?
tenthumbs: what's bogus is that the partial evaluation work I'm doing will soon make your JS benchmark run much faster than the C version. A more realistic benchmark wouldn't degenerate into a partial evaluation that becomes total and computes x and y at compile (JIT) time. Why don't you mail me your recursive function benchmark and we can figure out what's going on. Again, you're not going to get Perl, Python, JS, or other dynamic languages running as fast, or even at most 10x slower, than C code, without a lot of Self-ish code specialization, partial evaluation, JIT-compiling, code caching work. /be
Created attachment 94530 [details] [diff] [review] patch to avoid allocating NaNs and Infinities This patch will slow down double allocation slightly (by the cost of the failing JSDOUBLE_IS_FINITE test), but it will speed up the bogo-benchmark here a bit. If you change x's initial value to 1, then it slows down that less bogus, but still vacuous, variant benchmark. It seems to me the common case is not to generate lots of NaNs and infinities, so we shouldn't check in this patch. But I thought I'd post it. /be
+ if (rt->jsNegativeInfinity && *rt->jsPositiveInfinity == d) + return rt->jsNegativeInfinity; should be + if (rt->jsNegativeInfinity && *rt->jsNegativeInfinity == d) + return rt->jsNegativeInfinity; Couldn't you use one test of 'rt->state == JSRTS_UP'?
Created attachment 94949 [details] [diff] [review] duh, thanks rogerl I still don't think this patch should go in, because the dynamic frequence of non-finite values is (or should be, in well-behaved numerical JS) so low. Anyone have contrary thoughts or measurements? /be
cc'ing rginda, waldemar in case they have any contrary thoughts on this -
I instrumented the patch to count non-finite values passed to js_NewDouble, and found none (0) when starting Mozilla, loading slashdot.org in a browser window. Anyway, the patch wasn't meant to be checked in, it was more of a discussion device. So I'm going to keep talking to myself :-). Allocating doubles on the interpreter stack is harder. It would seem to require knowing the type at compile (or at partial evaluation) time, or else keeping a parallel type stack and checking it on every pop. The former is something JS2 makes easy, the latter is going to hurt JS1 performance. I argue that it is better to focus on more general wins (such as might be gained by fixing bug 121414), instead of spending too much effort on one cost measured by a synthetic benchmark. /be
If improving the handling of NaN and Inf implies a cost for the general numeric case, then, no, I don't think this should land.
Just as I'd submitted the last comment, I followed a link from /. and the number of non-finite new double GC-thing allocations saved by the patch went from 0 to 6 (out of only 213 new doubles created since browser startup, mostly by chrome JS -- note again that the trade-off of allocating doubles in the GC heap only when needed, and using tagged ints, seems to win). /be
I agree that handling non-finite values in software isn't worth the trouble. Trap it in the hardware if you really want to. I think you're all getting hung up on infinitities. They happen a lot more often than you think. Visit a number of web pages and see how many appear. I suspect that if you subtract out the chrome the percentage will be surprisingly high. A floating point value is essentially always a valid value. The fact that the hardware processes some values more slowly than others isn't particulary relevant because it affects all processes equally. The real issue is that spidermonkey is wasteful in both time and space on floating point.
tenthumbs, what in tarnation are you talking about? >I agree that handling non-finite values in software isn't worth the >trouble. Trap it in the hardware if you really want to. Trap what in hardware? In discussing the moot patch, we were talking about how JS allocates doubles, and whether it's worth returning per-runtime singletons for NaN and +/-Infinity -- so we were clearly talking only about software. >I think you're all getting hung up on infinitities. (Freudian slip? ;-) > They happen a lot more often than you think. Evidence? I already reported my results here, where are yours? >Visit a number of web pages and see how many appear. I suspect that if you >subtract out the chrome the percentage will be surprisingly high. 6 non-finite values out of 213 doubles, out of thousands to millions of tagged ints, sounds like a small number to me. What's surprising? What web pages do you mean? >A floating point value is essentially always a valid value. Strawman -- who said otherwise? Not that I know what "valid" means here. >The fact that the hardware processes some values more slowly No one ever said that hardware processes non-finite values more slowly, and I'm not sure popular FPUs do or do not penalize such values. This too seems to have nothing to do with the matter at hand, or what people have written in this bug's comments. >than others isn't particulary relevant because it affects all processes >equally. Again, not that anyone claimed non-finite values were slow or bogus to test (re-read my comment #8 about partial evaluation to understand why the benchmark is bogus whether it computes a NaN repeatedly in x, or a small fration in x), it seems to me you have failed to demonstrate that non-finite values occur frequently, or with any frequency sufficient to justify any software (or hardware, for that matter, if there are hardware penalties in current FPUs) special-case optimizations for those values. >The real issue is that spidermonkey is wasteful in both time and space on >floating point. No, the real issue is that your synthetic benchmark has nothing to do with Mozilla or JS performance on real world scripts, pending actual measurements and examples from you. If you're looking to prove that JS makes Mozilla slow because JS allocates doubles from its GC heap, you have to show some evidence (poor man's profilig by ctrl-c in a debugger that finds lots of js_NewDouble stack backtraces, ad-hoc instrumentation such as I added to find that only 6 non-finite values are created browsing /. and one of its article links, or jprof/quantify profiles as I strongly suggested in email to you) of frequent JS double GC-thing allocations. Until then, talk is cheap, and big talk (specifically, your use of "wasteful", which cries for real-world cost comparisons, as it begs questions about trade-offs that can't be answered without profiling data, or at least some global measurement of doubles allocated per unit time browsing an average url reference string, or doing some other user-level task) is even cheaper. /be
Next alpha. /be
I owe tenthumbs a bug about empty function invocation overhead, vs. perl empty subroutine. Tenthumbs, feel free to file it on me. /be
An alternative idea from Ed Smith of Adobe: change jsval to 64 bits and use NaN patterns to tag non-double values except for the single, canonical quiet NaN that JS needs. E.g., given 64-bit value x and exponent, mantissa, and signbit functions: x exponent(x) mantissa(x) signbit(x) = + =========== =========== ========== non-NaN double | not all ones ( one or both non-zero ) canonical NaN | all ones tag(1) 1 null | all ones tag(0) 1 undefined | all ones tag(1) 0 boolean | all ones tag(2) + x 0 int (48 bits!) | all ones tag(4) + x 0 string | all ones tag(5) + ptr(x) 0 object | all ones tag(6) + ptr(x) 0 private | all ones tag(7) + ptr(x) 0 others (namespaces, etc.) possible... The first two bytes contain the sign bit, 11 exponent bits, and 4 tag bits. Thus a private pointer can be up to 48 bits, which should be enough for practical 64-bit address space support. Thus tag(t) is t << 48 and ptr(x) is (x & bitmask(48)). Type tests can load just the first two bytes. There may be a better encoding than the sketch above, I just made it up! In particular perhaps null should be adjacent in tag terms to the reference types (string, object, etc.). Comments? /be
If 64-bit stack items cost too much on 32-bit systems I would be surprised. Control operators that push and pop the stack could use a separate control stack, but we already have precedent for pushing and popping two items. We may also end up with two PCs: one for fat opcodes as jsinterp.c uses today and the other for a lower-level code that can be traced and JITted efficiently, such that the heavy runtime type tax can be traced away (with guards). /be
boolean | all ones tag(2) + x 0 int (48 bits!) | all ones tag(4) + x 0 Oops, meant to include the boolean value in the tag bits, then thought better of it. Could use sign bit. Idea is to avoid loading 48 bits to get the true or false decoded. /be
The only problem I see with Ed Smith's suggestion is that the technique may require revision or worse on machines that use different floating-point representations. Honestly, I think that's a non-issue, but it probably should at least get mentioned ;)
Rhino's pure interpreter mode uses a separated double stack to hold numerical values. That made various math tests to run 20%-30% faster. Similarly, when Rhino compiles JS down to JVM's bytecode, it allocates separated JVM's register to hold numeric values.
(In reply to comment #21) > An alternative idea from Ed Smith of Adobe: change jsval to 64 bits and use NaN > patterns to tag non-double values except for the single, canonical quiet NaN > that JS needs. It is not necessary to even bother with NaN patterns, one can use an encoding that uses 61 bit for double values and allocates the rest of doubles on the heap. I.e. very small or very big doubles do not happen often in practical calculations and when they do, then it indicates about numerical instability in an algorithm.
This is interesting, but is there a real-world site that should expect as much as a 1% performance boost from the proposed change? On Intel 32-bit, won't it be bad for each jsval to soak up two registers instead of one?
(In reply to comment #24) > The only problem I see with Ed Smith's suggestion is that the technique may > require revision or worse on machines that use different floating-point > representations. Honestly, I think that's a non-issue, but it probably should > at least get mentioned ;) ECMA-262 requires IEEE double, and this means hardware that uses a different (or even more precise, i.e., IEEE 754 80 bit extended precision) does not conform without painful-to-performance extra software checks. This is a problem on Mac Intel laptops today, since Apple wierdly splits numeric chores across x87 and SSE and uses extended precision on the former. We have not found a way to make it stop (see bug ). Short story is that most sane hardware can use doubles without any overhead. (In reply to comment #26) > (In reply to comment #21) > > An alternative idea from Ed Smith of Adobe: change jsval to 64 bits and use NaN > > patterns to tag non-double values except for the single, canonical quiet NaN > > that JS needs. > > It is not necessary to even bother with NaN patterns, one can use an encoding > that uses 61 bit for double values and allocates the rest of doubles on the > heap. I.e. very small or very big doubles do not happen often in practical > calculations and when they do, then it indicates about numerical instability in > an algorithm. Can you do this without (a) extra overhead dealing with most doubles; (b) requiring whole-double loads to discern type? Probably, but (b) is important (ARM Thumb has a 16-bit data bus, I hear; the NaN idea allows the first two bytes to be loaded for type tests and even boolean tests). Jason: it's not just crypto. This bug was filed a long time ago, and the workload that crosses over the INT_FITS_IN_JSVAL threshold is only rising, in part due to benchmark wars but also due to real honest-to-goodness graphics and other numerically more intensive programming in JS. The cost of not optimizing doubles is unusually high unless you work hard on stack-like GC; the cost of stacking 64-bit values is relatively small on desktop CPUs. We would have to measure to be sure, but I think we have hopeful evidence. In a past life I moved a simulator for a 64-bit CPU to use 64-bit host code, by implementing MIPS R4000 64-bit support in GCC (my patches and another person's were merged for the final support), targeting the first SGI box to use the R4K. The speedup however was marginal, because other costs than load and store traffic dominated. The flip side was that the simulator targeting a 32-bit MIPS host ran well enough given the pipelining of the day and the other instructions in the schedule that filled the pipe. CPUs have become even more pipelined. Igor's data from Rhino is suggestive. Your point is well-taken for tagged ints: they may well want to remain 32 (not 31) bit, so only the type tag (if needed; JITted code would often know the on-trace precise type) and the low 32 bits of the double (and a 32-bit register to hold it) would be needed. This would reduce register pressure and memory bandwidth for all the data that stays in int-tagged 32-bit jsvals on 32-bit CPUs today. /be
(In reply to comment #29) > This is a problem > on Mac Intel laptops today, since Apple wierdly splits numeric chores across > x87 and SSE and uses extended precision on the former. We have not found a way > to make it stop (see bug ). See bug 264912, but I remember a more recent Mactel bug. Can someone id it? /be
(In reply to comment #30) > See bug 264912, but I remember a more recent Mactel bug. Can someone id it? Bug 338756?
midair... ecma/Date/22.214.171.124-1.js fails on 1.8.1|1.9.0 mactel but it may not be related to this and I haven't filed it yet. I have some other Linux 2.6.9 vs 2.6.2x kernel related failures I have yet to file that may or may not be related. ecma/TypeConversion/9.3.1-3.js (bug 338756) may be what you are thinking of.
(In reply to comment #29) > > It is not necessary to even bother with NaN patterns, one can use an encoding > > that uses 61 bit for double values and allocates the rest of doubles on the > > heap. I.e. very small or very big doubles do not happen often in practical > > calculations and when they do, then it indicates about numerical instability in > > an algorithm. > > Can you do this without (a) extra overhead dealing with most doubles; (b) > requiring whole-double loads to discern type? Probably, but (b) is important > (ARM Thumb has a 16-bit data bus, I hear; the NaN idea allows the first two > bytes to be loaded for type tests and even boolean tests). I thought the suggestion was to use 64bit slots in the interpreter, not to make typedef double jsval. In the former case the repackaging of interpreter slots into global jsval would be necessary in any case and then NaN or any other tagging schema would bring the same complexity. The advantages of 64 bit tagged slots versus NaN-packing is that it would be faster to unpack pointer values. But if the proposal is about universal 64-bit jsval, then NaN packaging would probably win. But then how to deal with non-atomic nature of 64 bit values on 32 bit CPU???
Here is some thoughts on that parallel stack for doubles: Whenever a new double should be pushed into interpreter's stack, a new pseudo boolean value is pushed instead and the double is transfered to the parallel stack. Then all stack operations should be updated to deal with this extra slot repackaging the double value into jsdouble* when necessary. As an extra optimization some native methods like Math.max can be made aware of the extra slots to avoid allocation of double values on the hip.
(In reply to comment #31) > (In reply to comment #30) > > See bug 264912, but I remember a more recent Mactel bug. Can someone id it? > > Bug 338756? That's it! Thanks, Gavin. (In reply to comment #33) > I thought the suggestion was to use 64bit slots in the interpreter, not to make > typedef double jsval. See this bug's summary ;-). > But if the proposal is about universal 64-bit jsval, then NaN packaging would > probably win. But then how to deal with non-atomic nature of 64 bit values on > 32 bit CPU??? Atomicity is not an issue AFAIK. The JS_THREADSAFE design does not count on jsval read/write atomicity AFAIK. (In reply to comment #34) > Here is some thoughts on that parallel stack for doubles: > > Whenever a new double should be pushed into interpreter's stack, a new pseudo > boolean value is pushed instead and the double is transfered to the parallel > stack. Then all stack operations should be updated to deal with this extra slot > repackaging the double value into jsdouble* when necessary. As an extra > optimization some native methods like Math.max can be made aware of the extra > slots to avoid allocation of double values on the hip. A more focused fix for this bug, but it adds code footprint and critical path branch tests for pseudo-booleans. Would like a unifying and minimizing approach. /be
An implementation using a fat jsval just to support inline double values can not even be source-compatible with the existing embeddings. For example, currently JSVAL_TO_DOUBLE returns a pointer to jsdouble. That pointer can be freely stored in global structures etc. as it a pointer to GC thing. With the fat jsval JSVAL_TO_DOUBLE does not have such pointer and at best it can return a pointer to itself. But that pointer is not stable and can not be stored in global structures. Similarly, JS_NewDouble returns jsdouble* which requires the result to be GC-thing.
Created attachment 305307 [details] [diff] [review] buggy WIP using tamarin-tracing boxed value representation Good news is it works enough to run perfect.js, Y.js and the bitwise-and (so-called) SunSpider test. Bad news: it loses ~60% on bitwise-and. This was a risk I saw going on. Not sure I want to go further right now when a temp stack for the interpreter could win for cases where heap writes bear far fewer doubles. Cc'ing TT folks. /be
> Bad news: it loses ~60% on bitwise-and. Where's the time going?
Slowdown may or may not be applicable to TT: traces often know when they are operating on 32-bit data (as opposed to 64-bit boxed data) and optimize for that. The interpreter has similar (but less effective) optimizations. I don't know enough about the internals of SpiderMonkey to know whether similar issues may apply.
Tracing's great because it allows all sorts of partial evaluation along the trace, but the concern here is pure interpreter performance, since you have to have an interpreter handy for those benchmarks that don't trace well. I ran out of time this weekend but I'll fire up shark today and try to see what is going on. /be
Sorry, still no time to dig into this but I will get to it. In the mean time, Mike Pall's post at http://article.gmane.org/gmane.comp.lang.lua.general/44823 (cited recently in cdouble's blog) suggests why interpreters or JITs that do not schedule carefully may suffer: "A half-baked port to LuaJIT 1.x showed some speedups on most cache-sensitive benchmarks, but also some bad slowdowns on numeric benchmarks. This is caused by store-to-load forwarding stalls due to the many type checks which load the 32 bit MSW right after a 64 bit LSW/MSW store. This was another point which made me think about a better design for LuaJIT." More when I have time. /be
Jason, will you take this bug? Thanks, /be
Created attachment 321250 [details] [diff] [review] still-probably-buggy WIP using TT boxed value representation Taking. The attached patch unbitrots Brendan's patch and fixes some bugs. It passes js/tests.
Created attachment 322106 [details] [diff] [review] WIP 3 Two fairly big changes: * No longer create or cope with UINT jsvals anywhere. It was a big correctness risk, and removing the extra "else if" cases improved performance. (JSTAG_UINT is still used in jsarray.c's array enumeration code, to tag certain enumeration state data.) * Rewrite all the jsval.h macros for speed on Intel32. I'm no expert--it seems likely that more juice can be squeezed out of them. Got a nice speed boost, but not enough. WIP2: 16.6% slower aggregate, 46% slower on bitops WIP3: 11.5% slower aggregate, 23% slower on bitops There's more low-hanging fruit, so I'll spend one more day on this.
I spent Thursday on this, trying two things: * Making atoms 64-bit values (changing "JSAtom *" to "jsatom") in the naivest possible way-- which is to say, everywhere, without trying to figure out whether I could turn stuff like JSFunction::atom into "JSString *" instead. This took many hours and turned out to be a slight performance loss. * Inspired by another Mike Pall post, since ditching UINT was a nice win, I tried ditching INT. The trade-off is, eliminating a ton of branches (typecases--but also stuff like the overflow check in x++) and js_NewNumberValue becomes a no-op; vs. doing more floating-point math, especially more doubleToInt conversion for array and string indexing. Didn't finish this but I suspect it's no win. Very tempted to spend more time on the latter (I'm curious!), but I should cut my losses and work on stuff with a higher chance of winning. :-P
I wonder whether we should give this another try. With the JIT the test cases that took a bath should fare much better.