If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

JS doubles are GC-things; could most be interpreter-stack allocated?

RESOLVED DUPLICATE of bug 549143

Status

()

Core
JavaScript Engine
P5
normal
RESOLVED DUPLICATE of bug 549143
15 years ago
6 years ago

People

(Reporter: tenthumbs, Unassigned)

Tracking

({perf})

Trunk
Future
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(4 attachments, 3 obsolete attachments)

(Reporter)

Description

15 years ago
It's certainly possible that I'm missing something and this is invalid 
but there does seem to be some very slow code in spidermonkey.

I was intrigued by a posting in n.p.p.performance about js timings but 
I'm interested in interpreter speed compared to real compiled code.

I built a number of spidermonkey versions with different gcc's (egcs, 
2.95.3, and 3.1.1) as well as different options (-O, -O2, -O3, -O3 
-fomit-frame-pointer). Using slightly modified test code, I translated 
it into C and built executable using the same compiler options so there 
are js and C pairs.

Going in, I would have been impressed if js was 5 times slower and I 
would have found 10 a reasonable number. However, all the pairs are 
between 55 to 60 times slower, almost a full order of magnitude worse 
than expected. That's not good at all.

I do notice that js allocates about 7MB while running the test. Perhaps 
that has something to do with the time.
(Reporter)

Comment 1

15 years ago
Created attachment 94020 [details]
test js and C code

JS code inside comment of C code

Updated

15 years ago
Keywords: perf

Comment 2

15 years ago
Reassigning to Kenton; cc'ing Brendan, rogerl, shaver, jrgm -
Assignee: rogerl → khanson

Comment 3

15 years ago
Timing Results: C vs 20020804 trunk xpcshell

Win2k 1200 MHz Athlon

C
-----------------------
elapsed time: 0.036161s
elapsed time: 0.03638s
elapsed time: 0.036716s
elapsed time: 0.036495s
elapsed time: 0.036258s
elapsed time: 0.039186s

xpcshell
-----------------------
elapsed time = 2.183s
elapsed time = 2.213s
elapsed time = 2.233s
elapsed time = 2.193s
elapsed time = 2.193s
elapsed time = 2.223s

Comment 4

15 years ago
Did you really intend that the test program carry a divide-by-zero error 
through all the calculations (for x), and the value of y to always be zero?

(Although, I'm not entirely surprised that js might be 60 times slower on 
a tight numeric loop. Within an order of magnitude (a factor of 10) would 
be nice, but not always expected).
What's the bug here?  Dynamic languages are 10x-100x slower than static ones
without aggressive optimizations that we haven't had time or need to do yet in
SpiderMonkey.  Try a better summary, or risk getting an INVALID resolution.

This might end up dup'd against an existing bug (bug 121414, or possibly bug
61898, although I think that one should be closed and new one(s) filed).

/be
I missed tenthumbs' comment about 7MB being allocated, but jrgm pointed the way:
the testcase is completely bogus.  It initializes x to 0, then proceeds to
generate 0/0 on the first iteration of each loop, which is NaN.  That NaN then
spreads throughout the rest of the iterations, requiring a new double-valued
GC-thing to be created.

This bug would be most useful if it were the successor to bug 61898, with the
summary I'm giving it now.

/be
Summary: Is JS really this slow? → JS doubles are GC-things; could most be interpreter-stack allocated?
(Reporter)

Comment 7

15 years ago
Well make that x=2.0, y=3.0 and it's not as bogus as you think. That 
just moves the numbers into the 50-55 range which isn't much of a 
change. In any case, infinities are allowed, so I don't see what's so 
bogus anyway.

Now, would you like a new bug on recursive functions that are an order 
of magnitude slower than the one here?
tenthumbs: what's bogus is that the partial evaluation work I'm doing will soon
make your JS benchmark run much faster than the C version.  A more realistic
benchmark wouldn't degenerate into a partial evaluation that becomes total and
computes x and y at compile (JIT) time.

Why don't you mail me your recursive function benchmark and we can figure out
what's going on.

Again, you're not going to get Perl, Python, JS, or other dynamic languages
running as fast, or even at most 10x slower, than C code, without a lot of
Self-ish code specialization, partial evaluation, JIT-compiling, code caching work.

/be
Assignee: khanson → brendan
Created attachment 94530 [details] [diff] [review]
patch to avoid allocating NaNs and Infinities

This patch will slow down double allocation slightly (by the cost of the
failing JSDOUBLE_IS_FINITE test), but it will speed up the bogo-benchmark here
a bit.	If you change x's initial value to 1, then it slows down that less
bogus, but still vacuous, variant benchmark.

It seems to me the common case is not to generate lots of NaNs and infinities,
so we shouldn't check in this patch.  But I thought I'd post it.

/be

Comment 10

15 years ago
+        if (rt->jsNegativeInfinity && *rt->jsPositiveInfinity == d)
+            return rt->jsNegativeInfinity;

should be

+        if (rt->jsNegativeInfinity && *rt->jsNegativeInfinity == d)
+            return rt->jsNegativeInfinity;

Couldn't you use one test of 'rt->state == JSRTS_UP'?
Created attachment 94949 [details] [diff] [review]
duh, thanks rogerl

I still don't think this patch should go in, because the dynamic frequence of
non-finite values is (or should be, in well-behaved numerical JS) so low. 
Anyone have contrary thoughts or measurements?

/be
Attachment #94530 - Attachment is obsolete: true

Comment 12

15 years ago
cc'ing rginda, waldemar in case they have any contrary thoughts on this -
I instrumented the patch to count non-finite values passed to js_NewDouble, and
found none (0) when starting Mozilla, loading slashdot.org in a browser window.

Anyway, the patch wasn't meant to be checked in, it was more of a discussion
device.  So I'm going to keep talking to myself :-).

Allocating doubles on the interpreter stack is harder.  It would seem to require
knowing the type at compile (or at partial evaluation) time, or else keeping a
parallel type stack and checking it on every pop.  The former is something JS2
makes easy, the latter is going to hurt JS1 performance.  I argue that it is
better to focus on more general wins (such as might be gained by fixing bug
121414), instead of spending too much effort on one cost measured by a synthetic
benchmark.

/be

Comment 14

15 years ago
If improving the handling of NaN and Inf implies a cost for the general numeric
case, then, no, I don't think this should land.
Just as I'd submitted the last comment, I followed a link from /. and the number
of non-finite new double GC-thing allocations saved by the patch went from 0 to
6 (out of only 213 new doubles created since browser startup, mostly by chrome
JS -- note again that the trade-off of allocating doubles in the GC heap only
when needed, and using tagged ints, seems to win).

/be
(Reporter)

Comment 16

15 years ago
I agree that handling non-finite values in software isn't worth the
trouble. Trap it in the hardware if you really want to.

I think you're all getting hung up on infinitities. They happen a lot
more often than you think. Visit a number of web pages and see how many
appear. I suspect that if you subtract out the chrome the percentage
will be surprisingly high.

A floating point value is essentially always a valid value. The fact
that the hardware processes some values more slowly than others isn't
particulary relevant because it affects all processes equally. The real
issue is that spidermonkey is wasteful in both time and space on
floating point.
tenthumbs, what in tarnation are you talking about?

>I agree that handling non-finite values in software isn't worth the
>trouble. Trap it in the hardware if you really want to.

Trap what in hardware?  In discussing the moot patch, we were talking about how
JS allocates doubles, and whether it's worth returning per-runtime singletons
for NaN and +/-Infinity -- so we were clearly talking only about software.

>I think you're all getting hung up on infinitities.

(Freudian slip? ;-)

> They happen a lot more often than you think.

Evidence?

I already reported my results here, where are yours?

>Visit a number of web pages and see how many appear. I suspect that if you 
>subtract out the chrome the percentage will be surprisingly high.

6 non-finite values out of 213 doubles, out of thousands to millions of tagged
ints, sounds like a small number to me.  What's surprising?  What web pages do
you mean?

>A floating point value is essentially always a valid value.

Strawman -- who said otherwise?  Not that I know what "valid" means here.

>The fact that the hardware processes some values more slowly

No one ever said that hardware processes non-finite values more slowly, and I'm
not sure popular FPUs do or do not penalize such values.  This too seems to have
nothing to do with the matter at hand, or what people have written in this bug's
comments.

>than others isn't particulary relevant because it affects all processes
>equally.

Again, not that anyone claimed non-finite values were slow or bogus to test
(re-read my comment #8 about partial evaluation to understand why the benchmark
is bogus whether it computes a NaN repeatedly in x, or a small fration in x), it
seems to me you have failed to demonstrate that non-finite values occur
frequently, or with any frequency sufficient to justify any software (or
hardware, for that matter, if there are hardware penalties in current FPUs)
special-case optimizations for those values.

>The real issue is that spidermonkey is wasteful in both time and space on
>floating point.

No, the real issue is that your synthetic benchmark has nothing to do with
Mozilla or JS performance on real world scripts, pending actual measurements and
examples from you.

If you're looking to prove that JS makes Mozilla slow because JS allocates
doubles from its GC heap, you have to show some evidence (poor man's profilig by
ctrl-c in a debugger that finds lots of js_NewDouble stack backtraces, ad-hoc
instrumentation such as I added to find that only 6 non-finite values are
created browsing /. and one of its article links, or jprof/quantify profiles as
I strongly suggested in email to you) of frequent JS double GC-thing allocations.

Until then, talk is cheap, and big talk (specifically, your use of "wasteful",
which cries for real-world cost comparisons, as it begs questions about
trade-offs that can't be answered without profiling data, or at least some
global measurement of doubles allocated per unit time browsing an average url
reference string, or doing some other user-level task) is even cheaper.

/be
Next alpha.

/be
Target Milestone: --- → mozilla1.4alpha
I owe tenthumbs a bug about empty function invocation overhead, vs. perl empty
subroutine.  Tenthumbs, feel free to file it on me.

/be
Status: NEW → ASSIGNED
Target Milestone: mozilla1.4alpha → mozilla1.5alpha

Updated

14 years ago
Target Milestone: mozilla1.5alpha → Future

Updated

13 years ago
Priority: -- → P5

Updated

13 years ago
OS: Linux → All
Hardware: PC → All

Updated

11 years ago
QA Contact: pschwartau → general

Updated

10 years ago
Duplicate of this bug: 384755

Updated

10 years ago
Blocks: 117611
An alternative idea from Ed Smith of Adobe: change jsval to 64 bits and use NaN patterns to tag non-double values except for the single, canonical quiet NaN that JS needs.

E.g., given 64-bit value x and exponent, mantissa, and signbit functions:

x                exponent(x)   mantissa(x)       signbit(x)
=              + ===========   ===========       ==========
non-NaN double | not all ones  (   one or both non-zero   )
canonical NaN  | all ones      tag(1)            1
null           | all ones      tag(0)            1
undefined      | all ones      tag(1)            0
boolean        | all ones      tag(2) + x        0
int (48 bits!) | all ones      tag(4) + x        0
string         | all ones      tag(5) + ptr(x)   0
object         | all ones      tag(6) + ptr(x)   0
private        | all ones      tag(7) + ptr(x)   0
others (namespaces, etc.) possible...

The first two bytes contain the sign bit, 11 exponent bits, and 4 tag bits. Thus a private pointer can be up to 48 bits, which should be enough for practical 64-bit address space support. Thus tag(t) is t << 48 and ptr(x) is (x & bitmask(48)).

Type tests can load just the first two bytes.

There may be a better encoding than the sketch above, I just made it up! In particular perhaps null should be adjacent in tag terms to the reference types (string, object, etc.). 

Comments?

/be
If 64-bit stack items cost too much on 32-bit systems I would be surprised. Control operators that push and pop the stack could use a separate control stack, but we already have precedent for pushing and popping two items. We may also end up with two PCs: one for fat opcodes as jsinterp.c uses today and the other for a lower-level code that can be traced and JITted efficiently, such that the heavy runtime type tax can be traced away (with guards).

/be
boolean        | all ones      tag(2) + x        0
int (48 bits!) | all ones      tag(4) + x        0

Oops, meant to include the boolean value in the tag bits, then thought better of it. Could use sign bit. Idea is to avoid loading 48 bits to get the true or false decoded.

/be

Comment 24

10 years ago
The only problem I see with Ed Smith's suggestion is that the technique may require revision or worse on machines that use different floating-point representations. Honestly, I think that's a non-issue, but it probably should at least get mentioned ;)

Comment 25

10 years ago
Rhino's pure interpreter mode uses a separated double stack to hold numerical values. That made various math tests to run 20%-30% faster. Similarly, when Rhino compiles JS down to JVM's bytecode, it allocates separated JVM's register to hold numeric values.

Comment 26

10 years ago
(In reply to comment #21)
> An alternative idea from Ed Smith of Adobe: change jsval to 64 bits and use NaN
> patterns to tag non-double values except for the single, canonical quiet NaN
> that JS needs.

It is not necessary to even bother with NaN patterns, one can use an encoding that uses 61 bit for double values and allocates the rest of doubles on the heap. I.e. very small or very big doubles do not happen often in practical calculations and when they do, then it indicates about numerical instability in an algorithm.
This is interesting, but is there a real-world site that should expect as much as a 1% performance boost from the proposed change?

On Intel 32-bit, won't it be bad for each jsval to soak up two registers instead of one?

Comment 28

10 years ago
> This is interesting, but is there a real-world site that should expect as
> much as a 1% performance boost from the proposed change?

(In reply to comment #27)
> This is interesting, but is there a real-world site that should expect as much
> as a 1% performance boost from the proposed change?

Until there is integral numeric types are widely available, sites that implement cryptographic functions on the client side will see a decent performance increase.

Login pages at Google and Yahoo do this. And the site Meebo.com makes heavy use of cryptography implemented in Javascript, IIRC.
(In reply to comment #24)
> The only problem I see with Ed Smith's suggestion is that the technique may
> require revision or worse on machines that use different floating-point
> representations. Honestly, I think that's a non-issue, but it probably should
> at least get mentioned ;)

ECMA-262 requires IEEE double, and this means hardware that uses a different (or even more precise, i.e., IEEE 754 80 bit extended precision) does not conform without painful-to-performance extra software checks. This is a problem on Mac Intel laptops today, since Apple wierdly splits numeric chores across x87 and SSE and uses extended precision on the former. We have not found a way to make it stop (see bug ).

Short story is that most sane hardware can use doubles without any overhead.


(In reply to comment #26)
> (In reply to comment #21)
> > An alternative idea from Ed Smith of Adobe: change jsval to 64 bits and use NaN
> > patterns to tag non-double values except for the single, canonical quiet NaN
> > that JS needs.
> 
> It is not necessary to even bother with NaN patterns, one can use an encoding
> that uses 61 bit for double values and allocates the rest of doubles on the
> heap. I.e. very small or very big doubles do not happen often in practical
> calculations and when they do, then it indicates about numerical instability in
> an algorithm.

Can you do this without (a) extra overhead dealing with most doubles; (b) requiring whole-double loads to discern type? Probably, but (b) is important (ARM Thumb has a 16-bit data bus, I hear; the NaN idea allows the first two bytes to be loaded for type tests and even boolean tests).

Jason: it's not just crypto. This bug was filed a long time ago, and the workload that crosses over the INT_FITS_IN_JSVAL threshold is only rising, in part due to benchmark wars but also due to real honest-to-goodness graphics and other numerically more intensive programming in JS.

The cost of not optimizing doubles is unusually high unless you work hard on stack-like GC; the cost of stacking 64-bit values is relatively small on desktop CPUs. We would have to measure to be sure, but I think we have hopeful evidence.

In a past life I moved a simulator for a 64-bit CPU to use 64-bit host code, by implementing MIPS R4000 64-bit support in GCC (my patches and another person's were merged for the final support), targeting the first SGI box to use the R4K. The speedup however was marginal, because other costs than load and store traffic dominated. The flip side was that the simulator targeting a 32-bit MIPS host ran well enough given the pipelining of the day and the other instructions in the schedule that filled the pipe.

CPUs have become even more pipelined. Igor's data from Rhino is suggestive.

Your point is well-taken for tagged ints: they may well want to remain 32 (not 31) bit, so only the type tag (if needed; JITted code would often know the on-trace precise type) and the low 32 bits of the double (and a 32-bit register to hold it) would be needed. This would reduce register pressure and memory bandwidth for all the data that stays in int-tagged 32-bit jsvals on 32-bit CPUs today.

/be
(In reply to comment #29)
> This is a problem
> on Mac Intel laptops today, since Apple wierdly splits numeric chores across
> x87 and SSE and uses extended precision on the former. We have not found a way
> to make it stop (see bug ).

See bug 264912, but I remember a more recent Mactel bug. Can someone id it?

/be
(In reply to comment #30)
> See bug 264912, but I remember a more recent Mactel bug. Can someone id it?

Bug 338756?

Comment 32

10 years ago
midair...

ecma/Date/15.9.5.31-1.js fails on 1.8.1|1.9.0 mactel but it may not be related to this and I haven't filed it yet. I have some other Linux 2.6.9 vs 2.6.2x kernel related failures I have yet to file that may or may not be related.

ecma/TypeConversion/9.3.1-3.js (bug 338756) may be what you are thinking of. 

Comment 33

10 years ago
(In reply to comment #29)
> > It is not necessary to even bother with NaN patterns, one can use an encoding
> > that uses 61 bit for double values and allocates the rest of doubles on the
> > heap. I.e. very small or very big doubles do not happen often in practical
> > calculations and when they do, then it indicates about numerical instability in
> > an algorithm.
> 
> Can you do this without (a) extra overhead dealing with most doubles; (b)
> requiring whole-double loads to discern type? Probably, but (b) is important
> (ARM Thumb has a 16-bit data bus, I hear; the NaN idea allows the first two
> bytes to be loaded for type tests and even boolean tests).

I thought the suggestion was to use 64bit slots in the interpreter, not to make typedef double jsval. In the former case the repackaging of interpreter slots into global jsval would be necessary in any case and then NaN or any other tagging schema would bring the same complexity. The advantages of 64 bit tagged slots versus NaN-packing is that it would be faster to unpack pointer values. 

But if the proposal is about universal 64-bit jsval, then NaN packaging would probably win. But then how to deal with non-atomic nature of 64 bit values on 32 bit CPU???

Comment 34

10 years ago
Here is some thoughts on that parallel stack for doubles:

Whenever a new double should be pushed into interpreter's stack, a new pseudo boolean value is pushed instead and the double is transfered to the parallel stack. Then all stack operations should be updated to deal with this extra slot repackaging the double value into jsdouble* when necessary. As an extra optimization some native methods like Math.max can be made aware of the extra slots to avoid allocation of double values on the hip.
(In reply to comment #31)
> (In reply to comment #30)
> > See bug 264912, but I remember a more recent Mactel bug. Can someone id it?
> 
> Bug 338756?

That's it! Thanks, Gavin.

(In reply to comment #33)
> I thought the suggestion was to use 64bit slots in the interpreter, not to make
> typedef double jsval.

See this bug's summary ;-).

> But if the proposal is about universal 64-bit jsval, then NaN packaging would
> probably win. But then how to deal with non-atomic nature of 64 bit values on
> 32 bit CPU???

Atomicity is not an issue AFAIK. The JS_THREADSAFE design does not count on jsval read/write atomicity AFAIK.


(In reply to comment #34)
> Here is some thoughts on that parallel stack for doubles:
> 
> Whenever a new double should be pushed into interpreter's stack, a new pseudo
> boolean value is pushed instead and the double is transfered to the parallel
> stack. Then all stack operations should be updated to deal with this extra slot
> repackaging the double value into jsdouble* when necessary. As an extra
> optimization some native methods like Math.max can be made aware of the extra
> slots to avoid allocation of double values on the hip.

A more focused fix for this bug, but it adds code footprint and critical path branch tests for pseudo-booleans. Would like a unifying and minimizing approach.

/be

Comment 36

10 years ago
An implementation using a fat jsval just to support inline double values can not even be source-compatible with the existing embeddings. 

For example, currently JSVAL_TO_DOUBLE returns a pointer to jsdouble. That pointer can be freely stored in global structures etc. as it a pointer to GC thing. With the fat jsval JSVAL_TO_DOUBLE does not have such pointer and at best it can return a pointer to itself. But that pointer is not stable and can not be stored in global structures.

Similarly, JS_NewDouble returns jsdouble* which requires the result to be GC-thing.
Created attachment 305307 [details] [diff] [review]
buggy WIP using tamarin-tracing boxed value representation

Good news is it works enough to run perfect.js, Y.js and the bitwise-and (so-called) SunSpider test. Bad news: it loses ~60% on bitwise-and. This was a risk I saw going on. Not sure I want to go further right now when a temp stack for the interpreter could win for cases where heap writes bear far fewer doubles.

Cc'ing TT folks.

/be

Updated

10 years ago
Attachment #305307 - Attachment description: buggy WIP using tamarin-tracing boxed value represnetation → buggy WIP using tamarin-tracing boxed value representation

Comment 38

10 years ago
> Bad news: it loses ~60% on bitwise-and.

Where's the time going?

Comment 39

10 years ago
Slowdown may or may not be applicable to TT: traces often know when they are operating on 32-bit data (as opposed to 64-bit boxed data) and optimize for that. The interpreter has similar (but less effective) optimizations. I don't know enough about the internals of SpiderMonkey to know whether similar issues may apply.
Tracing's great because it allows all sorts of partial evaluation along the trace, but the concern here is pure interpreter performance, since you have to have an interpreter handy for those benchmarks that don't trace well.

I ran out of time this weekend but I'll fire up shark today and try to see what is going on.

/be
Sorry, still no time to dig into this but I will get to it. In the mean time, Mike Pall's post at

http://article.gmane.org/gmane.comp.lang.lua.general/44823

(cited recently in cdouble's blog) suggests why interpreters or JITs that do not schedule carefully may suffer:

"A half-baked port to LuaJIT 1.x showed some speedups on most
cache-sensitive benchmarks, but also some bad slowdowns on
numeric benchmarks. This is caused by store-to-load forwarding
stalls due to the many type checks which load the 32 bit MSW
right after a 64 bit LSW/MSW store. This was another point which
made me think about a better design for LuaJIT."

More when I have time.

/be
Jason, will you take this bug? Thanks,

/be
Created attachment 321250 [details] [diff] [review]
still-probably-buggy WIP using TT boxed value representation

Taking.

The attached patch unbitrots Brendan's patch and fixes some bugs.  It passes js/tests.
Assignee: brendan → jorendorff
Attachment #305307 - Attachment is obsolete: true
Created attachment 322106 [details] [diff] [review]
WIP 3

Two fairly big changes:

* No longer create or cope with UINT jsvals anywhere.  It was a big correctness risk, and removing the extra "else if" cases improved performance.  (JSTAG_UINT is still used in jsarray.c's array enumeration code, to tag certain enumeration state data.)

* Rewrite all the jsval.h macros for speed on Intel32.  I'm no expert--it seems likely that more juice can be squeezed out of them.

Got a nice speed boost, but not enough.

WIP2:  16.6% slower aggregate, 46% slower on bitops
WIP3:  11.5% slower aggregate, 23% slower on bitops

There's more low-hanging fruit, so I'll spend one more day on this.
Attachment #321250 - Attachment is obsolete: true
Created attachment 322107 [details]
WIP 3 sunspider numbers (vs. mozilla-central trunk)
I spent Thursday on this, trying two things:

* Making atoms 64-bit values (changing "JSAtom *" to "jsatom") in the naivest possible way-- which is to say, everywhere, without trying to figure out whether I could turn stuff like JSFunction::atom into "JSString *" instead.  This took many hours and turned out to be a slight performance loss.

* Inspired by another Mike Pall post, since ditching UINT was a nice win, I tried ditching INT.  The trade-off is, eliminating a ton of branches (typecases--but also stuff like the overflow check in x++) and js_NewNumberValue becomes a no-op; vs. doing more floating-point math, especially more doubleToInt conversion for array and string indexing.  Didn't finish this but I suspect it's no win.

Very tempted to spend more time on the latter (I'm curious!), but I should cut my losses and work on stuff with a higher chance of winning. :-P
Assignee: jorendorff → general
Status: ASSIGNED → NEW
Flags: wanted1.9.2?

Comment 47

9 years ago
I wonder whether we should give this another try. With the JIT the test cases that took a bath should fare much better.

Updated

7 years ago
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 549143
Flags: wanted1.9.2?
You need to log in before you can comment on or make changes to this bug.