Closed Bug 421864 Opened 16 years ago Closed 12 years ago

Interpreter creates too many doubles

Categories

(Core :: JavaScript Engine, enhancement, P2)

enhancement

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: x00000000, Unassigned)

References

Details

(Keywords: perf)

Attachments

(4 files, 19 obsolete files)

754 bytes, text/html
Details
5.36 KB, text/plain
Details
128.15 KB, image/svg+xml
Details
66.19 KB, patch
Details | Diff | Splinter Review
User-Agent:       Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.12) Gecko/20080302 SeaMonkey/1.1.8
Build Identifier: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9b5pre) Gecko/2008031006 Minefield/3.0b5pre

A expression like

  x += a + b + c + d + e + f + g + h;

where a .. h are doubles creates a new double for every intermediate step.
Instead, it could modify most of them in place.

I was dissatisfied with floating point performance in Mozilla, looked at the
code and found this possible optimization. Actually floating point performance
isn't that bad anymore; it has improved very much since beta 3.


Reproducible: Always

Steps to Reproduce:
Code & testcase
Actual Results:  
Interpreter creates too much doubles

Expected Results:  
Should recycle the doubles
Attached patch patch (obsolete) — Splinter Review
This patch passes the JS test suite and works in the browser, but I don't know
much about GC, so I'm not sure if it is really safe. It assumes that doubles
returned by functions can be modified in place, but I had to fix several
functions where this wasn't true.

After I wrote the patch I noticed that it will collide with bug 419632.
I also found bug 161110, which will make it obsolete. So I don't know if this
patch is useful.
Attached file testcase
Timings on my Athlon64, 2.2 GHz, Linux amd64 system:

Seamonkey 1.1.8     14.0  s  (built with Gentoo ebuild, amd64)
Firefox 3 alpha 7   11.7  s  (x86 binary)
Firefox 3 beta 1    11.0  s  (x86 binary)
Firefox 3 beta 3    10.2  s  (x86 binary)
Trunk                1.28 s  (CVS build, amd64)
Trunk + patch         .74 s  (CVS build, amd64)
Trunk + bug 419632   1.36 s  (CVS build, amd64)
Opera 9.10           1.69 s  (x86 binary)
Konqueror 3.5.6      6.48 s  (built with Gentoo ebuild, amd64)

Improvement with the patch is high, but the testcase tests exactly the
optimizations of the patch. Real world scripts won't benefit that much.

You can change the loop count of the testcase by adding a fragment like
#x:param(1e6) to the URL (1e6 is the default).
Good thinking -- return values are single-reference temporaries by definition.

Igor, can you help get this proven safe and into landable shape?

/be
Assignee: general → igor
Status: UNCONFIRMED → NEW
Ever confirmed: true
Summary: Interpreter creates too much doubles → Interpreter creates too many doubles
Keywords: perf
Version: unspecified → Trunk
In a rush at a conference, but I've thought about such schemes before. Is this patch correct when you do the following:

x = a + b;
y = x;
x += c + d + e;

given a-e in 1-5, is y still 3 after the += completes?

With a copy-on-write bit in gcflags, we could avoid copying NaN, etc. In general we'd need a write barrier, though. We have one for objects now, but not for stack frame slots. It is not free to add to, so the cost might undermine the win.

/be
AFAICS the idea from this patch is only applicable to cases of chained math operations like in x + y + z where it can eliminate one temporary. Cases like x + y can not be optimized since there is no guarantee that the result of x or y is not aliased due to SpiderMonkey API compatibility.

On the other hand numerical-intensive code does contain various forms of multiply-add. nbody and ray-tracing tests from the sunspider benchmark are littered with samples like:

vals[0] * v[0] + vals[1] * v[1] + vals[2] * v[2] + vals[3];

v1[1] * v2[2] - v1[2] * v2[1]

bodyi.vx -= dx * bodyj.mass * mag;

This creates a lot of temporaries.  
Severity: normal → enhancement
(In reply to comment #4)
> x = a + b;
> y = x;
> x += c + d + e;
> 
> given a-e in 1-5, is y still 3 after the += completes?

Yes, the last line translates to

  x = x + ((c + d) + e);

and the patch only reuses temporary results on the left hand side of an expression. So it saves just 1 double here. However, it could be extended to also reuse temporary results on the right hand side. I hadn't thought about that.
I found a case where the patch fails:

  var x = .5;
  var y = .5;
  var z = (function () { return x })() + .5;

x, y and z will all three be 1 afterwards. Interpreted functions do not always return temporary values as I had assumed.

NB: If you test, you have to use values that are not small integers. The patch doesn't affect integer arithmetic (but .5 + .5 won't be converted to an integer).
Comment 7 shows the aliasing problem I gave a different example of in comment 4. The copy-on-write idea is good, it requires a write barrier. Simple flow-unaware static analysis could be used to select a prefix opcode for arithmetic and other ops, to try to mutate left or right operands. The + operator can apply to strings too, so this idea could be used to optimize string concatenation too.

/be
The oh-so-close patch in https://bugzilla.mozilla.org/show_bug.cgi?id=419743 does similar static analysis to pick an optimized bytecode for string concatenation, so as to amortize allocation and copy costs better.  Might be interesting to do the same thing here, and just look for multiple runs of temporary-generating ops, and then picking some different bytecodes that operate on a pigeonhole unrooted double instead of putting a jsval on the stack?

The addition cases are troublesome, but at least in cases like this

vals[0] * v[0] + vals[1] * v[1] + vals[2] * v[2] + vals[3];

we can know that the addition's operands are going to be numbers, since * always produces a number -- except for possibly vals[3], but even then the in-place operate case would be OK.

Yeah, if we could get some JSOP_LOADTEMP and JSOP_ADDTEMP/MULTEMP/DIVTEMP things going, then we could probably work well for strings as well, since the string in the temporary slot could be safely realloced.  Oh, but that wouldn't work as well for subexpressions, I guess, since the pigeonhole would get reused.  Hrmph.
I had based this on Igor's patch for bug 421274, and would like to do so again if that patch re-lands soon. It's a small change.

This approach can be used for strings too. The virtual register numbers would map at runtime to either the double accumulators, or the string accumulators, for +. It may also win to let locals (var and let in functions) refer to tagged pointers to accumulator registers, but this requires escape analysis to avoid imposing a write barrier at runtime.

/be
Assignee: igor → brendan
Status: NEW → ASSIGNED
Attachment #309282 - Flags: review?(igor)
Igor, mind if I steal this bug?

/be
Hrm, I already did -- sorry, didn't mean to.

/be
Attached patch picked a few nits (obsolete) — Splinter Review
This passes js tests (so did the previous patch).

/be
Attachment #308379 - Attachment is obsolete: true
Attachment #309282 - Attachment is obsolete: true
Attachment #309293 - Flags: review?(igor)
Attachment #309282 - Flags: review?(igor)
js> function f() {x+=a*b + c}
js> dis(f)
main:
00000:  bindname "x"
00003:  dup
00004:  getxprop "x"
00007:  name "a"
00010:  name "b"
00013:  mul 1
00015:  name "c"
00018:  add 1
00020:  add 0
00022:  setname "x"
00025:  pop
00026:  stop

Source notes:
  0:    20 [  20] xdelta  
  1:    20 [   0] assignop

Without this patch, the add at 18 had regno 0 as its immediate. The fix is to FreeNodeReg, it should not reset pn_reg. Each accumulator is single-use, so we record uses in tree nodes. Freeing a register from busy state does not require erasing recorded uses in the tree.

/be
Attachment #309293 - Attachment is obsolete: true
Attachment #309450 - Flags: review?(igor)
Attachment #309293 - Flags: review?(igor)
Igor, jimb, anyone else: I would appreciate tips on compiler options or pragmas to guarantee double alignment of JSStackFrame.dregs (soon JSFrameRegs.dregs), to avoid the runtime alignment done within the overallocated fp->dregs space by FRAME_DREGP. We'd like to cover GCC, ICC, and MSVC of course.

/be
Surprised we don't have a testcase for these. Decompiling *=, etc., requires one-opcode look-behind. The ancient decompiler ass-u-med that lastop was exactly one byte behind.

/be
Attachment #309450 - Attachment is obsolete: true
Attachment #309452 - Flags: review?(igor)
Attachment #309450 - Flags: review?(igor)
Comment on attachment 309452 [details] [diff] [review]
fix decompilation of op= (assignment ops)

>@@ -2664,24 +2664,36 @@ js_TraceStackFrame(JSTracer *trc, JSStac
....
>+        for (vp = fp->spbase, end = vp + nslots; vp < end; vp++) {
>+            v = *vp;
>+            if ((size_t)(v - (jsval)fp->dregs) < sizeof fp->dregs)
>+                continue;
>+            if (JSVAL_IS_TRACEABLE(v)) {
>+                JS_SET_TRACING_INDEX(trc, "operand", vp - fp->spbase);
>+                JS_CallTracer(trc, JSVAL_TO_TRACEABLE(v),
>+                              JSVAL_TRACE_KIND(v));
>+            }
>+        }
>     }

This is too hacky. Given the amount of checks in JSVAL_IS_TRACEABLE(v) it is better to write the body of the loop as:

	    if (JSVAL_IS_OBJECT(v) && v != JSVAL_NULL) {
	        traceKind = JSTRACE_OBJECT;
	    } else if (JSVAL_IS_DOUBLE(v)) {
                jddouble *dp = JSVAL_TO_DOUBLE(v);
	        if ((size_t) (dp - fp->dregs) < JS_ARRAY_LENGTH(fp->dregs))
	            continue;
	        traceKind = JSTRACE_DOUBLE;
	    } else if (JSVAL_IS_STRING(v)) {
	        traceKind = JSTRACE_STRING;
	    } else {
	        JS_ASSERT(!JSVAL_IS_TRACEABLE(v));
	        continue;
	    }
            JS_SET_TRACING_INDEX(trc, "operand", vp - fp->spbase);
            JS_CallTracer(trc, JSVAL_TO_TRACEABLE(v), traceKind);

It also makes things ready for string registers.

>Index: jsinterp.c
>===================================================================
>RCS file: /cvsroot/mozilla/js/src/jsinterp.c,v
>retrieving revision 3.477
>diff -p -u -8 -r3.477 jsinterp.c
>--- jsinterp.c	13 Mar 2008 18:03:02 -0000	3.477
>+++ jsinterp.c	14 Mar 2008 17:11:26 -0000
>@@ -2221,16 +2221,20 @@ js_DumpOpMeters()
>  * reference.
>  */
> #define STORE_NUMBER(cx, n, d)                                                \
>     JS_BEGIN_MACRO                                                            \
>         jsint i_;                                                             \
>                                                                               \
>         if (JSDOUBLE_IS_INT(d, i_) && INT_FITS_IN_JSVAL(i_)) {                \
>             sp[n] = INT_TO_JSVAL(i_);                                         \
>+        } else if (OP_IS_ARITH(op) && pc[1]) {                                \
>+            jsdouble *dp_ = FRAME_DREGP(fp, pc[1]);                           \
>+            *dp_ = d;                                                         \
>+            sp[n] = DOUBLE_TO_JSVAL(dp_);                                     \
>         } else {                                                              \
>             SAVE_SP_AND_PC(fp);                                               \
>             if (!js_NewDoubleInRootedValue(cx, d, &sp[n]))                    \
>                 goto error;                                                   \
>         }                                                                     \
>     JS_END_MACRO

This clearly demonstrates how optimizing parser can improve things with almost no touches to the interpreter.

>diff -p -u -8 -r3.86 jsinterp.h
>- * NB: This struct is manually initialized in jsinterp.c and jsiter.c.  If you
>- * add new members, update both files.  But first, try to remove members.  The
>- * sharp* and xml* members should be moved onto the stack as local variables
>- * with well-known slots, if possible.
>+ * NB: This struct is manually initialized in jsinterp.c and jsiter.c. If you
>+ * add new members, update both files to initialize appropriately (the dregs
>+ * member does not need to be initialized). And please try to remove members.
>+ * The sharp* and xml* members should be moved into magic local variables with
>+ * well-known slots, if possible.
>  */
> struct JSStackFrame {
>     jsval           *sp;            /* stack pointer */
>     jsbytecode      *pc;            /* program counter */
....
>+    char            dregs[3 * sizeof(jsdouble)];

The plea to remove members and bloating frames that do not benefit from the patch is self-contradictory. The buffer for doubles can be allocated together with stack slots and the script can be marked as such. Then dregs can become a local variable in js_Interpret. 

Now, that would require to change the serialization format of jsscript to recover the number of double regs. Effectively it means to bump JSXDR_MAGIC_SCRIPT_CURRENT. But this will be a good thing since it would also allow to store the version immediately after the magic and treat both magic and version mismatch the same way. The code is not going to support older versions of xdr scripts in any case and pretending that it does leads to bugs and waste of time. Bumping magic during betas also gives better chance of testing the code.

>Index: jsparse.c
...
> JSBool
>-js_FoldConstants(JSContext *cx, JSParseNode *pn, JSTreeContext *tc)
>+js_FoldConstants(JSContext *cx, JSParseNode *pn, JSParseNode *parent,
>+                 JSTreeContext *tc)

Nit: this is no longer a constant folder. Name like ExpressionOptimizer would better reflect the new meaning of the function. 

>+                if (PN_IS_PURE(parent) && PN_IS_ARITH(pn)) {
...
>+                if (PN_IS_PURE(parent) && PN_IS_ARITH(pn)) {

The new macros is only used in this combination. This suggests to have one single macro instead. 

Another note is that the patch excludes the bit operation from the optimization. The cryptography-related benchmarks from SunSpider contains sequences like (a << shift | b) with a << shift. They generates ints that do not fit jsval in 50% of cases. Thus even bit-manipulation code may benefit from dval registers. But this is for another bug.
With the latest patch:

js> x
Assertion failure: pn->pn_reg <= 2, at jsparse.c:259
Aborted

Every test of the test suite fails with the same assertion for me.
x00000000, you need to clobber when structs used in multiple .c files change size. The patch works, please confirm that rebuilding from source fixed what you reported in comment 18.

/be
(In reply to comment #17)
> This is too hacky.

I am a hacker. :-)

> Given the amount of checks in JSVAL_IS_TRACEABLE(v) it is
> better to write the body of the loop as:
...
> It also makes things ready for string registers.

Thanks, will do.

> This clearly demonstrates how optimizing parser can improve things with almost
> no touches to the interpreter.

I've been trying to tag non-GC-page doubles for a while, in several bugs. Trick to avoiding a write barrier entails locality and instruction selection plus register allocation. More to come.

> >+ * The sharp* and xml* members should be moved into magic local variables with
> >+ * well-known slots, if possible.
> >  */
> > struct JSStackFrame {
> >     jsval           *sp;            /* stack pointer */
> >     jsbytecode      *pc;            /* program counter */
> ....
> >+    char            dregs[3 * sizeof(jsdouble)];
> 
> The plea to remove members and bloating frames that do not benefit from the
> patch is self-contradictory.

That's a bit harsh -- getting sharp* and xml* out is good no matter what. The double registers need to be double-aligned (0 mod 8 addresses), however, which is not guaranteed by jsval-aligned stack allocation.

> The buffer for doubles can be allocated together
> with stack slots and the script can be marked as such. Then dregs can become a
> local variable in js_Interpret.

But then it becomes another pointer to keep alive in a register, or spill and load. I'd rather it be a small offset from an existing pointer or known on-stack buffer such as JSFrameRegs. Then the only issue is alignment, hence my plea to you, JimB, et al.

> Now, that would require to change the serialization format of jsscript to
> recover the number of double regs. Effectively it means to bump
> JSXDR_MAGIC_SCRIPT_CURRENT. But this will be a good thing since it would also
> allow to store the version immediately after the magic and treat both magic and
> version mismatch the same way. The code is not going to support older versions
> of xdr scripts in any case and pretending that it does leads to bugs and waste
> of time. Bumping magic during betas also gives better chance of testing the
> code.

I'd rather not do this now. It is a good idea to reform script magic vs. XDR bytecode version, but time is short and this patch introduces exactly 2 double accumulators, intentionally to keep things simple and satisfy the 90% (99.9%) use-case where there are few parenthesized sub-expressions requiring more than one accumulator.

> >Index: jsparse.c
> ...
> > JSBool
> >-js_FoldConstants(JSContext *cx, JSParseNode *pn, JSTreeContext *tc)
> >+js_FoldConstants(JSContext *cx, JSParseNode *pn, JSParseNode *parent,
> >+                 JSTreeContext *tc)
> 
> Nit: this is no longer a constant folder. Name like ExpressionOptimizer would
> better reflect the new meaning of the function. 

Good idea, will do.

> >+                if (PN_IS_PURE(parent) && PN_IS_ARITH(pn)) {
> ...
> >+                if (PN_IS_PURE(parent) && PN_IS_ARITH(pn)) {
> 
> The new macros is only used in this combination. This suggests to have one
> single macro instead. 

Sure, but layered on these.

> Another note is that the patch excludes the bit operation from the
> optimization. The cryptography-related benchmarks from SunSpider contains
> sequences like (a << shift | b) with a << shift. They generates ints that do
> not fit jsval in 50% of cases. Thus even bit-manipulation code may benefit from
> dval registers. But this is for another bug.

That too is simply done in this patch, due to adjacency of opcodes and existing structure of js_FoldConstants. New patch shortly.

/be
Attached patch per Igor's review (obsolete) — Splinter Review
** TOTAL **:           1.108x as fast    3539.3ms +/- 1.7%   3195.0ms +/- 0.2%     significant

Unaccountably except due to TLB and cache conflicts, the only test that slowed down was:

    3bit-bits-in-byte: *1.042x as slow*    57.8ms +/- 1.8%     60.2ms +/- 0.5%     significant

This test stays in the interpreter and uses only int jsvals. Between PGO and the still-beckoning change to organize interpreter "cases" by "temperature" and the dynamic-successor relation, we should be able to make this go away.

Interested in how much this wins in PGO'ed builds. Easiest way to tell is to get the patch in ;-).

/be
Attachment #309502 - Flags: review?(igor)
(In reply to comment #19)
> x00000000, you need to clobber when structs used in multiple .c files change
> size. The patch works, please confirm that rebuilding from source fixed what
> you reported in comment 18.

I still get the error. At the end of js_FoldConstants:

          case PN_UNARY:
          case PN_NAME:
            /* Our kid may be null (e.g. return; vs. return e;). */
            pn1 = pn;
            do {
                pn1 = pn1->pn_kid;
            } while (pn1 && PN_TYPE(pn1) == TOK_RP);
            FreeNodeReg(pn1, tc);
            break;

*pn1->pn_kid is mostly zero, but pn_reg is 32. Seems to be a missing initialization of pn_reg.
(In reply to comment #20)
> > The buffer for doubles can be allocated together
> > with stack slots and the script can be marked as such. Then dregs can become a
> > local variable in js_Interpret.
> 
> But then it becomes another pointer to keep alive in a register, or spill and
> load. I'd rather it be a small offset from an existing pointer or known
> on-stack buffer such as JSFrameRegs. Then the only issue is alignment, hence my
> plea to you, JimB, et al.

An alternative for doubles in JSFrame is to have dregs[3] local in js_Interpret. For inlined calls a copy would have to be stored in JSInlinedFrame, of cause, but the coping can be done only if caller's frame needs doubles. This will have an extra benefit of avoiding access fp during the optimized operations helping the compiler to optimize code.

But I agree that this can be done in a separated bug.
Attachment #309452 - Attachment is obsolete: true
Attachment #309452 - Flags: review?(igor)
Priority: -- → P1
Target Milestone: --- → mozilla1.9beta5
jimb, mmoy, anyone: could use help with minimal compiler options or pragmas to double-align JSStackFrame.dregs (and well-type it, and remove padding overhead from it). GCC's -malign-double is a bit too big of a hammer; care about MSVC and ICC too as noted in a previous comment. Thoughts?

/be
Comment on attachment 309502 [details] [diff] [review]
per Igor's review

>@@ -2664,24 +2664,48 @@ js_TraceStackFrame(JSTracer *trc, JSStac
>+            } else if (JSVAL_IS_DOUBLE(v)) {
>+                jsdouble *dp = JSVAL_TO_DOUBLE(v);
>+                if ((size_t)((char *)dp - fp->dregs) < sizeof fp->dregs)
>+                    continue;
...
>RCS file: /cvsroot/mozilla/js/src/jsinterp.h,v
...
> struct JSStackFrame {
...
>+    /*
>+     * GC-thing-aligned double accumulators used to store intermediate results
>+     * to avoid allocating from the GC heap. Use FRAME_DREGP(fp,regno) to find
>+     * the address of the accumulator identified by regno (which is one-based,
>+     * note well) in the frame at fp.
>+     */
>+    char            dregs[3 * sizeof(jsdouble)];
>+
>+#define FRAME_DREGP(fp,regno)                                                 \
>+    (JS_ASSERT((regno) != 0),                                                 \
>+     (jsdouble *)(((jsuword)((fp)->dregs + ((regno) - 1) * sizeof(jsdouble)   \
>+                             + sizeof(jsdouble) - 1))                         \
>+                  & ~JSVAL_TAGMASK))

Why use char dregs[3 * sizeof(jsdouble)] and not just jsdouble dregs[3]? With the patch from bug 421274 that array would not even be double-aligned at the current position. Preparations for string regs should wait its own bug without complicating this one.

r+ with this and the issue from the comment 22 addressed.
(In reply to comment #23)
> (In reply to comment #20)
> An alternative for doubles in JSFrame is to have dregs[3] local in
> js_Interpret. For inlined calls a copy would have to be stored in
> JSInlinedFrame, of cause, but the coping can be done only if caller's frame
> needs doubles. This will have an extra benefit of avoiding access fp during the
> optimized operations helping the compiler to optimize code.

Still need something like JSFrameRegs *regs in JSStackFrame pointing to the local drefs, so the GC can avoid scanning 'em.

That suggests to me that we prioritize alignment and co-location in JSFrameRegs ahead of minimizing compiler-selected/variable save/restore code and data space (which I agree we should do -- the bug 389214 might serve already).

/be
(In reply to comment #26)
> Still need something like JSFrameRegs *regs in JSStackFrame pointing to the
> local drefs, so the GC can avoid scanning 'em.

Right, I have missed that.

(In reply to comment #25)
> Why use char dregs[3 * sizeof(jsdouble)] and not just jsdouble dregs[3]?

Because the wider type serves no purpose and may even mislead readers and (more important) debugger-drivers. True, the alignment could just happen to be good and the view of dregs[0] and dregs[1] valid -- but I found less than 50-50 odds of misalignment on Mactel, and looking at half a double in dregs[0] where the other half is in dregs[1] does no one any favors.

> With
> the patch from bug 421274 that array would not even be double-aligned at the
> current position.

Nothing guarantees double alignment except the code in this patch; whether dregs moves to JSFrameRegs is independent.

> Preparations for string regs should wait its own bug without
> complicating this one.

> r+ with this and the issue from the comment 22 addressed.

I do not know why x00000000 is seeing a crash. I'm not -- the patches pass the js testsuite and have since v1. Need more crash info, or joint debugging over IRC. x00000000, do you have a name? Can you get onto irc.mozilla.org and /query me?

/be
I meant to write "worse than 50-50 odds of misalignment". Not scientific but it was clear from some printfs I added.

/be
(In reply to comment #28)
> Because the wider type serves no purpose and may even mislead readers and (more
> important) debugger-drivers.

Right, this is another thing I have missed. But then the patch should use 3*sizeof(jsdouble) + sizeof(jsuword) bytes for the array to ensure that the array always has room for 3 8-byte aligned doubles. I would also prefer uint8, not char, as  the type for the array.

[Note I have extremely unstable internet connection today and IRC is useless for me as most of the time the message would not go through due to timeouts :(. With bugzilla I can click reload if that happens.]



 True, the alignment could just happen to be good
> and the view of dregs[0] and dregs[1] valid -- but I found less than 50-50 odds
> of misalignment on Mactel, and looking at half a double in dregs[0] where the
> other half is in dregs[1] does no one any favors.
> 
> > With
> > the patch from bug 421274 that array would not even be double-aligned at the
> > current position.
> 
> Nothing guarantees double alignment except the code in this patch; whether
> dregs moves to JSFrameRegs is independent.
> 
> > Preparations for string regs should wait its own bug without
> > complicating this one.
> 
> > r+ with this and the issue from the comment 22 addressed.
> 
> I do not know why x00000000 is seeing a crash. I'm not -- the patches pass the
> js testsuite and have since v1. Need more crash info, or joint debugging over
> IRC. x00000000, do you have a name? Can you get onto irc.mozilla.org and /query
> me?
> 
> /be
> 

The problem is that pn_kid is accessed for PN_NAME, which has no pn_kid.
See code snippet of comment 22.
There are only 2 double registers, hence the 3 * sizeof(jsdouble). I can use uint8 for sure, tend to be old-school about char (C guarantees sizeof(char) == 1 but I recall days of old when char could be a 36-bit word! on PDP-10s!).

But, I really want to believe, like Agent Mulder, that some pragma can be injected at just the right spot to ensure alignment without this over-allocating/aligning in the patch, and without changing global options.

/be
(In reply to comment #31)
> The problem is that pn_kid is accessed for PN_NAME, which has no pn_kid.
> See code snippet of comment 22.

Argh, I misremembered pn_expr as overlaying pn_kid. Sorry about that. Fixing in a minute, and thanks for finding it (I just re-ran the suite and did not have any unexpected failures -- what platform are you using?).

/be
(In reply to comment #33)
> what platform are you using?

Linux, amd64.
Attached patch with that fixed (obsolete) — Splinter Review
x00000000, thanks again -- I would appreciate it if you could test this too.

/be
Attachment #309502 - Attachment is obsolete: true
Attachment #309521 - Flags: review?(igor)
Attachment #309502 - Flags: review?(igor)
No need for a dummy regBusy[0] for the no-reg case.

/be
Attachment #309521 - Attachment is obsolete: true
Attachment #309535 - Flags: review?(igor)
Attachment #309521 - Flags: review?(igor)
About 1/3 of the test suite still crashes (with attachment 309521 [details] [diff] [review]; now segfaults without assertion failure). A minimal testcase that crashes is:

  var x = .5
  x * 1 + 1;

Crash is in BINARY_OP(*), but |x * 1| does not crash.
Crashing where? Might have to macro-expand not only BINARY_OP for the JSOP_MUL case, but also of course STORE_* macros. That should let you see the crashing line in context.

/be
Crash is at *dp_ = d in STORE_DOUBLE.
dp_ seems to be quite random, but non-null.
dp_ is assigned to on the previous line, from FRAME_DREGP(fp, pc[1]). That macro does not produce a random value unless the compiler is broken.

Can you disassemble that line's instructions in gdb, and the *dp_ = d; line's instructions too? Thanks,

/be
dp_ looses its upper 32 bits. The reason is probably that JSVAL_TAGMASK is 32 bit.
x000000000 (mind if I call you x0 for short? ;-), thanks again. This is an ancient 64-bit porting hazard in jsapi.h, over 11 years old. Working around it in FRAME_DREGP the obvious way should fix things for you. Please confirm.

/be
Attachment #309535 - Attachment is obsolete: true
Attachment #309563 - Flags: review?(igor)
Attachment #309535 - Flags: review?(igor)
All tests pass now (assuming the patch just changes ~JSVAL_TAGMASK to ~(jsuword)JSVAL_TAGMASK, which was my correction of the last patch when I ran the tests).

Additionally, also js1_5/extensions/regress-335700.js ("Object Construction with getter closures should be O(N)") passes, which failed before.
(In reply to comment #43)
> All tests pass now (assuming the patch just changes ~JSVAL_TAGMASK to
> ~(jsuword)JSVAL_TAGMASK, which was my correction of the last patch when I ran
> the tests).

That was it -- if you click on the diff link for the latest attachment, at the top of the colorized side-by-side diff is an interdiff link, which confirms this.

> Additionally, also js1_5/extensions/regress-335700.js ("Object Construction
> with getter closures should be O(N)") passes, which failed before.

That's a false negative -- that test flaps a bit in my experience. It's doing a least-squares fit over perhaps not enough data to find the quadratic curve.

Thanks for all the testing help, and the original bug report and approach.

/be
Some basic arithmetic is totally wrong now:

js> var a = .1, b = .9;
js> (a + a) * (2 * b);
3.24
js> (a + a) / (2 * b);
1
Comment on attachment 309563 [details] [diff] [review]
workaround JSVAL_TAGMASK not 64-bit clean bug

Registers aren't marked busy when they should be. New patch (with __attribute__ / __declspec based alignment of dregs) later tonight.

/be
Attachment #309563 - Attachment is obsolete: true
Attachment #309563 - Flags: review?(igor)
Attached patch passes mochitests and js tests (obsolete) — Splinter Review
I can't get __attribute__ ((aligned (8))) to do what it is supposed to do when used after a member declaration such as jsdouble dregs[2] and before the ; that ends the declaration. Any have a clue to loan?

I'll file followup bugs as needed and note them in comments before checking in (at least one in jsinterp.h about using alignment pragmas), but this patch is ready for review.

BTW, the fix from the last patch was to propagate pn_reg up from the kid to its TOK_RP (parenthesized group expression) node in js_OptimizeExprs.

/be
Attachment #309595 - Flags: review?(igor)
Bugzilla interdiff between the last two patches works, BTW.

/be
The patch still has problems. In

  a + b * c * (d + e)

the intermediate result of b * c will be destroyed by (d + e).
It also misses some optimizations, e.g.

  a * b * c * d * e

won't be optimized.

I'll post a patch that uses just a count of the used registers instead of
busy bits. I think that's a better approach, because they work like a
stack.


BTW, __attribute__ ((aligned (8))) does work for me in gcc 4.2.3.

    ...
    uint8    misalign;
    jsdouble dregs[2];

is already correctly aligned without an attribute, but

    ...
    uint8    misalign;
    jsdouble dregs[2] __attribute__ ((aligned (16)));

gives 16-byte alignment as expected.
Passes the js test suite and works in the browser.
x0, you rock -- I was trying to reason "bottom-up" based on singular use of intermediates, but that's unsound. Also, my __attribute__ ((aligned (8))) problems were self-inflicted. The hard case is the "inline_call" code in the interpreter, where JSInlineFrame is allocated on the interpreter stack. GCC does not even try to warn about infeasibility of enforcing the aligned attribute here.

New patch shortly, a mostly-stylistic-nit-picking pass over yours plus alignment pragmas.

/be
A remaining issue: After

  (x += .1) + 1

x will be corrupt. We need to exclude assignment ops from using registers.
Yes, pn_type will be TOK_ASSIGN where pn_op satisfies PN_IS_ARITH. I've got that alignment patch almost ready (frustrating dealing with case proliferation under the inline call code for JSOP_CALL in js_Interpret), I'll fix that too. Thanks,

/be
With the issue in comment 52 fixed, I get always the same output as in an unpatched build.
Attachment #309800 - Attachment mime type: application/octet-stream → text/plain
Blocks: 423300
Passes mochi, js, and x0's tcgen.pl tests (I had to fold in the r+ patch for bug 423300). Many thanks to x00000000@freenet.de for all the help and timely assists!

/be
Attachment #309595 - Attachment is obsolete: true
Attachment #309718 - Attachment is obsolete: true
Attachment #310190 - Flags: review?(igor)
Attachment #309595 - Flags: review?(igor)
Comment on attachment 310190 [details] [diff] [review]
proposed brendan+x0 love-child fix

>Index: jsgc.c
...
>@@ -2664,25 +2664,49 @@ js_TraceStackFrame(JSTracer *trc, JSStac
...
>         if (fp->regs) {
>             nslots = (uintN) (fp->regs->sp - fp->spbase);
>             JS_ASSERT(nslots <= fp->script->depth);
>-            TRACE_JSVALS(trc, nslots, fp->spbase, "operand");
>+            for (vp = fp->spbase, end = vp + nslots; vp < end; vp++) {
>+                jsval v;
>+                uintN traceKind;
>+
>+                v = *vp;
>+                if (!JSVAL_IS_PRIMITIVE(v)) {
>+                    traceKind = JSTRACE_OBJECT;
>+                } else if (JSVAL_IS_DOUBLE(v)) {
>+                    jsdouble *dp = JSVAL_TO_DOUBLE(v);
>+                    if (IS_FRAME_DREGP(fp, dp))
>+                        continue;

This is wrong and I should have seen this earlier. 

For callers of an inline frame fp->regs will point to JSInlineFrame.callerRegs but the double jsvals will still point to REGS.dregs local from JS_Interpret native stack. The situation is worse for generator frames as for them jsvals pointing to registers may well point to the stack of JS_Interpret that has finished its execution.

The following test case may expose the problem:

function gen()
{
    yield (Math.PI * 2) * (yield 0);
}

var x = gen();
x.next();
gc();
var result = x.send(1);
if (result !== Math.PI * 2)
    throw "Bad result: "+ result; 

This problem with generators pretty much requires to move dregs back to JSStackFrame.
Attachment #310190 - Flags: review?(igor)
Argh, I should have seen that. I need a vacation!

Ok, it is not hard to move dregs back to the frame.

/be
The JSOP_NEG case in jsinterp.c should have a STORE_NUMBER() instead of js_NewNumberInRootedValue() to actually use its register. That was already missing in my patch.


IMHO there should be 3 registers at least. Most scripts won't use the 3rd register much, but they are not expensive. The simplest expression that would use a 3rd register is something like

  a + b + c * d * -e


Some additional operators could be declared "pure". Most important are JSOP_STRICTEQ and JSOP_STRICTNE. The other missing optimizations are only edge cases, AFAICT (examples are JSOP_GETELEM where only integers are common or conditionals without relational operator like if (x + y) ...).


The TOK_ASSIGN case in js_OptimizeExprs() doesn't make much sense at all. Any constant expression (except the rightmost) should cause a syntax error anyways (and does so), thus there's nothing to optimize. I don't understand the (old) comment there:

      case TOK_ASSIGN:
        /*
         * Compound operators such as *= should be subject to folding, in case
         * the left-hand side is constant, and so that the decompiler produces
         * the same string that you get from decompiling a script or function
         * compiled from that same string.  As with +, += is special.
         */
(In reply to comment #58)
> The JSOP_NEG case in jsinterp.c should have a STORE_NUMBER() instead of
> js_NewNumberInRootedValue() to actually use its register. That was already
> missing in my patch.

Thanks, will fix in next patch.

> IMHO there should be 3 registers at least. Most scripts won't use the 3rd
> register much, but they are not expensive.

Hard to measure but they take up (or will take up, in the new patch) interpreter (not C) stack frame space, which is not cost free.

> The simplest expression that would
> use a 3rd register is something like
> 
>   a + b + c * d * -e

Does anything use three registers in the microbenchmarks we study? Just curious, not saying that those micros are decisive. I will instrument the new patch lightly to find out.

> Some additional operators could be declared "pure". Most important are
> JSOP_STRICTEQ and JSOP_STRICTNE.

These are not that common, although some JS gurus push them over == and !=. It would be awkward to reorder jsopcode.tbl heavily at this stage, but I'll see what I can do.

> The TOK_ASSIGN case in js_OptimizeExprs() doesn't make much sense at all. Any
> constant expression (except the rightmost) should cause a syntax error anyways
> (and does so), thus there's nothing to optimize. I don't understand the (old)
> comment there:
> 
>       case TOK_ASSIGN:
>         /*
>          * Compound operators such as *= should be subject to folding, in case
>          * the left-hand side is constant, and so that the decompiler produces
>          * the same string that you get from decompiling a script or function
>          * compiled from that same string.  As with +, += is special.
>          */
> 

To answer this question, use cvs annotate and log:

$ cvs annotate jsparse.c | grep 'Compound operators'

Annotations for jsparse.c
***************
3.210        (brendan% 25-Aug-06):          * Compound operators such as *= should be subject to folding, in case
$ cvs log -r3.210 jsparse.c
RCS file: /cvsroot/mozilla/js/src/jsparse.c,v
Working file: jsparse.c
head: 3.336
branch:
locks: strict
access list:
keyword substitution: kv
total revisions: 506;   selected revisions: 1
description:
----------------------------
revision 3.210
date: 2006/08/25 18:01:35;  author: brendan%mozilla.org;  state: Exp;  lines: +15 -0
Fold assign-ops harder, to uphold decompiler round-trip invariance (349663, r=shaver).
=============================================================================

Then see the cited bug 349663. The reason is due to the strange nature of const in JS1.5: you can assign to a const once per evaluation of its initialiser, after which assignment ops targeting it must evaluate it and the right operand but not set the const.

/be
Thanks for the explanation, I understand it now (although I do not understand why a const doesn't simply throw on re-assignment, but there are more specifications in JavaScript and ECMAScript that make little sense to me).

Complex arithmetic calculations are common in spherical trigonometry. I'm developing a script that calculates map projections of the earth and displays them with SVG, with the user being able to change the view interactively. That was much slower in Mozilla than in Opera until recently, which caused me to file this bug. The script would use more than 2 registers, but I checked that I could rewrite almost all such expressions to be more stack efficient and using only 2 registers. For most of the script, 2 registers are enough anyways. So if there's a real cost for the 3rd register, then it's not worth it. Really complex scripts are rare (OTOH, they tend to be the performance critical ones).

Not optimizing JSOP_STRICTEQ and JSOP_STRICTNE is ok, because floating point is much more likely to use relational than equality operators. It just seemed a little odd to me that they will be treated differently than their classical counterparts.
(In reply to comment #60)
>I'm
> developing a script that calculates map projections of the earth and displays
> them with SVG, with the user being able to change the view interactively.

Does this involve calls to trigonometric functions from Math?
(In reply to comment #61)
> (In reply to comment #60)
> >I'm
> > developing a script that calculates map projections of the earth and displays
> > them with SVG, with the user being able to change the view interactively.
> 
> Does this involve calls to trigonometric functions from Math?

Yes, but their results are cached in local vars and reused many times. But the slow function calls affect overall performance nevertheless (of course I use |var sin= Math.sin| etc. to accelerate them, but function call overhead is still high (in other browsers too)).
(In reply to comment #60)
> Thanks for the explanation, I understand it now (although I do not understand
> why a const doesn't simply throw on re-assignment, but there are more
> specifications in JavaScript and ECMAScript that make little sense to me).

It's all water under the bridge, but ECMA-262 Edition 1 was 1997, based on JS1 in Netscape, beta'ed in 1995. No try/catch, so errors were fatal. Attempts to set ReadOnly properties resulted in uncatchable error in Netscape (IIRC, my memory is failing at this distance), but the ECMA working group felt that silence, or at most a boolean status result, were better outcomes.

Hence the true or false result for the delete operator, as well as the result of an assignment to a ReadOnly property being the result of the right-hand side.

> Complex arithmetic calculations are common in spherical trigonometry. I'm
> developing a script that calculates map projections of the earth and displays
> them with SVG, with the user being able to change the view interactively. That
> was much slower in Mozilla than in Opera until recently, which caused me to
> file this bug. The script would use more than 2 registers, but I checked that I
> could rewrite almost all such expressions to be more stack efficient and using
> only 2 registers. For most of the script, 2 registers are enough anyways. So if
> there's a real cost for the 3rd register, then it's not worth it. Really
> complex scripts are rare (OTOH, they tend to be the performance critical ones).

Working on a new patch that puts the accumulators in the JSContext, and can have many more.

> Not optimizing JSOP_STRICTEQ and JSOP_STRICTNE is ok, because floating point is
> much more likely to use relational than equality operators. It just seemed a
> little odd to me that they will be treated differently than their classical
> counterparts.

It's a matter of time -- the freeze for Firefox 3 beta 5 is upon us.

/be
(In reply to comment #60)

> Complex arithmetic calculations are common in spherical trigonometry. I'm
> developing a script that calculates map projections of the earth and displays
> them with SVG, with the user being able to change the view interactively. That
> was much slower in Mozilla than in Opera until recently, which caused me to
> file this bug. The script would use more than 2 registers, but I checked that I
> could rewrite almost all such expressions to be more stack efficient and using
> only 2 registers. For most of the script, 2 registers are enough anyways. So if
> there's a real cost for the 3rd register, then it's not worth it. Really
> complex scripts are rare (OTOH, they tend to be the performance critical ones).
> 

Would you be willing to share a link to the app?  
Attached image SVG with floating point intensive script (obsolete) —
(In reply to comment #64)
> Would you be willing to share a link to the app?  

It's not ready yet, but the attached file is a mostly working example. It contours pure grid data first and projects it then (Lambert azimuthal equal-area). On click the map will be re-projected with a new center point. (Data is temperature trend of the last 30 years as calculated by NCEP/DOE reanalysis 2 (which is not reliable everywhere).)
Attachment #310406 - Attachment mime type: application/octet-stream → image/svg+xml
Same file unzipped. Bugzilla doesn't seem to be able to handle compressed SVGs.
Attachment #310406 - Attachment is obsolete: true
That page performs quite well on a 2.5 Ghz MacBook Pro on the trunk. It only takes a second to repaint the new globe.

On Firefox 2.0.0.12 and a 2.8 Ghz AMD X2 machine running Windows, it takes almost six seconds to repaint after a click.
(In reply to comment #67)
> That page performs quite well on a 2.5 Ghz MacBook Pro on the trunk. It only
> takes a second to repaint the new globe.
> 
> On Firefox 2.0.0.12 and a 2.8 Ghz AMD X2 machine running Windows, it takes
> almost six seconds to repaint after a click.

Yes, there was a huge improvement in performance somewhere between beta 3 and now. If I knew that before I started to search for possible improvements in the code, I probably had not filed this bug...
Glad you didn't find out, then. :)
I traced down when the improvement of projection calculation in the attached file happened. It was between the 2008-02-15 and 2008-02-16 nightlies, when the consumed time dropped from 3.42 s to 1.44 s on my machine. So it must have been bug 395993. A non-null this pointer has been passed to Math.sin etc. before, even when called as a local var.

Since then, timings improved further, but in multiple smaller steps. 2008-02-26 was at 1.17 s, beta 4 at 0.97 s, and the current nightly is at 0.82 s. I haven't traced them down further.
I have a new patch coming some time in the next week. This won't make mozilla1.9 / firefox3 of course.

/be
Target Milestone: mozilla1.9beta5 → ---
getting this on the 1.9.1 triage queue.
Flags: blocking1.9.1?
Attached patch snapshotSplinter Review
Shaver kindly offered to take this one across the finish line.

/be
Attachment #310190 - Attachment is obsolete: true
Assignee: brendan → shaver
Status: ASSIGNED → NEW
Comment on attachment 324346 [details] [diff] [review]
snapshot

>+#define OP_IN_RANGE(op,lo,hi)   ((uintN)((op)-(lo)) <= (uintN)((hi)-(lo)))
>+#define OP_IS_PURE(op)          OP_IN_RANGE(op, JSOP_EQ, JSOP_BITNOT)
>+#define OP_IS_ARITH(op)         OP_IN_RANGE(op, JSOP_BITOR, JSOP_NEG)

Maybe it's just my early morning brain, but won't OP_IN_RANGE report false positives for opcodes lower than lo?

OP_IN_RANGE(3,5,10)
  3 - 5 <= 10 - 5
  -2 <= 5

?
(In reply to comment #74)
> >+#define OP_IN_RANGE(op,lo,hi)   ((uintN)((op)-(lo)) <= (uintN)((hi)-(lo)))
...
> OP_IN_RANGE(3,5,10)
>   3 - 5 <= 10 - 5
>   -2 <= 5

Note the uintN part which gives:

  (uintN)(3 - 5) <= (uintN)(10 - 5)
  2**32 - 2 <= 5
  false
Flags: blocking1.9.1? → blocking1.9.1+
Igor is going to take this, aiui
Assignee: shaver → igor
I focus on this bug from now on.
(In reply to comment #73)
> Created an attachment (id=324346) [details]
> snapshot

My current thinking is to use stack slots for double registers. That is, the code would reserve some space before vars to store the doubles eliminating the need for any special storage in JSStackFrame  or JSContext for the price of an extra uint8 field counting the number of double registers in JSScript. But that comes for free as JSScript has unused uint8 field.
Depends on: 441686
Attached patch using slots for dregs v1 (obsolete) — Splinter Review
This patch is based on the patch from bug 441686 comment 18 and allocates the double registers between vars and stack slots. Except for these dregs allocation changes this is essentially the patch from the comment 73 ported to the trunk.

The patch just compiles and I have not done any testing with it.
Attached patch using slots for dregs v2 (obsolete) — Splinter Review
work in progress
Attachment #329363 - Attachment is obsolete: true
Attached patch using slots for dregs v3 (obsolete) — Splinter Review
yet another incremental update that properly initializes the number of double registers in the script.
Attachment #330875 - Attachment is obsolete: true
Attached patch using slots for dregs v4 (obsolete) — Splinter Review
Changes compared with the patch from the comment 73:

1. The double registers are stored in frame slots between function variables (or global/regexps in the script case) and the stack base. In STORE_DOUBLE macro the base for double registers is calculated from script->nfixed and fp->slots. Although it adds a few instructions, an alternative of caching dregsEnd_ from the macro on function entrance made things slower both on Pentium-M 1.1 GHz and Xeon 2.66 GHz (64 bit). 

2. I added JSParseOptimizer struct as a convenient way to pass to the code generator the ndregs counter for the compiled functions.
Attachment #330882 - Attachment is obsolete: true
Attachment #330952 - Flags: review?(brendan)
Attached patch using slots for dregs v5 (obsolete) — Splinter Review
The new version of the patch bumps XDR version to reflect bytecode changes and makes JSOP_BITNOT to use dregs to optimize cases like x & ~y.
Attachment #331009 - Flags: review?(brendan)
Attachment #330952 - Attachment is obsolete: true
Attachment #330952 - Flags: review?(brendan)
Attached patch using slots for dregs v6 (obsolete) — Splinter Review
The patch is a sync with the trunk to remove unrelated changes for editline makefile. I fixed that in the bug 447705.
Attachment #331069 - Flags: review?(brendan)
Attachment #331009 - Attachment is obsolete: true
Attachment #331009 - Flags: review?(brendan)
Comment on attachment 331069 [details] [diff] [review]
using slots for dregs v6

Looks like globalUses and loopyGlobalUses crept back in.

Since you are adding more uses, could you please rename JOF_2BYTE to JOF_UINT8? This will help tracemonkey merging too (from TM's jsopcode.h:

#define JOF_UINT8         13      /* uint8 immediate, e.g. top 8 bits of 24-bit
                                     atom index */

use this for best merge unity ;-).

What are the perf effects with the latest m-c shell builds and this patch?

I will review and stamp promptly the next patch. Thanks again, sorry for delays.

/be
Comment on attachment 331069 [details] [diff] [review]
using slots for dregs v6

New patch is comming
Attachment #331069 - Attachment is obsolete: true
Attachment #331069 - Flags: review?(brendan)
Priority: P1 → P2
Any progress on this? This will probably also help to improve on the v8 benchmarks, as there is a lot of math in there.
going to have to unblock on this, given lack of activity. moving to WANTED.
Flags: blocking1.9.1+ → wanted1.9.1+
This goes back to the pool.
Assignee: igor → general
This was a 1.9.1 blocker once, is this still an issue? Nominating for blocking-1.9.2 to get it back on the radar...
Flags: blocking1.9.2?
Flags: wanted1.9.2+
Flags: blocking1.9.2?
Flags: blocking1.9.2-
fatvals made this not a problem
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: