Closed Bug 433382 Opened 16 years ago Closed 16 years ago

More efficient interpreter switch when computed goto is not available

Categories

(Core :: JavaScript Engine, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: igor, Assigned: igor)

References

Details

(Keywords: perf)

Attachments

(3 files, 11 obsolete files)

Currently with MSVC compiler, which does not support computed goto extension of GCC, the interpreter loops looks effectively as:

for (;;) {
     op = (JSOp) *regs.pc;
     len = js_CodeSpec[op].length;
     if (interruptHandler) { 
        ....
     }
     switch (op) {
         ....
     }
     regs.pc += len;
} 

This structure is inefficient for 2 reasons:

1. The if check for interruptHandler even with branch prediction does not come for free. Using the idea from a patch for bug 430293 it should be possible to replace the if check with branch-less bit  operation.

2. It accesses len way to early initializing it from a global table. This forces the compiler to save its value on the native stack whenever a switch case calls a function. Moreover, when a jump operation performs a goto, this assignment to len is completely an necessary. Initializing len  the end of each case in the same way as the indirect threading code does should avoid this.
Keywords: perf
Are you sure 2 is a real problem in our optimized, PGO'ed builds?

Improving Windows perf incrementally is great, but even better would be computed gotos somehow -- esp. since we can count on VS8 now. Is there no way?

/be
Attached patch v1 (obsolete) — Splinter Review
Implementation of the ideas from the comment 1. The biggest part of the patch is introduction of that jsoplen.h headers that defines JSOP_XXX_LENGTH as macros. This way it is possible in the interpreter to specialize END_CASE expansion for different bytecode lengths. Ideally it would be nice to autogenerate the patch but for now at least it is statically checked.
Attachment #320728 - Flags: review?(brendan)
To get the results with GCC I passed the following args to the make:

  BUILD_OPT=1 INTERP_OPTIMIZER="-falign-functions=4096 -O3 -fstrict-aliasing" XCFLAGS="-DJS_THREADED_INTERP=0"

where  JS_THREADED_INTERP=0 disables indirect threading in the interpreter with GCC.

The end result is 16% speedup with the patch. Now if somebody can check the patch with VC8, that would be nice.
For references, here is the results of the switched interpreter loop *with* the patch applied versus the trunk with the indirect threading loop. The end result is that with the patch the switch is 12% slower. Without the patch the switched loop is slower by 30%.
(In reply to comment #1)
> Are you sure 2 is a real problem in our optimized, PGO'ed builds?

Given that the patch shows speedups up to 80% for particular benchmarks  with GCC 4.3.0 at -O3, that must be a problem.
(In reply to comment #5)
> (In reply to comment #1)
> > Are you sure 2 is a real problem in our optimized, PGO'ed builds?
> 
> Given that the patch shows speedups up to 80% for particular benchmarks  with
> GCC 4.3.0 at -O3, that must be a problem.

For references, here the tests that benefits the most:

  bitops:              1.31x as fast     1705.6ms +/- 0.3%   1299.2ms +/- 0.7%     significant
    3bit-bits-in-byte: 1.48x as fast      266.2ms +/- 0.1%    180.2ms +/- 0.3%     significant
    bits-in-byte:      1.79x as fast      365.4ms +/- 0.8%    204.6ms +/- 0.8%     significant
    bitwise-and:       1.078x as fast     589.4ms +/- 0.8%    546.8ms +/- 1.5%     significant
    nsieve-bits:       1.32x as fast      484.6ms +/- 0.2%    367.6ms +/- 0.5%     significant
Amazing -- that hoisted len load was ~13 years old!

Can you attach a patch including jsoplen.h? I'm still interested in PGO'ed build results too -- could be even better.

/be
Attached patch v1 for real (obsolete) — Splinter Review
The new version of the patch includes the missing jsoplen.h
Attachment #320728 - Attachment is obsolete: true
Attachment #320739 - Flags: review?(brendan)
Attachment #320728 - Flags: review?(brendan)
Can you automate generation of jsoplen.h from jsopcode.tbl?

/be
I'll get PGO results. It'll take some time though. Should have results tomorrow morning.
Attached patch v2 (obsolete) — Splinter Review
The new version auto-generates jsautooplen.h during the compilation and adjust the build files to account for that. Since xpconnect includes jsopcode.h, in the new version jsopcode.h does not include the generated file. Instead the C sources that uses JSOP_<name>_LENGTH constants includes the file directly. It avoids the need to adjust the makefile for xpconnect to include the build dir for js into the include path.
Attachment #320739 - Attachment is obsolete: true
Attachment #320798 - Flags: review?(brendan)
Attachment #320739 - Flags: review?(brendan)
Attachment #320798 - Attachment is patch: true
Attachment #320798 - Attachment mime type: application/octet-stream → text/plain
Attached patch v3 (obsolete) — Splinter Review
The previous version of the patch does not compile when liveconnect is on as the code there currently uses JSOP_STOP_LENGTH. To fix that I replaced the symbolic constant by one, its numeric value, together with the assert that the last pc is JSOP_STOP. This way there is no need to patch the makefile for liveconnect to add the build directory of js/src to the include path of liveconnect.
Attachment #320798 - Attachment is obsolete: true
Attachment #320905 - Flags: review?(brendan)
Attachment #320798 - Flags: review?(brendan)
I tried v1 and v3 of this, and neither compile for me.
Attached patch v3 with all the files (obsolete) — Splinter Review
In the previous patch I forgot to include jsoplengen.c file.
Attachment #320905 - Attachment is obsolete: true
Attachment #320972 - Flags: review?(brendan)
Attachment #320905 - Flags: review?(brendan)
With a PGO windows build, I get something about 5% faster than the nightly on the same hardware, with some tests showing 10% improvement, and no regressions to speak of.
Where should we start landing these patches? We need an hg repo for 3.next/1.9.next work. We can't use ActionMonkey because XPCOMGC is not done and won't be for a while, but we need short-term release-ability *and* no-regressions test-ability including in-browser tests and macro-benchmarks.

/be
Comment on attachment 320972 [details] [diff] [review]
v3 with all the files

Need to get jimb's autoconf build system to track this bug's changes.

>+#include <jsautooplen.h>

We use "" not <> for local headers including generated ones. <> means system include directories only, or used to. It may not matter given -I overuse, but stylistically it would be best to stick with "".

> # define DO_OP()            JS_EXTENSION_(goto *jumpTable[op])
>-# define DO_NEXT_OP(n)      do { METER_OP_PAIR(op, regs.pc[n]);               \
>-                                 op = (JSOp) *(regs.pc += (n));               \
>-                                 DO_OP(); } while (0)
>+# define DO_NEXT_OP(n)                                                        \
>+    JS_BEGIN_MACRO                                                            \
>+        METER_OP_PAIR(op, regs.pc[n]);                                        \
>+        op = (JSOp) *(regs.pc += (n));                                        \
>+        DO_OP();                                                              \
>+    JS_END_MACRO
>+
> # define BEGIN_CASE(OP)     L_##OP:
> # define END_CASE(OP)       DO_NEXT_OP(OP##_LENGTH);
> # define END_VARLEN_CASE    DO_NEXT_OP(len);
>-# define EMPTY_CASE(OP)     BEGIN_CASE(OP) op = (JSOp) *++regs.pc; DO_OP();
>-#else
>+# define ADD_EMPTY_CASE(OP) BEGIN_CASE(OP)                                    \
>+                              JS_ASSERT(js_CodeSpec[OP].length == 1);         \
>+                              op = (JSOp) *++regs.pc;                         \
>+                              DO_OP();

Style nit: the DO_NEXT_OP body should probably try to align with other macro definitions, including ADD_EMPTY_CASE. That macro might indent by 4 instead of 2, but that's a nit on top of a nit.

>+# define END_EMPTY_CASES
>+
>+#else /* !JS_THREADED_INTERP */
>+
> # define DO_OP()            goto do_op
>-# define DO_NEXT_OP(n)      goto advance_pc
>+# define DO_NEXT_OP(n)                                                        \
>+    JS_BEGIN_MACRO                                                            \
>+        JS_ASSERT((n) == len);                                                \
>+        goto advance_pc;                                                      \
>+    JS_END_MACRO

Same here.

>+/*
>+ * To share the code for all len == 1 cases we use the specialized label with
>+ * code that fall through to advance_pc: .

s/fall/falls/

>-    /* Check for too deep a C stack. */
>+    /* Check for a too deep C stack. */

English nit: previous version could have said "too deep of " but new version would want "too-deep" or "overdeep". Either is fine in my book, probably "too deep of " is more colloquial.

>-#if JS_THREADED_INTERP
>+    /*
>+     * It is important that "op" be initialized before calling DO_OP because
>+     * it is possible for "op" to be specially assigned during the normally
>+     * processing of an opcode while looping. We rely on DO_NEXT_OP to manage
>+     * "op" correctly in all other cases.

s/normally/normal/

>+OPDEF(JSOP_NOP,           230,  "nop",         NULL,  1,  0,  0,  0,  JOF_BYTE)

Suggest eliminating JSOP_SWAP and using its opcode. This touches a few places by removing undetectably-dead code.

/be
Blocks: js1.8.5
Attached patch v4 (obsolete) — Splinter Review
The new version of the patch addresses the nits and eliminated JSOP_SWAP as suggested. Here is a delta from the previous version:

--- /home/igor/m/ff/mozilla/quilt.ANS1mk/js/src/jsinterp.c	2008-05-22 16:12:30.000000000 +0200
+++ js/src/jsinterp.c	2008-05-22 16:03:47.000000000 +0200
@@ -79,3 +79,3 @@
 
-#include <jsautooplen.h>
+#include "jsautooplen.h"
 
@@ -2595,8 +2595,7 @@ js_Interpret(JSContext *cx)
 # define DO_OP()            JS_EXTENSION_(goto *jumpTable[op])
-# define DO_NEXT_OP(n)                                                        \
-    JS_BEGIN_MACRO                                                            \
-        METER_OP_PAIR(op, regs.pc[n]);                                        \
-        op = (JSOp) *(regs.pc += (n));                                        \
-        DO_OP();                                                              \
-    JS_END_MACRO
+# define DO_NEXT_OP(n)      JS_BEGIN_MACRO                                    \
+                                METER_OP_PAIR(op, regs.pc[n]);                \
+                                op = (JSOp) *(regs.pc += (n));                \
+                                DO_OP();                                      \
+                            JS_END_MACRO
 
@@ -2606,5 +2605,6 @@ js_Interpret(JSContext *cx)
 # define ADD_EMPTY_CASE(OP) BEGIN_CASE(OP)                                    \
-                              JS_ASSERT(js_CodeSpec[OP].length == 1);         \
-                              op = (JSOp) *++regs.pc;                         \
-                              DO_OP();
+                                JS_ASSERT(js_CodeSpec[OP].length == 1);       \
+                                op = (JSOp) *++regs.pc;                       \
+                                DO_OP();
+
 # define END_EMPTY_CASES
@@ -2614,7 +2614,6 @@ js_Interpret(JSContext *cx)
 # define DO_OP()            goto do_op
-# define DO_NEXT_OP(n)                                                        \
-    JS_BEGIN_MACRO                                                            \
-        JS_ASSERT((n) == len);                                                \
-        goto advance_pc;                                                      \
-    JS_END_MACRO
+# define DO_NEXT_OP(n)      JS_BEGIN_MACRO                                    \
+                                JS_ASSERT((n) == len);                        \
+                                goto advance_pc;                              \
+                            JS_END_MACRO
 
@@ -2627,3 +2626,3 @@ js_Interpret(JSContext *cx)
  * To share the code for all len == 1 cases we use the specialized label with
- * code that fall through to advance_pc: .
+ * code that falls through to advance_pc: .
  */
@@ -2640,3 +2639,3 @@ js_Interpret(JSContext *cx)
 
-    /* Check for a too deep C stack. */
+    /* Check for too deep of C stack. */
     JS_CHECK_RECURSION(cx, return JS_FALSE);
@@ -2777,3 +2776,3 @@ js_Interpret(JSContext *cx)
      * It is important that "op" be initialized before calling DO_OP because
-     * it is possible for "op" to be specially assigned during the normally
+     * it is possible for "op" to be specially assigned during the normal
      * processing of an opcode while looping. We rely on DO_NEXT_OP to manage
@@ -2786,6 +2785,8 @@ js_Interpret(JSContext *cx)
     /*
-     * This is a loop, but it does not look like a loop.  The loop-closing
-     * jump is distributed throughout interruptJumpTable, and comes back to
-     * the interrupt label.  The dispatch on op is through normalJumpTable.
-     * The trick is LOAD_INTERRUPT_HANDLER setting jumpTable appropriately.
+     * This is a loop, but it does not look like a loop. The loop-closing
+     * jump is distributed throughout goto *jumpTable[op] implementing DO_OP.
+     * When interrupts are enabled, jumpTable is set to interruptJumpTable
+     * where all jumps point to the JSOP_INTERRUPT case. The latter, after
+     * calling the interrupt handler, dispatches through normalJumpTable to
+     * continue the normal bytecode processing.
      */
@@ -2890,8 +2891,2 @@ js_Interpret(JSContext *cx)
 
-          BEGIN_CASE(JSOP_SWAP)
-            rtmp = regs.sp[-1];
-            regs.sp[-1] = regs.sp[-2];
-            regs.sp[-2] = rtmp;
-          END_CASE(JSOP_SWAP)
-
           BEGIN_CASE(JSOP_SETRVAL)
--- /home/igor/m/ff/mozilla/quilt.ANS1mk/js/src/jsopcode.tbl	2008-05-22 16:12:30.000000000 +0200
+++ js/src/jsopcode.tbl	2008-05-22 11:07:32.000000000 +0200
@@ -333,5 +333,5 @@ OPDEF(JSOP_FINALLY,     134,"finally",  
 /*
- * Swap the top two stack elements.
+ * Generic nop for the decompiler.
  */
-OPDEF(JSOP_SWAP,        135,"swap",       NULL,       1,  2,  2,  0,  JOF_BYTE)
+OPDEF(JSOP_NOP,         135,"nop",        NULL,       1,  0,  0,  0,  JOF_BYTE)
 
@@ -523,2 +523 @@ OPDEF(JSOP_INT32,         228, "int32", 
 OPDEF(JSOP_LENGTH,        229, "length",       NULL,  1,  1,  1, 18,  JOF_BYTE|JOF_PROP)
-OPDEF(JSOP_NOP,           230,  "nop",         NULL,  1,  0,  0,  0,  JOF_BYTE)
--- /home/igor/m/ff/mozilla/quilt.ANS1mk/js/src/jsopcode.c	2008-05-22 16:12:30.000000000 +0200
+++ js/src/jsopcode.c	2008-05-22 11:12:15.000000000 +0200
@@ -77,3 +77,3 @@
 
-#include <jsautooplen.h>
+#include "jsautooplen.h"
 
@@ -2158,10 +2158,2 @@ Decompile(SprintStack *ss, jsbytecode *p
 
-              case JSOP_SWAP:
-                /*
-                 * We don't generate this opcode currently, and previously we
-                 * did not need to decompile it.  If old, serialized bytecode
-                 * uses it still, we should fall through and set todo = -2.
-                 */
-                /* FALL THROUGH */
-
               case JSOP_GOSUB:
@@ -4946,4 +4938,3 @@ DecompileExpression(JSContext *cx, JSScr
     JS_ASSERT(op != JSOP_CASE && op != JSOP_CASEX &&
-              op != JSOP_DUP && op != JSOP_DUP2 &&
-              op != JSOP_SWAP);
+              op != JSOP_DUP && op != JSOP_DUP2);
 
@@ -5219,9 +5210,2 @@ ReconstructPCStack(JSContext *cx, JSScri
 
-          case JSOP_SWAP:
-            JS_ASSERT(ndefs == 2);
-            pc2 = pcstack[pcdepth];
-            pcstack[pcdepth] = pcstack[pcdepth + 1];
-            pcstack[pcdepth + 1] = pc2;
-            break;
-
           case JSOP_LEAVEBLOCKEXPR:
--- /home/igor/m/ff/mozilla/quilt.ANS1mk/js/src/jsdbgapi.c	2008-05-22 16:12:30.000000000 +0200
+++ js/src/jsdbgapi.c	2008-05-22 11:12:21.000000000 +0200
@@ -64,3 +64,3 @@
 
-#include <jsautooplen.h>
+#include "jsautooplen.h"
 
--- /home/igor/m/ff/mozilla/quilt.ANS1mk/js/src/jsemit.c	2008-05-22 16:12:30.000000000 +0200
+++ js/src/jsemit.c	2008-05-22 11:11:56.000000000 +0200
@@ -68,3 +68,3 @@
 
-#include <jsautooplen.h>
+#include "jsautooplen.h"
 
--- /home/igor/m/ff/mozilla/quilt.ANS1mk/js/src/jsobj.c	2008-05-22 16:12:30.000000000 +0200
+++ js/src/jsobj.c	2008-05-22 11:12:21.000000000 +0200
@@ -90,3 +90,3 @@
 
-#include <jsautooplen.h>
+#include "jsautooplen.h"
Attachment #320972 - Attachment is obsolete: true
Attachment #322098 - Flags: review?(brendan)
Attachment #320972 - Flags: review?(brendan)
Attached patch v4 with all files (obsolete) — Splinter Review
I forgot to include the newly introduced jsoplengen.c in the previous patch.
Attachment #322098 - Attachment is obsolete: true
Attachment #322104 - Flags: review?(brendan)
Attachment #322098 - Flags: review?(brendan)
Comment on attachment 322104 [details] [diff] [review]
v4 with all files

>-    /* Check for too deep a C stack. */
>+    /* Check for too deep of C stack. */

Sorry, my last comment should have suggested "of a" -- need the indefinite article before "C stack". "thread stack" would be even better in view of API name, and of the language-independent nature of the thread stack.

>+#if JS_THREADED_INTERP
>+    /*
>+     * This is a loop, but it does not look like a loop. The loop-closing
>+     * jump is distributed throughout goto *jumpTable[op] implementing DO_OP.

s/implementing/inside of/

r=me with these final nits picked.

/be
Attachment #322104 - Flags: review?(brendan) → review+
Attached patch v4b (obsolete) — Splinter Review
The new version addresses the nits from the comment 20:

+++ js/src/jsinterp.c	2008-05-22 19:42:16.000000000 +0200
@@ -2639,3 +2639,3 @@ js_Interpret(JSContext *cx)
 
-    /* Check for too deep of C stack. */
+    /* Check for too deep of a native thread stack. */
     JS_CHECK_RECURSION(cx, return JS_FALSE);
@@ -2786,3 +2786,3 @@ js_Interpret(JSContext *cx)
      * This is a loop, but it does not look like a loop. The loop-closing
-     * jump is distributed throughout goto *jumpTable[op] implementing DO_OP.
+     * jump is distributed throughout goto *jumpTable[op] inside of DO_OP.
      * When interrupts are enabled, jumpTable is set to interruptJumpTable
Attachment #322104 - Attachment is obsolete: true
Attachment #322135 - Flags: review+
This is ready to land, right?
(In reply to comment #22)
> This is ready to land, right?

Yes.

Attached patch v4b (mozilla-central version) (obsolete) — Splinter Review
The patch is a version of v4b generated against mozilla-central.
Attachment #322135 - Attachment is obsolete: true
Attachment #325536 - Flags: review+
Attached patch v5 (sync with tip) (obsolete) — Splinter Review
The new version synchronizes the patch with mozilla-central changes.
Attachment #325536 - Attachment is obsolete: true
Attachment #325937 - Flags: review+
mozilla-central push:

changeset:   15457:21527193c49b
tag:         tip
parent:      15455:f7aaa37e926a
user:        Igor Bukanov <igor@mir2.org>
date:        Fri Jun 20 09:36:56 2008 +0200
summary:     [Bug 433382] More efficient interpreter switch when computed goto is not available. r=brendan
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
I backed out the pushed changes: in the Makefile.in I fotrgot to change CSRCS to CPPSRCS for the dependency rule for the new autogenerated header.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attachment #325943 - Flags: review+
Attachment #325943 - Attachment is patch: true
Attachment #325943 - Attachment mime type: application/octet-stream → text/plain
Attachment #325937 - Attachment is obsolete: true
mozilla-central push:

changeset:   15461:97977f224aff
tag:         tip
parent:      15458:404e6281b360
user:        Igor Bukanov <igor@mir2.org>
date:        Fri Jun 20 10:06:45 2008 +0200
summary:     [Bug 433382] More efficient interpreter switch when computed goto
Status: REOPENED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
I backed out again due to more breakage. Some dependencies are wrong...
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
The new patch includes the missing jsoplengen.cpp.
Attachment #325943 - Attachment is obsolete: true
Attachment #325955 - Flags: review+
Here is yet another push to mozilla-central:

changeset:   15463:5a770d1e8bb7
tag:         tip
user:        Igor Bukanov <igor@mir2.org>
date:        Fri Jun 20 11:55:49 2008 +0200
summary:     [Bug 433382] More efficient interpreter switch when computed goto is not available. r=brendan
Status: REOPENED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
Depends on: 441619
Flags: in-testsuite-
Flags: in-litmus-
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: