Closed Bug 432541 Opened 17 years ago Closed 17 years ago

Generate more/better superwords to improve interpreter speed

Categories

(Tamarin Graveyard :: Tracing Virtual Machine, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: stejohns, Assigned: stejohns)

Details

Attachments

(1 file, 2 obsolete files)

Attached patch Patch (obsolete) — Splinter Review
Major restructuring of the Interpreter, Tracer, and Forth compiler to improve interpreter speed. (Note that this patch is mac-only for now; testing and fixing on other platforms will be done soon and the patch revised accordingly) Big picture is that the Forth compiler now tries to identify superword candidates and generate them when possible, even for subsections of existing words. (It's no longer necessary -- or legal -- to use the SUPER: keyword in Forth.) Mac Release builds are around twice as fast as current tip in -interp mode: TT-OLD TT-NEW prime4 1386 1010 access-binary-trees 1196 603 access-fannkuch 3489 1747 access-nbody 2260 1208 access-nsieve 820 368 bitops-3bit-bits-in-byte 631 276 bitops-bits-in-byte 797 384 bitops-bitwise-and 4897 2582 bitops-nsieve-bits 1701 873 crypto-md5 706 318 crypto-sha1 726 329 math-cordic 1187 632 math-partial-sums 3572 1870 math-spectral-norm 1014 471 s3d-cube 845 470 s3d-morph 1443 694 s3d-raytrace 1992 1026 string-fasta 1954 996 In trace mode, speedup is less dramatic, but a few tests did see nontrivial speedups. The code for implementing each primitive has been moved from Interpreter.cpp into prim.fs. The "interp" section implements the interpreter stage; the "trace" section the trace step. From these the Forth compiler will generate two loops: an interpreter loop designed for maximum speed (complete with superwords), and a tracer loop used when tracing a hot path for possible jitting. (Making these two seaparte loops gives a little code redundancy but allows the C++ compiler to do a better job of optimizing.) The interpreter and trace loops themselves are compiled in a new file, vm_interp.cpp, which is used by Interpreter.cpp. primitives now use a syntax of: PRIM: primname (( argsin -- argsout )) interp:{ C++ code goes here } trace:{ C++ code goes here } ; Note that a trailing semicolon is now required. You must specify both trace and interp blocks for all prims, except: stack-motion prims (DUP etc) only specify trace blocks, the interp blocks are provided by the Forth compiler; and function primitives can omit the trace block (if they do then a default one is autogenerated). The format/syntax of the interp/trace sections in prim.fs is still a little ragged and might change a bit (e.g., I am using the Forth parsing code to read in the C++ code, meaning that there is exaggerated white space at times since that's all the Forth parser looks for). The main knobs to control superword generation are --maxnest (which will inline words to a certain nesting depth, rather than issuing NEST) and --superlen (which defines the minimum length a potential superword must be, in forth instructions, in order to be superworded). You can use these on the command line to set the global defaults, but it's much more useful to set them on a per-word basis. You can do this by editing core/wordmods.txt, which currently defines a handful of words as "cold" (no nesting or supering) or "hot" (aggressive nesting and supering)... there's lots more tuning that could be done here, by adding words and/or tuning the granularity. There should definitely be some profiling done to see what other words might benefit from being marked as hot (or warm, you can have different settings for different words) Typical settings result in a few hundred superwords being generated, meaning there just wasn't space in bytecode to do what we needed, so the Forth opcodes are now 16-bit words. (Thus the Token/wcodep/wcodepp types are gone, replaced by FOpcode/FOpcodep/FOpcodepp) The various INVOKE primitives (IINVOKEII, etc) were all removed, since we can now devote a single opcode for each such call, and generation of the trace code can be derived based on new information added to CallInfo. Since we have such a huge amount of opcode space due to huge words, all EXTERN: words in Forth get an opcode dedicated to them; this is basically a NEST internally, but it allows more compact forth generation in IL.cpp Other notes: -- rewrote getDefaultNamespace and createStackTrace in Forth so that they could use CURRENTFRAME; this means we no longer have any function primitives that access interp.currentFrame and thus don't have to stop a trace for them. (Do we still need interp.currentFrame at all?) -- Added AVMPLUS_UNALIGNED_ACCESS define to avmbuild.h; this is used (instead of various Intel checks) to indicate whether it's safe to do unaligned word/long access. -- added a --forthonly flag to builtin.py; passing this regenerates forth but not the builtin AS3 code. -- turned on -fstrict-aliasing (and -wstrict-aliasing) in XCode builds and fixed a few things that broke as a result, mostly in MathUtils
Attachment #319686 - Flags: review?(edwsmith)
Priority: -- → P1
Looks like wordmods.txt is missing from the patch
I wonder if it's possible to use #file, #line (iirc) directives to get the debugger to point to the code in prim.fs when debugging vm_interp.cpp
Attached patch Patch (obsolete) — Splinter Review
new patch that includes missing file
Attachment #319686 - Attachment is obsolete: true
Attachment #319686 - Flags: review?(edwsmith)
Attached patch PatchSplinter Review
Built and tested in MSVC for Win32 and PocketPC/ARM. (Arm testing was minimal but does run a few tests in interp and trace mode.) Fixed a few performance issues that turned up in Windows build; we're now at least at parity with previous Windows builds, and ahead on most (though not nearly as much as the GCC builds). I think this patch is ready to go in.
Attachment #319799 - Attachment is obsolete: true
Attachment #319904 - Flags: review?(edwsmith)
Is it still a win to hand-superoptimize certian primitives, like PICK2, PICK3, etc, or can superwording combine (eg) 5 PICK into a primitive, or into adjacent primitives. admitedly such handbuilt superwords maybe make sense when they're used directly from IL.cpp. iow, should PICK4 and friends be defined like 3drop and friends. Since IP is accessed directly in interp{} and trace{} blocks, should we make it mean the same thing in both contexts? fc.py can substitute (ip-1) or (ip+1) where it needs to and optimize away the const offset. maybe followon work, obviously whats there was ported and changing what IP means in one or the other will be tedious. but probably worth it for the long run.
Attachment #319904 - Flags: review?(edwsmith) → review+
nice; i like seeing interp{} trace{} blocks together. I was trying to figure out the overhead of the tracing framework in the interpreter. What I mean is, if we are interpreting there is a conditional check (per opcode) for tracing enabled or not; right? As we are trying to increase interp speed I wonder if we should apply some technique to dynamically un-hook the tracer from the interp loop when its not being used.
re: PICK2 and friends, we get a nontrivial benefit from optimizing those directly because the forth compiler knows how to rearrange stack motion at compile time. That said, we could specialize it for pick-and-a-constant rather than the hardcoded ones we have now. re: IP being same in both contexts, yep, I'd like to do that as follow-on work. existing code all made the assumptions you see and it was simpler to maintain that than to introduce off-by-one errors. re: tracer and interp loop unhooked, that's already done in this patch! the do_interp loop is used when we are not tracing, the do_trace loop when we are. and there are only a handful of primitives that need to check to see if we are swapping between modes.
pushed as changeset: 356:81b347b101d0
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: