Closed
Bug 432541
Opened 17 years ago
Closed 17 years ago
Generate more/better superwords to improve interpreter speed
Categories
(Tamarin Graveyard :: Tracing Virtual Machine, defect, P1)
Tamarin Graveyard
Tracing Virtual Machine
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: stejohns, Assigned: stejohns)
Details
Attachments
(1 file, 2 obsolete files)
427.41 KB,
patch
|
edwsmith
:
review+
|
Details | Diff | Splinter Review |
Major restructuring of the Interpreter, Tracer, and Forth compiler to improve interpreter speed. (Note that this patch is mac-only for now; testing and fixing on other platforms will be done soon and the patch revised accordingly)
Big picture is that the Forth compiler now tries to identify superword candidates and generate them when possible, even for subsections of existing words. (It's no longer necessary -- or legal -- to use the SUPER: keyword in Forth.) Mac Release builds are around twice as fast as current tip in -interp mode:
TT-OLD TT-NEW
prime4 1386 1010
access-binary-trees 1196 603
access-fannkuch 3489 1747
access-nbody 2260 1208
access-nsieve 820 368
bitops-3bit-bits-in-byte 631 276
bitops-bits-in-byte 797 384
bitops-bitwise-and 4897 2582
bitops-nsieve-bits 1701 873
crypto-md5 706 318
crypto-sha1 726 329
math-cordic 1187 632
math-partial-sums 3572 1870
math-spectral-norm 1014 471
s3d-cube 845 470
s3d-morph 1443 694
s3d-raytrace 1992 1026
string-fasta 1954 996
In trace mode, speedup is less dramatic, but a few tests did see nontrivial speedups.
The code for implementing each primitive has been moved from Interpreter.cpp into prim.fs. The "interp" section implements the interpreter stage; the "trace" section the trace step. From these the Forth compiler will generate two loops: an interpreter loop designed for maximum speed (complete with superwords), and a tracer loop used when tracing a hot path for possible jitting. (Making these two seaparte loops gives a little code redundancy but allows the C++ compiler to do a better job of optimizing.)
The interpreter and trace loops themselves are compiled in a new file, vm_interp.cpp, which is used by Interpreter.cpp.
primitives now use a syntax of:
PRIM: primname (( argsin -- argsout ))
interp:{ C++ code goes here }
trace:{ C++ code goes here } ;
Note that a trailing semicolon is now required. You must specify both trace and interp blocks for all prims, except: stack-motion prims (DUP etc) only specify trace blocks, the interp blocks are provided by the Forth compiler; and function primitives can omit the trace block (if they do then a default one is autogenerated).
The format/syntax of the interp/trace sections in prim.fs is still a little ragged and might change a bit (e.g., I am using the Forth parsing code to read in the C++ code, meaning that there is exaggerated white space at times since that's all the Forth parser looks for).
The main knobs to control superword generation are --maxnest (which will inline words to a certain nesting depth, rather than issuing NEST) and --superlen (which defines the minimum length a potential superword must be, in forth instructions, in order to be superworded). You can use these on the command line to set the global defaults, but it's much more useful to set them on a per-word basis. You can do this by editing core/wordmods.txt, which currently defines a handful of words as "cold" (no nesting or supering) or "hot" (aggressive nesting and supering)... there's lots more tuning that could be done here, by adding words and/or tuning the granularity. There should definitely be some profiling done to see what other words might benefit from being marked as hot (or warm, you can have different settings for different words)
Typical settings result in a few hundred superwords being generated, meaning there just wasn't space in bytecode to do what we needed, so the Forth opcodes are now 16-bit words. (Thus the Token/wcodep/wcodepp types are gone, replaced by FOpcode/FOpcodep/FOpcodepp)
The various INVOKE primitives (IINVOKEII, etc) were all removed, since we can now devote a single opcode for each such call, and generation of the trace code can be derived based on new information added to CallInfo.
Since we have such a huge amount of opcode space due to huge words, all EXTERN: words in Forth get an opcode dedicated to them; this is basically a NEST internally, but it allows more compact forth generation in IL.cpp
Other notes:
-- rewrote getDefaultNamespace and createStackTrace in Forth so that they could use CURRENTFRAME; this means we no longer have any function primitives that access interp.currentFrame and thus don't have to stop a trace for them. (Do we still need interp.currentFrame at all?)
-- Added AVMPLUS_UNALIGNED_ACCESS define to avmbuild.h; this is used (instead of various Intel checks) to indicate whether it's safe to do unaligned word/long access.
-- added a --forthonly flag to builtin.py; passing this regenerates forth but not the builtin AS3 code.
-- turned on -fstrict-aliasing (and -wstrict-aliasing) in XCode builds and fixed a few things that broke as a result, mostly in MathUtils
Attachment #319686 -
Flags: review?(edwsmith)
Updated•17 years ago
|
Priority: -- → P1
Comment 1•17 years ago
|
||
Looks like wordmods.txt is missing from the patch
Comment 2•17 years ago
|
||
I wonder if it's possible to use #file, #line (iirc) directives to get the debugger to point to the code in prim.fs when debugging vm_interp.cpp
Assignee | ||
Comment 3•17 years ago
|
||
new patch that includes missing file
Attachment #319686 -
Attachment is obsolete: true
Attachment #319686 -
Flags: review?(edwsmith)
Assignee | ||
Comment 4•17 years ago
|
||
Built and tested in MSVC for Win32 and PocketPC/ARM. (Arm testing was minimal but does run a few tests in interp and trace mode.) Fixed a few performance issues that turned up in Windows build; we're now at least at parity with previous Windows builds, and ahead on most (though not nearly as much as the GCC builds). I think this patch is ready to go in.
Attachment #319799 -
Attachment is obsolete: true
Attachment #319904 -
Flags: review?(edwsmith)
Comment 5•17 years ago
|
||
Is it still a win to hand-superoptimize certian primitives, like PICK2, PICK3, etc, or can superwording combine (eg) 5 PICK into a primitive, or into adjacent primitives. admitedly such handbuilt superwords maybe make sense when they're used directly from IL.cpp. iow, should PICK4 and friends be defined like 3drop and friends.
Since IP is accessed directly in interp{} and trace{} blocks, should we make it mean the same thing in both contexts? fc.py can substitute (ip-1) or (ip+1) where it needs to and optimize away the const offset. maybe followon work, obviously whats there was ported and changing what IP means in one or the other will be tedious. but probably worth it for the long run.
Updated•17 years ago
|
Attachment #319904 -
Flags: review?(edwsmith) → review+
Comment 6•17 years ago
|
||
nice; i like seeing interp{} trace{} blocks together.
I was trying to figure out the overhead of the tracing framework in the
interpreter. What I mean is, if we are interpreting there is a conditional
check (per opcode) for tracing enabled or not; right?
As we are trying to increase interp speed I wonder if we should apply some
technique to dynamically un-hook the tracer from the interp loop when its not
being used.
Assignee | ||
Comment 7•17 years ago
|
||
re: PICK2 and friends, we get a nontrivial benefit from optimizing those directly because the forth compiler knows how to rearrange stack motion at compile time. That said, we could specialize it for pick-and-a-constant rather than the hardcoded ones we have now.
re: IP being same in both contexts, yep, I'd like to do that as follow-on work. existing code all made the assumptions you see and it was simpler to maintain that than to introduce off-by-one errors.
re: tracer and interp loop unhooked, that's already done in this patch! the do_interp loop is used when we are not tracing, the do_trace loop when we are. and there are only a handful of primitives that need to check to see if we are swapping between modes.
Assignee | ||
Comment 8•17 years ago
|
||
pushed as changeset: 356:81b347b101d0
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•