Closed Bug 440601 Opened 16 years ago Closed 15 years ago

Analyze register usage of helper functions called from the trace.

Categories

(Tamarin Graveyard :: Baseline JIT (CodegenLIR), enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

VERIFIED WONTFIX
Future

People

(Reporter: gal, Unassigned)

Details

Attachments

(2 files)

User-Agent:       Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_3; en-us) AppleWebKit/525.18 (KHTML, like Gecko) Version/3.1.1 Safari/525.20
Build Identifier: 

Many helper functions invoked from generated trace are just small chunks of C code that don't touch all registers. Some simple analysis can be performed on them to detect what registers are actually used. We are doing this analysis in our Hotpath VM at runtime. For tamarin we probably would want a build-time analysis. This requires all functions to be part of the build image. Calls into libc are ok, but only if libc is bound dynamically. 

The analysis is pretty simplistic but works for most use cases. Limitations: 
- the helper function must be a leaf function (at the machine code level, inlined method calls are fine).
- no indirect calls. Direct calls are fined and the analysis follows those edges.
- currently only implemented for x86, but other architectures should be even easier.

Our code is written in Java (10000 LOC disassembler + about 100 LOC for the analysis). We can either use the code directly and drop it into tamarin, or try to re-tool the analysis (i.e. using objdump). The former requires taking a dependancy on java being present during built time. The latter takes more work, but might be easier to get right (I trust objdump more than I trust our disassembler).

Gregor Wagner from UCI would be interested in helping with this work since he is doing similar native code analysis with said framework.

Comments?

Reproducible: Always
yes please!

And agreed that objdump might be the right way to approach it.   Although it shouldn't really matter since I imagine that the output will be in the form a data structure (in a separate file) that gets compiled into our code and is used by Assembler.

Well right now we have a Java disassembler that generates Java data structures that the analyzer looks at. We could use objdump and use sed and grep and other kinds of Unix horror to do the analysis. This would take a bit more time but make the disassembler more trustworthy and we don't rely on JVM being present.

Btw, if the analysis fails (or cannot be run due to missing JVM), we can simply save all registers. So its not a fatal dependency. 
I use otool for Mac to disassemble object files. The python script collects all jumps and calls and generates a set of registers that are used for a function. The script has to be modified for different platforms but here my basic idea :)

from __future__ import with_statement
from subprocess import Popen, PIPE
from sets import Set
from string import Template
import re
import os
import string
import sys
infile = ""
funcname = ""
functions = Set()
regs = {}  	#Dictionary function name -> registers
calls = {}  	#Dictionary function name -> called functions
addr = {} 	#Dictionary address->function name
jumps = {}
search = Set()
resultregs = Set()

def collectCalls(func):
	for x in calls[func]:
		if x not in search:
			search.add(x)
			collectCalls(x)
	for y in jumps[func]:
		if addr[y] not in search:
			search.add(addr[y])
			collectCalls(addr[y])

def collectRegs(func):
	search.add(func)
	collectCalls(func)
	for s in search:
		for temp in regs[s]:
			resultregs.add(temp)

def addall():
	regs[funcname].add('%eax')
	regs[funcname].add('%ebx')
	regs[funcname].add('%ecx')
	regs[funcname].add('%edx')
	regs[funcname].add('%esp')
	regs[funcname].add('%ebp')
	regs[funcname].add('%esi')
	regs[funcname].add('%edi')

print "usage: python reg.py <input_file(.o)> <method_name>"

infile = sys.argv[1]
command = Template("otool -tV $inp > tmp.txt")
os.popen(command.substitute(inp=infile))

method = sys.argv[2]
whole = open("tmp.txt").readlines()

for currline in whole:
	if currline.startswith("_"):
		funcname = currline.strip().strip(":")
		functions.add(funcname)
		regs[funcname] = Set()
		calls[funcname] = Set()
		jumps[funcname] = Set()
	else:
		if funcname:
			if currline.startswith("0"):
				address = currline[0:8]
				addr[address]=funcname
			#General Register
			m=re.search('%e..',currline)
			if m:
				regs[funcname].add(m.group())
			#XMM Register
			#m=re.search('%x...', currline)
			#if m:
			#	regs[funcname].add(m.group())
			#idiv...
			m=re.search('divss', currline)
			if m:
				regs[funcname].add('%rdx')
			m=re.search('imull', currline)
			if m:
				regs[funcname].add('%rdx')
			#jumps
			m=re.search('j',currline)
			if m:
				ad=re.search('0x........',currline)
				if ad:
					jumps[funcname].add(ad.group()[2:10])
			if(string.find(currline, 'calll')>0):
				L = string.split(currline)
				for word in L:
					if word.startswith('_'):
						calls[funcname].add(word.strip().strip(':'))
			#jumps to register
			m=re.search('\*%e..',currline)
			if m:
				addall()
			
if method not in functions:
	print "function not found"
	print functions

print "original regs: "
print regs[method]

print "calls to: "
print calls[method]

print "jumps to:"
print jumps[method]

collectRegs(method)

print "regs collected from:"
print search

print "used regs:"
print resultregs

Status: UNCONFIRMED → NEW
Ever confirmed: true
Attached file updated python script
I've extended the static analysis script here to be a little more robust, a little less optimistic, abstract away ISA and object-dumper differences, and emit the beginnings of a usable C file with the symbols requested, a static call graph in comments, and the inferred register set.

You must run it on the loadable artifact -- libmozjs.dylib, libmozjs.so, or js3250.dll depending on platform -- rather than individual object files, and at the moment it gives pretty uninspiring answers on most of the symbols: too many jump off to libc, the js runtime, or (on linux) the PLT-resolver functions in ld.so. For most functions the register set is "everything". The indirections through PLT strike me as a bug though, and if so can be fixed. I'll look into it on monday.

If you want to give it a try, make sure to do a build with debugging symbols present so the dumpers can find most of the symbols (though not debug-enabled; then you'll get bogus paths for asserts and such).
Assignee: nobody → graydon
Status: NEW → ASSIGNED
I have made further investigation into this and am close to concluding that it's not going to help.

- Given all the jumping into runtime-linked sub-helpers (libc, libm and the like) as well as jumping through the PLT to hit other parts of libmozjs.so that are default visible, there only seem to be (from the functions listed in builtins.tbl) 5 candidates that have register usage we can bound as "less than the entire set".

- There is a trick we can use (involving asm labels and aliases) to avoid PLT calls *within* libmozjs.so. But it's intrusive and a big-ish patch.

- There isn't a lot of opportunity for winning if we bother to do so. If I simulate the PLT not being there in the analysis script, I get the same set of only 5 symbols with bounded register use.

The symbols in question are js_UnboxDouble, js_Math_floor, js_String_p_charCodeAt, js_EqualStrings, js_CompareStrings, and js_BooleanToNumber. 

Useful enough to justify continuing on this path? I am having a hard time imagining this is much of a potential win.
For which platform does this analysis apply to? On MacOSX the sin/cos/floor/ceil/pow libm functions were all perfectly analyzable last time we used this. I am willing to import frequently called libm source into the tree, put it in a separate file, compile with special flags. Flushing all registers is pretty big penalty, especially for UnboxDouble/BoxDouble/UnboxInt/BoxInt.
linux-x86. it's not that we can't analyze libm, it's that it's runtime-linked. if we can get allocator functions and libm functions in-tree, or otherwise statically linked, we'd be ok.
Instead of analyzing this register usage, does it make sense to just JIT some of these methods inline? Thinking especially of charcodeat and unboxdouble
(In reply to comment #8)
> Instead of analyzing this register usage, does it make sense to just JIT some
> of these methods inline? Thinking especially of charcodeat and unboxdouble

I have the same feeling. For anything small enough that the analysis returns something we'll enjoy hearing about, it seems to me we'd be better off open-coding the thing in LIR.
I guess we can do some manual analysis and try this. Math-partial for example would be a good candidate.
Right now lir doesn't support control-flow inside a trace, so we can't really implement builtins directly (we plan to do so pretty soon though).
If you're still interested, I poked around with udis86 while waiting for some email the other day, and got a little miniature runtime analysis sketched out: collects up registers-in-use and recursively traces through jumps and calls with immediate-ish operands. If it can't tell what it's looking at, it just assumes everything is clobbered, but for small functions (and absent any PLT linkage) it figures out a smaller register set.

Personally I'd be more comfortable with something like this in the binary than a static script. You'll need, naturally, to install udis86 to get this to work. But it seems to do x86 and x64. Sample output:

~/src/udis86$ ./register_analysis
disassembling 0x8048ddb
0x8048ddb: push ebp
0x8048ddc: mov ebp, esp
0x8048dde: push ebx
0x8048ddf: sub esp, 0x8
0x8048de2: mov eax, [ebp+0x8]
0x8048de5: mov [esp], eax
0x8048de8: call 0x8048db5
conditional jump to 0x8048db5
    disassembling 0x8048db5
    0x8048db5: push ebp
    0x8048db6: mov ebp, esp
    0x8048db8: add dword [ebp+0x8], 0x1
    0x8048dbc: mov eax, [ebp+0x8]
    0x8048dbf: pop ebp
    0x8048dc0: ret 
0x8048ded: mov ebx, eax
0x8048def: mov eax, [ebp+0x8]
0x8048df2: add eax, 0x2
0x8048df5: mov [esp+0x4], eax
0x8048df9: mov eax, [ebp+0x8]
0x8048dfc: mov [esp], eax
0x8048dff: call 0x8048dc1
conditional jump to 0x8048dc1
    disassembling 0x8048dc1
    0x8048dc1: push ebp
    0x8048dc2: mov ebp, esp
    0x8048dc4: sub esp, 0x4
    0x8048dc7: mov eax, [ebp+0xc]
    0x8048dca: mov [esp], eax
    0x8048dcd: call 0x8048db5
    conditional jump to 0x8048db5
        disassembling 0x8048db5
        0x8048db5: push ebp
        0x8048db6: mov ebp, esp
        0x8048db8: add dword [ebp+0x8], 0x1
        0x8048dbc: mov eax, [ebp+0x8]
        0x8048dbf: pop ebp
        0x8048dc0: ret 
    0x8048dd2: mov ecx, eax
    0x8048dd4: mov eax, [ebp+0x8]
    0x8048dd7: shl eax, cl
    0x8048dd9: leave 
    0x8048dda: ret 
0x8048e04: lea eax, [ebx+eax]
0x8048e07: add esp, 0x8
0x8048e0a: pop ebx
0x8048e0b: pop ebp
0x8048e0c: ret 
register use of 0x8048ddb:  eax ecx ebx esp ebp
Maybe I'm missing something obvious, but why is the static answer provided by the ABI's partition of the register set into call-saved/call-clobbered not good enough?
(In reply to comment #13)
> Maybe I'm missing something obvious, but why is the static answer provided by
> the ABI's partition of the register set into call-saved/call-clobbered not good
> enough?

Two reasons. One: lots of functions don't *use* all the registers they're allowed to clobber. We can be more aggressive if we know exactly what they use. Two: the static ABI is probably "plain x86" but the JIT will almost always dynamically find itself on an SSE2 platform. It currently assumes the worst -- that the callees will clobber all the SSE2 regs, which are huge -- even though it's likely they were compiled in a static mode that uses no SSE regs at all.

A static compile-time-generated indication that "no spidermonkey builtins use SSE regs or otherwise corrupt SSE state" would probably go a long way to removing the pain.
ARM has nice ELF tags for this sort of thing, I wonder if we could get something similar standardized for x86.

But maybe it would be easier to attack the problem from the other end: it should be easy to know at any given call site which of the call-clobbered registers are actually being used by trace code, and only those need to be saved, no?  In particular, surely most trace code *doesn't* use all of the SSE state?
(In reply to comment #15)

> But maybe it would be easier to attack the problem from the other end: it
> should be easy to know at any given call site which of the call-clobbered
> registers are actually being used by trace code, and only those need to be
> saved, no?  In particular, surely most trace code *doesn't* use all of the SSE
> state?

Why wouldn't it? They're very useful registers. In any case, we certainly only spill live registers already. We do not know exactly how expensive this problem is, because measuring it somewhat depends on the ability to construct working alternatives, but it's something Andreas said helped quite a bit in the hotpath work.

I believe this thread already reached a general stopping point though, which is "try re-coding many of the built-ins as LIR". It's just waiting on the redux patch to land so we get in-trace branching. I only posted the udis86 sample as a sort of "thing to try if we feel like revisiting".
Assignee: graydon → nobody
Component: Tracing Virtual Machine → JIT Compiler (NanoJIT)
QA Contact: tracing-vm → nanojit
Target Milestone: --- → Future
Status: ASSIGNED → NEW
closing this, prototypes of embedding LIR for helper functions are underway, at least in Bug 514515
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → WONTFIX
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: