Closed Bug 536285 Opened 15 years ago Closed 14 years ago

Import Nitro assembler

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: dmandelin, Assigned: jseward)

References

Details

Attachments

(1 file, 5 obsolete files)

For inline threading, we're going to take the Nitro assembler. Key things to do:

- import the files
- for each included wtf header, either import the header or replace with one of our own. wtf/Platform.h is probably the hard one.
- add the files to Makefile.in. Can probably pattern this after nanojit's.
I don't mean to sound defensive here, but am curious why we'd be bringing another assembler into the tree when we already have one in NJ. Is there a substantial difference in interface or capability?
Using NJ was also our first thought.  Our reasoning to do otherwise is as follows.

First, for the following reasons, LIR doesn't seem to be the ideal intermediate form for the simple assembly-splatting we aim to do:
 - the Nitro approach, which seems to work well, uses a global register allocation strategy and controls exactly how calls to functions from the generated assembly work.  Without specific hacks, it would be difficult to have this degree of control with LIR.
 - any simple assembly-splatter is going to be a lot faster than something like NJ (I think dmandelin measured this) and we are going to be assembling a lot more code than we currently do.  By running 10x more code through NJ, NJ hackers may be constrained from doing future optimizations which are affordable for tracing-JIT code but not for baseline-JIT code.

Basically, we felt that NJ works at a higher level of abstraction than we need and that lowering it sufficiently might conflict with other NJ goals and slow development.

This lead to Brendan's idea of building a second interface into NJ's backend, used by NJ and the new baseline compiler, so that we could avoid duplication.  There are problems with that as well:
 - there is a fundamental mismatch in that NJ's backend works backwards and we will be generating assembly from bytecode going forwards
 - our goal is to get up and running fast, and building such an interface mean solving more problems

So, we decided to start from Nitro's assembler.  As an added bonus, their library interface seems thought out and, FWIW, allows quite readable client code.
I'm down with this, so long as we look at sharing where we can, when we can afford to. It still seems like Nanojit could benefit along with total code size, if nothing else benefits. Bug habitat would go down with code size, too (I'm not saying where the bugs would lie -- just noting they'll be there).

We do not want to have too many duplicate abstractions or we will find the costs continue to accrue. Jorendorff noted the sunk cost fallacy elsewhere in the code. Ken Thompson talks in Coders at Work about software practically rotting if you don't keep turning it over. So we should not just glue together disparate source and let redundancy fester.

/be
The forward vs. backwards distinction is *really* important, I think that will make any kind of sharing really hard.  For example, it's actually impossible in NJ to iterate forwards over LIR code once it's written into a LirBuffer due to the current representation (which could be changed, but still).  And the register allocation and code generation, which are utterly intertwined, are also entirely based around traversing LIR backwards.
If NJ can't generate code fast enough, granted, that could be a problem. But that's the only real problem I'm seeing. The remainder derive from this assumed fact: the need for a secondary client interface and the rejection thereof based on the backwards-assembly pass and the fact that NJ performs (local) RA[1]. 

The existing interface to NJ is LirWriter, which works forwards. So if it worked fast enough, I wouldn't see the issue. Can we see those measurements?

[1]: Careful with terminology: Nitro's RA is not 'global' in the conventional sense of 'considering multiple basic blocks'. It's *fixed*. Registers are all pre-assigned to fixed roles, in the C++ source code for the assembler.
(In reply to comment #5)
> If NJ can't generate code fast enough, granted, that could be a problem. But
> that's the only real problem I'm seeing. The remainder derive from this assumed
> fact: the need for a secondary client interface

This was what I advocated to Luke, mentioned in comment 2 (4th para).

Rereading comment 2 I see phrases such as "Basically, we felt" -- always warning signs. This isn't Oprah, after all :-P. We should try to quantify.

It may be that that gut-feeling is dead-on, and I'm ok with some cutting some measure-twice/hack-once corners, especially given a righteous codebase like JSC to steal from, provided we minimize gross duplication later. Even with backward NJ assembly, we could possibly share some low-level instruction formatting code, at first glance.

But Graydon's comment is making me reconsider my "down with this" reaction in comment 3, in spite of NJ's abstraction overhead and the purty JSC Assembler code. Measuring would be better, even with the intuitive costs of NJ and more obvious wins of JSC/Assembler.

> and the rejection thereof based
> on the backwards-assembly pass and the fact that NJ performs (local) RA[1]. 
> 
> The existing interface to NJ is LirWriter, which works forwards. So if it
> worked fast enough, I wouldn't see the issue. Can we see those measurements?

Did Sully's patch use LirWriter? If so we can measure if that patch is unrotted.

> [1]: Careful with terminology: Nitro's RA is not 'global' in the conventional
> sense of 'considering multiple basic blocks'. It's *fixed*. Registers are all
> pre-assigned to fixed roles, in the C++ source code for the assembler.

Fixed allocation is good enough for inline-threading IA32, possibly even for x64 too, given our stack machine or a pseudo-register VM. JSC uses the latter but I haven't check to see if it wins.

Our stack usage is unoptimized, our JS source to bytecode compiler does no value optimizations to speak of (flat closures and constant folding are two exceptions). It may be a wash between stack and register VMs for JS if we did the same analysis that JSC's front end does (again, something I've not had time to study). Anyone know more?

/be
(In reply to comment #6)

> It may be that that gut-feeling is dead-on, and I'm ok with some cutting some
> measure-twice/hack-once corners, especially given a righteous codebase like JSC
> to steal from, provided we minimize gross duplication later. Even with backward
> NJ assembly, we could possibly share some low-level instruction formatting
> code, at first glance.

Eh, *maybe*. Don't get me wrong, I don't hate Nitro's code. From a cursory glance, it looks a fair bit tidier and simpler than NJ. It's doing less, and I can completely believe it might run faster. Maybe even enough to matter. If the numbers say you should use it, I'd just go use it, don't bother trying to share bits of code with NJ. I'm just not sure if it's worth the cost of people-time -- which will exist -- to maintain 2 assemblers in our tree. 

> Fixed allocation is good enough for inline-threading IA32, possibly even for
> x64 too, given our stack machine or a pseudo-register VM. JSC uses the latter
> but I haven't check to see if it wins.

Yeah, fixed register allocation is not the end of the world for a quick and dirty code generator, lots of production systems do it and it works well enough. It's just something to be sure we're all understanding: nitro doesn't do "global RA" in the sense most compiler people mean. That's fine. If it did, it'd almost *certainly* be more expensive than NJ. Generally: fixed is cheaper than local, local is cheaper than global, and the generated code gets better as you go from cheaper to more expensive.

> Our stack usage is unoptimized, our JS source to bytecode compiler does no
> value optimizations to speak of (flat closures and constant folding are two
> exceptions). It may be a wash between stack and register VMs for JS if we did
> the same analysis that JSC's front end does (again, something I've not had time
> to study). Anyone know more?

If you're going to start overhauling the bytecode format, you've got bigger fish to fry. The interpreter loop is not the only bit of SM that's tightly designed around SM bytecode semantics.
(In reply to comment #6)
>
> Our stack usage is unoptimized, our JS source to bytecode compiler does no
> value optimizations to speak of (flat closures and constant folding are two
> exceptions). It may be a wash between stack and register VMs for JS if we did
> the same analysis that JSC's front end does

This ties in with bug 536630.
(In reply to comment #7)
> (In reply to comment #6)
> 
> > It may be that that gut-feeling is dead-on, and I'm ok with some cutting
> > some measure-twice/hack-once corners, especially given a righteous codebase
> > like JSC to steal from, provided we minimize gross duplication later. Even
> > with backward NJ assembly, we could possibly share some low-level
> > instruction formatting code, at first glance.

I would not be opposed to making NJ do forward assembly someday.

> Eh, *maybe*. Don't get me wrong, I don't hate Nitro's code. From a cursory
> glance, it looks a fair bit tidier and simpler than NJ. It's doing less, and
> I can completely believe it might run faster. 

When I did the regexp compiler, I compared our compilation time with WebKit's for the same regexes. At that time, I measured ours as 8x longer. I don't know if that's the same now, or if it holds up on general codes. I do also think that Sully had a longer compilation time than I did with my prototype, but I don't know the factor.

But to me compilation time is not the main issue.

> Maybe even enough to matter. If the
> numbers say you should use it, I'd just go use it, don't bother trying to
> share bits of code with NJ. I'm just not sure if it's worth the cost of
> people-time -- which will exist -- to maintain 2 assemblers in our tree. 
> 
> > Fixed allocation is good enough for inline-threading IA32, possibly even
> > for x64 too, given our stack machine or a pseudo-register VM. JSC uses the
> > latter but I haven't check to see if it wins.
> 
> Yeah, fixed register allocation is not the end of the world for a quick and
> dirty code generator, lots of production systems do it and it works well
> enough. It's just something to be sure we're all understanding: nitro doesn't
> do "global RA" in the sense most compiler people mean. That's fine. If it
> did, it'd almost *certainly* be more expensive than NJ. Generally: fixed is
> cheaper than local, local is cheaper than global, and the generated code gets
> better as you go from cheaper to more expensive.

Hopefully the terminology is cleared up now. I think when Luke said "global register allocation strategy", he meant that registers are selected according to a simple pattern that is used everywhere and was not in fact referring to "traditional" register allocation at all.

Anyway, I think the real point here is that if you do use a classic greedy register allocator like nanojit's, you will get various bogus spills and useless reg->reg moves. Our argument is that if we instead use carefully tuned problem-specific register selection patterns, we will be able to generate better code. Working through an example would be a good test of this hypothesis.

Another point is that in JaegerMonkey, we will probably not be able to realize the chief advantage of nanojit's register allocator, which is that it can potentially keep more things in registers across many bytecodes. The reason is that we have arbitrary control flow, which is confusing to nanojit's allocator. It also requires the use of LIR_live instructions, which are confusing and bad. 
To avoid those problems, Sully ended up storing everything back after each instruction, so you don't really get any reg alloc benefits.

The other big reason we want to target an assembler instead of LIR is that there are some important optimizations that LIR doesn't know how to do. In particular stack layout and function calls need careful optimization that is not currently implemented for LIR. 

WebKit does "stub function" (functions that implement a bytecode case) calls very efficiently. They reserve about 16 words on the native stack. The "top" ones (higher address) contain their equivalents of cx, script, etc. They are set before the first call and never changed after that. The "bottom" (lower address) are reserved for the value arguments. To make a 2-argument call (say to a slow path of JSOP_ADD), you can then just do:

    mov  [esp],   arg1
    mov  [esp+4], arg2
    call &foo

I would think NJ would need significant modification to be able to reserve stack space, keep certain values unchanged but know they are available for calls, avoid stack cleanup after the call, etc.

There are likely to be other such optimizations that require detailed control of the assembly, e.g., generating slow path code out of the main line of straight code.
Shorter faster summary of comment 9:

- NJ was designed as a trace compiler and specialized for that purpose. It incorporates a tightly integrated assembler designed to support that purpose.

- In JM, we will be doing inline-threading compilation of general CFGs. We need a compiler that's strong with CFGs and can conveniently express inline-threading-specific optimizations.
Understood, and all ... plausible arguments. I certainly didn't mean to imply NJ has a *good* RA, or that its RA was anything you'd be getting much advantage from here. Possibly as you say the number of bogus moves you avoid with a fixed mapping will make up for the number you are forced to generate to work around the fixed mapping.

However, your contention that LIR_live is too "confusing and bad" to use seems to me exaggerated, considering that Tamarin uses NJ as a full-CFG, method-at-a-time target. Similarly the notion that NJ can't be changed a bit to accommodate pre-allocating outgoing function arguments in particular stack slots seems unwarranted. Has anyone tried? It doesn't seem like it'd be impossible. Some kind of hint to the stack-slot allocator? Likewise, as far as I know the assembler doesn't really *rely* on being in control of RA either; you can probably push a fixed allocation through with only a bit of an enrichment of the LirWriter interface.

Moreover, it seems likely to me that if you stick to NJ you'll be able to share some code (or at least copy-and-modify ability) with the bytecode-to-LIR semantic recording methods in the tracer. Move to a new assembler and you'll be writing more code from scratch.

I guess I just see you taking as read the notion that "importing" is the end of the costs, both opportunity and future. You'll have multi-port testing and merging burden if you do this, and picking bugs out of Yet More Code. It'll cost a lot of person-hours over the lifetime of the project and future releases. Even having NJ in-tree costs a fair bit. Who is going to do the merging? Are you going to send stuff back to Apple or remain forked? 

Sorry to be belaboring this point, maybe I'm just emotionally attached to sunk costs like the investment in NJ. Feels like a hasty decision from here.
(from phone, sorry for being terse)

(In reply to comment #5)
> If NJ can't generate code fast enough, granted, that could be a problem. But
> that's the only real problem I'm seeing. The remainder derive from this assumed
> fact: the need for a secondary client interface and the rejection thereof based
> on the backwards-assembly pass and the fact that NJ performs (local) RA[1]. 

The issues described in para 2 in comment 2 do not assume that. 

Sorry for the confusion with "global"; did not mean intraprocedural.

(In reply to comment #11)
> However, your contention that LIR_live is too "confusing and bad" to use seems
> to me exaggerated, considering that Tamarin uses NJ as a full-CFG,
> method-at-a-time target. 

I used LIR_live in the regexp compiler; I found it quite tricky to use correctly.  Leads to very hard to find RA bugs.

> Moreover, it seems likely to me that if you stick to NJ you'll be able to share
> some code (or at least copy-and-modify ability) with the bytecode-to-LIR
> semantic recording methods in the tracer. Move to a new assembler and you'll be
> writing more code from scratch.

I really don't see much code reuse between the tracing. and baseline JIT, even if the latter used LIR. 
 
> You'll have multi-port testing and
> merging burden if you do this, and picking bugs out of Yet More Code. It'll
> cost a lot of person-hours over the lifetime of the project and future
> releases. Even having NJ in-tree costs a fair bit. Who is going to do the
> merging? Are you going to send stuff back to Apple or remain forked?

Perhaps you overestimate the size and complexity of Nitros assembler.  Unlike the NJ, it does little beyond pasting opcodes; half the value is the API and how they can write both arch-indep code while still allowing arch-specific codegen as needed in the JIT compiler.  No need to build a generic "hinting" mechanism to indirectly achieve the desired assembly codes.  NJ and the assembler needed by the baseline JIT are really very different animals.

Good-ness, iPhone text editing has a long way to go.
Leaving aside the question of whether or not this is a good idea,
here's a status update.

I copied a minimal set of files from
WebKit/JavaScriptCore/{assembler,jit,wtf}, made a Makefile, and
wrote a test driver to exercise/learn about important features
of the assembler.  I have 5 test cases working:

  (1) straight line code

  (2) a loop

  (3) an if-then-else

  (4) the above surrounded by a simple function prologue/epilogue, so
      it can be called directly from C++ land

  (5) straight line code containing a call to C++ land, where we run
      it once, then change (repatch) the call instruction to a
      different fn, and run it again

These work on all 3 supported platforms: x86, x86_64 and arm.

Cases (1)-(3) can be generated using the convenience functions is
MacroAssembler.h, and are therefore arch independent.  (4) and (5)
require arch-specific calling conventions and function
prologue/epilogue construction, and so cannot be done using
the convenience functions.

Overall the assembler seems pretty clean and kludge-free.

The test driver is attached.  It shows the generated code for each
example, for all 3 archs.
(In reply to comment #13)
[snip]
>   (5) straight line code containing a call to C++ land, where we run
>       it once, then change (repatch) the call instruction to a
>       different fn, and run it again
> 
> These work on all 3 supported platforms: x86, x86_64 and arm.

Cool! It sounds like you already have most of what we need for the v1 baseline JIT. Have you tried yet to build it inside our tree? I'm hoping we can have that ready by the end of January, which hopefully is when Luke will have the new stack format ready.
(In reply to comment #15)

> JIT. Have you tried yet to build it inside our tree?

Yes.  I got it building successfully inside the tree (in a 
directory js/src/assembler) yesterday, and checked it builds all 3 archs.
The patch is big but pretty clean; it's just a bunch of new files
and some minor mods to js/src/Makefile.in.  (w/ much help from
Ted and bsmedberg).

I am currently trying to saw off the LGPLd code and either
get rid of it entirely or replace it.  So far I managed to 
get rid of FastMalloc/FastAlloc, and currently I'm trying to
remove the Vector implementation and hook up to js/src/jsvector.h
instead.  It's a bit tricky.  Once that's done there's a bunch
of smart pointer templatery to deal with, and then it's done.
(In reply to comment #16)
> I am currently trying to saw off the LGPLd code [...]

Got rid of Vector (using jsvector now) and some other headers.
Am now down to 4 LGPLd files, all to do with pointer magic afaics:

wtf/RefCounted.h
   used by jit/ExecutableAllocator.h

wtf/PassRefPtr.h
   used by assembler/MacroAssemblerCodeRef.h

wtf/Noncopyable.h
   used by wtf/OwnPtr.h
           wtf/RefCounted.h
           wtf/Vector.h
           assembler/AbstractMacroAssembler.h
           assembler/RepatchBuffer.h
           assembler/LinkBuffer.h

wtf/RefPtr.h
   used by jit/ExecutableAllocator.h
           assembler/MacroAssemblerCodeRef.h
Julian: Any chance you could post the surrounding files that lets one use the test driver? I'm interested in playing around with it for benchmarking.
(In reply to comment #18)
> Julian: Any chance you could post the surrounding files that lets one use the
> test driver? I'm interested in playing around with it for benchmarking.

Attached in Comment #19.  Is also LGPL-free now.  In the top
directory, do "make" and then run ./Main.  Builds and runs
Memcheck-clean for me on all 3 supported archs.  Will (re-)integrate
into TM build system and post a proper patch soon.
(In reply to comment #20)
> In the top directory, do "make" and then run ./Main.

Duh.  I forgot to say, in the Makefile, you need to set MFLAG thusly:

  to build on x86_64-linux,  set MFLAG = -m64
  to build on x86-linux,     set MFLAG = -m32
  to build on arm-linux,     set MFLAG =    # (empty)
Attached patch patch against TM tree (obsolete) — Splinter Review
This is the initial import patch.  It builds on x86-linux, x86_64-linux,
arm-linux and x86-darwin.  The imported code is placed in js/src/assembler,
and there are some minor changes to js/src/Makefile.in to get it to build.
It will be built/linked into jsshell, as required.  Also, "make TestMain"
builds the temporary test driver program as a standalone test/experimentation
framework.

The imported files are:

   assembler/TestMain.cpp       # tmp test driver, not imported from wk
   assembler/assembler/ARMAssembler.cpp
   assembler/assembler/ARMAssembler.h
   assembler/assembler/ARMv7Assembler.h
   assembler/assembler/AbstractMacroAssembler.h
   assembler/assembler/AssemblerBuffer.h
   assembler/assembler/AssemblerBufferWithConstantPool.h
   assembler/assembler/CodeLocation.h
   assembler/assembler/LinkBuffer.h
   assembler/assembler/MacroAssembler.h
   assembler/assembler/MacroAssemblerARM.cpp
   assembler/assembler/MacroAssemblerARM.h
   assembler/assembler/MacroAssemblerARMv7.h
   assembler/assembler/MacroAssemblerCodeRef.h
   assembler/assembler/MacroAssemblerX86.h
   assembler/assembler/MacroAssemblerX86Common.h
   assembler/assembler/MacroAssemblerX86_64.h
   assembler/assembler/RepatchBuffer.h
   assembler/assembler/X86Assembler.h
   assembler/jit/ExecutableAllocator.cpp
   assembler/jit/ExecutableAllocator.h
   assembler/jit/ExecutableAllocatorPosix.cpp
   assembler/moco/MocoStubs.h
   assembler/wtf/Assertions.cpp
   assembler/wtf/Assertions.h
   assembler/wtf/Platform.h
   assembler/wtf/SegmentedVector.h

All the smart-pointer stuff has been removed, and jsvector.h is used
instead of the webkit Vector.h.

One minor problem is that compilation produces hundreds of warnings
from the C preprocessor, mostly of the form

  warning: this use of "defined" may not be portable

because js is compiled -pedantic.  So some further cleanup is required.
Attachment #423440 - Attachment is obsolete: true
Attachment #424230 - Flags: superreview?(dsicore)
Attachment #424230 - Flags: review?(dmandelin)
It doesn't build on Windows. I'm still working on that. The first problem is that 

  #if COMPILER(MSVC)

apparently doesn't work on Windows the way they want it to. The second is that if I fix that I get mysterious errors.
I got it to build on Windows, except TestMain.cpp, which uses some GCC inline assembly that I did not want to fix up right now. I'm not sure we need to, although I suppose we should turn off building it in the Makefile in that case.

The main problem was with the configuration macros defined in Platform.h. They do it like this:

  #define COMPILER(WTF_FEATURE) (defined(WTF_COMPILER##WTF_FEATURE) && \
                                 WTF_COMPILER##WTF_FEATURE)

AFAICT, this is bogus, because the behavior of using |defined| on the RHS of a #define is unspecified by the spec. GCC apparently makes this work the way they want, but MSVC doesn't. When MSVC sees:

  #if COMPILER(MSVC)

it seems to expand the RHS to 

  (defined(1) && 1)

and considers this to be false. I think only the part after the && is really required, anyway. I removed it and that makes these macros work on Windows.

The other problem is that they include C99's |stdint.h|, which of course MSVC doesn't provide. WebKit includes a copyright-free version of that file in its Windows support libs, so I added that to the tree and Makefile.in.

After that, it all seems to build, except for a few things in TestMain.cpp. I fixed a few, but the others require replacement of some GCC inline assembly, as noted before.
Ah, there has been some confusion and some small duplication of work.
I didn't realise you'd managed to actually get it to build on Windows.

Anyway, attached is a revised patch which builds and runs the TestMain
tests on {x86,amd64,arm}-linux, x86-win32 and x86-darwin.  For x86-win32,
it just runs the two TestMain tests which don't require inline assembly,
but in any case those are the most interesting tests.  I think this is
suitable now for reviewing.
Attachment #421152 - Attachment is obsolete: true
Attachment #422817 - Attachment is obsolete: true
Attachment #424230 - Attachment is obsolete: true
Attachment #424344 - Attachment is obsolete: true
Attachment #424230 - Flags: superreview?(dsicore)
Attachment #424230 - Flags: review?(dmandelin)
Comment on attachment 425018 [details] [diff] [review]
patch that builds on {x86,amd64,arm}-linux, x86-win32 and x86-darwin

I looked into the "kludges" comment. INCLUDES gets set by config/config.mk, which is included in the middle of Makefile.in. I also noticed that all the modifications to INCLUDE in Makefile.in are already at that point. So I think it's fine the way it is, but you can move it back or modify the comment if you want.
Attachment #425018 - Flags: review+
Attachment #425018 - Flags: superreview?(dsicore)
Comment on attachment 425018 [details] [diff] [review]
patch that builds on {x86,amd64,arm}-linux, x86-win32 and x86-darwin

Do we need to add an entry in about:license for these files?  Seems we do.  Reviewed with legal.  I think the about:license issue is our only concern.  If we do need to add a reference, Sayre, can you file a bug, please?

sr+=damons
Attachment #425018 - Flags: superreview?(dsicore) → superreview+
(In reply to comment #28)
> (From update of attachment 425018 [details] [diff] [review])
> Do we need to add an entry in about:license for these files?  Seems we do. 
> Reviewed with legal.  I think the about:license issue is our only concern.  If
> we do need to add a reference, Sayre, can you file a bug, please?

Damon, unclear what you intend.  Are we ok to land as-is and file a 
followup bug for the about:license thing, or should that be fixed
before landing?
(In reply to comment #28)
> (From update of attachment 425018 [details] [diff] [review])
> Do we need to add an entry in about:license for these files?  Seems we do. 
> Reviewed with legal.  I think the about:license issue is our only concern.  If
> we do need to add a reference, Sayre, can you file a bug, please?

Yep, needs to be added to about:license. Please CC gerv and me on such a bug and add proper blocks/depends linkage.

> sr+=damons

I'm confused as to why you're granting super-review here when you're not a super-reviewer.
bug 544755 is the about:license update bug. it must be fixed before this lands, so we don't distribute nightly builds that violate Nitro's license.
Depends on: 544755
Bug 544755 has landed in TM, so this can land now.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: