Closed Bug 499976 Opened 15 years ago Closed 15 years ago

Clean up nanojit to used fixed allocation

Categories

(Tamarin Graveyard :: Baseline JIT (CodegenLIR), defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: edwsmith, Assigned: edwsmith)

References

Details

Attachments

(3 obsolete files)

placeholder for discussion, dependent bugs, and ultimatly, patches.

Design goals:

* nanojit should use vm-provided handle for allocation needs
* and not use MMgc::GC, nor even reference MMgc::GC

open questions:

* null check or assume allocations never fail (and let allocator longjump or throw)?  discussion welcome here.  In a discussion with graydon and jorendorf on #jsapi we concluded the longjmp approach is plausible enough for TM, to start prototyping.  It's already okay for Tamarin.

* if the allocator longjmps out, is it okay to leak some memory?  (specifically, for some allocations returned by nanojit::Allocator not be freed back to nanojit::Allocator) if not then all allocations must be reachable at the point cleanup occurs, pushing more allocations to the heap so they can be rooted.
Depends on: 499980
nanojit would be a perfect playground to try out setjmp/longjmp-style OOM handling. Its fairly small and allocates memory in a well understood way. I would try not to leak any memory if we use longjmps when we hit the soft ceiling (i.e. max code cache size). If its only used for the hard OOM case, then I think we can leak. The embedding is going to abort or crash or do something bad anyway.
If the cleanup action is to run all the destructors you *can* reach and then free all the memory owned by nanojit::Allocator (assuming it maintains its own allocation list, or works out of a pool), I don't see that we'd be "leaking" at all.

IMO the question is not so much whether we can tolerate the use of a pool, but rather whether we can tolerate certain sets of destructors never running. If all the destructors in question are doing is cleaning up more stuff in the same pool, I don't see a "leak" that anyone outside the cleanup code would be able to detect.
(In reply to comment #1)
> nanojit would be a perfect playground to try out setjmp/longjmp-style OOM
> handling. Its fairly small and allocates memory in a well understood way. I
> would try not to leak any memory if we use longjmps when we hit the soft
> ceiling (i.e. max code cache size).

Adobe has experience with ugly-macro TRY/CATCH syntactic salt for setjmp and longjmp.

The problem is not getting control flow back to some top level, that is stupidly easy. The problem is unwind-protect. This should *not* be done by hand, it needs static analysis.

But instead of setjmp/longjmp hacking, why not try C++ exceptions and see how bad the RTTI hit is?

> If its only used for the hard OOM case,
> then I think we can leak.

> The embedding is going to abort or crash or do
> something bad anyway.

This is manifestly false. See the newsgroup thread, the javascript: URL.

/be
#3: the newsgroup post is about hitting a quote limit. Thats fine and we should be able ot cleanly do that. If we ever really hit OOM, the world _is_ going to crash. Been there. Tried it. FF behaves very badly when malloc fails.
quota limit, not quote limit. spelling still on the plane
(In reply to comment #3)
> ugly-macro TRY/CATCH syntactic salt

nicely put.

> But instead of setjmp/longjmp hacking, why not try C++ exceptions and see how
> bad the RTTI hit is?

We'll need a fallback that doesn't require C++ exceptions, as many configurations of FlashPlayer build with C++ exceptions disabled.
(In reply to comment #3)

> Adobe has experience with ugly-macro TRY/CATCH syntactic salt for setjmp and
> longjmp.

As does SEH, as does symbian. Probably more software has shipped this way than with DWARF2 exceptions. It's not like we're talking about the kookiest untested behavior in the world.

> The problem is not getting control flow back to some top level, that is
> stupidly easy. The problem is unwind-protect. This should *not* be done by
> hand, it needs static analysis.

Possibly. Note that we're *only* talking about failing to run dtors for things allocated via nanojit::Allocate. And currently, if you read the dtors in nanojit, all any of them do is free memory. Flushing a pool is fine. We just need to be aware that dtors are not guaranteed to run in js/src/nanojit, particularly for things like stack objects, so they can't do any nontrivial correctness things. There are worse constraints in life. If it'll make you feel better we can statically enforce "no dtors exist in nanojit/, period"?

> But instead of setjmp/longjmp hacking, why not try C++ exceptions and see how
> bad the RTTI hit is?

It's not the RTTI cost so much as the implementation of exceptions on msvc that's the cost center. They push and pop cleanup-list entries on frame boundaries when there are local dtors (a la symbian) and use sj/lj themselves. See the recent platform thread on 'heretic thoughts'.

> This is manifestly false. See the newsgroup thread, the javascript: URL.

Agreed. But we can get control and flush the pool, we're back in business. I think this will work.
I agree it's tractable that nanojit's classes be written to be safe when their destructors don't run.  its a small enough codebase and the static analyzer wizards out there make me think so.

if we get in the habit (in nanojit) of using stack-allocated containers, which internally do nanojit::Allocator allocations, then it just means allocator.free() wont always be called for every allocator.alloc().

that just means:

a) the allocator keeps its own list, or
b) so what, we leak a bit just before we shut everything down

choice is up to the vm.

alternatively, we could avoid getting in that habit, and carefully ensure all allocations are rooted. then, no list required, and no leaks happen.  this dicipline changes the shape of the nanojit code, though.

(at what point are we just going to bite the bullet and start using
bump-pointer arena allocators?)
(In reply to comment #4)
> #3: the newsgroup post is about hitting a quote limit. Thats fine and we should
> be able ot cleanly do that. If we ever really hit OOM, the world _is_ going to
> crash. Been there. Tried it. FF behaves very badly when malloc fails.

This is false, and shows misunderstanding of the system. Again, try this:

javascript:var s=Array(1e6).join();var a=[];for(var i=0;i!=2000;++i)a.push(s);a.toString();

in your Firefox address bar. There is no "quota" being checked in Array or push or toString/join. Nor should there be, since that simply reinvents OOM handling with less precision.

/be
(In reply to comment #8)

> (at what point are we just going to bite the bullet and start using
> bump-pointer arena allocators?)

I would say "immediately"; my hope is to carve out an N-megabyte region for the JIT, allocate linearly, and recover from OOM-that-longjmps *or* early-detected quota exhaustion by flushing it.

That is, after all, the *assumption* that tracemonkey was designed around. It's just not the *reality* at the moment. My only hope with the conclusion of this bug is that we can actually align reality with the assumption.
I know we survive that particular case. I did test it. I also remember debugging to death an OOM crash and whipping up a patch that handles OOM in the tracer data structure (Queue), only to realize when I tested it that right after coming out of the JS engine FF falls on its face with an unchecked malloc.

I think the engine should never cause OOM. I see our interior malloc blocks as a bug, not a feature. They should come from the JS heap, and if you use too much of it you should run into a quota of some sort instead of causing OOM and likely affecting the embedding around you.
(In reply to comment #7)
> We just
> need to be aware that dtors are not guaranteed to run in js/src/nanojit,
> particularly for things like stack objects, so they can't do any nontrivial
> correctness things. There are worse constraints in life. If it'll make you feel
> better we can statically enforce "no dtors exist in nanojit/, period"?

Maybe. My feelings aren't the issue, though.

Gather round, _mes enfants_, for my experience-over-feelings-based storytime.

I have tried just-so arena allocation schemes over the years; SpiderMonkey's compiler still uses one. They are attractive at first, but over time they tend to grow lifetime-extension and custom-cleanup hair.

GCC's objstacks eventually gave way to Boehm GC, arguably overkill (Linus thought so). Even if GCC went too far, the general trajectory has been recapitulated by many codebase-projectiles over time.

> See the recent platform thread on 'heretic thoughts'.

Thanks for the tip, I will (wish groups links were shorter).

> > This is manifestly false. See the newsgroup thread, the javascript: URL.
> 
> Agreed. But we can get control and flush the pool, we're back in business. I
> think this will work.

Ok. It's up to you guys. Nanojit is small and allegedly simple enough that you should be able to pull it off and hold the line. My cautionary tale stands, tho.

/be
(In reply to comment #11)
> I know we survive that particular case. I did test it. I also remember
> debugging to death an OOM crash and whipping up a patch that handles OOM in the
> tracer data structure (Queue), only to realize when I tested it that right
> after coming out of the JS engine FF falls on its face with an unchecked
> malloc.

That is a bug in Firefox, or Gecko, but bugs do not justify more bugs. Embedders, including Gecko in spite of its other OOM handling bugs, benefit from SM's OOM checks.

> I think the engine should never cause OOM. I see our interior malloc blocks as
> a bug, not a feature. They should come from the JS heap, and if you use too
> much of it you should run into a quota of some sort instead of causing OOM and
> likely affecting the embedding around you.

How is this a more round wheel that you've just invented, caveman Andreas? Your "JS heap" fire frightens me, but even I, caveman Brendan, understand error-checking wheels and how they roll. :-P

/be
I question fixed quota judgments, or even ones based on a fraction of physical memory. Why shouldn't the dynamic range of the Web come close to using all of my system's usable VM?

See bug 479090 comment 9 for predictable badness from costly quota hitting. Yeah, the code cache design usually works, but making the only negative feedback loop dump everything makes a bang-bang control system.

Quota setting is bad business, most users won't do it, a few will do it badly, and we won't get the defaults right.

/be
(In reply to comment #12)

> I have tried just-so arena allocation schemes over the years; SpiderMonkey's
> compiler still uses one. They are attractive at first, but over time they tend
> to grow lifetime-extension and custom-cleanup hair.

Valid historical point. We may indeed feel some future pressure to keep JIT fragments alive longer than the current scheme. But even if we need to shift to separate code-pool vs. compilation-pipeline-pool (which I believe Ed has already done), and generationalize the code pool so old hot code doesn't keep being rebuilt, I think it can all still fit in this framework.

Keep in mind that the JIT has a lot of leeway that GCC and similar cases didn't have: so long as we have a strategy for cleaning ourselves up, "destroying the world and returning to the interpreter" is a valid corner-case strategy. It's what we do in a lot of cases already. We're sort of working in a system that's optional and transient by design.
(In reply to comment #14)
> I question fixed quota judgments, or even ones based on a fraction of physical
> memory. Why shouldn't the dynamic range of the Web come close to using all of
> my system's usable VM?

Automatically resizing the arena on flush is harmless and easy; you're mixing issues here. The issue is whether arena-alloc + flush is a valid *lifecycle* strategy, and for allocations we're always allowed to throw out w/o *semantic* penalty, I believe it is. We can balloon the heap up to use lots of physical memory if need be.

As I said above, I believe if we went so far as to have separate pools for the compilation-pipeline allocations and the fragment metadata / code pages, we'd be much less flush-happy. Add to this a simple generation-shifting scheme between N code arenas and you've invented copying GC without having to copy. You just *regenerate* your artifact in the colder generation.

I really think that being transient and optional gives us a lot of lifetime-policy options you won't see by comparing these apples to the oranges of normal memory-management.
<tangent>

if we're going to provide a nice fast bump-allocator to nanojit, (for lir, maybe, plus tables, maps, lists, whatever structures it needs), then it would be great to be able to count on that in the nanojit code.

a fast bump-allocating arena allocator would not want to even have a free() api, hence we shouldn't try to make nanojit agnostic about this.  it either knows its using a bump allocator, never calls free, or knows its calling a non-bump allocator, and calls free when it can. (its good to free stack-lifetime-things on normal exit, even if you have a flushing pool).

</tangent>

what I mostly care about in this thread is what assumptions do we have to bake 
into the nanojit code itself.  without getting into arena allocators, so far i think we have:

* nj allocations will always go through a handle (no global new/delete or malloc/free).

* the handle is a vm provided allocator.

* the vm-provided allocator never returns null to nanojit code.

* okay to have auto scope structs in nanojit that do allocations via our handle, even if their destructors never run due to a longjmp.  (if VM requires absolute OOM recovery, then VM-provided allocator must track allocations on its own).

and finally, for good measure:

* code is managed separately from data, with a dedicated code allocator.  (tamarin has this already, see nanojit::CodeAlloc in tamarin-redux).  code-allocator also should be factored to use vm-provided SPI which can map to mmap or something similar.  page protections and i-cache flushing are concerns.

* the code allocator also never returns null, it longjmps as needed, like the data allocators.
Is it considered feasible/desirable to impose a maximum limit on the size
of fragment that the JIT can handle?  This could be used as a way to 
bound the max memory that any specific JIT invokation would use, without
significant performance loss, since:

* almost all fragments would be smaller than the limit, and hence unaffected,
  and

* for those fragments larger than the limit, there would be a small
  performance overhead due to trace-to-trace transition (not sure if this
  is correct terminology) at the point of truncation.  But because the 
  threshold is set quite high, that cost is amortized over a relatively
  large amount of useful work.

It's what we do in Valgrind, and it makes memory management in the JIT
relatively simple.

(In reply to comment #16)

or have N code arenas (say, 8) and use them in FIFO style.  It's less
drastic than trashing the entire code cache when it gets full -- it's at
least a step towards LRU management of the code cache, without the 
overhead or complexity of real LRU.
(In reply to comment #16)
> (In reply to comment #14)
> > I question fixed quota judgments, or even ones based on a fraction of physical
> > memory. Why shouldn't the dynamic range of the Web come close to using all of
> > my system's usable VM?
> 
> Automatically resizing the arena on flush is harmless and easy; you're mixing
> issues here.

I object that it was not I who mixed issues, but our code, as that bug 479090 demonstrates.

Hindsight? Not entirely: we have an optional-and-transient JIT, which has its benefits, but it can underperform non-linearly. Interpreter performance, even if it is up to v8 or nitro par, can be hurt by the compensation costs. We've had blacklisting bugs. And monitoring and compilation costs count too.

We need a better interpreter/JIT cost balance, Andreas has talked about this; Mike Pall too. This is a big problem, a meta-bug, it needs many fixes including interpreter and GC -- but arguably also code cache policy.

> The issue is whether arena-alloc + flush is a valid *lifecycle*
> strategy, and for allocations we're always allowed to throw out w/o *semantic*
> penalty, I believe it is. We can balloon the heap up to use lots of physical
> memory if need be.

I agree, but (correct me if I'm wrong) right now we have a 16MB limit (32MB didn't stick, right?) for both short and long lived stuff. I don't mean to harp on this if it's just "bugs to fix", and you are right that a just-so LIFO or all-or-nothing approach can be good for the compilation-transient allocations.

I'm really objecting to quotas for code cache, object allocation, string allocation, etc. This is an argument I'll take back to the newsgroup; I hope others will comment there too.

> I really think that being transient and optional gives us a lot of
> lifetime-policy options you won't see by comparing these apples to the oranges
> of normal memory-management.

Agreed, in principle. The control theory here seems weak, though. Sure, we can dump any amount of memory on some "OOM" signal, and recompute (interpret and retrace). But this can cost too much if it happens too often. We benefit by a simpler nanojit, but what if the user experience sucks too often, especially on benchmarks we haven't seen yet?

We have no time component in the control system, AFAIK. For other web-driven caches we use LRU-SP or some naive LRU replacement, and a generous quota based on physical or even virtual memory, so time is not an issue. For recent textrun glyph caching work, we have added time-based caching. Cached testruns go free after a while and under memory pressure (I may not have that last bit right).

I'm probably off in the weeds now, talking about long-lived code cache and not anything this bug would change. Wanted to get the time component thought out, in case it's helpful.

/be
this is all great stuff, but i started this bug to talk about an api between nanojit and the VM for *data* (lir, structures, etc).  not *code*.  

both are important, of course.  And we've made progress in TR on the code allocator; we can evict individual fragments (lists of code blocks) now, and the code is clean enough to insert a nice clean wrapper around mmap/virtualprotect/whatever.

(not to mention, easy to have N CodeAlloc instances, if thats desireable).
(In reply to comment #20)
> this is all great stuff, but i started this bug to talk about an api between
> nanojit and the VM for *data* (lir, structures, etc).  not *code*.

Agreed. I think as both Julian and Brendan point out we'll gradually mutate the code cache towards something more like LRU or generations or ... something incorporating better notions of time. It's a bit subtle due to the inter-fragment jump dependencies and whatnot. But you're right it's a bit off topic.

The only sense in which it's *on* topic is really that the allocator-API choices encode lifetime (and recovery-strategy) assumptions, and we'll have to be careful if those are changing, that they change to match existing (or likewise-adjusted) assumptions elsewhere. Case in point: guard records and exits probably need to be changed to be allocated *outside* any "compilation objects" arena, if that arena is going to be flushed more often than the code cache. They have to live as long as the fragment-and-code, in order to permit knitting fragments together. They'll want to live in this "CodeAlloc" thingy.

It might help here to sketch out an actual design proposal :)
prototype: 
http://hg.mozilla.org/users/edwsmith_adobe.com/verifymem/rev/63689270997c

design sketch:

* nanojit::Allocator is a bump-pointer arena with a simple alloc/free SPI
  implemented by the host VM for getting real memory.

   class Allocator {
      Allocator()
      ~Allocator()          // for each chunk: freeChunk()
      void* alloc(size_t)   // fast: bump pointer; slow: allocChunk()
   private:
      void* allocChunk(size_t) // host-provided SPI
      void freeChunk(void*)    // host-provided SPI
   }

* Allocator.alloc never returns NULL.  vm-provided allocChunk() spi is expected
  to longjmp or throw upon OOM
* ~Allocator() frees all chunks with freeChunk() spi

prototype adds util.h/cpp with Allocator, BitSet, Seq<>, SeqBuilder<>, HashMap<>, and TreeMap<> implementations, all nanojit code uses these structures, they all allocate from Allocator.  (TreeMap is needed for
LabelMap, which is verbose-only so its a generic binary tree).

* Assembler, live(), etc take an Allocator paramater from host VM.  In tamarin
  we have two main Allocators:  one (alloc1) whose lifetime is during
  LIR generation, and the other (alloc2) whose lifetime is through the
  end of native code generation.  (i imagine in TM there's a 3rd: alloc3,
  whose liftime matches the lifetime of a native code fragment).

* LIR and LirBuffer is allocated using alloc2.  LirWriters and their
  helpers are allocated using alloc1.  this is all under VM control,
  nanojit doesn't specify.

* the newly added container classes mostly grow monotonically, and
  they know they're using a bump arena.  (avmplus::List<> and SortedMap<> are
  not arena friendly because they realloc a lot).

* the only destructor that *must* run, to avoid leaks, are ~Allocator() (data)
  and ~CodeAlloc() (code).

* all the explicit OOM checks, delets, frees, can be removed for any code
  that knows its using Allocator
Re comment 22: sounds simple, which I like!
Attachment #388263 - Flags: review?(rreitmai)
Attachment #388263 - Flags: review?(graydon)
this patch also removes other unused error enums including the outOMem enum.
Assignee: nobody → edwsmith
Attachment #388277 - Flags: review?(rreitmai)
Attachment #388277 - Flags: review?(graydon)
Attachment #388263 - Flags: review?(rreitmai) → review+
Comment on attachment 388277 [details] [diff] [review]
remove dead outOMem code since Allocator.alloc doesnt return NULL

Are we no longer using some of those AssmError values?
Attachment #388277 - Flags: review?(rreitmai) → review+
(In reply to comment #26)
> (From update of attachment 388277 [details] [diff] [review])
> Are we no longer using some of those AssmError values?

the patch removes use of outOMem, the others were already unused in TR and TM.
Comment on attachment 388263 [details] [diff] [review]
introduce class Allocator, use it in LirBuffer to replace segmentList

Mostly fine. A few minor points. Fix whatever of them sound agreeable.

- You'll need to adapt a couple bits of this to the changes to use byte arithmetic, or course. Collide with njn's patch here in whichever order you and he prefer :)

- I know I said I didn't care earlier today, but now that I see it and think about it, util.{h,cpp} is sufficiently against convention that I think Allocator should really go in its own file. 

- I find the use of an anonymous union declaring locals in Allocator::fill ... drifting back towards encoding address arithmetic in partly-numeric, partly-implicit-in-type terms. Likewise to a lesser degree the data8 union member in the Chunk. Maybe a couple casts to and from uintptr_t in the code, instead?

- A way to clear the allocator without destructing it might be handy. Maybe factor the clearing-loop out into a public member.
Attachment #388263 - Flags: review?(graydon) → review+
Comment on attachment 388277 [details] [diff] [review]
remove dead outOMem code since Allocator.alloc doesnt return NULL

Mostly changes to your code, so fine by me. I'm surprised all those error codes are dead! But can't find any references here either.
Attachment #388277 - Flags: review?(graydon) → review+
(In reply to comment #28)
> (From update of attachment 388263 [details] [diff] [review])

> - You'll need to adapt a couple bits of this to the changes to use byte
> arithmetic, or course. Collide with njn's patch here in whichever order you and
> he prefer :)

Yup, this puts Allocator in place with our LIR branch, and in the merge playground i'm importing TM's LIR branch and resolving.  resolving the
allocation goo is simpler with the new Allocator up and running.

> - I know I said I didn't care earlier today, but now that I see it and think
> about it, util.{h,cpp} is sufficiently against convention that I think
> Allocator should really go in its own file. 

I'll think about it.  with a proliferation of a bunch of small container classes, it gets annoying to put each in their own file.  but, Allocator.cpp|h, plus (say) Containers.cpp|h, is workable.

> - I find the use of an anonymous union declaring locals in Allocator::fill ...
> drifting back towards encoding address arithmetic in partly-numeric,
> partly-implicit-in-type terms. Likewise to a lesser degree the data8 union
> member in the Chunk. Maybe a couple casts to and from uintptr_t in the code,
> instead?

will do before submitting.

> - A way to clear the allocator without destructing it might be handy. Maybe
> factor the clearing-loop out into a public member.

easy enough.
pushed changeset 2134:d1fffa9d6662

Bug #499976: Mark the avmplus:Allocator.obj to be counted as part of nanojit in the size report. (r+ self)
Use Allocator to allocate strings and label table entries, eliminates use of avmplus::String and some uses of AvmCore from nanojit code.
Attachment #388263 - Attachment is obsolete: true
Attachment #388277 - Attachment is obsolete: true
Attachment #388692 - Flags: review?(rreitmai)
Attachment #388692 - Flags: review?(rreitmai) → review+
Comment on attachment 388692 [details] [diff] [review]
use Allocator for verbose label tables

pushed (last week)
http://hg.mozilla.org/tamarin-redux/rev/2136
Attachment #388692 - Attachment is obsolete: true
Depends on: 506390
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Resolved fixed engineering / work item that has been pushed.  Setting status to verified.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: