Closed Bug 473769 Opened 16 years ago Closed 15 years ago

stack frame size limited to 1K due to allocation structures

Categories

(Tamarin Graveyard :: Baseline JIT (CodegenLIR), defect, P2)

Tracking

(Not tracked)

VERIFIED FIXED
flash10.1

People

(Reporter: edwsmith, Assigned: stejohns)

References

()

Details

(Whiteboard: fixed-in-nanojit, fixed-in-tamarin, fixed-in-tracemonkey)

Attachments

(11 obsolete files)

class Assembler uses the same constant for AR.entry[] as for Assembler._resvTable[].  this means we can only have up to 256 reservations, or 256 words on the stack, whichever limit is hit first.  The reservation index in LIns is the "real" limit, and should only limit the # of reservations, not the size of the stack frame.

if you have reservations whose size is >4 bytes (for doubles, or LIR_alloc areas), its easy to run out of stack space before running out of live reservation structs.

An extreme example is on PPC or PPC64 where sizeof(jmp_buf) is 768 or 920, respectively, and Tamarin methods with try/catch blocks need to allocate a jmp_buf on the stack.  Those methods almost always fall back to interpretation because of these resource limits.
Tracemonkey's nanojit has partly fixed this issue by widening LIR instructions to 4 words each, and moving Reservation into the instruction.  see bug 490947.

Merging this change eliminates the 8bit restriction on the # of Reservations, and completely eliminates Assembler._resvTable.  We can then make AR.entry[] arbitrarily longer.  Reservation.arIndex is still limited to 16 bits, allowing at most 256k per stack frame, up from 1K.
The only reason we used an 8bit index for the reservation table was to keep instruction size small.  

It would be interesting to collect some data to see how much (if any?) we are saving by compressing the instruction into a single word (i.e as we are currently).

If there are some savings (i.e tramp usage is low) then maybe a 64bit form might be good middle ground.  A full 16Bytes, while attractive in many ways (speed/simplicity), does represent a full 4x increase in size.
Nick is working on a variable sized LIR encoding. There is no hard reason to make LIns fixed sized actually. He reported a 2x+ code size reduction from the 16Byte variant. We should cross link the bugs ... and well really get back to merging soon :)
I gave up on the 64bit LIns middle ground after experimenting in TT.  it's slightly bigger, and slightly less tramps compared to 32bit LIns, but you still have the headaches and cost of tramps.  A variable sized LIns that uses full pointers, or offsets from the start of a movable-but-contiguous is a better middle ground because its only a modest increase in size (<4x, probably 2-3x in practice), but eliminates the tramp complexity.
This would be a good funnel to facilitate merging us back together. Maybe you guys can talk to nick and make sure that his design meets your requirements as well, and then we port and merge back and forth. We are mostly done with 3.5, so we should be able to take disruptive changes onto trunk in two weeks or less.
I started a bug for the variable-width LIR in TM: #492866.
FWIW, bug 492866 is pretty well advanced now.
The last piece of work to close out this bug is to update how we manage the stack area (struct AR).  currently there are NJ_MAX_STACK_ENTRY (=256) entries we can manage, 4 bytes per entry.  

easy fix: increase NJ_MAX_STACK_ENTRY, or even make it variable.  this is okay when we have mostly scalar entries because each entry points to a different LIns.

However, when we have large LIR_alloc's in the code stream, we waste space in AR.entry[].  a LIR_alloc of (say) 128 bytes consumes 128/4 = 32 entries in this array, all of which point to the same LIR_alloc.  Same thing happens to a smaller degree when there's lots of 8-byte (double or 64-bit pointer) entries; many runs of two identical LIns* in AR.entry[].

AR.entry is indexed by Reservation->arIndex, which is the scaled displacement from the frame pointer.  Its used to free AR entries and in disp() to calculate the frame pointer offset.

harder: to compact these runs of the same LIns, we would need to change the data structure, and replace these direct indexing actions with something else.  i've spent very little time thinking about it but its basically a small memory allocation problem: minimize the overall frame size, track/re-use free regions, and minimize fragmentation.

note: Reservation:arIndex is 16 bits, so we still have an upper limit of 64k*4=256K stack frame size.
For the short-term we can probably get by with tweaking the constant, but if we need/want larger stacks then we'll probably require the *harder* option listed in comment #8.
Priority: -- → P3
Target Milestone: --- → flash10.1
FWIW I'm overflowing the stack regularly with large blocks with lirasm --random (see bug 519873).  Eg. on x86 it overflows if I do a block much bigger than 10,000 insns.

Increasing NJ_MAX_STACK_ENTRY from 256 to 4096 or 16384 would be really easy.  We probably don't want to go all the way to 65536, it would be nice to keep a couple of bits spare for other uses.
The cases where tamarin overflows the structure in practice are caused by LIR_alloc's with large sizes.  even with short blocks and very few live ranges, this hit the limits.  on ppc64, for example, sizeof(jmp_buf) is huge. I agree, 2^14 is probably an upper bound for the current approach.
Related to comment 8 and comment 9:  the algorithm used by arReserve() to find free slots is really awful, particularly for the more-than-8-bytes case.
Assignee: edwsmith → nobody
Crossref to Adobe Watson bug # 2498076
Severity: normal → major
Priority: P3 → P2
Assignee: nobody → stejohns
Status: NEW → ASSIGNED
bumping NJ_MAX_STACK_ENTRY to (say) 16384 does in fact address the issues at hand, but I'm unsure of the downside to simply increasing it... I'm presuming the main objection being that we have to allocate (and traverse) a much larger AR.entry array? If so, any reason this could be a growable array?
The only obvious downside is that increasing NJ_MAX_STACK_ENTRY will increase sizeof(Assembler).  I'd say lets do it and see what the impact is.

Yes, AR could be made growable, however nanojit::Allocator doesn't lend itself to realloc-style growing.  (old short blocks not reused until whole allocator is freed).  if we dont mind using the memory all the time (in tamarin Allocator is short-lived) then fixed size is better.

possible problems:
  - >16K alloc every time we jit, regardless of method size (yawn)
  - longer compile time for big methods because of linear scans in arReserve() (previously those methods would have failed)

the real fix of the AR data structures is probably not all that hard.  but also not needed if we dont mind using a modest amount more memory.  if we assume LIR size and code size is correlated with stack frame size then the growth in stackframe size is probably overshadowed by an even bigger growth in code size.
OK then, a related question: why is NJ_MAX_STACK_ENTRY specific to each architecture? (ie, if there is no per-architecture risk then we should move it into core code. otherwise, we need to identify said risk...)

I'm guessing the explanation of the above is that we want to be able to constrain the maximum AR size for low-stack devices... in which case rejiggering the data structures to be smaller and more efficient won't really help the underlying concern.

(In reply to comment #12)
> Related to comment 8 and comment 9:  the algorithm used by arReserve() to find
> free slots is really awful, particularly for the more-than-8-bytes case.

Ow, it certainly is. But before optimizing, does profiling suggest this is a meaningful hotspot? (or just an ugly one that needs to be rewritten for the sake of elegance?)
We currently use a linear search when looking for space in AR for each definition, so thats essentially quadratic with compilation time. Ed, the large code large frame argument is not always accurate I believe. insAlloc() grabs pretty large chunks at once, so a small fragment with a single alloc can make all subsequent definitions go really slow due to the linear search. I think we should rewrite the underlying code a bit instead of just bumping the limit (on an unrelated note, the code is also completely broken for anything but downwards stack growth, I tried to use the other direction for the ESP-relative stack addressing patch).
Increases NJ_MAX_STACK_ENTRY to 16384 for x86, x64, and ppc systems.
Attachment #417407 - Flags: review?(edwsmith)
Attachment #417407 - Flags: review?(nnethercote)
Yes, NJ_MAX_STACK_ENTRY was made per-arch so that we could tune it as needed.  

Despite its inelegance I've never seen arReserve show up as hot in profiling, but I'm guessing that as we ratchet up the entry size at some point it will start to show up.

re: Andreas and upward growing stack... it used to work at some point.  We should probably fix it or remove the stack_direction() macros as the implication of having these is that it works.
(In reply to comment #18)
> We currently use a linear search when looking for space in AR for each
> definition, so thats essentially quadratic with compilation time.

true, but does it matter in actual cases? ("class AllEvil extends PrematureOptimization" and all that)

> I think we should rewrite the underlying code a 
> bit instead of just bumping the limit

These are othorgonal needs. The problem on my immediate plate is that with NJ_MAX_STACK_ENTRY=256, methods that require large amounts of temporary stack space will fail to JIT, thus falling back to the interpreter and running many times slower. The underlying code could use a rewrite, but I don't see how that would allow us to allocate more stack space than the constant allows for.
Steven, I am not opposed to bumping the number. Not at all. I am just saying the code underlying it needs to be rewritten eventually. The code already pops up in lirasm profiles as is, without even bumping the pointer.
(In reply to comment #22)
> Steven, I am not opposed to bumping the number. 

Nor am I opposed to rewriting the code :-)
it might be easy to tweak linear scan code in arReserve() to skip N entries when the entry is a LIR_alloc, right now it strides 2 at a time and does not skip ahead.  its a low risk change but i dont know how much it will help.
Attachment #417407 - Flags: review?(edwsmith) → review+
arReserve() accounts for something like 10% of time in lirasm --random, when the block size is big.  This is largely because at startup lirasm --random allocates a 256 entry buffer from which it does all its loads and stores, so every call to arReserve() has to scan over that buffer.  So improving arReserve() would speed up lirasm --random if nothing else.  We should file a separate bug for that and assign it to Gal as he keeps complaining about it :)

Steven, 256 to 16384 is a big jump, could you do something smaller like 4096?  Why did you choose 16384?

I timed TM on SunSpider with the change and got repeatable slowdowns of 5--10ms.  I tried 4096 and got smaller (but still repeatable) slowdowns.  It may be the additional zeroing required in Assembler::reset() but I'm not sure.

Also, if this change is done on ARM it will cause because the maximum offset in a lot of cases (esp. STR) is -4096..4096.
Bring it on :) I am happy to fix that one. Ed's suggestion sounds like a good start.
(In reply to comment #25)
> arReserve() accounts for something like 10% of time in lirasm --random, when

true, but on the flipside, about as synthetic a benchmark as we can get. not arguing at all it shouldn't be improved, but what are the numbers for more realistic benches, I wonder?
 
> Steven, 256 to 16384 is a big jump, could you do something smaller like 4096? 
> Why did you choose 16384?

Mainly due to earlier comments that suggested we might as well crank it that far. 4096 would probably solve most of the immediate problems I'm trying to deal with, though a few outliers might still be problematic. 

> I timed TM on SunSpider with the change and got repeatable slowdowns of
> 5--10ms.  I tried 4096 and got smaller (but still repeatable) slowdowns.  It
> may be the additional zeroing required in Assembler::reset() but I'm not sure.

Side note: given the larger size, perhaps VMPI_memset(0) would be more appropriate for clearing the entry table in reset().
 
> Also, if this change is done on ARM it will cause because the maximum offset 

Yep, ARM *definitely* can't go that high.
Revised version of the previous patch; in addition to expanding the size, it revises AR to keep track of the active range at any time, so we don't have to search the entire range of entries (so in theory, functions don't pay a perf cost for the larger size of entries unless they actually use it).

Note that this doesn't attempt to address the fundamentally stupid algorithm being used.

Note also that I haven't done before-and-after timing for this patch, so the speedup (or lack thereof) is purely theoretical... posting for feedback on the approach first.
Attachment #417407 - Attachment is obsolete: true
Attachment #417844 - Flags: review?(nnethercote)
Attachment #417407 - Flags: review?(nnethercote)

Besides, I think(In reply to comment #29)
> Created an attachment (id=417844) [details]
> Increase NJ_MAX_STACK_ENTRY on desktop systems, revise search algo
> 
> Revised version of the previous patch; in addition to expanding the size, it
> revises AR to keep track of the active range at any time, so we don't have to
> search the entire range of entries (so in theory, functions don't pay a perf
> cost for the larger size of entries unless they actually use it).

Do we save searching the entries, or clearing the entries, or both?  I'm more interested in the latter, as it is probably the source of the slow-down in the case where the extra stack space isn't being used.

I think I see what you're trying to do, but the code changes are much more complicated than I would have expected.  Are you trying to do too much in a single patch?  Seems like you just want an extra 'limit' field in AR, and when searching you'd search from 0 to 'limit', and when clearing you'd just reset 'limit' to 0 (or maybe 1).

> Note also that I haven't done before-and-after timing for this patch, so the
> speedup (or lack thereof) is purely theoretical... posting for feedback on the
> approach first.

Numbers would make it more convincing :)
(In reply to comment #30)

> Do we save searching the entries, or clearing the entries, or both?  I'm more
> interested in the latter, as it is probably the source of the slow-down in the
> case where the extra stack space isn't being used.

both.
 
> I think I see what you're trying to do, but the code changes are much more
> complicated than I would have expected.  Are you trying to do too much in a
> single patch? 

Probably, tried to encapsulate everything too.

> Seems like you just want an extra 'limit' field in AR, and when
> searching you'd search from 0 to 'limit', and when clearing you'd just reset
> 'limit' to 0 (or maybe 1).

okey dokey
(In reply to comment #30)

> Numbers would make it more convincing :)

absolutely. any particular benches you'd expect to be more relevant than others?
(In reply to comment #32)
> (In reply to comment #30)
> 
> > Numbers would make it more convincing :)
> 
> absolutely. any particular benches you'd expect to be more relevant than
> others?

TraceMonkey on SunSpider is my main concern :)  I can test that when the time comes.

You could just add some counters into arReserve that count how many entries get scanned/cleared.  So even if times aren't affected on the usual benchmarks you know that how much less scanning/clearing you're doing.
(In reply to comment #30)

> I think I see what you're trying to do, but the code changes are much more
> complicated than I would have expected.  Are you trying to do too much in a
> single patch?  Seems like you just want an extra 'limit' field in AR, and when
> searching you'd search from 0 to 'limit', and when clearing you'd just reset
> 'limit' to 0 (or maybe 1).

one thing to note is that the current "tos" is misleading... it's commented as "current top of stack entry", but it's used as "high water mark" in the various backends.
Attached patch Encapsulate AR structure (obsolete) — Splinter Review
Trying to split the previous patch into smaller bits... this doesn't change any algo's or constants, merely encapsulates AR access, such that the external world no longer gets random access, but rather just reserves, frees, or iterates. (Quite possibly treading on the too-much-cleverness line, as I'm not sure that AR warrants full encapsulation, but it arguably will make it much easier to safely review the changes for limiting scanning, and almost certainly will make bug 534797 easier to do...)
Attachment #417844 - Attachment is obsolete: true
Attachment #418009 - Flags: review?(nnethercote)
Attachment #417844 - Flags: review?(nnethercote)
Additive to previous patch: enlarge the maximum size as for x86/x64/ppc, and enchance the algo to only scan the areas in use.

Passes lirasm --random 100000 on MacTel32, MacTel64.

With both patches applied, sunpider reports 1.010x as fast for MacTel32 and 1.009x as fast for MacTel64 (both nondebug builds, of course).
Attachment #418036 - Flags: review?(nnethercote)
Attachment #418009 - Flags: review?(edwsmith)
Attachment #418036 - Flags: review?(edwsmith)
Same as previous patch, but with less crashy version of AR::Iter::next method
Attachment #418036 - Attachment is obsolete: true
Attachment #418068 - Flags: review?(nnethercote)
Attachment #418036 - Flags: review?(nnethercote)
Attachment #418036 - Flags: review?(edwsmith)
Attachment #418068 - Flags: review?(edwsmith)
Comment on attachment 418009 [details] [diff] [review]
Encapsulate AR structure

I like the encapsulation of AR, it hides some of the gnarly details.


>+    bool AR::Iter::next(LIns*& ins, uint32_t& size, int32_t& offset) 
>+    { 
>+        while (_i <= _ar._highWaterMark)
>+        {
>+            if ((ins = _ar._entries[++_i]) != NULL)
>+            {   
>+                size = sizeFor(ins);
>+                _i += size - 1;
>+                offset = _i;
>+                return true;
>+            }
>+        }
>+        ins = NULL;
>+        offset = 0;
>+        return false;
>+    }

I like the iterator.  But I found the argument names 'size' and 'offset'
non-obvious, particular in terms of their units.  Some suggestions: 'size'
could be 'nStackSlots', 'sizeFor' could be 'nStackSlotsFor'.  And we have
arIndex() elsewhere, so 'arIndex' for 'offset'?


>+    void AR::checkForResourceConsistency(const RegAlloc& regs) const
>+    {
>+        for (uint32_t i = 1; i <= _highWaterMark; ++i)
>+        {
>+            LIns* ins = _entries[i];
>+            if (!ins)
>+                continue;
>+            Register r = ins->getReg();
>+            uint32_t arIndex = ins->getArIndex();
>+            if (arIndex != 0) {

I think this condition must always be true (it's one of the invariants) and
so should be changed to an assertion.


>+                if (ins->isop(LIR_alloc)) {
>+                    int j=i+1;
>+                    for (int n = i + (ins->size()>>2); j < n; j++) {
>+                        NanoAssert(_entries[j]==ins); // @todo, should this be _entries[n]?

No, _entries[j] is correct, but the loop initialisation is very confusing --
it initialises the loop variable, j, outside the loop, but initialises the
loop limit, n, inside the loop.  Swap those two and it'll be clearer.


>+                    }
>+                    NanoAssert(arIndex == (uint32_t)j-1);
>+                    i = j-1;
>+                }
>+                else if (ins->isQuad()) {
>+                    NanoAssert(_entries[i - stack_direction(1)]==ins);
>+                    i += 1; // skip high word

Having stack_direction() here but not in the LIR_alloc case is odd.
stack_direction is probably broken, as you've said elsewhere.  Maybe leave
it as is and final a separate bug to fix it.


> 
>@@ -468,8 +492,9 @@
>         if (!ins->isUsed())
>             ins->markAsUsed();
>         if (!ins->getArIndex()) {
>-            ins->setArIndex(arReserve(ins));
>-            NanoAssert(ins->getArIndex() <= _activation.tos);
>+            uint32_t const idx = arReserve(ins);
>+            ins->setArIndex(idx);
>+            NanoAssert(_activation.isValidEntry(ins->getArIndex(), ins) == (idx != 0));
>         }
>         return disp(ins);
>     }

As above, 'arIndex' is probably a better name than 'idx'.

Also, I find the final assertion too clever for it's own good.  I find this
clearer:

  uint32_t const arIndex = arReserve(ins);
  if (idx != 0) {
      ins->setArIndex(arIndex)
      NanoAssert(_activation.isValidEntry(ins->getArIndex(), ins));
  }


>+        
>+        LIns* ins = 0;
>+        uint32_t size = 0;
>+        int32_t offset = 0;

>+        for (AR::Iter iter(_activation); iter.next(ins, size, offset); )
>+        {
>+            const char* n = _thisfrag->lirbuf->names->formatRef(ins);
>+            if (size > 2) {
>+                VMPI_sprintf(s," %d-%d(%s)", 4*offset, 4*(offset+size-1), n);
>+            }
>+            else if (size == 2) {
>+                VMPI_sprintf(s," %d+(%s)", 4*offset, n);
>+            }
>+            else {
>+                VMPI_sprintf(s," %d(%s)", 4*offset, n);
>             }
>             s += VMPI_strlen(s);
>         }

Change the for-loop to a while-loop?  I don't much like for loops that don't
follow the standard pattern.

I'd suggest folding the size>2 and size==2 cases together.  And again,
'offset' -> 'arIndex'.


>@@ -1641,29 +1646,28 @@
>     }
>+    inline bool AR::isEmptyRange(uint32_t count, uint32_t start) const

Would isEmptyRange(start, count) be more intuitive?  It is to me.


>-        if (i >= (int32_t)ar.tos) {
>-            ar.tos = i+1;
>+        if (i >= _highWaterMark) {
>+            _highWaterMark = i+1;
>         }
>-        if (tos+size >= NJ_MAX_STACK_ENTRY) {
>-            setError(StackFull);
>+        if (_highWaterMark+size >= NJ_MAX_STACK_ENTRY) {
>+            return 0;
>         }

Careful!  ar.tos and tos were distinct values.  You've conflated them into a
single value, _highWaterMark.  You'll need to add another variable
oldHighWaterMark.


>+    void Assembler::arFree(LIns* l)
>+    {
>+        uint32_t arIndex = l->getArIndex();
>+        if (arIndex) {
>+            NanoAssert(_activation.isValidEntry(arIndex, l));
>+            _activation.freeEntryAt(arIndex);        // free any stack stack space associated with entry
>+        }
>+    }

I don't like functions called "doSomething" that only do something under
certain conditions.  "arMaybeFree" or "arFreeIfUsed" or something like that
would be better.


>+    inline /*static*/ uint32_t AR::sizeFor(LIns* l)
>+    {
>+        return l->isop(LIR_alloc) ? (l->size()>>2) : (l->isQuad() ? 2 : 1);
>+    }

I've been moving things towards consistently using 'ins' for LIR instructions.  'l' is a poor name for a variable because it looks too similar to '1'.


>@@ -258,7 +296,7 @@
>             NIns*       genEpilogue();
> 
>             uint32_t    arReserve(LIns* l);
>+            void        arFree(LIns* l);

'l' -> 'ins', again.
I think you've attached the wrong patch for "expand maximum size".  Can you re-attach?
For what it's worth, MMgc assumes that the stack (seen as a sequence of frames) grows from higher addresses to lower addresses, and will need fixing if that turns out not to be true.  Stack overflow checking in the VM is the same.
Attachment #418009 - Flags: review?(edwsmith)
Comment on attachment 418068 [details] [diff] [review]
Expand maximum size & optimize algo, v2

oops, this is the mops range check filter patch (right patch, wrong bug, or vice versa)
Attachment #418068 - Flags: review?(edwsmith) → review-
Not to invalidate the work that Steven is doing to clean-up the existing code, but if bug 534797 is going to address the arReserve algo issue (assuming it is being actively worked on), is it worth overhauling the code to this extent at this point?  

Or does bumping the limit to 4K (which I assume solves the immediate problem), cause too much a slowdown.
Steven's patch is a quick easy win and when its ready to land, it won't hurt compilation time because it will be pay-as-you-go (only used entries get initialized or scanned).

Andreas's patch will be pay-less-as-you-go :-).

in my opinion, a lesson from the last two years of nanojit work is to clean up early and often. True, we risk some churn if both patches land in quick succession, but its worth it.
(In reply to comment #38)
 
> I like the iterator.  But I found the argument names 'size' and 'offset'
> non-obvious, particular in terms of their units.  Some suggestions: 'size'
> could be 'nStackSlots', 'sizeFor' could be 'nStackSlotsFor'.  And we have
> arIndex() elsewhere, so 'arIndex' for 'offset'?

Done.
 
> I think this condition must always be true (it's one of the invariants) and
> so should be changed to an assertion.

Done.
 

> No, _entries[j] is correct, but the loop initialisation is very confusing --
> it initialises the loop variable, j, outside the loop, but initialises the
> loop limit, n, inside the loop.  Swap those two and it'll be clearer.

Done.
 

> stack_direction is probably broken, as you've said elsewhere.  Maybe leave
> it as is and final a separate bug to fix it.

https://bugzilla.mozilla.org/show_bug.cgi?id=535044
 

> Also, I find the final assertion too clever for it's own good.  I find this
> clearer:

I disagree with you on this one: (1) the assertion is well-defined and unambiguous, (2) your suggested replacement isn't equivalent, as is only checks for validity if arReserve succeeded (vs also checking if it failed)

> Change the for-loop to a while-loop?  I don't much like for loops that don't
> follow the standard pattern.

Eh, I'm not religious either way; this is a pattern we use several other places in Tamarin, though afaik it hasn't showed up in NJ. IMHO this usage has a (slight) advantage to a while loop in that it keeps the lifetime scope of the iterator to the loop itself, whereas a while loop would require extra {} goodness to do the same... so unless this is a major stylistic objection I'm inclined to leave it.

> I'd suggest folding the size>2 and size==2 cases together.  And again,
> 'offset' -> 'arIndex'.

Done

> Would isEmptyRange(start, count) be more intuitive?  It is to me.

Ditto, this was just an effort to minimize code churn by keeping the previous pattern... but let's change it.

> Careful!  ar.tos and tos were distinct values.  You've conflated them into a
> single value, _highWaterMark.  You'll need to add another variable
> oldHighWaterMark.

this is true, but looking at it now, I wonder if the old code was wrong... since stack growth is downward, is "oldHighWaterMark+nStackSlots" meaningful?
 
> I don't like functions called "doSomething" that only do something under
> certain conditions.  "arMaybeFree" or "arFreeIfUsed" or something like that
> would be better.

maybe, but scope creep alert is in effect, since this patch is preserving existing name & semantics of arFree.
 
> I've been moving things towards consistently using 'ins' for LIR instructions. 
> 'l' is a poor name for a variable because it looks too similar to '1'.

Agree 100%; kept "l" solely to reduce change churn. I'll change.

(In reply to comment #39)
> I think you've attached the wrong patch for "expand maximum size".  Can you
> re-attach?

oops.

(In reply to comment #43)
> Andreas's patch will be pay-less-as-you-go :-).

and also, Andreas's patch doesn't exist yet, so there's nothing for me to conflict with. :-)
Attached patch Encapsulate AR structure, v2 (obsolete) — Splinter Review
Tweaked version with Nick's suggestions incorporated
Attachment #418009 - Attachment is obsolete: true
Attachment #418219 - Flags: review?(nnethercote)
Attachment #418009 - Flags: review?(nnethercote)
Attachment #418219 - Flags: review?(edwsmith)
Correct version of the previous patch.
Attachment #418068 - Attachment is obsolete: true
Attachment #418223 - Flags: review?(nnethercote)
Attachment #418068 - Flags: review?(nnethercote)
Attachment #418223 - Flags: review?(edwsmith)
Comment on attachment 418219 [details] [diff] [review]
Encapsulate AR structure, v2

(In reply to comment #44)
> 
> > Careful!  ar.tos and tos were distinct values.  You've conflated them into a
> > single value, _highWaterMark.  You'll need to add another variable
> > oldHighWaterMark.
> 
> this is true, but looking at it now, I wonder if the old code was wrong...
> since stack growth is downward, is "oldHighWaterMark+nStackSlots" meaningful?

Hmm, not sure.  The current code certainly isn't obvious... how about you add a boolean with a name like 'success', initialise it to false, and then set it true if a slot is found (ie. before each of the 'break' statements).  Then replace the oldHighWaterMark test with 'if (!success) return 0;'.  I'd find that much easier to understand.


> > I don't like functions called "doSomething" that only do something under
> > certain conditions.  "arMaybeFree" or "arFreeIfUsed" or something like that
> > would be better.
> 
> maybe, but scope creep alert is in effect, since this patch is preserving
> existing name & semantics of arFree.

No it isn't!   arFree() has completely changed, it doesn't even have the same signature any more.  Please change.


BTW, although I suggested 'nStackSlots' I see that "Entry" is used a lot, so maybe 'nEntries' would be better, if you like.


Thanks for the other changes.  r=me with the first two changes above made, no need for a re-review.
Attachment #418219 - Flags: review?(nnethercote) → review+
Attachment #418223 - Flags: review?(nnethercote) → review-
Comment on attachment 418223 [details] [diff] [review]
Expand maximum size & optimize algo, v2a

I like the patch's aim, but I don't like some of the details.


>@@ -1656,61 +1690,75 @@
>     uint32_t AR::reserveEntry(LIns* ins)
>     {
>         uint32_t const nStackSlots = nStackSlotsFor(ins);
>-        uint32_t start = 1;
>-        uint32_t i = 0;
> 
>         if (nStackSlots == 1) {
>             // easy most common case -- find a hole, or make the frame bigger
>+            for (uint32_t i=1; i <= _curActive; i++) {
>                 if (_entries[i] == 0) {
>                     // found a hole
>                     _entries[i] = ins;
>-                    break;
>+                    return i;
>                 }
>             }
>+
>+            uint32_t const spaceLeft = NJ_MAX_STACK_ENTRY - _curActive - 1;
>+            if (spaceLeft >= 1)
>+            {
>+                 NanoAssert(_entries[_curActive+1] == BAD_ENTRY);
>+                _entries[++_curActive] = ins;
>+                if (_highWaterMark < _curActive) 
>+                    _highWaterMark = _curActive;
>+                return _curActive;
>+             }

I don't like that this is now in two steps, and the second is just repeating the first.  IMO this loop is the one place where we should use NJ_MAX_STACK_ENTRY instead of _curActive/_highWaterMark as the loop limit.  Then the duplication could be removed and it would be more like the old version.


>+        uint32_t        _curActive;                     /* index of highest currently-in-use entry */
>         uint32_t        _highWaterMark;                 /* index of highest entry used since last clear() */
>         LIns*           _entries[ NJ_MAX_STACK_ENTRY ]; /* maps to 4B contiguous locations relative to the frame pointer.

It took me a while to work out the difference between _curActive and _highWaterMark.  AIUI they will be equal unless the highest entries are freed, in which case _curActive will go down but _highWaterMark will be the same?

I'd prefer to remove _curActive and just use _highWaterMark throughout.  You may end up iterating over slightly more entries, but I suspect the number will be very small (ie. the average difference between _curActive and _highWaterMark will be *much* less than the difference between _highWaterMark and NJ_MAX_STACK_ENTRY), and the code will be much simpler.


You've chosen 16384 for the stack size, I'll ask again:  would 4096 suffice for your current purposes?  If so, it doesn't seem sensible to make it four times bigger than needed.


(BTW, I think it's generally preferred to have one patch per bug.  Doesn't matter for now but something for the future.)
Attachment #418223 - Flags: review?(edwsmith)
Attachment #418219 - Flags: review?(edwsmith) → review+
(In reply to comment #48)
> (BTW, I think it's generally preferred to have one patch per bug.  Doesn't
> matter for now but something for the future.)

Perhaps, but IMHO it's better to have multiple patches if it means easier reviewability. Ease-of-review trumps number-of-patches.
(In reply to comment #47)

> The current code certainly isn't obvious... how about you add a
> boolean with a name like 'success', initialise it to false, and then set it
> true if a slot is found (ie. before each of the 'break' statements).  Then
> replace the oldHighWaterMark test with 'if (!success) return 0;'.  I'd find
> that much easier to understand.

Done.
 
> No it isn't!   arFree() has completely changed, it doesn't even have the same
> signature any more.  Please change.

Doh! You're right. Busted.

> BTW, although I suggested 'nStackSlots' I see that "Entry" is used a lot, so
> maybe 'nEntries' would be better, if you like.

'stackslots' is more descriptive IMHO.
Attached patch Encapsulate AR structure, v3 (obsolete) — Splinter Review
Re-requesting review because the minor cleanup revealed a subtle bug: reserveEntry was updating _highWaterMark as though it was a count rather than an index. Fixed the logic in arReserve that did this, then changed highWaterMark()->stackSpaceNeeded() to return a count rather than index.
Attachment #418219 - Attachment is obsolete: true
Attachment #418412 - Flags: review?(nnethercote)
Attachment #418412 - Attachment is patch: true
Attachment #418412 - Attachment mime type: application/octet-stream → text/plain
(In reply to comment #48)
> I'd prefer to remove _curActive and just use _highWaterMark throughout.

Hmm... I think you are probably correct. Let me re-work it (and re-run the perf numbers).

> You've chosen 16384 for the stack size, I'll ask again:  would 4096 suffice for
> your current purposes?  If so, it doesn't seem sensible to make it four times
> bigger than needed.

The underlying issue I'm trying to deal with is that existing SWF compilers tend to generate large vector initializers in a naive way, which requires a lot of stack space. It's pretty easy to run out of space with the current limit of 256 (which means we don't jit the function and interpret it instead, which can be dramatically slower). Of course, no matter what we raise the constant to, it's still gonna be possible to make a testcase that won't be handled, but the most reasonable short-term fix (IMHO) is to choose the largest size that's likely to be safe & sensible for a given architecture and make that the max.

Earlier discussion in this bug had suggested 16384 as a maximum, so that's what I went with. I could be persuaded to go with 4096 but I'm a little concerned will find some pathological case that isn't handled by it. 

Edwin/Rick, you want to weigh in here with an opinion?

(Obviously the longterm fix is to correct the SWF compilers to generate more sensible code for initializers, but we're still stuck with supporting existing content without dramatically slowing down.)
> (Obviously the longterm fix is to correct the SWF compilers to generate more
> sensible code for initializers

Has a defect report been filed on this?  I will file and/or fix it.
(In reply to comment #53)
> Has a defect report been filed on this?  I will file and/or fix it.

Not that I know of. Make it so!
(In reply to comment #48)
> (From update of attachment 418223 [details] [diff] [review])
> I don't like that this is now in two steps, and the second is just repeating
> the first.  IMO this loop is the one place where we should use
> NJ_MAX_STACK_ENTRY instead of _curActive/_highWaterMark as the loop limit. 
> Then the duplication could be removed and it would be more like the old
> version.

Not sure I agree here... the idea of structuring it this way is that we don't have to zero the whole array every time we call reset, since everything > highWaterMark is defined to be available. We could go back to zeroing the whole array, then the approach you mention will work just fine, but we'd be paying a larger price for the zeroing for the soon-to-be-larger array.
Revised version with some (but not all) of Nick's suggestions. _curActive is gone, but I'm sticking with my two-step-reserve strategy for now (see previous comment for explanation).
Attachment #418223 - Attachment is obsolete: true
Attachment #418444 - Flags: review?(nnethercote)
Attachment #418444 - Flags: review?(edwsmith)
Attachment #418444 - Attachment is patch: true
Attachment #418444 - Attachment mime type: application/octet-stream → text/plain
Attachment #418412 - Flags: review?(nnethercote) → review+
Attachment #418444 - Flags: review?(nnethercote) → review+
Comment on attachment 418412 [details] [diff] [review]
Encapsulate AR structure, v3

http://hg.mozilla.org/projects/nanojit-central/rev/a2bdfa990384
Attachment #418412 - Attachment is obsolete: true
Comment on attachment 418444 [details] [diff] [review]
Expand maximum size & optimize algo, v3

http://hg.mozilla.org/projects/nanojit-central/rev/0a12bb3f8436

(with verbal r+ from edwin)
Attachment #418444 - Attachment is obsolete: true
Attachment #418444 - Flags: review?(edwsmith)
http://hg.mozilla.org/tamarin-redux/rev/d1bb609bb3fe
Whiteboard: fixed-in-tamarin
Attached patch Speed up AR::validate (obsolete) — Splinter Review
Nick observes that AR::validate is unduly slow in debug builds. This attempts to mitigate that by doing a full validate only every 100 times, doing a lesser check other times... and only calling it in checkForResourceConsistency.
Attachment #418770 - Flags: review?(nnethercote)
Comment on attachment 418770 [details] [diff] [review]
Speed up AR::validate

>+    void AR::validate()
>+    {
>+        if (++_validateCounter >= VALIDATE_FULL_RATE)
>+        {
>+            validateFull();
>+            _validateCounter= 0;
>+        }
>+        else
>+        {
>+            validateQuick();
>+        }
>+    }

I would make _validateCounter a static local variable in this method and just do a full check every Nth time regardless of calls to AR::clear().  It avoids spreading the complexity of the every-Nth-time checking all around.


>diff -r b60bbeac9080 nanojit/Assembler.h
>--- a/nanojit/Assembler.h	Mon Dec 21 16:14:30 2009 -0800
>+++ b/nanojit/Assembler.h	Mon Dec 21 18:54:16 2009 -0800
>@@ -104,9 +104,13 @@
>     {
>     private:
>         uint32_t        _highWaterMark;                 /* index of highest entry used since last clear() */
>+        #ifdef _DEBUG
>+        uint32_t        _validateCounter;               /* only do full validation when this reaches VALIDATE_FULL_RATE */
>+        #endif

I think this would be better written using the debug_only() macro.  But if _validateCounter is made local it won't be relevant.


>         LIns*           _entries[ NJ_MAX_STACK_ENTRY ]; /* maps to 4B contiguous locations relative to the frame pointer.
>                                                             NB: _entries[0] is always unused */
>         #ifdef _DEBUG
>+        enum { VALIDATE_FULL_RATE = 100 };              /* do full validation once per this-many calls to validate */
>         static LIns* const BAD_ENTRY;
>         #endif

Why is this an enum?  Although, again, if _validateCounter is made local it won't be relevant.
Attached patch Speed up AR::validate, v2 (obsolete) — Splinter Review
Changes suggested by Nick. If you like it, feel free to go ahead and push.
Attachment #418944 - Flags: review?(nnethercote)
Comment on attachment 418944 [details] [diff] [review]
Speed up AR::validate, v2

Looks good.

Semi-related nit: I noticed in the AR::freeEntryAt() patch that you're using both NULL and 0 for _entries[0].  You might like to fix that up too...
Attachment #418944 - Flags: review?(nnethercote) → review+
BTW, would you mind pushing?  Thanks.
Attachment #418770 - Attachment is obsolete: true
Attachment #418770 - Flags: review?(nnethercote)
Comment on attachment 418944 [details] [diff] [review]
Speed up AR::validate, v2

nj-c: http://hg.mozilla.org/projects/nanojit-central/rev/0ff411e99654
Attachment #418944 - Attachment is obsolete: true
http://hg.mozilla.org/mozilla-central/rev/f3f879abe16b
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
QA Contact: nanojit → dschaffe
QA Contact: dschaffe → brbaker
Verifying bug fix via bug# 493949 which is a duplicate of this bug, but which had known testcases that would fail to jit. All of those testcases are now jitting in tamarin. Marking verified now that source has been submitted to all necessary repos
Status: RESOLVED → VERIFIED
Flags: in-testsuite+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: