IonMonkey: Fix ESP to a large offset from the start of the frame

RESOLVED FIXED

Status

()

RESOLVED FIXED
7 years ago
7 years ago

People

(Reporter: dvander, Assigned: dvander)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments, 2 obsolete attachments)

The problem: We want to avoid bloating JIT memory with large sync blocks. Snapshots are compressed (variable-length encoded and bitpacked), but we still get this pattern:
   add eax, edx
   jo _exit_block_N:
   ....

_exit_block_N:
   mov eax, <SnapshotPointer>
   jmp BailoutTrampoline

These exit blocks use 11 bytes on x86, 16-17 on x64. We know from JM that this sort of stuff accounts for a large fraction of JIT memory. With EBP relative addressing, compression is easy. As long as each script has a mapping from [0..n] -> Bailout, we can compose a global jump table:

  add eax, edx
  jo _GlobalExit + sizeof(call) * N

  _GlobalExit:
   call BailoutTrampoline
   call BailoutTrampoline
   call BailoutTrampoline
   call BailoutTrampoline
   call BailoutTrampoline
   call BailoutTrampoline
   ...
   BailoutTrampoline:
 
By picking the right |call| entry, we can compute (returnAddress - _GlobalExit) / sizeof(call) and find N. Then we can look up bailout N in the script. But, we have ESP relative offsets, so we don't know what the script is!

Dave and I thought we could generate a jump table for each frame size (this could be lazy, but would still be very heavy), but Luke has proposed an idea that I really like and will solve other problems as well: we can fix ESP to a static offset from the start of the frame, and unwind it before calls. For example:

   sub esp, STATIC_FRAME_SIZE
   ...
   ...
   add esp, <dynamic frame size>
   call function
   sub esp, <dynamic frame size>
   ...
   ...
   add esp, STATIC_FRAME_SIZE
   ret

STATIC_FRAME_SIZE could also automatically include slack space for cycle-resolving spills, argument pushing, etc. Let's do this!
When i read that, i thought there must be an other way to do this! So i remembered my days when i did some reverse engineering. A common technique to protect programs from being debugged is to change the control flow in some not obvious manner. On of these tricks is to register an exception or interupt handler, raise an exception or interupt and continue execution in the handler.

So generally we have could split two groups of processor, those who support non-executable memory (NX bit) and those who don't.

When NX is support we could do something like this:
  add eax, edx
  jo <not executable page + 0> 
If not we would do something like:
  add eax, edx
  jo _GlobalExit + sizeof(call) * N

  _GlobalExit:
  mov [<not writable page + 0>], 0

Then only linux we could register signal handler for SIGSEGV, and when siginfo_t->si_addr in the handler is inside our not executable/not writable page, we read the table index from that address. We could also even store the data there in continuous chunks to avoid one more indirection. 

On windows there is something.

Drawbacks:
 - Overriding the default signal handler is evil (could affect breakpad?)
 - This is obviously a lot slower than one jmp and one call
 - Makes debugging harder (see :)
I think on windows we could use this http://www.rohitab.com/structured-exception-handling-in-assembly-language, also i don't know yet how to get the page address.
Created attachment 547953 [details] [diff] [review]
part 1: change slot order

Previously slot 0 would get [esp+0], slot 1 would get [esp+4], etc. This patch changes it to be as originally intended: slot 0 gets [esp+frameDepth-4], slot 1 gets [esp+frameDepth-8], etc, so the last slot is at the bottom of the stack.
Attachment #547953 - Flags: review?(adrake)
Created attachment 547955 [details] [diff] [review]
part 2: introduce frame size classes

We shouldn't just have one fixed size:
 * It places an upper-limit on frame sizes
 * It introduces large esp-relative offsets which cost 3-bytes per stack move.

However, having a totally dynamic size means we'd need a bailout table for each frame size. As a compromise, this patch allows us to break separate frame sizes into just a few classes. (There is some unknown upper bound given we cap the number of interpreter slots and args.)

This is a very simple heuristic so I can proceed with bailouts. We can tweak once we are running on real code. For example, we could lazily create bailout tables only after we've measured that for a given frame class, (SizeOfGuard * Guards) >= sizeof(GuardTable). We could also keep guard tables small and grow them as needed. Or, simply not grow them, and use bigger guards after the ideal number is reached. Measuring v8 on v8bench, the most guards in a method was 826 (~400 on Kraken) and the average is much smaller.
Attachment #547955 - Flags: review?(adrake)
(In reply to comment #1)
> When NX is support we could do something like this:
>   add eax, edx
>   jo <not executable page + 0> 

That's a great idea! With that, the memory overhead would essentially be 0.
I just found out that Java has code for doing this, because they use signal handlers to catch null pointer and division-by-zero exceptions. 

For the different implementation look at the -signal.h files eg. here http://opensource.apple.com/source/gcc/gcc-5479/libjava/include/. I also found a exceptions that uses those header files to convert this exceptions into c++ exceptions (http://code.google.com/p/segvcatch/).

I did some very unscientific measurements, one assembler function causing a signal handler and one just returning. The difference seems to be around 10-20 ms. That doesn't sound to bad to me.

nasm -f elf64 -o execute.o execute.s

If you wan't to test the signal handler
gcc -o signal -DTRAP execute.o signal.c
otherwise
gcc -o signal execute.o signal.c
Created attachment 548042 [details]
the test code
Note that this only works for explicit guards (i.e. jo <bad_page + N>). If we were to use signals implicitly, like, to catch NULL-derefs and avoid the test+branch, we would have to use ebp-relative addressing.
Well, one workaround would be to have a per-compartment map of signal instruction to snapshot pointer, and store the frame size in the snapshot.
>Well, one workaround would be to have a per-compartment map of signal >instruction to snapshot pointer, and store the frame size in the snapshot.

I don't know for what you need the framesize, but just to find out the right snapshot, we could map the rip/eip of the failing instruction to the snapshot.

Also i just found cacaovm.org, which use this technique for some kind of inline caching. The implementation seems to be really clean and under GPL license.
Comment on attachment 547953 [details] [diff] [review]
part 1: change slot order

Review of attachment 547953 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good with two little things:

::: js/src/ion/IonLIR.h
@@ +609,5 @@
>  
>  class LInstructionVisitor
>  {
>    public:
> +#define VISIT_INS(op) virtual bool visit##op(L##op *) { return true; JS_NOT_REACHED("implement " #op); return false; }

I think you've got all the possibilities covered here -- which one should it be?

::: js/src/ion/shared/CodeGenerator-shared.h
@@ +74,5 @@
> +    int32 framePushed_;
> +
> +    inline int32 ArgToStackOffset(int32 slot) {
> +        JS_ASSERT(slot >= 0);
> +        return framePushed_ + frameDepth_ + ION_FRAME_PREFIX_SIZE + slot;

slot * STACK_SLOT_SIZE
Attachment #547953 - Flags: review?(adrake) → review+
Comment on attachment 547955 [details] [diff] [review]
part 2: introduce frame size classes

Review of attachment 547955 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/ion/shared/CodeGenerator-shared.cpp
@@ +39,5 @@
>   *
>   * ***** END LICENSE BLOCK ***** */
>  #include "CodeGenerator-shared.h"
>  #include "ion/MIRGenerator.h"
> +#include "ion/IonFrames.h"

This file is missing.

@@ +51,5 @@
>      frameDepth_(graph.localSlotCount() * sizeof(STACK_SLOT_SIZE)),
>      framePushed_(0)
>  {
> +    frameClass_ = ClassForFrameSize(frameDepth_);
> +    frameStaticSize_ = FixedSizeOfFrameClass(frameClass_);

It seems like something like this would be a little cleaner:

frameClass_ = FrameSizeClass::fromSize(frameDepth_);
frameStaticSize_ = frameClass_.size();

::: js/src/ion/shared/CodeGenerator-shared.h
@@ +82,3 @@
>      inline int32 ArgToStackOffset(int32 slot) {
>          JS_ASSERT(slot >= 0);
> +        return framePushed_ + frameStaticSize_ + ION_FRAME_PREFIX_SIZE + slot;

slot * STACK_SIZE, but this is covered in my comments for the previous patch, too

::: js/src/ion/shared/CodeGenerator-x86-shared.cpp
@@ +58,5 @@
> +static const uint32 LAST_FRAME_INCREMENT = 512;
> +static const uint32 FrameSizes[] = { 128, 256, LAST_FRAME_SIZE };
> +
> +uint32
> +ion::ClassForFrameSize(uint32 frameSize)

If this is all x86/_64 specific, then what goes in IonFrames.h?
Attachment #547955 - Flags: review?(adrake) → review-
Argument slots are already adjusted during LIR, so no * STACK_SLOT_SIZE there. IonFrames.h is pretty thin now but I did forget to attach it, it will also house bailout logic in the future.
Created attachment 548243 [details] [diff] [review]
part 2: nits fixed, added missing file

Thanks, I added the suggested abstraction and it worked nicely.
Attachment #547955 - Attachment is obsolete: true
Attachment #548243 - Flags: review?(adrake)
Created attachment 548245 [details] [diff] [review]
part 2: attach the right file
Attachment #548243 - Attachment is obsolete: true
Attachment #548243 - Flags: review?(adrake)
Attachment #548245 - Flags: review?(adrake)
Comment on attachment 548245 [details] [diff] [review]
part 2: attach the right file

Review of attachment 548245 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #548245 - Flags: review?(adrake) → review+
You need to log in before you can comment on or make changes to this bug.