Closed Bug 1027885 Opened 6 years ago Closed 6 years ago

OdinMonkey: add async stack-walking support

Categories

(Core :: JavaScript Engine: JIT, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla34

People

(Reporter: luke, Assigned: luke)

References

(Blocks 1 open bug)

Details

Attachments

(12 files, 9 obsolete files)

1.93 KB, patch
sunfish
: review+
Details | Diff | Splinter Review
24.31 KB, patch
bbouvier
: review+
Details | Diff | Splinter Review
8.99 KB, patch
bbouvier
: review+
Details | Diff | Splinter Review
3.80 KB, patch
jandem
: review+
Details | Diff | Splinter Review
8.63 KB, patch
dougc
: review+
Details | Diff | Splinter Review
34.92 KB, patch
dougc
: review+
Details | Diff | Splinter Review
63.25 KB, patch
dougc
: review+
Details | Diff | Splinter Review
17.83 KB, patch
dougc
: review+
Details | Diff | Splinter Review
53.26 KB, patch
dougc
: review+
Details | Diff | Splinter Review
130.02 KB, patch
dougc
: review+
Details | Diff | Splinter Review
1.10 KB, patch
luke
: review+
Details | Diff | Splinter Review
30.44 KB, patch
dougc
: review+
Details | Diff | Splinter Review
Right now, asm.js code doesn't show up in the profiler because asm.js code doesn't push SPS psuedo stack frames when profiling is enabled.  This could be done (by patching all callsites to a SPS-pushing thunk), but, talking to Jan/Kannan/others, I think we want to move away from the pseudo-stack in general since pseudo-stack-pushing is pretty error prone and has non-trivial performance overhead.

The profiler asynchronously stops the thread (using SIGPROF/SuspendThread) which means that, if we don't push SPS frames, we need to be able to unwind the stack at an arbitrary pc inside jit code.  There are various ways to do this.  One way would be to build something analogous to SWARF that would tell us, at every pc, things like stack depth.  I started down this path initially and it started to look like it'd (1) be a lot of work for all of Ion, (2) constrain our flexibility in generating code since now have to statically describe what the code does in metadata, (3) a maintenance headache, since bugs would only show up sometimes during profiling.  (I have a patch that single-steps the processor so we could write testcases that sample at every pc, but this still requires that we write exhaustive testcases.)

After considering some hybrid solutions, I think the best way forward is to just save the frame-pointer like all C/C++ did before DWARF unwinding.  Yes, this means an extra "push ebp; mov esp, ebp" in the prologue and "pop ebp" in the epilogue and burning a GPR.  Fortunately I wasn't able to measure any significant changes on asm.js or general benchmarks (it was noisy, though, so we'd need to double-check awfy).  As x86 becomes less common (when we ship FF64 on Win64, which I hear is this year), this will matter even less.  To wit, v8 also maintains a frame pointer.

As an added bonus, tools generally benefit from maintaining frame pointer: some gdbs (not x64 gdb on linux, but 32-bit gdb and gdb on osx), lldb, and breakpad has a fallback path to frame pointer.

The patches in this bug are just a first step in this direction: FramePointer is preserved in asm.js code and the existing AsmJSFrameIterator is re-implemented in terms of frame-pointer-walking.  Async stack-walking will be in a separate bug as will saving the frame pointer in all of Ion.  All over, this was a pretty simple change and even simplified several corner cases involving JS->C++ calls.
Attached patch simplify-labelsSplinter Review
Random trivial cleanup.
Attachment #8443080 - Flags: review?(sunfish)
Attached patch rename-moveSplinter Review
Now there are 2 words in the "stuff pushed before masm.framePushed", on all archs.  Thus, AsmJSSizeOfRetAddr is no longer a good description.  This patch also moves the declaration of AllRegsExceptSP to be closer to its first use.
Attachment #8443083 - Flags: review?(benj)
This patch also just hoists some code which makes GenerateCode a little nicer to read and also prepares for extension in a later patch.
Attachment #8443085 - Flags: review?(benj)
Attached patch maintain-fp (obsolete) — Splinter Review
This patch causes asm.js to maintain FramePointer, but it doesn't use it anywhere.
Attachment #8443086 - Flags: review?(benj)
Comment on attachment 8443080 [details] [diff] [review]
simplify-labels

Review of attachment 8443080 [details] [diff] [review]:
-----------------------------------------------------------------

As of bug 976446, Label itself no longer asserts ever, so NonAssertingLabel is currently redundant. That said, it would be nice to restore Label's asserting behavior, and it will help to have those uses of Label which know they need to opt out explicitly declare themselves as such. So LGTM.
Attachment #8443080 - Flags: review?(sunfish) → review+
Attached patch fp-stack-iter (obsolete) — Splinter Review
This patch re-implements AsmJSFrameIter to use the frame pointer.  At a high-level, the changes are:
 - A sorted array of CodeRanges is added to the AsmJSModule. A CodeRange describes a range of code in the AsmJSModule and is used to attach things like function name, or stub descriptions.
 - CallSite is still used to provide line/column info, but only for JS->JS calls and, with CodeRanges, it no longer needs to store the function name index.
 - Instead of recording the sp at exit, the fp is now recorded.  This thankfully avoids a whole bunch very careful code to deal with the variety of sp displacements.
 - We can now safely stack-walk even under interrupt callbacks because fp points to the interrupted code's frame.
Attachment #8443092 - Flags: review?(dtc-moz)
Comment on attachment 8443092 [details] [diff] [review]
fp-stack-iter

Review of attachment 8443092 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry, dedicating a register to the frame pointer will harm performance and is not necessary.

1. I accept the general principle that register pressure has a significant impact on performance. The claimed measurements and 'google does this' does nothing to debunk this.

2. The patches dedicate a register to the frame pointer which by this general principle will degrade performance or potential performance.

3. The patches dedicate a register to the frame pointer irrespective of the profiler even being used. It will degrade performance even if the profiler is not used.

4. Generally a profiler is only required to give statistically useful results, it does not need to be highly accurate.

5. From this requirement a plausible stack trace is adequate. The unresolved traces could be optionally noted in total or per function to help measure the statistical significance of the results to the developer.

6. For a start, a plausible stack trace can be establish by noting the stack depth for each function and the address or offset range for each function, plus the previously noted stack depth at the return sites. This would obviously not be accurate at all function locations, but would be accurate at most locations. This information can be used to attempt to walk the stack of an interrupted asm.js function, taking care to keep within the expected stack range. If the stack looks plausible then it can be accepted for profiling purposed, otherwise the profiler could at least record the immediate function and could bin the unresolved traces per function. This strategy could be improved later by increasing the accuracy of the noted stack depths and/or inspecting instructions to detect know outliers, and the recorded unresolved traces could be used to measure the accuracy.
Attachment #8443092 - Flags: review?(dtc-moz) → review-
Comment on attachment 8443083 [details] [diff] [review]
rename-move

Review of attachment 8443083 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM

::: js/src/jit/arm/MacroAssembler-arm.cpp
@@ +3875,5 @@
>  #if defined(JS_CODEGEN_ARM_HARDFP) || defined(JS_ARM_SIMULATOR)
>      if (useHardFpABI())
>          *stackAdjust += 2*((usedFloatSlots_ > NumFloatArgRegs) ? usedFloatSlots_ - NumFloatArgRegs : 0) * sizeof(intptr_t);
>  #endif
> +    uint32_t alignmentAtPrologue = callFromAsmJS ? AsmJSFrameSize : 0;

Pre-existing: you should need to modify the same thing in MacroAssembler-mips.cpp, which so far still uses AlignmentAtAsmJSPrologue.
Attachment #8443083 - Flags: review?(benj) → review+
Comment on attachment 8443085 [details] [diff] [review]
hoist-start-end-function

Review of attachment 8443085 [details] [diff] [review]:
-----------------------------------------------------------------

Nice.

::: js/src/jit/AsmJS.cpp
@@ +1459,5 @@
> +            return false;
> +# ifdef JS_ION_PERF
> +        mir.perfSpewer().noteBlocksOffsets();
> +        unsigned endInlineCodeOffset = mir.perfSpewer().endInlineCode.offset();
> +        if (!module_->addPerfProfiledBlocks(func.name(), startCodeOffset, endInlineCodeOffset,

nit: this should be
  if (PerfBlockEnabled() && !module_->addPerfProfiledBlocks(...))

@@ -5413,5 @@
> -
> -    // Unlike regular IonMonkey which links and generates a new JitCode for
> -    // every function, we accumulate all the functions in the module in a
> -    // single MacroAssembler and link at end. Linking asm.js doesn't require a
> -    // CodeGenerator so we can destroy it now.

Could we keep this comment?
Attachment #8443085 - Flags: review?(benj) → review+
Comment on attachment 8443086 [details] [diff] [review]
maintain-fp

Review of attachment 8443086 [details] [diff] [review]:
-----------------------------------------------------------------

Nice! For MIPS, you should be able to use r30 rather than s7 as the frame pointer (actually fp is already defined there as r30).

To be sure to understand everything:
- we don't need to maintain FP in the stack overflow exit as it eventually jumps to the throwLabel, and FP is popped in the throw exit.
- the interrupt exit handles exceptional behavior, which doesn't change semantics of code running, and thus shouldn't appear in any form of stack-walking. For this reason, we don't need to maintain FP in the GenerateInterruptExit.
Is that right? If that's the case, could you add comments explaining this at the top of GenerateStackOverflowExit and GenerateInterruptExit please?

Clearing the review flag, as I would like to see the answers to these questions. Also, I understand Douglas' concerns about the impact this can have on speed for ARM platforms. Do you have some measurements of the perf impact on ARM platforms, especially on low-end devices?

::: js/src/jit/AsmJS.cpp
@@ +6406,2 @@
>      masm.PushRegsInMask(GeneralRegisterSet((1<<GlobalReg.code()) | (1<<HeapReg.code())));
>  #elif defined(JS_CODEGEN_MIPS)

nit: this can be factored out with the above ifdef(ARM)

::: js/src/jit/arm/CodeGenerator-arm.cpp
@@ +49,5 @@
>  }
>  
>  bool
> +CodeGeneratorARM::generateEpilogue()
> +{

JS_ASSERT(!gen->compilingAsmJS());

@@ +108,2 @@
>  
> +    masm.pop(FramePointer);

Could you also add the debug check that SP and FP are equal before popping, as what's being done for x86?

::: js/src/jit/mips/Architecture-mips.h
@@ +125,5 @@
>      }
>  
>      static Code FromName(const char *name);
>  
> +    static const Code FramePointer = s7;

It seems that FP == r30.

::: js/src/jit/mips/Assembler-mips.h
@@ +71,5 @@
>  static MOZ_CONSTEXPR_VAR Register IntArgReg1 = a1;
>  static MOZ_CONSTEXPR_VAR Register IntArgReg2 = a2;
>  static MOZ_CONSTEXPR_VAR Register IntArgReg3 = a3;
> +static MOZ_CONSTEXPR_VAR Register GlobalReg = s5; // used by Odin
> +static MOZ_CONSTEXPR_VAR Register HeapReg = s6; // used by Odin

nit: not needed as FramePointer == fp == r30

@@ +105,5 @@
>  
>  static MOZ_CONSTEXPR_VAR Register JSReturnReg_Type = v1;
>  static MOZ_CONSTEXPR_VAR Register JSReturnReg_Data = v0;
>  static MOZ_CONSTEXPR_VAR Register StackPointer = sp;
> +static MOZ_CONSTEXPR_VAR Register FramePointer = { Registers::s7 };

nit: Registers::fp

::: js/src/jit/mips/CodeGenerator-mips.cpp
@@ +107,2 @@
>  
> +    masm.pop(FramePointer);

as for ARM, could you add the debug check that FP == SP before popping?
Attachment #8443086 - Flags: review?(benj)
(In reply to Douglas Crosher [:dougc] from comment #7)
> Sorry, dedicating a register to the frame pointer will harm performance and
> is not necessary.

One always has to make tradeoffs.  These were stated in comment 0.

> 1. I accept the general principle that register pressure has a significant
> impact on performance. The claimed measurements and 'google does this' does
> nothing to debunk this.

The claim wasn't "there is zero negative impact" but, rather, "the impact is small enough".  We'll have more measurements once we land this on awfy.

> 3. The patches dedicate a register to the frame pointer irrespective of the
> profiler even being used. It will degrade performance even if the profiler
> is not used.

I'd like to avoid recompiling when profiling is enabled.  (1) this is a bad user experience, (2) if we maintain frame pointer always, then we can use async stack-walking when profiling isn't enabled (for the chromehang monitor).  It will also allow other tools to do a better job (e.g. crash stacks).

> 6. For a start, a plausible stack trace can be establish by noting the stack
> depth for each function and the address or offset range for each function,
[...]

This approach was discussed in comment 0.  I'd also rather not do this fuzzy-stack-unwinding as it will lead to bugs going undetected and will result in a lower-quality result.

I would like to go forward with this patch.
I would like to suggest that recompiling is not that bad here. There are 2 main use cases for profiling: (1) A benchmark, just a single iteration of the main loop that does some stuff, and (2) an app like a game that runs multiple frames (using requestAnimationFrame or setTimeout).

In the first case, the user turned on the profiler before starting the benchmark, as there is no way to turn it on in the middle, so we don't need to recompile - we can compile the first and only time with frame pointers.

In the second case, the user turns on the profiler as the app is running. The profiler itself is enabled from the top of the event loop, so there is nothing on the stack. This suggests we can recompile and swap in the globals, replacing all the code with a frame pointer-compiled version on the next frame.

I agree that recompiling on profiler activation is not the best user experience. But this would only happen for people running the profiler, and we could cache the frame pointer version so it only happens once per codebase. (In theory, we could even speculatively compile the frame pointer version ahead of time if the profiler tab is open, before it is actually activated.) Given that, this seems to me like a small cost compared to the benefit of not having the frame pointer overhead.

Of course this depends on the actual overhead, which I guess we don't know yet. It just seems like we've worked so hard to squeeze every drop of performance out of Odin, so sacrificing a pointer just for the profiler feels troubling.
(In reply to Alon Zakai (:azakai) from comment #12)
> Given that, this seems to me like a small cost compared to the
> benefit of not having the frame pointer overhead.

Again, this all depends on the measured frame pointer overhead.  Only x86 is register-poor enough for this to really matter and, at least when I measured, it didn't seem to make a big difference (though there was a lot of noise).  I'd like to land the frame pointer patch to see what the actual impact is on awfy; then we can talk about whether it makes sense to only enable frame-pointer in profiling builds.  As I already said, there is more benefit than just avoiding a big jank when enabling the profiler: frame-pointers assist in recovering stacks for the crash-reporter and hang monitor (not a big deal for asm.js, but it would be for Ion) as well as debuggers.
(In reply to Benjamin Bouvier [:bbouvier] from comment #10)
> - we don't need to maintain FP in the stack overflow exit as it eventually
> jumps to the throwLabel, and FP is popped in the throw exit.

Almost: the overflow exit isn't called, it is jumped to, so the caller's fp *is* the overflow exit's fp.

> - the interrupt exit handles exceptional behavior, which doesn't change
> semantics of code running, and thus shouldn't appear in any form of
> stack-walking. For this reason, we don't need to maintain FP in the
> GenerateInterruptExit.

Same reasoning here as with the overflow exit (except there isn't an explicit 'jump').

> Clearing the review flag, as I would like to see the answers to these
> questions. Also, I understand Douglas' concerns about the impact this can
> have on speed for ARM platforms. Do you have some measurements of the perf
> impact on ARM platforms, especially on low-end devices?

I was hoping to measure this by just pushing and seeing what happens on awfy; that way we can see the effects on a bunch of different hardware at once.  I would only expect x86 to be an issue, though, as ARM/x64 have a bunch of registers.
Testing on the ARM (Flame), running the asm.js benchmarks from awfy, with the only change being the taking of a register for asm.js code generation. Result without a register taken / result a register taken.

box2d-throughput: 0.99490464
bullet-throughput: 0.9811837
lua_binarytrees-throughput: 0.9974996
zlib-throughput: 0.979948
asmjs-ubench: 0.97556335

Some of these show around a 2% loss of performance from just losing a register to the frame pointer.

These small difference are hard to measure and we also need to use some design knowledge to guide decisions.

Registers are a scarce resource. We might really need a register in future for a pointer to thread local storage, or it might be possible to usefully allocate one of the asm.js module closure variables to a register such as an emscripten stack pointer.

Performance gains are getting harder to obtain, and 2% is very large - we could invest many man months chasing a 2% gain.

If there is an alternative to dedicating a register to the frame pointer then I challenge you to explore it.
(In reply to Douglas Crosher [:dougc] from comment #16)
Thanks for measuring.  2% is higher than I expected for ARM, but I forgot that, even though ARM has 16 registers, 6 are already unavailable (pc, lr, sp, ip, asm.js global, asm.js heap), leaving ARM with only 3 more GPRs than x86.

Do you suppose you could measure just the overhead of the push;mov/pop in the prologue/epilogue (w/o taking the FramePointer)?  This is something we could reduce (by eliding for leaf functions that don't contain loops or calls).

> We might really need a register in future for a pointer to thread local storage

That is, effectively, the asm.js global register.

> If there is an alternative to dedicating a register to the frame pointer then I challenge you to explore it.

The obvious option is recompiling when profiling is enabled, which I had really wanted to avoid.  2% on ARM does seem significant enough to warrant the extra complexity, though.
Yet another strategy I had been considering was to use a particular word on the heap as, effectively, 'fp'.  The advantage of this strategy is that, since it doesn't consume a register, we can toggle profiling by patching the prologue/epilogue in-place.  This still leaves the overhead, in non-profiling mode, of several nops in the prologue/epilogue, but that can be erased by patching all callsites (and func-ptr-tables) to point after the profiling prologue and patching over the profiling epilogue with the ret.  I abandoned this earlier b/c of some other difficulties and maintaining a frame-pointer seemed so much more elegant, but perhaps this is the best way forward as it avoids full recompilation.
(In reply to Luke Wagner [:luke] from comment #19)
> Yet another strategy I had been considering was to use a particular word on
> the heap as, effectively, 'fp'.

Pushing a tag word on the stack at a known position in the frame?

> The advantage of this strategy is that,
> since it doesn't consume a register, we can toggle profiling by patching the
> prologue/epilogue in-place.  This still leaves the overhead, in
> non-profiling mode, of several nops in the prologue/epilogue, but that can
> be erased by patching all callsites (and func-ptr-tables) to point after the
> profiling prologue and patching over the profiling epilogue with the ret.

I may misunderstand the strategy, but using a tag word on the stack would not be reliable on its own, but it could be used to improve the plausibility of stack traces. However finding a known return address trace is probably just as good. You already have the stack depth at the call return sites, and following these already gives confidence in a stack trace.

1. SIGPROF. Get the PC from the context. Look up the address in the asm.js function code ranges to map to info for the asm.js function.

2. Use the asm.js function info and the PC to determine a likely stack depth. This would first check the recorder stack depths at the call return sites, otherwise use a fall-back stack depth. If necessary inspect the instructions to handle some known sequences that modify the stack depth, or add more stack depth info. Perhaps store the range of stack depths for the function to allow multiple attempts to find a plausible stack trace.

3. Get the SP from the context. Add the likely stack depth to create an estimated frame pointer. Validate this as being within range of the stack. Is the asm.js module entry stack pointer already recorded? If so then this would be useful for range checking.

4. Read the return address from the stack at the estimated frame pointer, and lookup the associated function and return site stack depth. If not found then it is not a valid stack trace, return false. If this frame looks ok then continue walking the stack validating the frames. While a stack frame tag word could be added, finding an trace of expected return addresses is probably just as good. If the search fails then try another stack depth (if searching a range).

The % of valid versus invalid stack traces found could be recorded as one measure of how well this works.

It would be interesting to know the range of stack depths within a function? It might be that no stack depths within range for one function could allow a bad stack depth estimate to skip the immediate functions frame to the next frame.

This would leave only an bad stack depth estimate pointing to a stale return address trail as a source of error.

For development it would be possible to try all stack depths within range for the function and note how many lead to valid stack traces. If not well constrained then it might be necessary to poison the trail or sharpen the stack depth estimates etc.
(In reply to Douglas Crosher [:dougc] from comment #20)
> (In reply to Luke Wagner [:luke] from comment #19)
> > Yet another strategy I had been considering was to use a particular word on
> > the heap as, effectively, 'fp'.
> 
> Pushing a tag word on the stack at a known position in the frame?

No, on the heap.  Probably in the global data section (as this is easy to get to on all archs).  This heap fp would be treated just like a real fp register and get pushed/popped on the stack in the epilogue/prologue.  With this, we should always get precise unwinding.  Importantly, though, it doesn't require recompilation, just patching.
(In reply to Luke Wagner [:luke] from comment #21)
> (In reply to Douglas Crosher [:dougc] from comment #20)
> > (In reply to Luke Wagner [:luke] from comment #19)
> > > Yet another strategy I had been considering was to use a particular word on
> > > the heap as, effectively, 'fp'.
> > 
> > Pushing a tag word on the stack at a known position in the frame?
> 
> No, on the heap.  Probably in the global data section (as this is easy to
> get to on all archs).  This heap fp would be treated just like a real fp
> register and get pushed/popped on the stack in the epilogue/prologue.  With
> this, we should always get precise unwinding.  Importantly, though, it
> doesn't require recompilation, just patching.

Great idea. This would add just one instruction of overhead at the function entry on the ARM, to store the stack in a defined location in the global data section, assuming the offset is small - a small negative offset is also an option. Would not expect this to impact performance much. Could ignore the patching of function entry addresses for a start, but keep it as a future option.

If the stack depth at the return sites is kept then it would also be possible to avoid pushing the previous frame pointer on the stack at function entry.
(In reply to Douglas Crosher [:dougc] from comment #22)
Hmm, I don't think we can avoid the push pop of global.fp in the prologue/epilogue.  The reason to save it isn't unwinding (for that we already have CallSite); the reason is that, when I return to my caller, I've clobbered global.fp so I need to restore it (just like for a real fp register).

Unfortunately, I think that means we're talking about 3 instruction prologue on arm:
  load global.fp, scratch
  push scratch
  store sp, global.fp
and a two instruction epilogue.

I think the patching won't be too bad.  Instead of what I said previously, I thinking that, for each function/trampoline, we generate a stub:
  profiling-prologue
  call real-function
  profiling-epilogue
In non-profiling mode, all calls go directly to real-function.  When profiling mode is enabled, all CallSites are patched to point to the real-function's stub.  Fortunately, we already have the address of all callsites (in CallSite, used for stack-unwinding).  All indirect calls load from the func-ptr-tables, so we'd patch those too.  This is preferable to trying to stick the profiling-prologue/epilogue in the function directly since the profiling-prologue affects the static stack depth.
Summary: OdinMonkey: maintain frame pointer in asm.js code → OdinMonkey: add async stack-walking support
Depends on: 1030446
Attachment #8443086 - Attachment is obsolete: true
Comment on attachment 8443092 [details] [diff] [review]
fp-stack-iter

I'll make a go at the above strategy, starting with some prep work in bug 1030446.
Attachment #8443092 - Attachment is obsolete: true
(In reply to Luke Wagner [:luke] from comment #23)
> (In reply to Douglas Crosher [:dougc] from comment #22)
> Hmm, I don't think we can avoid the push pop of global.fp in the
> prologue/epilogue.  The reason to save it isn't unwinding (for that we
> already have CallSite); the reason is that, when I return to my caller, I've
> clobbered global.fp so I need to restore it (just like for a real fp

Yes, good point.

> I think the patching won't be too bad.  Instead of what I said previously, I
> thinking that, for each function/trampoline, we generate a stub:
>   profiling-prologue
>   call real-function
>   profiling-epilogue
> In non-profiling mode, all calls go directly to real-function.  When
> profiling mode is enabled, all CallSites are patched to point to the
> real-function's stub.  Fortunately, we already have the address of all
> callsites (in CallSite, used for stack-unwinding).  All indirect calls load
> from the func-ptr-tables, so we'd patch those too.  This is preferable to
> trying to stick the profiling-prologue/epilogue in the function directly
> since the profiling-prologue affects the static stack depth.

Yes, another great idea. Thank you for taking another look at this.
Thanks for measuring; saved a landing and backout :)
Arg, I implemented the above thunk method but, of course, it totally breaks the stack layout of stack arguments as well as Win64/MIPS shadow stack space.  One option is to copy everything above the original sp to recreate the frame in the thunk... not fun.  Instead, I think we can do something rather like this:

profiling_prologue:
  load *fp, tmp
  store sp, *fp
  push tmp
  sub sp, frameSize
  j body
prologue:  # callsites always call here, profiling mode or not
  [push lr] # maybe
  nop  # patched to 'j profiling_prologue' when profiling is enabled
  check stack overflow (maybe)
  sub sp, frameSize + sizeof(void*)
body:
  ...
epilogue:
  nop  # patched to 'j profiling_epilogue' when profiling is enabled
  inc sp, frameSize + sizeof(void*)
  ret
profiling_epilogue:
  pop tmp
  store tmp, *fp
  inc sp, frameSize
  ret

With this design, the frame always has an AsmJSFrameSize = 2*sizeof(void*), but in non-profiling mode, the second word is uninitialized.  Thus, the only real overhead in non-profiling mode is the nop.  Putting the profiling_prologue/epilogue right before/after the prologue/epilogue means the nops are only 2 bytes (on x86).
(In reply to Luke Wagner [:luke] from comment #27)
> Arg, I implemented the above thunk method but, of course, it totally breaks
> the stack layout of stack arguments as well as Win64/MIPS shadow stack
> space.  One option is to copy everything above the original sp to recreate
> the frame in the thunk... not fun.  Instead, I think we can do something
> rather like this:

>   profiling-prologue
>   call real-function
>   profiling-epilogue

I understood this as setting up a new call frame with the arguments. Only arguments passed on the stack would need to be copied. Should work, just like any other trampoline.

I guess the patchable nop/jump could work too.

Or back to recording some stack depths and seeking plausible traces.
Attached patch WIP (obsolete) — Splinter Review
I still have some tidying up and more testing to do, so this is still WIP, but I'm almost done.  This patch (which is really 5 patches combined) ultimately adds a JS::ProfilingFrameIterator to the public JS API and shell tests that use single-stepping in the ARM simulator to verify that we get the right stack at every instruction.  Thus, with this patch, when profiling is enabled, we'll do all the patching mentioned below, but nothing will actually use the maintained fp.  It'll be nice to land this ahead of the SPS work (which I'll do in another bug) so we can get earlier testing.

A few notes on the strategy I went with:
 - The in-memory fp is stored in the AsmJSActivation.  This adds an extra indirection when setting fp but it really simplifies things compared to storing fp in the globalData and dealing with nested activations.
 - At compile-time, both profiling and normal prologues/epilogues are compiled.  The profiling prologue/epilogue are normally dead code.  When profiling is enabled the code is patched to run the profiling prologue/epilogue instead.  For the prologues, this is achieved by patching all callsites to instead call the profiling prologue.  For epilogues, there is a nop right before the normal epilogue that gets patched into a jump to the profiling epilogue.

While I get it ready for final review, Douglas, do you suppose you could look at the parts I commented with "dougc" and tell me what the "right" way to do this instruction patching is?  Any other comments are welcome too, of course.
Attachment #8455867 - Flags: feedback?(dtc-moz)
Attached patch WIP (obsolete) — Splinter Review
Oops, hg export'd too much.
Attachment #8455867 - Attachment is obsolete: true
Attachment #8455867 - Flags: feedback?(dtc-moz)
Attachment #8455868 - Flags: feedback?(dtc-moz)
Attached patch rm-unused-stateSplinter Review
Unrelated, but I found a dead field.  This used to control some expensive assertions in StackIter back in the bad old days.
Attachment #8457409 - Flags: review?(jdemooij)
Attachment #8455868 - Flags: feedback?(dtc-moz)
This patch stores the current AsmJSActivation* in the global data.  On x64 this saves a 64-bit immediate move and on ARM this saves a movw/movt.  This is useful for subsequent patches which will want to get the current AsmJSActivation* in every prologue.  In addition to performance, this reduces code bloat and AsmJSModule size (static link data for patching the AsmJSImm_Runtime).
Attachment #8457412 - Flags: review?(dtc-moz)
Attached patch avoid-passing-cxSplinter Review
This patch removes the explicit cx arg to various C++ functions called from asm.js since it can be retrieved instead from TLS.  In particular, this simplifies the stack-overflow call refactoring in the next patch.
Attachment #8457414 - Flags: review?(dtc-moz)
Attached patch redo-exitfpSplinter Review
This patch refactors how exitFP is set (again, I'm done now :) so that, in the next patch, we can use the exact same prologue/epilogue codegen for normal asm.js functions as well as exits.  This allows for there to be a single 'fp' field instead of having a separate 'fp' and 'exitFP' which simplifies everything.
Attachment #8457417 - Flags: review?(dtc-moz)
Attached patch profiling-fp (obsolete) — Splinter Review
This patch does what comment 29 described.  The new public JS API is JS::ProfilingFrameIterator; this is what, in a separate bug, will be used in the SPS integration.  A few other notes:
 - ProfilingFrameIterator will later be extended to iterate over all other JS activations, hence the separation between ProfilingFrameIterator and AsmJSProfilingFrameIterator.
 - the AlignedStorage hackery is to avoid needing to expose AsmJSProfilingFrameIterator (and, later the Ion/Baseline/Interp iterators) in public headers
Attachment #8457423 - Flags: review?(dtc-moz)
Attachment #8455868 - Attachment is obsolete: true
Also, a few measurements:
 - enabling/disabling profiling mode takes about 50ms on the Unity DT2 demo; I'm pretty happy with that
 - the overall AsmJSModule size increase is ~10% on the big asm.js apps I looked at.  The main reason for this is that these modules have >100,000 functions and this patch adds approximately 50 bytes of overhead per function (for the profiling prologue/epilogue).  Also, CodeRanges get 2 words bigger.
Comment on attachment 8457412 [details] [diff] [review]
activation-in-global-data

Review of attachment 8457412 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good.
Attachment #8457412 - Flags: review?(dtc-moz) → review+
Comment on attachment 8457414 [details] [diff] [review]
avoid-passing-cx

Review of attachment 8457414 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good.
Attachment #8457414 - Flags: review?(dtc-moz) → review+
Comment on attachment 8457417 [details] [diff] [review]
redo-exitfp

Review of attachment 8457417 [details] [diff] [review]:
-----------------------------------------------------------------

Great work.

Probably a silly question but could the activation be stored in the global data, rather than a pointer to the activation, to save an indirection?

You mentioned that storing the fp in the global data was problematic, so I guess this would have the same issues, copying state around?
Attachment #8457417 - Flags: review?(dtc-moz) → review+
(In reply to Douglas Crosher [:dougc] from comment #39)
> Probably a silly question but could the activation be stored in the global
> data, rather than a pointer to the activation, to save an indirection?
> 
> You mentioned that storing the fp in the global data was problematic, so I
> guess this would have the same issues, copying state around?

Not a silly question; you're right it could.  The annoyance is dealing with nested AsmJSActivations clobbering and restoring and, more importantly, figuring out the right fp given an AsmJSActivation.  All eminently solvable, but I felt like the one extra indirection in profiling mode wasn't worth the complexity; definitely something we can do later.  I think the target for Ion-FFIs, though, is that asm.js and Ion share a single JitActivation and thus the IonFFI thunk isn't an exit.
Comment on attachment 8457409 [details] [diff] [review]
rm-unused-state

Review of attachment 8457409 [details] [diff] [review]:
-----------------------------------------------------------------

Ah, nice.
Attachment #8457409 - Flags: review?(jdemooij) → review+
(Oh there's a call to enableStackWalkingAssertion in js/src/tests/ecma_5/Object/defineProperty-setup.js that you can also remove now.)
This patch saves about .7mb by changing how absolute address link data is stored.  It also prepares for the next patch.
Attachment #8458298 - Flags: review?(dtc-moz)
Attached patch builtin-thunksSplinter Review
This is the last patch.  This patch injects a thunk between a builtin callsites and the native callee (e.g., between a call to Math.sin and the sin function).  Otherwise, if the profiler samples during 'sin', fp will point to the caller's frame and unwinding will start at the caller of the caller of sin which is terrible.  With this patch, the profile will explicitly include the builtin 'sin' (even when only JS profiling is active) and its caller.
Attachment #8458300 - Flags: review?(dtc-moz)
Blocks: 1040390
Comment on attachment 8458298 [details] [diff] [review]
optimize-absolute-links

Review of attachment 8458298 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good. Might even help separate the pre-adjusted offsets from their actual-offsets.
Attachment #8458298 - Flags: review?(dtc-moz) → review+
Comment on attachment 8458300 [details] [diff] [review]
builtin-thunks

Review of attachment 8458300 [details] [diff] [review]:
-----------------------------------------------------------------

Looks ok. Have a query about the usage of AutoForbidPools in another one of the patches, and this usage might need changing here too.

Looks like there is no cost when not profiling, the call sites are patched?
There's not a lot of code bloat as there are only a few builtins?
Just the overhead of recording the call sites to patch?

If this were just needed to record the builtin functions in the trace and if these could be recorded against the caller asm.js leaf function then it might not be worth it. You claim this is needed to be able to record the asm.js leaf function too? If so then what happens if a asm.js leaf function a interrupted? Just curious if there might have been other options.
Attachment #8458300 - Flags: review?(dtc-moz) → review+
Comment on attachment 8457423 [details] [diff] [review]
profiling-fp

Review of attachment 8457423 [details] [diff] [review]:
-----------------------------------------------------------------

Nice work.

I have some concerns about the hard coded instruction sizes on the ARM, and in particular the ARMv6. Could you please double check that the jit-tests pass in the ARM simulator using ARMv6 code generation. This can be done by setting the ARMHWCAP environment variable to "vfp". The tbpl does not check the ARMv6 code generation. For some test coverage of pool placement issues you could try the tests with the environment variable ARM_ASM_NOP_FILL set to 1. You could try some of the large asm.js demos in the browser too using these.

Comments below are mainly suggestions. I don't expect them to be addressed before landing.

::: js/public/ProfilingFrameIterator.h
@@ +22,5 @@
> +// This iterator can be used to walk the stack of a thread suspended at an
> +// arbitrary pc. To provide acurate results, profiling must have been enabled
> +// (via EnableRuntimeProfilingStack) before executing the callstack being
> +// unwound. Stack iteration can only occur while the thread being unwound is
> +// suspended.

Is this true. A thread could walk it's own stack after being interrupted, but not suspended?

::: js/src/jit/AsmJS.cpp
@@ +1465,2 @@
>          JS_ASSERT(!finishedFunctionBodies_);
>          masm_.align(AsmJSPageSize);

btw this will only align the offset from the start of the code to a page boundary. Hope we have a check elsewhere to ensure that the start is page aligned.

::: js/src/jit/AsmJSFrameIterator.cpp
@@ +127,5 @@
> +static void
> +GenerateProfilingPrologue(MacroAssembler &masm, unsigned framePushed, AsmJSExitReason reason,
> +                          Label *begin)
> +{
> +    DebugOnly<uint32_t> sizeAtBegin = masm.size();

Could you avoid using size() and use currentOffset()?

@@ +131,5 @@
> +    DebugOnly<uint32_t> sizeAtBegin = masm.size();
> +    masm.bind(begin);
> +
> +    PushRetAddr(masm);
> +    JS_ASSERT(PushedRetAddr == masm.size() - sizeAtBegin);

Same here, and search for other uses of size() and see if they can be replaced with currentOffset.

@@ +140,5 @@
> +    JS_ASSERT(PushedFP == masm.size() - sizeAtBegin);
> +
> +    masm.storePtr(StackPointer, Address(act, AsmJSActivation::offsetOfFP()));
> +    JS_ASSERT(StoredFP == masm.size() - sizeAtBegin);
> +

Is this the end of the region that really needs to have a known instruction layout? If so might it be possible to abstract this differently to keep the following out of the no-pool region?

@@ +142,5 @@
> +    masm.storePtr(StackPointer, Address(act, AsmJSActivation::offsetOfFP()));
> +    JS_ASSERT(StoredFP == masm.size() - sizeAtBegin);
> +
> +    if (reason != AsmJSNoExit)
> +        masm.store32(Imm32(reason), Address(act, AsmJSActivation::offsetOfExitReason()));

We should check that the 'reason' and the offsetOfExitReason are small enough that the ARM backend does not need to use a pool entry on the ARMv6 as this is not possible in the no-pool region. Would it matter if the ARMv7 emitted three instructions here, a movw/t and the store?

@@ +145,5 @@
> +    if (reason != AsmJSNoExit)
> +        masm.store32(Imm32(reason), Address(act, AsmJSActivation::offsetOfExitReason()));
> +
> +    if (framePushed)
> +        masm.subPtr(Imm32(framePushed), StackPointer);

Similar issue here, if the frame is large.

@@ +150,5 @@
> +}
> +
> +// Generate the inverse of GenerateProfilingPrologue.
> +static void
> +GenerateProfilingEpilogue(MacroAssembler &masm, unsigned framePushed, AsmJSExitReason reason,

I can not see an obvious reason that this gets wrapped in an AutoForbidPools region. Is it because some of the labels need to point directly at some instructions? If so then could this be documented and could you just wrap the label and instruction in an AutoForbidPools region.

@@ +189,5 @@
> +#if defined(JS_CODEGEN_ARM)
> +    AutoForbidPools afp(&masm);
> +#endif
> +
> +    GenerateProfilingPrologue(masm, framePushed, AsmJSNoExit, &labels->begin);

Can the scope of the AutoForbidPools region be ended here?

@@ +200,2 @@
>      PushRetAddr(masm);
>      masm.subPtr(Imm32(framePushed + AsmJSFrameBytesAfterReturnAddress), StackPointer);

The ARM backend might need a pool entry for a large immediate value.

@@ +213,4 @@
>          masm.branchPtr(Assembler::AboveOrEqual,
>                         AsmJSAbsoluteAddress(AsmJSImm_StackLimit),
>                         StackPointer,
>                         target);

And here.

@@ +237,5 @@
> +#if defined(JS_CODEGEN_X86) || defined(JS_CODEGEN_X64)
> +    masm.twoByteNop();
> +#elif defined(JS_CODEGEN_ARM) || defined(JS_CODEGEN_MIPS)
> +    masm.nop();
> +#endif

Could you end the scope of the above AutoForbidPools region here. Is it just so that the label points to the Nop to be patched and not to a pool?

@@ +244,1 @@
>      masm.addPtr(Imm32(framePushed + AsmJSFrameBytesAfterReturnAddress), StackPointer);

A pool entry might be needed here, so it might be necessary to keep this outside the AutoForbidPools region.

@@ +246,5 @@
>      masm.setFramePushed(0);
>  
> +    // Profiling epilogue:
> +    masm.bind(&labels->profilingEpilogue);
> +    GenerateProfilingEpilogue(masm, framePushed, AsmJSNoExit, &labels->profilingReturn);

Wrap this is another AutoForbidPools region if necessary. Is it just this Epilogue that needs to be in the AutoForbidPools region? If so then consider moving the AutoForbidPools into GenerateProfilingEpilogue.

@@ +380,5 @@
> +    //  - for Math and other builtin calls, when profiling is activated, we
> +    //    patch all call sites to instead call through a thunk; and
> +    //  - for interrupts, we just accept that we'll lose the innermost frame.
> +    // However, we do want FFI trampolines to show up in callstacks (so that
> +    // they properly accumulate self-time) and for this we use the exitReason.

Oh, I see this answers some questions I ask about a prior patch, good.

@@ +479,5 @@
> +        uint32_t offsetInModule = ((uint8_t*)state.pc) - module_->codeBase();
> +        JS_ASSERT(offsetInModule < module_->codeBytes());
> +        JS_ASSERT(offsetInModule >= codeRange->begin());
> +        JS_ASSERT(offsetInModule < codeRange->end());
> +        uint32_t offsetInCodeRange = offsetInModule - codeRange->begin();

Would like to explore an alternative for the ARM that inspects the instructions, but if this works then it seems ok.

::: js/src/jit/AsmJSModule.cpp
@@ +1439,5 @@
> +        return;
> +
> +    // Conservatively flush the icache for the entire module.
> +    AutoFlushICache afc("AsmJSModule::setProfilingEnabled");
> +    setAutoFlushICacheRange();

Good.

@@ +1453,5 @@
> +
> +        uint8_t *callerRetAddr = code_ + cs.returnAddressOffset();
> +#if defined(JS_CODEGEN_X86) || defined(JS_CODEGEN_X64)
> +        void *callee = JSC::X86Assembler::getRel32Target(callerRetAddr);
> +#else

#elif defined(JS_CODEGEN_ARM) ?

@@ +1454,5 @@
> +        uint8_t *callerRetAddr = code_ + cs.returnAddressOffset();
> +#if defined(JS_CODEGEN_X86) || defined(JS_CODEGEN_X64)
> +        void *callee = JSC::X86Assembler::getRel32Target(callerRetAddr);
> +#else
> +        uint8_t *caller = callerRetAddr - 4;

Is there a label just after the calls? I should add it in as a requirement for the assembler that an offset or label just after an instruction is guaranteed to point to the end of the instruction, so that pools are not dumped after instructions.

@@ +1456,5 @@
> +        void *callee = JSC::X86Assembler::getRel32Target(callerRetAddr);
> +#else
> +        uint8_t *caller = callerRetAddr - 4;
> +        Instruction *callerInsn = reinterpret_cast<Instruction*>(caller);
> +        BOffImm calleeOffset;

MOZ_ASSERT(callerInsn->is<InstBLImm>());

@@ +1457,5 @@
> +#else
> +        uint8_t *caller = callerRetAddr - 4;
> +        Instruction *callerInsn = reinterpret_cast<Instruction*>(caller);
> +        BOffImm calleeOffset;
> +        callerInsn->as<InstBLImm>()->extractImm(&calleeOffset);

It appears more appropriate for this all to be in the backend, but it's fine here for now.

@@ +1459,5 @@
> +        Instruction *callerInsn = reinterpret_cast<Instruction*>(caller);
> +        BOffImm calleeOffset;
> +        callerInsn->as<InstBLImm>()->extractImm(&calleeOffset);
> +        Assembler::Condition c;
> +        callerInsn->extractCond(&c);

Is a condition other than Always expected?

@@ +1474,5 @@
> +#if defined(JS_CODEGEN_X86) || defined(JS_CODEGEN_X64)
> +        JSC::X86Assembler::setRel32(callerRetAddr, newCallee);
> +#elif defined(JS_CODEGEN_ARM)
> +        // dougc: is there a nicer way to do this patching?
> +        *(uint32_t*)caller = OpBl | int(c) | BOffImm(newCallee - caller).encode();

Again this might be more appropriate in the backend.
Perhaps: new (callerInsn) InstBLImm(BOffImm(newCallee - caller), c);

@@ +1527,5 @@
> +        }
> +#elif defined(JS_CODEGEN_ARM)
> +        // dougc: is there a nicer way to do this patching?
> +        if (enabled) {
> +            ptrdiff_t jumpOffset = profilingEpilogue - jump;

MOZ_ASSERT(jump->is<InstNOP>());

@@ +1528,5 @@
> +#elif defined(JS_CODEGEN_ARM)
> +        // dougc: is there a nicer way to do this patching?
> +        if (enabled) {
> +            ptrdiff_t jumpOffset = profilingEpilogue - jump;
> +            *(uint32_t*)jump = OpB | int(Assembler::Always) | BOffImm(jumpOffset).encode();

Perhaps: new (jump) InstBImm(BOffImm(jumpOffset), Assembler::Always);

@@ +1529,5 @@
> +        // dougc: is there a nicer way to do this patching?
> +        if (enabled) {
> +            ptrdiff_t jumpOffset = profilingEpilogue - jump;
> +            *(uint32_t*)jump = OpB | int(Assembler::Always) | BOffImm(jumpOffset).encode();
> +        } else {

MOZ_ASSERT(jump->is<InstBImm>());

@@ +1530,5 @@
> +        if (enabled) {
> +            ptrdiff_t jumpOffset = profilingEpilogue - jump;
> +            *(uint32_t*)jump = OpB | int(Assembler::Always) | BOffImm(jumpOffset).encode();
> +        } else {
> +            *(uint32_t*)jump = 0xe320f000;

Perhaps: *jump = InstNOP();

::: js/src/shell/js.cpp
@@ +4987,5 @@
> +"enableSingleStepProfiling()",
> +"  This function will fail on platforms that don't support single-step profiling\n"
> +"  (currently everything but ARM-simulator). When enabled, at every instruction a\n"
> +"  backtrace will be recorded and stored in an array. Adjacent duplicate backtraces\n"
> +"  are discarded."),

Good testing.
Attachment #8457423 - Flags: review?(dtc-moz) → review+
(In reply to Douglas Crosher [:dougc] from comment #46)
> Looks like there is no cost when not profiling, the call sites are patched?

Correct

> There's not a lot of code bloat as there are only a few builtins?

Right, <1000 bytes.

> Just the overhead of recording the call sites to patch?

Indeed, we already had them (in the StaticLinkData for serialization patching purposes :)

> If this were just needed to record the builtin functions in the trace and if
> these could be recorded against the caller asm.js leaf function then it
> might not be worth it. You claim this is needed to be able to record the
> asm.js leaf function too? If so then what happens if a asm.js leaf function
> a interrupted? Just curious if there might have been other options.

The key difference is sampling inside jit code vs sampling outside jit code.  When you sample inside jit code, you have pc so you can get the enclosing function via lookupCodeRange(pc).  When outside jit code, the only breadcrumb you have to start unwinding the innermost jit code is fp.  Unfortunately, given the fp of the innermost frame, you can only start unwinding at the *caller* of the innermost frame b/c you have no way to find sp/fp.  Other options:
 - Store an AsmJSActivation::exitPC before calling out.  This is easy on ARM, but one cannot directly access pc on x86/x64, requiring tricks like __i686.get_pc_thunk.bx.  The other issue is where these store instructions go.  If they go in the caller, then they add overhead when not profiling (although you could nop+patching).  If they go in a thunk, then we're basically doing what we're doing here, saving only a few instructions, and adding a new concept (exitFP).
 - Storing an AsmJSActivation::exitSP before calling out (from which you could get the pc).  This has bad corner cases when sampling a thread that has been interrupted by a signal handler (you necessarily set the sp before it points to a valid frame).  It also requires the caller to push pc before the call on ARM/MIPS which complicates things and either adds overhead in the non-profiling case or requires nop+patching (which still has overhead).

With zero overhead and reusing the static link data, this seemed like the easiest way.  In a perfect world, we'd record full native unwind data so we'd unwind from C++ directly into JIT code, but that is no small task.  Maybe it'll be worth it to do it in the future, though.

As a nice side-benefit, it makes it easy to show Math builtins in the profile (which you wouldn't otherwise see in normal JS-only profiling mode).  In C+++JS-profiling-mode, we can filter these out (that's why they have a separate Kind in the iterator).
(In reply to Douglas Crosher [:dougc] from comment #47)
Great suggestions, thanks!

In particular, the AFPs used to need to cover the entire prologue/epilogue since CodeRange::beingToEntry and the other 2 used to be static constants.  However, that had other problems (immediate size being one) so I turned them into dynamic offsets.  I'll post a new patch that more narrowly scopes the AFPs.

> > +// This iterator can be used to walk the stack of a thread suspended at an
> > +// arbitrary pc. To provide acurate results, profiling must have been enabled
> > +// (via EnableRuntimeProfilingStack) before executing the callstack being
> > +// unwound. Stack iteration can only occur while the thread being unwound is
> > +// suspended.
> 
> Is this true. A thread could walk it's own stack after being interrupted,
> but not suspended?

I didn't really word this correctly; I just meant to say "don't walk the stack while the thread is running" but, that's kinda obvious, so I'll just remove that bit.

> btw this will only align the offset from the start of the code to a page
> boundary. Hope we have a check elsewhere to ensure that the start is page
> aligned.

Yeah, we already depend on page alignment and assert that the results of VirtualAlloc/mmap are page-aligned.

> > +        uint8_t *callerRetAddr = code_ + cs.returnAddressOffset();
> > +#if defined(JS_CODEGEN_X86) || defined(JS_CODEGEN_X64)
> > +        void *callee = JSC::X86Assembler::getRel32Target(callerRetAddr);
> > +#else
> > +        uint8_t *caller = callerRetAddr - 4;
> 
> Is there a label just after the calls? I should add it in as a requirement
> for the assembler that an offset or label just after an instruction is
> guaranteed to point to the end of the instruction, so that pools are not
> dumped after instructions.

No label: basically, after generating a call instruction, we save masm.currentOffset() (which, for ARM, should be the address of the call instruction + 4) (see the CallSiteDesc call overloads in MacroAssembler-arm.h).  However, I guess there is one concern: in AsmJSModule::finish, we update CallSite::returnAddressOffset with actualOffset; will actualOffset ever move an offset that was before a pool (call-pc + 4) to after the pool?  Running with ARM_ASM_NOP_FILL=1 doesn't seem to expose any problems.
Attached patch profiling-fpSplinter Review
This patch also rebases over bug 986673, which adds a new exit stub.
Attachment #8457423 - Attachment is obsolete: true
Attachment #8458972 - Flags: review?(dtc-moz)
Comment on attachment 8458972 [details] [diff] [review]
profiling-fp

Review of attachment 8458972 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good for now, but can see some potential issues to explore in future.

::: js/src/jit/AsmJSFrameIterator.cpp
@@ +178,5 @@
> +
> +    // AsmJSProfilingFrameIterator assumes that there is only a single
> +    // instruction (whose offset is recorded by profilingReturn) after the store
> +    // which sets AsmJSActivation::fp to the caller's fp: the return. Use
> +    // AutoForbidPools to ensure that a pool (which includes a jump) inserted

"is not inserted"

@@ +189,5 @@
> +        masm.pop(Operand(act, AsmJSActivation::offsetOfFP()));
> +#else
> +        Register fp = ABIArgGenerator::NonArgReturnVolatileReg1;
> +        masm.pop(fp);
> +        masm.storePtr(fp, Address(act, AsmJSActivation::offsetOfFP()));

Is it also important that a pool not be inserted between the pop and storePtr?

@@ +257,5 @@
> +#if defined(JS_CODEGEN_ARM)
> +    // Flush pending pools so they do not get dumped between the profilingReturn
> +    // and profilingJump/profilingEpilogue labels since the difference must be
> +    // less than UINT8_MAX.
> +    masm.flushBuffer();

Bit of a hack that probably only works because of the simplicity of the current assembler buffer. The stack adjustment could also be loaded before entering the AFP region if the offset is large. If the current hack holds together as-is then it seems ok for now, and other options can be explored in future.

::: js/src/jit/AsmJSModule.cpp
@@ +1190,5 @@
> +    JS_ASSERT(profilingReturn_ - profilingJump <= UINT8_MAX);
> +    profilingJumpToProfilingReturn_ = profilingReturn_ - profilingJump;
> +
> +    JS_ASSERT(profilingReturn_ - profilingEpilogue <= UINT8_MAX);
> +    profilingEpilogueToProfilingReturn_ = profilingReturn_ - profilingEpilogue;

These could be encoded in units of instructions for the ARM and MIPS to give a little extra range but doing so would not solve the above issue.
Attachment #8458972 - Flags: review?(dtc-moz) → review+
(In reply to Luke Wagner [:luke] from comment #49)
> (In reply to Douglas Crosher [:dougc] from comment #47)

Thank you for the explaination.

> > > +        uint8_t *callerRetAddr = code_ + cs.returnAddressOffset();
> > > +#if defined(JS_CODEGEN_X86) || defined(JS_CODEGEN_X64)
> > > +        void *callee = JSC::X86Assembler::getRel32Target(callerRetAddr);
> > > +#else
> > > +        uint8_t *caller = callerRetAddr - 4;
> > 
> > Is there a label just after the calls? I should add it in as a requirement
> > for the assembler that an offset or label just after an instruction is
> > guaranteed to point to the end of the instruction, so that pools are not
> > dumped after instructions.
> 
> No label: basically, after generating a call instruction, we save
> masm.currentOffset() (which, for ARM, should be the address of the call
> instruction + 4) (see the CallSiteDesc call overloads in
> MacroAssembler-arm.h).  However, I guess there is one concern: in
> AsmJSModule::finish, we update CallSite::returnAddressOffset with
> actualOffset; will actualOffset ever move an offset that was before a pool
> (call-pc + 4) to after the pool?

It seems very reasonable to add a requirement to the assembler that a pool is not placed after an instruction and before a call to currentOffset(). The NOP fill is only placed before instructions, not after.

> Running with ARM_ASM_NOP_FILL=1 doesn't seem to expose any problems.

A large amount of fill might also be a useful test given the use of deltas and their limited range. But such tests will bloat the code and be very slow.
Comment on attachment 8458972 [details] [diff] [review]
profiling-fp

Review of attachment 8458972 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/AsmJSFrameIterator.cpp
@@ +257,5 @@
> +#if defined(JS_CODEGEN_ARM)
> +    // Flush pending pools so they do not get dumped between the profilingReturn
> +    // and profilingJump/profilingEpilogue labels since the difference must be
> +    // less than UINT8_MAX.
> +    masm.flushBuffer();

A good alternative might be to increase the delta encoding size to 16 bits and then the above masm.flushBuffer() could be removed. This increase could be specific to the ARM and MIPS and just for delta to the epilogue. Perhaps some of the other deltas use less than 8 bits, and for the ARM and MIPS the deltas could be in units of 4 byte instructions, so these might be repacked in future.
(In reply to Douglas Crosher [:dougc] from comment #51)
> Is it also important that a pool not be inserted between the pop and
> storePtr?

Nope, that's fine since noone is looking at state.sp.  It matters in the prologue case only because we haven't yet stored fp.

(In reply to Douglas Crosher [:dougc] from comment #53)
> A good alternative might be to increase the delta encoding size to 16 bits
> and then the above masm.flushBuffer() could be removed. This increase could
> be specific to the ARM and MIPS and just for delta to the epilogue. Perhaps
> some of the other deltas use less than 8 bits, and for the ARM and MIPS the
> deltas could be in units of 4 byte instructions, so these might be repacked
> in future.

That's a good idea, but I just don't feel flushBuffer is that much of a hack.  Increasing the size of CodeRange (there are 150,000 of these for Unreal/Unity apps) and doing #ifdef and/or bitfield hacking have their own complexity cost.
This seems to be what was intended and fixes the JS_ION_PERF build.
Attachment #8459871 - Flags: review?(luke)
Comment on attachment 8459871 [details] [diff] [review]
fix-ion-perf-build.patch

Thanks!
Attachment #8459871 - Flags: review?(luke) → review+
Blocks: 1041868
Blocks: 1043453
No longer blocks: 1043453
Depends on: 1043453
Comment on attachment 8458972 [details] [diff] [review]
profiling-fp

Review of attachment 8458972 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/AsmJSFrameIterator.cpp
@@ +128,5 @@
> +static void
> +GenerateProfilingPrologue(MacroAssembler &masm, unsigned framePushed, AsmJSExitReason reason,
> +                          Label *begin)
> +{
> +    Register act = ABIArgGenerator::NonArgReturnVolatileReg0;

Just a heads up that we have a problem here on the ARM. For the built in calls the NonArgReturnVolatileReg0, which is r4, is not volatile and the code expects it to be preserved. Suggest holding back patches that enable the use of this until this is resolved. Looking into it.
(In reply to Douglas Crosher [:dougc] from comment #61)
I think we're safe: GenerateAsmJSEntryPrologue doesn't call GenerateProfilingPrologue.  Initially I did, but ran into other issues.  Given this important implicit constraint, I should probably re-linline GenerateAsmJSEntry(Prologue|Epilogue) to make this dependency explicit.

Btw, with this one case removed, all the "volatile" in the name NonArgReturnVolatileReg is meaningless.  I can put either volatile or non-volatile regs in and it works fine.  (In fact, inexplicably, x64 uses one volatile AND one non-volatile.)

I'll write a patch to tidy these up.
Attached patch change-act-reg (obsolete) — Splinter Review
Actually, dougc pointed out on IRC that:
 - the builtin thunk is called by asm.js and the caller may assume, since they are calling C++, that non-volatiles are clobbered
 - while almost all callers currently don't assume this, ARM soft div/mod calls do (they don't specify isCall()=true, which spills everything)
Attachment #8462717 - Flags: review?(dtc-moz)
(Nice catch, btw!)
Comment on attachment 8462717 [details] [diff] [review]
change-act-reg

Review of attachment 8462717 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, but there are still problems and I would like to see these fixed and to retest before r+. See attachment 8462717 [details] [diff] [review] for some suggested changes that I have been testing.

::: js/src/jit/AsmJSFrameIterator.cpp
@@ +140,5 @@
> +    // so we have to use the second scratch register (lr). That means we must be
> +    // not use any MacroAssembler methods that clobber secondScratchReg. To
> +    // catch bugs, temporarily invalidate secondScratchReg.
> +    Register act = lr;
> +    masm.setSecondScratchReg(InvalidReg);

This is a good idea - setting the second scratch register to InvalidReg.

@@ +175,5 @@
>      if (framePushed)
>          masm.subPtr(Imm32(framePushed), StackPointer);
> +
> +#if defined(JS_CODEGEN_ARM)
> +    masm.setSecondScratchReg(lr);

This could be moved above subPtr() and doing so might help make it clear to the reader that subPtr() could use the second scratch register.

@@ +184,5 @@
>  static void
>  GenerateProfilingEpilogue(MacroAssembler &masm, unsigned framePushed, AsmJSExit::Reason reason,
>                            Label *profilingReturn)
>  {
> +    // We need one scratch register for the profiling prologue. Because this

s/prologue/epilogue/

@@ +186,5 @@
>                            Label *profilingReturn)
>  {
> +    // We need one scratch register for the profiling prologue. Because this
> +    // prologue may be executed in a thunk entering or leaving from C++, we must
> +    // preserve the system ABI return address and non-volatile registers.

s/address//

@@ +190,5 @@
> +    // preserve the system ABI return address and non-volatile registers.
> +#if defined(JS_CODEGEN_X86) || defined(JS_CODEGEN_X64)
> +    Register act = ecx;
> +#elif defined(JS_CODEGEN_ARM)
> +    Register act = r2;

Using r2 is probably ok, but we can use lr again. As above, invalidate the second scratch reg.

@@ +198,3 @@
>  
>      if (framePushed)
>          masm.addPtr(Imm32(framePushed), StackPointer);

This could be moved above the definition or 'act' - even more usefuly if using the lr for act.

@@ +201,5 @@
>  
>      masm.loadAsmJSActivation(act);
>  
>      if (reason != AsmJSExit::None)
>          masm.store32(Imm32(AsmJSExit::None), Address(act, AsmJSActivation::offsetOfExitReason()));

If using lr for act then store32_NoSecondScratch() can be used here.

@@ +220,2 @@
>          masm.pop(fp);
>          masm.storePtr(fp, Address(act, AsmJSActivation::offsetOfFP()));

The NonArgReturnReg1 is still non-volatile and must not be destructively modified here. Suggest using r12 for the ARM. If using lr for act then the second scratch register can be restored after storePtr().
Attachment #8462717 - Flags: review?(dtc-moz)
The copying of stack arguments, in GenerateBuiltinThunk(), would also be a problem for the ARM. None of the built in functions are using stack arguments on the ARM, but perhaps adding an assertion and source comment about the issues would help catch and explain the issue in case it comes up in future. The problem could be easily solved in future by using volatile registers when copy the stack arguments.
Attachment #8463084 - Attachment is obsolete: true
(In reply to Douglas Crosher [:dougc] from comment #66)
> @@ +190,5 @@
> > +    // preserve the system ABI return address and non-volatile registers.
> > +#if defined(JS_CODEGEN_X86) || defined(JS_CODEGEN_X64)
> > +    Register act = ecx;
> > +#elif defined(JS_CODEGEN_ARM)
> > +    Register act = r2;
> 
> Using r2 is probably ok, but we can use lr again. As above, invalidate the
> second scratch reg.

r2 is less constrained so it seemed like the simpler solution.  With lr, we have to audit and there is a fair amount of codegen in the live range of 'scratch' audit.  It's only b/c the prologue requires we keep alive the args that we have to do the lr hack.

> The NonArgReturnReg1 is still non-volatile and must not be destructively
> modified here. Suggest using r12 for the ARM. If using lr for act then the
> second scratch register can be restored after storePtr().

Oops, I must not have qref'd before attaching the patch; my local version has two scratch registers in GenerateProfilingEpilogue to remove this.  Sorry!
Attached patch change-act-reg (obsolete) — Splinter Review
Attachment #8462717 - Attachment is obsolete: true
Attachment #8463111 - Attachment is obsolete: true
Attachment #8463407 - Flags: review?(dtc-moz)
Attached patch change-act-regSplinter Review
Oops, last patch was bogus.
Attachment #8463407 - Attachment is obsolete: true
Attachment #8463407 - Flags: review?(dtc-moz)
Attachment #8463494 - Flags: review?(dtc-moz)
Comment on attachment 8463494 [details] [diff] [review]
change-act-reg

Review of attachment 8463494 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good. No problems with this patch in testing.
Attachment #8463494 - Flags: review?(dtc-moz) → review+
https://hg.mozilla.org/mozilla-central/rev/6f9dc979a994
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla34
Blocks: 1068885
Depends on: 1100083
You need to log in before you can comment on or make changes to this bug.