Closed Bug 1635987 Opened 4 years ago Closed 2 years ago

Allow native stack unwinding to recover after it gets lost in JavaScript JIT code, by exposing EnterJIT's saved register values to the profiler

Categories

(Core :: Gecko Profiler, enhancement, P2)

enhancement

Tracking

()

RESOLVED DUPLICATE of bug 1707489

People

(Reporter: mstange, Unassigned)

References

Details

This is a proposal for one solution of bug 1434402.

Problem statement: The Gecko profiler frequently captures broken native stacks for samples in JavaScript execution. That's because the stack walker cannot unwind through JS JIT code. Native stack walking aborts halfway through, and then the rest of the merged stack that's displayed in the profiler is just made of JS frames and profiler label frames, but no more native frames. This leads to very confusing profiles with bifurcated stacks, such as this one: https://perfht.ml/3bcnOtc - the two subtrees starting at "promise callback" should be the same subtree (both called by mozilla::CycleCollectedJSContext::AfterProcessTask), but they're split into different parts because they don't share the same native stack ancestor frames.

Bug 1426134 suggests one solution to this problem: Preserve framepointer values in JS JIT code so that the unwinder can keep unwinding through these functions.

In this bug, I want to suggest an alternative: The JS engine should give the profiler access to the saved register values that it pushes to the stack in the JIT entry trampolines. Then the profiler can restart stack walking at those places in the stack. More specifically:

  • The JIT entry trampoline should store the stack address of the saved register values into cx->_profilingActivation->asJit()->stackAddress.
  • The JS engine should expose a profiling API to iterate on-stack activation stackAddresses. It should probably also read the register values itself and expose them via architecture-specific structs, so that the profiler doesn't need to rely on the JS-internal stack layout.

(Something like this was implemented 6 years ago for 32 bit ARM in bug 914561, but it got broken in bug 1057082 (see bug 1578726 comment 1 for context) and is no longer used.)

This solution has the following advantages over bug 1426134:

  • It does not eat into the number of registers that are available to JIT code. This is important on 32 bit architectures with their limited number of registers.
  • It allows stack walking to resume in more cases: If Jit code is on the stack, but stack frames near the top of the stack are inside a library without unwinding information, then native stack walking already gives up in those top stack frames, before it even gets to the JIT frames. Having a safe point in the stack to restart stack walking would allow at least part of the stack to be recovered.

Implementation notes:

Before we call the enter jit code, we put a JitActivation on the stack. If the profiler is currently running, its constructor sets cx->_profilingActivation to itself.
So the JitActivation struct just needs a new field stackAddress.
The generated code in the enter JIT trampoline needs to put the address into the right place.
It needs to handle both profiler on and profiler off mode, because, according to some old comments I read, the Baseline JIT does not recompile JIT code if the profiler state is toggled.

So the generated code needs to be:

// just after pushing all the relevant registers:
cx->profilingActivation != null?
cx->profilingActivation is jit? (should always be true, I think)
get address to cx->profilingActivation->stackAddress
write stack pointer to cx->profilingActivation->stackAddress
emit jump dest labels

(this doesn't modify the stack or the stack pointer.)

The JSContext cx remains at the same address for the entire lifetime of the generated code. So we can hard-code its address (or actually the address of its _profilingActivation member) into the generated code for the trampoline.

On the profiler side, the following changes will be necessary:

  • For each stack walking method:
    • Make sure the method supports stopping early, i.e. make it respect an aStackEnd parameter.
    • When walking the native stack: Call the JS engine API to iterate stack activations. For each activation: Run stack walking until the activation's stack address, and re-initialize stack walking with the activation's recovered register values. At the end, walk the rest of the stack.

(In reply to Markus Stange [:mstange] from comment #0)

So the generated code needs to be:

// just after pushing all the relevant registers:
cx->profilingActivation != null?
cx->profilingActivation is jit? (should always be true, I think)
get address to cx->profilingActivation->stackAddress
write stack pointer to cx->profilingActivation->stackAddress
emit jump dest labels

Would it also work to just unconditionally write stack pointer to cx->activation->stackAddress? How relevant is the profilingActivation part of this?

Oh, sure! The profilingActivation part is not that relevant. When I looked at the JSContext fields, it just seemed easier to use profilingActivation because that's a raw pointer.

In case it helps prototyping or so, the masm code in the EnterJit trampoline would look like this:

masm.loadJSContext(reg);
masm.loadPtr(Address(reg, JSContext::offsetOfActivation()), reg);
masm.storeStackPtr(Address(reg, JitActivation::offsetOfStackAddress()));

With reg being an available Register and JitActivation::offsetOfStackAddress returning the offset of the new field.

I'm starting to read up on all this! I'll see what I can learn on my own, but I may request some help along the way...

An early thought:
Could part of this solution be done with labels?
In fact, I'm thinking (possibly naively) we could implement a first prototype of the start&stop stack-walker by using the existing label stack; it may even help a bit with the JIT issues, by losing fewer native frames.
And later we could switch to, or add, the JIT hints. What do you think?
And/or if labels do work well, maybe we could try to use labels from the JIT enter/exit code instead?

(In reply to Gerald Squelart [:gerald] (he/him) from comment #4)

Could part of this solution be done with labels?

I think so, what is the overhead of labels? Storing the last frame address?

In fact, I'm thinking (possibly naively) we could implement a first prototype of the start&stop stack-walker by using the existing label stack; it may even help a bit with the JIT issues, by losing fewer native frames.
And later we could switch to, or add, the JIT hints. What do you think?

Sounds good to me. The JIT stack frames has a specific form and values which are potentially discoverable in an automated way.

And/or if labels do work well, maybe we could try to use labels from the JIT enter/exit code instead?

The problem with that is that we do a lot of enter/exit, so the overhead of labels would really be the key point here.
Note that we do exit JIT code each time we enter a C++ function from the Jit.

Maybe a simple way would be to register the last JIT activation address in the profiler, which would then be capable of checking whether we are in JIT code or not by checking the JIT activation structure against the stack pointer of the thread. The JIT activation is already maintained to make the GC work.

Thinking about this for a few seconds more, I realized that labels don't actually know about actual stack frames, and where the PC & SP would be relative to the label's on-stack object. 🤦 So there would be no way to restart stack-walking from just a label. (Right?)
Sorry for the waste of time, I'll go back to studying your full solution...

I think the answer is different for platform / configurations that use a frame pointer for stack walking, and platforms / configurations that don't.
If you have a frame pointer, we could store the frame pointer value in the label frame, during the label frame's push method.
If you don't have a frame pointer, the stack walker needs a lot more register values to be able to resume stack walking. You would need to store all those register values in the label frame. There are two problems with that: 1. It bloats the label frame size, so it would need to be limited to a new "flavor" of label frames. And 2. The function that obtains the register values (before it stores them) might have allocated stack space for itself (depending on what the optimizer does). So parts the stack could change between obtaining the register values and entering the JS code. And the unwinder might try to get values from those parts of the stack that have changed.
So the safest thing is to save the register values from assembly code under our control (makes us independent from the optimizer), and at a place where we know that anything that's on the stack "before" the stored sp value will not change.

(In reply to Markus Stange [:mstange] from comment #7)

I think the answer is different for platform / configurations that use a frame pointer for stack walking, and platforms / configurations that don't.
If you have a frame pointer, we could store the frame pointer value in the label frame, during the label frame's push method.
If you don't have a frame pointer, the stack walker needs a lot more register values to be able to resume stack walking. You would need to store all those register values in the label frame. […]

On all architectures, we have a generateEnterJIT which is responsible for copy the content of all registers which are supposed to not change across a function call, such as:
https://searchfox.org/mozilla-central/rev/4dac9993b609fccc87e82682614faf2a44cda306/js/src/jit/x64/Trampoline-x64.cpp#62-94

This is used for C++ -> JIT, I do not know for C++ -> WASM case, as WASM is closer to the system ABI than our JITs.

In the spirit of being respectful of everyone's time, trying to find an efficient way to get this done, while I'm slowly learning more about our stack walkers and JIT frames, I'm proposing the following plan:

  1. I could start a stand-alone C++ prototype that would rework one of the stack walkers, to make it stop&start-capable.
  2. In parallel, one+ of you experts (:jandem, :mstange, :nbp) could code a quick&ugly generateEnterJIT patch for x64 (because that's my main hardware); no testing yet, only make sure it doesn't crash.
    Then I'll continue:
  3. Join the two, debug, build better tests, land it -- unless we think it should only land at the very end.
  4. Gradually patch more trampolines and stack walkers.

Please let me know what you think of the above. I'll start with 1 for now.
And if the plan is approved, who would volunteer for 2? 😁 If no one has the time, I can give it a try, but I may still need a bit of help; And conversely if someone would like to do more, I'm happy to let you have all the fun!

(In reply to Gerald Squelart [:gerald] (he/him) from comment #9)

Please let me know what you think of the above. I'll start with 1 for now.
And if the plan is approved, who would volunteer for 2? 😁 If no one has the time, I can give it a try, but I may still need a bit of help; And conversely if someone would like to do more, I'm happy to let you have all the fun!

The plan sounds good to me. If you need any help for step 2, you can ask in #spidermonkey.

If you are not confident with writing code using the MacroAssembler yet, I suggest looking at callWithABI and implement it in C++ in the mean time.

Alright, I took the plunge, and finally started fiddling with generateEnterJIT in Trampoline-x64.cpp. Thanks to all your comments (mostly comment 3 + tweaks) I managed to record the address of the "frame descriptor" into a new field in cx->activation_, and I can extract it and compare it to other available data... (Code in Try, for the curious.)

Wait a minute... "frame descriptor"? Other available data?? Is the work already done for me??? 🤔

If I read the code correctly, the info is mostly there:

  • In generateEnterJIT, there's a frame descriptor that helpfully indicates that this is a CppToJSJit frame, and the number of bytes in the frame, so it'd be possible to retrieve the previous stack frame pointer and the return address.
  • JS::ProfilingFrameIterator can go through all the frames, and in particular it gives us the "stackAddress", which is where the local return address is pushed right after the frame descriptor!

Glancing at the FrameIterator code, it seems possible that it could identify CppToJSJit frames (and I'm thinking that this could be stored in/near the ProfilingFrameIterator::Frame::kind), and extract the extra registers I would need to restart the stack walkers.
MergeStacks iterates through all frames anyway, so I think it wouldn't add much work to find these important frames, and use them with the stop&start walkers.

Before I go further into the woods, what do you experts think? TIA.

(In reply to Gerald Squelart [:gerald] (he/him) from comment #11)

If I read the code correctly, the info is mostly there:

  • In generateEnterJIT, there's a frame descriptor that helpfully indicates that this is a CppToJSJit frame, and the number of bytes in the frame, so it'd be possible to retrieve the previous stack frame pointer and the return address.

CppToJSJit have artificially empty stack frame, the size only correspond to the size of the Jit Frame, and does not account for the amount of space reserved for arguments.

  • JS::ProfilingFrameIterator can go through all the frames, and in particular it gives us the "stackAddress", which is where the local return address is pushed right after the frame descriptor!

Yes. You would still have to skip the list of arguments given to the JS function, and then we could potentially map a structure with everything which is alive on the stack, thus recovering the desired register spilled at the entry of the trampoline.

Glancing at the FrameIterator code, it seems possible that it could identify CppToJSJit frames (and I'm thinking that this could be stored in/near the ProfilingFrameIterator::Frame::kind), and extract the extra registers I would need to restart the stack walkers.

Instead of adding new field to the ProfilingFrameIterator::Frame, you could add a new Frame::kind, which can be use to help interpreter the existing content of what is spilled on the stack. Within which you will find the rsp, rbp registers, or any others that you might want to spill.

From what I understand you should not have to restart the stack walking as the ProfilingFrameITerator could be used first, collecting all the gap informations, and use this information later while walking the C++ stack.

Before I go further into the woods, what do you experts think? TIA.

I think this is a good idea!

Update: I got the prototype working!
Normal profile: https://share.firefox.dev/3qyPbWJ
With stack-fixing: https://share.firefox.dev/3ptbImt
The difference is most visible in the Stack Chart, it used to have lots of "cuts" where some stacks with Javascript would seem to be re-rooted on labels like "XREMain". Now the chart is much smoother, e.g., in "Web Content 3", only 3.4% of samples have incomplete native stacks, instead of 51% before.

Prototype code, very ugly, only Works For Me™ on my 64-bit Windows PC for now: https://hg.mozilla.org/try/rev/c3cadb221dd3dedf120e2995bc1ad850d0fc3e12

The main learning is that the enterJIT code doesn't actually have to change, the needed information is already there! Details:
In the profiler's MergeStacks, the JS::ProfilingFrameIterator lets us extractStack into JS::ProfilingFrameIterator::Frames in which the stackAddress points at the enterJIT return address, which was pushed right after the frame descriptor.
The descriptor contains a type, we're only interested in FrameType::CppToJSJit.
And it has a size, which helps skip the function arguments pushed on the stack.
From there, a few more fixed-size hops gets me to where the frame pointer was pushed, and the CALL return address follows.
With these last two registers, and the stack address after pretend-popping the return address, I can restart the stack walker. 🎉

At a glance the other platforms seem to have a similar shape (stack address pushed after frame descriptor, which helps skip function arguments, after which I can skip other fixed things to find the function return address). So I think there's a good chance this strategy will work everywhere.

Now to make all this prototype code cleaner...
I think that the "magic" code (here it is in the prototype) that can use the JS::ProfilingFrameIterator::Frame::stackAddress to retrieve the needed stack-walking registers, should be next to the generateEnterJIT function in each Trampoline-X.cpp file, because they're related: If generateEnterJIT is modified to push more or less stuff, we'll need to update the corresponding stack-walking-register-retrieval code as well.
And then I'll need to find a way to access this function from the profiler code.

Thoughts, suggestions, feedback, criticism?

Nice work! I haven't looked at this in too much detail yet, but here's one idea: To minimize the magic hopping when getting the values from the stack, it would be nice if all trampoline generators had a struct whose layout mirrors the trampoline's stack frame layout. The arm implementation has one: struct EnterJitStack, and the mips64 implementation also has a struct EnterJitRegs. Since we'll need to add the "magic" code next to all trampoline generators anyway, it might be nice to add these structs for all of them too, so that the magic code can look really simple.

Yes, great to have this work!

(In reply to Markus Stange [:mstange] from comment #14)

To minimize the magic hopping when getting the values from the stack, it would be nice if all trampoline generators had a struct whose layout mirrors the trampoline's stack frame layout. The arm implementation has one: struct EnterJitStack, and the mips64 implementation also has a struct EnterJitRegs. Since we'll need to add the "magic" code next to all trampoline generators anyway, it might be nice to add these structs for all of them too, so that the magic code can look really simple.

Yeah and ideally we'd share the trampoline code, having a separate copy for each platform is pretty annoying. Let me see if I can do something there today...

(In reply to Gerald Squelart [:gerald] (he/him) from comment #13)

[…] 🎉

At a glance the other platforms seem to have a similar shape (stack address pushed after frame descriptor, which helps skip function arguments, after which I can skip other fixed things to find the function return address). So I think there's a good chance this strategy will work everywhere.

[…]

Thoughts, suggestions, feedback, criticism?

Nice work!

One remark is that sp += 8; // skip r15 can in statically known stack usage be rewritten as

struct StackLayout { uint64_t r15; };
sp += sizeof(StackLayout);

Identically, you could reinterpret the stack pointer given the structure to fetch the corresponding value:

struct StackLayout { uint64_t r15; };
StackLayout* stack = reinterpret_cast<StackLayout*>(sp);
(void) stack->r15; // read r15 out of the stack

If you are going to make such structure/offset, I would ask that we can, at best share the code, at least assert the consistency between the JIT and the profiler.

Thank you both.
The prototype code was written as quickly as I could, doing lots of ugly reinterpret_casts to move around the stack and read thing.
I certainly intend to make it clearer. I'll try Markus' suggestion of having a big struct to map the end of the stack frame, to both read and skip things (as you suggested, Nicolas, IIUC).

(In reply to Nicolas B. Pierron [:nbp] from comment #16)

If you are going to make such structure/offset, I would ask that we can, at best share the code, at least assert the consistency between the JIT and the profiler.

I'm hoping this stack-reading code would live in the same Trampoline-X.cpp files, so it should be consistent with the JIT, being part of it.
I'm not sure what you're thinking of regarding consistency with the profiler? 🤔
In any case, I'll make sure I get feedback and reviews from knowledgeable people on all sides (JIT, profiler, stack walkers).

I'll circle around when I have a cleaner (not too hacky) proof of concept, and we can refine things...

(In reply to Gerald Squelart [:gerald] (he/him) from comment #17)

(In reply to Nicolas B. Pierron [:nbp] from comment #16)

If you are going to make such structure/offset, I would ask that we can, at best share the code, at least assert the consistency between the JIT and the profiler.

I'm hoping this stack-reading code would live in the same Trampoline-X.cpp files, so it should be consistent with the JIT, being part of it.
I'm not sure what you're thinking of regarding consistency with the profiler? 🤔

This is only a wishful thinking that any changes made to the trampoline or the profiler code do not go out of sync without a compilation failure, or a runtime failure.

Wow, time flies!
Update: I'm getting closer to proper code.
Before going too far on the JS side, could you (:jandem or :nbp) please have a look at https://hg.mozilla.org/try/rev/4e80bfd99a8d10e17fa8fa286f152a9ccad7dd6b , see if there are things I should do differently? (At a high level for now, we can hash out the line-by-line details and bikeshed names in later reviews.)

In summary, I'm adding JS::ProfilingFrameIterator::getCppEntryRegisters() (since the profiler already uses this iterator to find all JIT frames), which calls a platform-dependent jit::JitRuntime::getCppEntryRegisters(), which knows the shape of its CppToJit frames and can retrieve the registers I need to resume stack-walking.
There's only the implementation for x64 for now. I've added some descriptive structs as Markus&Nicolas suggested, very handy!

High-level approach looks good!

It would be nice to use EnterJITStackEntry more directly in generateEnterJIT too, especially here. There you can probably subtract sizeof(EnterJITStackEntry) from the stack pointer, and then fill in each field using storePtr and offsetof(EnterJITStackEntry, foo).

This sounds good!

Minor detail would be to rely on existing Frame structure, such as JitFrameLayout, and expect a pointer of this type instead of a void* pointer.

(In reply to Jan de Mooij [:jandem] from comment #20)

It would be nice to use EnterJITStackEntry more directly in generateEnterJIT too, especially here. There you can probably subtract sizeof(EnterJITStackEntry) from the stack pointer, and then fill in each field using storePtr and offsetof(EnterJITStackEntry, foo).

I was a bit afraid of that: Introducing this struct would result in requests to use it everywhere! 😱
I managed to use it for the xmm registers. For the other I may need more guidance, but I was also concerned that these storePtrs would be less efficient than the current pushes?
Anyway, I've just filed sub-bug 1700869 to handle the JS side of this bug, we can continue the discussion there...

(In reply to Nicolas B. Pierron [:nbp] from comment #21)

Minor detail would be to rely on existing Frame structure, such as JitFrameLayout, and expect a pointer of this type instead of a void* pointer.

Ah, thanks for the tip. I ended up using the base class CommonFrameLayout, it has everything I need.

(In reply to Gerald Squelart [:gerald] (he/him) from comment #22)

I managed to use it for the xmm registers. For the other I may need more guidance, but I was also concerned that these storePtrs would be less efficient than the current pushes?

Fair enough. Adding a comment suggesting updating the other function too is probably good enough for this.

QA Whiteboard: qa-not-actionable

I'm considering this bug fixed thanks to bug 1707489, which did pretty much what comment 0 suggested!

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Duplicate of bug: 1707489
Resolution: FIXED → DUPLICATE
You need to log in before you can comment on or make changes to this bug.