Closed Bug 855977 Opened 12 years ago Closed 12 years ago

SPS breakpad: disable stack scanning for profiling unwinds

Categories

(Core :: Gecko Profiler, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla23

People

(Reporter: jseward, Assigned: jseward)

References

Details

Attachments

(1 file, 1 obsolete file)

Breakpad will -- as a last-ditch measure -- use stack scanning to find probable return addresses. This is useful when trying to extract info from crashed out processes, but is problematic when doing profiling, because the heuristic tends to insert frames which don't actually exist, messing up the profiling results. This is because there is no way to tell which values-that-look-like-return-addresses on the stack are actually live, and which are dead ones created by now-defunct stack frames, which haven't got overwritten. I think it might be possible to do this without modifying breakpad, since breakpad tells us the 'trustyworthyness' of each frame it gives us. So it would be possible to ignore untrustyworthy frames, or (maybe better) truncate the stack at the first one encountered.
I thought we short-circuted this with the LocalDebugInfoSymbolizer?
Hmm. LocalDebugInfoSymbolizer does have this // Lie to the stackwalker to short-circuit stack-scanning heuristics. virtual bool HasImplementation() { return false; } and the only place it is called is from Stackwalker::InstructionAddressSeemsValid, which causes that routine to return 'true'. I can't remember any details about this now, but I have the feeling that this was done so as to stop stack scanning poking around with symbols to try and figure out if an address is a plausible return address. I don't get the impression from reading the code, that LocalDebugInfoSymbolizer's HasImplementation method would actually stop stack scanning from happening. But maybe I misunderstand.
Ah, you are correct. So we have two options, I think: 1) Let the stack walker scan, but drop the rest of the stack on the floor when we find a frame that was found by scanning (a little wasteful, but not terrible) 2) Add an interface to google_breakpad::Stackwalker to disable stack scanning.
> So we have two options, I think: I had planned to do (1) on the basis that it doesn't involve messing with breakpad, and not messing with breakpad is generally a Good Thing. However, I'd also agree that (2) is a generally better solution.
I should also maybe point out that (1) is pretty close to being a 1-liner on the SPS side of things, whereas (2) will inevitably involve a bunch of plumbing.
Might as well just hook up the easy thing, then. If it's a problem we can look at the more complicated solution. The stack traces you get should be the same in either case anyway.
I'd vote for (1). Once we get this all hooked up and working I think the stack scanning could be use later but slightly differently if the results aren't accurate enough or we have system library that don't have proper unwind info. (1) Stack scanning results wouldn't show up in the aggregation. (2) You'd be able to click on the timeline and ask for the best guess for the source.
Assignee: nobody → jseward
Attached patch WIP patch, for consideration (obsolete) — Splinter Review
Here's a WIP patch I've been playing with. I discovered a couple of things with this. The story is more complex than I thought. Firstly, due to the way in which unwind results are returned from do_breakpad_unwind_Buffer to its caller, we can't just break out of the loop when we find a frame we don't like. Doing so causes uninitialised garbage to be fed through to the rest of the profiler. This patch handles it correctly, but at the cost of a bit more complexity. Secondly, truncating stacks at the first untrustworthy frame does indeed give more believable results on x86_64 Linux, at least. However it does appear to mean no unwinds out of libc when (I think) firefox is blocked in syscalls. Which means in practice, no useful data for thread(s) blocked in syscalls. Not sure why breakpad can't handle that case. Maybe there is no CFI on the syscall assembly code. Thirdly, as a result of the above, it's clear that truncating the stacks on the SPS side rather than giving the problem to breakpad (iow, using comment 3 option 1) is the better thing to do. Because it gives us flexibility. Eg, we could try and work around the above problem by accepting exactly on stack-scanned innermost frame. Using comment 3 option 2 would make that impossible.
(In reply to Julian Seward from comment #8) > Eg, we could try and work around the above > problem by accepting exactly one stack-scanned innermost frame. Trivially done, with the above patch: change "if (dubious)" to "if (dubious && frame_index > 1)". That helps in the case where we have (eg) libxul calling libc.so, but not in the case where (eg) libxcb.so calls libc.so and there is no unwind info for libxcb.so (for whatever reason). Not sure what to do about this.
(In reply to Julian Seward from comment #8) > Secondly, truncating stacks at the first untrustworthy frame does > indeed give more believable results on x86_64 Linux, at least. > However it does appear to mean no unwinds out of libc when (I think) > firefox is blocked in syscalls. Which means in practice, no useful > data for thread(s) blocked in syscalls. Not sure why breakpad can't > handle that case. Maybe there is no CFI on the syscall assembly code. There are some symbol files for linux-gate.so in the Breakpad tree: http://code.google.com/p/google-breakpad/source/browse/#svn%2Ftrunk%2Fsrc%2Fclient%2Flinux%2Fdata If you're seeing stacks truncated in syscalls we should probably bake the equivalent of this into our LocalDebugInfoSymbolizer.
Blocks: 855466
Attached patch Revised patchSplinter Review
Turns off stack scanning by default, but allows it to be selectively re-enabled using env var MOZ_PROFILER_STACK_SCAN=<number> where <number> is the max number of dubious frames that can be accepted in a stack trace, before it gets truncated. This allows easy experimentation whilst we are still in the debugging-the-native-unwinder phase. Also moves the declarations for sUnwind{Mode,Interval,StackScan} from TableTicker.h to platform.h, which seems a more logical place for them.
Attachment #732727 - Attachment is obsolete: true
Attachment #734037 - Flags: review?(bgirard)
(In reply to Julian Seward from comment #11) > Also moves the declarations for sUnwind{Mode,Interval,StackScan} from > TableTicker.h to platform.h, which seems a more logical place for > them. The reason this wasn't done before was because Sampler.h was the interface to the sampler and TableTicker.h was just an implementation. The idea was to allow multiple sampler to exists. But at this point it we wont be support other samplers so I've reverted that decision in my multi-thread patch and it's fine here as well.
Comment on attachment 734037 [details] [diff] [review] Revised patch Review of attachment 734037 [details] [diff] [review]: ----------------------------------------------------------------- ::: tools/profiler/platform.h @@ +221,5 @@ > +void read_profiler_env_vars(); > +typedef enum { UnwINVALID, UnwNATIVE, UnwPSEUDO, UnwCOMBINED } UnwMode; > +extern UnwMode sUnwindMode; /* what mode? */ > +extern int sUnwindInterval; /* in milliseconds */ > +extern int sUnwindStackScan; /* max # of dubious frames allowed */ It would be nice to have a way to use INT_MAX/generally allow stack scan but for now this is just debugging so it's fine.
Attachment #734037 - Flags: review?(bgirard) → review+
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla23
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: