Closed Bug 855977 Opened 11 years ago Closed 11 years ago

SPS breakpad: disable stack scanning for profiling unwinds

Categories

(Core :: Gecko Profiler, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla23

People

(Reporter: jseward, Assigned: jseward)

References

Details

Attachments

(1 file, 1 obsolete file)

Breakpad will -- as a last-ditch measure -- use stack scanning to find 
probable return addresses.  This is useful when trying to extract info
from crashed out processes, but is problematic when doing profiling,
because the heuristic tends to insert frames which don't actually
exist, messing up the profiling results.  This is because there is no 
way to tell which values-that-look-like-return-addresses on the stack
are actually live, and which are dead ones created by now-defunct
stack frames, which haven't got overwritten.

I think it might be possible to do this without modifying breakpad,
since breakpad tells us the 'trustyworthyness' of each frame it gives
us.  So it would be possible to ignore untrustyworthy frames, or
(maybe better) truncate the stack at the first one encountered.
I thought we short-circuted this with the LocalDebugInfoSymbolizer?
Hmm.  LocalDebugInfoSymbolizer does have this

  // Lie to the stackwalker to short-circuit stack-scanning heuristics.
  virtual bool HasImplementation() { return false; }

and the only place it is called is from Stackwalker::InstructionAddressSeemsValid,
which causes that routine to return 'true'.

I can't remember any details about this now, but I have the feeling that 
this was done so as to stop stack scanning poking around with symbols to
try and figure out if an address is a plausible return address.  I don't
get the impression from reading the code, that LocalDebugInfoSymbolizer's
HasImplementation method would actually stop stack scanning from happening.

But maybe I misunderstand.
Ah, you are correct. So we have two options, I think:
1) Let the stack walker scan, but drop the rest of the stack on the floor when we find a frame that was found by scanning (a little wasteful, but not terrible)
2) Add an interface to google_breakpad::Stackwalker to disable stack scanning.
> So we have two options, I think:

I had planned to do (1) on the basis that it doesn't involve messing
with breakpad, and not messing with breakpad is generally a Good
Thing.  However, I'd also agree that (2) is a generally better
solution.
I should also maybe point out that (1) is pretty close to being a 1-liner 
on the SPS side of things, whereas (2) will inevitably involve a bunch of
plumbing.
Might as well just hook up the easy thing, then. If it's a problem we can look at the more complicated solution. The stack traces you get should be the same in either case anyway.
I'd vote for (1). Once we get this all hooked up and working I think the stack scanning could be use later but slightly differently if the results aren't accurate enough or we have system library that don't have proper unwind info.

(1) Stack scanning results wouldn't show up in the aggregation.
(2) You'd be able to click on the timeline and ask for the best guess for the source.
Assignee: nobody → jseward
Attached patch WIP patch, for consideration (obsolete) — Splinter Review
Here's a WIP patch I've been playing with.  I discovered a couple of
things with this.  The story is more complex than I thought.

Firstly, due to the way in which unwind results are returned from
do_breakpad_unwind_Buffer to its caller, we can't just break out of
the loop when we find a frame we don't like.  Doing so causes
uninitialised garbage to be fed through to the rest of the profiler.
This patch handles it correctly, but at the cost of a bit more
complexity.

Secondly, truncating stacks at the first untrustworthy frame does
indeed give more believable results on x86_64 Linux, at least.
However it does appear to mean no unwinds out of libc when (I think)
firefox is blocked in syscalls.  Which means in practice, no useful
data for thread(s) blocked in syscalls.  Not sure why breakpad can't
handle that case.  Maybe there is no CFI on the syscall assembly code.

Thirdly, as a result of the above, it's clear that truncating the
stacks on the SPS side rather than giving the problem to breakpad
(iow, using comment 3 option 1) is the better thing to do.  Because it
gives us flexibility.  Eg, we could try and work around the above
problem by accepting exactly on stack-scanned innermost frame.  Using
comment 3 option 2 would make that impossible.
(In reply to Julian Seward from comment #8)
> Eg, we could try and work around the above
> problem by accepting exactly one stack-scanned innermost frame.

Trivially done, with the above patch: change "if (dubious)" to 
"if (dubious && frame_index > 1)".

That helps in the case where we have (eg) libxul calling libc.so, but
not in the case where (eg) libxcb.so calls libc.so and there is no
unwind info for libxcb.so (for whatever reason).

Not sure what to do about this.
(In reply to Julian Seward from comment #8)
> Secondly, truncating stacks at the first untrustworthy frame does
> indeed give more believable results on x86_64 Linux, at least.
> However it does appear to mean no unwinds out of libc when (I think)
> firefox is blocked in syscalls.  Which means in practice, no useful
> data for thread(s) blocked in syscalls.  Not sure why breakpad can't
> handle that case.  Maybe there is no CFI on the syscall assembly code.

There are some symbol files for linux-gate.so in the Breakpad tree:
http://code.google.com/p/google-breakpad/source/browse/#svn%2Ftrunk%2Fsrc%2Fclient%2Flinux%2Fdata

If you're seeing stacks truncated in syscalls we should probably bake the equivalent of this into our LocalDebugInfoSymbolizer.
Blocks: 855466
Attached patch Revised patchSplinter Review
Turns off stack scanning by default, but allows it to be selectively
re-enabled using env var MOZ_PROFILER_STACK_SCAN=<number> where
<number> is the max number of dubious frames that can be accepted in a
stack trace, before it gets truncated.  This allows easy
experimentation whilst we are still in the
debugging-the-native-unwinder phase.

Also moves the declarations for sUnwind{Mode,Interval,StackScan} from
TableTicker.h to platform.h, which seems a more logical place for
them.
Attachment #732727 - Attachment is obsolete: true
Attachment #734037 - Flags: review?(bgirard)
(In reply to Julian Seward from comment #11)
> Also moves the declarations for sUnwind{Mode,Interval,StackScan} from
> TableTicker.h to platform.h, which seems a more logical place for
> them.

The reason this wasn't done before was because Sampler.h was the interface to the sampler and TableTicker.h was just an implementation. The idea was to allow multiple sampler to exists. But at this point it we wont be support other samplers so I've reverted that decision in my multi-thread patch and it's fine here as well.
Comment on attachment 734037 [details] [diff] [review]
Revised patch

Review of attachment 734037 [details] [diff] [review]:
-----------------------------------------------------------------

::: tools/profiler/platform.h
@@ +221,5 @@
> +void read_profiler_env_vars();
> +typedef  enum { UnwINVALID, UnwNATIVE, UnwPSEUDO, UnwCOMBINED }  UnwMode;
> +extern UnwMode sUnwindMode;       /* what mode? */
> +extern int     sUnwindInterval;   /* in milliseconds */
> +extern int     sUnwindStackScan;  /* max # of dubious frames allowed */

It would be nice to have a way to use INT_MAX/generally allow stack scan but for now this is just debugging so it's fine.
Attachment #734037 - Flags: review?(bgirard) → review+
https://hg.mozilla.org/mozilla-central/rev/f0063cd8a37b
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla23
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: