Open Bug 1209131 Opened 9 years ago Updated 2 years ago

[meta] Need a way to capture tagged stacks in Telemetry

Categories

(Toolkit :: Telemetry, defect, P4)

defect

Tracking

()

People

(Reporter: bholley, Unassigned)

References

(Depends on 4 open bugs)

Details

(Keywords: meta, Whiteboard: [measurement:client] [measurement:client:project])

When hardening our security, I often want to introduce a new invariant. My process for doing it looks roughly like this:

(1) MOZ_RELEASE_ASSERT the invariant.
(2) Push to try, fixing until green.
(3) Land to m-c, and watch for crash reports. Examine the stacks that come in, and fix them.

(3) is necessary because our test coverage is imperfect. Usually the churn isn't too bad, but sometimes it is (as it was for bug 1072150).

It would be really awesome if I could gather this data without making Nightly less stable. At some point we'd still want to land the release assertion so that we're sure the invariant holds, but we could avert a lot of unnecessary pain by doing so.
Flags: needinfo?(gfritzsche)
If we have a regular need for that i wonder if we should potentially have a keyed stack store for this kind of data.

Vladan, what do you think?
Could we re-use any of the existing functionality for hangs for this?

Or could we maybe add functionality on Telemetry to snapshot the stack and submit it under some string key, so we end up with (key -> (stack -> count)) mappings or similar?
Flags: needinfo?(gfritzsche) → needinfo?(vladan.bugzilla)
Do we actually have a way to capture the stack in opt builds? I'd think we'd need to involved Socorro somehow.
Nightly already walks its stack during jank events of 1 second or more (--enable-profiling makes this possible), but doing this regularly would add performance overhead. Stack walking when the browser is hung for over a second is different from adding stackwalking when the browser is responsive.

If we are able to address the performance concerns, I would prefer this info be stored in a custom section of the Telemetry payload. You should be able to re-use the chrome hangs or Background Thread Hangs data structures for this.
Flags: needinfo?(vladan.bugzilla)
For this kind of issue, capturing even just a few stacks per day per user would be useful. A developer would get a list of stacks, fix the top issues, wait for those to wash out, and then see what comes next.
One thing that jseward and I have discussed in the past is implementing an "unwinder service" whose sole responsibility is capturing stacks. The profiler would then just be another client that would request stacks from this service.

This situation could be yet another use case for an unwinder service.
> but doing this regularly would add performance overhead.

For Bobby's use case we're talking about things that should be rare enough that we're willing to consider MOZ_CRASHing on them in nightlies.  I suspect that the performance overhead is not a huge problem there.

And yes, we could capture the stack once per session per probe location or something like that to mitigate the concern over performance.
Ok. Georg, are you able to work on this?
Flags: needinfo?(gfritzsche)
Whiteboard: [measurement:client]
Not right now, i'll check if we can get it on our list for the near future.
If someone else has time, i think we can sketch out what should work between me and aklotz.
Flags: needinfo?(gfritzsche)
Ok.

Benjamin, just making sure you're aware of this prioritization given the impact that release-mode assertions are having on Firefox stability. It's a suboptimal programming pattern, but we currently lack something better.
Flags: needinfo?(benjamin)
Flags: needinfo?(benjamin)
I will mentor Iaroslav for this small project.
We will use this bug as the tracking bug and break the work down into smaller actionable bugs.

We will open a bug for a further design discussion soon.
Summary: Need a way to capture stacks / crash dump without actually making the build crash → [meta] Need a way to capture stacks / crash dump without actually making the build crash
For the context of this project we will not go for the unwinder service.
That seems like a separate improvement that doesn't have to block this bug.
MozStackWalk() can already do this.
Summary: [meta] Need a way to capture stacks / crash dump without actually making the build crash → [meta] Need a way to capture tagged stacks in Telemetry
Priority: -- → P3
Georg, how are you going to do this without the unwinder?
Flags: needinfo?(gfritzsche)
We already use do this for two hang Telemetry data collections, it's just moving this to an unwinder service that seems a bit involved for a short-term project.
I assume this could either share a common implementation or we use MozStackWalk() directly.


Any blockers there?
Flags: needinfo?(gfritzsche)
Oh no, I thought you meant that you weren't going to do client-side unwinding and were going to use minidumps instead. You should make the decision about the code structure.
Depends on: 1220120
Depends on: 1225851
Whiteboard: [measurement:client] → [measurement:client] [measurement:client:project]
Priority: P3 → P4
Depends on: 1299770
Depends on: 1299772
Depends on: 1299773
Depends on: 1307746
Depends on: 1316088
Depends on: 1316793
Depends on: 1330013
Depends on: 1330564
See bug 1322735, which IIUC should let us walk the stack on non-profiling builds as well.
(Eric Rahm [:erahm] on bug, comment #63)
> I see what you're saying now, yes this improvement is not isolated to
> crashes. It should help with getting useful stack traces across the board on
> win32. Bug 1209131 seems to be limiting itself to builds that have profiling
> enabled so that might be an issue as this doesn't change when profiling is
> enabled. That conversation is probably better discussed in that bug.

Do i understand this right that, with bug 1322735 landed, we should be able to enable the stack capturing here for all Win32 builds?
Flags: needinfo?(erahm)
(In reply to Georg Fritzsche [:gfritzsche] [away Jan 14 - 24] from comment #17)
> (Eric Rahm [:erahm] on bug, comment #63)
> > I see what you're saying now, yes this improvement is not isolated to
> > crashes. It should help with getting useful stack traces across the board on
> > win32. Bug 1209131 seems to be limiting itself to builds that have profiling
> > enabled so that might be an issue as this doesn't change when profiling is
> > enabled. That conversation is probably better discussed in that bug.
> 
> Do i understand this right that, with bug 1322735 landed, we should be able
> to enable the stack capturing here for all Win32 builds?

Correct!
Flags: needinfo?(erahm)
Depends on: 1335444
Bobby, the stack collection mechanism here is becoming ready for use:
http://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/collection/stack-capture.html

To test this with useful data, do you have any current use-cases in mind?
Flags: needinfo?(bobbyholley)
I'm mostly working on stylo right now, and don't have any immediate use-cases now. Boris, anything we want to sniff out on the DOM side these days?
Flags: needinfo?(bobbyholley) → needinfo?(bzbarsky)
Nothing obvious that I know of.  I think we just started doing MOZ_CRASH/MOZ_RELEASE_ASSERT for the cases we really cared about.
Flags: needinfo?(bzbarsky)
(In reply to Boris Zbarsky [:bz] (still a bit busy) from comment #21)
> Nothing obvious that I know of.  I think we just started doing
> MOZ_CRASH/MOZ_RELEASE_ASSERT for the cases we really cared about.

The question is whether there are cases we care less about (but still care about) that this unblocks. I guess we'll need to wait for something to come up in DOM-land.

On the broader question though, might be worth an email to dev-platform saying that this is possible and asking people to try it out?
I will write out a broader mail once this is wrapped up and we ran a test.
Keywords: meta
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.