Open Bug 1209131 Opened 9 years ago Updated 2 years ago

[meta] Need a way to capture tagged stacks in Telemetry

Categories

(Toolkit :: Telemetry, defect, P4)

Product:

Component:

Type:

defect

Priority:

P4

Severity:

S3

Tracking

()

Status:

NEW

People

(Reporter: bholley, Unassigned)

References

(Depends on 4 open bugs)

Details

(Keywords: meta, Whiteboard: [measurement:client] [measurement:client:project])

Bobby Holley (:bholley)

Reporter

Description

•

9 years ago

When hardening our security, I often want to introduce a new invariant. My process for doing it looks roughly like this: (1) MOZ_RELEASE_ASSERT the invariant. (2) Push to try, fixing until green. (3) Land to m-c, and watch for crash reports. Examine the stacks that come in, and fix them. (3) is necessary because our test coverage is imperfect. Usually the churn isn't too bad, but sometimes it is (as it was for bug 1072150). It would be really awesome if I could gather this data without making Nightly less stable. At some point we'd still want to land the release assertion so that we're sure the invariant holds, but we could avert a lot of unnecessary pain by doing so.

Bobby Holley (:bholley)

Reporter

Updated

•

9 years ago

Flags: needinfo?(gfritzsche)

Georg Fritzsche [:gfritzsche]

Comment 1

•

9 years ago

If we have a regular need for that i wonder if we should potentially have a keyed stack store for this kind of data. Vladan, what do you think? Could we re-use any of the existing functionality for hangs for this? Or could we maybe add functionality on Telemetry to snapshot the stack and submit it under some string key, so we end up with (key -> (stack -> count)) mappings or similar?

Flags: needinfo?(gfritzsche) → needinfo?(vladan.bugzilla)

Bobby Holley (:bholley)

Reporter

Comment 2

•

9 years ago

Do we actually have a way to capture the stack in opt builds? I'd think we'd need to involved Socorro somehow.

Vladan Djeric (:vladan)

Comment 3

•

9 years ago

Nightly already walks its stack during jank events of 1 second or more (--enable-profiling makes this possible), but doing this regularly would add performance overhead. Stack walking when the browser is hung for over a second is different from adding stackwalking when the browser is responsive. If we are able to address the performance concerns, I would prefer this info be stored in a custom section of the Telemetry payload. You should be able to re-use the chrome hangs or Background Thread Hangs data structures for this.

Flags: needinfo?(vladan.bugzilla)

Andrew McCreight [:mccr8]

Comment 4

•

9 years ago

For this kind of issue, capturing even just a few stacks per day per user would be useful. A developer would get a list of stacks, fix the top issues, wait for those to wash out, and then see what comes next.

(No longer employed by Mozilla) Aaron Klotz

Comment 5

•

9 years ago

One thing that jseward and I have discussed in the past is implementing an "unwinder service" whose sole responsibility is capturing stacks. The profiler would then just be another client that would request stacks from this service. This situation could be yet another use case for an unwinder service.

Boris Zbarsky [:bzbarsky]

Comment 6

•

9 years ago

> but doing this regularly would add performance overhead. For Bobby's use case we're talking about things that should be rare enough that we're willing to consider MOZ_CRASHing on them in nightlies. I suspect that the performance overhead is not a huge problem there. And yes, we could capture the stack once per session per probe location or something like that to mitigate the concern over performance.

Bobby Holley (:bholley)

Reporter

Comment 7

•

9 years ago

Ok. Georg, are you able to work on this?

Flags: needinfo?(gfritzsche)

Georg Fritzsche [:gfritzsche]

Updated

•

9 years ago

Blocks: TelemetryAdditions

Whiteboard: [measurement:client]

Georg Fritzsche [:gfritzsche]

Comment 8

•

9 years ago

Not right now, i'll check if we can get it on our list for the near future. If someone else has time, i think we can sketch out what should work between me and aklotz.

Flags: needinfo?(gfritzsche)

Bobby Holley (:bholley)

Reporter

Comment 9

•

9 years ago

Ok. Benjamin, just making sure you're aware of this prioritization given the impact that release-mode assertions are having on Firefox stability. It's a suboptimal programming pattern, but we currently lack something better.

Flags: needinfo?(benjamin)

Benjamin Smedberg

Updated

•

9 years ago

Flags: needinfo?(benjamin)

Georg Fritzsche [:gfritzsche]

Comment 10

•

9 years ago

I will mentor Iaroslav for this small project. We will use this bug as the tracking bug and break the work down into smaller actionable bugs. We will open a bug for a further design discussion soon.

Summary: Need a way to capture stacks / crash dump without actually making the build crash → [meta] Need a way to capture stacks / crash dump without actually making the build crash

Georg Fritzsche [:gfritzsche]

Comment 11

•

9 years ago

For the context of this project we will not go for the unwinder service. That seems like a separate improvement that doesn't have to block this bug.

Andrew McCreight [:mccr8]

Comment 12

•

9 years ago

MozStackWalk() can already do this.

Georg Fritzsche [:gfritzsche]

Updated

•

9 years ago

Summary: [meta] Need a way to capture stacks / crash dump without actually making the build crash → [meta] Need a way to capture tagged stacks in Telemetry

Georg Fritzsche [:gfritzsche]

Updated

•

9 years ago

Priority: -- → P3

Benjamin Smedberg

Comment 13

•

9 years ago

Georg, how are you going to do this without the unwinder?

Flags: needinfo?(gfritzsche)

Georg Fritzsche [:gfritzsche]

Comment 14

•

9 years ago

We already use do this for two hang Telemetry data collections, it's just moving this to an unwinder service that seems a bit involved for a short-term project. I assume this could either share a common implementation or we use MozStackWalk() directly. Any blockers there?

Flags: needinfo?(gfritzsche)

Benjamin Smedberg

Comment 15

•

9 years ago

Oh no, I thought you meant that you weren't going to do client-side unwinding and were going to use minidumps instead. You should make the decision about the code structure.

Georg Fritzsche [:gfritzsche]

Updated

•

9 years ago

Depends on: 1220120

Iaroslav Sheptykin

Updated

•

9 years ago

Depends on: 1225851

Georg Fritzsche [:gfritzsche]

Updated

•

9 years ago

Whiteboard: [measurement:client] → [measurement:client] [measurement:client:project]

Georg Fritzsche [:gfritzsche]

Updated

•

9 years ago

Priority: P3 → P4

Georg Fritzsche [:gfritzsche]

Updated

•

8 years ago

Depends on: 1299770

Georg Fritzsche [:gfritzsche]

Updated

•

8 years ago

Depends on: 1299772

Georg Fritzsche [:gfritzsche]

Updated

•

8 years ago

Depends on: 1299773

Georg Fritzsche [:gfritzsche]

Updated

•

8 years ago

Depends on: 1307746

Georg Fritzsche [:gfritzsche]

Updated

•

8 years ago

Depends on: 1316088

Georg Fritzsche [:gfritzsche]

Updated

•

8 years ago

Depends on: 1316793

Georg Fritzsche [:gfritzsche]

Updated

•

8 years ago

Depends on: 1330013

Georg Fritzsche [:gfritzsche]

Updated

•

8 years ago

Depends on: 1330564

Bobby Holley (:bholley)

Reporter

Comment 16

•

8 years ago

See bug 1322735, which IIUC should let us walk the stack on non-profiling builds as well.

Georg Fritzsche [:gfritzsche]

Comment 17

•

8 years ago

(Eric Rahm [:erahm] on bug, comment #63) > I see what you're saying now, yes this improvement is not isolated to > crashes. It should help with getting useful stack traces across the board on > win32. Bug 1209131 seems to be limiting itself to builds that have profiling > enabled so that might be an issue as this doesn't change when profiling is > enabled. That conversation is probably better discussed in that bug. Do i understand this right that, with bug 1322735 landed, we should be able to enable the stack capturing here for all Win32 builds?

Flags: needinfo?(erahm)

Eric Rahm [:erahm]

Comment 18

•

8 years ago

(In reply to Georg Fritzsche [:gfritzsche] [away Jan 14 - 24] from comment #17) > (Eric Rahm [:erahm] on bug, comment #63) > > I see what you're saying now, yes this improvement is not isolated to > > crashes. It should help with getting useful stack traces across the board on > > win32. Bug 1209131 seems to be limiting itself to builds that have profiling > > enabled so that might be an issue as this doesn't change when profiling is > > enabled. That conversation is probably better discussed in that bug. > > Do i understand this right that, with bug 1322735 landed, we should be able > to enable the stack capturing here for all Win32 builds? Correct!

Flags: needinfo?(erahm)

Georg Fritzsche [:gfritzsche]

Updated

•

8 years ago

Depends on: 1335444

Georg Fritzsche [:gfritzsche]

Comment 19

•

8 years ago

Bobby, the stack collection mechanism here is becoming ready for use: http://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/collection/stack-capture.html To test this with useful data, do you have any current use-cases in mind?

Flags: needinfo?(bobbyholley)

Bobby Holley (:bholley)

Reporter

Comment 20

•

8 years ago

I'm mostly working on stylo right now, and don't have any immediate use-cases now. Boris, anything we want to sniff out on the DOM side these days?

Flags: needinfo?(bobbyholley) → needinfo?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Comment 21

•

8 years ago

Nothing obvious that I know of. I think we just started doing MOZ_CRASH/MOZ_RELEASE_ASSERT for the cases we really cared about.

Flags: needinfo?(bzbarsky)

Bobby Holley (:bholley)

Reporter

Comment 22

•

8 years ago

(In reply to Boris Zbarsky [:bz] (still a bit busy) from comment #21) > Nothing obvious that I know of. I think we just started doing > MOZ_CRASH/MOZ_RELEASE_ASSERT for the cases we really cared about. The question is whether there are cases we care less about (but still care about) that this unblocks. I guess we'll need to wait for something to come up in DOM-land. On the broader question though, might be worth an email to dev-platform saying that this is possible and asking people to try it out?

Georg Fritzsche [:gfritzsche]

Comment 23

•

8 years ago

I will write out a broader mail once this is wrapped up and we ran a test.

Bobby Holley (:bholley)

Reporter

Comment 24

•

6 years ago

There is new demand for this: https://groups.google.com/forum/#!topic/mozilla.dev.platform/Z5ikfQMTJUA

Sylvestre Ledru [:Sylvestre]

Updated

•

6 years ago

Keywords: meta

Updated

•

2 years ago

Severity: normal → S3

You need to log in before you can comment on or make changes to this bug.