Open
Bug 1209131
Opened 9 years ago
Updated 2 years ago
[meta] Need a way to capture tagged stacks in Telemetry
Categories
(Toolkit :: Telemetry, defect, P4)
Toolkit
Telemetry
Tracking
()
NEW
People
(Reporter: bholley, Unassigned)
References
(Depends on 4 open bugs)
Details
(Keywords: meta, Whiteboard: [measurement:client] [measurement:client:project])
When hardening our security, I often want to introduce a new invariant. My process for doing it looks roughly like this:
(1) MOZ_RELEASE_ASSERT the invariant.
(2) Push to try, fixing until green.
(3) Land to m-c, and watch for crash reports. Examine the stacks that come in, and fix them.
(3) is necessary because our test coverage is imperfect. Usually the churn isn't too bad, but sometimes it is (as it was for bug 1072150).
It would be really awesome if I could gather this data without making Nightly less stable. At some point we'd still want to land the release assertion so that we're sure the invariant holds, but we could avert a lot of unnecessary pain by doing so.
Reporter | ||
Updated•9 years ago
|
Flags: needinfo?(gfritzsche)
Comment 1•9 years ago
|
||
If we have a regular need for that i wonder if we should potentially have a keyed stack store for this kind of data.
Vladan, what do you think?
Could we re-use any of the existing functionality for hangs for this?
Or could we maybe add functionality on Telemetry to snapshot the stack and submit it under some string key, so we end up with (key -> (stack -> count)) mappings or similar?
Flags: needinfo?(gfritzsche) → needinfo?(vladan.bugzilla)
Reporter | ||
Comment 2•9 years ago
|
||
Do we actually have a way to capture the stack in opt builds? I'd think we'd need to involved Socorro somehow.
Comment 3•9 years ago
|
||
Nightly already walks its stack during jank events of 1 second or more (--enable-profiling makes this possible), but doing this regularly would add performance overhead. Stack walking when the browser is hung for over a second is different from adding stackwalking when the browser is responsive.
If we are able to address the performance concerns, I would prefer this info be stored in a custom section of the Telemetry payload. You should be able to re-use the chrome hangs or Background Thread Hangs data structures for this.
Flags: needinfo?(vladan.bugzilla)
Comment 4•9 years ago
|
||
For this kind of issue, capturing even just a few stacks per day per user would be useful. A developer would get a list of stacks, fix the top issues, wait for those to wash out, and then see what comes next.
Comment 5•9 years ago
|
||
One thing that jseward and I have discussed in the past is implementing an "unwinder service" whose sole responsibility is capturing stacks. The profiler would then just be another client that would request stacks from this service.
This situation could be yet another use case for an unwinder service.
![]() |
||
Comment 6•9 years ago
|
||
> but doing this regularly would add performance overhead.
For Bobby's use case we're talking about things that should be rare enough that we're willing to consider MOZ_CRASHing on them in nightlies. I suspect that the performance overhead is not a huge problem there.
And yes, we could capture the stack once per session per probe location or something like that to mitigate the concern over performance.
Reporter | ||
Comment 7•9 years ago
|
||
Ok. Georg, are you able to work on this?
Flags: needinfo?(gfritzsche)
Updated•9 years ago
|
Blocks: TelemetryAdditions
Whiteboard: [measurement:client]
Comment 8•9 years ago
|
||
Not right now, i'll check if we can get it on our list for the near future.
If someone else has time, i think we can sketch out what should work between me and aklotz.
Flags: needinfo?(gfritzsche)
Reporter | ||
Comment 9•9 years ago
|
||
Ok.
Benjamin, just making sure you're aware of this prioritization given the impact that release-mode assertions are having on Firefox stability. It's a suboptimal programming pattern, but we currently lack something better.
Flags: needinfo?(benjamin)
Updated•9 years ago
|
Flags: needinfo?(benjamin)
Comment 10•9 years ago
|
||
I will mentor Iaroslav for this small project.
We will use this bug as the tracking bug and break the work down into smaller actionable bugs.
We will open a bug for a further design discussion soon.
Summary: Need a way to capture stacks / crash dump without actually making the build crash → [meta] Need a way to capture stacks / crash dump without actually making the build crash
Comment 11•9 years ago
|
||
For the context of this project we will not go for the unwinder service.
That seems like a separate improvement that doesn't have to block this bug.
Comment 12•9 years ago
|
||
MozStackWalk() can already do this.
Updated•9 years ago
|
Summary: [meta] Need a way to capture stacks / crash dump without actually making the build crash → [meta] Need a way to capture tagged stacks in Telemetry
Updated•9 years ago
|
Priority: -- → P3
Comment 13•9 years ago
|
||
Georg, how are you going to do this without the unwinder?
Flags: needinfo?(gfritzsche)
Comment 14•9 years ago
|
||
We already use do this for two hang Telemetry data collections, it's just moving this to an unwinder service that seems a bit involved for a short-term project.
I assume this could either share a common implementation or we use MozStackWalk() directly.
Any blockers there?
Flags: needinfo?(gfritzsche)
Comment 15•9 years ago
|
||
Oh no, I thought you meant that you weren't going to do client-side unwinding and were going to use minidumps instead. You should make the decision about the code structure.
Updated•9 years ago
|
Whiteboard: [measurement:client] → [measurement:client] [measurement:client:project]
Updated•9 years ago
|
Priority: P3 → P4
Reporter | ||
Comment 16•8 years ago
|
||
See bug 1322735, which IIUC should let us walk the stack on non-profiling builds as well.
Comment 17•8 years ago
|
||
(Eric Rahm [:erahm] on bug, comment #63)
> I see what you're saying now, yes this improvement is not isolated to
> crashes. It should help with getting useful stack traces across the board on
> win32. Bug 1209131 seems to be limiting itself to builds that have profiling
> enabled so that might be an issue as this doesn't change when profiling is
> enabled. That conversation is probably better discussed in that bug.
Do i understand this right that, with bug 1322735 landed, we should be able to enable the stack capturing here for all Win32 builds?
Flags: needinfo?(erahm)
Comment 18•8 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] [away Jan 14 - 24] from comment #17)
> (Eric Rahm [:erahm] on bug, comment #63)
> > I see what you're saying now, yes this improvement is not isolated to
> > crashes. It should help with getting useful stack traces across the board on
> > win32. Bug 1209131 seems to be limiting itself to builds that have profiling
> > enabled so that might be an issue as this doesn't change when profiling is
> > enabled. That conversation is probably better discussed in that bug.
>
> Do i understand this right that, with bug 1322735 landed, we should be able
> to enable the stack capturing here for all Win32 builds?
Correct!
Flags: needinfo?(erahm)
Comment 19•8 years ago
|
||
Bobby, the stack collection mechanism here is becoming ready for use:
http://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/collection/stack-capture.html
To test this with useful data, do you have any current use-cases in mind?
Flags: needinfo?(bobbyholley)
Reporter | ||
Comment 20•8 years ago
|
||
I'm mostly working on stylo right now, and don't have any immediate use-cases now. Boris, anything we want to sniff out on the DOM side these days?
Flags: needinfo?(bobbyholley) → needinfo?(bzbarsky)
![]() |
||
Comment 21•8 years ago
|
||
Nothing obvious that I know of. I think we just started doing MOZ_CRASH/MOZ_RELEASE_ASSERT for the cases we really cared about.
Flags: needinfo?(bzbarsky)
Reporter | ||
Comment 22•8 years ago
|
||
(In reply to Boris Zbarsky [:bz] (still a bit busy) from comment #21)
> Nothing obvious that I know of. I think we just started doing
> MOZ_CRASH/MOZ_RELEASE_ASSERT for the cases we really cared about.
The question is whether there are cases we care less about (but still care about) that this unblocks. I guess we'll need to wait for something to come up in DOM-land.
On the broader question though, might be worth an email to dev-platform saying that this is possible and asking people to try it out?
Comment 23•8 years ago
|
||
I will write out a broader mail once this is wrapped up and we ran a test.
Reporter | ||
Comment 24•6 years ago
|
||
There is new demand for this: https://groups.google.com/forum/#!topic/mozilla.dev.platform/Z5ikfQMTJUA
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•