Closed Bug 411431 Opened 16 years ago Closed 15 years ago

Store/Display raw stack data and registers

Categories

(Socorro :: General, task, P1)

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 517010

People

(Reporter: samuel.sidler+old, Unassigned)

References

()

Details

Reported by ted.mielczarek, Today (2 hours ago)

We should store and (optionally, perhaps only for authed users) display raw
stack data and registers.  dbaron has used this in lieu of having local
variable information to debug difficult crashers.  This may require changes
to minidump_stackwalk.
Priority: -- → P1
Need clarification of what's important here.  dbaron: do you need the full stack memory for each thread, plus the registers for each frame?
Pretty much, though in the past I'd worked out registers for non-top frames based on saved values in the stack memory.  For examples, see bug 302536 (crash diagnosis), bug 347054 (using the stack memory to get a correct stack trace when talkback's was wrong), bug 334177, bug 321366, bug 347053 (which we never figured out), bug 335038 (analysis never made sense), bug 328184, bug 319797 (still not fixed), bug 303821 (see also comment 30 for potentially useful analysis we could also do).
Flags: blocking1.9+
Schrep, I do not think this should be  1.9-blocker... we cannot hand out the raw stack data outside of MoCo due to privacy concerns, so the pool of people who can use this data is very limited. We have a symbol server now so some decent subset of crashes can be debugged "in the field". We could even, when the rare case arrives, give dbaron the actual minidump file which can be attached in a debugger.

The amount of work required to implement this is non-trivial given the current minidump_stackwalk architecture, and I think we could spend our prior-to-1.9 work in many more fruitful areas of the reporter.
k moving off list
Flags: blocking1.9+ → blocking1.9-
> so the pool of people who can use this data is very limited.

the number of engineers working at moco, or that we could grant access to is pretty significant, but in reality we just need to make sure there is a streamline process for a few folks that have a passion for killing really hard to dignose crasher bugs (like dbaron)


would bsmedberg's suggestion in comment 3 get us to point that would have aided in the analysis and debugging of the bugs dbaron mentioned in comment 2?

if handing out minidump files will work that would be great, but if its not we really need to be working on another solution.

could we test the mini dump process in the analysis of a few hairy crashers that have us stumpped from the fx3 beta 3 release.

what's needed to get started?

https://bugzilla.mozilla.org/show_bug.cgi?id=418378  is an example of a top crash bug that we are going in circles, on and really need to take to the next level of analysis.

whats the process for handing out mini-dumps if they wanted to look at bug 418378 ?

renominating until we get some kind of process in place to make sure the analysis can happen, and we don't make critical mistakes with shipping firefox 3.

lets make this bug track the set up the process for getting at mini-dumps if nothing else.  who can help documenting or setting that up?

Flags: blocking1.9- → blocking1.9?
Saving actual minidumps should be easy, the processor isn't currently doing that now, but they're just files, we could copy them somewhere. On Windows they can be loaded in a debugger for postmortem debugging. On other platforms you can run minidump_stackwalk or minidump_dump on them directly to get some info, but without having the symbol files handy it's a little tricky to debug them.
instead of saving all minidumps, it'd be nice if we could setup triggers:

if Signature matches (....) AND platform is (win32)
  StoreMiniDump
  MailNotification: dbaron

Storing all the minidumps we collect could be expensive, but we could also even configure the server to automatically store one minidump for each signature when it becomes a topcrash.

There are definitely people who can work w/ the windows minidumps. If an NDA is necessary, this shouldn't be insurmountable. If we're collecting user's email addresses, we should be able to send them an email asking for their permission if people are really worried (this should not be automatic, it would be done by the person about to open the dump before opening it).
I don't think we need to ask permission. People are opting in to send us this information. As long as we're not making it all publicly visible, I think we're within our rights ethically to distribute minidumps to employees of the corporation and other trusted engineers.
Also, to address the other portion of your comment, the processor doesn't currently know what's a topcrash. If we had a separate table listing topcrash signatures, then it could pretty easily perform what you pseudocoded there.
the pseudocode was actually meant to be separate from the topcrash flavor - for the obscure crash (not the topcrash). Although if you supported the psuedocode, then someone could easily write external code that added rules for all current topcrashes, and removed rules as it got minidumps that matched the topcrashes.

but however it's implemented, it would be nice :).
one other idea would be to save one minidump for each unique stack trace as in


if Signature does not match (any existing signature in the database )
  StoreMiniDump


That would mean we always have one minidump on file for each crash and folks wouldn't have to hang around waiting on notifications.

so, what do we need to to turn on minidump saving?

some changes to the scripts like above, or other?
more diskspace somewhere?
anything else?

how quickly could we put something in place?

That sounds like a sensible plan, and not too hard to implement. It would require some code changes to the processor. morgamic will have to comment on how quickly we could get it done.
Flags: tracking1.9? → blocking1.9?
morgamic?
Scaling all three tiers and improving processor throughput are a lot more important right now.  Given current priorities this can't be completed until April.  We just can't drop scalability or our other blockers (q1 goals) for this unless it's mission critical, and I am not seeing that it is.
morgamic,

your right, the system isn't much good to us if it isn't scaled to the expected load, but on the other hand it also isn't much good to us if we don't have the tools in place to understand and analyze the data it provides.  Right now we have several top crashers in the beta3 top 10 list that we have exhausted the first stage of analysis where someone has looking at the stack traces, looked at comments, and trying to find common patterns and reproducable test cases that lead to the crashes.  The next stage for looking at these kind of crashes is to do code inspections around the crashing code, and to do the analysis dbaron talks about in comment 2 which would involve minidumps or some other blackbox analysis tools.

Right now at least this set of bugs for fx3 top 10 crashers are blocked on getting mini-dumps or some other way of get a more detailed look at registers and other details from the blackbox.


1 XPCWrappedNativeScope (fixed for beta4 )
2 MultiByteToWideChar 	
3 @0x0 	
4 jpinscp.dll@0xcf45 	
5 RtlpCoalesceFreeBlocks 
6 nsGlobalWindow::SaveWindowState(needs minidump analysis) bug 418378
7 nsChromeRegistry::CheckForNewChrome()(could use minidump analysis) bug 391311
8 npLegitCheckPlugin.dll@0x14e09 	
9 UniscribeItem::SaveGlyphs(needs comment analysis-then minidump? bug 418382 )
10 HashString(nsAString_internal const&) (needs minidump analysis) bug 418381


It looks like we will be shipping with these bugs still in the top crash list unless we can get some better analysis systems in place.

If there is anyway to balance some scalability work with getting this minidump saving in place it would be really helpful towards improving stability of fx3 at launch.

Talking with Ted about this, it's all server side.  I don't think we should block the release on this, but it's super critical to have as much crash info as possible.  Ted's aligned to make sure this gets fixed, so morgamic, can you see to it on your end as well?  Just prioritize accordingly.

-'ing, wanted1.9.0.x+ (but, really ASAP because we don't need a client release to make it happen).
Flags: wanted1.9.0.x+
Flags: blocking1.9?
Flags: blocking1.9-
As for who should have access, may I suggest that the access to these minidumps is tied into the mercurial/cvs account/LDAP story, to make sure non-MoCo employees can enjoy its benefits? 

This is how the TryServer restricts access, and seems to work fine.
Austin and David had a discussion about this.  Assigning to aking.
Assignee: nobody → aking
As filed, I don't think this is a good solution. Discussion here mentions saving minidumps for developers, so duping to bug 517010.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → DUPLICATE
Assignee: ozten.bugs → nobody
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.