Closed Bug 480503 Opened 15 years ago Closed 8 years ago

Can't search for stack frames beyond the top frame in socorro

Categories

(Socorro :: Webapp, task, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whimboo, Assigned: adrian)

References

Details

(Keywords: regression, Whiteboard: [search])

Attachments

(1 file)

With the latest upgrade the search ui was updated too. Now it is not possible anymore to search for frames which occur at least the second level. You can only search for frames on top of the stack.

We should revert this change because searching under the first 10 frames was really helpful when you have an address only as the top frame. How should those searches be performed now?
[08:31am] ted: there was a dropdown
[08:31am] ted: [stack signature, one of the top 10 frames] [is exactly, contains, starts with]

I believe this was cut from the design intentionally. Will ping neilio.
lars has improved the data that this stack signature search hits...
Does this make the top 10 frames option obsolete?

Should we add some copy to educate users about this new feature?
or
Should we add this drop down back in to advanced filters?
(In reply to comment #2)
> lars has improved the data that this stack signature search hits...
> ... Should we add some copy to educate users about this new feature?

improved/new feature in what way?  Top of stack analysis?  Eg.
Bug 411349 Better signatures for crashes with just an address as the top frame


> or
> Should we add this drop down back in to advanced filters?

If only Bug 411349 then I'm guessing "top n" still needed. 

Perhaps a "top 5" would serve 90% of my needs, if it puts far less load on query than top 10. Just a thought.
Whiteboard: cloud
Target Milestone: --- → Future
Target Milestone: Future → 1.8
yeah, top 5 sounds like a good number.  I think that is what we had with talkback.

There are a few cases where searching for things down in the 7th frame were interesting like in bug 533035, but those are infrequent.

an interesting output to these search inputs would be to show the calls on the stack "horizontally" rather than vertically to allow sorting/counting of the various forms of the stacks.  This would allow you to get a pile of signatures or different stacks withing the same signature, and sort out the similarities and differences.  Its a bit tough to look at without wide screen monitiors but maybe we could figure out some presentation and data hiding tricks to help.

See the attachments in bug 295514 like https://bug295514.bugzilla.mozilla.org/attachment.cgi?id=445121 for some rough ideas on this kind of analysis report.
maybe this spins off into a more advanced search bug, but one of the things that I think we want to do with this kind of search is to not only input some terms to search for a specific call in a specific frame (e.g. return all the reports with PR_Lock in the second frame of stack), but we also want input a series of frames and get back all the reports that have that sequence in the top 5 (or 10 frames)

take for example bug 563847 

I want to give these two, three or four frames as the input like this

a. ntdll.dll RtlEnterCriticalSection    
b. nspr4.dll PR_Lock    
c. xul.dll mozilla::MutexAutoLock::MutexAutoLock(mozilla::Mutex &)    
d. xul.dll mozilla::plugins::ChildAsyncCall::RemoveFromAsyncList()    

then then have the query return any signature or set of reports that have sequence in set of frames we can look at.  for example it would return the signature RtlRaiseStatus and a stack that looks like

1 ntdll.dll RtlRaiseStatus    
2 ntdll.dll SHATransformP3    
3 ntdll.dll RtlEnterCriticalSection    
4 nspr4.dll PR_Lock    
5 xul.dll mozilla::MutexAutoLock::MutexAutoLock(mozilla::Mutex &)    
6 xul.dll mozilla::plugins::ChildAsyncCall::RemoveFromAsyncList()    

or

1 ntdll.dll RtlRaiseStatus    
2 ntdll.dll TransformMD5    
3 ntdll.dll RtlEnterCriticalSection    
4 nspr4.dll PR_Lock    
5 xul.dll mozilla::MutexAutoLock::MutexAutoLock(mozilla::Mutex &)    
6 xul.dll mozilla::plugins::ChildAsyncCall::RemoveFromAsyncList()    


where a matches 3, b matches 4, c matches 5, ect...

and it would return a signature like RtlEnterCriticalSection with the stack

1 ntdll.dll RtlEnterCriticalSection    
2 nspr4.dll PR_Lock    
3 xul.dll mozilla::MutexAutoLock::MutexAutoLock(mozilla::Mutex &)    
4 xul.dll mozilla::plugins::ChildAsyncCall::RemoveFromAsyncList()    
5 anything...
6 anything...

where a matches 1, b matches 2, etc...

and similar matching within the stack frame window.

If we did something like this we could turn these kind of queries into the smart analysis talked about in Bug 527304.
2.x
Target Milestone: 1.8 → 2.0
I think 5 frames is to small to be really useful.  I vote for 10 (if you
must have a fixed number).  Would it be technically feasible to let
the user choose?
I've hacked a system to do a bit of this while we wait for the full solution.  We actually might want to think of doing something like this to optimize queries and minimize processing time.

For a given release I take the top 300 signatures,  
then for each signature 
  find a single report for the signature 
then foreach signature
  create a file with the stack

That gives me a directory of files I can search over any part of the stack to find related code.

I used this while working on bug 49129, and I think its going to be useful for many more bugs.

I noticed a couple of performance, space, and utility trade offs were encountered pretty quickly while creating this system.

- I'm guessing there really isn't much additional expense of getting, storing, or using the top 1,5,10 frames, or full stack.  we have to access a full copy of the stack to perform any of these tasks and the length of the stack is generally limited to 10-100 lines of text worse case.

- just getting a single report for each signature is useful for a lot of cases, especially when there is one root cause and similar stack for each signature; however, we do have signatures that contain a wider variation of stacks and possible problems.  so enlarging the sample size for each signature helps to solve that problem.  gathering large or complete sample of stacks for the top 300 would also help us with with "smart analysis" kinds of analysis where we are trying to figure out the distribution of various stacks within a given signature.

- just working off the top 300 signatures is actually pretty useful since that's where we are doing most of the analysis on bugs.  I think the solution we have been thinking involves access to X frames to any of the 30,000 signatures and many more stacks within that signature set.  That's going to be pretty expensive.

So a space/time trade that starts to surface pretty quickly when we think about the searchable space.  

does the system allow search of the full stack for a sample of 1 report for X top signatures? e.g.

  -- sample of 1 report x 300 signatures = 300 files or tables to store and make searchable.

that's pretty fast to assemble and delivers quite a bit of utility.  the time to get at that set of information doesn't take too long and for a given release on a given day it looks like it takes about 900-1,000kb to store the sample of stacks.  finding a way to share this level of functionality actually might be interesting, until we get a more robust system in place.  

beyond that there are more questions.

A. Is it more useful to expand the number of reports of a given signature?  e.g.
 
  sample of 10 reports x 300 sigatures = 3000 files or tables to store and make searchable 

or, 

B. is it more useful to expand the depth of the signatures?  e.g.

  - sample of 1 report for 3000 signatures = 3000 files or table to store and make search able.

space and performance of A and B is going to be roughly the same, but I think A provides more utility.

does the system allow search of X frames of the stack of any given report for any given signature?   I think this is going to turn out to be really expensive, especially when we turn off throttling for major releases and greatly expand the search space of the number of reports.
er, actually the bug I was working on was https://bugzilla.mozilla.org/show_bug.cgi?id=449129#c59  (currently security closed)
or, here is an example of the directory structure that we might think about making searchable

filename convention is rank..sig, and that could be expanded to rank..sample..sig like below if we kept three samples for each stack:

   1..1..js_fun_apply.JSContext.,.unsigned.int,.js::Value..
   1..2..js_fun_apply.JSContext.,.unsigned.int,.js::Value..
   1..3..js_fun_apply.JSContext.,.unsigned.int,.js::Value..
   2..1..HeapDestroy
   2..2..HeapDestroy
   2..3..HeapDestroy
so here is what I'll propose as the temporary solution.

get IT to set up a gigabyte of storage at 
http://people.mozilla.org/crash-stacks.  
that should give us enough room to experiment a bit.

start out running the script to pull 10 stacks for each of the top 200 signatures for the latest 4.0beta, then experiment a bit to see if we can expand that to other releases like latest 3.6.x,  and bump up to 25 or 50 stacks per sig.

we immediately get some "free" search options like:

 login to people.m.o if you have an account there and use grep or text processing tools to quickly look anywhere SocketSend was on the stack. 
In the quick test this grep found it's in the 267th, 35th, 37th, 43rd ranked stacks.

% grep SocketSend  stacks/topstacks-4.0b6/*

stacks/topstacks-4.0b6/267..VirtualAllocEx:
0|4|nspr4.dll|SocketSend|hg:hg.mozilla.org/mozilla-central:nsprpub/pr/src/io/prsocket.c:633e895d5e84|681|0xb

stacks/topstacks-4.0b6/35..SocketSend
:0|0|nspr4.dll|SocketSend|hg:hg.mozilla.org/mozilla-central:nsprpub/pr/src/io/prsocket.c:633e895d5e84|688|0x0

stacks/topstacks-4.0b6/37.._PR_MD_SEND:
7|4|nspr4.dll|SocketSend|hg:hg.mozilla.org/mozilla-central:nsprpub/pr/src/io/prsocket.c:633e895d5e84|681|0xb

stacks/topstacks-4.0b6/43..send:
0|3|nspr4.dll|SocketSend|hg:hg.mozilla.org/mozilla-central:nsprpub/pr/src/io/prsocket.c:633e895d5e84|681|0xb

a simple, but more refined, grep could look for a specific frame or set of frames on the stack.

Second, we can make these files as search engine friendly posible and coax google crawlers into indexing the files, then see if site search terms like 

   site:people.mozilla.com inurl:crash-stacks  SocketSend

also deliver interesting results.

Then next we could look at hooking up mxr or some other more refined search tool on the data.
ok, I have this up and running to gather 10 stacks for each of the top 100 signatures on 4.0b6

http://people.mozilla.com/~chofmann/crash-stats/stacks/topstacks-4.0b6/

it takes about looks like it takes about 25 minutes to build that set of 1000 stacks.

Then it only takes a few seconds to generate another useful report that gives a breakdown of different stacks for all the signatures looking at the top 6 frames.

http://people.mozilla.com/~chofmann/crash-stats/stacks/stack-breakdown-4.0b6.txt

pretty interesting that only 23 of the top 100 signatures have stacks that all look the same.
chofmann:
1. please only look at the crashing thread
here's a basic rule for this [perl]
/^(\d+)/ && $this_thread = $1; if (!defined $crashed_thread) {$crashed_thread = $this_thread; } else { if ($this_thread != $crashed_thread) { ignore_frame() }

example of confused report: ....Signature number: 14-nsPluginInstanceOwner::CreateWidget

2. please don't consider the crashing thread number as relevant except for "is 0" "is not 0" 
here's a basic rule for this [perl]:
s/^[^0]\d*/x/

example of confused report: ....Signature number: 10-nssTrustDomain_LockCertCache
good catch.  what causes non-crashing threads to be at the top of the raw dump?

I was just grabbing the top 6 frames of the raw dump.
Now I grab up to the top 6 lines where "^0|"

have a look at:
http://people.mozilla.com/~chofmann/crash-stats/stacks/stack-breakdown-4.0b6-1.txt

that changes things up a bit on counts of signatures where all the stacks are the same, but not much.

  24 sigs where 10 stacks are the same    was 23 10
  12 sige where  9 stacks are the same    was 10  9
   8 sigs where  8 stacks are the same    was  8  8
   7 sigs where  7 stacks are the same    was  8  7
   9 sigs where  6 stacks are the same    was 10  6
   ...
well, the simplest thing is crashing early in a thread's stack. but the other thing is when we hit a frame and don't have enough information to walk to its next stack frame (optimized third party libraries).

Right now, we have a bug where JIT frames aren't walkable,

....Signature number: 4-cairo_d2d_present_backbuffer

A. please have your code stop when it hits a frame without anything such as this:
0|2|xul.dll|

The info past that point is basically garbage -- unless it's the first frame (see B).

....Signature number: 5-pthread_mutex_lock
      6 0|0|XUL|__tcf_2
      3 0|0|XUL|__tcf_0

I don't have a good explanation of what __tcf_X is right now, but for the time being I'd request that you treat them as equivalent:

s/__tcf_\d+/__tcf_X/

the frame past it is
0|1|libSystem.B.dylib|__cxa_finalize

so we're basically looking at abort() or something like it in some probably optimized manner.

  count:   top 6 frames
      2 0|0|DWrite.dll|`anonymous namespace'::TrueTypeFontMetricsBuilder::GetBlackBox(unsigned short)

Please change the way you write this, to this:

  count: 2  top 6 frames
0|0|DWrite.dll|`anonymous namespace'::TrueTypeFontMetricsBuilder::GetBlackBox(unsigned short)

Right now I keep missing frame 0 because it's overindented and hidden behind a hit count.

>      1 0|0|XUL|nsTHashtable<nsBaseHashtableET<nsCStringHashKey, nsRefPtr<imgCacheEntry> > >::s_ClearEntry

Please unescape HTML <>& in .txt format.

....Signature number: 10-nssTrustDomain_LockCertCache
.... distribution of 10 different stacks
  count:   top 6 frames
      2 0|0|DWrite.dll|`anonymous namespace'::TrueTypeFontMetricsBuilder::GetBlackBox(unsigned short)

For some reason the label here is bogus, this isn't nssTrustDomain_LockCertCache...

....Signature number: 16-_purecallnsXPCWrappedJS::QueryInterfacensIDconst,void
.... distribution of 10 different stacks
  count:   top 6 frames
      6 0|0|ntdll.dll|KiFastSystemCallRet
0|1|ntdll.dll|ZwWaitForSingleObject
0|2|kernel32.dll|WaitForSingleObjectEx
0|3|kernel32.dll|WaitForSingleObject
0|4|xul.dll|google_breakpad::ExceptionHandler::WriteMinidumpOnHandlerThread(_EXCEPTION_POINTERS *,MDRawAssertionInfo *)
0|5|xul.dll|google_breakpad::ExceptionHandler::HandlePureVirtualCall()

For purecall, please skip the first 5 frames. i.e. start with:
0|5|xul.dll|google_breakpad::ExceptionHandler::HandlePureVirtualCall()

Or possibly something past that.

....Signature number: 19-_SEH_prolog4_GS
0|2|KERNELBASE.dll|

can your script automatically queue symbol pulls for kernelbase? we're just missing symbols here.

....Signature number: 24-nsLocalFile::Removeint
.... distribution of 10 different stacks
  count:   top 6 frames
      4 0|0||
0|1|xul.dll|nsLocalFile::Remove(int)

B. Skip the first frame in the output when it's empty it was here...
0|0||
re comment 16.
oh err. no, your logic isn't right.

the crashing thread is always at the top (it might not be thread 0!). it's just a matter of the crashing thread not always having 6 decipherable stack frames, so after that you just get stacks for each of the live threads starting from 0 skipping the crashed thread.
(In reply to comment #17)
> well, the simplest thing is crashing early in a thread's stack. but the other
> thing is when we hit a frame and don't have enough information to walk to its
> next stack frame (optimized third party libraries).
> 
> Right now, we have a bug where JIT frames aren't walkable,
> 
> ....Signature number: 4-cairo_d2d_present_backbuffer
> 
> A. please have your code stop when it hits a frame without anything such as
> this:
> 0|2|xul.dll|
> 
> The info past that point is basically garbage -- unless it's the first frame
> (see B).
> 

I notice we don't to that in the socorro web interface
http://crash-stats.mozilla.com/report/index/2416707b-c8d5-46ea-9250-b98e12101016

How many of the things you suggest to clean up the stacks from the raw data are currently done in the socorro web interface?

Maybe working off the raw dumps isn't the right thing if there is a better source of "cleaned up stack data", or maybe we should create this single source of cleaned up stack data and make it available to the socorro web interface, and any analysis tools that want to work off it.
It depends on what we're doing. when looking at a single report it's helpful to have as much information as readily available as possible. The risk of information being wrong is that you look at it and shake your head.

For correlations, the data past those points is IME generally garbage (and too variable) and I'd rather exclude it. Note that when I say "past" i mean that you should include that frame - so that we get the caller module for which we don't have information.

Some of my requests would be good for Socorro. The purecall stuff should really be cut off the top of Socorro too -- iirc it isn't atm. As should the hang detection equivalent. But it's probably best to basically define individual features (as bugs) and then identify whether they apply to both. It might be good to write the code in a modular way so that it could be shared by merely choosing to call the function from the individual consumer.

I don't have a mental model of all the things Socorro is doing, it'd be good if someone actually had a live list in human readable terms (this means something I can load in a web browser w/o speaking Python).

I believe that working from the raw data is the right thing here, I don't think you can get much from the processed versions.
(In reply to comment #20)
> It depends on what we're doing. when looking at a single report it's helpful to
> have as much information as readily available as possible. The risk of
> information being wrong is that you look at it and shake your head.
> 
> For correlations, the data past those points is IME generally garbage (and too
> variable) and I'd rather exclude it. Note that when I say "past" i mean that
> you should include that frame - so that we get the caller module for which we
> don't have information.
> 

yeah, mostly what what I was thinking when doing this was to try and characterize the raw data that we are getting.  

Maybe there isn't much value in that.   For this particular case with  4-cairo_d2d_present_backbuffer is there any value in knowing that 6 out of 10 times we got a reasonably good stack, and 4 out of ten we got the stack plus garbage, and there were two different ways that garbage took form.

As we start to try and do more "fix-ups" on the data (including skiplisting) it might be good to document these somehow, then note in the reports when the fix-ups are applied.
ok,  I've moved the prototype of this over to a new location on people.m.o

these reports allow searching for any snippets of code within the top 10 frames of the top 300 signatures, looking at a sample of 10 reports for each signature. Just loading the page in a browser and using "find in page."

http://people.mozilla.com/crash_stacks/stack-summary-3.6.11.txt
http://people.mozilla.com/crash_stacks/stack-summary-4.0b6.txt
http://people.mozilla.com/crash_stacks/stack-summary-4.0b8pre.txt

then for full stack search for the top 300 signatures login to people.m.o and grep within the directories under 

/var/www/html/crash_stacks/topstacks-4.0b6
/var/www/html/crash_stacks/topstacks-4.0b8pre
/var/www/html/crash_stacks/topstacks-3.6.11

I'll update these every few days until we get something better in place.
This is a major regression.  I'd like to have this fixed in 1.9, please.
chofmann: this set looks like stack overflow from infinite recursion of nsDisplayList::Paint/nsDisplayClip::Paint:

....Signature number: 6-nsIFrame::GetOffsetTonsIFrameconst
______ distribution of 10 different stacks, looking at top 10 frames
      5  stacks like
0|0|xul.dll|nsIFrame::GetOffsetTo(nsIFrame const *)
0|1|xul.dll|nsDisplayImage::Paint(nsDisplayListBuilder *,nsIRenderingContext *)
0|2|xul.dll|nsDisplayList::Paint(nsDisplayListBuilder *,nsIRenderingContext *)
0|3|xul.dll|nsDisplayClip::Paint(nsDisplayListBuilder *,nsIRenderingContext *)
0|4|xul.dll|nsDisplayList::Paint(nsDisplayListBuilder *,nsIRenderingContext *)
0|5|xul.dll|nsDisplayClip::Paint(nsDisplayListBuilder *,nsIRenderingContext *)
0|6|xul.dll|nsDisplayList::Paint(nsDisplayListBuilder *,nsIRenderingContext *)
0|7|xul.dll|nsDisplayClip::Paint(nsDisplayListBuilder *,nsIRenderingContext *)
0|8|xul.dll|nsDisplayList::Paint(nsDisplayListBuilder *,nsIRenderingContext *)
0|9|xul.dll|nsDisplayClip::Paint(nsDisplayListBuilder *,nsIRenderingContext *)

      1  stacks like
0|0|xul.dll|nsIFrame::GetOffsetTo(nsIFrame const *)
0|1|xul.dll|nsDisplayListBuilder::ToReferenceFrame(nsIFrame const *)
0|2|xul.dll|nsDisplayPlugin::GetBounds(nsDisplayListBuilder *)
0|3|xul.dll|nsDisplayPlugin::Paint(nsDisplayListBuilder *,nsIRenderingContext *)
0|4|xul.dll|nsDisplayList::Paint(nsDisplayListBuilder *,nsIRenderingContext *)
0|5|xul.dll|nsDisplayClip::Paint(nsDisplayListBuilder *,nsIRenderingContext *)
0|6|xul.dll|nsDisplayList::Paint(nsDisplayListBuilder *,nsIRenderingContext *)
0|7|xul.dll|nsDisplayClip::Paint(nsDisplayListBuilder *,nsIRenderingContext *)
0|8|xul.dll|nsLayoutUtils::PaintFrame(nsIRenderingContext *,nsIFrame *,nsRegion const &amp;,unsigned int,unsigned int)
0|9|xul.dll|PresShell::Paint(nsIView *,nsIRenderingContext *,nsRegion const &amp;)

      1  stacks like
0|0|xul.dll|nsIFrame::GetOffsetTo(nsIFrame const *)
0|1|xul.dll|nsDisplayListBuilder::ToReferenceFrame(nsIFrame const *)
0|2|xul.dll|nsDisplayPlugin::GetBounds(nsDisplayListBuilder *)
0|3|xul.dll|nsDisplayPlugin::Paint(nsDisplayListBuilder *,nsIRenderingContext *)
0|4|xul.dll|nsDisplayList::Paint(nsDisplayListBuilder *,nsIRenderingContext *)
0|5|xul.dll|nsDisplayClip::Paint(nsDisplayListBuilder *,nsIRenderingContext *)
0|6|xul.dll|nsDisplayList::Paint(nsDisplayListBuilder *,nsIRenderingContext *)
0|7|xul.dll|nsDisplayClip::Paint(nsDisplayListBuilder *,nsIRenderingContext *)
0|8|xul.dll|nsDisplayList::Paint(nsDisplayListBuilder *,nsIRenderingContext *)
0|9|xul.dll|nsDisplayClip::Paint(nsDisplayListBuilder *,nsIRenderingContext *)
----

So, I think I need the crash reason :)
trying a run to see what that looks like.  we might want to add crash address too.
Summary: No possibility anymore to search for stack frames except the top frame → Can't search for stack frames beyond the top frame in socorro
Will current ES implementation plans fix this?
Assignee: nobody → adrian
I think to do this usefully we might need to fix bug 573100 as well. Otherwise the stack data is in an unhelpful format that will be difficult to search meaningfully.
Same as for bug 573100, searching into the dump is possible with ES, so this will be fixed by the ES implementation. 

See bug 654567 for ES implementation.
Target Milestone: 2.0 → 2.1
Target Milestone: 2.1 → 2.2
Target Milestone: 2.2 → 2.3
Blocks: 678101
Target Milestone: 2.3 → 2.4
Target Milestone: 2.4 → ---
Component: Socorro → General
Product: Webtools → Socorro
Ping?  what's the ETA for fixing this two-year old major regression?
(In reply to Mats Palmgren [:mats] from comment #29)
> Ping?  what's the ETA for fixing this two-year old major regression?

Socorro never was able to do that, so it's not a regression. Please don't confuse things here. Talkback, which might have been able to do that, was completely different software that was not open.

Resolving this in Socorro will AFAIK need ElasticSearch to be deployed and that's currently waiting for hardware to be available and installed.
It's a regression in the sense that an important feature of crash-stats.mozilla.com
was simply removed and thereby severely crippling our ability to analyze crashes for
the past two years.

I'm happy to hear that it's just a matter of installing new hardware and then
deploying the new software, when will that take place?
I'm not aware of crash-stats.mozilla.com being able to search beyond the top frame at any time, so I wouldn't call it a regression.

I'm not sure if it will work instantly when the new hardware is installed (for the "when" you should ask IT and/or look into the bug for getting it up), I guess it will require some software work as well, but the Socorro team needs the hardware to even be able to test any code that might get written for this.
Yes, chofmann indicated it was a feature of the older talkback system. Let's not argue about whether or not it's a regression, it seems like it would be very useful to have. Kairo, once we have the elastic search stuff in place, it this sort of thing easy to put in place? I understand the Socorro folks are blocked right at the moment.
(In reply to Sheila Mooney from comment #33)
> Kairo, once we have the elastic search stuff in place,
> it this sort of thing easy to put in place? I understand the Socorro folks
> are blocked right at the moment.

From what the Socorro team has told me, it should not be too hard to get this in place once ElasticSearch is in production, yes. Unfortunately, this is still blocked by getting the hardware up and it's a quite frustrating issue for all of us involved, including the Socorro team, and bickering on how unusable things are in the current situation is just rubbing it in. We all know we need to resolve this, but unfortunately it needs more patience from everyone.
are bugs open for setting up the hardware?  

lets get them added to the dependency list.   

Again, we are missing out on crash fixes that could be made by not having this capability in place.  Some recent research at http://www.cse.ust.hk/~hmseo/CRASH/Home.html shows the kind of problems that can be prevented and solved by being able to search across frames for other callers that may be part of our crash problems.
Depends on: 656297, 726725
Whiteboard: cloud → [search]
No longer blocks: 678101
Depends on: 678101
Component: General → Middleware
Priority: -- → P4
why is this now moved down to p4?
Adrian is ordering his personal bug queue.  That's all it means.
Please re-prioritize.  This bug is crippling our ability to hunt down crash bugs.
It's very important to us that we can search for signatures in the top 10 frames
given the amount of noise there is at the top.
Priority: P4 → --
Priority: -- → P1
It is now possible to search into the entire dump field using supersearch. For example: https://crash-stats.mozilla.com/search/?product=Firefox&dump=EmitNameOp&_facets=signature

I think that resolves this bug. Please reopen if it doesn't.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Awesome!  Thanks Adrian.
Given that this was regressed by moving from pipe dump to JSON dump, I'm reopening the bug.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Quick status update here:

We have had the dump available in search for a few months last year, and those were glorious days, but we then moved from the "pipe" dump to the "json" dump and that ability to search was lost. Super Search is far from being ideal for searching that json dump, for it would require the addition of many confusing fields or quite expensive and hacky features. The "spectateur" prototype [0] has been a solution to this problem, and it worked for a short bit, but then we moved to AWS and had to remove the dump from our Elasticsearch database entirely for disk space reasons. That's where we are now.

Now, looking at the future. The dump is a big piece of data and it's very expensive for us to store it in Elasticsearch. So what I propose is that we store it, but only for a very short period of time (something like 2 weeks). It would not be possible to search the dump directly through Super Search, but we can make it available through the API, so that you can write your own script to search those dumps, or use a tool like spectateur or Benjamin's crash-stats-api-magic [1]. 

We have not started working on this yet, and I would like to get your feedback on this solution if you have any. Other suggestions are of course very welcome as well. 

[0] http://spectateur.mozilla.io/
[1] http://bsmedberg.github.io/crash-stats-api-magic/analyze-crash.html
(In reply to Adrian Gaudebert [:adrian] from comment #44)
> 
> We have not started working on this yet, and I would like to get your
> feedback on this solution if you have any. Other suggestions are of course
> very welcome as well. 
> 
> [0] http://spectateur.mozilla.io/
> [1] http://bsmedberg.github.io/crash-stats-api-magic/analyze-crash.html

This sounds like it might be workable. Is the json from the raw dump tab currently available through any kind of api?
What if you had a single field called "super-signature" ? It would have a similar form as "signature", but none of the signature generation rules would be used: just a long conglomeration of enumerated frame signatures (and/or other frame info that was included in the pipe dump) of the crashing thread. It wouldn't actually be used directly for bucketting, but could bring back the "glorious days" of searching for patterns in the complete stack.
(In reply to K Lars Lohn [:lars] [:klohn] from comment #46)
> What if you had a single field called "super-signature" ? It would have a
> similar form as "signature", but none of the signature generation rules
> would be used: just a long conglomeration of enumerated frame signatures
> (and/or other frame info that was included in the pipe dump) of the crashing
> thread. It wouldn't actually be used directly for bucketting, but could
> bring back the "glorious days" of searching for patterns in the complete
> stack.

That would solve the majority of the problem, I think.
So alternately, maybe we could store just a *bit* of the json_dump, like `json_dump.crashing_thread.frames`, which is at most the top 10 frames of the crashing thread, which ought to be useful enough and not unbounded in size.
Depends on: 1208129
Sorry for not deep-reading the whole bug history here but Ted mentions in https://bugzilla.mozilla.org/show_bug.cgi?id=480503#c48 that we could store some of the json_dump.crashing_thread.frames in the processed crash. Thus available in SuperSearch. That was implemented a couple of months ago. Does that move this bug forward?
Component: Middleware → Webapp
(In reply to Peter Bengtsson [:peterbe] from comment #50)
> That was implemented a couple of months ago. 

Nope, that was never implemented. See bug 1266099.
Bug 1208129 should have resolved this problem.  All the frame signatures of the crashing thread are concatenated in one field called "proto_signature".
Indeed! Here's an example search: 

https://crash-stats.mozilla.com/search/?product=Firefox&proto_signature=~nsXMLHttpRequest%3A%3AOnStopRequest&_sort=-date&_facets=proto_signature&_columns=proto_signature
Status: REOPENED → RESOLVED
Closed: 11 years ago8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: