Closed Bug 519703 Opened 15 years ago Closed 11 years ago

enhancements for crash csv files

Categories

(Socorro :: General, task)

x86
All
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: chofmann, Unassigned)

References

Details

the use of the .csv files for finding interesting urls to test automatically is going really well and its finding a ton of good crashes and crashes with security implications.

we are also able to use the data files to help analyze trends in the crash data in new and experimental ways like we haven't been able to do in the past.

there are a few improvements that would make things even better.

1) shorten the cycle for producing the report to 12 hours with AM and PM reports

fileanme conventions would change to 20090928-am-crashdata.csv and 
20090928-pm-crashdata.csv  and cover the period from midnight to noon and noon to midnight

This would allow us to stay on top of trends like with bug 519039 with a bit more precision than to need to wait around for a full 24 hours before we can see the next set of trends in the data.

2) add source file name to the reports.  this would be the first source file name that appears in the stackframe.

this would allow us to dig out key crashes where the signature name rarely matches the area of the code. we could look for things such as all "font" bugs and run them though automated testing cycles as requested in  Bug 513642 -  report needed for gfx crashes

3) IP Address and e-mail addresses.  these will be used to do analysis on which IP addresses and collection of address we receive the most crashes from and is another tool in helping to identify reports coming from the same source.  In some cases we may contact the user for additional information on how they generated the crash and ask for assistance in debugging and trying out fixes.

this adds a bit of additional privacy concerns to the file, but we already have privacy sensitive data in the files with the urls, and have take precautions to protect the data.

if any of these things are hard or will take a long time lets do them in the priority order.  the generation of these files is really turning out to be a valuable way for us to have a platform for analysis and crash reproduction.  I hope we can do these things easily and quickly.
all the fields can be tacked on to the end of the \t separated order.
sounds like ted is removing e-mail so lets dispense with that part of the request.
We removed it a long time ago, FWIW.
more enhancement requests


Add time since start up.  I didn't know this was around when the orginal file spec was made.  this would be really useful



Bug 521906 -  Breakpad should report if extension compatibility overrides are being used.   this would also allow some very interesting analysis.

top source file line in stack  https://bugzilla.mozilla.org/show_bug.cgi?id=513642#c3   -- in some cases this would allow some quick analysis on the corrlation of the source file or area of code changed, and the stack signature inproving the speed at getting bugs in to the right engineer's hands faster.
sounds like e-mail is back, but I just as soon leave it out of the .csv files.

the web interface is the best way to get at this data when needed in case by case situations, and provides better protect of the data.
maybe we can do these as part of the work on the .csv files that is planned for the next socorro milestone.


[Bug 529431] create daily summary snapshots of all crashes and publish on people or crash-stats.m.c

Bug 530160] Daily Crash Dump contains bogus Linux version
dolske and dmandelin are users of these files so they should know the changes are coming at some point too.
jdagget and jst too.
since flash is involved in about 20-30% of all crashes it would be good to reach in and pick out flash version from the module list.  that would help on analysis and testing of those crashes.
So I think this captures a summary of all the desired changes to .csv files for this round

fixes to existing fields

 -make a sanitized version of the output that contains no url data -Bug 529431
 -fix bogus linux versions  - Bug 530160
 -remove tabs from comments  - Bug 530665

new tab separated fields appended to the existing fields

 Uptime
 IP_ADDR
 top-most-sourcefile-listed-on-the-stack
 Addon_compat_override_setting --  Bug 521906
 flash-version
Group: webtools-security
Target Milestone: --- → 1.3
Assignee: nobody → griswolf
Target Milestone: 1.3 → 1.4
wyane mery is also interested in getting a set of these files for thunderbird crash analysis and could probably work off the set of data planned for the public reports since url aren't very interesting for tbird crash analysis.
(In reply to comment #5)
> sounds like e-mail is back, but I just as soon leave it out of the .csv files.

how about, reduced by domain?
I'ed just as soon leave e-mails off.  it seems like the best access/use of those are in individual cases related to a particular signature on a particular report.   I can see how the domains might be valuable to tbird but I think less so for Firefox.

If number of adu's corresponding to the product version is available I'd like to see that added.  that would allow correlation of volume changes to uptake in adoption of releases. See https://bugzilla.mozilla.org/show_bug.cgi?id=530955#c22  for more info


  so to update the list it would look like


fixes to existing fields

 -make a sanitized version of the output that contains no url data -Bug 529431
 -fix bogus linux versions  - Bug 530160
 -remove tabs from comments  - Bug 530665

new tab separated fields appended to the existing fields

 Uptime
 IP_ADDR
 top-most-sourcefile-listed-on-the-stack
 Addon_compat_override_setting --  Bug 521906
 flash-version
 adu's-correspoding-to-release-version-listed in field #8
Assignee: griswolf → lars
also add number of cores reported in the crash report
Assignee: lars → griswolf
There is a basic problem with the python/psycopg2 interface that is 
 - handing us a list of lists for the query: expected
 - each inner list represents a row of query output: expected. 
 - the length of an inner list is not the same from row to row: unexpected

I *do not understand* this behavior. We should be getting full lists with interspersed nulls, and we are indeed getting SOME nulls, but apparently not all of them. This appears to be broken for the existing code (as seen from khan against dm-breakpad-stagedb). I cannot fix this issue overnight: Too fried to do good debugging.
... not too fried to do a little debugging. It appears to be in the writer, not the database access. And it appears to be associated with array elements whose value is ''. On a set of 10 lines, we were getting 3 failures. Fixing all the blanks gets us down to 1 failure on the set of 10. Same 'fixed' set has 6 failures in 30, though. All of them seem to be associated with complex signatures.
my error. Complex signatures were screwing up my awk script.
Not completed.

I have added Uptime, Addon_compat_override_setting and adu's for this version, in svn revision 1721

IP_ADDR is incompletely specified, but if I understand the request, you want the IP address where the browser lives. I don't have access to that data so far as I can tell.

topmost source-file and flash version are available in principle, but the effort was greater than the available time for Socorro 1.4.
(In reply to comment #19)
> 
> IP_ADDR is incompletely specified, but if I understand the request, you want
> the IP address where the browser lives. I don't have access to that data so far
> as I can tell.
>

yes,  IP_ADDR is the ip address of the crashing system.
 
I thought I ran across docs that said this was available, but now I can't find any.   Its not shown in http://code.google.com/p/socorro/wiki/schema
so what is the next step here?  can the updated script be handed off to aravind so the cron job can start running it?  having Uptime, Addon_compat_override_setting will help us to open up some new areas of analysis.

it would be great if we could get a "top source file" and "flash version number" version of the script in place as soon as possible.  maybe inpendent of the 1.5 socorro release since this isn't connected to a socorro update.  

adobe guys need the "flash version number" feature as soon as possible to help verify fixes they have made in flash 10.1.   we need the "top source file" feature so we can start getting each of the gecko module owners to start looking at top crashes in their area of the code.
This is all possible at some effort: We will need to either do more parsing while the processor has the information from the dump file, or we will need to reparse the dump file during the run of dailyUrl. Not clear which option will be most feasible: Processor is already running close to its limit (recall the issue with throttling Firefox 3.6) so doing more work there is iffy. Similarly, the cost of reparsing the dump file seems excessive / redundant. Just adding the code to handle the ADUs made the dailyUrl script run 10 times as long. Still quite reasonable so far (8.5 seconds before, 87 seconds now, on khan) since it is once daily.
you make the call...  I've been a proponent of keeping the initial processing lean and mean so we can process as many incoming reports as we can, and then loading up on post processing.  I filed  Bug 541873 -  build capacity to process more crash reports get people thinking about beefing up post processing if needed.

There is a lot more beyond these simple tasks that I thing we eventually want to do.  Most of the 22 other bugs I have on file mean we need to dig into the data, draw additional correlations, and do cross checking and run second passes of analysis...   https://bugzilla.mozilla.org/buglist.cgi?emailreporter2=1;emailtype2=substring;resolution=DUPLICATE;resolution=---;classification=Server%20Software;query_format=advanced;email2=chofmann%40gmail.com;component=Socorro;product=Webtools

others have bugs on file that also hint at the need for the same kinds of intensive post processing.
In general there is one decision to be made here.  Do we want to pursue further enhancements, reports, processing capability, storage, etc.. in the old system or are we going to depend on the Hadoop infrastructure that's coming up for this sort of stuff?  Every new report or every new enhancement we make adds additional load to the existing system which is barely keeping up.  Doing more on the current system means adding more resources to it.
We've built out a couple of knobs for Socorro:
* server side throttling
* number of processors

We haven't tweaked the number of processors lately. This system scales horizontally until we hit IO issues on disk.

Have we bumped the processors up lately and analyzed the performance characteristics of the system? Lars recently mentioned we've run up to 9 at a time.
(In reply to comment #25)
> We haven't tweaked the number of processors lately. This system scales
> horizontally until we hit IO issues on disk.

> Have we bumped the processors up lately and analyzed the performance
> characteristics of the system? Lars recently mentioned we've run up to 9 at a
> time.

We are hitting IO issues now.  We have never run the processors on more physical boxes than we do now.  It used to be only two boxes in the past.  What lars is talking about looked like more processors because I had limited each processor to one thread and fired off more of them to help with isolating problems in the past.  We have gotten past that and we currently run two processors on two boxes + 1 on the monitor box.  The load average on these boxes hovers around 4 to 5.  Increasing the number of processors can only decrease performance on these boxes.

Another solution is to add more processor boxes into the mix.  We could scale horizontally a little bit more, but we will run into I/O issues on the NFS store.
push remainder out to 1.5
Status: NEW → ASSIGNED
Target Milestone: 1.4 → 1.5
(In reply to comment #24)
> In general there is one decision to be made here.  Do we want to pursue further
> enhancements, reports, processing capability, storage, etc.. in the old system
> or are we going to depend on the Hadoop infrastructure that's coming up for
> this sort of stuff?  Every new report or every new enhancement we make adds
> additional load to the existing system which is barely keeping up.  Doing more
> on the current system means adding more resources to it.


I'm not sure the issue is so global for this specific request.  It definitely is for a broader set of enhancement requests.   For this one the trade off is can I get 87 seconds of processing time to get broader crashkilling by each module owner and engineering subgroups by providing the source file names,  and get a much better handle on which flash versions are involved in 30% of are overall crash load.  Trying to analyze the flash crashes right now is like trying to analyzing firefox crashes without the firefox version number.
and the 87 seconds I need each day happens at midnight, or some times when I/O issues are as likely to be critical.
(In reply to comment #29)
> and the 87 seconds I need each day happens at midnight, or some times when I/O
> issues are as likely to be critical.

My earlier comments weren't directed at this specific request - but more at this  "There is a lot more beyond these simple tasks that I thing we eventually want
to do.  Most of the 22 other bugs I have on file mean we need to dig into the
data, draw additional correlations, and do cross checking and run second passes
of analysis... "

Just pointing out that the existing setup can't scale without further investments.
@Frank: It looks like the dailyUrl script is failing consistently after a point - this could be a postgressql timeout (even though the postgres guys swear that it shouldn't happen).  Here is what I see in the logs.

At the end of the output - 

2010-02-09 06:50:25,546 ERROR - MainThread Caught Error: psycopg2.ProgrammingError
2010-02-09 06:50:25,553 ERROR - server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

2010-02-09 06:50:25,561 INFO - trace back follows:
  File "/data/breakpad/processor/socorro/lib/psycopghelper.py", line 139, in cleanup
    aDatabaseConnectionPair[0].rollback()

2010-02-09 06:50:25,571 INFO - done.
Target Milestone: 1.5 → 1.6
do the additional processing during first pass, per meeting today
Bringing comment #13 to the bottom for easier visual retrieval:

new tab separated fields appended to the existing fields

 DONE Uptime
 CANNOT DO IP_ADDR
 DONE Addon_compat_override_setting --  Bug 521906
 DONE adu's-correspoding-to-release-version-listed in field #8

 top-most-sourcefile-listed-on-the-stack
 flash-version
hey charles,  

we are trying to get a system in place to dig all the flash version info out and put it in some reports to get better access to the version info associated with each crash.  

windows is looking to be straight forward.

with mac we were just thinking about try to scrape the version info out of the new library/module names were they are availble.  we started looking at the data and that might not work as easy as we thought.  see the report

http://crash-stats.mozilla.com/report/index/0ead1d21-11eb-4930-8f2e-e70f82100213

on the modules tab are two libraries listed

  Flash Player

and

  FlashPlayer-10.6

is this an outlier or the standard way it should be looking.  Is this a case where two version of flash might be loaded some how?
This would be great and I can see where this can be a problem for 10.1.  So what you are seeing is a loader and then the actual plug-in.  

We currently have two plugin's in a loader.  In there we have the 10.4+10.5 plugin and 10.6 which is why it looks like the player is loaded twice. 

Since we released version 10.0.42.32 we export the build version information as a public symbol.  If it helps you can find it as FlashPlayer_[VERSION]
This is a bit of sql that will be needed. Putting it here to make sure we don't lose track. Note that the special case of reports_20090202 just that: A special case.

alter table reports_20090202 no inherit reports;
alter table reports_20090202
  add column topmost_filenames TEXT, 
  add column addons_checked boolean,
  add column flash_version TEXT
;
alter table reports 
  add column topmost_filenames TEXT, 
  add column addons_checked boolean,
  add column flash_version TEXT
;

alter table reports_20090202 inherit reports;
this is still on as part of the next socorro update, right?
According to the meeting yesterday, yes.
waiting for production push
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
looking at the .csv files from yesterday it appears the new fields are there, 

1 signature
2 url
3 uuid_url
4 client_crash_date
5 date_processed
6 last_crash
7 product
8 version
9 build
10 branch
11 os_name
12 os_version
13 cpu_name
14 address
15 bug_list
16 user_comments
17 uptime_seconds
18 email
19 adu_count
20 topmost_filenames
21 addons_checked
22 flash_version


but no useful data yet.  the processors were switched over at around 7-8p so we were expecting to see some data.

last reports in the files were time processed 201004082359 and time of crash 201004082359

the topmost file filed shows some blanks and some \N

awk -F\t ' {print $20}' 20100408* | sort | uniq -c | sort -nr | more
271147 \N
65969 
   1 topmost_filenames


and flash version shows all \N

awk -F\t ' {print $22}' 20100408* | sort | uniq -c | sort -nr | more
337116 \N
   1 flash_version
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I've been pouring over frank's code on this and I've found some fatal flaws.  Those fields will never get populated.  we're going to have a 3.6.1 release and I'll get this fixed...  sorry about the trouble.
s/3.6.1/1.6.1/
Target Milestone: 1.6 → 1.6.1
Component: Socorro → General
Product: Webtools → Socorro
Hey Lars: what should be the resolution here?  The bug was still assigned to Frank so it's been rotting for a while.
passing the buck to Kairo: is there still a problem in the CSVs?
topmost_filenames and flash_version work, I'm using them for my custom reports. No idea about addons_checked, I never looked into it.
I'm not sure which CSV files this bug is talking about. I see some of these fields in a crash report, and some fields derived from these in the .json dump.

Is this be resolved?
Assignee: griswolf → nobody
Flags: needinfo?(kairo)
Target Milestone: 1.6.1 → ---
these files are produced via https://github.com/mozilla/socorro/blob/master/socorro/cron/jobs/daily_url.py#L238

the files now contain:

1 signature
2 url
3 uuid_url
4 client_crash_date
5 date_processed
6 last_crash
7 product
8 version
9 build
10 branch
11 os_name
12 os_version
13 cpu_info
14 address
15 bug_list
16 user_comments
17 uptime_seconds
18 email
19 adu_count
20 topmost_filenames
21 addons_checked
22 flash_version
23 hangid
24 reason
25 process_type
26 app_notes
27 install_age
28 duplicate_of
29 release_channel
30 productid

productid is not empty, neither is topmost_filenames, addons_checked or flash_version. It looks like this is ok now.
Thanks bob!
Status: REOPENED → RESOLVED
Closed: 14 years ago11 years ago
Resolution: --- → FIXED
(In reply to Chris Lonnen :lonnen from comment #46)
> I'm not sure which CSV files this bug is talking about.

Just to clarify this: The CSV files are *-pub-crashdata.csv.gz in the dated directories of https://crash-analysis.mozilla.com/crash_analysis/ and are more or less daily dumps out of the reports table.
Flags: needinfo?(kairo)
You need to log in before you can comment on or make changes to this bug.