Closed Bug 579136 Opened 12 years ago Closed 5 years ago

Duplicate crashes in Socorro [tracker]

Categories

(Socorro :: General, task)

task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: marcia, Assigned: laura)

References

(Blocks 1 open bug, )

Details

Attachments

(5 files, 3 obsolete files)

Seen whilst reviewing crash stats today.

I have seen several stacks which seem to not look right.

http://tinyurl.com/2d5ywqj is one example of the same crash which looks as if it was submitted 38 times. The URL is the same in each instance.

http://tinyurl.com/22rfuj3 is another example.

bc wonders if it is possible that either Soccoro is automatically submitting the same crash over and over or is processing the same report over and over. Either way I have not seen that much identical information being shown in my daily review of crash stats.
These are definitely dupes - the uptime is the same for each.

Did you see others?  

lars, xstevens: could this be related to our monitor problems of yesterday?
url, buildid, crashdate, time since last crash, install age are also identical. correlations are at 100%. These are essentially two identical sets of crash reports. The libnspr4 signature dates back to 20100703. The mozjs signature dates to 20100714.
http://tinyurl.com/27cb5l8 contains a set of crashes that all seem have to the same uptime as well.
http://tinyurl.com/22rfuj3 is from the Mac set that I caught, I can go back to the Windows machine if you need others.
while each of these crashes is essentially identical, they have there own uuid
and submission time.  That means that the collector is actually seeing these
arrive individually from the client.  The collector reads the data from the http POST from the client and then adds the uuid and submission time to it.   

The monitor only sees uuids and never the data behind them.  It just assigns uuids to processors.  It is very unlikely that yesterday's monitor trouble is related to this problem.

Could breakpad be submitting the same crash repeatedly?
assuming the duplicates only differ in uuid and date processed, counting the total lines in each crashdump file and then counting the unique lines after removing the uuid and the date processed can give a count of the actual submitted reports vs the unique reports for a given day. The difference between the two is the number of duplicate submissions for that day.

$ for f in *crashdata.csv.gz; do before=`gunzip -c  $f  | awk -F'\t' '{print $1, $2, $4, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22}'|wc -l`; after=`gunzip -c $f | awk -F'\t' '{print $1, $2, $4, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22}' | sort -u | wc -l`; echo "$f before $before, after $after"; done
20100626-crashdata.csv.gz before 487779, after 486501
20100627-crashdata.csv.gz before 419135, after 418172
20100628-crashdata.csv.gz before 344641, after 343559
20100629-crashdata.csv.gz before 301676, after 300666
20100630-crashdata.csv.gz before 287176, after 286346
20100701-crashdata.csv.gz before 270586, after 269851
20100702-crashdata.csv.gz before 258072, after 257209
20100703-crashdata.csv.gz before 230711, after 230097
20100704-crashdata.csv.gz before 235063, after 234426
20100705-crashdata.csv.gz before 263509, after 262917
20100706-crashdata.csv.gz before 268535, after 267941
20100707-crashdata.csv.gz before 328875, after 328025
20100708-crashdata.csv.gz before 336260, after 335340
20100709-crashdata.csv.gz before 272916, after 271960
20100710-crashdata.csv.gz before 246530, after 245750
20100711-crashdata.csv.gz before 255869, after 255241
20100712-crashdata.csv.gz before 285620, after 284851
20100713-crashdata.csv.gz before 283947, after 283092
20100714-crashdata.csv.gz before 272448, after 269337
Attached file duplicates 20090801-20100701 (obsolete) —
we also have documented dups of the hang reports.  see bug 568849
We are going to run a map reduce tomorrow that will count the number of duplicates broken down by date and product+buildid.  It will also count how many of those duplicates were throttled vs processed.
with access to the latest database, I've been doing some queries, trolling for duplicates.  I'm using sql that will group unlikely combinations of columns over the data from this week.

select
  install_age,
  last_crash,
  uptime,
  process_type,
  product,
  version,
  url,
  count(*) as total
from reports_20100712
group by
  1,2,3,4,5,6,7
having
  count(*) > 3
order by
  3 desc, 1, 2, 4;

This query will give all the repeats where all those columns in the select list are the same.  I sort by uptime descending because I want to see crashes that aren't immediately on restart first.  This query found 360 duplicates across all products.

Then I reran the query on a week's data from a year ago.   It found only 136. The number of times for each repeat were generally lower than the first dataset. Interestingly, only one had a url associated with it, while most had urls from the current dataset.  

I'm not sure yet what this means, but I thought I should report it before I left for the day.  I'll continue my investigation tomorrow.
we probably need to calibrate the dups against total reports received per day as well.  I have a bit of that data.

A year ago I think we were running 118k-180k processed reports per day, now we are running 240k-280k per day.

$ wc -l 2010071
  246530 20100710-crashdata.csv
  255869 20100711-crashdata.csv
  285620 20100712-crashdata.csv
  283947 20100713-crashdata.csv
  272448 20100714-crashdata.csv

$ wc -l 2009071*  
  186111 20090716-crashdata.csv
  181468 20090717-crashdata.csv
  155158 20090718-crashdata.csv
  118364 20090719-crashdata.csv
we might also be seeing this in some thunderbird crash sigs
We will be running a map reduce job in production that should get us the duplicate counts and other details for all products that are submitted by Socorro.
I'm nitpicking here and changing the title.  (Socorro doesn't submit anything.)

From Bob's numbers it looks like this issue has existed for at least a year, too.
Summary: Socorro may be automatically submitting the same crash over and over → Duplicate crashes in Socorro
For 7/14:
Browser BuildId InstallTime StartupTime CrashTime ProcessType legacy_0 legacy_1 total
Firefox 20100202165920 1266538977 1268937980 1268940564 null    16      1073    1089

The OOID 001060ae-c30f-4871-a1ef-c2a072100610 got duplicated 1089 times and only 16 of it got processed. 

Should we (metrics team) go ahead and generate statistics on ranges of dupes etc. ?
One data point doesn't tell me much (why am I telling the metrics team this? ;) )

More data would be good.
Work in progress, haven't fully audited the results, but they are basically tabulating the data aphadke already attached above.
Questions:
- What percentage of reports are duplicates?  (total, by date, by buildid)
- How are dupes different from non dupes?  Are they over-represented:
  - For hangs vs non hangs?
  - For a particular product?
  - For a particular build?
Assignee: nobody → aphadke
This image shows that most of our duplicates are due to startup crashes.  Also, we had a huge spike in them on 2010-07-14.  We need to run the job on subsequent days to see if that is a new trend or just a single spike.
Attachment #457671 - Attachment is obsolete: true
Attachment #458478 - Attachment is obsolete: true
Attachment #458482 - Attachment is obsolete: true
It looks like Firefox 3.6.6 Windows is generating significantly more duplicate startup crash submissions.
It looks like the startup crashes went back down from that second graph, too.

Anyone (chofmann? Tomcat? Marcia? LegNeato?) have any ideas why 3.6.6 on Windows might have given us all these duplicate startup crashes?  Seems like there might be an underlying Firefox bug.
Please note that currently, the number shown in the Y axis is a sum of all the duplicate submissions.  That does mean that 3.6.6 will be higher than the other builds since it has a significantly higher user population so there are more total submissions that might possibly be duplicated.

I'm trying to think of how we might be able to change this number to be normalized based on the population so we can tell if it is generating more duplicates per capita than previous builds.
PDF export of a cross tab that shows the following:

Avg. Duplicates, Duplicates, Fingerprints and Std. dev. of Duplicates broken down by BuildID vs. IsStartupCrash and SubmitDate. The data is filtered on IsUnique, which keeps 0. The view is filtered on SubmitDate and BuildID. The SubmitDate filter excludes 0. The BuildID filter has multiple members selected.

I still don't have an easy way to provide quartiles or export to a spreadsheet.  I'm going to CC the Portugal team to see if they can do that tomorrow morning.
should we close this ticket or assign it to someone other than me (aphadke)?
The full analysis view is now available on the metrics site:
https://metrics.mozilla.com/pentaho/Pivot?solution=metrics&path=&action=SocorroDuplicatesStudy.analysisview.xaction

You should be able to use that to explore any of the several dimensions included and either graph or export it to CSV.  If you have any trouble with it, please just let me know what you are trying to do and I'll be happy to help out.
Assignee: aphadke → laura
Severity: critical → normal
OS: Mac OS X → All
Hardware: x86 → All
Daniel, I can't view that page (get the error: This page cannot be accessed directly. It must be linked to from other pages)
duplicate crashes are actually getting in the way of diagnosing and getting traction on the introduction of new crashes on trunk development, pre-beta builds, and experimental branches. Daily volume of top 10-20 crashes on these builds is often around the range of 20-40 crashes today, so its easy for a duplicate to slip into the top crash list, and when it does we waste time trying to sort out the "real frequency" of the crash and the priority it should receive.

We need a way for this to be eliminated.  One possibility is surface some info that tells anyone analyzing the reports the number of possible duplicates in any given sample.

Daniel suggest watching time_since_last_crash and install_time as a possible way to quickly fingerprint duplicate reports.  That might be a good start.
We might need to expand this definition to improve the fingerprinting.

We should also spin off another bug to get a review of the client code to better understand error handling and transactions conditions between client and server where these dups might be occurring.

For example it looks like ted suggest returning the http status to the client to help it understand the state of the transfer.
http://mxr.mozilla.org/mozilla1.9.2/source/toolkit/crashreporter/client/crashreporter_win.cpp#432
http://code.google.com/p/google-breakpad/issues/detail?id=220

We could also look at the possibility that these are all "manual resubmissions" happening under the conditions in this code http://mxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/client/crashreporter.cpp#312

Maybe the client could also forward info to indicate which crashes are "automatically submitted" and which are "manual."
(In reply to comment #29)
> For example it looks like ted suggest returning the http status to the client
> to help it understand the state of the transfer.
> http://mxr.mozilla.org/mozilla1.9.2/source/toolkit/crashreporter/client/crashreporter_win.cpp#432
> http://code.google.com/p/google-breakpad/issues/detail?id=220

This could be a possible root cause, if the collector returned an error status, but in fact had accepted the report, and then the user manually resubmitted it.

> 
> We could also look at the possibility that these are all "manual resubmissions"
> happening under the conditions in this code
> http://mxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/client/crashreporter.cpp#312

This code is currently unused. We never rolled out that kind of throttling.


> Maybe the client could also forward info to indicate which crashes are
> "automatically submitted" and which are "manual."

We already do this. Crashes submitted by manually clicking in about:crashes send "Throttleable=0".
here is another area to look at for bugs in the client, or for fingerprinting on the server.

it looks like it might be a bit more complicated that just watching for matches on install_time and time_since_last_crash.

here is a sample of crashes for the signature gfxWindowsSurface::GetImageSurface() which is one of these signatures with a suspect early release cycle one-day-spikes. 

in these cases if you take install time minus time since last crash the possible duplicates begin to appear as a result of the client appearing to increment install time before resending.

in the set of reports below it looks like

#  install since itime        uptime
   time    last   - last_time
           crash

1. 6707   115   6592  103 
2. 7988    28   7960   10 
3. 8020    32   7988   20    -- 3 is a retransmission of 2, 32 seconds later
4. 8040    20   8020   12    -- 4 is a retransmission of 3, 20 seconds later
5. 8128    88   8040   28    -- 5 is a retransmission of 4, 88 seconds later
6. 8193    65   8128   59    -- 6 is a retransmission of 5
7. 8236    43   8193   36    -- 7 is a retransmission of 6
8. 8438   168   8270  134 
9. 8555   117   8438  112    -- 9 is a retransmission of 8

here is the set of reports used in this sample if we want to look deeper.

http://crash-stats.mozilla.com/report/index/4b7eaabd-2078-4da0-80b2-a5f8d2100921
http://crash-stats.mozilla.com/report/index/a974adf2-9661-4058-8259-e38cc2100921
http://crash-stats.mozilla.com/report/index/930553de-34fe-4325-891f-aeb562100921
http://crash-stats.mozilla.com/report/index/78b4cdc2-966e-44b1-a6a2-f39d32100921
http://crash-stats.mozilla.com/report/index/8e05cdd7-6634-46ce-b063-501292100921
http://crash-stats.mozilla.com/report/index/77f7d3d8-c06f-4436-924d-b7f1b2100921
http://crash-stats.mozilla.com/report/index/c488053c-1b4a-4532-bc05-7bc952100921
http://crash-stats.mozilla.com/report/index/867c8d99-0e14-4307-a57e-aa1802100921
http://crash-stats.mozilla.com/report/index/55a524a6-7552-4065-98ea-0932c2100921
this is also strange. processed time and client crash time doesn't seem to be advancing in synch with the incremental advance of install time and time since last crash.

one theory to explain this is that somehow the dups are queing on the client, with each report advancing time since install and time since last crash, then are submitted in fast succession and the server picks them up as unique reports.

another theory is that these really aren't dups, even though the majority of the reports share a lot in common when trying to fingerprint signature and various correlations and connects of time, os (Windows NT 5.1.2600 Service Pack 2), build id ( 20100921041551 ) , and crash address (0x116) as you can see in
http://crash-stats.mozilla.com/report/list?signature=gfxWindowsSurface::GetImageSurface%28%29
beyond proximity in summission time, version, branch, os, crash address, e-mail status  addon compat check status, and flash version 8 of these also have the same URL
that last list with urls and other fingerprinting info might support the theory that the original crash event happened around 201009211505 when the user hit that url a http://200.181.114.4, 

then a bunch of reports were dupped on the client, 

then 21 minutes later at 201009211526 the server began receiving and processing a flood of these duplicates over the next 4 minutes until 201009211530
Blocks: 598814
another pile of these kind of growing install age, but close proximity in submission time dups found in bug 598814

 201009210613 201009210612 5456 383 5073 101	 http://crash-stats.mozilla.com/report/index/528346a5-5eab-4298-afcd-8fc442100921
 201009210613 201009210613 5472 16 5456 0	 http://crash-stats.mozilla.com/report/index/81f4e24d-3c3e-4935-b42d-1f0e62100921
 201009210613 201009210613 5474 2 5472 0	 http://crash-stats.mozilla.com/report/index/27c92b06-b46e-4c8d-9ca5-028c52100921
 201009210613 201009210613 5476 2 5474 1	 http://crash-stats.mozilla.com/report/index/6699dc08-02b6-4460-ab9a-b2cd92100921
 201009210613 201009210613 5487 8 5479 0	 http://crash-stats.mozilla.com/report/index/3280a928-9d66-4f36-8fb5-19fc22100921
 201009210613 201009210613 5491 4 5487 0	 http://crash-stats.mozilla.com/report/index/2c9f3785-c1c9-47a9-99cf-c6dad2100921
 201009210613 201009210613 5509 7 5502 0	 http://crash-stats.mozilla.com/report/index/9e84e4b3-7c5f-4af5-b33b-756742100921
 201009210614 201009210614 5553 44 5509 0	 http://crash-stats.mozilla.com/report/index/c70b6784-3645-4b97-a0e7-f3bf32100921
 201009210614 201009210614 5563 10 5553 0	 http://crash-stats.mozilla.com/report/index/c169d43e-448c-498b-ac56-5594f2100921
 201009210614 201009210614 5567 4 5563 0	 http://crash-stats.mozilla.com/report/index/48ca6327-cb69-476f-a745-07cb02100921
 201009210614 201009210614 5568 1 5567 0	 http://crash-stats.mozilla.com/report/index/76fa154c-ecb3-446e-a6bb-d4a1d2100921
 201009210614 201009210614 5570 2 5568 0	 http://crash-stats.mozilla.com/report/index/0e8aaab6-b87c-4999-a34e-232892100921
 201009210614 201009210614 5572 2 5570 0	 http://crash-stats.mozilla.com/report/index/77615e63-e1bf-44a4-ad30-93c9c2100921
 201009210615 201009210614 5574 2 5572 0	 http://crash-stats.mozilla.com/report/index/141aaadf-5dfb-4b17-87cf-878b02100921
 201009210615 201009210614 5576 2 5574 0	 http://crash-stats.mozilla.com/report/index/665c79f8-1678-4d6c-b193-3c1642100921
 201009210615 201009210615 5579 3 5576 0	 http://crash-stats.mozilla.com/report/index/3bd76f2e-71f0-431f-9a57-cea1b2100921
 201009210615 201009210615 5584 5 5579 0	 http://crash-stats.mozilla.com/report/index/77cb4751-4d5a-4058-9f54-56ad82100921
 201009210615 201009210615 5588 4 5584 0	 http://crash-stats.mozilla.com/report/index/4267e52a-75d6-47b3-a195-55c582100921
 201009210615 201009210615 5589 1 5588 0	 http://crash-stats.mozilla.com/report/index/2ba77df1-4b4b-4f0d-9066-0fe002100921
 201009210615 201009210615 5590 1 5589 0	 http://crash-stats.mozilla.com/report/index/bb83854c-5e5f-40ea-9ae0-6b12b2100921
 201009210615 201009210615 5592 2 5590 0	 http://crash-stats.mozilla.com/report/index/c81c381a-b32d-4024-a8f3-f409d2100921
 201009210615 201009210615 5594 2 5592 0	 http://crash-stats.mozilla.com/report/index/5364e724-2ce6-4a22-9d95-515b02100921
 201009210615 201009210615 5595 1 5594 0	 http://crash-stats.mozilla.com/report/index/4f067d44-8d09-40c4-816b-0a6a82100921
Blocks: 602102
Blocks: 597960
(In reply to comment #36)
> From the TB guys:
> http://crash-stats.mozilla.com/report/list?range_value=2&range_unit=weeks&signature=JS_InternString&version=Thunderbird%3A3.1.4

This spike began and ended on 9/26/2010. JS_InternString is now back at nominal (low) levels
Blocks: 606083
Blocks: 608378
Blocks: 609173
Blocks: 616675
Depends on: 617212
During the last week I have noticed instances of these dupe crashes showing up more regularly.  Here are some examples:

http://tinyurl.com/2ebd27q
http://tinyurl.com/3a8k9eb
Blocks: 618293
I want to repeat my comment 29.

duplicate crashes are actually getting in the way of diagnosing and getting
traction on the introduction of new crashes on trunk development, pre-beta
builds, and experimental branches.  

the last dependency I added with Bug 618293 shows 71 of these duplicate reports being injected within a 1 hour time frame by a single user on firefox 4 b8pre.

I think this is our #1 problem.  Its bloating our storage requirements, throwing off stability metrics, and wasting crash analysis cycles.   we should get together at the on-site next week and figure out how to figure out whats going on and then reduce and/or remove this problem on the client or server or both.    laura/ted/daniel, are you up for that?
In my automated testing, I see cases where several dump files are all created with the same signature but with slightly different times. In at least some of the cases, the browser crashes, then restarts and crashes again immediately sometimes creating up to 6 dump files before it gives up the ghost.

A recent example:

    StartupTime = 1291929817
    InstallTime = 1291895125
    SecondsSinceLastCrash = 2769811
    CrashTime = 1291929838

    StartupTime = 1291929854
    InstallTime = 1291895125
    SecondsSinceLastCrash = 12
    CrashTime = 1291929865

    StartupTime = 1291929838
    InstallTime = 1291895125
    SecondsSinceLastCrash = 15
    CrashTime = 1291929853
(In reply to comment #40)

> I think this is our #1 problem.  Its bloating our storage requirements,
> throwing off stability metrics, and wasting crash analysis cycles.   we should
> get together at the on-site next week and figure out how to figure out whats
> going on and then reduce and/or remove this problem on the client or server or
> both.    laura/ted/daniel, are you up for that?

Yes, for sure.  Lars should be in too.  Send us a Zimbra invite, please?
we had a good meeting and agree this should rise in the priority ranking among tasks.

first step was to identify the possible several different dup scenarios that we might be seeing, the figure out how to either mark or remove the dups depending on how well the exact problem is understood.

-breakpad misbehavior/bug
  probably we want to remove these but will need instrumentation to figure out more.  
  we could track/diff crash upload filenames/contents
  add a crash counter and include that in the submission to makes dups easier to spot.
  we can also try injecting the fix suggested in comment 30.

-user resubmission
   could be a contributor on plugin reload, or as users hit session restore
   probably want to mark these to allow analysis and possible improvements to 
   how these two features work.

-malware
   could cause continue crashes with small variations in the stacks.  same with user resubmissions.  this will make dup associations a bit harder.


The other topic we covered was to figure out how to start some dup associations and get these surfaced in the web UI and maybe .csv files so we can use that information to reduce wasted time in crash analysis.

The most likely candidate seem to be to figure out how to surface install time info.  right now we toss install time info after converting it to "install age".

we need to figure out if install age can converted back to install time in the web interface. 

maybe we should spin off another bug for that task and keep this as the tracking bug for all activities.
bug 619983  requests adding install age/install time to the signature listings.
Blocks: 619983
Blocks: 574711
Blocks: 624797
Blocks: 627579
Blocks: 628104
(In reply to comment #30)
> (In reply to comment #29)
> > For example it looks like ted suggest returning the http status to the client
> > to help it understand the state of the transfer.
> > http://mxr.mozilla.org/mozilla1.9.2/source/toolkit/crashreporter/client/crashreporter_win.cpp#432
> > http://code.google.com/p/google-breakpad/issues/detail?id=220
> 
> This could be a possible root cause, if the collector returned an error status,
> but in fact had accepted the report, and then the user manually resubmitted it.

Should we spin off another bug, or couple of bugs to return status and check it on the client?
Severity: normal → major
Target Milestone: --- → 1.7.7
Targeting for next release.

I think we should spin off a few smaller bugs.  Let's address that in the triage meeting.
URL: 628816
Summary: Duplicate crashes in Socorro → Duplicate crashes in Socorro [tracker]
Depends on: 628816
Depends on: 629086
Depends on: 629088
We noticed a pretty large duplicate spike when looking at early B11 crash data: http://tinyurl.com/5u57jdp appears to be all reports from the same person. Should we track this issue here or in a separate bug?
set up another bug, and add it to the blocking dependency list.
Depends on: 632618
No longer depends on: 632618
Blocks: 632618
Blocks: 632785
Blocks: 635209
Blocks: 617212
No longer depends on: 617212
All the 1.7.7 parts of this are done.  Remaining is a 1.7.8 bug, so bumping the tracker to 1.7.8.
Target Milestone: 1.7.7 → 1.7.8
an additional fingerprinting thing we could do is something around the module list.

One simple and inexpensive thing would be to calculate the number of modules loaded in the module list and display that and use it in the dup marking.

I'm seeing enough of variability in that number that I think it could be useful in sorting out "retransmission of the same report", kind of dups.
in addition to the module count, we could do simple checksums on the module list, and the stack.
Depends on: 651541
Depends on: 651543
Parts of this tracker are done, and the other dependencies are slated for 2.0, so pushing this to 2.0 also.
Target Milestone: 1.7.8 → 2.0
Dependent bugs pushed out already
Target Milestone: 2.0 → 2.2
Target Milestone: 2.2 → 2.3
Target Milestone: 2.3 → 2.4
Target Milestone: 2.4 → ---
Component: Socorro → General
Product: Webtools → Socorro
We have some mitigation mechanisms that have been put in place so I'm going to call this fixed.

We are moving forward with crash pings as the source of record for crash rates which obviates any outstanding scope beyond what is fixed.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.