Closed Bug 570478 Opened 15 years ago Closed 15 years ago

Breakpad reports "[Errno 24] Too many open files" in Processor Notes

Categories

(Socorro :: General, task)

x86
macOS
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: Gijs, Assigned: lars)

References

Details

Attachments

(1 file)

eg.: http://crash-stats.mozilla.com/report/index/bp-faa65b2b-f806-48f9-8e18-dfd092100607

http://crash-stats.mozilla.com/report/index/3da761ee-1ffb-487a-858a-b89862100607

Not sure what other info is relevant. Both are Fx 3.6.3 reports, one from Windows, one from Mac. If you need more info, please just ask and I'll do my best.
Whatever's broken here it sure doesn't sound good.
Severity: normal → blocker
Lars, can you take a look?

I also note that the top crash for 3.6.3 is "no signature" which doesn't seem right either:
http://crash-stats.mozilla.com/products/Firefox/versions/3.6.3
Assignee: nobody → lars
This problem seemed to affect only one of the five processors, ruining the processing for 115802 crashes since last Friday morning.  That's as far back as the rotating logs go, so we can assume that the problem started sometime earlier than that.

There is no obvious cause to for this problem - the file handling code in the processor has not changed in over a year and has not been problematic before. All file handling code is protected within try:finally: blocks. Because v1.8 of the processor replaces NFS file handling with HBase, I'm not sure there is much value in trying to track down this subtle problem.

I've submitted an IT request to monitor the processor.  If the string "ERROR - [Errno 24] Too many open files" appears in the logs, someone will be notified to immediately to take that processor offline.

As for the frequency of "no signature" crashes, that is an unrelated problem.  Go to the supplied link from comment #2, then click on any crash.  Then click on the "raw crash" tab and notice line 3 beginning with "Crash".  There should be an integer at the end of that line indicating which thread caused the crash.  If it is empty, the processor cannot create a signature because it doesn't know what thread crashed.  The UI, however, shows the stack for the crashing thread: that is an error in the UI (see Bug 528578).

Interestingly, the processor is suppose to make note of this "unknown crashing  thread" in the processor notes section.  That message is missing from the UI.  I'm investigating that under another bug: Bug 570516
the WONTFIX label on this sounds extreme. However, we can reopen if it recurs.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → WONTFIX
did we restart, or take that 5th processor off line to stop the problem? 

does anyone recall what time any adjustments were made happened?

looking for any connections to Bug 571118

I guess it also sounds like it would be really strange if a high pct. of the firefox 3.6.4 crashes have been being routed though that particular processor but that would explain what looks like a jump in 3.6.4 crash volume starting around 2010 06 07 1600 hrs PDT
the problematic processor was restarted.  If its resurrection is the cause of a jump, then there should be a corresponding drop four or five days earlier when the problem began.
(In reply to comment #6)
> did we restart, or take that 5th processor off line to stop the problem? 
> 
> does anyone recall what time any adjustments were made happened?
> 

I bumped up the maxfiles to 8 times the current value on all the processors and restarted them.  We also have a bug on file to monitor the processors for this problem.
it appears, by looking through the IRC logs, that aravind restarted the errant processor sometime between 9:50am pdt and 10:00am pdt.
Attached image hourly crash volume june 1-9 β€”
(In reply to comment #7)
> the problematic processor was restarted.  If its resurrection is the cause of a
> jump, then there should be a corresponding drop four or five days earlier when
> the problem began.

there was a reduced number of overall processed crashes per hour June 2-6 but some of that could be explained by the normal weekend reduction in submissions.

June 7-8-9 are new all time highs for hourly peaks and  June 8-9 for total crashes processed

total reports:

  378282 20100601-crashdata.csv
  379287 20100602-crashdata.csv
  332924 20100603-crashdata.csv
  333114 20100604-crashdata.csv
  304221 20100605-crashdata.csv
  327009 20100606-crashdata.csv
  378025 20100607-crashdata.csv
  402816 20100608-crashdata.csv
  397863 20100609-crashdata.csv
Bug 560628 "ADU daily UI changes for hang detection"  is coming on line today as part of the socorro 1.7 upgrade.  that will be another data point.

There was one new crash spiking crash on 3.6.4 (and other releases) durning the period where the uptick happened.  that is Bug 570722 "Sharp rise in crashes [@ SogouPy.ime ]"   If anyone spots others post here.

Also, http://people.mozilla.com/~chofmann/crash-stats/20100609/crash-counts.txt is posted now and show roughly same adu, crash counts, and crash per 100 users as June 8 continuing on June 9.
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: