Closed Bug 834403 Opened 13 years ago Closed 12 years ago

Crashes backed up on collectors since Tuesday 1/24, and collectors not catching up

Categories

(Socorro :: General, task)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: laura, Unassigned)

References

Details

Attachments

(1 file)

collector04 is behaving normally. All the others are running extremely slowly and experiencing many HBase connection failures, too. collector01 has nearly 200k crashes backed up on disk. I also note that the crashes are all in date/[date]/00/00 instead of being distributed over subdirectories, which is surely not helping. Problem appears to have started around 7am PT on Tuesday 1/22 Current status is instrumenting a collector (01) to get more information.
Still debugging the issue. I recommend that we come up with two new monitors: * "acceptable number of timeouts when talking to hbase" - in the past this seems to stay below 10-15/hr * "how many files we have per directory" on each collector
This is running on collector01, just eyeballing the times it looks like removing files from the filesystem is taking the longest by far.
Update: :jakem has gotten all collectors back online with fresh new primaryCrashStore directories. We took sp-collector03 out of zeus pool and we are processing the old crashes. This is proceeding at a rate of about 2k crashes per minute, on a backlog of about 490k.
More info: we rolled collectors and crash movers back to Socorro 33 (pre-multi-dump). We will need to roll them forward again once the storage bug is fixed.
I should also point out that once we complete backlog processing, we will need to backfill the days 1/22- 1/24
sp-collector03 has completed backlog. :jakem is starting up processing for backlog on sp-collector02.
Another note - socorro team needs sudo access as apache/socorro user on sp-collectorXX nodes.
sp-collector02 is completed. sp-collector06 was already completed earlier. 01 is underway.
01 and 04 are completed. 05 is underway... but proving more problematic than the others did.
Where did we get to? Still need to run backfill after this completes, too.
:mpressman is running backfill now per IRC.
Backfill was complete by 1/29.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: