Closed
Bug 834403
Opened 13 years ago
Closed 12 years ago
Crashes backed up on collectors since Tuesday 1/24, and collectors not catching up
Categories
(Socorro :: General, task)
Socorro
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: laura, Unassigned)
References
Details
Attachments
(1 file)
2.44 KB,
text/plain
|
Details |
collector04 is behaving normally.
All the others are running extremely slowly and experiencing many HBase connection failures, too. collector01 has nearly 200k crashes backed up on disk.
I also note that the crashes are all in date/[date]/00/00 instead of being distributed over subdirectories, which is surely not helping.
Problem appears to have started around 7am PT on Tuesday 1/22
Current status is instrumenting a collector (01) to get more information.
Comment 1•13 years ago
|
||
Still debugging the issue.
I recommend that we come up with two new monitors:
* "acceptable number of timeouts when talking to hbase" - in the past this seems to stay below 10-15/hr
* "how many files we have per directory" on each collector
Comment 2•13 years ago
|
||
This is running on collector01, just eyeballing the times it looks like removing files from the filesystem is taking the longest by far.
Comment 3•13 years ago
|
||
Update:
:jakem has gotten all collectors back online with fresh new primaryCrashStore directories.
We took sp-collector03 out of zeus pool and we are processing the old crashes. This is proceeding at a rate of about 2k crashes per minute, on a backlog of about 490k.
Reporter | ||
Comment 4•13 years ago
|
||
More info: we rolled collectors and crash movers back to Socorro 33 (pre-multi-dump). We will need to roll them forward again once the storage bug is fixed.
Reporter | ||
Comment 5•13 years ago
|
||
I should also point out that once we complete backlog processing, we will need to backfill the days 1/22- 1/24
Comment 6•13 years ago
|
||
sp-collector03 has completed backlog. :jakem is starting up processing for backlog on sp-collector02.
Comment 7•13 years ago
|
||
Another note - socorro team needs sudo access as apache/socorro user on sp-collectorXX nodes.
Updated•13 years ago
|
Comment 8•13 years ago
|
||
sp-collector02 is completed. sp-collector06 was already completed earlier. 01 is underway.
Comment 9•13 years ago
|
||
01 and 04 are completed. 05 is underway... but proving more problematic than the others did.
Reporter | ||
Comment 10•13 years ago
|
||
Where did we get to? Still need to run backfill after this completes, too.
Comment 11•13 years ago
|
||
:mpressman is running backfill now per IRC.
Comment 12•13 years ago
|
||
Backfill was complete by 1/29.
Updated•12 years ago
|
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•