Closed
Bug 626944
Opened 13 years ago
Closed 13 years ago
Socorro - hbaseResubmit cron missing on two collectors
Categories
(Socorro :: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: lars, Unassigned)
Details
During the update of collectors from 1755 to 1756, jabba noticed that pm-app-collector04 and pm-app-collector06 had no hbaseResubmit cron job. There are thousands of crashes from all the way back to 20101007 waiting for submission. All the crashes from the recent hbase outages over the last week are also in there. we cannot just unleash them into the general pool until we understand the ramifications. Will old data suddenly appear in old tables have serious effects on the Postgres migration that has already happened? I've deferred submitting an IT request until we've had a chance to discuss it. However we should respond quickly in the morning.
Reporter | ||
Updated•13 years ago
|
Severity: normal → critical
Lars, is this why the crash I just submitted comes up "Archived report can't be found" (as well as the crashes I've previously submitted in the last month)? http://crash-stats.mozilla.com/report/index/e54c77bc-2f78-46f7-b679-5a0fe2110118 is the one from about a half-hour ago.
Sorry, that first sentence should have included the phrase "or is that some other known bug, or do I need to file a new bug on the issue".
In the course of doing my weekly pre-meeting crash analysis tonight, I discovered that not a single crash--old, new, anything--was available via crash-stats, so I filed bug 626953 instead. Sorry for spamming this one ;)
Comment 4•13 years ago
|
||
[root@pm-app-collector04 local_failed_hbase_crashes]# du -hsx * 4.6G 20101004 1.1G 20101007 193M 20101008 2.2G 20101009 124K 20101011 27M 20101012 5.6G 20101013 424K 20101014 371M 20101015 6.2M 20101017 1.7G 20101018 918M 20101020 36K 20101021 584M 20101022 64K 20101024 2.4G 20101025 103M 20101026 1.7G 20101027 874M 20101029 76K 20101031 693M 20101101 581M 20101103 2.7G 20101104 34M 20101106 905M 20101107 433M 20101108 668M 20101110 296K 20101111 743M 20101112 188K 20101114 1.3G 20101115 13M 20101116 771M 20101117 592M 20101119 625M 20101121 1.2G 20101123 36K 20101124 1.8M 20101125 12M 20101126 2.4G 20101127 1.4M 20101129 554M 20101130 14M 20101203 5.1G 20101204 52K 20101205 1.4M 20101206 1.6G 20101207 64K 20101208 1.8G 20101210 72K 20101218 228K 20101222 456K 20101223 112K 20101224 628K 20110106 304K 20110107 124K 20110109 234M 20110110 6.2M 20110111 900K 20110113 1.1G 20110114 506M 20110115 472M 20110116 2.3G 20110118 [root@pm-app-collector04 local_failed_hbase_crashes]# [root@pm-app-collector06 local_failed_hbase_crashes]# du -hsx * 1.1G 20101007 195M 20101008 2.2G 20101009 30M 20101012 5.7G 20101013 324K 20101014 354M 20101015 4.9M 20101017 1.7G 20101018 848M 20101020 613M 20101022 2.5G 20101025 96M 20101026 1.6G 20101027 380K 20101028 850M 20101029 552K 20101031 681M 20101101 100K 20101102 685M 20101103 2.4G 20101104 68K 20101105 18M 20101106 999M 20101107 450M 20101108 64K 20101109 672M 20101110 124K 20101111 675M 20101112 116K 20101114 1.3G 20101115 27M 20101116 768M 20101117 572M 20101119 630M 20101121 1.1G 20101123 568K 20101125 5.3M 20101126 2.1G 20101127 36K 20101128 844K 20101129 572M 20101130 30M 20101203 5.3G 20101204 36K 20101205 1.9M 20101206 1.5G 20101207 1.8G 20101210 64K 20101226 140K 20101231 336K 20110108 258M 20110110 576K 20110111 1.6M 20110113 1022M 20110114 412M 20110115 501M 20110116 1.7G 20110118 [root@pm-app-collector06 local_failed_hbase_crashes]#
Reporter | ||
Comment 5•13 years ago
|
||
I want all the 2011 crash submitted as soon as possible. This means moving the directories that represent 2010 data to a holding location. Then we set up the proper hbaseResubmit.py cron and let it feed on the 2011 crashes. I will file an IT request within the next few minutes. Then we can discuss the fate of the remaining 2010 crashes.
Comment 6•13 years ago
|
||
[root@pm-app-collector01 ~]# find /opt/local_failed_hbase_crashes -name '*json' |wc -l 120093 [root@pm-app-collector01 ~]# [root@pm-app-collector06 ~]# find /opt/local_failed_hbase_crashes -name '*json' |wc -l 506885 [root@pm-app-collector06 ~]# Still waiting for results for collectors 02, 03, 04, 05
Reporter | ||
Comment 7•13 years ago
|
||
jabba: reports that other collectors also have crashes hanging out in their fallback storage areas - I'm investigating. On pm-app-collector01, I'm see degenerate cases where the dump is missing. I'm also seeing crashes where the symbolic links from the date branch are missing. This would prevent the resubmitter from finding them. I theorize that these are ones that the resubmitter actually found once and they failed submission for some reason.
Comment 8•13 years ago
|
||
[root@pm-app-collector04 ~]# find /opt/local_failed_hbase_crashes -name '*json' |wc -l 552287 [root@pm-app-collector04 ~]#
Comment 9•13 years ago
|
||
[root@pm-app-collector03 ~]# find /opt/local_failed_hbase_crashes -name '*json' |wc -l 180941 [root@pm-app-collector03 ~]#
Comment 10•13 years ago
|
||
[root@pm-app-collector02 ~]# find /opt/local_failed_hbase_crashes -name '*json' |wc -l 220182 [root@pm-app-collector02 ~]#
Comment 11•13 years ago
|
||
This is fixed (and actually now, irrelevant, since SJC/hbaseresubmit is dead to us.)
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•13 years ago
|
Component: Socorro → General
Product: Webtools → Socorro
You need to log in
before you can comment on or make changes to this bug.
Description
•