Closed Bug 626944 Opened 13 years ago Closed 13 years ago

Socorro - hbaseResubmit cron missing on two collectors

Categories

(Socorro :: General, task)

x86_64
Linux
task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: lars, Unassigned)

Details

During the update of collectors from 1755 to 1756, jabba noticed that pm-app-collector04 and pm-app-collector06 had no hbaseResubmit cron job.  There are thousands of crashes from all the way back to 20101007 waiting for submission.  All the crashes from the recent hbase outages over the last week are also in there.

we cannot just unleash them into the general pool until we understand the ramifications.  Will old data suddenly appear in old tables have serious effects on the Postgres migration that has already happened?

I've deferred submitting an IT request until we've had a chance to discuss it.  However we should respond quickly in the morning.
Severity: normal → critical
Lars, is this why the crash I just submitted comes up "Archived report can't be found" (as well as the crashes I've previously submitted in the last month)?

http://crash-stats.mozilla.com/report/index/e54c77bc-2f78-46f7-b679-5a0fe2110118 is the one from about a half-hour ago.
Sorry, that first sentence should have included the phrase "or is that some other known bug, or do I need to file a new bug on the issue".
In the course of doing my weekly pre-meeting crash analysis tonight, I discovered that not a single crash--old, new, anything--was available via crash-stats, so I filed bug 626953 instead.  Sorry for spamming this one ;)
[root@pm-app-collector04 local_failed_hbase_crashes]# du -hsx *
4.6G    20101004
1.1G    20101007
193M    20101008
2.2G    20101009
124K    20101011
27M     20101012
5.6G    20101013
424K    20101014
371M    20101015
6.2M    20101017
1.7G    20101018
918M    20101020
36K     20101021
584M    20101022
64K     20101024
2.4G    20101025
103M    20101026
1.7G    20101027
874M    20101029
76K     20101031
693M    20101101
581M    20101103
2.7G    20101104
34M     20101106
905M    20101107
433M    20101108
668M    20101110
296K    20101111
743M    20101112
188K    20101114
1.3G    20101115
13M     20101116
771M    20101117
592M    20101119
625M    20101121
1.2G    20101123
36K     20101124
1.8M    20101125
12M     20101126
2.4G    20101127
1.4M    20101129
554M    20101130
14M     20101203
5.1G    20101204
52K     20101205
1.4M    20101206
1.6G    20101207
64K     20101208
1.8G    20101210
72K     20101218
228K    20101222
456K    20101223
112K    20101224
628K    20110106
304K    20110107
124K    20110109
234M    20110110
6.2M    20110111
900K    20110113
1.1G    20110114
506M    20110115
472M    20110116
2.3G    20110118
[root@pm-app-collector04 local_failed_hbase_crashes]# 


[root@pm-app-collector06 local_failed_hbase_crashes]# du -hsx *
1.1G    20101007
195M    20101008
2.2G    20101009
30M     20101012
5.7G    20101013
324K    20101014
354M    20101015
4.9M    20101017
1.7G    20101018
848M    20101020
613M    20101022
2.5G    20101025
96M     20101026
1.6G    20101027
380K    20101028
850M    20101029
552K    20101031
681M    20101101
100K    20101102
685M    20101103
2.4G    20101104
68K     20101105
18M     20101106
999M    20101107
450M    20101108
64K     20101109
672M    20101110
124K    20101111
675M    20101112
116K    20101114
1.3G    20101115
27M     20101116
768M    20101117
572M    20101119
630M    20101121
1.1G    20101123
568K    20101125
5.3M    20101126
2.1G    20101127
36K     20101128
844K    20101129
572M    20101130
30M     20101203
5.3G    20101204
36K     20101205
1.9M    20101206
1.5G    20101207
1.8G    20101210
64K     20101226
140K    20101231
336K    20110108
258M    20110110
576K    20110111
1.6M    20110113
1022M   20110114
412M    20110115
501M    20110116
1.7G    20110118
[root@pm-app-collector06 local_failed_hbase_crashes]#
I want all the 2011 crash submitted as soon as possible.  This means moving the directories that represent 2010 data to a holding location.  Then we set up the proper hbaseResubmit.py cron and let it feed on the 2011 crashes.  I will file an IT request within the next few minutes.

Then we can discuss the fate of the remaining 2010 crashes.
[root@pm-app-collector01 ~]# find /opt/local_failed_hbase_crashes -name '*json' |wc -l
120093
[root@pm-app-collector01 ~]# 

[root@pm-app-collector06 ~]# find /opt/local_failed_hbase_crashes -name '*json' |wc -l
506885
[root@pm-app-collector06 ~]# 

Still waiting for results for collectors 02, 03, 04, 05
jabba: reports that other collectors also have crashes hanging out in their fallback storage areas - I'm investigating.  

On pm-app-collector01, I'm see degenerate cases where the dump is missing.  I'm also seeing crashes where the symbolic links from the date branch are missing.  This would prevent the resubmitter from finding them.  I theorize that these are ones that the resubmitter actually found once and they failed submission for some reason.
[root@pm-app-collector04 ~]# find /opt/local_failed_hbase_crashes -name '*json' |wc -l
552287
[root@pm-app-collector04 ~]#
[root@pm-app-collector03 ~]# find /opt/local_failed_hbase_crashes -name '*json' |wc -l
180941
[root@pm-app-collector03 ~]#
[root@pm-app-collector02 ~]# find /opt/local_failed_hbase_crashes -name '*json' |wc -l
220182
[root@pm-app-collector02 ~]#
This is fixed (and actually now, irrelevant, since SJC/hbaseresubmit is dead to us.)
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.