Closed Bug 604081 Opened 15 years ago Closed 12 years ago

Socorro - collector and nagios

Categories

(Socorro :: General, task)

x86_64
Linux
task
Not set
major

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: lars, Unassigned)

Details

on reading the code of the collector nagios check for the first time, I'm concerned that it may be telling us of a deeper collector problem than we realized. The collector, if it encounters hbase trouble, should put crashes into fallback storage and then return a normal ooid. From the stand point of the submitter, there should be no difference between a regular successful hbase storage and a fallback storage. On getting a warning from nagios about the collector, I think it's saying that both the hbase _and_ fallback storage failed. At least that's what I get from reading the nagios check. This warrants investigation. The nagios collector code: #! /bin/sh PATH=$PATH:$HOME/bin:~/python_extras/bin export PATH PYTHONPATH=~/python_extras/lib:/opt/processor:/usr/lib/python2.4/site-packages export PYTHONPATH UUID=`python /opt/processor/socorro/collector/submitter.py -u $1 -h crash-reports.mozilla.com -j /opt/processor/crash_data/breakpad_test.json -d /opt/processor/crash_data/breakpad_test.dump` if [ $? -eq 0 ]; then UUID=`echo "$UUID" | sed -n -e '/CrashID=bp-.\{8\}-.\{4\}-.\{4\}-.\{4\}-.\{12\}/p'` if [ -z "${UUID}" ]; then echo "CRITICAL: Unable to submit crash report." exit 2 else echo "OK: Submitted crash report." exit 0 fi fi
Severity: normal → major
jabba/aravind - could we get a copy of the collector log from around when this happened? I see this one in IRC from right before the HBase restart: 16:24 < nagios> [75] cm-breakpad04:File Age - /home/processor/processor_1.log is CRITICAL: FILE_AGE CRITICAL: /home/processor/processor_1.log is 360 seconds old and 99917215 bytes I believe the collector timeout was at 5 seconds, so +/- 2 minutes around 16:24
(In reply to comment #1) > jabba/aravind - could we get a copy of the collector log from around when this > happened? I see this one in IRC from right before the HBase restart: > > 16:24 < nagios> [75] cm-breakpad04:File Age - /home/processor/processor_1.log > is CRITICAL: FILE_AGE CRITICAL: /home/processor/processor_1.log > is 360 seconds old and 99917215 bytes > > I believe the collector timeout was at 5 seconds, so +/- 2 minutes around 16:24 Sorry, tabbed out of the textarea and hit space (which accidentally submitted by post), that is the wrong alert.
Here is an example from before the restart: 16:09 < nagios> [61] pm-app-collector01:Socorro Collector is CRITICAL: CRITICAL: Unable to submit crash report.
Component: Socorro → General
Product: Webtools → Socorro
no longer relevant
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.