Closed
Bug 604081
Opened 15 years ago
Closed 12 years ago
Socorro - collector and nagios
Categories
(Socorro :: General, task)
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: lars, Unassigned)
Details
on reading the code of the collector nagios check for the first time, I'm concerned that it may be telling us of a deeper collector problem than we realized.
The collector, if it encounters hbase trouble, should put crashes into fallback storage and then return a normal ooid. From the stand point of the submitter, there should be no difference between a regular successful hbase storage and a fallback storage.
On getting a warning from nagios about the collector, I think it's saying that both the hbase _and_ fallback storage failed. At least that's what I get from reading the nagios check.
This warrants investigation.
The nagios collector code:
#! /bin/sh
PATH=$PATH:$HOME/bin:~/python_extras/bin
export PATH
PYTHONPATH=~/python_extras/lib:/opt/processor:/usr/lib/python2.4/site-packages
export PYTHONPATH
UUID=`python /opt/processor/socorro/collector/submitter.py -u $1 -h crash-reports.mozilla.com -j /opt/processor/crash_data/breakpad_test.json -d /opt/processor/crash_data/breakpad_test.dump`
if [ $? -eq 0 ]; then
UUID=`echo "$UUID" | sed -n -e '/CrashID=bp-.\{8\}-.\{4\}-.\{4\}-.\{4\}-.\{12\}/p'`
if [ -z "${UUID}" ]; then
echo "CRITICAL: Unable to submit crash report."
exit 2
else
echo "OK: Submitted crash report."
exit 0
fi
fi
| Reporter | ||
Updated•15 years ago
|
Severity: normal → major
Comment 1•15 years ago
|
||
jabba/aravind - could we get a copy of the collector log from around when this happened? I see this one in IRC from right before the HBase restart:
16:24 < nagios> [75] cm-breakpad04:File Age - /home/processor/processor_1.log
is CRITICAL: FILE_AGE CRITICAL: /home/processor/processor_1.log
is 360 seconds old and 99917215 bytes
I believe the collector timeout was at 5 seconds, so +/- 2 minutes around 16:24
Comment 2•15 years ago
|
||
(In reply to comment #1)
> jabba/aravind - could we get a copy of the collector log from around when this
> happened? I see this one in IRC from right before the HBase restart:
>
> 16:24 < nagios> [75] cm-breakpad04:File Age - /home/processor/processor_1.log
> is CRITICAL: FILE_AGE CRITICAL: /home/processor/processor_1.log
> is 360 seconds old and 99917215 bytes
>
> I believe the collector timeout was at 5 seconds, so +/- 2 minutes around 16:24
Sorry, tabbed out of the textarea and hit space (which accidentally submitted by post), that is the wrong alert.
Comment 3•15 years ago
|
||
Here is an example from before the restart:
16:09 < nagios> [61] pm-app-collector01:Socorro Collector is CRITICAL:
CRITICAL: Unable to submit crash report.
| Assignee | ||
Updated•14 years ago
|
Component: Socorro → General
Product: Webtools → Socorro
| Reporter | ||
Comment 4•12 years ago
|
||
no longer relevant
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•