Closed
Bug 944974
Opened 12 years ago
Closed 10 years ago
Collector-1 showing much more apache log entries than crashmover is showing writes
Categories
(Socorro :: General, task)
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: lonnen, Unassigned)
Details
Sat 23:29:38 PST [1409] socorro-collector1.webapp.phx1.mozilla.com:Socorro Crash Volume is CRITICAL: CRITICAL: 10296 is NOT within 20% of 4472 (http://m.allizom.org/Socorro+Crash+Volume)
Once on Saturday, Nov 30 at ~16:00 and again at ~23:30 we had the above alert go off. The alert measures the discrepancy between number of apache log entries for crashes (the first number) and the number of writes in the crashmover logs (the second number).
So far it has only affected collector-1. For reference here are numbers from other collectors shortly after one such alert:
socorro-collector1.webapp.phx1.mozilla.com:Socorro Crash Volume is CRITICAL - CRITICAL: 10296 is NOT within 20% of 3953 Last Checked: 2013-11-30 22:29:28 PST
socorro-collector2.webapp.phx1.mozilla.com:Socorro Crash Volume is OK - OK: 1818 is within 10% of 1830 Last Checked: 2013-11-30 23:12:26 PST
socorro-collector3.webapp.phx1.mozilla.com:Socorro Crash Volume is OK - OK: 1870 is within 10% of 1868 Last Checked: 2013-12-01 00:09:55 PST
socorro-collector4.webapp.phx1.mozilla.com:Socorro Crash Volume is OK - OK: 1357 is within 10% of 1342 Last Checked: 2013-12-01 00:07:22 PST
socorro-collector5.webapp.phx1.mozilla.com:Socorro Crash Volume is OK - OK: 835 is within 10% of 853 Last Checked: 2013-12-01 00:04:49 PST
socorro-collector6.webapp.phx1.mozilla.com:Socorro Crash Volume is OK - OK: 400 is within 10% of 415 Last Checked: 2013-12-01 00:02:16 PST
Removing it from the pool clears the alert within a few minutes. I've asked IT to remove it from the pool temporarily to keep it from alerting overnight.
Comment 1•12 years ago
|
||
while the numbers are not an exact match, they are so close that I have no concern that the collector/crashmover were malfunctioning. I need to see what this alert is actually testing.
$ # number of crashes accepted by the collector:
$ grep accepted error_log-20131201 | wc -l
2231918
$ # number of crashes moved by the crashmover
$ grep saved socorro-crashmover.log-20131201 | wc -l
2231904
Updated•12 years ago
|
Summary: Collector-1 showing much more apache log entries that cashmover is showing writes → Collector-1 showing much more apache log entries that crashmover is showing writes
| Reporter | ||
Updated•12 years ago
|
Summary: Collector-1 showing much more apache log entries that crashmover is showing writes → Collector-1 showing much more apache log entries than crashmover is showing writes
| Reporter | ||
Comment 2•12 years ago
|
||
:lars -- https://bugzilla.mozilla.org/show_bug.cgi?id=841578 discusses the implementation of the error.
I'm cc'ing ericz for insight since he authored the alert
Comment 3•12 years ago
|
||
From bug 841578:
"I implemented a new nagios check, check_socorro_volume, that compares the incoming crash reports (from Apache logs) to outgoing crashes saved to HBase (from socorro-crashmover.log) and WARNS if they differ by greater than 10% in either direction. It alerts as CRITICAL if they are off by more than 20% from each other. These values are probably not ideal so we can tune them down the road..."
The actual check script is /usr/lib64/nagios/plugins/custom/check_socorro_volume.sh on the collectors so you can see what it's doing. It's a pretty straightforward shell script aside from perhaps calculating percent difference with bc.
| Reporter | ||
Updated•10 years ago
|
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•