Closed Bug 944974 Opened 12 years ago Closed 10 years ago

Collector-1 showing much more apache log entries than crashmover is showing writes

Categories

(Socorro :: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: lonnen, Unassigned)

Details

Sat 23:29:38 PST [1409] socorro-collector1.webapp.phx1.mozilla.com:Socorro Crash Volume is CRITICAL: CRITICAL: 10296 is NOT within 20% of 4472 (http://m.allizom.org/Socorro+Crash+Volume) Once on Saturday, Nov 30 at ~16:00 and again at ~23:30 we had the above alert go off. The alert measures the discrepancy between number of apache log entries for crashes (the first number) and the number of writes in the crashmover logs (the second number). So far it has only affected collector-1. For reference here are numbers from other collectors shortly after one such alert: socorro-collector1.webapp.phx1.mozilla.com:Socorro Crash Volume is CRITICAL - CRITICAL: 10296 is NOT within 20% of 3953 Last Checked: 2013-11-30 22:29:28 PST socorro-collector2.webapp.phx1.mozilla.com:Socorro Crash Volume is OK - OK: 1818 is within 10% of 1830 Last Checked: 2013-11-30 23:12:26 PST socorro-collector3.webapp.phx1.mozilla.com:Socorro Crash Volume is OK - OK: 1870 is within 10% of 1868 Last Checked: 2013-12-01 00:09:55 PST socorro-collector4.webapp.phx1.mozilla.com:Socorro Crash Volume is OK - OK: 1357 is within 10% of 1342 Last Checked: 2013-12-01 00:07:22 PST socorro-collector5.webapp.phx1.mozilla.com:Socorro Crash Volume is OK - OK: 835 is within 10% of 853 Last Checked: 2013-12-01 00:04:49 PST socorro-collector6.webapp.phx1.mozilla.com:Socorro Crash Volume is OK - OK: 400 is within 10% of 415 Last Checked: 2013-12-01 00:02:16 PST Removing it from the pool clears the alert within a few minutes. I've asked IT to remove it from the pool temporarily to keep it from alerting overnight.
while the numbers are not an exact match, they are so close that I have no concern that the collector/crashmover were malfunctioning. I need to see what this alert is actually testing. $ # number of crashes accepted by the collector: $ grep accepted error_log-20131201 | wc -l 2231918 $ # number of crashes moved by the crashmover $ grep saved socorro-crashmover.log-20131201 | wc -l 2231904
Summary: Collector-1 showing much more apache log entries that cashmover is showing writes → Collector-1 showing much more apache log entries that crashmover is showing writes
Summary: Collector-1 showing much more apache log entries that crashmover is showing writes → Collector-1 showing much more apache log entries than crashmover is showing writes
:lars -- https://bugzilla.mozilla.org/show_bug.cgi?id=841578 discusses the implementation of the error. I'm cc'ing ericz for insight since he authored the alert
From bug 841578: "I implemented a new nagios check, check_socorro_volume, that compares the incoming crash reports (from Apache logs) to outgoing crashes saved to HBase (from socorro-crashmover.log) and WARNS if they differ by greater than 10% in either direction. It alerts as CRITICAL if they are off by more than 20% from each other. These values are probably not ideal so we can tune them down the road..." The actual check script is /usr/lib64/nagios/plugins/custom/check_socorro_volume.sh on the collectors so you can see what it's doing. It's a pretty straightforward shell script aside from perhaps calculating percent difference with bc.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.