Closed Bug 1140489 Opened 9 years ago Closed 9 years ago

Create a Nagios alert for zlb bandwidth based on graphite

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cliang, Assigned: ericz)

References

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/695] )

Despite what is implied in the ZLB documentation, we *don't* trigger a log event when the amount of incoming/outgoing traffic hits or exceeds our licensed limit.  Knowing when we are close to or at that limit would be useful in troubleshooting issues, especially those around releases.

AFAICT, the license limit applies to both incoming and outgoing traffic.  Network transmission / reception information is being collected in graphite (in octets).  I don't know if it is better to try to alert based on the shape of the graph (looking for "flat tops") or if it is better to grab the numbers straight from graphite and gonkulate appropriately.  

Whenever this alert is sorted out, alerts should also go to #buildduty in IRC (s.a. https://bugzilla.mozilla.org/show_bug.cgi?id=1130242#c10)
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/695]
Assignee: server-ops-webops → eziegenhorn
I've created a check_graphite_upper check which can check if a zlb's total rx and tx bandwidth is close to or exceeding the license limit of 2Gb/s by referencing data in Graphite.  It has a downside though in that we'd require an entry in services.pp for every ZLB we add with that ZLB's graphite data urls.  This goes against the grain of most of our checks so I may try and make it more generic with a custom check script that can translate between hostnames and corresponding graphite urls.

For the record I believe that Graphite has octets/minute information which translating Zeus' 2Gbit/s to octets/minute I get 257698037 as the license limit.  Looking at zlb1.ops.scl3, I see from 62-83% of that octets/min limit in current traffic which seems reasonable and hopefully verifies my math.
The new check zeus-bandwidth, "Zeus Bandwidth License Limit" is in place now and alerts on IRC for sysadmins, netops-alerts and buildduty.  It is currently quite noisy, but that appears to all be legitimate because ZLBs keep hitting the 2GB limit.  Once we get over the current issues, we should turn it on to page the MOC.

Another issue is that currently graphite in phx1 is overwhelmed with data much greater than historical levels.  Opened bug 1145163 for that issue.  Until that is resolved, the alerts for ZLB traffic in PHX1 will likely show UNKNOWN.
Reexamining the math behind the limit.  Zeus' own traffic graphs are labeled as bits per second so I believe the limit is 2Gbits/s.  I believe the collectd data in graphite is actually a sample or average that equates to octets/s.  Doing the math to convert from one to the other we get:

2Gbits/s * 1000Mbit/Gbit = 2000Mbit/s, * 1000Kbit/Mbit = 2000000 Kbit/s, * 1000Bits/Kbit = 2000000000 bits/s * 1/8 octets/bits = 250000000 octets/s

+ 10% to allow for non-Zeus traffic to/from those boxes = 275000000 for critical alert.  I'm making it alert after 5 minutely checks in a row alerting, and smooth the data over the last 10 minutes, all in an effort to make it alert less.
Raised thresholds to 130% and 140% of license limit because it's still too noisy.  I think we may have to make this check actual Zeus traffic versus all traffic to a Zeus server in order to be accurate because we're not seeing a good match with this data.
I bumped this up further for scl3 (hardware limits vs license, temp change till April 18th) , split out the config for scl3 and everything else in Nagios.
This is done, we can tune this check down the road as necessary, just like every other nagios check.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
(In reply to Shyam Mani [:fox2mike] from comment #6)
> I bumped this up further for scl3 (hardware limits vs license, temp change
> till April 18th) , split out the config for scl3 and everything else in
> Nagios.

Change is now extended till May 19th.
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.