Closed Bug 1243497 Opened 9 years ago Closed 9 years ago

Need to monitor streams between Encoders and Streaming engines

Categories

(Infrastructure & Operations :: MOC: Service Requests, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: richard, Assigned: ryanc)

Details

Attachments

(2 files)

We need a way to monitor the RTMP traffic to Port 1935 on Air Mozilla's Wowza Streaming Engines. The streaming engines run on Linux VMs. Two are VMs in the SCL3 Data Center, and one is an Amazon EC2 instance. wowza1.scl3.mozilla.com (63.245.214.154) wowza2.scl3.mozilla.com (63.245.214.151) ec2-54-83-79-243.compute-1.amazonaws.com (54.83.79.243) SSH interfaces for the SCL3 VMs are on the VPN: wowza1.corpdmz.scl3.mozilla.com (10.22.72.153) wowza2.corpdmz.scl3.mozilla.com (10.22.72.140) Most of the RTMP traffic to those destinations originates from VBrick Encoders in Mozilla offices: MTV-Commons https://encoder3.commons.av.mtv2.mozilla.com 10.252.55.23 00:07:DF:01:B6:3A MTV-Cyberspace https://encoder2.avlab.av.mtv2.mozilla.com/ 10.252.55.164 00:07:DF:01:B6:42 MTV-Atom https://encoder1.atom.av.mtv2.mozilla.com/ 10.252.55.230 00:07:DF:01:B6:52 SFO-Commons https://encoder2.commons.voip.sfo1.mozilla.com/ 10.251.47.11 00:07:DF:01:B6:4E SFO-Cyberspace https://encoder2.avlab.voip.sfo1.mozilla.com/ 10.251.45.16 00:07:DF:01:B6:53 PDX-Commons https://encoder2.commons.av.pdx1.mozilla.com/ 10.248.47.11 00:07:DF:01:B6:55 YVR-Commons https://encoder2.commons.av.yvr1.mozilla.com/ 10.244.55.11 00:07:DF:01:B6:39 TOR-Commons https://encoder2.commons.voip.tor1.mozilla.com/ 10.242.47.10 00:07:DF:01:B6:37 LON-Commons https://encoder2.commons.voip.lon1.mozilla.com/ 10.246.47.11 00:07:DF:01:B6:4F PAR-Commons https://encoder2.commons.voip.par1.mozilla.com/ 10.243.47.11 00:07:DF:01:B6:58 There is some traffic that comes from VBrick and OBS encoders with connections that may or may not be in Mozilla offices. We need a way to detect and log RTMP connection drops, as well as general traffic statistics such as incoming bandwidth and dropped packet counts. We need both logs and graphs with sampling averages shorter than 10 seconds. (The size of a video media segment).
Assignee: nobody → rchilds
Status: NEW → ASSIGNED
Richard, Just some questions before I proceed. A. Is it OK to install a package on these hosts? (conntrack-tools) B. If you want these to be Nagios checks, what are the alert conditions? C. For graphing, does Graphite work for you?
Sure. I know nothing about either Nagios or Graphite. That's why I came to your team for a solution. Educate me. What can you do?
Attached image eth0-rx-tx
Sure, A. Give you graphs via https://graphite-scl3.mozilla.org/ -- See attachment for example More plugins need to be added for port statistics, but I already see the scl3 hosts in there, but not the ec2 instance. Does this have another hostname? If not, that's not a problem, we can work something out. B. Have Nagios alert in an IRC channel or alert the MOC (so we can escalate to you) if x statistic is over specified threshold. e.g, Alert if incoming traffic on eth0 is over 10MB/s Alert if any errors on eth0
Attached patch patch1Splinter Review
Richard, In order to proceed we would need to make those changes, but I'd also like to know when is the best time to do this, in case these changes need to be reverted. Also NI'ing Brian to review this commit. I'm expecting this to require iptables, which is already installed and running, then apply NO rules, then install the collectd plugin for conntrack.
Attachment #8716564 - Flags: review?(bhourigan)
Comment on attachment 8716564 [details] [diff] [review] patch1 looks reasonable to me. my only word of advice is to also ensure that conntrack_max is set to a reasonable value (the default is too small) to avoid service interruptions during peak traffic.
Attachment #8716564 - Flags: review?(bhourigan) → review+
The calendar of AirMo events is at https://air.mozilla.org/calendar/ Anytime between scheduled events is fine. Afternoons next week look pretty quiet right now. If you're unsure ping me on irc (Richard) and txt my phone (in the phonebook) if you have trouble finding me on irc. Graphite looks ok for graphing, but I'd like to find a way to watch traffic between specific endpoints, or at least over specific ports. If we hit a glitch it's important to know if the problem was with RTMP ingest (port 1935) or HLS push to the CDN (port 443). Error counts and dropped packet counts would be useful, particularly if they're graphed separately for each port.
(In reply to Brian Hourigan [:digi] from comment #5) > Comment on attachment 8716564 [details] [diff] [review] > patch1 > > looks reasonable to me. my only word of advice is to also ensure that > conntrack_max is set to a reasonable value (the default is too small) to > avoid service interruptions during peak traffic. Here are the default values for conntrack_max and hash, net.netfilter.nf_conntrack_max = 65536 net.netfilter.nf_conntrack_buckets = 16384 The VMs have 16GB of RAM, and this is the formula I've found to determine conntrack_max, 16000*1024^2/16384 = 1,024,000 But what about hash size? Does this look correct?
Flags: needinfo?(bhourigan)
> The VMs have 16GB of RAM, and this is the formula I've found to determine > conntrack_max, > > 16000*1024^2/16384 = 1,024,000 > > But what about hash size? Does this look correct? Looks good. I found a citation to back up the math, and there is a calculation for determining hashsize as well. https://wiki.khnet.info/index.php/Conntrack_tuning#Default_value_of_HASHSIZE
Flags: needinfo?(bhourigan)
Holding off on this till stuttering issues are resolved.
Alright, Richard sent me an email directly asking me to proceed with this. Committed in r115253. I get the following, Error: Could not start Service[iptables]: Execution of '/sbin/service iptables start' returned 6: Wrapped exception: Execution of '/sbin/service iptables start' returned 6: Error: /Stage[main]/Iptables/Service[iptables]/ensure: change from stopped to running failed: Could not start Service[iptables]: Execution of '/sbin/service iptables start' returned 6: This isn't affecting conntrackd stats or connectivity, though. Here's the RX and TX for wowza[12].scl3 https://graphite-scl3.mozilla.org/graphlot/?width=909&height=645&_salt=1456544311.362&target=hosts.wowza1_corpdmz_scl3_mozilla_com.interface.if_octets.eth0.tx&target=hosts.wowza1_corpdmz_scl3_mozilla_com.interface.if_octets.eth0.rx&target=hosts.wowza2_corpdmz_scl3_mozilla_com.interface.if_octets.eth0.tx&target=hosts.wowza2_corpdmz_scl3_mozilla_com.interface.if_octets.eth0.rx And conntrack entropy, https://graphite-scl3.mozilla.org/graphlot/?width=909&height=645&_salt=1456544201.7&target=hosts.wowza1_corpdmz_scl3_mozilla_com.conntrack.conntrack.entropy&target=hosts.wowza2_corpdmz_scl3_mozilla_com.conntrack.conntrack.entropy Please get back to me as soon as you can to confirm that these hosts are functioning properly. Hopefully this provides some useful data in your diagnosis.
Graphs are working. Can you point me at documentation about how to change graphing parameters. I'd like to change the time scale if possible. What is the sampling rate?
For general traffic stats I also added wowza1/2 to Observium: The 2 pages you might find interesting are https://observium.private.scl3.mozilla.com/device/device=432/tab=graphs/group=netstats/ and https://observium.private.scl3.mozilla.com/device/device=432/tab=port/port=904973/ https://observium.private.scl3.mozilla.com/device/device=431/tab=graphs/group=netstats/ and https://observium.private.scl3.mozilla.com/device/device=431/tab=port/port=904971/ The polling is done every 5 minutes, but as most of them are counters, increases in packet loss, and other metrics Will be visible, We can also use netflow to visualize specific flows but can't do specific packet loss per flow.
Issue with iptables was resolved in Bug 1255170. Richard, Are we all set here? Do you have all the information you need with Graphite and Observium?
Flags: needinfo?(richard)
All set. The observium links are quite helpful, thanks!
Flags: needinfo?(richard)
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: