Closed
Bug 1243497
Opened 9 years ago
Closed 9 years ago
Need to monitor streams between Encoders and Streaming engines
Categories
(Infrastructure & Operations :: MOC: Service Requests, task)
Infrastructure & Operations
MOC: Service Requests
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: richard, Assigned: ryanc)
Details
Attachments
(2 files)
303.69 KB,
image/png
|
Details | |
482 bytes,
patch
|
bhourigan
:
review+
|
Details | Diff | Splinter Review |
We need a way to monitor the RTMP traffic to Port 1935 on Air Mozilla's Wowza Streaming Engines.
The streaming engines run on Linux VMs. Two are VMs in the SCL3 Data Center, and one is an Amazon EC2 instance.
wowza1.scl3.mozilla.com (63.245.214.154)
wowza2.scl3.mozilla.com (63.245.214.151)
ec2-54-83-79-243.compute-1.amazonaws.com (54.83.79.243)
SSH interfaces for the SCL3 VMs are on the VPN:
wowza1.corpdmz.scl3.mozilla.com (10.22.72.153)
wowza2.corpdmz.scl3.mozilla.com (10.22.72.140)
Most of the RTMP traffic to those destinations originates from VBrick Encoders in Mozilla offices:
MTV-Commons https://encoder3.commons.av.mtv2.mozilla.com 10.252.55.23 00:07:DF:01:B6:3A
MTV-Cyberspace https://encoder2.avlab.av.mtv2.mozilla.com/ 10.252.55.164 00:07:DF:01:B6:42
MTV-Atom https://encoder1.atom.av.mtv2.mozilla.com/ 10.252.55.230 00:07:DF:01:B6:52
SFO-Commons https://encoder2.commons.voip.sfo1.mozilla.com/ 10.251.47.11 00:07:DF:01:B6:4E
SFO-Cyberspace https://encoder2.avlab.voip.sfo1.mozilla.com/ 10.251.45.16 00:07:DF:01:B6:53
PDX-Commons https://encoder2.commons.av.pdx1.mozilla.com/ 10.248.47.11 00:07:DF:01:B6:55
YVR-Commons https://encoder2.commons.av.yvr1.mozilla.com/ 10.244.55.11 00:07:DF:01:B6:39
TOR-Commons https://encoder2.commons.voip.tor1.mozilla.com/ 10.242.47.10 00:07:DF:01:B6:37
LON-Commons https://encoder2.commons.voip.lon1.mozilla.com/ 10.246.47.11 00:07:DF:01:B6:4F
PAR-Commons https://encoder2.commons.voip.par1.mozilla.com/ 10.243.47.11 00:07:DF:01:B6:58
There is some traffic that comes from VBrick and OBS encoders with connections that may or may not be in Mozilla offices.
We need a way to detect and log RTMP connection drops, as well as general traffic statistics such as incoming bandwidth and dropped packet counts.
We need both logs and graphs with sampling averages shorter than 10 seconds. (The size of a video media segment).
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → rchilds
Status: NEW → ASSIGNED
Assignee | ||
Comment 1•9 years ago
|
||
Richard,
Just some questions before I proceed.
A. Is it OK to install a package on these hosts? (conntrack-tools)
B. If you want these to be Nagios checks, what are the alert conditions?
C. For graphing, does Graphite work for you?
Reporter | ||
Comment 2•9 years ago
|
||
Sure.
I know nothing about either Nagios or Graphite. That's why I came to your team for a solution.
Educate me. What can you do?
Assignee | ||
Comment 3•9 years ago
|
||
Sure,
A. Give you graphs via https://graphite-scl3.mozilla.org/ -- See attachment for example
More plugins need to be added for port statistics, but I already see the scl3 hosts in there, but not the ec2 instance. Does this have another hostname? If not, that's not a problem, we can work something out.
B. Have Nagios alert in an IRC channel or alert the MOC (so we can escalate to you) if x statistic is over specified threshold. e.g,
Alert if incoming traffic on eth0 is over 10MB/s
Alert if any errors on eth0
Assignee | ||
Comment 4•9 years ago
|
||
Richard,
In order to proceed we would need to make those changes, but I'd also like to know when is the best time to do this, in case these changes need to be reverted.
Also NI'ing Brian to review this commit.
I'm expecting this to require iptables, which is already installed and running, then apply NO rules, then install the collectd plugin for conntrack.
Attachment #8716564 -
Flags: review?(bhourigan)
Comment 5•9 years ago
|
||
Comment on attachment 8716564 [details] [diff] [review]
patch1
looks reasonable to me. my only word of advice is to also ensure that conntrack_max is set to a reasonable value (the default is too small) to avoid service interruptions during peak traffic.
Attachment #8716564 -
Flags: review?(bhourigan) → review+
Reporter | ||
Comment 6•9 years ago
|
||
The calendar of AirMo events is at https://air.mozilla.org/calendar/
Anytime between scheduled events is fine. Afternoons next week look pretty quiet right now.
If you're unsure ping me on irc (Richard) and txt my phone (in the phonebook) if you have trouble finding me on irc.
Graphite looks ok for graphing, but I'd like to find a way to watch traffic between specific endpoints, or at least over specific ports. If we hit a glitch it's important to know if the problem was with RTMP ingest (port 1935) or HLS push to the CDN (port 443). Error counts and dropped packet counts would be useful, particularly if they're graphed separately for each port.
Assignee | ||
Comment 7•9 years ago
|
||
(In reply to Brian Hourigan [:digi] from comment #5)
> Comment on attachment 8716564 [details] [diff] [review]
> patch1
>
> looks reasonable to me. my only word of advice is to also ensure that
> conntrack_max is set to a reasonable value (the default is too small) to
> avoid service interruptions during peak traffic.
Here are the default values for conntrack_max and hash,
net.netfilter.nf_conntrack_max = 65536
net.netfilter.nf_conntrack_buckets = 16384
The VMs have 16GB of RAM, and this is the formula I've found to determine conntrack_max,
16000*1024^2/16384 = 1,024,000
But what about hash size? Does this look correct?
Assignee | ||
Updated•9 years ago
|
Flags: needinfo?(bhourigan)
Comment 8•9 years ago
|
||
> The VMs have 16GB of RAM, and this is the formula I've found to determine
> conntrack_max,
>
> 16000*1024^2/16384 = 1,024,000
>
> But what about hash size? Does this look correct?
Looks good. I found a citation to back up the math, and there is a calculation for determining hashsize as well.
https://wiki.khnet.info/index.php/Conntrack_tuning#Default_value_of_HASHSIZE
Flags: needinfo?(bhourigan)
Assignee | ||
Comment 9•9 years ago
|
||
Holding off on this till stuttering issues are resolved.
Assignee | ||
Comment 10•9 years ago
|
||
Alright,
Richard sent me an email directly asking me to proceed with this. Committed in r115253. I get the following,
Error: Could not start Service[iptables]: Execution of '/sbin/service iptables start' returned 6:
Wrapped exception:
Execution of '/sbin/service iptables start' returned 6:
Error: /Stage[main]/Iptables/Service[iptables]/ensure: change from stopped to running failed: Could not start Service[iptables]: Execution of '/sbin/service iptables start' returned 6:
This isn't affecting conntrackd stats or connectivity, though. Here's the RX and TX for wowza[12].scl3
https://graphite-scl3.mozilla.org/graphlot/?width=909&height=645&_salt=1456544311.362&target=hosts.wowza1_corpdmz_scl3_mozilla_com.interface.if_octets.eth0.tx&target=hosts.wowza1_corpdmz_scl3_mozilla_com.interface.if_octets.eth0.rx&target=hosts.wowza2_corpdmz_scl3_mozilla_com.interface.if_octets.eth0.tx&target=hosts.wowza2_corpdmz_scl3_mozilla_com.interface.if_octets.eth0.rx
And conntrack entropy,
https://graphite-scl3.mozilla.org/graphlot/?width=909&height=645&_salt=1456544201.7&target=hosts.wowza1_corpdmz_scl3_mozilla_com.conntrack.conntrack.entropy&target=hosts.wowza2_corpdmz_scl3_mozilla_com.conntrack.conntrack.entropy
Please get back to me as soon as you can to confirm that these hosts are functioning properly. Hopefully this provides some useful data in your diagnosis.
Reporter | ||
Comment 11•9 years ago
|
||
Graphs are working. Can you point me at documentation about how to change graphing parameters. I'd like to change the time scale if possible. What is the sampling rate?
Comment 12•9 years ago
|
||
For general traffic stats I also added wowza1/2 to Observium:
The 2 pages you might find interesting are
https://observium.private.scl3.mozilla.com/device/device=432/tab=graphs/group=netstats/
and
https://observium.private.scl3.mozilla.com/device/device=432/tab=port/port=904973/
https://observium.private.scl3.mozilla.com/device/device=431/tab=graphs/group=netstats/
and
https://observium.private.scl3.mozilla.com/device/device=431/tab=port/port=904971/
The polling is done every 5 minutes, but as most of them are counters, increases in packet loss, and other metrics Will be visible,
We can also use netflow to visualize specific flows but can't do specific packet loss per flow.
Assignee | ||
Comment 13•9 years ago
|
||
Issue with iptables was resolved in Bug 1255170.
Richard,
Are we all set here? Do you have all the information you need with Graphite and Observium?
Flags: needinfo?(richard)
Reporter | ||
Comment 14•9 years ago
|
||
All set. The observium links are quite helpful, thanks!
Flags: needinfo?(richard)
Assignee | ||
Updated•9 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•