PAR Air Mozilla Streaming Issue - Seeing drops between the Encoded and the Wowza Engine

RESOLVED INVALID

Status

Infrastructure & Operations
NetOps: Other
RESOLVED INVALID
2 years ago
2 years ago

People

(Reporter: Tony Recendez, Assigned: dcurado)

Tracking

Details

Attachments

(3 attachments)

(Reporter)

Description

2 years ago
Created attachment 8767988 [details]
unspecified.png

Air Mozilla Stream is choppy and repeatedly drops.

Please check connection between the Encoder and the Wowza Engine:

Wowza Engine: 54.218.169.156
Encoder: 10.243.47.11
(Reporter)

Comment 1

2 years ago
Created attachment 8767991 [details]
unspecified2.png
(Assignee)

Comment 2

2 years ago
Hard to figure out what's going on here.

First of all, the Wowza engine is located in AWS.
Right away, that limits what we can control.
That's the deal when we put our stuff into other people's infrastructure.

Second, I can't ping 54.218.169.156.  That is probably a security-group issue 
on your AWS instance.  

Lastly, here's what a traceroute from PAR1 looks like:

mozilla@canary1:~$ traceroute 54.218.169.156
traceroute to 54.218.169.156 (54.218.169.156), 30 hops max, 60 byte packets
 1  fw1.corp.par1.mozilla.net (10.243.24.1)  0.649 ms  0.653 ms  0.987 ms
 2  81.31.15.153 (81.31.15.153)  3.908 ms  3.929 ms  3.928 ms
 3  ae-18.r04.parsfr01.fr.bb.gin.ntt.net (81.25.197.185)  4.093 ms  4.104 ms  4.695 ms
 4  ae-2.r25.londen12.uk.bb.gin.ntt.net (129.250.6.13)  11.311 ms ae-8.r23.londen03.uk.bb.gin.ntt.net (129.250.6.206)  17.854 ms  18.020 ms
 5  ae-10.r24.londen12.uk.bb.gin.ntt.net (129.250.4.23)  31.987 ms  32.417 ms ae-1.r24.londen12.uk.bb.gin.ntt.net (129.250.2.26)  33.084 ms
 6  ae-5.r24.nycmny01.us.bb.gin.ntt.net (129.250.2.18)  78.924 ms  75.708 ms  81.837 ms
 7  ae-1.r25.nycmny01.us.bb.gin.ntt.net (129.250.3.207)  76.958 ms  76.046 ms  81.643 ms
 8  ae-1.r20.chcgil09.us.bb.gin.ntt.net (129.250.2.166)  95.331 ms ae-28.r05.sttlwa01.us.bb.gin.ntt.net (129.250.2.45)  146.949 ms  142.043 ms
 9  ae-3.amazon.sttlwa01.us.bb.gin.ntt.net (198.104.202.182)  147.782 ms ae-2.amazon.sttlwa01.us.bb.gin.ntt.net (129.250.201.178)  146.200 ms ae-1.r20.sttlwa01.us.bb.gin.ntt.net (129.250.3.42)  137.778 ms
10  ae-28.r05.sttlwa01.us.bb.gin.ntt.net (129.250.2.45)  150.858 ms *  153.412 ms
11  * * ae-3.amazon.sttlwa01.us.bb.gin.ntt.net (198.104.202.182)  149.189 ms
12  * * *
13  * * *
14  * * *
15  * * *
16  * * *
17  54.239.48.189 (54.239.48.189)  154.052 ms * *
18  * * *
19  205.251.230.125 (205.251.230.125)  158.297 ms 54.239.48.191 (54.239.48.191)  146.617 ms 205.251.232.169 (205.251.232.169)  151.117 ms
20  * * *
21  * * *
22  * * *
23  * * ec2-50-112-0-197.us-west-2.compute.amazonaws.com (50.112.0.197)  156.856 ms
24  * * *
25  * * *
26  * * *
27  * * *
28  * * *
29  * * *
30  * * *

Bit of a mess.
But it also shows that there is no problem with our provider in PAR1, who is handing the packets
off to NTT.  NTT then appears to be handing the packets to AWS, and their network is so weird
that we get all those "stars" in the traceroute response.  

Can you look into the security group you have on the ec2 instance, and allow icmp through?
That will at least let us ping it.
Assignee: network-operations → dcurado
Status: NEW → ASSIGNED
Flags: needinfo?(trecendez)
R2 controls that security group.
From par1 canary host all the responding hops have stable latency using ICMP until they reach AWS' network, then it gets a bit messy and some packet loss.

mozilla@canary1:~$ mtr 54.218.169.156 --report
Start: Tue Jul  5 14:36:15 2016
HOST: canary1                     Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- fw1.corp.par1.mozilla.net  0.0%    10    0.6   0.6   0.5   0.7   0.0
  2.|-- 81.31.15.153               0.0%    10    3.5   5.2   2.1  27.8   8.0
  3.|-- ae-18.r04.parsfr01.fr.bb.  0.0%    10    2.8   3.2   2.8   4.2   0.3
  4.|-- ae-2.r25.londen12.uk.bb.g  0.0%    10    9.7   9.9   9.4  11.1   0.3
  5.|-- ae-1.r24.londen12.uk.bb.g  0.0%    10   27.5  27.9  27.4  29.7   0.6
  6.|-- ae-5.r24.nycmny01.us.bb.g  0.0%    10   75.7  75.7  75.6  75.9   0.0
  7.|-- ae-1.r25.nycmny01.us.bb.g  0.0%    10   75.7  77.5  75.7  82.8   2.6
  8.|-- ae-1.r20.chcgil09.us.bb.g  0.0%    10   94.1  94.2  94.1  94.4   0.0
  9.|-- ae-1.r20.sttlwa01.us.bb.g  0.0%    10  149.5 164.1 149.5 273.7  38.8
 10.|-- ae-28.r05.sttlwa01.us.bb.  0.0%    10  149.5 149.6 149.5 149.9   0.0
 11.|-- ae-2.amazon.sttlwa01.us.b  0.0%    10  149.6 151.1 149.3 160.3   3.4
 12.|-- ???                       100.0    10    0.0   0.0   0.0   0.0   0.0
 13.|-- ???                       100.0    10    0.0   0.0   0.0   0.0   0.0
 14.|-- 205.251.232.92             0.0%    10  156.7 156.7 156.7 156.8   0.0
 15.|-- ???                       100.0    10    0.0   0.0   0.0   0.0   0.0
 16.|-- ???                       100.0    10    0.0   0.0   0.0   0.0   0.0
 17.|-- ???                       100.0    10    0.0   0.0   0.0   0.0   0.0
 18.|-- ???                       100.0    10    0.0   0.0   0.0   0.0   0.0
 19.|-- 205.251.230.125           10.0%    10  155.9 156.4 155.9 157.8   0.4
 20.|-- ???                        0.0%     0    0.0   0.0   0.0   0.0   0.0



A 2nd run shows something cleaner:
mozilla@canary1:~$ mtr 54.218.169.156 --report
Start: Tue Jul  5 14:39:47 2016
HOST: canary1                     Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- fw1.corp.par1.mozilla.net  0.0%    10    0.6   0.6   0.6   0.7   0.0
  2.|-- 81.31.15.153               0.0%    10    2.1   3.0   2.0   7.0   1.5
  3.|-- ae-18.r04.parsfr01.fr.bb.  0.0%    10    3.1   2.9   2.6   3.6   0.0
  4.|-- ae-2.r25.londen12.uk.bb.g  0.0%    10   11.2  10.1   9.6  11.2   0.3
  5.|-- ae-1.r24.londen12.uk.bb.g  0.0%    10   27.7  27.8  27.4  29.6   0.5
  6.|-- ae-5.r24.nycmny01.us.bb.g  0.0%    10   75.8  76.3  75.8  77.6   0.3
  7.|-- ae-1.r25.nycmny01.us.bb.g  0.0%    10   75.6  77.3  75.6  82.8   2.7
  8.|-- ae-1.r20.chcgil09.us.bb.g  0.0%    10   94.1  95.9  94.0 111.3   5.4
  9.|-- ae-1.r20.sttlwa01.us.bb.g  0.0%    10  149.5 155.3 149.5 180.8  12.1
 10.|-- ae-28.r05.sttlwa01.us.bb.  0.0%    10  149.4 149.6 149.4 149.8   0.0
 11.|-- ae-2.amazon.sttlwa01.us.b  0.0%    10  149.5 151.9 149.3 174.6   8.0
 12.|-- ???                       100.0    10    0.0   0.0   0.0   0.0   0.0

Which shows that AWS had maybe a transient issue. Did it improve since?

It would also be useful for testing to have the final host to answer ICMP.

Nothing special to report toward other hosts on the Internet:
See for example:
http://smokeping1.private.scl3.mozilla.com/smokeping/smokeping.cgi?displaymode=n;start=2016-07-05%2011:33;end=now;target=Offices.OfficesEurope.edgecast-fra~canary1.corp.par1.mozilla.com
Assignee: dcurado → network-operations
Status: ASSIGNED → NEW
Assignee: network-operations → dcurado
Status: NEW → ASSIGNED
(Reporter)

Updated

2 years ago
Flags: needinfo?(trecendez)

Comment 5

2 years ago
i'm prepared to open up ICMP but don't want to open it to all addresses. Can you give me an originating CIDR block?
Flags: needinfo?(dcurado)
(Assignee)

Comment 6

2 years ago
Why not?  All throughout Mozilla's infrastructure we have ICMP echo allowed.
Otherwise, it makes it very difficult to debug network problems, as you have already found out.

From the security-group listing you posted in IRC, lots of ports are open to all of the Internet.
I'm wondering why, given that open-ness, you would be worried about ICMP.
Flags: needinfo?(dcurado)

Comment 7

2 years ago
ICMP has been opened in the Security Group associated with that Wowza instance.
Filed a ticket with VBrick with the following info:

We've got a problem with our 9000 Series encoder in Paris.

The connection to our Wowza Engine is dropping at approximately 2 minute intervals.

This is serial number 91051500020 running version 4.4.0c firmware.

The System Events log is showing error 70 each time the connection drops:

            SYSEVENT:07/05/2016 13:32:09LOC RTMP send error 70 for TX 4
            SYSEVENT:07/05/2016 13:30:11LOC RTMP send error 70 for TX 4
            SYSEVENT:07/05/2016 13:28:13LOC RTMP send error 70 for TX 4
            SYSEVENT:07/05/2016 13:26:21LOC RTMP send error 70 for TX 4
            SYSEVENT:07/05/2016 13:24:28LOC RTMP send error 70 for TX 4
            SYSEVENT:07/05/2016 13:22:39LOC RTMP send error 70 for TX 4
            SYSEVENT:07/05/2016 13:20:45LOC RTMP send error 70 for TX 4
            SYSEVENT:07/05/2016 13:18:44LOC RTMP send error 70 for TX 4
            SYSEVENT:07/05/2016 13:17:09LOC RTMP send error 70 for TX 4
            SYSEVENT:07/05/2016 13:15:11LOC RTMP send error 70 for TX 4
            SYSEVENT:07/05/2016 13:13:28LOC RTMP send error 70 for TX 4
            SYSEVENT:07/05/2016 13:11:45LOC RTMP send error 70 for TX 4
            SYSEVENT:07/05/2016 13:09:54LOC RTMP send error 70 for TX 4
            SYSEVENT:07/05/2016 13:08:14LOC RTMP send error 70 for TX 4
            SYSEVENT:07/05/2016 13:06:30LOC RTMP send error 70 for TX 4
            SYSEVENT:07/05/2016 13:04:40LOC RTMP send error 70 for TX 4
            SYSEVENT:07/05/2016 13:03:01LOC RTMP send error 70 for TX 4
            SYSEVENT:07/05/2016 13:00:55LOC RTMP send error 70 for TX 4
            SYSEVENT:07/05/2016 12:59:06LOC RTMP send error 70 for TX 4
            SYSEVENT:07/05/2016 12:57:27LOC RTMP send error 70 for TX 4
            SYSEVENT:07/05/2016 12:55:43LOC RTMP send error 70 for TX 4
            SYSEVENT:07/05/2016 12:53:49LOC RTMP send error 70 for TX 4
            SYSEVENT:07/05/2016 12:51:51LOC RTMP send error 70 for TX 4
            SYSEVENT:07/05/2016 12:49:50LOC RTMP send error 70 for TX 4



...and there were unusual errors recently in the Sys Info log:

            SysInfo:07/03/2016 15:52:11LOC Task:tNetraCtl File:C:/swng/apps/netra/netraControl.c Line:3370 Tick:484816547, Media Processor watchdog reset encoder index:0, reason 4 SysInfo:07/03/2016 04:48:17LOC Err code 0x1700
            SysInfo:07/03/2016 01:12:00LOC Task:tNetraCtl File:C:/swng/apps/netra/netraControl.c Line:3370 Tick:432006369, Media Processor watchdog reset encoder index:0, reason 1
            SysInfo:07/03/2016 01:12:00LOC Err code 0x1700
            SysInfo:07/02/2016 08:49:50LOC Task:tNetraCtl File:C:/swng/apps/netra/netraControl.c Line:3370 Tick:373077488, Media Processor watchdog reset encoder index:3, reason 4
            SysInfo:06/28/2016 05:38:21LOC Task:mud1_t2 File:C:/swng/apps/util/pm.c Line:1919 Tick:15995812, PM_getReplyPipe(): Sequence error. Expected Seq = 6711, received Seq = 6666
            SysInfo:06/28/2016 05:35:56LOC Task:mud1_t1 File:C:/swng/apps/snmp/snmpLeafSystem.c Line:2561 Tick:15851205, internal apply error
            SysInfo:06/28/2016 05:35:56LOC Task:mud1_t1 File:C:/swng/apps/util/pm.c Line:2457 Tick:15851204, sendCmdToRoot(): Timeout receiving reply  


The transmit buffer size for TX 4 is 4MB.  The current configuration for TX 4 is:

                        <Group id="Transmitter.4">
                            <Object id="1" Name="Enable Transmitter"  Value="Disabled"/>
                            <Object id="2" Name="Name"  Value="HTTPS W1a Restricted"/>
                            <Object id="3" Name="Stream Select"  Value="1"/>
                            <Object id="4" Name="Destination IP Address Type "  Value="Host Name"/>
                            <Object id="5" Name="Destination IP Address"  Value="ec2-54-190-118-230.us-west-2.compute.amazonaws.com"/>
                            <Object id="6" Name="Video Port"  Value="4444"/>
                            <Object id="7" Name="Audio Port"  Value="4644"/>
                            <Object id="8" Name="MPEG2TS Port"  Value="4444"/>
                            <Object id="9" Name="RTMP Port"  Value="1935"/>
                            <Object id="10" Name="RTCP Transmit Enable"  Value="Enabled"/>
                            <Object id="11" Name="RTCP Retransmit Time"  Value="10"/>
                            <Object id="12" Name="Announce Enable"  Value="Enabled"/>
                            <Object id="13" Name="Announce Use Global IP/Port"  Value="Enabled"/>
                            <Object id="14" Name="Announce IP Address Type "  Value="IP Address"/>
                            <Object id="15" Name="Announce IP Address"  Value="224.2.127.254"/>
                            <Object id="16" Name="Announce Port"  Value="9875"/>
                            <Object id="17" Name="Announce Send To Unicast Destination"  Value="Disabled"/>
                            <Object id="18" Name="External Announce Enable"  Value="Disabled"/>
                            <Object id="19" Name="External Announce Use Global IP/Port"  Value="Enabled"/>
                            <Object id="20" Name="External Announce IP Address Type"  Value="IP Address"/>
                            <Object id="21" Name="External Announce IP Address"  Value=""/>
                            <Object id="22" Name="External Announce Port"  Value="1040"/>
                            <Object id="23" Name="External Announce URL"  Value=""/>
                            <Object id="24" Name="Auto Unicast Mode"  Value="Disabled"/>
                            <Object id="25" Name="Auto Unicast Dest Port"  Value="554"/>
                            <Object id="26" Name="Auto Unicast Dest Pub Pnt Name"  Value="mystream7.sdp"/>
                            <Object id="27" Name="Auto Unicast Dest Username"  Value="broadcast"/>
                            <Object id="28" Name="Auto Unicast Dest Password"  Value="broadcast"/>
                            <Object id="29" Name="RTMP Application"  Value="HTTPS_Restricted"/>
                            <Object id="30" Name="RTMP Stream"  Value="PAR-Commons"/>
                            <Object id="31" Name="RTMP Username"  Value="VBrick-9000"/>
                            <Object id="32" Name="RTMP Password"  Value="8Bpm6Fus7giU97A6"/>
                            <Object id="33" Name="FEC Stream 1 Enable"  Value="Enabled"/>
                            <Object id="34" Name="FEC Stream 2 Enable"  Value="Enabled"/>
                            <Object id="35" Name="RTP Encapsulation Enable"  Value="Disabled"/>
                            <Object id="36" Name="Tx Session Name"  Value=""/>
                            <Object id="37" Name="Smooth Streaming Dest Port"  Value="80"/>
                            <Object id="38" Name="Smooth Streaming Dest Pub Pnt Name"  Value="VBLiveSmoothStream.isml"/>
                            <Object id="39" Name="Smooth Streaming Dest Username"  Value="broadcast"/>
                            <Object id="40" Name="Smooth Streaming Dest Password"  Value="broadcast"/>
                            <Object id="41" Name="Smooth Streaming Domain"  Value=""/>
                            <Object id="42" Name="RTMP Timecode Enable"  Value="Disabled"/>
                            <Object id="43" Name="RTMP Timecode Frame Interval"  Value="15"/>
                            <Object id="44" Name="RTMP DVR Auto Start Enable"  Value="Disabled"/>
                            <Object id="45" Name="Transmit Buffer Size"  Value="4M"/>
                            <Object id="46" Name="RTMP Id"  Value="VBrick"/>
                            <Object id="9999" Name="Apply Program Transmitter"  Value="Apply"/>
                        </Group>
VBrick incident number: 00045630.
(Assignee)

Comment 10

2 years ago
re-assigning this bug to Richard, as it doesn't appear to be a netops issue.
Assignee: dcurado → richard
The problem appears to be a bandwidth availability issue.  Reducing the RTMP stream bitrate from 3.5 Mb/sec to 2.5 Mb/sec solves the periodic disconnect issue, but reduces the quality of the video streams from Paris.

Handing this back to Dave.
Assignee: richard → dcurado

Comment 12

2 years ago
We could try a different instance type, such as m4.2xlarge. That would mean moving the instance into a VPC, which is a good idea anyway.
(Assignee)

Comment 13

2 years ago
I am attaching a screenshot of our par1 firewall interface to the Internet.
We have 100 mbits of capacity there.  Over the past week, we once touched 80 mbits, 
otherwise the average is (as you can see) quite low.

The Mozilla portion of this network infrastructure in use for this application, is working correctly and has enough bandwidth.  

Within AWS, or getting to AWS... we can't control that.

Please open a new bug with Richard Weiss to figure out what can be done on the AWS side.
Thank you.
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → INVALID
(Assignee)

Comment 14

2 years ago
Created attachment 8768479 [details]
Screen Shot 2016-07-06 at 2.21.50 PM.png
You need to log in before you can comment on or make changes to this bug.