Closed Bug 1179709 Opened 9 years ago Closed 9 years ago

We lost the tunnel to use1

Categories

(Infrastructure & Operations Graveyard :: NetOps: Other, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: Usul, Assigned: dcurado)

Details

I got the following :

<nagios-releng> (IRC) Thu 03:07:14 PDT [4479] fw1.private.releng.scl3.mozilla.net:BGP use1 vpn-c149afa8-2 is CRITICAL: SNMP CRITICAL - BGP sess vpn-c149afa8-2 (use1/169.254.255.77) uptime *30* secs (http://m.mozilla.org/BGP+use1+vpn-c149afa8-2)

<nagios-usw1> Thu 06:09:17 EDT [10213] nagios1.private.releng.use1.mozilla.com (10.134.75.31) is DOWN :PING CRITICAL - Packet loss = 100%
* nagios-use1 has quit (Ping timeout: 121 seconds)

<nagios-releng> Thu 03:09:34 PDT [4480] nagios1.private.releng.use1.mozilla.com (10.134.75.31) is DOWN :PING CRITICAL - Packet loss = 100%

<nagios-usw2> Thu 06:09:44 EDT [10300] nagios1.private.releng.use1.mozilla.com (10.134.75.31) is DOWN :PING CRITICAL - Packet loss = 100%

I can't ssh to nagios1.private.releng.use1.mozilla.com when the mozilla-vpn is up it times out.
of course I paged dcurado and as soon as the page was sent :

nagios-releng> Thu 03:20:35 PDT [4540] nagios1.private.releng.use1.mozilla.com (10.134.75.31) is UP :PING OK - Packet loss = 0%, RTA = 83.53 ms
<nagios-usw2> Thu 06:20:35 EDT [10302] nagios1.private.releng.use1.mozilla.com (10.134.75.31) is UP :PING OK - Packet loss = 0%, RTA = 103.51 ms
<nagios-usw1> Thu 06:20:37 EDT [10215] nagios1.private.releng.use1.mozilla.com (10.134.75.31) is UP :PING OK - Packet loss = 0%, RTA = 83.74 ms
<nagios-releng> (IRC) Thu 03:20:55 PDT [4556] fw1.private.releng.scl3.mozilla.net:BGP use1 vpn-c149afa8-2 is CRITICAL: SNMP CRITICAL - BGP sess vpn-c149afa8-2 (use1/169.254.255.77) uptime *40* secs (http://m.mozilla.org/BGP+use1+vpn-c149afa8-2)
Went down again :
<nagios-releng> (IRC) Thu 03:34:52 PDT [4576] fw1.private.releng.scl3.mozilla.net:BGP use1 vpn-c149afa8-1 is CRITICAL: SNMP CRITICAL - BGP sess vpn-c149afa8-1 (use1/169.254.255.73) uptime *8* secs (http://m.mozilla.org/BGP+use1+vpn-c149afa8-1)
<nagios-releng> (IRC) Thu 03:34:52 PDT [4579] fw1.private.releng.scl3.mozilla.net:BGP use1 vpn-c149afa8-2 is CRITICAL: SNMP CRITICAL - BGP sess vpn-c149afa8-2 (use1/169.254.255.77) uptime *1* secs (http://m.mozilla.org/BGP+use1+vpn-c149afa8-2)
<nagios-usw2> Thu 06:36:55 EDT [10304] nagios1.private.releng.use1.mozilla.com (10.134.75.31) is DOWN :PING CRITICAL - Packet loss = 100%
<nagios-releng> Thu 03:37:02 PDT [4581] nagios1.private.releng.use1.mozilla.com (10.134.75.31) is DOWN :PING CRITICAL - Packet loss = 100%
* nagios-use1 has quit (Ping timeout: 121 seconds)
<nagios-usw1> Thu 06:37:48 EDT [10217] nagios1.private.releng.use1.mozilla.com (10.134.75.31) is DOWN :PING CRITICAL - Packet loss = 100%
<cknowles> dcurado: I had a moment recently with "I don't remember my dad being in that pic... aw **** that's *me*"
<nagios-scl3> Thu 03:38:40 PDT [5085] levin.mozilla.org:IRC TCP - port 8443 is CRITICAL: CRITICAL - Socket timeout after 10 seconds (
It came back up :
nagios-releng> (IRC) Thu 03:39:51 PDT [4585] fw1.private.releng.scl3.mozilla.net:BGP use1 vpn-c149afa8-1 is CRITICAL: SNMP CRITICAL - BGP sess vpn-c149afa8-1 (use1/169.254.255.73) uptime *294* secs (http://m.mozilla.org/BGP+use1+vpn-c149afa8-1)
<nagios-releng> (IRC) Thu 03:39:51 PDT [4588] fw1.private.releng.scl3.mozilla.net:BGP use1 vpn-c149afa8-2 is WARNING: SNMP WARNING - BGP sess vpn-c149afa8-2 (use1/169.254.255.77) uptime *301* secs (http://m.mozilla.org/BGP+use1+vpn-c149afa8-2)
<nagios-releng> (IRC) Thu 03:40:51 PDT [4599] fw1.private.releng.scl3.mozilla.net:BGP use1 vpn-c149afa8-1 is WARNING: SNMP WARNING - BGP sess vpn-c149afa8-1 (use1/169.254.255.73) uptime *354* secs (http://m.mozilla.org/BGP+use1+vpn-c149afa8-1)
<nagios-scl3> Thu 03:41:00 PDT [5087] levin.mozilla.org:IRC Clients is CRITICAL: NRPE: Unable to read output (http://m.mozilla.org/IRC+Clients)
<nagios-scl3> Thu 03:43:29 PDT [5089] levin.mozilla.org:IRC TCP - port 8443 is OK: TCP OK - 0.080 second response time on port 8443 (http://m.mozilla.org/IRC+TCP+-+port+8443)
<nagios-releng> (IRC) Thu 03:44:51 PDT [4619] fw1.private.releng.scl3.mozilla.net:BGP use1 vpn-c149afa8-2 is OK: SNMP OK - BGP sess vpn-c149afa8-2 (use1/169.254.255.77) uptime 602 secs (http://m.mozilla.org/BGP+use1+vpn-c149afa8-2)
<nagios-releng> (IRC) Thu 03:45:52 PDT [4624] fw1.private.releng.scl3.mozilla.net:BGP use1 vpn-c149afa8-1 is OK: SNMP OK - BGP sess vpn-c149afa8-1 (use1/169.254.255.73) uptime 655 secs (http://m.mozilla.org/BGP+use1+vpn-c149afa8-1)
<nagios-scl3> Thu 03:47:00 PDT [5090] levin.mozilla.org:IRC TCP - port 6665 is CRITICAL: CRITICAL - Socket timeout after 10 seconds (http://m.mozilla.org/IRC+TCP+-+port+6665)
<nagios-scl3> Thu 03:54:40 PDT [5092] levin.mozilla.org:IRC Clients is OK: CRITCLIENTS OK: clients 3819 (critcount=5) (http://m.mozilla.org/IRC+Clients)
The ipsec tunnels and bgp sessions between fw1.scl3 and fw1.releng.scl3 to AWS' us-east-1 VPN endpoint are down.  

From fw1.releng.scl3, I can ping one of the two IP addresses of the AWS VPN endpoint, but ipsec and bgp are (as mentioned) down.  At the moment I am assuming that AWS has got a problem with their VPN endpoint.

gateway gw-vpn-c149afa8-1 {
    ike-policy ike-pol-vpn-c149afa8-1;
    address 72.21.209.193;
    dead-peer-detection;
    external-interface reth0.1030;
}
gateway gw-vpn-c149afa8-2 {
    ike-policy ike-pol-vpn-c149afa8-2;
    address 72.21.209.225;
    dead-peer-detection;
    external-interface reth0.1030;
} 
                                        
{primary:node1}
dcurado@fw1.ops.releng.scl3.mozilla.net> ping 72.21.209.225 
PING 72.21.209.225 (72.21.209.225): 56 data bytes
64 bytes from 72.21.209.225: icmp_seq=0 ttl=52 time=84.930 ms
64 bytes from 72.21.209.225: icmp_seq=1 ttl=52 time=85.167 ms
64 bytes from 72.21.209.225: icmp_seq=2 ttl=52 time=98.711 ms
^C
--- 72.21.209.225 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 84.930/89.603/98.711/6.441 ms

{primary:node1}
dcurado@fw1.ops.releng.scl3.mozilla.net> ping 72.21.209.193    
PING 72.21.209.193 (72.21.209.193): 56 data bytes
^C
--- 72.21.209.193 ping statistics ---
12 packets transmitted, 0 packets received, 100% packet loss

-----------------
and the same behavior from fw1.scl3...

gateway gw-vpn-6510f20c-1 {             
    ike-policy ike-pol-vpn-6510f20c-1;  
    address 207.171.167.235;            
    dead-peer-detection;                
    external-interface reth0.1108;      
}                                       
gateway gw-vpn-6510f20c-2 {             
    ike-policy ike-pol-vpn-6510f20c-2;  
    address 207.171.167.234;            
    dead-peer-detection;                
    external-interface reth0.1108;      
}                                                    
                                        
{primary:node0}
dcurado@fw1.ops.scl3.mozilla.net> ping 207.171.167.235 
PING 207.171.167.235 (207.171.167.235): 56 data bytes
64 bytes from 207.171.167.235: icmp_seq=0 ttl=52 time=84.177 ms
64 bytes from 207.171.167.235: icmp_seq=1 ttl=52 time=83.265 ms
64 bytes from 207.171.167.235: icmp_seq=2 ttl=52 time=83.188 ms
64 bytes from 207.171.167.235: icmp_seq=3 ttl=52 time=126.620 ms
^C
--- 207.171.167.235 ping statistics ---
4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max/stddev = 83.188/94.312/126.620/18.657 ms

{primary:node0}
dcurado@fw1.ops.scl3.mozilla.net> ping 207.171.167.234    
PING 207.171.167.234 (207.171.167.234): 56 data bytes
^C
--- 207.171.167.234 ping statistics ---
5 packets transmitted, 0 packets received, 100% packet loss

I checked http://status.aws.amazon.com/ which is not reporting any problem at this time.
But, it seems clear from that they are actually having a problem of some sort.

I suspect that a whole bunch of activity is going on within AWS at the moment, but I will 
try opening a support case with them anyway, just in case I am wrong in what I believe is
going on.
Opened a case with AWS support:

Jul 2, 2015
Case Number: 1434443071	
07:15 AM -0400	All ipsec tunnels and BGP sessions to us-east-1 are down	   
Technical	Urgent	Unassigned

Our support deal with them requires them to get back to us within 1 hour.
But... I'm thinking it may take longer than that. 
I will update this bug once I know more.
Assignee: network-operations → dcurado
Status: NEW → ASSIGNED
closed trees since this affect releng etc
I am seeing both ipsec tunnels and bgp sessions from fw1.releng.scl3 as up.
If it does not cause too much pain, I would recommend keeping the trees closed for a little while
longer.  Perhaps 30 minutes?
Just to make sure AWS really has their act together, and avoiding problems from the trees opening and closing again.
So affected sites/services
Irc (levin.mozilla.org is in use1)
masterfirefoxos.mozilla.org
releng - tree closed.
While the VPN connection from fw1.releng.scl3 to AWS us-east-1 (which consists of 2 ipsec tunnels 
and 2 bgp sessions) has been restored at this time, the VPN connections from fw1.scl3 to AWS us-east-1
are not 100% restored.  For the latter set of VPN connections (fw1.scl3 to AWS us-east-1) one ipsec tunnel and one bgp session from each VPN connection is still down.
This is non-service impacting as we are not actually using the VPN connections from fw1.scl3 to us-east-1 for anything at this time.  (they are for the IT AWS project, which does not have any services in production)
(In reply to Dave Curado :dcurado from comment #7)
> I am seeing both ipsec tunnels and bgp sessions from fw1.releng.scl3 as up.
> If it does not cause too much pain, I would recommend keeping the trees
> closed for a little while
> longer.  Perhaps 30 minutes?
> Just to make sure AWS really has their act together, and avoiding problems
> from the trees opening and closing again.

yeah thats fine, will report/check back in 30 minutes here and monitor the bug
OK, making another cup of coffee and taking a closer look at this.
Still seeing the VPN connection from fw1.releng.scl3 as fully restored.
But also still seeing some of the (not really in production) ipsec tunnels and bgp sessions from
fw1.scl3 to AWS us-east-1 as down.

Here's the thing: I can ping the AWS VPN end points from here in Boston, but not all of them from
fw1.scl3.  I am making a list of all the IP addresses in question so I can make a map of who can ping
who and from where.  

Note that all of our IP address end points are part of the same prefixes we announce to the world via BGP, so I don't see that we (Mozilla) are causing any sort of reach-ability issue here.
Here's the status of reachability from fw1.scl3 to all AWS VPN end points:

gateway gw-vpn-8a35d6e3-1 address 54.240.217.166 reth0.1025 63.245.214.78 no response
gateway gw-vpn-8a35d6e3-2 address 54.240.217.160 reth0.1025 63.245.214.78 good
gateway gw-vpn-ae590beb-1 address 204.246.163.77 reth0.1024 63.245.214.54 good
gateway gw-vpn-ae590beb-2 address 204.246.163.76 reth0.1024 63.245.214.54 good
gateway gw-vpn-e80eec81-1 address 207.171.167.235 reth0.1116 63.245.214.117 good
gateway gw-vpn-e80eec81-2 address 207.171.167.234 reth0.1116 63.245.214.117 no response
gateway gw-vpn-4376aa5d-1 address 54.239.50.133 reth0.1116 63.245.214.117 good
gateway gw-vpn-4376aa5d-2 address 54.239.50.132 reth0.1116 63.245.214.117 good
gateway gw-vpn-6510f20c-1 address 207.171.167.235 reth0.1108 63.245.214.109 good
gateway gw-vpn-6510f20c-2 address 207.171.167.234 reth0.1108 63.245.214.109 good
gateway gw-vpn-d072aece-1 address 54.239.50.132 reth0.1108 63.245.214.109 good
gateway gw-vpn-d072aece-2 address 54.239.50.133 reth0.1108 63.245.214.109 good
gateway gw-vpn-2d5dbf44-1 address 54.240.202.251 reth0.1112 63.245.214.113 good
gateway gw-vpn-2d5dbf44-2 address 54.240.202.249 reth0.1112 63.245.214.113 good
gateway gw-vpn-93835c8d-1 address 54.239.50.133 reth0.1112 63.245.214.113 good
gateway gw-vpn-93835c8d-2 address 205.251.233.122 reth0.1112 63.245.214.113 good

However, ipsec and bgp sessions to some of the end points marked as "good" are
still down.  

dcurado@fw1.ops.scl3.mozilla.net> show bgp summary 
Groups: 4 Peers: 20 Down peers: 6
Table          Tot Paths  Act Paths Suppressed    History Damp State    Pending
inet.0               278        138          0          0          0          0
inet6.0               18          7          0          0          0          0
Peer                     AS      InPkt     OutPkt    OutQ   Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped...
10.0.22.1             65000     501083     502954       0       2    22w4d18h 37/134/134/0         0/0/0/0
10.0.22.5             65000     501019     502913       0       3    22w4d18h 95/134/134/0         0/0/0/0
169.254.249.49         7224      65957      69435       0      80     1w0d11h 1/1/1/0              0/0/0/0
169.254.249.53         7224      61561      64827       0      48 6d 23:25:28 0/1/1/0              0/0/0/0
169.254.249.57         7224     191745     202182       0       3     3w0d18h 1/1/1/0              0/0/0/0
169.254.249.61         7224      63993      67386       0      22      1w0d6h 0/1/1/0              0/0/0/0
169.254.249.65         7224      66617      70130       0      85     1w0d13h 0/1/1/0              0/0/0/0
169.254.249.69         7224      79161      83448       0      48     1w1d23h 1/1/1/0              0/0/0/0
169.254.253.17         7224     503463     529221       0      14      5w2d3h 1/1/1/0              0/0/0/0
169.254.253.21         7224     328344     345269       0       9      5w2d3h 0/1/1/0              0/0/0/0
169.254.255.33         7224     223922     235780       0      32       23:39 Connect
169.254.255.37         7224     139199     146602       0     149        4:54 Connect
169.254.255.41         7224     190288     200342       0      36     3:21:15 Connect
169.254.255.45         7224     266923     281131       0     138     1:06:05 Connect
169.254.255.49         7224     166553     173170       0       4     1:00:51 Connect
169.254.255.53         7224     279122     287721       0       4     1:11:23 1/1/1/0              0/0/0/0
169.254.255.73         7224      64002      67375       0     156     2:28:52 Connect
169.254.255.77         7224     327238     344783       0      94        3:45 1/1/1/0              0/0/0/0

Fortunately as per my previous comments, none of the VPN connections that are down
are carrying production traffic.
As per our AWS architect, I changed the status update method on our case with AWS from "web" to "phone".

The result, for those following along at home, is that AWS' automated phone system immediately called me,
and put me in a hold queue.

<sigh>
Update:
Have been listening to bad hold music for about 10 minutes now.
No idea of the call hold queue depth.

The same ipsec tunnels and bgp sessions are still down on fw1.scl3, no change there.

HOWEVER: previously both ipsec tunnels (and associated bgp sessions) from fw1.releng.scl3 had been
      restored.  Now one of those ipsec tunnels has gone down again.  Whatever problem is occurring,
      is still on going.
Now both ipsec tunnels and associated bgp sessions from fw1.releng.scl3 to AWS us-east-1 are back up.

dcurado@fw1.ops.releng.scl3.mozilla.net> show bgp summary 
Groups: 3 Peers: 8 Down peers: 0
Table          Tot Paths  Act Paths Suppressed    History Damp State    Pending
inet.0               274        134          0          0          0          0
Peer                     AS      InPkt     OutPkt    OutQ   Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped...
10.0.22.9             65000     501089     506903       0       2      8w0d8h 31/134/134/0         0/0/0/0
10.0.22.13            65000     501013     506871       0       1    22w4d23h 100/134/134/0        0/0/0/0
169.254.249.25         7224     327342     351708       0      11      5w2d5h 1/1/1/0              0/0/0/0
169.254.249.29         7224     327648     352116       0      17      2w2d7h 0/1/1/0              0/0/0/0
169.254.253.25         7224     329542     351707       0      13      5w2d5h 0/1/1/0              0/0/0/0
169.254.253.29         7224     606740     648526       0      10     9w5d14h 1/1/1/0              0/0/0/0
169.254.255.73         7224        273        292       0      37          25 0/1/1/0              0/0/0/0 <----
169.254.255.77         7224     507253     546655       0      54     1:31:09 1/1/1/0              0/0/0/0 <----

bad hold music for ~15 minutes now
Update:
All ipsec tunnels and associated BGP sessions have now been restored.
Bad hold music up to about 20 minutes now.  I'm going to leave that call 
going for at least an hour, just to see if AWS ever actually picks up their
end of the line -- to give Mozilla IT management that data point.

dcurado@fw1.ops.scl3.mozilla.net> show bgp summary 
Groups: 4 Peers: 20 Down peers: 0
Table          Tot Paths  Act Paths Suppressed    History Damp State    Pending
inet.0               284        140          0          0          0          0
inet6.0               18          7          0          0          0          0
Peer                     AS      InPkt     OutPkt    OutQ   Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped...
10.0.22.1             65000     501138     503008       0       2    22w4d19h 38/134/134/0         0/0/0/0
10.0.22.5             65000     501074     502967       0       3    22w4d18h 94/134/134/0         0/0/0/0
169.254.249.49         7224      66108      69595       0      80     1w0d11h 1/1/1/0              0/0/0/0
169.254.249.53         7224      61713      64987       0      48 6d 23:50:05 0/1/1/0              0/0/0/0
169.254.249.57         7224     191895     202342       0       3     3w0d18h 1/1/1/0              0/0/0/0
169.254.249.61         7224      64144      67546       0      22      1w0d6h 0/1/1/0              0/0/0/0
169.254.249.65         7224      66768      70290       0      85     1w0d13h 0/1/1/0              0/0/0/0
169.254.249.69         7224      79312      83608       0      48     1w1d23h 1/1/1/0              0/0/0/0
169.254.253.17         7224     503614     529381       0      14      5w2d4h 1/1/1/0              0/0/0/0
169.254.253.21         7224     328495     345429       0       9      5w2d3h 0/1/1/0              0/0/0/0
169.254.255.33         7224     223959     235819       0      32        5:38 1/1/1/0              0/0/0/0
169.254.255.37         7224         29         30       0     149        4:12 0/1/1/0              0/0/0/0
169.254.255.41         7224     190324     200380       0      36        5:29 1/1/1/0              0/0/0/0
169.254.255.45         7224         37         37       0     138        5:23 0/1/1/0              0/0/0/0
169.254.255.49         7224     166590     173209       0       4        5:35 0/1/1/0              0/0/0/0
169.254.255.53         7224     279273     287881       0       4     1:36:00 1/1/1/0              0/0/0/0
169.254.255.73         7224      64036      67410       0     156        5:04 0/1/1/0              0/0/0/0
169.254.255.77         7224     327389     344943       0      94       28:22 1/1/1/0              0/0/0/0
2620:101:8000:1028::1       65000     495359     503005       0       2    22w4d19h Establ
  inet6.0: 6/9/9/0
2620:101:8000:1029::1       65000     495335     502965       0       3    22w4d18h Establ
  inet6.0: 1/9/9/0

----

dcurado@fw1.ops.releng.scl3.mozilla.net> show bgp summary 
Groups: 3 Peers: 8 Down peers: 0
Table          Tot Paths  Act Paths Suppressed    History Damp State    Pending
inet.0               274        134          0          0          0          0
Peer                     AS      InPkt     OutPkt    OutQ   Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped...
10.0.22.9             65000     501101     506915       0       2      8w0d9h 31/134/134/0         0/0/0/0
10.0.22.13            65000     501025     506883       0       1    22w4d23h 100/134/134/0        0/0/0/0
169.254.249.25         7224     327376     351744       0      11      5w2d5h 1/1/1/0              0/0/0/0
169.254.249.29         7224     327681     352153       0      17      2w2d7h 0/1/1/0              0/0/0/0
169.254.253.25         7224     329576     351744       0      13      5w2d5h 0/1/1/0              0/0/0/0
169.254.253.29         7224     606774     648562       0      10     9w5d14h 1/1/1/0              0/0/0/0
169.254.255.73         7224        306        328       0      37        5:55 0/1/1/0              0/0/0/0
169.254.255.77         7224     507287     546691       0      54     1:36:39 1/1/1/0              0/0/0/0
trees reopened at 7:01am pacific
on hold for 1 hour and 3 minutes now
AWS picked up after ~1h 20mins.
However, they were only able to look at our VPN connections, and verify that they are, in fact, up.
After a few minutes they were able to find that there was some problem with the VPN infrastructure
in us-east-1, that the VPN team is looking at it, but do not have a root cause.

I was promised a call back by COB today with an RFO.
From AWS just now:

Hello Dave,

Thank you for your time on call today. Here is the summary of our conversation. 

You had VPN outage this morning. Your IPSec and BGP peering went down around 2015-07-02 10:05:23 UTC. You had a period of outage and called in to see if there was any issue going on with AWS us-east-1 region endpoint. I was able to confirm that there was certainly an outage that happened but the issue was resolved. You confirmed that your connectivity is restored but wanted to know the root cause analysis on what happened. 

I have engaged the VPN team for further investigation and root cause analysis. Pending their investigation I will contact you with an update as soon as I receive an answer from them. Please do not hesitate to reach out to us if you have any questions or concern.

Best regards,

Guru K.
Amazon Web Services
We value your feedback. Please rate my response using the link below.
root cause never came from AWS, at this point I don't think we'll be getting one, and badgering them
for it seems like tilting at windmills.

closing this bug.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → INCOMPLETE
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.