Testpool connection issues

RESOLVED FIXED

Status

Infrastructure & Operations
CIDuty
--
blocker
RESOLVED FIXED
5 years ago
a month ago

People

(Reporter: Tomcat, Assigned: XioNoX)

Tracking

Details

Attachments

(1 attachment, 1 obsolete attachment)

(Reporter)

Description

5 years ago
Seems we have some connection issues going on currently on inbound

failures like

-> https://tbpl.mozilla.org/php/getParsedLog.php?id=30763981&tree=Mozilla-Inbound
Can't download from http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux64/1384864326/firefox-28.0a1.en-US.linux-x86_64.tests.zip to /builds/slave/test/build/firefox-28.0a1.en-US.linux-x86_64.tests.zip!

-> https://tbpl.mozilla.org/php/getParsedLog.php?id=30763107&tree=Mozilla-Inbound
TEST-UNEXPECTED-FAIL | test_killall.py test_killall.TestKillAll.test_kill_all | error: [Errno 111] Connection refused


-> https://tbpl.mozilla.org/php/getParsedLog.php?id=30763913&tree=Mozilla-Inbound
remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.spread.pb.PBConnectionLost'>: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.

could someone check this?
(Reporter)

Comment 1

5 years ago
note seems this now also over other trees, so all trees are closed for this
Summary: Possible testpool connection issues on mozilla-inbound → Testpool connection issues

Comment 2

5 years ago
Nick helped me look into this. In short, stage was really overloaded for a bit with uploads and virus scan traffic. 

(In reply to Carsten Book [:Tomcat] from comment #0)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=30763981&tree=Mozilla-
> Inbound
> Can't download from
> http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-
> inbound-linux64/1384864326/firefox-28.0a1.en-US.linux-x86_64.tests.zip to
> /builds/slave/test/build/firefox-28.0a1.en-US.linux-x86_64.tests.zip!

Cross-colo link that timed out due to excessive load on stage.

> https://tbpl.mozilla.org/php/getParsedLog.php?id=30763107&tree=Mozilla-
> Inbound
> TEST-UNEXPECTED-FAIL | test_killall.py
> test_killall.TestKillAll.test_kill_all | error: [Errno 111] Connection
> refused

This looks like an actual b2g test failure, but I could be wrong.
 
> ->
> https://tbpl.mozilla.org/php/getParsedLog.php?id=30763913&tree=Mozilla-
> Inbound
> remoteFailed: [Failure instance: Traceback (failure with no frames): <class
> 'twisted.spread.pb.PBConnectionLost'>: [Failure instance: Traceback (failure
> with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection
> to the other side was lost in a non-clean fashion.

Cross-colo link that timed out due to excessive load on stage.

Load is down now. I retriggered one of the failed mochitest jobs (https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=f12f7257b1c6&jobname=Ubuntu%20VM%2012.04%20x64%20mozilla-inbound%20opt%20test%20mochitest-3) and it downloaded its resources fine.

Updated

5 years ago
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Happening again. All trees closed.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Comment 4

5 years ago
Not sure if it's related yet, but:

11:07 PM PST We are investigating increased API error rates and latencies in the US-EAST-1 Region.
11:39 PM PST We continue to investigate the issue causing increased API error rates and latencies in the US-EAST-1 Region.
Nov 19, 12:29 AM PST Between 10:37 PM and 11:45 PM PST on November 18 we experienced increased API error rates and latencies in the US-EAST-1 Region. The issue has been resolved and the service is operating normally.

Comment 5

5 years ago
Is there any update on this? :-)

Comment 6

5 years ago
[12:07pm] coop|buildduty: ping - all trees are currently closed due to network issues. would any of the work from the TCW over the weekend have affected the network?
[12:07pm] coop|buildduty: e.g. https://bugzilla.mozilla.org/show_bug.cgi?id=924570
[12:08pm] nthomas: or are we seeing any bgp flaps on the RelEng VPCs to us-west-2 or us-east-1 ?
[12:09pm] arr: netops ^^
[12:11pm] nthomas: vpn-0525f61b in us-west-2 is 'One tunnel down'
[12:11pm] nthomas: in the AWS console
[12:12pm] XioNoX: hi
[12:12pm] nthomas: us-east-1 looks fine (now)
[12:13pm] XioNoX: nthomas: yes, the VPCs flapped a lot todya
[12:14pm] nthomas: fun. any info on where the issue might lie ?
[12:15pm] XioNoX: and they have just flapped one more time
[12:15pm] nthomas: the usual 'somewhere outside our network' ?
[12:15pm] XioNoX: let me look
[12:21pm] nthomas: AWS says the 205.251.233.121 one is flapping

Comment 7

5 years ago
[12:37pm] Callek|Buildduty: XioNoX: so I can try and re-gather some state, are you in standby mode for direction, or are you investigating already?
[12:37pm] cturra: rail: can you file a bug for me for this pls?
[12:37pm] XioNoX: Callek|Buildduty: I'm monitoring it more closely
[12:38pm] XioNoX: and yes it's always the same sessions that is bouncing so in theory bgp should fail over the 2nd link
[12:39pm] XioNoX: Callek|Buildduty: which network is behind this VPC?
[12:40pm] Callek|Buildduty: XioNoX: is this session bouncing on scl3 or aws side?
[12:40pm] Callek|Buildduty: XioNoX: we are aware of BGP failovers not preserving session state on the AWS end of course
[12:40pm] Callek|Buildduty: aiui all VPC connections to and from aws go through scl3 for us
[12:40pm] XioNoX: Callek|Buildduty: 10.132.0.0/16      *[BGP/170] 04:58:12, localpref 100
[12:41pm] XioNoX:                       AS path: 7224 I
[12:41pm] XioNoX:                     > to 169.254.249.29 via st0.2132
[12:41pm] XioNoX: so bgp prefers the other path anyway
[12:41pm] XioNoX: so even if the other session bounces it shouldn't impact anything
[12:43pm] XioNoX: Callek|Buildduty: you can join #netops-alerts if you want to see it bounce btw
[12:43pm] • Callek|Buildduty looks up key
[12:47pm] XioNoX: Callek|Buildduty: so yeah, this single sessions flapping shouldn't cause any issue
[12:48pm] Callek|Buildduty: we've certainly seen loss of session state between us-west2 and scl3 today, multiple times
[12:48pm] Callek|Buildduty: and if the BGP session flaps, and fails over to the backup we do lose session state on AWS's end
[12:49pm] XioNoX: Callek|Buildduty: http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.nagios1-releng-usw2
[12:50pm] Callek|Buildduty: I sadly don't have any great insight beyond what you;ve told me and that generic comment, so I'm happy to just say to watch it closer for the next few hours and when/if things crop up we'll have current and relevant data
[12:50pm] Callek|Buildduty: (unless coop|lunch or nthomas|away wish to inject and say/ask for something more anyway)
[12:51pm] XioNoX: yeah I'm around so don't hesitate to ping me as soon as you're seeing something
[12:51pm] XioNoX: on smokeping my guess would be that where there is some packet loss is when bgp fails over
[12:52pm] XioNoX: but it doesn't seem like anything very bad from there
[12:52pm] XioNoX: but subobtimal at least
[12:53pm] Callek|Buildduty: XioNoX: ok, I told sheriff that things seem good at this instant, and failing new data there doesn't seem to be anything we can action on, and to ping me/buildduty as soon as any new issues show

Comment 8

5 years ago
Informally, we've heard from other AWS customers that they've been seeing the same issue (BGP failovers) with USW2 today.
Back again. Trees closed.
We brought this issue up to AWS support before and received the following reply:

"In the case of BGP connection failover, existing TCP connections are unfortunately not going to be transparently handled using the secondary tunnel. The only work-around I can think of would be to support appropriate retry procedures in the application itself or use asynchronous communication methods."

"Please accept my apologies, we are doing all that's possible to provide highly available and stable VPN connections but if the actual communication happens over Internet, intermittent connectivity issues cannot be entirely avoided despite of the protocol being used."
3:46:34 PM - coop|buildduty: load on ftp is normal now, and AWS service hasn't flapped in 80 minutes

Trees reopened.
[3:44pm] coop|buildduty: but there are two different issues here: 1) BGP flapping in AWS; and, 2) a giant load spike on ftp
[3:44pm] coop|buildduty: we can't do much about 1)
[3:45pm] coop|buildduty: and we're trying to deal with 2) by taking the problem node out of the shared pool, and possibly curtailing experiments with S3
[3:46pm] coop|buildduty: load on ftp is normal now, and AWS service hasn't flapped in 80 minutes
(Reporter)

Comment 13

5 years ago
seems its back https://tbpl.mozilla.org/php/getParsedLog.php?id=30818963&tree=B2g-Inbound

Resolving ftp.mozilla.org (ftp.mozilla.org)... failed: Name or service not known.
wget: unable to resolve host address `ftp.mozilla.org'
(Reporter)

Comment 14

5 years ago
another example (is this also stage?)  CalledProcessError: Command '['hg', 'pull', 'http://hg-internal.dmz.scl3.mozilla.com/gaia-l10n/zh-TW']' returned non-zero exit status 255 - hg not responding
(Reporter)

Comment 15

5 years ago
Trees closed again, we now get also upload errors like - reverse mapping checking getaddrinfo for upload-zlb.vips.scl3.mozilla.com [63.245.215.47] failed - POSSIBLE BREAK-IN ATTEMPT!
Whatever host that is running on failed to look up reverse DNS for 63.245.215.47, probably from the same network problems that made the connection timeout. Ignore it.
(Assignee)

Comment 17

5 years ago
BGP flapped again between 10:20 and 10:30 UTC where we can see some packet loss on smokeping:
http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.nagios1-releng-usw2
But it is fine now

Also the high packet loss we can see on Smokeping between 22:00 and 23:30 yesterday (and the related outage) was during a time where the VPNs to AWS were stable (lat flap was at 21:50 and the next one was at 01:15.

During that time pings to the AWS side of the ipsec tunnel were fine (no packet loss, etc...) but pings to hosts in AWS were experiencing difficulties (mostly high packet loss). Based on that I'd say that something else was going on within AWS during and after (or just after) the tunnels flapping.
(Reporter)

Comment 18

5 years ago
and trees reopened
We're going to stop using usw2 slaves while things are still rocky. Patch incoming.
Created attachment 8335457 [details] [diff] [review]
don't start new usw2 instances
Attachment #8335457 - Flags: review?(rail)
Comment on attachment 8335457 [details] [diff] [review]
don't start new usw2 instances

Rail tells me that we need to adjust a file on cruncher (~buildduty/bin/aws_watch_pending.sh) instead of doing this. He did that, so we won't start any new instances in usw2 now. We need to revert this at some point.
Attachment #8335457 - Attachment is obsolete: true
Attachment #8335457 - Flags: review?(rail)
(In reply to Ben Hearsum [:bhearsum] from comment #21)
> Comment on attachment 8335457 [details] [diff] [review]
> don't start new usw2 instances
> 
> Rail tells me that we need to adjust a file on cruncher
> (~buildduty/bin/aws_watch_pending.sh) instead of doing this. He did that, so
> we won't start any new instances in usw2 now. We need to revert this at some
> point.

We decided to try re-enabling usw2 this morning, as we're already getting backed up. We'll shut it off again if we see widespread disconnects. IRC conversation, for posterity:
12:03 < philor> bhearsum: well, builds still pending on https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=f0544dfcc75c from 20 minutes ago is sort of infra busted
12:03  * bhearsum has a look
12:04 < philor> but since someone's had a push to every single high-priority branch except release, it may well just be load
12:04 < bhearsum> it looks like those are all amazon jobs...
12:04 < bhearsum> and without usw2, our throughput is lower
12:05 < bhearsum> let me confirm that we're actually using everything from use1 though
12:05 < RyanVM> bhearsum: FWIW, I had a similar complaint earlier today
12:06 < RyanVM> coop didn't find any smoking guns
12:06 < RyanVM> and it magically resolved itself
12:06 < bhearsum> it looks like load to me - all instances in use1 except for a few try builders are running already
12:06 < bhearsum> one thing we could try is to re-enable usw2, but make it lower priority
12:07 < bhearsum> we might still hit issues at peak load, but it would make it less likely to hit them during lower load times without affecting our throughput
12:15 < RyanVM> bhearsum: *sigh*
12:15 < RyanVM> we're already in peakish load
12:15 < bhearsum> yeah, it sucks :(
12:15 < RyanVM> so we either get backlog or frequent disconnects
12:15 < RyanVM> or maybe usw2 is working again and we'll be fine
12:15 < bhearsum> it's possible that the link will behave better today
12:15 < bhearsum> yeah
12:15 < bhearsum> do you want to try it? we can turn it off again if things go nuts
12:16 < RyanVM> might as well roll the dice
I almost forgot to do the lower priority part...did that just now in https://hg.mozilla.org/build/cloud-tools/rev/808c69d48696
Dustin and catlee mentioned something about the type of routing we're doing with Amazon (static vs dynamic) and that switching to the other type might allow us to have hung connection (which might resume) instead dropped connections when BGP flaps.

This is all hearsay on my part, so cc-ing them both here.
Arzhel and I discussed this idea, too.  It's trading one kind of redundancy (of endpoint failure) for another (not losing connection state).  I'm not sure it'd be a win..
I'd like to respond to our open case with Amazon before it gets closed again.

Arzhel or Dustin: what hard data can I put in the Amazon ticket? The most recent substantive reply from AWS is this:

"Hi,

Thank you for contacting AWS Premium Support! 

I'm really sorry you are having connectivity issues due to the BGP session flapping. 

I just verified on our side and I can confirm there were no scheduled maintenance or no other issues reported on our side for yesterday 2013/11/19. 

Could you confirm if you made any changes at all to your configuration for vpn-0525f61b (assuming this is the VPN connection you are referring) in region US West (Oregon)? Perhaps it could be an issue local to your ISP or provider? Do you have some debug logs from your side that you are able to share with us to help us cross-check the information at time of the event? Any additional information and log output related to the IPsec tunnels would really help in this case. 

I look forward to your answer and don't hesitate to contact us if you have additional questions or concerns! 

Best regards, 

Valentin R. 
Amazon Web Services"
(Assignee)

Comment 27

5 years ago
Created attachment 8338801 [details]
Amazon-ipsec.log

We have 2 tunnels that go to usw2 (remote endpoints 205.251.233.121 and 205.251.233.122), both have the same endpoint on our side (63.245.214.82) and the same path to Amazon.

When the tunnels flap (goes down/up within less than 2min) it's usually one or the other but not often both at the same time (for example vpn1 will flap 15 times then stay stable and a few hours latter vpn2 will do the same). A traceroute/mtr from our side to Amazon when a tunnel is down or flapping doesn't show any issues on the path.
The resources on the router are fine and we have other VPNs to other Amazon instances on the same router which are perfectly stables.

Also even when the tunnels are UP, pinging the other side of the tunnel is fine (for example 169.254.249.25) but any other machine within AWS (seems to be very often suffering of packet loss.

I'm attaching the most verbose (and recent) IKE logs that the router can output.
AWS Support responds:

Hello

You indicated that you performed a MTR trace to test the connection from your side to your VGW. Could you please run this test again and provide me with the output. Please ensure you perform the test using the public destination IP of your VGW (205.251.233.121 or 122). Once the test is running, please press the 'o' key and enter LDRS NBAW VG JMXI in order to adjust the MTR fields. This is only available on a Linux host. 

Could you please verify the following:
- The MTU size being used on your tunnel interface. The MTU that needs to be applied is 1396. It is possible for a BGP neighbor to flap due to this.
- Your BGP timers that are being used. Your timeout will need to be 10 and the hold timers will be 30 (on Cisco this is defined by "neighbor <neighbor> timers 10 30 30").
- Ensure you have BGP soft-reconfiguration inbound applied
- The number of errors or resets on the interface (this could indicate a layer 1 issue)
- If you have defined a static route to your BGP peer or if you are using a default route. This could cause recursive routing issues. Please see here for more information: http://www.cisco.com/en/US/tech/tk365/technologies_tech_note09186a00800945fe.shtml
- Please may you run the BGP troubleshooting commands listed here for your respective device: http://docs.aws.amazon.com/AmazonVPC/latest/NetworkAdminGuide/Cisco_Troubleshooting.html

Once you have performed the above, please may you run a debug on BGP so we may get more information on why the BGP neighbor state changed.

Please may you provide us with the above information so we may further investigate why BGP peering is failing. I look forward to your response.

Best regards,

Darren R | Cape Town Support Center
Amazon Web Services

Comment 29

5 years ago
What do we want to do with this bug?

Should we open another bug to shift to upload to ftp/S3 and download artifacts from ftp/S3 depending on the host's location?
I don't think we're done with that project but I believe it would be the long term solution.
Most of the points in comment 28 are for Arzhel/netops, so let's sort that out first?
For your long-range mirror-everything-in-S3 plans, the https://tbpl.mozilla.org/php/getParsedLog.php?id=31346862&tree=Holly style failure to install blobuploader from http://pypi.pvt.build.mozilla.org is a significant percentage of these failures.
(Assignee)

Comment 32

5 years ago
Very thorough email from Amazon (hopefully this could be indexed and help other people):

----- Original Message -----
> Arzhel,
> 
> The "No SA for incoming SPI: [n]" is normal to see if the tunnel is down.
> This means you received traffic from your VPC, but since the tunnel is down
> there's no SA (Security Association).
> 
> Regarding this message: "%-RT_IPSEC_REPLAY: Replay packet detected on IPSec
> tunnel on reth0.1030 with tunnel ID 0x20003! From 205.251.233.121 to
> 63.245.214.82/52, ESP, SPI 0xa779d17, SEQ 0xae7ff5." - this is what I
> suspected may be happening. I believe what's going on here is that your
> Juniper is receiving packets that are out of order (specifically, the SPI,
> or Security Payload Index) and so the Juniper thinks this is a replay
> attack, and it drops the packets, killing the tunnel. Juniper specifically
> identifies this in the following KBs:
> 
> [SRX to ScreenOS VPN] VPN Packets from ISG to SRX is Dropped Due to Out of
> Sequence Packets with Replay Protection Enabled - KB 26671 -
> https://kb.juniper.net/InfoCenter/index?page=content&id=KB26671&actp=LIST_RECENT&smlogin=true
> 
> and
> 
> [ScreenOS] Why disable VPN engines, and how? - KB 22051 -
> http://kb.juniper.net/InfoCenter/index?subtoken=0D0C2B5ABADB72091311EA462E5E051D&page=content&id=KB22051&enablesaml=yes
> 
> You need a Juniper login to view these. Just in case you don't have one, I've
> saved each as PDFs and attached them. The issue applies to ScreenOS, ISG,
> NS5000, and SRX. In a nutshell, there are multiple VPN engines on the
> Juniper that process IPsec traffic. When the device is under load (and this
> would explain the sporadic nature of your outages), different engines can
> receive packets that were meant to be handled by specific engines. The
> Juniper device perceives these as being out of order, and considers it an
> attack, thusly generating the "RT_IPSEC_REPLAY: Replay packet detected...!"
> message you've seen, and accordingly drops the traffic. Juniper has a couple
> methods to resolve this, which relate to disabling VPN engines. Please
> review the two attached PDFs - they're pretty short - and let me know how
> you want to proceed. If what Juniper suggests are not options, then your
> next best bet would be to try connecting to an instance based VPN rather
> than our hardware based VPNs. There are several solutions for this - for
> example, Cisco has their CSR 1000v available as an AMI (free 60 day trial)
> that you can evaluate, and there are OpenSwan solutions as well. We can talk
> about those if you're interested, but given the error messages you've
> received, along with the behavior observed (and we have other customers who
> have run into the exact same problem), I'm pretty confident this is what's
> happening. If you like, we can also setup a call to go over all this. Please
> let me know what works.
> 
> To see the file named 'Juniper Networks - [SRX to ScreenOS VPN] VPN Packets
> from ISG to SRX is Dropped Due to Out of Sequence Packets with Replay
> Protection Enabled - Knowledge Base.pdf, Juniper Networks - [ScreenOS] Why
> disable VPN engines, and how_ - Knowledge Base.pdf' included with this
> correspondence, please use the case link given below the signature.
> 
> Best regards,
> 
> Leigh H.
> Amazon Web Services
-----

I disabled the SPI protection for those tunnels and we're monitoring them. Everything now should stay stable.
ACK'ed from me yesterday to disable replay protection here.
We had a flap again today:

[10:22:58]	nagios-releng	[Dec/04 09:52:19] Wed 06:52:19 PST [4162] fw1.private.releng.scl3.mozilla.net:BGP usw2 vpn-0525f61b-1 is CRITICAL: SNMP CRITICAL - BGP sess vpn-0525f61b-1 (usw2/169.254.249.25) uptime *13* secs (http://m.allizom.org/BGP+usw2+vpn-0525f61b-1)
[10:22:58]	nagios-releng	[Dec/04 09:52:39] Wed 06:52:39 PST [4165] panda-0144.p1.releng.scl1.mozilla.com is DOWN :CRITICAL: in state failed_pxe_booting
[10:22:58]	nagios-releng	[Dec/04 09:57:19] Wed 06:57:19 PST [4166] fw1.private.releng.scl3.mozilla.net:BGP usw2 vpn-0525f61b-1 is WARNING: SNMP WARNING - BGP sess vpn-0525f61b-1 (usw2/169.254.249.25) uptime *313* secs (http://m.allizom.org/BGP+usw2+vpn-0525f61b-1)
[10:22:58]	nagios-releng	[Dec/04 10:02:19] Wed 07:02:19 PST [4169] fw1.private.releng.scl3.mozilla.net:BGP usw2 vpn-0525f61b-1 is OK: SNMP OK - BGP sess vpn-0525f61b-1 (usw2/169.254.249.25) uptime 613 secs (http://m.allizom.org/BGP+usw2+vpn-0525f61b-1)

Updated

5 years ago
See Also: → bug 946238

Comment 35

5 years ago
Can this be related? (Currently all trees are close and a blocker bug has been filed)

(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #0)
> We're seeing frequent disconnects in big spurts. All trees closed.
> 
> https://tbpl.mozilla.org/php/getParsedLog.php?id=31446488&tree=Mozilla-B2g26-
> v1.2
> 
> exceptions.TypeError: '_sre.SRE_Pattern' object is not iterable

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=946238
No, that's an unrelated bustage.

Updated

5 years ago
Duplicate of this bug: 936248
Arzhel has been interacting with AWS support like a champ on this.
Assignee: nobody → arzhel
(Assignee)

Comment 39

5 years ago
Quick update, 

The fix solved the main issue (VPCs flapping like crazy as soon as it was a bit loaded) which put us in a good spot.
Fixing that revealed another small issue (VPCs flapping once a day or couple days), which has another error message, etc.. and we're still working on it with Amazon.
Thank you for all your work on this - much appreciated! :-)

Updated

4 years ago
Duplicate of this bug: 918677
(Assignee)

Comment 42

4 years ago
I believe we can close that, offloading the ipsec tunnels solved the remaining issues
Status: REOPENED → RESOLVED
Last Resolved: 5 years ago4 years ago
Resolution: --- → FIXED

Updated

a month ago
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.