Closed Bug 1036176 Opened 11 years ago Closed 11 years ago

Some spot instances in us-east-1 are failing to connect to hg.mozilla.org

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
Linux
task
Not set
major

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: nthomas, Unassigned)

Details

Attachments

(2 files)

eg bld-linux64-spot-116.build.releng.use1: $ nc -vz hg.mozilla.org 443 nc: connect to hg.mozilla.org port 443 (tcp) failed: Connection timed out vs bld-linux64-spot-114.build.releng.use1: $ nc -vz hg.mozilla.org 443 Connection to hg.mozilla.org 443 port [tcp/https] succeeded! Same AMI, availability zone, security group, traceroute to hg.m.o but different result. Known busted with public IP: bld-linux64-spot-110 54.209.99.210 bld-linux64-spot-111 54.210.0.198 bld-linux64-spot-115 54.88.247.236 bld-linux64-spot-116 54.210.0.205 Known OK: bld-linux64-spot-112 54.210.7.111 bld-linux64-spot-113 54.209.228.110 bld-linux64-spot-114 54.88.139.20
Disabled the four busted in slavealloc, as well as bld-linux64-spot-118 (54.88.255.24). bld-linux64-spot-038 also had this issue but now OK.
bld-linux64-spot-110 re-enabled after it recovered. The aws_watch_pending log says a new spot request was opened for this at 16:06, so the instance was recreated then and subsequently did a green build.
FWIW, it got the exact same public IP (54.209.99.210). The busted slaves can connect to people.mozilla.org:80 and www.mozilla.org:80/443 just fine, but not hg.m.o:80. Very weird. Gonna terminate bld-linux64-spot-111.
very odd. if it happens again, can you grab the subnet of the good/bad slaves too? private IPs would be sufficient too.
Will do. I am fairly sure there were examples of working and not-working slaves in the same subnet (even taking into account /25's we use). All the busted spot instances have been destroyed by attrition, so re-enabled in slavealloc.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Well goodie, there are some test slaves with the same issue.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Summary: Some bld-linux64-spot are failing to connect to hg.mozilla.org → Some spot instances in us-east-1 are failing to connect to hg.mozilla.org
Slave name Current IP First failure (PDT) tst-linux64-spot-950 10.134.157.122 2014-07-08 11:24:17 tst-linux64-spot-859 10.134.157.218 2014-07-08 11:25:11 tst-linux64-spot-1036 10.134.59.23 2014-07-08 18:10:32 tst-linux64-spot-805 already gone 2014-07-08 18:16:21 tst-linux64-spot-940 already gone 2014-07-08 18:35:32 bld-linux64-spot-1005 10.134.55.117 2014-07-08 18:55:38 All terminated. mtr and traceroute work fine, should try mtr --tcp.
Attached file Testing notes
tl;dr - Two tst-linux64-spot on the same subnet etc, one can reach sites routed over the internet, the other can't. Different AMI but I've double checked that doesn't make a difference in another pair of working/fail. VCS guys say Zeus isn't doing an IP blacklisting, asking netops but next stop is an AWS ticket.
bld-linux64-spot-119 was doing this again today, rail terminated it on RyanVM's request.
Opened case 222113071 with AWS: i-f57518d9 is an example of an issue we've been seeing in us-east-1 since 2014-07-08. We route traffic to 63.245.215.25 via igw-09b7cc67, but the instance cannot open a connection on port 80 or 443. Traceroute works normally, and confirms the routing is over the internet rather than our VPN connection; same for mtr --tcp. We suspect some sort of block on the Amazon side, because other spot instances using the same AMI, subnet, availability zone, etc are working normally. Instance ID(s): i-f57518d9
If we have a lot of trouble with this we can try disabling us-east-1c, as they've all been in that availability zone in my spot checks.
Kim had issues with tst-emulator64-spot-086 (i-f16b04dd). Disabled in slavealloc and changed the moz-status tag to avoid automatic recycling. Commented in the AWS case. For the record, the external IP is 54.210.0.89, security group is sg-f0f1239f, subnet subnet-b8643190, VPC vpc-b42100df, routing table rtb-d77190b2. (In reply to Nick Thomas [:nthomas] from comment #12) > If we have a lot of trouble with this we can try disabling us-east-1c, as > they've all been in that availability zone in my spot checks. Ouch, we already disabled us-east-1b because it lacks a lot of features (no c3, no SSD EBS).
Bug 1044429 looks like another case of this, starting at 2014-07-25 23:34:05 until it was terminated. use1 again. And bug 1044426, starting 2014-07-25 23:20:15, use1.
Hey Nick, Has any outreach been done toward our AWS ticket, or netops, etc about this?
Flags: needinfo?(nthomas)
Attached file AWS response
AWS responded to the ticket with some suggestions on how to test further (see attachment). We haven't tried that yet because of other interruptions, and needing an instance which is exhibiting the problem. bug 1039076 was caused by our IT blocking some IPs at the zlb level on hg.m.o, and I wondered if that was the cause of the problem here. But if you read from comment #6 onwards it seems not. Would be good to verify that with tcpdump (ie abruptly hung up connection from block, vs no connection at all).
Flags: needinfo?(nthomas)
last report was 2 weeks ago, and given c#16 I'm not seeing any reason to leave this open, reso/wfm for now, we can reopen if we get new info.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → WORKSFORME
The two cases in comment #14 were only 5 days ago, but I take your point about about wait-and-see. AWS has closed the ticket on their side. A tcpdump would look something like # connect as root apt-get install tcpdump tcpdump -nn '(host hg.mozilla.org or ftp-ssl.mozilla.org) and port 443' Add -X to dump the packets too.
tst-linux64-spot-291 did this 2014-07-31 12:06:54 until it went away after 2014-07-31 16:45:21.
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: