Closed Bug 1036176 Opened 11 years ago Closed 11 years ago

Some spot instances in us-east-1 are failing to connect to hg.mozilla.org

Tracking

(Not tracked)

Status:

RESOLVED WORKSFORME

People

(Reporter: nthomas, Unassigned)

Details

Attachments

(2 files)

Testing notes 11 years ago Nick Thomas [:nthomas] (UTC+12) 5.99 KB, text/plain		Details
AWS response 11 years ago Nick Thomas [:nthomas] (UTC+12) 5.21 KB, text/plain		Details

Nick Thomas [:nthomas] (UTC+12)

Reporter

Description

•

11 years ago

eg bld-linux64-spot-116.build.releng.use1: $ nc -vz hg.mozilla.org 443 nc: connect to hg.mozilla.org port 443 (tcp) failed: Connection timed out vs bld-linux64-spot-114.build.releng.use1: $ nc -vz hg.mozilla.org 443 Connection to hg.mozilla.org 443 port [tcp/https] succeeded! Same AMI, availability zone, security group, traceroute to hg.m.o but different result. Known busted with public IP: bld-linux64-spot-110 54.209.99.210 bld-linux64-spot-111 54.210.0.198 bld-linux64-spot-115 54.88.247.236 bld-linux64-spot-116 54.210.0.205 Known OK: bld-linux64-spot-112 54.210.7.111 bld-linux64-spot-113 54.209.228.110 bld-linux64-spot-114 54.88.139.20

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 1

•

11 years ago

Disabled the four busted in slavealloc, as well as bld-linux64-spot-118 (54.88.255.24). bld-linux64-spot-038 also had this issue but now OK.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 2

•

11 years ago

bld-linux64-spot-110 re-enabled after it recovered. The aws_watch_pending log says a new spot request was opened for this at 16:06, so the instance was recreated then and subsequently did a green build.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 3

•

11 years ago

FWIW, it got the exact same public IP (54.209.99.210). The busted slaves can connect to people.mozilla.org:80 and www.mozilla.org:80/443 just fine, but not hg.m.o:80. Very weird. Gonna terminate bld-linux64-spot-111.

Chris AtLee [:catlee]

Comment 4

•

11 years ago

very odd. if it happens again, can you grab the subnet of the good/bad slaves too? private IPs would be sufficient too.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 5

•

11 years ago

Will do. I am fairly sure there were examples of working and not-working slaves in the same subnet (even taking into account /25's we use). All the busted spot instances have been destroyed by attrition, so re-enabled in slavealloc.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 6

•

11 years ago

Well goodie, there are some test slaves with the same issue.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Summary: Some bld-linux64-spot are failing to connect to hg.mozilla.org → Some spot instances in us-east-1 are failing to connect to hg.mozilla.org

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 7

•

11 years ago

Slave name Current IP First failure (PDT) tst-linux64-spot-950 10.134.157.122 2014-07-08 11:24:17 tst-linux64-spot-859 10.134.157.218 2014-07-08 11:25:11 tst-linux64-spot-1036 10.134.59.23 2014-07-08 18:10:32 tst-linux64-spot-805 already gone 2014-07-08 18:16:21 tst-linux64-spot-940 already gone 2014-07-08 18:35:32 bld-linux64-spot-1005 10.134.55.117 2014-07-08 18:55:38 All terminated. mtr and traceroute work fine, should try mtr --tcp.

Ryan VanderMeulen [:RyanVM]

Comment 8

•

11 years ago

Fresh occurrences: bld-linux64-spot-1041 https://tbpl.mozilla.org/php/getParsedLog.php?id=43437769&tree=Mozilla-Beta bld-linux64-spot-1045 https://tbpl.mozilla.org/php/getParsedLog.php?id=43437519&tree=Mozilla-Beta

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 9

•

11 years ago

Attached file Testing notes — Details

tl;dr - Two tst-linux64-spot on the same subnet etc, one can reach sites routed over the internet, the other can't. Different AMI but I've double checked that doesn't make a difference in another pair of working/fail. VCS guys say Zeus isn't doing an IP blacklisting, asking netops but next stop is an AWS ticket.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 10

•

11 years ago

bld-linux64-spot-119 was doing this again today, rail terminated it on RyanVM's request.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 11

•

11 years ago

Opened case 222113071 with AWS: i-f57518d9 is an example of an issue we've been seeing in us-east-1 since 2014-07-08. We route traffic to 63.245.215.25 via igw-09b7cc67, but the instance cannot open a connection on port 80 or 443. Traceroute works normally, and confirms the routing is over the internet rather than our VPN connection; same for mtr --tcp. We suspect some sort of block on the Amazon side, because other spot instances using the same AMI, subnet, availability zone, etc are working normally. Instance ID(s): i-f57518d9

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 12

•

11 years ago

If we have a lot of trouble with this we can try disabling us-east-1c, as they've all been in that availability zone in my spot checks.

Rail Aliiev [:rail]

Comment 13

•

11 years ago

Kim had issues with tst-emulator64-spot-086 (i-f16b04dd). Disabled in slavealloc and changed the moz-status tag to avoid automatic recycling. Commented in the AWS case. For the record, the external IP is 54.210.0.89, security group is sg-f0f1239f, subnet subnet-b8643190, VPC vpc-b42100df, routing table rtb-d77190b2. (In reply to Nick Thomas [:nthomas] from comment #12) > If we have a lot of trouble with this we can try disabling us-east-1c, as > they've all been in that availability zone in my spot checks. Ouch, we already disabled us-east-1b because it lacks a lot of features (no c3, no SSD EBS).

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 14

•

11 years ago

Bug 1044429 looks like another case of this, starting at 2014-07-25 23:34:05 until it was terminated. use1 again. And bug 1044426, starting 2014-07-25 23:20:15, use1.

Justin Wood (:Callek)

Comment 15

•

11 years ago

Hey Nick, Has any outreach been done toward our AWS ticket, or netops, etc about this?

Flags: needinfo?(nthomas)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 16

•

11 years ago

Attached file AWS response — Details

AWS responded to the ticket with some suggestions on how to test further (see attachment). We haven't tried that yet because of other interruptions, and needing an instance which is exhibiting the problem. bug 1039076 was caused by our IT blocking some IPs at the zlb level on hg.m.o, and I wondered if that was the cause of the problem here. But if you read from comment #6 onwards it seems not. Would be good to verify that with tcpdump (ie abruptly hung up connection from block, vs no connection at all).

Flags: needinfo?(nthomas)

Justin Wood (:Callek)

Comment 17

•

11 years ago

last report was 2 weeks ago, and given c#16 I'm not seeing any reason to leave this open, reso/wfm for now, we can reopen if we get new info.

Status: REOPENED → RESOLVED

Closed: 11 years ago → 11 years ago

Resolution: --- → WORKSFORME

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 18

•

11 years ago

The two cases in comment #14 were only 5 days ago, but I take your point about about wait-and-see. AWS has closed the ticket on their side. A tcpdump would look something like # connect as root apt-get install tcpdump tcpdump -nn '(host hg.mozilla.org or ftp-ssl.mozilla.org) and port 443' Add -X to dump the packets too.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 19

•

11 years ago

tst-linux64-spot-291 did this 2014-07-31 12:06:54 until it went away after 2014-07-31 16:45:21.

BMO Automation

Updated

•

7 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

6 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.