Closed
Bug 1036176
Opened 11 years ago
Closed 11 years ago
Some spot instances in us-east-1 are failing to connect to hg.mozilla.org
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: nthomas, Unassigned)
Details
Attachments
(2 files)
eg bld-linux64-spot-116.build.releng.use1:
$ nc -vz hg.mozilla.org 443
nc: connect to hg.mozilla.org port 443 (tcp) failed: Connection timed out
vs bld-linux64-spot-114.build.releng.use1:
$ nc -vz hg.mozilla.org 443
Connection to hg.mozilla.org 443 port [tcp/https] succeeded!
Same AMI, availability zone, security group, traceroute to hg.m.o but different result.
Known busted with public IP:
bld-linux64-spot-110 54.209.99.210
bld-linux64-spot-111 54.210.0.198
bld-linux64-spot-115 54.88.247.236
bld-linux64-spot-116 54.210.0.205
Known OK:
bld-linux64-spot-112 54.210.7.111
bld-linux64-spot-113 54.209.228.110
bld-linux64-spot-114 54.88.139.20
Reporter | ||
Comment 1•11 years ago
|
||
Disabled the four busted in slavealloc, as well as bld-linux64-spot-118 (54.88.255.24).
bld-linux64-spot-038 also had this issue but now OK.
Reporter | ||
Comment 2•11 years ago
|
||
bld-linux64-spot-110 re-enabled after it recovered. The aws_watch_pending log says a new spot request was opened for this at 16:06, so the instance was recreated then and subsequently did a green build.
Reporter | ||
Comment 3•11 years ago
|
||
FWIW, it got the exact same public IP (54.209.99.210). The busted slaves can connect to people.mozilla.org:80 and www.mozilla.org:80/443 just fine, but not hg.m.o:80. Very weird.
Gonna terminate bld-linux64-spot-111.
Comment 4•11 years ago
|
||
very odd.
if it happens again, can you grab the subnet of the good/bad slaves too? private IPs would be sufficient too.
Reporter | ||
Comment 5•11 years ago
|
||
Will do. I am fairly sure there were examples of working and not-working slaves in the same subnet (even taking into account /25's we use).
All the busted spot instances have been destroyed by attrition, so re-enabled in slavealloc.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 6•11 years ago
|
||
Well goodie, there are some test slaves with the same issue.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Summary: Some bld-linux64-spot are failing to connect to hg.mozilla.org → Some spot instances in us-east-1 are failing to connect to hg.mozilla.org
Reporter | ||
Comment 7•11 years ago
|
||
Slave name Current IP First failure (PDT)
tst-linux64-spot-950 10.134.157.122 2014-07-08 11:24:17
tst-linux64-spot-859 10.134.157.218 2014-07-08 11:25:11
tst-linux64-spot-1036 10.134.59.23 2014-07-08 18:10:32
tst-linux64-spot-805 already gone 2014-07-08 18:16:21
tst-linux64-spot-940 already gone 2014-07-08 18:35:32
bld-linux64-spot-1005 10.134.55.117 2014-07-08 18:55:38
All terminated.
mtr and traceroute work fine, should try mtr --tcp.
Comment 8•11 years ago
|
||
Fresh occurrences:
bld-linux64-spot-1041
https://tbpl.mozilla.org/php/getParsedLog.php?id=43437769&tree=Mozilla-Beta
bld-linux64-spot-1045
https://tbpl.mozilla.org/php/getParsedLog.php?id=43437519&tree=Mozilla-Beta
Reporter | ||
Comment 9•11 years ago
|
||
tl;dr - Two tst-linux64-spot on the same subnet etc, one can reach sites routed over the internet, the other can't. Different AMI but I've double checked that doesn't make a difference in another pair of working/fail.
VCS guys say Zeus isn't doing an IP blacklisting, asking netops but next stop is an AWS ticket.
Reporter | ||
Comment 10•11 years ago
|
||
bld-linux64-spot-119 was doing this again today, rail terminated it on RyanVM's request.
Reporter | ||
Comment 11•11 years ago
|
||
Opened case 222113071 with AWS:
i-f57518d9 is an example of an issue we've been seeing in us-east-1 since 2014-07-08. We route traffic to 63.245.215.25 via igw-09b7cc67, but the instance cannot open a connection on port 80 or 443. Traceroute works normally, and confirms the routing is over the internet rather than our VPN connection; same for mtr --tcp. We suspect some sort of block on the Amazon side, because other spot instances using the same AMI, subnet, availability zone, etc are working normally. Instance ID(s): i-f57518d9
Reporter | ||
Comment 12•11 years ago
|
||
If we have a lot of trouble with this we can try disabling us-east-1c, as they've all been in that availability zone in my spot checks.
Comment 13•11 years ago
|
||
Kim had issues with tst-emulator64-spot-086 (i-f16b04dd). Disabled in slavealloc and changed the moz-status tag to avoid automatic recycling.
Commented in the AWS case.
For the record, the external IP is 54.210.0.89, security group is sg-f0f1239f, subnet subnet-b8643190, VPC vpc-b42100df, routing table rtb-d77190b2.
(In reply to Nick Thomas [:nthomas] from comment #12)
> If we have a lot of trouble with this we can try disabling us-east-1c, as
> they've all been in that availability zone in my spot checks.
Ouch, we already disabled us-east-1b because it lacks a lot of features (no c3, no SSD EBS).
Reporter | ||
Comment 14•11 years ago
|
||
Bug 1044429 looks like another case of this, starting at 2014-07-25 23:34:05 until it was terminated. use1 again.
And bug 1044426, starting 2014-07-25 23:20:15, use1.
Comment 15•11 years ago
|
||
Hey Nick,
Has any outreach been done toward our AWS ticket, or netops, etc about this?
Flags: needinfo?(nthomas)
Reporter | ||
Comment 16•11 years ago
|
||
AWS responded to the ticket with some suggestions on how to test further (see attachment). We haven't tried that yet because of other interruptions, and needing an instance which is exhibiting the problem.
bug 1039076 was caused by our IT blocking some IPs at the zlb level on hg.m.o, and I wondered if that was the cause of the problem here. But if you read from comment #6 onwards it seems not. Would be good to verify that with tcpdump (ie abruptly hung up connection from block, vs no connection at all).
Flags: needinfo?(nthomas)
Comment 17•11 years ago
|
||
last report was 2 weeks ago, and given c#16 I'm not seeing any reason to leave this open, reso/wfm for now, we can reopen if we get new info.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → WORKSFORME
Reporter | ||
Comment 18•11 years ago
|
||
The two cases in comment #14 were only 5 days ago, but I take your point about about wait-and-see. AWS has closed the ticket on their side.
A tcpdump would look something like
# connect as root
apt-get install tcpdump
tcpdump -nn '(host hg.mozilla.org or ftp-ssl.mozilla.org) and port 443'
Add -X to dump the packets too.
Reporter | ||
Comment 19•11 years ago
|
||
tst-linux64-spot-291 did this 2014-07-31 12:06:54 until it went away after 2014-07-31 16:45:21.
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•6 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•