Closed Bug 1491948 Opened 6 years ago Closed 6 years ago

connections between some RelEng machines in AWS and some in mdc1/mdc2 are much slower since Friday

Categories

(Infrastructure & Operations :: SRE, task)

task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: dividehex)

References

Details

Attachments

(1 file)

This appears to have begun between 9:30am and 11:30am EDT on Friday, September 14th. The most obvious symptom is connections between depsigning workers (RelEng use1/usw2) and signing servers (RelEng mdc1/mdc2) taking a lot longer than before. 

Requests used to take less than a second, eg:
2018-09-14 14:44:13,606 - signingscript.utils - INFO - 2018-09-14 14:44:13,606 - Starting new HTTPS connection (1): signing7.srv.releng.mdc1.mozilla.com:9110
2018-09-14 14:44:13,995 - signingscript.utils - INFO - 2018-09-14 14:44:13,995 - https://signing7.srv.releng.mdc1.mozilla.com:9110 "GET /sign/sha2signcode/bc0f99f65d37709f2b4475af58c63aa0b9474260 HTTP/1.1" 404 None

And now take multiple seconds:
2018-09-14 21:04:18,862 - signingscript.utils - INFO - 2018-09-14 21:04:18,861 - Starting new HTTPS connection (1): signing11.srv.releng.mdc2.mozilla.com:9110
2018-09-14 21:04:23,929 - signingscript.utils - INFO - 2018-09-14 21:04:23,928 - https://signing11.srv.releng.mdc2.mozilla.com:9110 "GET /sign/sha2signcode/5127b34c5cea78b2d28d0f210730cfc6c9a5789a HTTP/1.1" 404 None

This is mtr output from depsigning-worker14.srv.releng.usw2.mozilla.com -> signing11.srv.releng.mdc2.mozilla.com

 Host                                                                                              Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. 169.254.249.69                                                                                  0.0%   807    1.2   2.8   0.3  41.6   6.5
 2. ???
 3. 54.239.51.210                                                                                   0.0%   806    0.8   1.5   0.6 129.2   5.7
 4. 52.93.15.222                                                                                    0.0%   806    1.5   1.9   1.3  76.0   3.4
    52.93.13.74
 5. 52.93.15.219                                                                                    0.0%   806    1.6   2.1   1.3  84.6   4.7
    54.239.48.179
 6. 169.254.249.13                                                                                  0.0%   806    1.1   1.9   1.1  74.7   6.0
 7. 169.254.249.14                                                                                  0.0%   806   84.7  84.7  83.3  86.4   0.2
 8. signing11.srv.releng.mdc2.mozilla.com                                                           0.0%   806   87.6  86.2  83.5  88.7   1.4



This has resulted in Taskcluster jobs taking ~10min instead of ~3min, which ends up causing backlogs in results. It is not closing the trees, so I'm filing as critical instead of blocker.
Once we shut off the scl3 vpns, it looks like we've lost all internet connectivity in the releng use1/usw2 regions... essentially, `mtr github.com` hangs.
Trees appear to be closed for this now?
Yes, trees closed at 2018-09-17T20:29+00:00.
Severity: critical → blocker
Just FYI, the timing on when this started coincides with the area surrounding MDC2 and AWS us-east getting attacked by a hurricane...  may or may not be related but just putting that out there.
I realize this is still important, but I want to note that jobs are still passing - we're just backlogged because things are slow. We had a brief period where jobs were failing (when we tried a fix that didn't work), but that is long over.

It's not my decision whether or not to close the trees, but I'll note that we've been in this state since Friday with trees open.
For timing/correlation/history, 2018-09-13 the day we powered off the majority of hosts in scl3.  We left up most infra items, for IT and releng.

After caucusing at a standup on 2018-09-14, we agreed to power down the remainder of the releng.scl3 infra.  It was at that point that admin1[ab].private.releng.scl3 and ns[12].private.releng.scl3 went down, ~1612 UTC, which would line up with the impact times.
Tree's re-opened at 2018-09-17T23:09:05+00:00. We have a workaround in place to improve signing speed by gcox turning ns[12].private.releng.scl3 back on, and releng reverting the DHCP options change reported in bug 1491497. 

With SCL3 networking not lasting much longer the search for a proper fix is on. dividehex and gcox have been using tcpdump to check network traffic.
Severity: blocker → critical
We can potentially solve this by:

- pointing at scl3 dns servers, without any of the below. this is not a long term option
- turning off ipv6 on the scriptworker hosts, even if we're pointing at mdc dns

/etc/sysctl.conf :
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

sysctl -p

we'd need to add this to puppet if we want it to persist. This is not ideal; ipv6 is the future.

- enabling single-request in /etc/resolv.conf, even if we're pointing at mdc dns

options single-request

(we'd need to add this via the aws dhcp config somehow?)

- It's possible we could see an improvement by going to CentOS > 6.5; infra is using 7 and isn't seeing this issue. Untested.
tl;dr
Adding options single-request-reopen to resolv.conf seems to fix the issue.
doing a dig @10.48.75.120 and @10.50.75.120 return responses immediately.

Longer explanation
Best guess is that the difference in firewalls between SCL3 and MDC1/MDC2 is the culprit. The default process is for 2 separate sockets to open, 1 for the A and one for the AAAA response. The A response comes immediately, the AAAA times out, thus the 5 second slowdown.

The firwalls in MDC1/MDC2 being Panorama are much more application specific.


from man resolv.conf
single-request-reopen (since glibc 2.9)
                     The  resolver  uses  the  same socket for the A and AAAA requests.  Some hardware mistakenly only sends back one reply.  When that happens the client sytem will sit and
                     wait for the second reply.  Turning this option on changes this behavior so that if two requests from the same port are not handled correctly it will close the   socket
                     and open a new one before sending the second request.
I remember having to do that options single-request thing on the office admin hosts when we first got IPv6 in the offices.
Digging in further, looks like this could be a network layer issue inside the VPC.

We've engaged jbircher for additional troubleshooting.
Attached file GitHub Pull Request
Assignee: network-operations → jwatkins
Worked further with :johnb and there were multiple paths in/out of the VPC. 

He downed the 2nd tunnel from MDC2 to the vpc and things are working now.

:bhearsum,
Can you please test further?
Flags: needinfo?(bhearsum)
on signing-linux-1.srv.releng.use1.mozilla.com, I:

- kept ipv6 enabled
- pointed resolv.conf at mdc*
- removed the `options` line from the resolv.conf

This resulted in slow wget. Adding the `options` line back in sped up wget.
That tells me that downing the 2nd tunnel didn't resolve the issue. It may be part of the solution, but it's not the whole thing.
Flags: needinfo?(bhearsum)
(In reply to Aki Sasaki [:aki] from comment #14)
> on signing-linux-1.srv.releng.use1.mozilla.com, I:
> 
> - kept ipv6 enabled
> - pointed resolv.conf at mdc*
> - removed the `options` line from the resolv.conf
> 
> This resulted in slow wget. Adding the `options` line back in sped up wget.
> That tells me that downing the 2nd tunnel didn't resolve the issue. It may
> be part of the solution, but it's not the whole thing.

Per :dividehex, we need the 2nd tunnel to avoid SPOF.
We've rolled out the `single-request-reopen` fix, pointed DNS back at mdc*, and rebooted all the scriptworkers.
This appears to have resolved our network slowdowns.

If we find a network-level fix, we can test by removing the single-request-reopen line from /etc/resolv.conf.
(In reply to Nick Thomas [:nthomas] (UTC+12) from comment #7)
> Tree's re-opened at 2018-09-17T23:09:05+00:00. We have a workaround in place
> to improve signing speed by gcox turning ns[12].private.releng.scl3 back on,
> and releng reverting the DHCP options change reported in bug 1491497. 

nameservers for scl3 powered back down as of 2018-09-18T21:27:00+00:00, r=aki
Releng's EC2 routing tables have all been flipped, modulo the unused subnets with names like "relops-vpc" or "upload-nat".
We should be able to turn off the SCL3 VPNs.
Depends on: 1492312, 1492318
I think we can resolve this bug. (Alternately we can wait til we ship releases tomorrow with the current network settings, but I'm not too worried.)
Component: NetOps → Infrastructure: AWS
QA Contact: jbircher → cshields
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: