Closed Bug 1125688 Opened 9 years ago Closed 9 years ago

Constant stream of timeouts as AWS slaves download from tooltool, pvtbuilds and ftp-ssl and upload to signing servers

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

Details

Attachments

(2 files)

We've been seeing packetloss since ~ 10am Pacific. Both tunnels currently routing via above.net and Zayo to Amazon, then loss in AWS:

                                                           Packets               Pings
 Host                                              Loss%   Snt Drop   Avg  Best  Wrst StDev
 1. 10.26.48.1                                     0.0%    60    0    0.7   0.5   3.6   0.4
 2. v-1030.core1.releng.scl3.mozilla.net           0.0%    60    0    2.0   1.5   4.6   0.4
 3. v-1032.border2.scl3.mozilla.net                0.0%    60    0    2.7   0.8  42.2   6.6
 4. xe-0-0-3.border1.scl3.mozilla.net              0.0%    60    0    5.8   1.1  61.8  12.8
 5. xe-1-2-0.border1.pao1.mozilla.net              0.0%    60    0    2.9   1.3  28.6   4.3
 6. xe-3-1-0.mpr2.pao1.us.above.net                0.0%    60    0    1.9   1.3  17.3   2.2
 7. ae7.cr2.sjc2.us.zip.zayo.com                   0.0%    60    0    4.6   1.9  32.3   6.7
 8. ae8.cr1.sjc2.us.zip.zayo.com                   0.0%    60    0    5.8   1.9  32.3   8.1
 9. ae9.mpr3.sjc7.us.zip.zayo.com                  0.0%    60    0    2.5   2.0  14.8   1.7
10. equinix02-sfo5.amazon.com                      0.0%    60    0    2.3   2.1   6.5   0.6
11. 54.240.242.48                                  0.0%    60    0  204.9 194.8 242.0   7.2
12. 54.240.242.57                                  0.0%    60    0  157.7  54.3 220.8  52.8
13. 54.239.41.98                                   6.7%    60    4  220.9 220.2 228.7   1.2
14. 178.236.3.114                                  6.7%    60    4  220.4 220.0 221.6   0.3
15. 54.239.41.157                                  5.1%    60    3  220.9 219.3 231.5   2.0
16. 54.239.52.134                                  8.5%    60    5  220.7 218.3 238.2   2.8
17. 54.239.41.161                                  5.1%    60    3  220.7 218.5 225.2   1.0
18. 205.251.225.214                                0.0%    60    0  220.2 219.4 224.7   0.9
19. 205.251.232.74                                10.2%    59    6  220.7 220.0 224.8   0.8
20. 205.251.232.205                                5.1%    59    3  220.9 219.7 234.8   2.0
21. 54.239.48.187                                  5.1%    59    3  220.8 220.0 224.7   0.8
22. 205.251.233.122                                3.4%    59    2  221.6 220.0 259.3   5.2
Similar story in us-east-1, big jump in latency with packet loss.
Severity: normal → blocker
From http://status.aws.amazon.com/:

We are investigating an issue with an external provider, which may be impacting Internet connectivity between some customer networks and the US-WEST-2 Region. Connectivity to instances and services within the Region is not impacted by the event.

And dcurado from netops confirms it's an issue between the last hop from that mtr (equinix and amazon). Interesting that our us-east-1 tunnels also go via equinix02-sfo5.amazon.com, but the next hop differs.
use1 looks to be in better shape (currently no packet loss, normal latency), so disabling new spot instances in usw2 with
 https://github.com/mozilla/build-cloud-tools/commit/c71599085deda217c89338de2178fb994b3a7584
(warning: I got the region wrong in the comment message).
Terminated about 351 build and test spot instances in usw2. But we're hitting capacity on test spot instances in use1.
Status: we're waiting for Amazon to resolve their networking issues. RelEng (and maybe wider in MoCo) can look at 
http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.nagios1-releng-use1
http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.nagios1-releng-usw2
to see what the recent trends for packetloss and latency. Current state is 
* usw2 consistently bad (5-15% packetloss, 220ms latency vs 0% and 25ms normally).
* use1 is usually fine (0% loss, 75ms latency), but jumps to 5-15%

We appear to have some pending compile jobs on fx-team, m-i and b2g-i, which are probably due to jacuzzi pools which are entirely or in part based in the disabled usw2.
use1 deteriorated just as soon as I posted comment #6, but in the last hour we've had very little packet loss for both use1 and usw2 links. The latency is still fluctuating, and AWS hasn't cleared their notice yet, so lets leave it a little longer still. I'll check again in about 90 minutes.
The network is fine to both regions now. Sheriffs, feel free to reopen trees again at you discretion.
Trees were reopened at 23:12 Pacific.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: