Constant stream of timeouts as AWS slaves download from tooltool, pvtbuilds and ftp-ssl and upload to signing servers

RESOLVED FIXED

Status

Release Engineering
Buildduty
--
blocker
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: philor, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments)

(Reporter)

Description

3 years ago
e.g. http://ftp.mozilla.org/pub/mozilla.org/b2g/tinderbox-builds/mozilla-inbound-linux32_gecko/1422228576/mozilla-inbound-linux32_gecko-bm72-build1-build1398.txt.gz or http://ftp.mozilla.org/pub/mozilla.org/mobile/tinderbox-builds/mozilla-inbound-android-x86/1422230495/mozilla-inbound-android-x86-bm94-build1-build1636.txt.gz or https://bugzilla.mozilla.org/show_bug.cgi?id=1059287#c124 through #c214

If it involves data transfer between AWS and the outside world, it's timed out a dozen times and I've retriggered it a dozen times this afternoon, and now my retrigger finger is tired.

All trees closed.
Created attachment 8554336 [details]
us-west-2 packetloss/latency

We've been seeing packetloss since ~ 10am Pacific. Both tunnels currently routing via above.net and Zayo to Amazon, then loss in AWS:

                                                           Packets               Pings
 Host                                              Loss%   Snt Drop   Avg  Best  Wrst StDev
 1. 10.26.48.1                                     0.0%    60    0    0.7   0.5   3.6   0.4
 2. v-1030.core1.releng.scl3.mozilla.net           0.0%    60    0    2.0   1.5   4.6   0.4
 3. v-1032.border2.scl3.mozilla.net                0.0%    60    0    2.7   0.8  42.2   6.6
 4. xe-0-0-3.border1.scl3.mozilla.net              0.0%    60    0    5.8   1.1  61.8  12.8
 5. xe-1-2-0.border1.pao1.mozilla.net              0.0%    60    0    2.9   1.3  28.6   4.3
 6. xe-3-1-0.mpr2.pao1.us.above.net                0.0%    60    0    1.9   1.3  17.3   2.2
 7. ae7.cr2.sjc2.us.zip.zayo.com                   0.0%    60    0    4.6   1.9  32.3   6.7
 8. ae8.cr1.sjc2.us.zip.zayo.com                   0.0%    60    0    5.8   1.9  32.3   8.1
 9. ae9.mpr3.sjc7.us.zip.zayo.com                  0.0%    60    0    2.5   2.0  14.8   1.7
10. equinix02-sfo5.amazon.com                      0.0%    60    0    2.3   2.1   6.5   0.6
11. 54.240.242.48                                  0.0%    60    0  204.9 194.8 242.0   7.2
12. 54.240.242.57                                  0.0%    60    0  157.7  54.3 220.8  52.8
13. 54.239.41.98                                   6.7%    60    4  220.9 220.2 228.7   1.2
14. 178.236.3.114                                  6.7%    60    4  220.4 220.0 221.6   0.3
15. 54.239.41.157                                  5.1%    60    3  220.9 219.3 231.5   2.0
16. 54.239.52.134                                  8.5%    60    5  220.7 218.3 238.2   2.8
17. 54.239.41.161                                  5.1%    60    3  220.7 218.5 225.2   1.0
18. 205.251.225.214                                0.0%    60    0  220.2 219.4 224.7   0.9
19. 205.251.232.74                                10.2%    59    6  220.7 220.0 224.8   0.8
20. 205.251.232.205                                5.1%    59    3  220.9 219.7 234.8   2.0
21. 54.239.48.187                                  5.1%    59    3  220.8 220.0 224.7   0.8
22. 205.251.233.122                                3.4%    59    2  221.6 220.0 259.3   5.2
Created attachment 8554337 [details]
us-east-1 packetloss/latency

Similar story in us-east-1, big jump in latency with packet loss.
(Reporter)

Updated

3 years ago
Severity: normal → blocker
From http://status.aws.amazon.com/:

We are investigating an issue with an external provider, which may be impacting Internet connectivity between some customer networks and the US-WEST-2 Region. Connectivity to instances and services within the Region is not impacted by the event.

And dcurado from netops confirms it's an issue between the last hop from that mtr (equinix and amazon). Interesting that our us-east-1 tunnels also go via equinix02-sfo5.amazon.com, but the next hop differs.
use1 looks to be in better shape (currently no packet loss, normal latency), so disabling new spot instances in usw2 with
 https://github.com/mozilla/build-cloud-tools/commit/c71599085deda217c89338de2178fb994b3a7584
(warning: I got the region wrong in the comment message).
Terminated about 351 build and test spot instances in usw2. But we're hitting capacity on test spot instances in use1.
Status: we're waiting for Amazon to resolve their networking issues. RelEng (and maybe wider in MoCo) can look at 
http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.nagios1-releng-use1
http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.nagios1-releng-usw2
to see what the recent trends for packetloss and latency. Current state is 
* usw2 consistently bad (5-15% packetloss, 220ms latency vs 0% and 25ms normally).
* use1 is usually fine (0% loss, 75ms latency), but jumps to 5-15%

We appear to have some pending compile jobs on fx-team, m-i and b2g-i, which are probably due to jacuzzi pools which are entirely or in part based in the disabled usw2.
use1 deteriorated just as soon as I posted comment #6, but in the last hour we've had very little packet loss for both use1 and usw2 links. The latency is still fluctuating, and AWS hasn't cleared their notice yet, so lets leave it a little longer still. I'll check again in about 90 minutes.
https://github.com/mozilla/build-cloud-tools/commit/fc7e3ba4d4840f978784f654a72e7eb9b9a1e767 to re-enable spot instances in us-west-2, to test the waters.
The network is fine to both regions now. Sheriffs, feel free to reopen trees again at you discretion.
Trees were reopened at 23:12 Pacific.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.