Closed Bug 1125688 Opened 9 years ago Closed 9 years ago

Constant stream of timeouts as AWS slaves download from tooltool, pvtbuilds and ftp-ssl and upload to signing servers

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

Details

Attachments

(2 files)

us-west-2 packetloss/latency 9 years ago Nick Thomas [:nthomas] (UTC+12) 177.43 KB, image/png		Details
us-east-1 packetloss/latency 9 years ago Nick Thomas [:nthomas] (UTC+12) 160.15 KB, image/png		Details

Phil Ringnalda (:philor)

Reporter

Description

•

9 years ago

e.g. http://ftp.mozilla.org/pub/mozilla.org/b2g/tinderbox-builds/mozilla-inbound-linux32_gecko/1422228576/mozilla-inbound-linux32_gecko-bm72-build1-build1398.txt.gz or http://ftp.mozilla.org/pub/mozilla.org/mobile/tinderbox-builds/mozilla-inbound-android-x86/1422230495/mozilla-inbound-android-x86-bm94-build1-build1636.txt.gz or https://bugzilla.mozilla.org/show_bug.cgi?id=1059287#c124 through #c214

If it involves data transfer between AWS and the outside world, it's timed out a dozen times and I've retriggered it a dozen times this afternoon, and now my retrigger finger is tired.

All trees closed.

Nick Thomas [:nthomas] (UTC+12)

Comment 1

•

9 years ago

Attached image us-west-2 packetloss/latency — Details

We've been seeing packetloss since ~ 10am Pacific. Both tunnels currently routing via above.net and Zayo to Amazon, then loss in AWS:

                                                           Packets               Pings
 Host                                              Loss%   Snt Drop   Avg  Best  Wrst StDev
 1. 10.26.48.1                                     0.0%    60    0    0.7   0.5   3.6   0.4
 2. v-1030.core1.releng.scl3.mozilla.net           0.0%    60    0    2.0   1.5   4.6   0.4
 3. v-1032.border2.scl3.mozilla.net                0.0%    60    0    2.7   0.8  42.2   6.6
 4. xe-0-0-3.border1.scl3.mozilla.net              0.0%    60    0    5.8   1.1  61.8  12.8
 5. xe-1-2-0.border1.pao1.mozilla.net              0.0%    60    0    2.9   1.3  28.6   4.3
 6. xe-3-1-0.mpr2.pao1.us.above.net                0.0%    60    0    1.9   1.3  17.3   2.2
 7. ae7.cr2.sjc2.us.zip.zayo.com                   0.0%    60    0    4.6   1.9  32.3   6.7
 8. ae8.cr1.sjc2.us.zip.zayo.com                   0.0%    60    0    5.8   1.9  32.3   8.1
 9. ae9.mpr3.sjc7.us.zip.zayo.com                  0.0%    60    0    2.5   2.0  14.8   1.7
10. equinix02-sfo5.amazon.com                      0.0%    60    0    2.3   2.1   6.5   0.6
11. 54.240.242.48                                  0.0%    60    0  204.9 194.8 242.0   7.2
12. 54.240.242.57                                  0.0%    60    0  157.7  54.3 220.8  52.8
13. 54.239.41.98                                   6.7%    60    4  220.9 220.2 228.7   1.2
14. 178.236.3.114                                  6.7%    60    4  220.4 220.0 221.6   0.3
15. 54.239.41.157                                  5.1%    60    3  220.9 219.3 231.5   2.0
16. 54.239.52.134                                  8.5%    60    5  220.7 218.3 238.2   2.8
17. 54.239.41.161                                  5.1%    60    3  220.7 218.5 225.2   1.0
18. 205.251.225.214                                0.0%    60    0  220.2 219.4 224.7   0.9
19. 205.251.232.74                                10.2%    59    6  220.7 220.0 224.8   0.8
20. 205.251.232.205                                5.1%    59    3  220.9 219.7 234.8   2.0
21. 54.239.48.187                                  5.1%    59    3  220.8 220.0 224.7   0.8
22. 205.251.233.122                                3.4%    59    2  221.6 220.0 259.3   5.2

Nick Thomas [:nthomas] (UTC+12)

Comment 2

•

9 years ago

Attached image us-east-1 packetloss/latency — Details

Similar story in us-east-1, big jump in latency with packet loss.

Phil Ringnalda (:philor)

Reporter

Updated

•

9 years ago

Severity: normal → blocker

Nick Thomas [:nthomas] (UTC+12)

Comment 3

•

9 years ago

From http://status.aws.amazon.com/:

We are investigating an issue with an external provider, which may be impacting Internet connectivity between some customer networks and the US-WEST-2 Region. Connectivity to instances and services within the Region is not impacted by the event.

And dcurado from netops confirms it's an issue between the last hop from that mtr (equinix and amazon). Interesting that our us-east-1 tunnels also go via equinix02-sfo5.amazon.com, but the next hop differs.

Nick Thomas [:nthomas] (UTC+12)

Comment 4

•

9 years ago

use1 looks to be in better shape (currently no packet loss, normal latency), so disabling new spot instances in usw2 with
 https://github.com/mozilla/build-cloud-tools/commit/c71599085deda217c89338de2178fb994b3a7584
(warning: I got the region wrong in the comment message).

Nick Thomas [:nthomas] (UTC+12)

Comment 5

•

9 years ago

Terminated about 351 build and test spot instances in usw2. But we're hitting capacity on test spot instances in use1.

Nick Thomas [:nthomas] (UTC+12)

Comment 6

•

9 years ago

Status: we're waiting for Amazon to resolve their networking issues. RelEng (and maybe wider in MoCo) can look at 
http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.nagios1-releng-use1
http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.nagios1-releng-usw2
to see what the recent trends for packetloss and latency. Current state is 
* usw2 consistently bad (5-15% packetloss, 220ms latency vs 0% and 25ms normally).
* use1 is usually fine (0% loss, 75ms latency), but jumps to 5-15%

We appear to have some pending compile jobs on fx-team, m-i and b2g-i, which are probably due to jacuzzi pools which are entirely or in part based in the disabled usw2.

Nick Thomas [:nthomas] (UTC+12)

Comment 7

•

9 years ago

use1 deteriorated just as soon as I posted comment #6, but in the last hour we've had very little packet loss for both use1 and usw2 links. The latency is still fluctuating, and AWS hasn't cleared their notice yet, so lets leave it a little longer still. I'll check again in about 90 minutes.

Nick Thomas [:nthomas] (UTC+12)

Comment 8

•

9 years ago

https://github.com/mozilla/build-cloud-tools/commit/fc7e3ba4d4840f978784f654a72e7eb9b9a1e767 to re-enable spot instances in us-west-2, to test the waters.

Nick Thomas [:nthomas] (UTC+12)

Comment 9

•

9 years ago

The network is fine to both regions now. Sheriffs, feel free to reopen trees again at you discretion.

Nick Thomas [:nthomas] (UTC+12)

Comment 10

•

9 years ago

Trees were reopened at 23:12 Pacific.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

4 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Constant stream of timeouts as AWS slaves download from tooltool, pvtbuilds and ftp-ssl and upload to signing servers

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

Tracking

(Not tracked)

People

(Reporter: philor, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated

Updated

Attachment

General

Description

File Name

Content Type