Closed
Bug 1125688
Opened 9 years ago
Closed 9 years ago
Constant stream of timeouts as AWS slaves download from tooltool, pvtbuilds and ftp-ssl and upload to signing servers
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Unassigned)
Details
Attachments
(2 files)
e.g. http://ftp.mozilla.org/pub/mozilla.org/b2g/tinderbox-builds/mozilla-inbound-linux32_gecko/1422228576/mozilla-inbound-linux32_gecko-bm72-build1-build1398.txt.gz or http://ftp.mozilla.org/pub/mozilla.org/mobile/tinderbox-builds/mozilla-inbound-android-x86/1422230495/mozilla-inbound-android-x86-bm94-build1-build1636.txt.gz or https://bugzilla.mozilla.org/show_bug.cgi?id=1059287#c124 through #c214 If it involves data transfer between AWS and the outside world, it's timed out a dozen times and I've retriggered it a dozen times this afternoon, and now my retrigger finger is tired. All trees closed.
Comment 1•9 years ago
|
||
We've been seeing packetloss since ~ 10am Pacific. Both tunnels currently routing via above.net and Zayo to Amazon, then loss in AWS: Packets Pings Host Loss% Snt Drop Avg Best Wrst StDev 1. 10.26.48.1 0.0% 60 0 0.7 0.5 3.6 0.4 2. v-1030.core1.releng.scl3.mozilla.net 0.0% 60 0 2.0 1.5 4.6 0.4 3. v-1032.border2.scl3.mozilla.net 0.0% 60 0 2.7 0.8 42.2 6.6 4. xe-0-0-3.border1.scl3.mozilla.net 0.0% 60 0 5.8 1.1 61.8 12.8 5. xe-1-2-0.border1.pao1.mozilla.net 0.0% 60 0 2.9 1.3 28.6 4.3 6. xe-3-1-0.mpr2.pao1.us.above.net 0.0% 60 0 1.9 1.3 17.3 2.2 7. ae7.cr2.sjc2.us.zip.zayo.com 0.0% 60 0 4.6 1.9 32.3 6.7 8. ae8.cr1.sjc2.us.zip.zayo.com 0.0% 60 0 5.8 1.9 32.3 8.1 9. ae9.mpr3.sjc7.us.zip.zayo.com 0.0% 60 0 2.5 2.0 14.8 1.7 10. equinix02-sfo5.amazon.com 0.0% 60 0 2.3 2.1 6.5 0.6 11. 54.240.242.48 0.0% 60 0 204.9 194.8 242.0 7.2 12. 54.240.242.57 0.0% 60 0 157.7 54.3 220.8 52.8 13. 54.239.41.98 6.7% 60 4 220.9 220.2 228.7 1.2 14. 178.236.3.114 6.7% 60 4 220.4 220.0 221.6 0.3 15. 54.239.41.157 5.1% 60 3 220.9 219.3 231.5 2.0 16. 54.239.52.134 8.5% 60 5 220.7 218.3 238.2 2.8 17. 54.239.41.161 5.1% 60 3 220.7 218.5 225.2 1.0 18. 205.251.225.214 0.0% 60 0 220.2 219.4 224.7 0.9 19. 205.251.232.74 10.2% 59 6 220.7 220.0 224.8 0.8 20. 205.251.232.205 5.1% 59 3 220.9 219.7 234.8 2.0 21. 54.239.48.187 5.1% 59 3 220.8 220.0 224.7 0.8 22. 205.251.233.122 3.4% 59 2 221.6 220.0 259.3 5.2
Comment 2•9 years ago
|
||
Similar story in us-east-1, big jump in latency with packet loss.
Reporter | ||
Updated•9 years ago
|
Severity: normal → blocker
Comment 3•9 years ago
|
||
From http://status.aws.amazon.com/: We are investigating an issue with an external provider, which may be impacting Internet connectivity between some customer networks and the US-WEST-2 Region. Connectivity to instances and services within the Region is not impacted by the event. And dcurado from netops confirms it's an issue between the last hop from that mtr (equinix and amazon). Interesting that our us-east-1 tunnels also go via equinix02-sfo5.amazon.com, but the next hop differs.
Comment 4•9 years ago
|
||
use1 looks to be in better shape (currently no packet loss, normal latency), so disabling new spot instances in usw2 with https://github.com/mozilla/build-cloud-tools/commit/c71599085deda217c89338de2178fb994b3a7584 (warning: I got the region wrong in the comment message).
Comment 5•9 years ago
|
||
Terminated about 351 build and test spot instances in usw2. But we're hitting capacity on test spot instances in use1.
Comment 6•9 years ago
|
||
Status: we're waiting for Amazon to resolve their networking issues. RelEng (and maybe wider in MoCo) can look at http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.nagios1-releng-use1 http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.nagios1-releng-usw2 to see what the recent trends for packetloss and latency. Current state is * usw2 consistently bad (5-15% packetloss, 220ms latency vs 0% and 25ms normally). * use1 is usually fine (0% loss, 75ms latency), but jumps to 5-15% We appear to have some pending compile jobs on fx-team, m-i and b2g-i, which are probably due to jacuzzi pools which are entirely or in part based in the disabled usw2.
Comment 7•9 years ago
|
||
use1 deteriorated just as soon as I posted comment #6, but in the last hour we've had very little packet loss for both use1 and usw2 links. The latency is still fluctuating, and AWS hasn't cleared their notice yet, so lets leave it a little longer still. I'll check again in about 90 minutes.
Comment 8•9 years ago
|
||
https://github.com/mozilla/build-cloud-tools/commit/fc7e3ba4d4840f978784f654a72e7eb9b9a1e767 to re-enable spot instances in us-west-2, to test the waters.
Comment 9•9 years ago
|
||
The network is fine to both regions now. Sheriffs, feel free to reopen trees again at you discretion.
Comment 10•9 years ago
|
||
Trees were reopened at 23:12 Pacific.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•