Closed Bug 1140070 Opened 10 years ago Closed 10 years ago

Frequent FTP timeouts affecting all trees and platforms

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: RyanVM, Unassigned)

Details

Everything's closed. https://treeherder.mozilla.org/logviewer.html#?job_id=7266074&repo=mozilla-inbound 11:01:34 INFO - mkdir: C:\slave\test\build 11:01:34 INFO - Downloading https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32/1425578721/firefox-39.0a1.en-US.win32.tests.zip to C:\slave\test\build\firefox-39.0a1.en-US.win32.tests.zip 11:01:34 INFO - retry: Calling _download_file with args: ('https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32/1425578721/firefox-39.0a1.en-US.win32.tests.zip', 'C:\\slave\\test\\build\\firefox-39.0a1.en-US.win32.tests.zip'), kwargs: {}, attempt #1 command timed out: 1800 seconds without output, attempting to kill
This looks similar to what we saw last Tuesday when we released 36.0. Since we released 36.0.1 today, the root cause (general load from the CDNs?) may be the same. I've asked in #moc for details about the network and zlbs.
FWIW, with update tests, we've seen the same symptoms a few times in recent months, including when we were in Portland (I think we pushed 34.0.5 live or so at that point).
(Timestamps ET) [18:17:33] nthomas why are the tree's still closed ? [18:17:48] nthomas no reason from an infra pov [18:18:03] jlund|buildduty I thinnk KWierso mentioned they aree waiting on something [18:18:12] nthomas ok [18:18:23] nthomas back after lunch [18:18:36] KWierso jlund|buildduty: was waiting to see if these merges I'm doing are free from the infra timeouts [18:19:04] nthomas|away they should be, we're back under the traffic limit on the load balancers for a while [18:19:30] KWierso jlund|buildduty: speaking of which, I just merged inbound with that mozharness rev [18:19:45] RyanVM nthomas|away: bad time to mention that AFAICT, nobody updated the bug with that info? [18:20:21] RyanVM can't have it both ways, screaming for a bug to be filed for every closure and then not adding new info to it once it's there [18:21:13] RyanVM looks at the scrollback in this channel - not even a comment here about it until just now [18:21:33] RyanVM so yeah, "Why are the trees still closed?" - we aren't mind readers [18:22:04] KWierso ... yet So we've been closed since 15:14 ET, to now at least (18:49 ET), over this issue, not a clue (in this bug or #releng) that there was any "all clear" type of statement. Neither I nor sheriffs were told. Kinda confused. * Are we actually better now? * Do we expect this to happen again? * Do we know what caused the added load on our LB? * Do we have any mitigation steps forthcoming, outlined in any bugs (other than "we hope to be mostly in AWS for ftp by end of Q2") -- Basically a mitigation that helps us ~ now. * How did we pinpoint this to a LB issue * Why do we feel the issue is clear now (could it only be clear because our infra has been closed for 3 hours since this bug was filed) ? Feel free to redirect needinfo as appropriate. I'm shooting towards the messenger I know of.
Flags: needinfo?(nthomas)
Reopened the trees for lack of anything else to really do, since tests aren't timing out while no tests are running for lack of pushes.
Sorry, the conversations in #moc didn't make it back to #releng or to here. The problem was release related, in that 36.0.1 shipped at 10:50 Pacific and the CDNs started requesting a lot of files for the origin server. That happens to be hosted on the same Zeus load balancer as ftp-ssl.mozilla.org (part of anyway), where the tests are served from. atoll did all the heavy lifting on the debugging, and this is the point we were all clear from a traffic point of view: [13:48] <atoll> cdn is also already back to normal, so it seems unaffected / out of problem set <atoll> also, my wget is running at 60Mbit now [13:49] <atoll> maybe it was just the CDN saturating the license limit AFAIK we don't know why the last few releases have been like that. I would like someone to look at the Cloudfrond config, since that is a relatively new CDN in the mix. Other bugs of interest, suggestions * bug 1130242 to get alerts when the Zeus load balancers hit the license-based traffic limits * revisiting the allocation of domains to ZLBs, to isolate release from test traffic * precaching CDNs before release, when we can control the rate. bug 1140165 to do that on edgecast, one of the four CDNs
Flags: needinfo?(nthomas)
I've also filed bug 1140398 to give us some early warning in #buildduty when stage.m.o is under load.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
(In reply to Nick Thomas [:nthomas] from comment #5) > * revisiting the allocation of domains to ZLBs, to isolate release from test > traffic Bug 1141812 for this.
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.