4 years ago
6 months ago


4 years ago
Everything's closed.

11:01:34 INFO - mkdir: C:\slave\test\build
11:01:34 INFO - Downloading to C:\slave\test\build\
11:01:34 INFO - retry: Calling _download_file with args: ('', 'C:\\slave\\test\\build\\'), kwargs: {}, attempt #1
command timed out: 1800 seconds without output, attempting to kill
This looks similar to what we saw last Tuesday when we released 36.0. Since we released 36.0.1 today, the root cause (general load from the CDNs?) may be the same.

I've asked in #moc for details about the network and zlbs.

4 years ago
FWIW, with update tests, we've seen the same symptoms a few times in recent months, including when we were in Portland (I think we pushed 34.0.5 live or so at that point).
(Timestamps ET)
[18:17:33]	nthomas	why are the tree's still closed ?
[18:17:48]	nthomas	no reason from an infra pov
[18:18:03]	jlund|buildduty	I thinnk KWierso mentioned they aree waiting on something
[18:18:12]	nthomas	ok
[18:18:23]	nthomas	back after lunch
[18:18:36]	KWierso	jlund|buildduty: was waiting to see if these merges I'm doing are free from the infra timeouts
[18:19:04]	nthomas|away	they should be, we're back under the traffic limit on the load balancers for a while
[18:19:30]	KWierso	jlund|buildduty: speaking of which, I just merged inbound with that mozharness rev
[18:19:45]	RyanVM	nthomas|away: bad time to mention that AFAICT, nobody updated the bug with that info?
[18:20:21]	RyanVM	can't have it both ways, screaming for a bug to be filed for every closure and then not adding new info to it once it's there
[18:21:13]	RyanVM	looks at the scrollback in this channel - not even a comment here about it until just now
[18:21:33]	RyanVM	so yeah, "Why are the trees still closed?" - we aren't mind readers
[18:22:04]	KWierso	... yet

So we've been closed since 15:14 ET, to now at least (18:49 ET), over this issue, not a clue (in this bug or #releng) that there was any "all clear" type of statement. Neither I nor sheriffs were told. Kinda confused.

* Are we actually better now?
* Do we expect this to happen again?
* Do we know what caused the added load on our LB?
* Do we have any mitigation steps forthcoming, outlined in any bugs (other than "we hope to be mostly in AWS for ftp by end of Q2") -- Basically a mitigation that helps us ~ now.
* How did we pinpoint this to a LB issue
* Why do we feel the issue is clear now (could it only be clear because our infra has been closed for 3 hours since this bug was filed) ?

Feel free to redirect needinfo as appropriate. I'm shooting towards the messenger I know of.
Reopened the trees for lack of anything else to really do, since tests aren't timing out while no tests are running for lack of pushes.
Sorry, the conversations in #moc didn't make it back to #releng or to here. 

The problem was release related, in that 36.0.1 shipped at 10:50 Pacific and the CDNs started requesting a lot of files for the origin server. That happens to be hosted on the same Zeus load balancer as (part of anyway), where the tests are served from. atoll did all the heavy lifting on the debugging, and this is the point we were all clear from a traffic point of view:

[13:48]	<atoll> cdn is also already back to normal, so it seems unaffected / out of problem set
	<atoll> also, my wget is running at 60Mbit now
[13:49]	<atoll> maybe it was just the CDN saturating the license limit

AFAIK we don't know why the last few releases have been like that. I would like someone to look at the Cloudfrond config, since that is a relatively new CDN in the mix.

Other bugs of interest, suggestions
* bug 1130242 to get alerts when the Zeus load balancers hit the license-based traffic limits
* revisiting the allocation of domains to ZLBs, to isolate release from test traffic
* precaching CDNs before release, when we can control the rate. bug 1140165 to do that on edgecast, one of the four CDNs
I've also filed bug 1140398 to give us some early warning in #buildduty when stage.m.o is under load.
Bug 1141812 for this.


6 months ago
