Closed Bug 957502 Opened 10 years ago Closed 10 years ago

All trees closed due to timeouts as usw2 slaves download or clone

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: adam)

References

Details

Attachments

(7 files)

The bulk of the failures can be found from bug 934938 comment 65 on down (poor webops), but we're also seeing gaia tests time out while doing their poor version of cloning gaia-central without using hgtool, and Android builds timing out while downloading from http://runtime-binaries.pvt.build.mozilla.org/tooltool, and probably other timeouts I've already managed to forget.

I'm told that the high numbers of all the instances involved mean it's usw2 that's having troubles.
Device build failures like https://tbpl.mozilla.org/php/getParsedLog.php?id=32681302&tree=Mozilla-Central, that was what else I managed to forget.
Trying to reproduce https://tbpl.mozilla.org/php/getParsedLog.php?id=32680752&tree=Mozilla-Inbound I get 100-200KB/s, which results in a download time over 20 mins. Pretty sure that is what's leading to the timeouts on 1200s.
ftp.m.o seem healthy, based on the amount of total traffic going out of it. From SCL1, I get 2MB/s, similar from USE1. But transfer in and out of USW2 is slow.

XioNoX, could you check the health of the VPN endpoint on our side ? Otherwise we need to raise a support case with Amazon.

I'll look into disabling USW2 in the meantime.
Flags: needinfo?(arzhel)
Attached image Smokeping Jan 7th
We believe this was due to packet less between our network and the EC2 instances USW2, see the attached graph for elevated loss rates during the period when builds were timing out. The loss rate has now fallen back to 0% and we can re-open the trees.

We can't tell at this point if it's loss on the internet between us and Amazon, or within Amazon itself. If it happens again we can ssh into cruncher to run 
  mtr 205.251.233.121
and
  mtr 205.251.233.122
for the two halves of the tunnel. That should simulate the path the tunnel packets take between us and Amazon.

Many thanks to XioNoX for his support on IRC.
Flags: needinfo?(arzhel)
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
What do either we or Amazon do around 19:30ish? We're currently having another episode, with the earliest jobs I've seen having started at 19:36.
Eh, might as well get a jump on it, and just reclose every tree right now.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reopened, looks like the problem fizzled out around 21:00, so merely critical until it happens again tomorrow at 19:30 and it becomes a blocker again.
Severity: blocker → critical
Attachment #8357022 - Attachment description: Smokeping → Smokeping Jan 7th
Attached image Smokeping Jan 8th
Unfortunately the 20+ min lag in noticing the problem meant we weren't able to traceroute to a cause.
Note, UTC times on those graphs.
I'm logging the output of mtr from cruncher once per minute for the two addresses listed in comment 4.

Logs are at buildduty@cruncher:logs/*.mtr

New buildduty@cruncher crontab entries:
* * * * *	export d=`date`; export ip=205.251.233.122; /usr/sbin/mtr --no-dns --raw --report --report-cycles=5 $ip 2>&1 | while read line; do echo $d: $line; done >> logs/$ip.mtr
* * * * *	export d=`date`; export ip=205.251.233.121; /usr/sbin/mtr --no-dns --raw --report --report-cycles=5 $ip 2>&1 | while read line; do echo $d: $line; done >> logs/$ip.mtr


Example output:

Thu Jan 9 08:26:01 PST 2014: HOST: cruncher.srv.releng.scl3.mo Loss% Snt Last Avg Best Wrst StDev
Thu Jan 9 08:26:01 PST 2014: 1. 10.26.48.1 0.0% 5 0.6 2.3 0.5 9.2 3.8
Thu Jan 9 08:26:01 PST 2014: 2. 63.245.214.81 0.0% 5 7.1 3.7 1.6 7.1 2.2
Thu Jan 9 08:26:01 PST 2014: 3. 63.245.214.85 0.0% 5 1.0 1.1 1.0 1.2 0.1
Thu Jan 9 08:26:01 PST 2014: 4. 63.245.219.129 0.0% 5 1.3 1.8 1.2 3.5 1.0
Thu Jan 9 08:26:01 PST 2014: 5. 64.125.170.33 0.0% 5 1.3 1.4 1.2 1.5 0.1
Thu Jan 9 08:26:01 PST 2014: 6. 64.125.31.61 0.0% 5 1.3 1.6 1.3 2.4 0.5
Thu Jan 9 08:26:01 PST 2014: 7. aws bin 100.0 5 0.0 0.0 0.0 0.0 0.0
Thu Jan 9 08:26:01 PST 2014: 8. 205.251.229.2 0.0% 5 37.6 47.5 35.1 72.1 14.9
Thu Jan 9 08:26:01 PST 2014: 9. 205.251.232.66 0.0% 5 35.4 42.4 35.0 68.2 14.5
Thu Jan 9 08:26:01 PST 2014: 10. 205.251.232.149 0.0% 5 35.5 41.2 35.1 65.0 13.3
Thu Jan 9 08:26:01 PST 2014: 11. 205.251.232.167 0.0% 5 35.4 37.7 35.3 46.5 4.9
Thu Jan 9 08:26:01 PST 2014: 12. 205.251.233.122 0.0% 5 35.3 32.7 28.4 35.6 3.8
seems this closed the trees again
From #releng for posterity:

[19:28]	<nthomas>	I would hazard a guess that network load is a little lower than the last two days, so we'll see if my hypothesis survives
[19:33]	seeing some loss on our firewall in scl3 though
[19:36]	bgp flap just now
[19:37]	oh, use1, n/m
[19:42]	<philor>	https://tbpl.mozilla.org/php/getParsedLog.php?id=32798013&tree=B2g-Inbound
	perhaps the storm moved east
[19:43]	<nthomas>	was just going to say that I'm seeing 5% packet loss on a firewall on our side
[19:44]	and only 200KB/s sucking data from ftp into usw2
[19:47]	<philor>	and there's three timeouts cloning gaia-central, I think it's time for the nightly closure
[20:37]	fascinating, tons of gaia-central clone timeouts and nothing else
[20:38]	<nthomas>	maybe they're the canary
	getting about 600KB/s from ftp.m.o at the mo
	or even as much as 2MB/s
	so probably we're over the worst of it
[20:39]	<philor>	yeah, I figured canary and closed, but it's been over an hour without any mass burning of linux/b2g tests
	<nthomas>	could do with a gaia-central bundle I think
[20:40]	<philor>	and some retrying use of hgtool
[20:41]	<nthomas>	set the bug to blocker if you need any help, not that I have any knobs to twiddle on this
I was looking into the problem during today. At one point I had myself convinced that high traffic through the tunnels (~1Gbps) corresponded to smokeping measuring 500ms ping times and packet loss of 5-10%. We didn't see that around 7:30 though - ping stayed low and traffic was maybe half the peak value. We didn't see issues with those paramaters earlier in the day.

Did manage to get a mtr run, and it fingered fw1.releng.scl3 (where the tunnels terminate):
                                 My traceroute  [v0.75]
cruncher.srv.releng.scl3.mozilla.com (0.0.0.0)                 Thu Jan  9 19:59:33 2014
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                                 Packets               Pings
 Host                                          Loss%   Snt Drop   Avg  Best  Wrst StDev
 1. 10.26.48.1                                  5.5%  1940  107   1.6   0.5  27.3   3.4
 2. v-1030.core1.releng.scl3.mozilla.net        0.0%  1940    0   4.0   1.2  54.4   6.6
 3. v-1031.border1.scl3.mozilla.net             0.0%  1939    0   3.5   0.8  68.0   7.8
 4. xe-1-2-0.border1.pao1.mozilla.net           0.0%  1939    0  17.7   1.2  95.2  16.2
 5. xe-3-1-0.mpr2.pao1.us.above.net             0.0%  1939    0   5.3   1.2  93.3  10.3
 6. xe-1-0-0.mpr1.pao1.us.above.net             0.0%  1939    0   5.5   1.2 113.5  12.5
 7. ???
 8. 205.251.229.4                               0.0%  1939    0  37.7  34.7  89.3   7.7
 9. 205.251.232.96                              0.0%  1939    0  37.9  34.7 613.4  14.9
10. 205.251.232.153                             0.0%  1939    0  37.9  34.6  89.4   7.5
11. 205.251.232.63                              0.0%  1939    0  38.0  34.7  88.2   7.5
    205.251.232.165
12. 205.251.233.121                             0.0%  1939    0  50.8  28.4 135.2  21.2

So is there some sort of cron equivalent on firewall that comes along every evening ? Why does usw2 have issues and not use1 ? AIUI both sets of tunnels terminate on the same device.

From IRC discussion XioNoX is poking at it, so setting a needinfo to see what he finds.
Flags: needinfo?(arzhel)
On fw1.releng.scl3, the CPU usage of the SPC (the module in charge of ipsec and firewall sessions) is currently at ~80% usage with about 1Gbps and ~9000 sessions.
So far I think this issue is caused by a combination of high ipsec traffic and high number of sessions which overload that CPU.

That firewall is supposed to handle 2Gbps of IPsec traffic, but the addition of firewall sessions might impacts it more than expected.

We will need to monitor those values the next times this issue happens to confirm it or not. If it's confirmed, several solutions are possible, for example getting a 2nd identical module, or not using the IPsec tunnels for the connections fro AWS to our public services (ftp/hg).
Flags: needinfo?(arzhel)
Only a moderate episode tonight, starting late at around 20:30.
(In reply to Arzhel Younsi [:XioNoX] from comment #14)

Ah hah! This seems likely from what I've seen. The next opportunity to verify is probably around 1900 Pacific on Monday, but maybe occur earlier if there is a 'busy monday' effect.
I've been looking at what we can do to reduce the traffic through the tunnels. The biggest sources of tunnel traffic will be 
*) retrieving source from hg.m.o  (often small, but can turn into 600MB d/l from ftp.m.o for a bundle)
*) uploading files to ftp.m.o after compile jobs complete (300-450MB each)
*) downloading files from ftp.m.o to run tests (many requests for data from upload)
Do we have any easily accessible metrics on this btw ?

Avoiding the tunnel for ftp and hg - As a proof of concept, I've used [1] to create an EC2 instance in a public subnet, which provides NAT to instances in the private subnets. This seems to work fine in a quick test, but wasn't any faster to d/l from ftp because the load on the tunnel and ftp were low when I was testing. I'll check it when more developers are awake and the systems are busy. On the nice side of the coin, a routing table is used to select the tunnel or NAT gateway, so we can opt individual subnets in or out (ie compilers vs testers vs support hosts). We also do ftp uploads & downloads on different IPs. Can't do things by port though, but passing through ssh seems to work OK. Access to hg-internal.dmz.scl3 will need to keep running down the tunnel, so may not get much win from hg.m.o.

We don't know yet what size instance we would need to get enough network I/O, and since it adds a SPOF how to make it robust/HA. Probably one beefy NAT box per availability zone per region. You can do failover fairly easily [2], which might be OK as we retry some things. Some ongoing costs involved (a few instances, no new traffic costs), additional complexity in managing the networking in AWS.

Alternatives - we could also look at some caching in AWS for the test downloads. Using squid would be one option, eg [2], but will require some investigation. Could also roll our own ftp 'mirror' in each AWS region, and some sort of smart downloader that knows to try locally before heading further away. Even further off.

[1] http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_NAT_Instance.html
[2] http://aws.amazon.com/articles/2781451301784570
[2] https://aws.amazon.com/articles/5995712515781075
(In reply to Nick Thomas [:nthomas] from comment #17)
> I've been looking at what we can do to reduce the traffic through the
> tunnels. The biggest sources of tunnel traffic will be 
> *) retrieving source from hg.m.o  (often small, but can turn into 600MB d/l
> from ftp.m.o for a bundle)

Bug 851398 would help with this massively - we currently rm -rf most of the source directories (tools, mozharness, gecko, ...) and re-clone, instead of using hg purge.
Depends on: 851398
(In reply to Ed Morley [:edmorley UTC+0] from comment #18)
> (In reply to Nick Thomas [:nthomas] from comment #17)
> > I've been looking at what we can do to reduce the traffic through the
> > tunnels. The biggest sources of tunnel traffic will be 
> > *) retrieving source from hg.m.o  (often small, but can turn into 600MB d/l
> > from ftp.m.o for a bundle)
> 
> Bug 851398 would help with this massively - we currently rm -rf most of the
> source directories (tools, mozharness, gecko, ...) and re-clone, instead of
> using hg purge.

That's not entirely true. We rm -rf the build directory, but in most cases we have a local clone of gecko we can use and don't have to re-clone it from scratch.
Another non-standard episode tonight, barely noticeable starting at 19:30, but coming to a nice rolling boil at 21:00.
this hit us again hard and required another round of trizilions retriggers :)
I wrote a small script that pulls the SPU usage value from the firewall and writes it to a file. Next time it happens we will have more data.
Blocks: 934938
Case 2014-0117-0233 opened with Juniper to investigate possible performance issues on our side.
Depends on: 960571
We're back to sucking mightily, beyond belief, and every tree is closed again.
Closed all trees at 20:37.
:arzhel was looking into this last I knew
Assignee: nobody → arzhel
:arzhel informed me in IRC that he is on PTO, and this closes the trees for a few hours daily right now, so assigning to :adam for any rectification/remediation he can
Assignee: arzhel → adam
Reopened, it fizzled out some unknowable time ago.
Observation from https://treestatus.mozilla.org/b2g-inbound - and discussion with sheriffs in #releng:
 - this is happening every weeknight PT (M-Th)
 - it impacts the throughput of build/testing every night
 - some nights the trees are not closed due to lighter overall load
 - all reports of bug 934938 are also this condition

(In reply to Arzhel Younsi [:XioNoX] from comment #23)
> Case 2014-0117-0233 opened with Juniper to investigate possible performance
> issues on our side.

Can we get an update on Juniper's feedback while :arzel is on pto (comment 27)

What is the current plan for steps to resolution?
Flags: needinfo?(adam)
We have a case open with Juniper and we'll work toward resolution. This seems like we're going to have to make upgrades (buy more/new hardware, run dedicated lines, etc.) in order to push this harder. In the meantime can we get our SLAs from Amazon so that I know what they are promising from their side?
Flags: needinfo?(adam)
Impossible for me to tell whether it's actually recovered, or we've just run out of running jobs to fail. Reopened after four hours.
Only closed for three hours tonight.
(In reply to Phil Ringnalda (:philor) from comment #33)
> Only closed for three hours tonight.

Is this our first Friday night closure for this reason? I.e. any gut feeling that the issue is getting worse?
Flags: needinfo?(philringnalda)
I'd say it's worse, yes: I just merged mozilla-inbound and b2g-inbound to mozilla-central, and mozilla-central back to m-i, b-i and f-t, so we have a total of maybe 10 pushes and a thousand jobs running, less than half our normal full load, and we're hitting this on a Saturday night.
Flags: needinfo?(philringnalda)
This'll cut down my spamming of this bug, though: from now on, just assume that every tree is closed every day from 19:30 to 23:30.
One thing I do know that we do around the time when it all goes to pot (remember the thing that we do in the evening, that we're supposed to be looking for?) is trigger PGO builds on the tip of every tree that does periodic PGO, even if we've already built PGO on that revision, and trigger non-unified builds on the tip of every tree, even if we've already built non-unified on that revision. We do that at 18:00, but we might very well trigger tests on the PGO builds, upload the PGO builds and their logs, and upload the non-unified builds and their logs all around 19:30.

It can't be the cause, both because we also do that at 00:00 and 06:00 and 12:00 and because we started doing that in the middle of January, and this started somewhere between early January and before the Christmas break (since anything involving load wouldn't have shown up until January), but it certainly can be contributing to the degree of bustage.
Depends on: 964071
Trees closed again, not inclined to open them any time soon.
This / bug 934938 needs escalating asap, my patience is somewhat diminishing :-(
I'll send some emails out.
Severity: critical → blocker
Despite the suspicious timing 12 hours before when this hits, I think the current episode is something distinct, since this doesn't involve 500s from ftp.m.o and doesn't involve Windows and Mac slaves.
c#39 was caused by Bug 963768 c#9
Depends on: 963104
We've decreased priority of usw2, so should be preferring use1 at this point.
Depends on: 964486
Depends on: 964497
Laura pinged me about this.

There is a question about whether or not we can "turn off IPSEC" in the interim. To be clear, turning off IPSEC not a valid option. IPSEC is a requirement to connect to AWC VPCs. The only other option would be a Direct Connection, which has considerable lead time.

It seems there are a few options detailed in this etherpad, https://infra.etherpad.mozilla.org/aws-ipsec which include:

* Add hardware modules to the firewalls
* Moving FTP and HG traffic off the IPSEC tunnel
I've received a quote for the FW expansion and they're working on the lead time for the resources if it is determined that adding cards is our preferred direction.
Funny thing happened on the way to the nightly closure tonight: ordinarily, I only look at finished-and-failed jobs, so I don't even see scheduling, but at 21:00 (inconveniently one minute before it recovered) I noticed that we hadn't been scheduling any tests and builds for 30 minutes or so. Not surprisingly, once we scheduled 30 minutes worth all in one huge wad so they all cloned and downloaded at once, we choked out the VPN.

Do we routinely have a 30 minute period, 20-30 minutes before this starts, when we don't schedule any jobs before scheduling them all at once, causing this to happen?
(In reply to Phil Ringnalda (:philor) from comment #44)
> I noticed that we
> hadn't been scheduling any tests and builds for 30 minutes or so. Not
> surprisingly, once we scheduled 30 minutes worth all in one huge wad so they
> all cloned and downloaded at once, we choked out the VPN.

Perhaps completely unrelated, but I noticed that some jobs I retriggered on a try job last night (the tree was open at this point) simply never happened. I retriggered them again and they came back green, but the earlier triggers appear to have been eaten.
There have been some (probably unrelated) problems with self-serve which would have caused that. See bug 963720.
I've queries the netflow data around the tree closure events, it seems that the top talkers just before the tunnel died were:

- v-1030.fw1.releng.scl3.mozilla.net - over 589Mbit/sec on average on this one alone. That an external IP used by source NAT for a number of RelEng systems.
- ftp1-zlb.vips.scl3.mozilla.com - 170Mbit/sec
- ftp2-zlb.vips.scl3.mozilla.com - 168Mbit/sec

- 205.251.233.122
- 72.21.209.193
- hg-zlb.vips.scl3.mozilla.com
- upload-zlb.vips.scl3.mozilla.com
- pvtbuilds.pvt.build.mozilla.org

Note - it's a netflow data, so by its very nature (statistics, sampling one from N) the actual flow rates might have been higher, the numbers quoted here are an estimation on the low side.
Here is the list of subnets RelEng VPC talks to (or the other way around). I used the fact that no subnet is smaller than /24.

cat releng_vpc_ips_sorted_uniq_novpc.txt | cut -d '.' -f 1-3 | sort -n | uniq
10.12.128
10.12.130
10.12.131
10.12.132
10.12.133
10.12.135
10.12.48
10.12.49
10.12.50
10.12.51
10.12.52
10.22.248
10.22.70
10.22.74
10.22.75
10.22.81
10.250.48
10.26.40
10.26.41
10.26.48
10.26.56
10.26.57
10.26.75
10.26.80
10.26.84
10.8.74
10.8.75
10.8.81
I'm parsing the netflow data for 24h of traffic from last Wednesday.
A list of hostnames with IP attached to the bug - see releng_vpc_ips_sorted_uniq_novpc_resolved.txt

For the record, generated with

for i in `cat releng_vpc_ips_sorted_uniq_novpc.txt`; do (echo -n $i; echo -n " "; dig +short -x $i) >> releng_vpc_ips_sorted_uniq_novpc_resolved.txt; done;
Attached the full output of netflow data I was parsing, should any of you feel like doing it yourself. releng_vpc.csv.bz2 - over 7MB bzip2 file
1. It looks like it is a SPU meltdown - connectivity closes frequently after a large data transfer starts, like from/to the FTP server. The SPU theoretical performance for IPSEC can only be achieved in ideal conditions. We can't confirm but it's a standard marketing practice to measure the throughput with almost empty session table (i.e. almost no firewalling, or just nothing at all) and maximum size packets. I haven't seen the SPU usage going under 82% from yesterday and it means we don't have a lot of headroom should any be necessary. NetOps can correct me here.

2. We do not know for sure that purchasing another two SPU will resolve problems, per Juniper ticket (they aren't saying it will).

3. Comment #47 should give you an overview what kind of communication is generating most of the traffic and is the number one candidate to move out of the IPSEC and into a TLS, should we decide to go this route.

4. Comments #48 and #50 have a parsed list of hosts talking with the VPCs. Data are for a single timeslot of 24h, last Wednesday. I can recreate it for any other period, just keep in mind it's a task consuming and manual exercise, also querying netflow data for anything more than 48h takes a lot of time.
bug 964804 appears related to philor's note in comment 44.
Depends on: 964804
[13:28:29] < michal`> adam: one thing. say we can spread the single vpn session over two SPUs and we get a lot more throughput 
                      this way. I think we should rate limit the ZLB uploads and downloads a bit so it does not end up going as 
                      fast as it can and consuming what we just gained
[13:28:51] < adam> That's sound logic.

TL;DR we must figure out how fast do we need the downloads and uploads going and rate limit it, otherwise we might saturate SPUs again. The limit should be enforced on the ZLB level.
Depends on: 964993
Depends on: 965001
Depends on: 965289
Attached is a CSV traffic summary that includes bytes transferred.  This is sampled for a 3 hour time period last night.  (approx 17:00 - 20:00 PST)

The columns are sorted by total bytes transferred.  It's pretty obvious that HG sends a lot of traffic across the VPN, and that pvtbuilds receives a lot of traffic from across the VPN.

:michal` is commenting with more summary details.  See his comment.  :-)
Ben just fixed bug 964411 this morning, which should help with the HG traffic a bit.
The majority of the traffic is pvtbuilds and HG, will have more scientific data with numbers later today. After filtering out all teh pvtbuilds+HG I can't see more than 60-100Mbit/sec of traffic with occasional spikes to around 200+

Next step - breakdown by ports. Stay tuned.
Depends on: 965911
Attached the query output showing which instances use the HG-internal over the IPSEC tunnel.
And after a nice long period of things being good, we're seeing timeouts from both ftp-ssl.m.o and pvtbuilds.pvt.build.m.o, all trees closed.
Attached image Smokeping Feb 6th
Best I can tell this was an issue within Amazon's network, which hurt usw2 only. 

This is the smokeping graph from admin1b.releng.scl3 to nagios1.private.releng.usw2, where there is bad packet loss but latency is still good. I checked the traffic levels on the AWS tunnels and fw1.releng.scl3 overall, and it was wasn't anything higher than we dealt with all day, so that's consistent with the latency.

We have a cron running mtr from cruncher.srv.releng.scl3 to Amazon's side of the two usw2 tunnels, and saw packet loss at 198.32.176.36 (paix01-sfo4.amazon.com) between 02:42 and 0540 UTC. That was only on the .122 tunnel, and not the .121 tunnel. It's only sending 5 packets so the only results are 0%/20%/40%.. loss and it's not a very strong test, but still some info there.
No longer depends on: 851398
No longer depends on: 963104
seems the issue with aws came back today like https://tbpl.mozilla.org/php/getParsedLog.php?id=34833613&tree=B2g-Inbound trees closed till the maschine/test recovers
(In reply to Carsten Book [:Tomcat] from comment #63)
> seems the issue with aws came back today like
> https://tbpl.mozilla.org/php/getParsedLog.php?id=34833613&tree=B2g-Inbound
> trees closed till the maschine/test recovers

This log shows:
02:50:17    ERROR -  ssh: Could not resolve hostname pvtbuilds.pvt.build.mozilla.org: Name or service not known

All the earlier comments in this bug talk about very slow download speeds, not DNS issues (though it's hard to be sure because all of the TBPL links have expired). I don't think this particular log is related to this.
All work on this particular issue has been completed. Please open new bugs for AWS issues, even it it appears to have similar symptoms.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: