957502 - All trees closed due to timeouts as usw2 slaves download or clone

Reporter

Description

•

10 years ago

The bulk of the failures can be found from bug 934938 comment 65 on down (poor webops), but we're also seeing gaia tests time out while doing their poor version of cloning gaia-central without using hgtool, and Android builds timing out while downloading from http://runtime-binaries.pvt.build.mozilla.org/tooltool, and probably other timeouts I've already managed to forget.

I'm told that the high numbers of all the instances involved mean it's usw2 that's having troubles.

Phil Ringnalda (:philor)

Reporter

Comment 1

•

10 years ago

Device build failures like https://tbpl.mozilla.org/php/getParsedLog.php?id=32681302&tree=Mozilla-Central, that was what else I managed to forget.

Nick Thomas [:nthomas] (UTC+12)

Comment 2

•

10 years ago

Trying to reproduce https://tbpl.mozilla.org/php/getParsedLog.php?id=32680752&tree=Mozilla-Inbound I get 100-200KB/s, which results in a download time over 20 mins. Pretty sure that is what's leading to the timeouts on 1200s.

Nick Thomas [:nthomas] (UTC+12)

Comment 3

•

10 years ago

ftp.m.o seem healthy, based on the amount of total traffic going out of it. From SCL1, I get 2MB/s, similar from USE1. But transfer in and out of USW2 is slow.

XioNoX, could you check the health of the VPN endpoint on our side ? Otherwise we need to raise a support case with Amazon.

I'll look into disabling USW2 in the meantime.

Flags: needinfo?(arzhel)

Nick Thomas [:nthomas] (UTC+12)

Comment 4

•

10 years ago

Attached image Smokeping Jan 7th — Details

We believe this was due to packet less between our network and the EC2 instances USW2, see the attached graph for elevated loss rates during the period when builds were timing out. The loss rate has now fallen back to 0% and we can re-open the trees.

We can't tell at this point if it's loss on the internet between us and Amazon, or within Amazon itself. If it happens again we can ssh into cruncher to run 
  mtr 205.251.233.121
and
  mtr 205.251.233.122
for the two halves of the tunnel. That should simulate the path the tunnel packets take between us and Amazon.

Many thanks to XioNoX for his support on IRC.

Flags: needinfo?(arzhel)

Nick Thomas [:nthomas] (UTC+12)

Updated

•

10 years ago

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Phil Ringnalda (:philor)

Reporter

Comment 5

•

10 years ago

What do either we or Amazon do around 19:30ish? We're currently having another episode, with the earliest jobs I've seen having started at 19:36.

Phil Ringnalda (:philor)

Reporter

Comment 6

•

10 years ago

Eh, might as well get a jump on it, and just reclose every tree right now.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Phil Ringnalda (:philor)

Reporter

Comment 7

•

10 years ago

Reopened, looks like the problem fizzled out around 21:00, so merely critical until it happens again tomorrow at 19:30 and it becomes a blocker again.

Severity: blocker → critical

Nick Thomas [:nthomas] (UTC+12)

Updated

•

10 years ago

Attachment #8357022 - Attachment description: Smokeping → Smokeping Jan 7th

Nick Thomas [:nthomas] (UTC+12)

Comment 8

•

10 years ago

Attached image Smokeping Jan 8th — Details

Unfortunately the 20+ min lag in noticing the problem meant we weren't able to traceroute to a cause.

Nick Thomas [:nthomas] (UTC+12)

Comment 9

•

10 years ago

Note, UTC times on those graphs.

John Hopkins (:jhopkins)

Comment 10

•

10 years ago

I'm logging the output of mtr from cruncher once per minute for the two addresses listed in comment 4.

Logs are at buildduty@cruncher:logs/*.mtr

New buildduty@cruncher crontab entries:
* * * * *	export d=`date`; export ip=205.251.233.122; /usr/sbin/mtr --no-dns --raw --report --report-cycles=5 $ip 2>&1 | while read line; do echo $d: $line; done >> logs/$ip.mtr
* * * * *	export d=`date`; export ip=205.251.233.121; /usr/sbin/mtr --no-dns --raw --report --report-cycles=5 $ip 2>&1 | while read line; do echo $d: $line; done >> logs/$ip.mtr


Example output:

Thu Jan 9 08:26:01 PST 2014: HOST: cruncher.srv.releng.scl3.mo Loss% Snt Last Avg Best Wrst StDev
Thu Jan 9 08:26:01 PST 2014: 1. 10.26.48.1 0.0% 5 0.6 2.3 0.5 9.2 3.8
Thu Jan 9 08:26:01 PST 2014: 2. 63.245.214.81 0.0% 5 7.1 3.7 1.6 7.1 2.2
Thu Jan 9 08:26:01 PST 2014: 3. 63.245.214.85 0.0% 5 1.0 1.1 1.0 1.2 0.1
Thu Jan 9 08:26:01 PST 2014: 4. 63.245.219.129 0.0% 5 1.3 1.8 1.2 3.5 1.0
Thu Jan 9 08:26:01 PST 2014: 5. 64.125.170.33 0.0% 5 1.3 1.4 1.2 1.5 0.1
Thu Jan 9 08:26:01 PST 2014: 6. 64.125.31.61 0.0% 5 1.3 1.6 1.3 2.4 0.5
Thu Jan 9 08:26:01 PST 2014: 7. aws bin 100.0 5 0.0 0.0 0.0 0.0 0.0
Thu Jan 9 08:26:01 PST 2014: 8. 205.251.229.2 0.0% 5 37.6 47.5 35.1 72.1 14.9
Thu Jan 9 08:26:01 PST 2014: 9. 205.251.232.66 0.0% 5 35.4 42.4 35.0 68.2 14.5
Thu Jan 9 08:26:01 PST 2014: 10. 205.251.232.149 0.0% 5 35.5 41.2 35.1 65.0 13.3
Thu Jan 9 08:26:01 PST 2014: 11. 205.251.232.167 0.0% 5 35.4 37.7 35.3 46.5 4.9
Thu Jan 9 08:26:01 PST 2014: 12. 205.251.233.122 0.0% 5 35.3 32.7 28.4 35.6 3.8

Carsten Book [:Tomcat]

Comment 11

•

10 years ago

seems this closed the trees again

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 12

•

10 years ago

From #releng for posterity:

[19:28]	<nthomas>	I would hazard a guess that network load is a little lower than the last two days, so we'll see if my hypothesis survives
[19:33]	seeing some loss on our firewall in scl3 though
[19:36]	bgp flap just now
[19:37]	oh, use1, n/m
[19:42]	<philor>	https://tbpl.mozilla.org/php/getParsedLog.php?id=32798013&tree=B2g-Inbound
	perhaps the storm moved east
[19:43]	<nthomas>	was just going to say that I'm seeing 5% packet loss on a firewall on our side
[19:44]	and only 200KB/s sucking data from ftp into usw2
[19:47]	<philor>	and there's three timeouts cloning gaia-central, I think it's time for the nightly closure
[20:37]	fascinating, tons of gaia-central clone timeouts and nothing else
[20:38]	<nthomas>	maybe they're the canary
	getting about 600KB/s from ftp.m.o at the mo
	or even as much as 2MB/s
	so probably we're over the worst of it
[20:39]	<philor>	yeah, I figured canary and closed, but it's been over an hour without any mass burning of linux/b2g tests
	<nthomas>	could do with a gaia-central bundle I think
[20:40]	<philor>	and some retrying use of hgtool
[20:41]	<nthomas>	set the bug to blocker if you need any help, not that I have any knobs to twiddle on this

Nick Thomas [:nthomas] (UTC+12)

Comment 13

•

10 years ago

I was looking into the problem during today. At one point I had myself convinced that high traffic through the tunnels (~1Gbps) corresponded to smokeping measuring 500ms ping times and packet loss of 5-10%. We didn't see that around 7:30 though - ping stayed low and traffic was maybe half the peak value. We didn't see issues with those paramaters earlier in the day.

Did manage to get a mtr run, and it fingered fw1.releng.scl3 (where the tunnels terminate):
                                 My traceroute  [v0.75]
cruncher.srv.releng.scl3.mozilla.com (0.0.0.0)                 Thu Jan  9 19:59:33 2014
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                                 Packets               Pings
 Host                                          Loss%   Snt Drop   Avg  Best  Wrst StDev
 1. 10.26.48.1                                  5.5%  1940  107   1.6   0.5  27.3   3.4
 2. v-1030.core1.releng.scl3.mozilla.net        0.0%  1940    0   4.0   1.2  54.4   6.6
 3. v-1031.border1.scl3.mozilla.net             0.0%  1939    0   3.5   0.8  68.0   7.8
 4. xe-1-2-0.border1.pao1.mozilla.net           0.0%  1939    0  17.7   1.2  95.2  16.2
 5. xe-3-1-0.mpr2.pao1.us.above.net             0.0%  1939    0   5.3   1.2  93.3  10.3
 6. xe-1-0-0.mpr1.pao1.us.above.net             0.0%  1939    0   5.5   1.2 113.5  12.5
 7. ???
 8. 205.251.229.4                               0.0%  1939    0  37.7  34.7  89.3   7.7
 9. 205.251.232.96                              0.0%  1939    0  37.9  34.7 613.4  14.9
10. 205.251.232.153                             0.0%  1939    0  37.9  34.6  89.4   7.5
11. 205.251.232.63                              0.0%  1939    0  38.0  34.7  88.2   7.5
    205.251.232.165
12. 205.251.233.121                             0.0%  1939    0  50.8  28.4 135.2  21.2

So is there some sort of cron equivalent on firewall that comes along every evening ? Why does usw2 have issues and not use1 ? AIUI both sets of tunnels terminate on the same device.

From IRC discussion XioNoX is poking at it, so setting a needinfo to see what he finds.

Flags: needinfo?(arzhel)

Arzhel Younsi [:XioNoX]

Comment 14

•

10 years ago

On fw1.releng.scl3, the CPU usage of the SPC (the module in charge of ipsec and firewall sessions) is currently at ~80% usage with about 1Gbps and ~9000 sessions.
So far I think this issue is caused by a combination of high ipsec traffic and high number of sessions which overload that CPU.

That firewall is supposed to handle 2Gbps of IPsec traffic, but the addition of firewall sessions might impacts it more than expected.

We will need to monitor those values the next times this issue happens to confirm it or not. If it's confirmed, several solutions are possible, for example getting a 2nd identical module, or not using the IPsec tunnels for the connections fro AWS to our public services (ftp/hg).

Flags: needinfo?(arzhel)

Phil Ringnalda (:philor)

Reporter

Comment 15

•

10 years ago

Only a moderate episode tonight, starting late at around 20:30.

Nick Thomas [:nthomas] (UTC+12)

Comment 16

•

10 years ago

(In reply to Arzhel Younsi [:XioNoX] from comment #14)

Ah hah! This seems likely from what I've seen. The next opportunity to verify is probably around 1900 Pacific on Monday, but maybe occur earlier if there is a 'busy monday' effect.

Nick Thomas [:nthomas] (UTC+12)

Comment 17

•

10 years ago

I've been looking at what we can do to reduce the traffic through the tunnels. The biggest sources of tunnel traffic will be
*) retrieving source from hg.m.o (often small, but can turn into 600MB d/l from ftp.m.o for a bundle)
*) uploading files to ftp.m.o after compile jobs complete (300-450MB each)
*) downloading files from ftp.m.o to run tests (many requests for data from upload)
Do we have any easily accessible metrics on this btw ?

Avoiding the tunnel for ftp and hg - As a proof of concept, I've used [1] to create an EC2 instance in a public subnet, which provides NAT to instances in the private subnets. This seems to work fine in a quick test, but wasn't any faster to d/l from ftp because the load on the tunnel and ftp were low when I was testing. I'll check it when more developers are awake and the systems are busy. On the nice side of the coin, a routing table is used to select the tunnel or NAT gateway, so we can opt individual subnets in or out (ie compilers vs testers vs support hosts). We also do ftp uploads & downloads on different IPs. Can't do things by port though, but passing through ssh seems to work OK. Access to hg-internal.dmz.scl3 will need to keep running down the tunnel, so may not get much win from hg.m.o.

We don't know yet what size instance we would need to get enough network I/O, and since it adds a SPOF how to make it robust/HA. Probably one beefy NAT box per availability zone per region. You can do failover fairly easily [2], which might be OK as we retry some things. Some ongoing costs involved (a few instances, no new traffic costs), additional complexity in managing the networking in AWS.

Alternatives - we could also look at some caching in AWS for the test downloads. Using squid would be one option, eg [2], but will require some investigation. Could also roll our own ftp 'mirror' in each AWS region, and some sort of smart downloader that knows to try locally before heading further away. Even further off.

[1] http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_NAT_Instance.html
[2] http://aws.amazon.com/articles/2781451301784570
[2] https://aws.amazon.com/articles/5995712515781075

Ed Morley [:emorley]

Comment 18

•

10 years ago

(In reply to Nick Thomas [:nthomas] from comment #17)
> I've been looking at what we can do to reduce the traffic through the
> tunnels. The biggest sources of tunnel traffic will be 
> *) retrieving source from hg.m.o  (often small, but can turn into 600MB d/l
> from ftp.m.o for a bundle)

Bug 851398 would help with this massively - we currently rm -rf most of the source directories (tools, mozharness, gecko, ...) and re-clone, instead of using hg purge.

Depends on: 851398

Chris AtLee [:catlee]

Comment 19

•

10 years ago

(In reply to Ed Morley [:edmorley UTC+0] from comment #18)
> (In reply to Nick Thomas [:nthomas] from comment #17)
> > I've been looking at what we can do to reduce the traffic through the
> > tunnels. The biggest sources of tunnel traffic will be 
> > *) retrieving source from hg.m.o  (often small, but can turn into 600MB d/l
> > from ftp.m.o for a bundle)
> 
> Bug 851398 would help with this massively - we currently rm -rf most of the
> source directories (tools, mozharness, gecko, ...) and re-clone, instead of
> using hg purge.

That's not entirely true. We rm -rf the build directory, but in most cases we have a local clone of gecko we can use and don't have to re-clone it from scratch.

Phil Ringnalda (:philor)

Reporter

Comment 20

•

10 years ago

Another non-standard episode tonight, barely noticeable starting at 19:30, but coming to a nice rolling boil at 21:00.

Carsten Book [:Tomcat]

Comment 21

•

10 years ago

this hit us again hard and required another round of trizilions retriggers :)

Arzhel Younsi [:XioNoX]

Comment 22

•

10 years ago

I wrote a small script that pulls the SPU usage value from the firewall and writes it to a file. Next time it happens we will have more data.

Phil Ringnalda (:philor)

Reporter

Updated

•

10 years ago

Blocks: 934938

Arzhel Younsi [:XioNoX]

Comment 23

•

10 years ago

Case 2014-0117-0233 opened with Juniper to investigate possible performance issues on our side.

Ed Morley [:emorley]

Updated

•

10 years ago

Depends on: 960571

Phil Ringnalda (:philor)

Reporter

Comment 24

•

10 years ago

We're back to sucking mightily, beyond belief, and every tree is closed again.

Phil Ringnalda (:philor)

Reporter

Comment 25

•

10 years ago

Closed all trees at 20:37.

Justin Wood (:Callek)

Comment 26

•

10 years ago

:arzhel was looking into this last I knew

Assignee: nobody → arzhel

Justin Wood (:Callek)

Comment 27

•

10 years ago

:arzhel informed me in IRC that he is on PTO, and this closes the trees for a few hours daily right now, so assigning to :adam for any rectification/remediation he can

Assignee: arzhel → adam

Phil Ringnalda (:philor)

Reporter

Comment 28

•

10 years ago

Reopened, it fizzled out some unknowable time ago.

Hal Wine [:hwine] use NI!

Comment 29

•

10 years ago

Observation from https://treestatus.mozilla.org/b2g-inbound - and discussion with sheriffs in #releng:
 - this is happening every weeknight PT (M-Th)
 - it impacts the throughput of build/testing every night
 - some nights the trees are not closed due to lighter overall load
 - all reports of bug 934938 are also this condition

(In reply to Arzhel Younsi [:XioNoX] from comment #23)
> Case 2014-0117-0233 opened with Juniper to investigate possible performance
> issues on our side.

Can we get an update on Juniper's feedback while :arzel is on pto (comment 27)

What is the current plan for steps to resolution?

Flags: needinfo?(adam)

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 30

•

10 years ago

Trees closed at 18:38

Adam Newman [:adam]

Assignee

Comment 31

•

10 years ago

We have a case open with Juniper and we'll work toward resolution. This seems like we're going to have to make upgrades (buy more/new hardware, run dedicated lines, etc.) in order to push this harder. In the meantime can we get our SLAs from Amazon so that I know what they are promising from their side?

Flags: needinfo?(adam)

Phil Ringnalda (:philor)

Reporter

Comment 32

•

10 years ago

Impossible for me to tell whether it's actually recovered, or we've just run out of running jobs to fail. Reopened after four hours.

Phil Ringnalda (:philor)

Reporter

Comment 33

•

10 years ago

Only closed for three hours tonight.

Hal Wine [:hwine] use NI!

Comment 34

•

10 years ago

(In reply to Phil Ringnalda (:philor) from comment #33)
> Only closed for three hours tonight.

Is this our first Friday night closure for this reason? I.e. any gut feeling that the issue is getting worse?

Flags: needinfo?(philringnalda)

Phil Ringnalda (:philor)

Reporter

Comment 35

•

10 years ago

I'd say it's worse, yes: I just merged mozilla-inbound and b2g-inbound to mozilla-central, and mozilla-central back to m-i, b-i and f-t, so we have a total of maybe 10 pushes and a thousand jobs running, less than half our normal full load, and we're hitting this on a Saturday night.

Flags: needinfo?(philringnalda)

Phil Ringnalda (:philor)

Reporter

Comment 36

•

10 years ago

This'll cut down my spamming of this bug, though: from now on, just assume that every tree is closed every day from 19:30 to 23:30.

Phil Ringnalda (:philor)

Reporter

Comment 37

•

10 years ago

One thing I do know that we do around the time when it all goes to pot (remember the thing that we do in the evening, that we're supposed to be looking for?) is trigger PGO builds on the tip of every tree that does periodic PGO, even if we've already built PGO on that revision, and trigger non-unified builds on the tip of every tree, even if we've already built non-unified on that revision. We do that at 18:00, but we might very well trigger tests on the PGO builds, upload the PGO builds and their logs, and upload the non-unified builds and their logs all around 19:30.

It can't be the cause, both because we also do that at 00:00 and 06:00 and 12:00 and because we started doing that in the middle of January, and this started somewhere between early January and before the Christmas break (since anything involving load wouldn't have shown up until January), but it certainly can be contributing to the degree of bustage.

Depends on: 964071

Ed Morley [:emorley]

Comment 38

•

10 years ago

Trees closed again, not inclined to open them any time soon.
This / bug 934938 needs escalating asap, my patience is somewhat diminishing :-(
I'll send some emails out.

Ed Morley [:emorley]

Updated

•

10 years ago

Severity: critical → blocker

Phil Ringnalda (:philor)

Reporter

Comment 39

•

10 years ago

Despite the suspicious timing 12 hours before when this hits, I think the current episode is something distinct, since this doesn't involve 500s from ftp.m.o and doesn't involve Windows and Mac slaves.

Justin Wood (:Callek)

Comment 40

•

10 years ago

c#39 was caused by Bug 963768 c#9

Ed Morley [:emorley]

Updated

•

10 years ago

Depends on: 963104

Chris AtLee [:catlee]

Comment 41

•

10 years ago

We've decreased priority of usw2, so should be preferring use1 at this point.

Chris AtLee [:catlee]

Updated

•

10 years ago

Depends on: 964486

Nick Thomas [:nthomas] (UTC+12)

Updated

•

10 years ago

Depends on: 964497

Joe Stevensen [:joe]

Comment 42

•

10 years ago

Laura pinged me about this.

There is a question about whether or not we can "turn off IPSEC" in the interim. To be clear, turning off IPSEC not a valid option. IPSEC is a requirement to connect to AWC VPCs. The only other option would be a Direct Connection, which has considerable lead time.

It seems there are a few options detailed in this etherpad, https://infra.etherpad.mozilla.org/aws-ipsec which include:

* Add hardware modules to the firewalls
* Moving FTP and HG traffic off the IPSEC tunnel

Adam Newman [:adam]

Assignee

Comment 43

•

10 years ago

I've received a quote for the FW expansion and they're working on the lead time for the resources if it is determined that adding cards is our preferred direction.

Phil Ringnalda (:philor)

Reporter

Comment 44

•

10 years ago

Funny thing happened on the way to the nightly closure tonight: ordinarily, I only look at finished-and-failed jobs, so I don't even see scheduling, but at 21:00 (inconveniently one minute before it recovered) I noticed that we hadn't been scheduling any tests and builds for 30 minutes or so. Not surprisingly, once we scheduled 30 minutes worth all in one huge wad so they all cloned and downloaded at once, we choked out the VPN.

Do we routinely have a 30 minute period, 20-30 minutes before this starts, when we don't schedule any jobs before scheduling them all at once, causing this to happen?

Emanuel Hoogeveen [:ehoogeveen]

Comment 45

•

10 years ago

(In reply to Phil Ringnalda (:philor) from comment #44)
> I noticed that we
> hadn't been scheduling any tests and builds for 30 minutes or so. Not
> surprisingly, once we scheduled 30 minutes worth all in one huge wad so they
> all cloned and downloaded at once, we choked out the VPN.

Perhaps completely unrelated, but I noticed that some jobs I retriggered on a try job last night (the tree was open at this point) simply never happened. I retriggered them again and they came back green, but the earlier triggers appear to have been eaten.

Nick Thomas [:nthomas] (UTC+12)

Comment 46

•

10 years ago

There have been some (probably unrelated) problems with self-serve which would have caused that. See bug 963720.

Michal Purzynski [:michal`] (use NEEDINFO)

Comment 47

•

10 years ago

I've queries the netflow data around the tree closure events, it seems that the top talkers just before the tunnel died were:

- v-1030.fw1.releng.scl3.mozilla.net - over 589Mbit/sec on average on this one alone. That an external IP used by source NAT for a number of RelEng systems.
- ftp1-zlb.vips.scl3.mozilla.com - 170Mbit/sec
- ftp2-zlb.vips.scl3.mozilla.com - 168Mbit/sec

- 205.251.233.122
- 72.21.209.193
- hg-zlb.vips.scl3.mozilla.com
- upload-zlb.vips.scl3.mozilla.com
- pvtbuilds.pvt.build.mozilla.org

Note - it's a netflow data, so by its very nature (statistics, sampling one from N) the actual flow rates might have been higher, the numbers quoted here are an estimation on the low side.

Michal Purzynski [:michal`] (use NEEDINFO)

Comment 48

•

10 years ago

Here is the list of subnets RelEng VPC talks to (or the other way around). I used the fact that no subnet is smaller than /24.

cat releng_vpc_ips_sorted_uniq_novpc.txt | cut -d '.' -f 1-3 | sort -n | uniq
10.12.128
10.12.130
10.12.131
10.12.132
10.12.133
10.12.135
10.12.48
10.12.49
10.12.50
10.12.51
10.12.52
10.22.248
10.22.70
10.22.74
10.22.75
10.22.81
10.250.48
10.26.40
10.26.41
10.26.48
10.26.56
10.26.57
10.26.75
10.26.80
10.26.84
10.8.74
10.8.75
10.8.81

Michal Purzynski [:michal`] (use NEEDINFO)

Comment 49

•

10 years ago

I'm parsing the netflow data for 24h of traffic from last Wednesday.

Michal Purzynski [:michal`] (use NEEDINFO)

Comment 50

•

10 years ago

A list of hostnames with IP attached to the bug - see releng_vpc_ips_sorted_uniq_novpc_resolved.txt

For the record, generated with

for i in `cat releng_vpc_ips_sorted_uniq_novpc.txt`; do (echo -n $i; echo -n " "; dig +short -x $i) >> releng_vpc_ips_sorted_uniq_novpc_resolved.txt; done;

Michal Purzynski [:michal`] (use NEEDINFO)

Comment 51

•

10 years ago

Attached file Hostnames and IPs talking with the RelEng VPC — Details

Michal Purzynski [:michal`] (use NEEDINFO)

Comment 52

•

10 years ago

Attached file releng_vpc.csv.bz2 — Details

Michal Purzynski [:michal`] (use NEEDINFO)

Comment 53

•

10 years ago

Attached the full output of netflow data I was parsing, should any of you feel like doing it yourself. releng_vpc.csv.bz2 - over 7MB bzip2 file

Michal Purzynski [:michal`] (use NEEDINFO)

Comment 54

•

10 years ago

1. It looks like it is a SPU meltdown - connectivity closes frequently after a large data transfer starts, like from/to the FTP server. The SPU theoretical performance for IPSEC can only be achieved in ideal conditions. We can't confirm but it's a standard marketing practice to measure the throughput with almost empty session table (i.e. almost no firewalling, or just nothing at all) and maximum size packets. I haven't seen the SPU usage going under 82% from yesterday and it means we don't have a lot of headroom should any be necessary. NetOps can correct me here.

2. We do not know for sure that purchasing another two SPU will resolve problems, per Juniper ticket (they aren't saying it will).

3. Comment #47 should give you an overview what kind of communication is generating most of the traffic and is the number one candidate to move out of the IPSEC and into a TLS, should we decide to go this route.

4. Comments #48 and #50 have a parsed list of hosts talking with the VPCs. Data are for a single timeslot of 24h, last Wednesday. I can recreate it for any other period, just keep in mind it's a task consuming and manual exercise, also querying netflow data for anything more than 48h takes a lot of time.

Hal Wine [:hwine] use NI!

Comment 55

•

10 years ago

bug 964804 appears related to philor's note in comment 44.

Depends on: 964804

Michal Purzynski [:michal`] (use NEEDINFO)

Comment 56

•

10 years ago

[13:28:29] < michal`> adam: one thing. say we can spread the single vpn session over two SPUs and we get a lot more throughput 
                      this way. I think we should rate limit the ZLB uploads and downloads a bit so it does not end up going as 
                      fast as it can and consuming what we just gained
[13:28:51] < adam> That's sound logic.

TL;DR we must figure out how fast do we need the downloads and uploads going and rate limit it, otherwise we might saturate SPUs again. The limit should be enforced on the ZLB level.

Laura Thomson :laura

Updated

•

10 years ago

Depends on: 964993

Chris AtLee [:catlee]

Updated

•

10 years ago

Depends on: 965001

Chris AtLee [:catlee]

Updated

•

10 years ago

Depends on: 965289

Michael Henry [:tinfoil]

Comment 57

•

10 years ago

Attached file aws-to-scl3-traffic.csv.bz2 — Details

Attached is a CSV traffic summary that includes bytes transferred.  This is sampled for a 3 hour time period last night.  (approx 17:00 - 20:00 PST)

The columns are sorted by total bytes transferred.  It's pretty obvious that HG sends a lot of traffic across the VPN, and that pvtbuilds receives a lot of traffic from across the VPN.

:michal` is commenting with more summary details.  See his comment.  :-)

Chris AtLee [:catlee]

Comment 58

•

10 years ago

Ben just fixed bug 964411 this morning, which should help with the HG traffic a bit.

Michal Purzynski [:michal`] (use NEEDINFO)

Comment 59

•

10 years ago

The majority of the traffic is pvtbuilds and HG, will have more scientific data with numbers later today. After filtering out all teh pvtbuilds+HG I can't see more than 60-100Mbit/sec of traffic with occasional spikes to around 200+

Next step - breakdown by ports. Stay tuned.

Chris AtLee [:catlee]

Updated

•

10 years ago

Depends on: 965911

Michal Purzynski [:michal`] (use NEEDINFO)

Comment 60

•

10 years ago

Attached file releng_hg_ipsec_ipsuniq.txt — Details

Attached the query output showing which instances use the HG-internal over the IPSEC tunnel.

Phil Ringnalda (:philor)

Reporter

Comment 61

•

10 years ago

And after a nice long period of things being good, we're seeing timeouts from both ftp-ssl.m.o and pvtbuilds.pvt.build.m.o, all trees closed.

Nick Thomas [:nthomas] (UTC+12)

Comment 62

•

10 years ago

Attached image Smokeping Feb 6th — Details

Best I can tell this was an issue within Amazon's network, which hurt usw2 only. 

This is the smokeping graph from admin1b.releng.scl3 to nagios1.private.releng.usw2, where there is bad packet loss but latency is still good. I checked the traffic levels on the AWS tunnels and fw1.releng.scl3 overall, and it was wasn't anything higher than we dealt with all day, so that's consistent with the latency.

We have a cron running mtr from cruncher.srv.releng.scl3 to Amazon's side of the two usw2 tunnels, and saw packet loss at 198.32.176.36 (paix01-sfo4.amazon.com) between 02:42 and 0540 UTC. That was only on the .122 tunnel, and not the .121 tunnel. It's only sending 5 packets so the only results are 0%/20%/40%.. loss and it's not a very strong test, but still some info there.

Hal Wine [:hwine] use NI!

Updated

•

10 years ago

No longer depends on: 851398

Hal Wine [:hwine] use NI!

Updated

•

10 years ago

No longer depends on: 963104

Carsten Book [:Tomcat]

Comment 63

•

10 years ago

seems the issue with aws came back today like https://tbpl.mozilla.org/php/getParsedLog.php?id=34833613&tree=B2g-Inbound trees closed till the maschine/test recovers

bhearsum@mozilla.com (:bhearsum)

Comment 64

•

10 years ago

(In reply to Carsten Book [:Tomcat] from comment #63)
> seems the issue with aws came back today like
> https://tbpl.mozilla.org/php/getParsedLog.php?id=34833613&tree=B2g-Inbound
> trees closed till the maschine/test recovers

This log shows:
02:50:17    ERROR -  ssh: Could not resolve hostname pvtbuilds.pvt.build.mozilla.org: Name or service not known

All the earlier comments in this bug talk about very slow download speeds, not DNS issues (though it's hard to be sure because all of the TBPL links have expired). I don't think this particular log is related to this.

Hal Wine [:hwine] use NI!

Comment 65

•

10 years ago

All work on this particular issue has been completed. Please open new bugs for AWS issues, even it it appears to have similar symptoms.

Status: REOPENED → RESOLVED

Closed: 10 years ago → 10 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

4 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Smokeping Jan 7th 10 years ago Nick Thomas [:nthomas] (UTC+12) 49.72 KB, image/png		Details
Smokeping Jan 8th 10 years ago Nick Thomas [:nthomas] (UTC+12) 40.25 KB, image/png		Details
Hostnames and IPs talking with the RelEng VPC 10 years ago Michal Purzynski [:michal`] (use NEEDINFO) 55.05 KB, text/plain		Details
releng_vpc.csv.bz2 10 years ago Michal Purzynski [:michal`] (use NEEDINFO) 7.35 MB, application/x-bzip		Details
aws-to-scl3-traffic.csv.bz2 10 years ago Michael Henry [:tinfoil] 20.77 KB, application/x-bzip2		Details
releng_hg_ipsec_ipsuniq.txt 10 years ago Michal Purzynski [:michal`] (use NEEDINFO) 7.01 KB, text/plain		Details
Smokeping Feb 6th 10 years ago Nick Thomas [:nthomas] (UTC+12) 44.30 KB, image/png		Details