Closed
Bug 957502
Opened 11 years ago
Closed 11 years ago
All trees closed due to timeouts as usw2 slaves download or clone
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Assigned: adam)
References
Details
Attachments
(7 files)
The bulk of the failures can be found from bug 934938 comment 65 on down (poor webops), but we're also seeing gaia tests time out while doing their poor version of cloning gaia-central without using hgtool, and Android builds timing out while downloading from http://runtime-binaries.pvt.build.mozilla.org/tooltool, and probably other timeouts I've already managed to forget.
I'm told that the high numbers of all the instances involved mean it's usw2 that's having troubles.
Reporter | ||
Comment 1•11 years ago
|
||
Device build failures like https://tbpl.mozilla.org/php/getParsedLog.php?id=32681302&tree=Mozilla-Central, that was what else I managed to forget.
Comment 2•11 years ago
|
||
Trying to reproduce https://tbpl.mozilla.org/php/getParsedLog.php?id=32680752&tree=Mozilla-Inbound I get 100-200KB/s, which results in a download time over 20 mins. Pretty sure that is what's leading to the timeouts on 1200s.
Comment 3•11 years ago
|
||
ftp.m.o seem healthy, based on the amount of total traffic going out of it. From SCL1, I get 2MB/s, similar from USE1. But transfer in and out of USW2 is slow.
XioNoX, could you check the health of the VPN endpoint on our side ? Otherwise we need to raise a support case with Amazon.
I'll look into disabling USW2 in the meantime.
Flags: needinfo?(arzhel)
Comment 4•11 years ago
|
||
We believe this was due to packet less between our network and the EC2 instances USW2, see the attached graph for elevated loss rates during the period when builds were timing out. The loss rate has now fallen back to 0% and we can re-open the trees.
We can't tell at this point if it's loss on the internet between us and Amazon, or within Amazon itself. If it happens again we can ssh into cruncher to run
mtr 205.251.233.121
and
mtr 205.251.233.122
for the two halves of the tunnel. That should simulate the path the tunnel packets take between us and Amazon.
Many thanks to XioNoX for his support on IRC.
Flags: needinfo?(arzhel)
Updated•11 years ago
|
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 5•11 years ago
|
||
What do either we or Amazon do around 19:30ish? We're currently having another episode, with the earliest jobs I've seen having started at 19:36.
Reporter | ||
Comment 6•11 years ago
|
||
Eh, might as well get a jump on it, and just reclose every tree right now.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Comment 7•11 years ago
|
||
Reopened, looks like the problem fizzled out around 21:00, so merely critical until it happens again tomorrow at 19:30 and it becomes a blocker again.
Severity: blocker → critical
Updated•11 years ago
|
Attachment #8357022 -
Attachment description: Smokeping → Smokeping Jan 7th
Comment 8•11 years ago
|
||
Unfortunately the 20+ min lag in noticing the problem meant we weren't able to traceroute to a cause.
Comment 9•11 years ago
|
||
Note, UTC times on those graphs.
Comment 10•11 years ago
|
||
I'm logging the output of mtr from cruncher once per minute for the two addresses listed in comment 4.
Logs are at buildduty@cruncher:logs/*.mtr
New buildduty@cruncher crontab entries:
* * * * * export d=`date`; export ip=205.251.233.122; /usr/sbin/mtr --no-dns --raw --report --report-cycles=5 $ip 2>&1 | while read line; do echo $d: $line; done >> logs/$ip.mtr
* * * * * export d=`date`; export ip=205.251.233.121; /usr/sbin/mtr --no-dns --raw --report --report-cycles=5 $ip 2>&1 | while read line; do echo $d: $line; done >> logs/$ip.mtr
Example output:
Thu Jan 9 08:26:01 PST 2014: HOST: cruncher.srv.releng.scl3.mo Loss% Snt Last Avg Best Wrst StDev
Thu Jan 9 08:26:01 PST 2014: 1. 10.26.48.1 0.0% 5 0.6 2.3 0.5 9.2 3.8
Thu Jan 9 08:26:01 PST 2014: 2. 63.245.214.81 0.0% 5 7.1 3.7 1.6 7.1 2.2
Thu Jan 9 08:26:01 PST 2014: 3. 63.245.214.85 0.0% 5 1.0 1.1 1.0 1.2 0.1
Thu Jan 9 08:26:01 PST 2014: 4. 63.245.219.129 0.0% 5 1.3 1.8 1.2 3.5 1.0
Thu Jan 9 08:26:01 PST 2014: 5. 64.125.170.33 0.0% 5 1.3 1.4 1.2 1.5 0.1
Thu Jan 9 08:26:01 PST 2014: 6. 64.125.31.61 0.0% 5 1.3 1.6 1.3 2.4 0.5
Thu Jan 9 08:26:01 PST 2014: 7. aws bin 100.0 5 0.0 0.0 0.0 0.0 0.0
Thu Jan 9 08:26:01 PST 2014: 8. 205.251.229.2 0.0% 5 37.6 47.5 35.1 72.1 14.9
Thu Jan 9 08:26:01 PST 2014: 9. 205.251.232.66 0.0% 5 35.4 42.4 35.0 68.2 14.5
Thu Jan 9 08:26:01 PST 2014: 10. 205.251.232.149 0.0% 5 35.5 41.2 35.1 65.0 13.3
Thu Jan 9 08:26:01 PST 2014: 11. 205.251.232.167 0.0% 5 35.4 37.7 35.3 46.5 4.9
Thu Jan 9 08:26:01 PST 2014: 12. 205.251.233.122 0.0% 5 35.3 32.7 28.4 35.6 3.8
Comment 11•11 years ago
|
||
seems this closed the trees again
From #releng for posterity:
[19:28] <nthomas> I would hazard a guess that network load is a little lower than the last two days, so we'll see if my hypothesis survives
[19:33] seeing some loss on our firewall in scl3 though
[19:36] bgp flap just now
[19:37] oh, use1, n/m
[19:42] <philor> https://tbpl.mozilla.org/php/getParsedLog.php?id=32798013&tree=B2g-Inbound
perhaps the storm moved east
[19:43] <nthomas> was just going to say that I'm seeing 5% packet loss on a firewall on our side
[19:44] and only 200KB/s sucking data from ftp into usw2
[19:47] <philor> and there's three timeouts cloning gaia-central, I think it's time for the nightly closure
[20:37] fascinating, tons of gaia-central clone timeouts and nothing else
[20:38] <nthomas> maybe they're the canary
getting about 600KB/s from ftp.m.o at the mo
or even as much as 2MB/s
so probably we're over the worst of it
[20:39] <philor> yeah, I figured canary and closed, but it's been over an hour without any mass burning of linux/b2g tests
<nthomas> could do with a gaia-central bundle I think
[20:40] <philor> and some retrying use of hgtool
[20:41] <nthomas> set the bug to blocker if you need any help, not that I have any knobs to twiddle on this
Comment 13•11 years ago
|
||
I was looking into the problem during today. At one point I had myself convinced that high traffic through the tunnels (~1Gbps) corresponded to smokeping measuring 500ms ping times and packet loss of 5-10%. We didn't see that around 7:30 though - ping stayed low and traffic was maybe half the peak value. We didn't see issues with those paramaters earlier in the day.
Did manage to get a mtr run, and it fingered fw1.releng.scl3 (where the tunnels terminate):
My traceroute [v0.75]
cruncher.srv.releng.scl3.mozilla.com (0.0.0.0) Thu Jan 9 19:59:33 2014
Keys: Help Display mode Restart statistics Order of fields quit
Packets Pings
Host Loss% Snt Drop Avg Best Wrst StDev
1. 10.26.48.1 5.5% 1940 107 1.6 0.5 27.3 3.4
2. v-1030.core1.releng.scl3.mozilla.net 0.0% 1940 0 4.0 1.2 54.4 6.6
3. v-1031.border1.scl3.mozilla.net 0.0% 1939 0 3.5 0.8 68.0 7.8
4. xe-1-2-0.border1.pao1.mozilla.net 0.0% 1939 0 17.7 1.2 95.2 16.2
5. xe-3-1-0.mpr2.pao1.us.above.net 0.0% 1939 0 5.3 1.2 93.3 10.3
6. xe-1-0-0.mpr1.pao1.us.above.net 0.0% 1939 0 5.5 1.2 113.5 12.5
7. ???
8. 205.251.229.4 0.0% 1939 0 37.7 34.7 89.3 7.7
9. 205.251.232.96 0.0% 1939 0 37.9 34.7 613.4 14.9
10. 205.251.232.153 0.0% 1939 0 37.9 34.6 89.4 7.5
11. 205.251.232.63 0.0% 1939 0 38.0 34.7 88.2 7.5
205.251.232.165
12. 205.251.233.121 0.0% 1939 0 50.8 28.4 135.2 21.2
So is there some sort of cron equivalent on firewall that comes along every evening ? Why does usw2 have issues and not use1 ? AIUI both sets of tunnels terminate on the same device.
From IRC discussion XioNoX is poking at it, so setting a needinfo to see what he finds.
Flags: needinfo?(arzhel)
Comment 14•11 years ago
|
||
On fw1.releng.scl3, the CPU usage of the SPC (the module in charge of ipsec and firewall sessions) is currently at ~80% usage with about 1Gbps and ~9000 sessions.
So far I think this issue is caused by a combination of high ipsec traffic and high number of sessions which overload that CPU.
That firewall is supposed to handle 2Gbps of IPsec traffic, but the addition of firewall sessions might impacts it more than expected.
We will need to monitor those values the next times this issue happens to confirm it or not. If it's confirmed, several solutions are possible, for example getting a 2nd identical module, or not using the IPsec tunnels for the connections fro AWS to our public services (ftp/hg).
Flags: needinfo?(arzhel)
Reporter | ||
Comment 15•11 years ago
|
||
Only a moderate episode tonight, starting late at around 20:30.
Comment 16•11 years ago
|
||
(In reply to Arzhel Younsi [:XioNoX] from comment #14)
Ah hah! This seems likely from what I've seen. The next opportunity to verify is probably around 1900 Pacific on Monday, but maybe occur earlier if there is a 'busy monday' effect.
Comment 17•11 years ago
|
||
I've been looking at what we can do to reduce the traffic through the tunnels. The biggest sources of tunnel traffic will be
*) retrieving source from hg.m.o (often small, but can turn into 600MB d/l from ftp.m.o for a bundle)
*) uploading files to ftp.m.o after compile jobs complete (300-450MB each)
*) downloading files from ftp.m.o to run tests (many requests for data from upload)
Do we have any easily accessible metrics on this btw ?
Avoiding the tunnel for ftp and hg - As a proof of concept, I've used [1] to create an EC2 instance in a public subnet, which provides NAT to instances in the private subnets. This seems to work fine in a quick test, but wasn't any faster to d/l from ftp because the load on the tunnel and ftp were low when I was testing. I'll check it when more developers are awake and the systems are busy. On the nice side of the coin, a routing table is used to select the tunnel or NAT gateway, so we can opt individual subnets in or out (ie compilers vs testers vs support hosts). We also do ftp uploads & downloads on different IPs. Can't do things by port though, but passing through ssh seems to work OK. Access to hg-internal.dmz.scl3 will need to keep running down the tunnel, so may not get much win from hg.m.o.
We don't know yet what size instance we would need to get enough network I/O, and since it adds a SPOF how to make it robust/HA. Probably one beefy NAT box per availability zone per region. You can do failover fairly easily [2], which might be OK as we retry some things. Some ongoing costs involved (a few instances, no new traffic costs), additional complexity in managing the networking in AWS.
Alternatives - we could also look at some caching in AWS for the test downloads. Using squid would be one option, eg [2], but will require some investigation. Could also roll our own ftp 'mirror' in each AWS region, and some sort of smart downloader that knows to try locally before heading further away. Even further off.
[1] http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_NAT_Instance.html
[2] http://aws.amazon.com/articles/2781451301784570
[2] https://aws.amazon.com/articles/5995712515781075
Comment 18•11 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #17)
> I've been looking at what we can do to reduce the traffic through the
> tunnels. The biggest sources of tunnel traffic will be
> *) retrieving source from hg.m.o (often small, but can turn into 600MB d/l
> from ftp.m.o for a bundle)
Bug 851398 would help with this massively - we currently rm -rf most of the source directories (tools, mozharness, gecko, ...) and re-clone, instead of using hg purge.
Depends on: 851398
Comment 19•11 years ago
|
||
(In reply to Ed Morley [:edmorley UTC+0] from comment #18)
> (In reply to Nick Thomas [:nthomas] from comment #17)
> > I've been looking at what we can do to reduce the traffic through the
> > tunnels. The biggest sources of tunnel traffic will be
> > *) retrieving source from hg.m.o (often small, but can turn into 600MB d/l
> > from ftp.m.o for a bundle)
>
> Bug 851398 would help with this massively - we currently rm -rf most of the
> source directories (tools, mozharness, gecko, ...) and re-clone, instead of
> using hg purge.
That's not entirely true. We rm -rf the build directory, but in most cases we have a local clone of gecko we can use and don't have to re-clone it from scratch.
Reporter | ||
Comment 20•11 years ago
|
||
Another non-standard episode tonight, barely noticeable starting at 19:30, but coming to a nice rolling boil at 21:00.
Comment 21•11 years ago
|
||
this hit us again hard and required another round of trizilions retriggers :)
Comment 22•11 years ago
|
||
I wrote a small script that pulls the SPU usage value from the firewall and writes it to a file. Next time it happens we will have more data.
Comment 23•11 years ago
|
||
Case 2014-0117-0233 opened with Juniper to investigate possible performance issues on our side.
Reporter | ||
Comment 24•11 years ago
|
||
We're back to sucking mightily, beyond belief, and every tree is closed again.
Reporter | ||
Comment 25•11 years ago
|
||
Closed all trees at 20:37.
Comment 27•11 years ago
|
||
:arzhel informed me in IRC that he is on PTO, and this closes the trees for a few hours daily right now, so assigning to :adam for any rectification/remediation he can
Assignee: arzhel → adam
Reporter | ||
Comment 28•11 years ago
|
||
Reopened, it fizzled out some unknowable time ago.
Comment 29•11 years ago
|
||
Observation from https://treestatus.mozilla.org/b2g-inbound - and discussion with sheriffs in #releng:
- this is happening every weeknight PT (M-Th)
- it impacts the throughput of build/testing every night
- some nights the trees are not closed due to lighter overall load
- all reports of bug 934938 are also this condition
(In reply to Arzhel Younsi [:XioNoX] from comment #23)
> Case 2014-0117-0233 opened with Juniper to investigate possible performance
> issues on our side.
Can we get an update on Juniper's feedback while :arzel is on pto (comment 27)
What is the current plan for steps to resolution?
Flags: needinfo?(adam)
Trees closed at 18:38
Assignee | ||
Comment 31•11 years ago
|
||
We have a case open with Juniper and we'll work toward resolution. This seems like we're going to have to make upgrades (buy more/new hardware, run dedicated lines, etc.) in order to push this harder. In the meantime can we get our SLAs from Amazon so that I know what they are promising from their side?
Flags: needinfo?(adam)
Reporter | ||
Comment 32•11 years ago
|
||
Impossible for me to tell whether it's actually recovered, or we've just run out of running jobs to fail. Reopened after four hours.
Reporter | ||
Comment 33•11 years ago
|
||
Only closed for three hours tonight.
Comment 34•11 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #33)
> Only closed for three hours tonight.
Is this our first Friday night closure for this reason? I.e. any gut feeling that the issue is getting worse?
Flags: needinfo?(philringnalda)
Reporter | ||
Comment 35•11 years ago
|
||
I'd say it's worse, yes: I just merged mozilla-inbound and b2g-inbound to mozilla-central, and mozilla-central back to m-i, b-i and f-t, so we have a total of maybe 10 pushes and a thousand jobs running, less than half our normal full load, and we're hitting this on a Saturday night.
Flags: needinfo?(philringnalda)
Reporter | ||
Comment 36•11 years ago
|
||
This'll cut down my spamming of this bug, though: from now on, just assume that every tree is closed every day from 19:30 to 23:30.
Reporter | ||
Comment 37•11 years ago
|
||
One thing I do know that we do around the time when it all goes to pot (remember the thing that we do in the evening, that we're supposed to be looking for?) is trigger PGO builds on the tip of every tree that does periodic PGO, even if we've already built PGO on that revision, and trigger non-unified builds on the tip of every tree, even if we've already built non-unified on that revision. We do that at 18:00, but we might very well trigger tests on the PGO builds, upload the PGO builds and their logs, and upload the non-unified builds and their logs all around 19:30.
It can't be the cause, both because we also do that at 00:00 and 06:00 and 12:00 and because we started doing that in the middle of January, and this started somewhere between early January and before the Christmas break (since anything involving load wouldn't have shown up until January), but it certainly can be contributing to the degree of bustage.
Depends on: 964071
Comment 38•11 years ago
|
||
Trees closed again, not inclined to open them any time soon.
This / bug 934938 needs escalating asap, my patience is somewhat diminishing :-(
I'll send some emails out.
Updated•11 years ago
|
Severity: critical → blocker
Reporter | ||
Comment 39•11 years ago
|
||
Despite the suspicious timing 12 hours before when this hits, I think the current episode is something distinct, since this doesn't involve 500s from ftp.m.o and doesn't involve Windows and Mac slaves.
Comment 40•11 years ago
|
||
c#39 was caused by Bug 963768 c#9
Comment 41•11 years ago
|
||
We've decreased priority of usw2, so should be preferring use1 at this point.
Comment 42•11 years ago
|
||
Laura pinged me about this.
There is a question about whether or not we can "turn off IPSEC" in the interim. To be clear, turning off IPSEC not a valid option. IPSEC is a requirement to connect to AWC VPCs. The only other option would be a Direct Connection, which has considerable lead time.
It seems there are a few options detailed in this etherpad, https://infra.etherpad.mozilla.org/aws-ipsec which include:
* Add hardware modules to the firewalls
* Moving FTP and HG traffic off the IPSEC tunnel
Assignee | ||
Comment 43•11 years ago
|
||
I've received a quote for the FW expansion and they're working on the lead time for the resources if it is determined that adding cards is our preferred direction.
Reporter | ||
Comment 44•11 years ago
|
||
Funny thing happened on the way to the nightly closure tonight: ordinarily, I only look at finished-and-failed jobs, so I don't even see scheduling, but at 21:00 (inconveniently one minute before it recovered) I noticed that we hadn't been scheduling any tests and builds for 30 minutes or so. Not surprisingly, once we scheduled 30 minutes worth all in one huge wad so they all cloned and downloaded at once, we choked out the VPN.
Do we routinely have a 30 minute period, 20-30 minutes before this starts, when we don't schedule any jobs before scheduling them all at once, causing this to happen?
Comment 45•11 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #44)
> I noticed that we
> hadn't been scheduling any tests and builds for 30 minutes or so. Not
> surprisingly, once we scheduled 30 minutes worth all in one huge wad so they
> all cloned and downloaded at once, we choked out the VPN.
Perhaps completely unrelated, but I noticed that some jobs I retriggered on a try job last night (the tree was open at this point) simply never happened. I retriggered them again and they came back green, but the earlier triggers appear to have been eaten.
Comment 46•11 years ago
|
||
There have been some (probably unrelated) problems with self-serve which would have caused that. See bug 963720.
Comment 47•11 years ago
|
||
I've queries the netflow data around the tree closure events, it seems that the top talkers just before the tunnel died were:
- v-1030.fw1.releng.scl3.mozilla.net - over 589Mbit/sec on average on this one alone. That an external IP used by source NAT for a number of RelEng systems.
- ftp1-zlb.vips.scl3.mozilla.com - 170Mbit/sec
- ftp2-zlb.vips.scl3.mozilla.com - 168Mbit/sec
- 205.251.233.122
- 72.21.209.193
- hg-zlb.vips.scl3.mozilla.com
- upload-zlb.vips.scl3.mozilla.com
- pvtbuilds.pvt.build.mozilla.org
Note - it's a netflow data, so by its very nature (statistics, sampling one from N) the actual flow rates might have been higher, the numbers quoted here are an estimation on the low side.
Comment 48•11 years ago
|
||
Here is the list of subnets RelEng VPC talks to (or the other way around). I used the fact that no subnet is smaller than /24.
cat releng_vpc_ips_sorted_uniq_novpc.txt | cut -d '.' -f 1-3 | sort -n | uniq
10.12.128
10.12.130
10.12.131
10.12.132
10.12.133
10.12.135
10.12.48
10.12.49
10.12.50
10.12.51
10.12.52
10.22.248
10.22.70
10.22.74
10.22.75
10.22.81
10.250.48
10.26.40
10.26.41
10.26.48
10.26.56
10.26.57
10.26.75
10.26.80
10.26.84
10.8.74
10.8.75
10.8.81
Comment 49•11 years ago
|
||
I'm parsing the netflow data for 24h of traffic from last Wednesday.
Comment 50•11 years ago
|
||
A list of hostnames with IP attached to the bug - see releng_vpc_ips_sorted_uniq_novpc_resolved.txt
For the record, generated with
for i in `cat releng_vpc_ips_sorted_uniq_novpc.txt`; do (echo -n $i; echo -n " "; dig +short -x $i) >> releng_vpc_ips_sorted_uniq_novpc_resolved.txt; done;
Comment 51•11 years ago
|
||
Comment 52•11 years ago
|
||
Comment 53•11 years ago
|
||
Attached the full output of netflow data I was parsing, should any of you feel like doing it yourself. releng_vpc.csv.bz2 - over 7MB bzip2 file
Comment 54•11 years ago
|
||
1. It looks like it is a SPU meltdown - connectivity closes frequently after a large data transfer starts, like from/to the FTP server. The SPU theoretical performance for IPSEC can only be achieved in ideal conditions. We can't confirm but it's a standard marketing practice to measure the throughput with almost empty session table (i.e. almost no firewalling, or just nothing at all) and maximum size packets. I haven't seen the SPU usage going under 82% from yesterday and it means we don't have a lot of headroom should any be necessary. NetOps can correct me here.
2. We do not know for sure that purchasing another two SPU will resolve problems, per Juniper ticket (they aren't saying it will).
3. Comment #47 should give you an overview what kind of communication is generating most of the traffic and is the number one candidate to move out of the IPSEC and into a TLS, should we decide to go this route.
4. Comments #48 and #50 have a parsed list of hosts talking with the VPCs. Data are for a single timeslot of 24h, last Wednesday. I can recreate it for any other period, just keep in mind it's a task consuming and manual exercise, also querying netflow data for anything more than 48h takes a lot of time.
Comment 55•11 years ago
|
||
bug 964804 appears related to philor's note in comment 44.
Depends on: 964804
Comment 56•11 years ago
|
||
[13:28:29] < michal`> adam: one thing. say we can spread the single vpn session over two SPUs and we get a lot more throughput
this way. I think we should rate limit the ZLB uploads and downloads a bit so it does not end up going as
fast as it can and consuming what we just gained
[13:28:51] < adam> That's sound logic.
TL;DR we must figure out how fast do we need the downloads and uploads going and rate limit it, otherwise we might saturate SPUs again. The limit should be enforced on the ZLB level.
Comment 57•11 years ago
|
||
Attached is a CSV traffic summary that includes bytes transferred. This is sampled for a 3 hour time period last night. (approx 17:00 - 20:00 PST)
The columns are sorted by total bytes transferred. It's pretty obvious that HG sends a lot of traffic across the VPN, and that pvtbuilds receives a lot of traffic from across the VPN.
:michal` is commenting with more summary details. See his comment. :-)
Comment 58•11 years ago
|
||
Ben just fixed bug 964411 this morning, which should help with the HG traffic a bit.
Comment 59•11 years ago
|
||
The majority of the traffic is pvtbuilds and HG, will have more scientific data with numbers later today. After filtering out all teh pvtbuilds+HG I can't see more than 60-100Mbit/sec of traffic with occasional spikes to around 200+
Next step - breakdown by ports. Stay tuned.
Comment 60•11 years ago
|
||
Attached the query output showing which instances use the HG-internal over the IPSEC tunnel.
Reporter | ||
Comment 61•11 years ago
|
||
And after a nice long period of things being good, we're seeing timeouts from both ftp-ssl.m.o and pvtbuilds.pvt.build.m.o, all trees closed.
Comment 62•11 years ago
|
||
Best I can tell this was an issue within Amazon's network, which hurt usw2 only.
This is the smokeping graph from admin1b.releng.scl3 to nagios1.private.releng.usw2, where there is bad packet loss but latency is still good. I checked the traffic levels on the AWS tunnels and fw1.releng.scl3 overall, and it was wasn't anything higher than we dealt with all day, so that's consistent with the latency.
We have a cron running mtr from cruncher.srv.releng.scl3 to Amazon's side of the two usw2 tunnels, and saw packet loss at 198.32.176.36 (paix01-sfo4.amazon.com) between 02:42 and 0540 UTC. That was only on the .122 tunnel, and not the .121 tunnel. It's only sending 5 packets so the only results are 0%/20%/40%.. loss and it's not a very strong test, but still some info there.
Comment 63•11 years ago
|
||
seems the issue with aws came back today like https://tbpl.mozilla.org/php/getParsedLog.php?id=34833613&tree=B2g-Inbound trees closed till the maschine/test recovers
Comment 64•11 years ago
|
||
(In reply to Carsten Book [:Tomcat] from comment #63)
> seems the issue with aws came back today like
> https://tbpl.mozilla.org/php/getParsedLog.php?id=34833613&tree=B2g-Inbound
> trees closed till the maschine/test recovers
This log shows:
02:50:17 ERROR - ssh: Could not resolve hostname pvtbuilds.pvt.build.mozilla.org: Name or service not known
All the earlier comments in this bug talk about very slow download speeds, not DNS issues (though it's hard to be sure because all of the TBPL links have expired). I don't think this particular log is related to this.
Comment 65•11 years ago
|
||
All work on this particular issue has been completed. Please open new bugs for AWS issues, even it it appears to have similar symptoms.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•