Closed
Bug 1096337
Opened 11 years ago
Closed 9 years ago
Excessive requests to hg.mozilla.org exhaust network capacity
Categories
(Developer Services :: Mercurial: hg.mozilla.org, defect)
Developer Services
Mercurial: hg.mozilla.org
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: cbook, Unassigned)
References
Details
(Keywords: meta, Whiteboard: [tracker] [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4134] )
Attachments
(1 file)
|
235.67 KB,
text/plain
|
Details |
as example https://treeherder.mozilla.org/ui/logviewer.html#?job_id=610809&repo=mozilla-central
06:40:27 INFO - Downloading https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win32/1415626437/firefox-36.0a1.en-US.win32.tests.zip to C:\slave\test\build\firefox-36.0a1.en-US.win32.tests.zip
06:40:27 INFO - retry: Calling <bound method Proxxy._download_file of <mozharness.mozilla.proxxy.Proxxy object at 0x01D23550>> with args: ('https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win32/1415626437/firefox-36.0a1.en-US.win32.tests.zip', 'C:\\slave\\test\\build\\firefox-36.0a1.en-US.win32.tests.zip'), kwargs: {}, attempt #1
command timed out: 1800 seconds without output, attempting to kill
Closing trees
Comment 1•11 years ago
|
||
I imagine this will resolve itself once the surge in FTP load is reduced by switching the Firefox Developer edition to CDN (until now Aurora just used FTP directly aiui).
Comment 2•11 years ago
|
||
There have been discussions in #moc.
If you look at the network traffic graphs of Zeus going back for a month, there is a strong correlation in high traffic spikes with in-flight requests on hg.mozilla.org, despite the Zeus traffic not involving hg.mozilla.org.
The hg.mozilla.org logs from the last 2 events reveal that the server has plenty of CPU, I/O, and memory capacity during these events: it just can't send bits to the network fast enough. In-flight requests pile up and the server eventually has no slots left to process requests. Operations like cloning mozharness that normally take under 4s are taking over a minute. Outbound bandwidth both to EC2 and the SCL3 firewall (v-1030.fw1.releng.scl3.mozilla.net) is severely limited.
We believe there is a correlation between the high traffic spikes and releasing Firefox. So, releasing Firefox triggers a cascading network outage due to saturated load. This problem goes back at least until mid October.
I believe there is a network device or link in SCL3 that hits capacity during these events. However, I don't believe anyone has positively identified what device that is. There are theories. I'd love to offer a smoking gun, but I don't know the network topology and am not sure where to access data to go digging for answers.
Comment 3•11 years ago
|
||
While I still believe hg.mozilla.org isn't the original point of failure here (the network is), network traffic from hg.mozilla.org is certainly contributing to the problem.
Here is a list of outbound traffic per URL path from 1 hgweb node today:
217584175027 integration/gaia-central
114539102183 build/tools
93529675286 projects/fig/json-pushes
25273822086 integration/gaia-2_1
16728568156 mozilla-central
14468451244 releases/mozilla-beta
13736330981 integration/mozilla-inbound
12666163503 build/mozharness
11122526727 build/talos
9457257245 releases/mozilla-aurora
6090081997 integration/gaia-2_0
4769007406 projects/maple
3934642453 projects/ash
2859290530 projects/cypress/json-pushes
2454524984 integration/gaia-1_4
995438278 projects/cedar
969852715 integration/fx-team
912853369 projects/holly/json-pushes
878740118 projects/pine/json-pushes
759913211 comm-central
610110130 projects/ux/json-pushes
588445720 projects/elm/json-pushes
565873165 releases/comm-beta
557689275 l10n-central/pl
530654231 projects/ionmonkey/json-pushes
522200050 services/services-central/json-pushes
498329975 users/stage-ffxbld/tools
447961775 projects/ash/json-pushes
443713333 projects/build-system/json-pushes
395443968 try
369771260 projects/larch/json-pushes
355321901 l10n-central/es-ES
326484408 users/Callek_gmail.com/tools
294789762 projects/graphics/json-pushes
283687740 l10n-central/pt-BR
227922785 projects/cedar/json-pushes
211938973 projects/date/json-pushes
193656544 build/ash-mozharness
160807916 projects/jamun/json-pushes
139600810 releases/comm-aurora
122206275 gaia-l10n/en-US
I've already filed a few bugs to track some obvious problems. What's shocking to me is that build/tools is only a 29 MB repo but it is transferring more than every repository except gaia-central. Ouch.
Comment 4•11 years ago
|
||
Total CPU time per repository / URL path:
38460 projects/fig/json-pushes
35368 build/tools
32895 integration/gaia-central
10609 integration/mozilla-inbound
8293 build/mozharness
5638 mozilla-central
4830 releases/mozilla-beta
3588 integration/gaia-2_1
3131 releases/mozilla-aurora
2842 try
2568 build/talos
1548 projects/maple
1528 try/json-pushes
1257 projects/ash
853 integration/gaia-2_0
682 projects/cypress/json-pushes
518 integration/fx-team
388 projects/cedar
371 integration/gaia-1_4
263 l10n-central/pl
223 comm-central
219 projects/holly/json-pushes
209 projects/pine/json-pushes
189 projects/ux/json-pushes
176 l10n-central/es-ES
174 releases/comm-beta
167 projects/ionmonkey/json-pushes
161 services/services-central/json-pushes
154 integration/b2g-inbound
144 users/stage-ffxbld/tools
135 projects/elm/json-pushes
135 mozilla-central/annotate/0bed6fc0b0b7/configure.in
133 projects/build-system/json-pushes
127 releases/mozilla-release
125 l10n-central/pt-BR
121 build/ash-mozharness
108 projects/larch/json-pushes
108 projects/ash/json-pushes
107 users/Callek_gmail.com/tools
106 gaia-l10n/en-US
IMO integration/mozilla-inbound should be #1. The items above it are all bugs.
Comment 5•11 years ago
|
||
Total response size for all hgweb nodes for Nov 10 UTC:
1992687013298 integration/gaia-central
1049664755232 build/tools
762034508701 projects/fig/json-pushes
182886245712 mozilla-central
155743621405 releases/mozilla-beta
149080455046 integration/mozilla-inbound
122715571509 integration/gaia-2_1
114123852497 build/mozharness
101884717657 build/talos
86674227517 integration/gaia-2_0
79790138227 projects/maple
71777273460 releases/mozilla-aurora
34514508687 projects/cedar
28572272607 projects/ash
27587098727 integration/gaia-1_4
25165673944 projects/cypress/json-pushes
20745737350 try
9010094078 projects/pine/json-pushes
7524875339 projects/holly/json-pushes
6655613549 releases/comm-beta
6624339681 projects/elm/json-pushes
6346144306 integration/fx-team
5837963779 comm-central
5732022028 projects/ux/json-pushes
5433899292 projects/ionmonkey/json-pushes
5374096650 l10n-central/pl
5069931793 services/services-central/json-pushes
5037247503 integration/b2g-inbound
4298728516 users/Callek_gmail.com/tools
4205904989 users/stage-ffxbld/tools
4074232038 projects/ash/json-pushes
3750094202 projects/build-system/json-pushes
3571335520 projects/larch/json-pushes
3171262433 l10n-central/es-ES
2949379660 releases/mozilla-release
2672131460 releases/mozilla-b2g32_v2_0
2569241108 l10n-central/pt-BR
2538777823 releases/mozilla-b2g30_v1_4
2303840600 releases/comm-aurora
1973782077 projects/cedar/json-pushes
1950817230 projects/graphics/json-pushes
1950144199 projects/date/json-pushes
1781215071 releases/mozilla-esr31
1521218934 projects/jamun/json-pushes
1405078180 build/ash-mozharness
1159592537 gaia-l10n/en-US
1137746745 releases/l10n/mozilla-beta/pl
Firefox release automation is DoSing itself.
Comment 6•11 years ago
|
||
To put these numbers in perspective, a 1Gbps link is capable of delivering ~125 MiB/s or 10,800,000 MiB = 10,800 GiB per day.
In terms of 1 Gbps/day saturation:
Contrib Cumul Repo/URL
18.45% 18.45% integration/gaia-central
9.72% 28.17% build/tools
7.06% 35.23% projects/fig/json-pushes
1.69% 36.92% mozilla-central
1.44% 38.36% releases/mozilla-beta
1.38% 39.74% integration/mozilla-inbound
1.14% 40.88% integration/gaia-2_1
1.06% 41.93% build/mozharness
0.94% 42.88% build/talos
0.80% 43.68% integration/gaia-2_0
0.74% 44.42% projects/maple
0.66% 45.08% releases/mozilla-aurora
0.32% 45.40% projects/cedar
0.26% 45.67% projects/ash
0.26% 45.92% integration/gaia-1_4
0.23% 46.16% projects/cypress/json-pushes
0.19% 46.35% try
So, yeah, cloning/pulling our top 17 repositories can get a 1 Gbps link to ~50% saturation.
This would arguably be tolerable. However, load isn't even (it is concentrated when the sum is over California). And, we have other things occurring over the link (like FTP traffic).
We need to significantly reduce the bandwidth used by Mercurial clones.
While we can deploy things on the Mercurial server to reduce bandwidth, I still think release automation has a long ways to go to cut down on excessive clones and pulls.
Comment 7•11 years ago
|
||
For the official record and so credit is properly given, I didn't realize we were wasting quite so much bandwidth in Mercurial land until catlee pasted a link in IRC showing the overall traffic of hg.mozilla.org being greater than the overall traffic for ftp. I was shocked by this and quite frankly didn't believe it at first. Judging by catlee's reaction on IRC, I don't think he believed it either :) I wrote a quick script for dumping per-repository totals from the Mercurial serverlog extension logs to a) validate the claim b) isolate where all that data was going. That quickly revealed some numbers (pasted above) that validated the data catlee pulled, were significantly larger than I expected, and made it really easy to identify some oddities.
As they say, data is king. I hope when all of this is over we have more accessible and easy-to-use dashboards for quickly finding critical data like this.
As an aside, those of us maintaining the Mercurial servers have been starting at the total bandwidth number for a while. You can clearly see it in the "Network Graph" at https://nigelbabu.github.io/hgstats/. However, the unit is simply "M." I assumed that was Mbps. Looking at it now, the underling data source is in octets, so it is actually MBps - 8x larger than I thought. I wonder how many more years I need to be an engineer before I realize not to make assumptions about bits vs bytes.
Finally, https://graphite.mozilla.org/render?from=-6weeks&until=now&width=586&height=308&_salt=1408010393.594&yMax=&yMin=&target=sumSeries%28hosts.hgweb*_dmz_scl3_mozilla_com.interface.if_octets.bond0.tx%29&title=Outbound%20Network%20Traffic&hideLegend=true&_uniq=0.8170505894781112 shows that hg.mozilla.org peak network bandwidth increased dramatically around Oct 21. I wonder what triggered that...
Comment 9•11 years ago
|
||
Is the traffic coming from hg's and ftp's public facing interfaces?
host hg.mozilla.org
hg.mozilla.org has address 63.245.215.25
host ftp.mozilla.org
ftp.mozilla.org has address 63.245.215.56
ftp.mozilla.org has address 63.245.215.46
and going to destinations on the Internet?
Thanks.
Comment 10•11 years ago
|
||
My understanding is traffic is going from hg.mozilla.org's public interfaces both to public internet *and* to internal consumers, like fw1.releng.scl3.
One important thing I left out is that during "network events," traffic is slow both to fw1 *and* public internet. Assuming fw1 is not in the path of hg.mozilla.org -> EC2 releng infra, this seems to imply that there is a network issue outside of fw1.
Also, please read http://gregoryszorc.com/blog/2014/11/07/mercurial-server-hiccup-2014-11-06/ for my reasoning for pinning this on network capacity instead of something else. tl;dr hg servers have plenty of idle resources: they appear to not be able to send bits to the network fast enough. Operations that commonly take 4s to transfer 20MB are taking >100s.
Comment 11•11 years ago
|
||
It would be good if we know the percentages of traffic going from where to where.
Going from hg.m.o and/or ftp.m.o out to the Internet... we need to look into that for sure.
There should be no bottlenecks there.
Really, for outbound packets we should be going from
server -> switch -> border router -> border router -> Internet, all at 10Gbits
and the load on those three network devices is a small fraction of what they are capable of.
For traffic passing through fw1.releng.scl3, do we know the actual destination?
I'm wondering if that traffic is going to AWS. If so, the path those packets take is
complicated, and disappears into the AWS cloud, which we don't have visibility
into. =-(
Yeah, more info needed here. I don't know how these servers are connected.
I know *where* they are connected in SCL3, as in, what VLAN, but are they connected
via 1G or 10G, how the load balancers are involved (not pointing a finger, just
including them in the path if they should be).
Yes, I read your blog yesterday. Good sleuthing on your part!
Hey, aren't you supposed to be taking the day off? =-)
Comment 12•11 years ago
|
||
Should add something:
> Assuming fw1 is not in the path of hg.mozilla.org -> EC2 releng infra,
> this seems to imply that there is a network issue outside of fw1.
It depends on which EC2 instance the packets are going to.
We have IPSec tunnels from both fw1.releng.scl3 and fw1.scl3 to AWS.
Here's the break down WRT destination IP addresses:
fw1.scl3:
-------------------
10.136/16
10.128/16
10.152/16
fw1.releng.scl3:
-------------------
10.132/16
10.130/16
10.134/16
This has got me curious. Because, if hg.m.o or ftp.m.o sent packets destined for any
of the above mentioned IP space, those packets would hit core1.scl3 or core2.scl3
(a mated pair of switch/routers) which do not have routes for any of these /16s.
I wonder what the routing-table is like (netstat -rn) on hg.m.o and ftp.m.o.
Specifically, do they send traffic destined for this address space out a
different-than-their-public-facing interface?
Comment 13•11 years ago
|
||
This is a list of times for hgweb1 to serve "bundle" requests for the build/tools repository on UTC day Nov 10 (partial day coverage only). A "bundle" request is an `hg clone` or `hg pull`. Most of these are clones (you can tell by the ~20MB response size).
Columns are:
* Wall time to serve request
* IP address of requester. IP comes from X-CLUSTER-CLIENT-IP HTTP request header or REMOTE_ADDR environment variable (presumably set by Apache to the remote IP address).
* Response bytes
* Time request arrived
Average response time: 40.5s
Median response time: 8.8s
As you can see, the response times at the bottom are quite extreme.
Comment 14•11 years ago
|
||
Note that hgwebN.dmz.scl3 are behind Zeus, so I don't think routing tables on those hosts matter too much.
I have no clue how Zeus is configured. bkero is the subject expert from my perspective.
Comment 15•11 years ago
|
||
Ah. Those requests from 54/8 address space is from various AWS sites, and they are using their
public facing interface to avoid (offload) the IPsec tunnels.
And (as I'm sure you've already figured out) 63.245.214.82 is the NAT IP for fw1.releng.scl3.
If we ramp up traffic to/from VLANs behind fw1.releng.scl3 to hg.m.o/ftp.m.o then that
firewall could be a bottleneck.
However, that doesn't explain why traffic between AWS (public space) and hg.m.o/ftp.m.o
experiences an issue as well. If the slow connections happen at the same time, to both
locations, that's a big data point to me. Before I write a script to order the lines in your
attachment by date/time, do we already know this to be the case?
Thanks!
Comment 16•11 years ago
|
||
I am very curious about how Zeus is configured.
The hgwebN.dmz.scl3 hosts are behind a firewall (fw1.scl3.mozilla.net), depending on how Zeus
connects to them.
Comment 17•11 years ago
|
||
awk '{print $4, $1, $2, $3}' tools-bundle-times | sort -n
Start looking around 11-10T14:00:00. Everything is fine.
We start seeing some long requests:
11-10T14:13:57 52.557 54.165.75.203 19956506
11-10T14:14:02 48.249 63.245.214.82 19956506
11-10T14:14:03 46.507 54.173.41.136 19956506
11-10T14:15:44 8.793 63.245.214.82 19956506
11-10T14:16:16 5.979 54.148.37.234 19956506
11-10T14:16:17 20.412 54.173.162.237 19956506
Then some very long ones:
11-10T14:24:07 361.713 54.148.53.5 19956506
11-10T14:27:13 433.966 63.245.214.82 19956506
11-10T14:33:47 193.295 63.245.214.82 19956506
11-10T14:34:06 200.85 54.148.52.195 19956506
11-10T14:34:21 719.954 63.245.214.82 19956506
11-10T14:34:46 207.176 54.173.165.17 19956506
11-10T14:36:03 41.733 63.245.214.82 19956506
11-10T14:36:20 34.411 63.245.214.82 19956506
11-10T14:37:02 8.252 63.245.214.82 19956506
11-10T14:37:17 765.684 63.245.214.82 19956506
11-10T14:37:18 757.881 63.245.214.82 19956506
By 15:00 things are in a royal mess:
11-10T15:05:56 999.702 54.148.43.165 14879752
11-10T15:05:56 999.899 63.245.214.82 15075600
11-10T15:06:01 1370.844 54.173.99.241 19956506
11-10T15:06:43 12.078 63.245.214.82 19956506
11-10T15:07:03 8.326 54.148.46.20 19956506
11-10T15:07:12 9.244 63.245.214.82 19956506
11-10T15:07:30 8.447 54.148.51.71 19956506
11-10T15:07:58 1319.753 63.245.214.82 18164098
11-10T15:08:00 7.995 54.173.59.194 19956506
11-10T15:08:09 999.944 63.245.214.82 14069530
11-10T15:08:32 14.494 54.173.189.124 19956506
11-10T15:09:14 8.768 54.173.180.137 19956506
11-10T15:09:24 1319.101 54.69.126.186 18565592
That's 1300s to transfer 20MB - ~16KB/s on average.
Comment 18•11 years ago
|
||
(In reply to Dave Curado :dcurado from comment #11)
> It would be good if we know the percentages of traffic going from where to
> where.
> Going from hg.m.o and/or ftp.m.o out to the Internet... we need to look into
> that for sure.
> There should be no bottlenecks there.
> Really, for outbound packets we should be going from
> server -> switch -> border router -> border router -> Internet, all at
> 10Gbits
> and the load on those three network devices is a small fraction of what they
> are capable of.
>
> For traffic passing through fw1.releng.scl3, do we know the actual
> destination?
> I'm wondering if that traffic is going to AWS. If so, the path those
> packets take is
> complicated, and disappears into the AWS cloud, which we don't have
> visibility
> into. =-(
>
> Yeah, more info needed here. I don't know how these servers are connected.
> I know *where* they are connected in SCL3, as in, what VLAN, but are they
> connected
> via 1G or 10G, how the load balancers are involved (not pointing a finger,
> just
> including them in the path if they should be).
For posterity - the load balancers are limited to 2Gbps by license, and all traffic to/from hg.m.o travels through them.
Comment 19•11 years ago
|
||
Apologies -- I don't have any idea how the load balancers are set up.
(I'm looking at the info in mana, but I think I need a network diagram...)
Does this mean that we can get 2 Gbps in total, or 2 Gbps through each
device in the load balancer cluster? (and are there 4 in the SCL3 cluster?)
Thanks very much.
Comment 20•11 years ago
|
||
Good question, and I don't know either! :-) I think, because of the way we normally do things, is that we're looking at 2gbps in total.
The mana doc says it's "per node" but the linked-to Stingray docs do not. If it's per zeus node, the next question is: does that mean the zlb VIP needs to be active on all zeus nodes? Default is that the VIP runs active on one zeus node and is passive on all others for fail over. Perhaps if we run active-active we get (2gbps * nodes).
Dropping a needinfo on :jakem, since I think he'll have the answers we want.
Flags: needinfo?(nmaul)
Comment 21•11 years ago
|
||
A Pivotal Tracker story has been created for this Bug: https://www.pivotaltracker.com/story/show/82612530
Comment 22•11 years ago
|
||
In SCL3 we have 2gbps of bandwidth per ZLB node. That means a node can push up to 2gbps before Stingray will start throttling it internally. That means we have a total possible bandwidth of 8gbps.
However as :fubar correctly ascertained, hg.m.o (and most) VIPs are only active on one node at a time, effectively limiting them to 2gbps.
ftp.m.o is one that is done differently... it is load balanced on 2 IPs, which are active on separate ZLB nodes. It has an effective theoretical capacity of 4gbps. I say 'theoretical' because obviously that 4gbps is shared with everything else that happens to be on those 2 nodes.
There is also ftp-ssl.m.o, which is split among 3 VIPs (3 separate ZLB nodes). As I recall this is purely a "workaround" name having something to do with saturation of the VPC (or perhaps a firewall), and is only used by RelEng... it may be a factor here as well.
hg.m.o is active on zlb5
ftp.m.o is active on zlb5 and zlb6
ftp-ssl.m.o is active on zlb1, zlb3, and zlb5
Therefore, hg.m.o's theoretical 2gbps max is shared with 1/2 of ftp.m.o's theoretical 4gbps max, as well as 1/3 of ftp-ssl.m.o's theoretical 6gbps max.
This does mean that if you're pulling heavily from both hg and ftp at the same time, they will tend to interfere with each other more than might otherwise be expected (that is, outside of any relationship they may or may not share with respect to firewalls).
We could add another IP to hg.m.o and distribute it similarly to the way ftp.m.o is. There are only 4 ZLB nodes in this cluster, so it's impossible for these 3 services to be entirely disjoint (unless we shrink one or more of them)... but that doesn't mean we can't do better than it currently is.
Flags: needinfo?(nmaul)
Comment 23•11 years ago
|
||
I think perhaps we found the "network resistor" in this path?
We were pushing 5 gbps out the front door of SCL3 at this time, and sending
~2 gbps from the border routers towards fw1.releng.scl3
That's what I tend to look at: all the interfaces on the 2 border routers at the head
end of SCL3. Traffic from the VLAN(s) supporting the load balancers going into the
releng networks (and vice versa) have to traverse those head end border routers, so
measuring throughput there gives a pretty accurate picture.
Comment 24•11 years ago
|
||
Thanks for the offer to distribute hg.mozilla.org across multiple IPs! However, unless it becomes SOP, I don't believe it is necessary any more.
We identified a lot of "dumb" traffic we shouldn't have been generating. We've already made giant leaps to reducing the traffic and we have more in the works (see dependent bugs).
You can clearly see a drop-off in hg.mozilla.org this week at https://graphite.mozilla.org/render?from=-14days&until=now&width=1000&height=600&_salt=1408010393.594&yMax=&yMin=&target=sumSeries%28hosts.hgweb*_dmz_scl3_mozilla_com.interface.if_octets.bond0.tx%29&title=Outbound%20Network%20Traffic&hideLegend=true&_uniq=0.8170505894781112. Expect a similar drop next week if the remaining patches all land and get deployed.
Comment 25•11 years ago
|
||
Wow, great work in reducing the traffic load!
Comment 26•11 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #24)
> Thanks for the offer to distribute hg.mozilla.org across multiple IPs!
> However, unless it becomes SOP, I don't believe it is necessary any more.
>
> We identified a lot of "dumb" traffic we shouldn't have been generating.
> We've already made giant leaps to reducing the traffic and we have more in
> the works (see dependent bugs).
>
> You can clearly see a drop-off in hg.mozilla.org this week at
> https://graphite.mozilla.org/render?from=-
> 14days&until=now&width=1000&height=600&_salt=1408010393.
> 594&yMax=&yMin=&target=sumSeries%28hosts.hgweb*_dmz_scl3_mozilla_com.
> interface.if_octets.bond0.
> tx%29&title=Outbound%20Network%20Traffic&hideLegend=true&_uniq=0.
> 8170505894781112. Expect a similar drop next week if the remaining patches
> all land and get deployed.
Awesome, that certainly helps buildduty! Keep up the good work! =)
Comment 27•11 years ago
|
||
I plotted the per-repository bandwidth usage over time. You can find results at https://docs.google.com/spreadsheets/d/1EMhKWnKLLA24jX4cPmLcBWq6Fuk4tVJBEN4BH1D25d0/pubchart?oid=1200300937&format=interactive
If you are running e10s, you may need to disable that. Or just load the link in another browser.
Google Docs won't let me save defaults for this chart type. You'll want to:
1) Change X axis (bottom axis) to "Order: Alphabetical"
2) Under "Size" in the top right, change to "Bytes Sent"
3) Optionally change "Color" to something else to make it slightly more readable
When you move the date slider, you'll notice a few things:
1) A handful of repositories are "long poles" (build/tools, integration/gaia-central, fig pushlog, etc) and are sucking up absurd amounts of bandwidth.
2) gaia-central appears to be the main culprit behind our 200 Mbps traffic increase starting Oct 21. Still no clue what changed.
3) It appears all the major bad actors are being tracked by this bug. Resolving the dependent bugs will drastically reduce overall traffic load.
Comment 28•11 years ago
|
||
Moving to a component appropriate for ongoing discussion and changing summary to be more specific.
Component: Buildduty → Mercurial: hg.mozilla.org
Product: Release Engineering → Developer Services
QA Contact: bugspam.Callek → hwine
Summary: Trees suffering on timeouts - Trees closed → Excessive requests to hg.mozilla.org exhaust network capacity
Updated•11 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4134]
Comment 29•11 years ago
|
||
> fw1.scl3:
> -------------------
> 10.136/16
> 10.128/16
> 10.152/16
>
> fw1.releng.scl3:
> -------------------
> 10.132/16
> 10.130/16
> 10.134/16
Just to toss this out there for future reference -- all releng buildslaves, and thus all releng traffic sources, are in 10.132/16 and 10.134/16 (releng.use1 and releng.usw2, respectively). 10.130/16 is releng.usw1, but is unused due to high instance prices.
Comment 30•10 years ago
|
||
adjust status to reflect current reality. this event is over, but we started using it as a tracker.
Severity: blocker → normal
Keywords: meta
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4134] → [tracker] [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4134]
Comment 31•9 years ago
|
||
Closing. Using bundles has moved a lot of traffic off zeus and I don't see a reason to keep this open, even as a tracker.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•