1096337 - Excessive requests to hg.mozilla.org exhaust network capacity

Reporter

Description

•

11 years ago

as example https://treeherder.mozilla.org/ui/logviewer.html#?job_id=610809&repo=mozilla-central 06:40:27 INFO - Downloading https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win32/1415626437/firefox-36.0a1.en-US.win32.tests.zip to C:\slave\test\build\firefox-36.0a1.en-US.win32.tests.zip 06:40:27 INFO - retry: Calling <bound method Proxxy._download_file of <mozharness.mozilla.proxxy.Proxxy object at 0x01D23550>> with args: ('https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win32/1415626437/firefox-36.0a1.en-US.win32.tests.zip', 'C:\\slave\\test\\build\\firefox-36.0a1.en-US.win32.tests.zip'), kwargs: {}, attempt #1 command timed out: 1800 seconds without output, attempting to kill Closing trees

Ed Morley [:emorley]

Comment 1

•

11 years ago

I imagine this will resolve itself once the surge in FTP load is reduced by switching the Firefox Developer edition to CDN (until now Aurora just used FTP directly aiui).

Gregory Szorc [:gps]

Comment 2

•

11 years ago

There have been discussions in #moc. If you look at the network traffic graphs of Zeus going back for a month, there is a strong correlation in high traffic spikes with in-flight requests on hg.mozilla.org, despite the Zeus traffic not involving hg.mozilla.org. The hg.mozilla.org logs from the last 2 events reveal that the server has plenty of CPU, I/O, and memory capacity during these events: it just can't send bits to the network fast enough. In-flight requests pile up and the server eventually has no slots left to process requests. Operations like cloning mozharness that normally take under 4s are taking over a minute. Outbound bandwidth both to EC2 and the SCL3 firewall (v-1030.fw1.releng.scl3.mozilla.net) is severely limited. We believe there is a correlation between the high traffic spikes and releasing Firefox. So, releasing Firefox triggers a cascading network outage due to saturated load. This problem goes back at least until mid October. I believe there is a network device or link in SCL3 that hits capacity during these events. However, I don't believe anyone has positively identified what device that is. There are theories. I'd love to offer a smoking gun, but I don't know the network topology and am not sure where to access data to go digging for answers.

Gregory Szorc [:gps]

Updated

•

11 years ago

Depends on: 1096650

Gregory Szorc [:gps]

Updated

•

11 years ago

Depends on: 1096653

Gregory Szorc [:gps]

Comment 3

•

11 years ago

While I still believe hg.mozilla.org isn't the original point of failure here (the network is), network traffic from hg.mozilla.org is certainly contributing to the problem. Here is a list of outbound traffic per URL path from 1 hgweb node today: 217584175027 integration/gaia-central 114539102183 build/tools 93529675286 projects/fig/json-pushes 25273822086 integration/gaia-2_1 16728568156 mozilla-central 14468451244 releases/mozilla-beta 13736330981 integration/mozilla-inbound 12666163503 build/mozharness 11122526727 build/talos 9457257245 releases/mozilla-aurora 6090081997 integration/gaia-2_0 4769007406 projects/maple 3934642453 projects/ash 2859290530 projects/cypress/json-pushes 2454524984 integration/gaia-1_4 995438278 projects/cedar 969852715 integration/fx-team 912853369 projects/holly/json-pushes 878740118 projects/pine/json-pushes 759913211 comm-central 610110130 projects/ux/json-pushes 588445720 projects/elm/json-pushes 565873165 releases/comm-beta 557689275 l10n-central/pl 530654231 projects/ionmonkey/json-pushes 522200050 services/services-central/json-pushes 498329975 users/stage-ffxbld/tools 447961775 projects/ash/json-pushes 443713333 projects/build-system/json-pushes 395443968 try 369771260 projects/larch/json-pushes 355321901 l10n-central/es-ES 326484408 users/Callek_gmail.com/tools 294789762 projects/graphics/json-pushes 283687740 l10n-central/pt-BR 227922785 projects/cedar/json-pushes 211938973 projects/date/json-pushes 193656544 build/ash-mozharness 160807916 projects/jamun/json-pushes 139600810 releases/comm-aurora 122206275 gaia-l10n/en-US I've already filed a few bugs to track some obvious problems. What's shocking to me is that build/tools is only a 29 MB repo but it is transferring more than every repository except gaia-central. Ouch.

Gregory Szorc [:gps]

Comment 4

•

11 years ago

Total CPU time per repository / URL path: 38460 projects/fig/json-pushes 35368 build/tools 32895 integration/gaia-central 10609 integration/mozilla-inbound 8293 build/mozharness 5638 mozilla-central 4830 releases/mozilla-beta 3588 integration/gaia-2_1 3131 releases/mozilla-aurora 2842 try 2568 build/talos 1548 projects/maple 1528 try/json-pushes 1257 projects/ash 853 integration/gaia-2_0 682 projects/cypress/json-pushes 518 integration/fx-team 388 projects/cedar 371 integration/gaia-1_4 263 l10n-central/pl 223 comm-central 219 projects/holly/json-pushes 209 projects/pine/json-pushes 189 projects/ux/json-pushes 176 l10n-central/es-ES 174 releases/comm-beta 167 projects/ionmonkey/json-pushes 161 services/services-central/json-pushes 154 integration/b2g-inbound 144 users/stage-ffxbld/tools 135 projects/elm/json-pushes 135 mozilla-central/annotate/0bed6fc0b0b7/configure.in 133 projects/build-system/json-pushes 127 releases/mozilla-release 125 l10n-central/pt-BR 121 build/ash-mozharness 108 projects/larch/json-pushes 108 projects/ash/json-pushes 107 users/Callek_gmail.com/tools 106 gaia-l10n/en-US IMO integration/mozilla-inbound should be #1. The items above it are all bugs.

Gregory Szorc [:gps]

Comment 5

•

11 years ago

Total response size for all hgweb nodes for Nov 10 UTC: 1992687013298 integration/gaia-central 1049664755232 build/tools 762034508701 projects/fig/json-pushes 182886245712 mozilla-central 155743621405 releases/mozilla-beta 149080455046 integration/mozilla-inbound 122715571509 integration/gaia-2_1 114123852497 build/mozharness 101884717657 build/talos 86674227517 integration/gaia-2_0 79790138227 projects/maple 71777273460 releases/mozilla-aurora 34514508687 projects/cedar 28572272607 projects/ash 27587098727 integration/gaia-1_4 25165673944 projects/cypress/json-pushes 20745737350 try 9010094078 projects/pine/json-pushes 7524875339 projects/holly/json-pushes 6655613549 releases/comm-beta 6624339681 projects/elm/json-pushes 6346144306 integration/fx-team 5837963779 comm-central 5732022028 projects/ux/json-pushes 5433899292 projects/ionmonkey/json-pushes 5374096650 l10n-central/pl 5069931793 services/services-central/json-pushes 5037247503 integration/b2g-inbound 4298728516 users/Callek_gmail.com/tools 4205904989 users/stage-ffxbld/tools 4074232038 projects/ash/json-pushes 3750094202 projects/build-system/json-pushes 3571335520 projects/larch/json-pushes 3171262433 l10n-central/es-ES 2949379660 releases/mozilla-release 2672131460 releases/mozilla-b2g32_v2_0 2569241108 l10n-central/pt-BR 2538777823 releases/mozilla-b2g30_v1_4 2303840600 releases/comm-aurora 1973782077 projects/cedar/json-pushes 1950817230 projects/graphics/json-pushes 1950144199 projects/date/json-pushes 1781215071 releases/mozilla-esr31 1521218934 projects/jamun/json-pushes 1405078180 build/ash-mozharness 1159592537 gaia-l10n/en-US 1137746745 releases/l10n/mozilla-beta/pl Firefox release automation is DoSing itself.

Gregory Szorc [:gps]

Comment 6

•

11 years ago

To put these numbers in perspective, a 1Gbps link is capable of delivering ~125 MiB/s or 10,800,000 MiB = 10,800 GiB per day. In terms of 1 Gbps/day saturation: Contrib Cumul Repo/URL 18.45% 18.45% integration/gaia-central 9.72% 28.17% build/tools 7.06% 35.23% projects/fig/json-pushes 1.69% 36.92% mozilla-central 1.44% 38.36% releases/mozilla-beta 1.38% 39.74% integration/mozilla-inbound 1.14% 40.88% integration/gaia-2_1 1.06% 41.93% build/mozharness 0.94% 42.88% build/talos 0.80% 43.68% integration/gaia-2_0 0.74% 44.42% projects/maple 0.66% 45.08% releases/mozilla-aurora 0.32% 45.40% projects/cedar 0.26% 45.67% projects/ash 0.26% 45.92% integration/gaia-1_4 0.23% 46.16% projects/cypress/json-pushes 0.19% 46.35% try So, yeah, cloning/pulling our top 17 repositories can get a 1 Gbps link to ~50% saturation. This would arguably be tolerable. However, load isn't even (it is concentrated when the sum is over California). And, we have other things occurring over the link (like FTP traffic). We need to significantly reduce the bandwidth used by Mercurial clones. While we can deploy things on the Mercurial server to reduce bandwidth, I still think release automation has a long ways to go to cut down on excessive clones and pulls.

Gregory Szorc [:gps]

Updated

•

11 years ago

Depends on: 1036122

Gregory Szorc [:gps]

Comment 7

•

11 years ago

For the official record and so credit is properly given, I didn't realize we were wasting quite so much bandwidth in Mercurial land until catlee pasted a link in IRC showing the overall traffic of hg.mozilla.org being greater than the overall traffic for ftp. I was shocked by this and quite frankly didn't believe it at first. Judging by catlee's reaction on IRC, I don't think he believed it either :) I wrote a quick script for dumping per-repository totals from the Mercurial serverlog extension logs to a) validate the claim b) isolate where all that data was going. That quickly revealed some numbers (pasted above) that validated the data catlee pulled, were significantly larger than I expected, and made it really easy to identify some oddities. As they say, data is king. I hope when all of this is over we have more accessible and easy-to-use dashboards for quickly finding critical data like this. As an aside, those of us maintaining the Mercurial servers have been starting at the total bandwidth number for a while. You can clearly see it in the "Network Graph" at https://nigelbabu.github.io/hgstats/. However, the unit is simply "M." I assumed that was Mbps. Looking at it now, the underling data source is in octets, so it is actually MBps - 8x larger than I thought. I wonder how many more years I need to be an engineer before I realize not to make assumptions about bits vs bytes. Finally, https://graphite.mozilla.org/render?from=-6weeks&until=now&width=586&height=308&_salt=1408010393.594&yMax=&yMin=&target=sumSeries%28hosts.hgweb*_dmz_scl3_mozilla_com.interface.if_octets.bond0.tx%29&title=Outbound%20Network%20Traffic&hideLegend=true&_uniq=0.8170505894781112 shows that hg.mozilla.org peak network bandwidth increased dramatically around Oct 21. I wonder what triggered that...

Dave Curado :dcurado

Comment 9

•

11 years ago

Is the traffic coming from hg's and ftp's public facing interfaces? host hg.mozilla.org hg.mozilla.org has address 63.245.215.25 host ftp.mozilla.org ftp.mozilla.org has address 63.245.215.56 ftp.mozilla.org has address 63.245.215.46 and going to destinations on the Internet? Thanks.

Gregory Szorc [:gps]

Comment 10

•

11 years ago

My understanding is traffic is going from hg.mozilla.org's public interfaces both to public internet *and* to internal consumers, like fw1.releng.scl3. One important thing I left out is that during "network events," traffic is slow both to fw1 *and* public internet. Assuming fw1 is not in the path of hg.mozilla.org -> EC2 releng infra, this seems to imply that there is a network issue outside of fw1. Also, please read http://gregoryszorc.com/blog/2014/11/07/mercurial-server-hiccup-2014-11-06/ for my reasoning for pinning this on network capacity instead of something else. tl;dr hg servers have plenty of idle resources: they appear to not be able to send bits to the network fast enough. Operations that commonly take 4s to transfer 20MB are taking >100s.

Dave Curado :dcurado

Comment 11

•

11 years ago

It would be good if we know the percentages of traffic going from where to where. Going from hg.m.o and/or ftp.m.o out to the Internet... we need to look into that for sure. There should be no bottlenecks there. Really, for outbound packets we should be going from server -> switch -> border router -> border router -> Internet, all at 10Gbits and the load on those three network devices is a small fraction of what they are capable of. For traffic passing through fw1.releng.scl3, do we know the actual destination? I'm wondering if that traffic is going to AWS. If so, the path those packets take is complicated, and disappears into the AWS cloud, which we don't have visibility into. =-( Yeah, more info needed here. I don't know how these servers are connected. I know *where* they are connected in SCL3, as in, what VLAN, but are they connected via 1G or 10G, how the load balancers are involved (not pointing a finger, just including them in the path if they should be). Yes, I read your blog yesterday. Good sleuthing on your part! Hey, aren't you supposed to be taking the day off? =-)

Dave Curado :dcurado

Comment 12

•

11 years ago

Should add something: > Assuming fw1 is not in the path of hg.mozilla.org -> EC2 releng infra, > this seems to imply that there is a network issue outside of fw1. It depends on which EC2 instance the packets are going to. We have IPSec tunnels from both fw1.releng.scl3 and fw1.scl3 to AWS. Here's the break down WRT destination IP addresses: fw1.scl3: ------------------- 10.136/16 10.128/16 10.152/16 fw1.releng.scl3: ------------------- 10.132/16 10.130/16 10.134/16 This has got me curious. Because, if hg.m.o or ftp.m.o sent packets destined for any of the above mentioned IP space, those packets would hit core1.scl3 or core2.scl3 (a mated pair of switch/routers) which do not have routes for any of these /16s. I wonder what the routing-table is like (netstat -rn) on hg.m.o and ftp.m.o. Specifically, do they send traffic destined for this address space out a different-than-their-public-facing interface?

Gregory Szorc [:gps]

Comment 13

•

11 years ago

Attached file tools-bundle-times — Details

This is a list of times for hgweb1 to serve "bundle" requests for the build/tools repository on UTC day Nov 10 (partial day coverage only). A "bundle" request is an `hg clone` or `hg pull`. Most of these are clones (you can tell by the ~20MB response size). Columns are: * Wall time to serve request * IP address of requester. IP comes from X-CLUSTER-CLIENT-IP HTTP request header or REMOTE_ADDR environment variable (presumably set by Apache to the remote IP address). * Response bytes * Time request arrived Average response time: 40.5s Median response time: 8.8s As you can see, the response times at the bottom are quite extreme.

Gregory Szorc [:gps]

Comment 14

•

11 years ago

Note that hgwebN.dmz.scl3 are behind Zeus, so I don't think routing tables on those hosts matter too much. I have no clue how Zeus is configured. bkero is the subject expert from my perspective.

Dave Curado :dcurado

Comment 15

•

11 years ago

Ah. Those requests from 54/8 address space is from various AWS sites, and they are using their public facing interface to avoid (offload) the IPsec tunnels. And (as I'm sure you've already figured out) 63.245.214.82 is the NAT IP for fw1.releng.scl3. If we ramp up traffic to/from VLANs behind fw1.releng.scl3 to hg.m.o/ftp.m.o then that firewall could be a bottleneck. However, that doesn't explain why traffic between AWS (public space) and hg.m.o/ftp.m.o experiences an issue as well. If the slow connections happen at the same time, to both locations, that's a big data point to me. Before I write a script to order the lines in your attachment by date/time, do we already know this to be the case? Thanks!

Dave Curado :dcurado

Comment 16

•

11 years ago

I am very curious about how Zeus is configured. The hgwebN.dmz.scl3 hosts are behind a firewall (fw1.scl3.mozilla.net), depending on how Zeus connects to them.

Gregory Szorc [:gps]

Comment 17

•

11 years ago

awk '{print $4, $1, $2, $3}' tools-bundle-times | sort -n Start looking around 11-10T14:00:00. Everything is fine. We start seeing some long requests: 11-10T14:13:57 52.557 54.165.75.203 19956506 11-10T14:14:02 48.249 63.245.214.82 19956506 11-10T14:14:03 46.507 54.173.41.136 19956506 11-10T14:15:44 8.793 63.245.214.82 19956506 11-10T14:16:16 5.979 54.148.37.234 19956506 11-10T14:16:17 20.412 54.173.162.237 19956506 Then some very long ones: 11-10T14:24:07 361.713 54.148.53.5 19956506 11-10T14:27:13 433.966 63.245.214.82 19956506 11-10T14:33:47 193.295 63.245.214.82 19956506 11-10T14:34:06 200.85 54.148.52.195 19956506 11-10T14:34:21 719.954 63.245.214.82 19956506 11-10T14:34:46 207.176 54.173.165.17 19956506 11-10T14:36:03 41.733 63.245.214.82 19956506 11-10T14:36:20 34.411 63.245.214.82 19956506 11-10T14:37:02 8.252 63.245.214.82 19956506 11-10T14:37:17 765.684 63.245.214.82 19956506 11-10T14:37:18 757.881 63.245.214.82 19956506 By 15:00 things are in a royal mess: 11-10T15:05:56 999.702 54.148.43.165 14879752 11-10T15:05:56 999.899 63.245.214.82 15075600 11-10T15:06:01 1370.844 54.173.99.241 19956506 11-10T15:06:43 12.078 63.245.214.82 19956506 11-10T15:07:03 8.326 54.148.46.20 19956506 11-10T15:07:12 9.244 63.245.214.82 19956506 11-10T15:07:30 8.447 54.148.51.71 19956506 11-10T15:07:58 1319.753 63.245.214.82 18164098 11-10T15:08:00 7.995 54.173.59.194 19956506 11-10T15:08:09 999.944 63.245.214.82 14069530 11-10T15:08:32 14.494 54.173.189.124 19956506 11-10T15:09:14 8.768 54.173.180.137 19956506 11-10T15:09:24 1319.101 54.69.126.186 18565592 That's 1300s to transfer 20MB - ~16KB/s on average.

Kendall Libby [:fubar] (he/him)

Comment 18

•

11 years ago

(In reply to Dave Curado :dcurado from comment #11) > It would be good if we know the percentages of traffic going from where to > where. > Going from hg.m.o and/or ftp.m.o out to the Internet... we need to look into > that for sure. > There should be no bottlenecks there. > Really, for outbound packets we should be going from > server -> switch -> border router -> border router -> Internet, all at > 10Gbits > and the load on those three network devices is a small fraction of what they > are capable of. > > For traffic passing through fw1.releng.scl3, do we know the actual > destination? > I'm wondering if that traffic is going to AWS. If so, the path those > packets take is > complicated, and disappears into the AWS cloud, which we don't have > visibility > into. =-( > > Yeah, more info needed here. I don't know how these servers are connected. > I know *where* they are connected in SCL3, as in, what VLAN, but are they > connected > via 1G or 10G, how the load balancers are involved (not pointing a finger, > just > including them in the path if they should be). For posterity - the load balancers are limited to 2Gbps by license, and all traffic to/from hg.m.o travels through them.

Dave Curado :dcurado

Comment 19

•

11 years ago

Apologies -- I don't have any idea how the load balancers are set up. (I'm looking at the info in mana, but I think I need a network diagram...) Does this mean that we can get 2 Gbps in total, or 2 Gbps through each device in the load balancer cluster? (and are there 4 in the SCL3 cluster?) Thanks very much.

Kendall Libby [:fubar] (he/him)

Comment 20

•

11 years ago

Good question, and I don't know either! :-) I think, because of the way we normally do things, is that we're looking at 2gbps in total. The mana doc says it's "per node" but the linked-to Stingray docs do not. If it's per zeus node, the next question is: does that mean the zlb VIP needs to be active on all zeus nodes? Default is that the VIP runs active on one zeus node and is passive on all others for fail over. Perhaps if we run active-active we get (2gbps * nodes). Dropping a needinfo on :jakem, since I think he'll have the answers we want.

Flags: needinfo?(nmaul)

Gregory Szorc [:gps]

Updated

•

11 years ago

Depends on: 1097979

:kanban-engops

Comment 21

•

11 years ago

A Pivotal Tracker story has been created for this Bug: https://www.pivotaltracker.com/story/show/82612530

Jake Maul [:jakem]

Comment 22

•

11 years ago

In SCL3 we have 2gbps of bandwidth per ZLB node. That means a node can push up to 2gbps before Stingray will start throttling it internally. That means we have a total possible bandwidth of 8gbps. However as :fubar correctly ascertained, hg.m.o (and most) VIPs are only active on one node at a time, effectively limiting them to 2gbps. ftp.m.o is one that is done differently... it is load balanced on 2 IPs, which are active on separate ZLB nodes. It has an effective theoretical capacity of 4gbps. I say 'theoretical' because obviously that 4gbps is shared with everything else that happens to be on those 2 nodes. There is also ftp-ssl.m.o, which is split among 3 VIPs (3 separate ZLB nodes). As I recall this is purely a "workaround" name having something to do with saturation of the VPC (or perhaps a firewall), and is only used by RelEng... it may be a factor here as well. hg.m.o is active on zlb5 ftp.m.o is active on zlb5 and zlb6 ftp-ssl.m.o is active on zlb1, zlb3, and zlb5 Therefore, hg.m.o's theoretical 2gbps max is shared with 1/2 of ftp.m.o's theoretical 4gbps max, as well as 1/3 of ftp-ssl.m.o's theoretical 6gbps max. This does mean that if you're pulling heavily from both hg and ftp at the same time, they will tend to interfere with each other more than might otherwise be expected (that is, outside of any relationship they may or may not share with respect to firewalls). We could add another IP to hg.m.o and distribute it similarly to the way ftp.m.o is. There are only 4 ZLB nodes in this cluster, so it's impossible for these 3 services to be entirely disjoint (unless we shrink one or more of them)... but that doesn't mean we can't do better than it currently is.

Flags: needinfo?(nmaul)

Dave Curado :dcurado

Comment 23

•

11 years ago

I think perhaps we found the "network resistor" in this path? We were pushing 5 gbps out the front door of SCL3 at this time, and sending ~2 gbps from the border routers towards fw1.releng.scl3 That's what I tend to look at: all the interfaces on the 2 border routers at the head end of SCL3. Traffic from the VLAN(s) supporting the load balancers going into the releng networks (and vice versa) have to traverse those head end border routers, so measuring throughput there gives a pretty accurate picture.

Gregory Szorc [:gps]

Comment 24

•

11 years ago

Thanks for the offer to distribute hg.mozilla.org across multiple IPs! However, unless it becomes SOP, I don't believe it is necessary any more. We identified a lot of "dumb" traffic we shouldn't have been generating. We've already made giant leaps to reducing the traffic and we have more in the works (see dependent bugs). You can clearly see a drop-off in hg.mozilla.org this week at https://graphite.mozilla.org/render?from=-14days&until=now&width=1000&height=600&_salt=1408010393.594&yMax=&yMin=&target=sumSeries%28hosts.hgweb*_dmz_scl3_mozilla_com.interface.if_octets.bond0.tx%29&title=Outbound%20Network%20Traffic&hideLegend=true&_uniq=0.8170505894781112. Expect a similar drop next week if the remaining patches all land and get deployed.

Dave Curado :dcurado

Comment 25

•

11 years ago

Wow, great work in reducing the traffic load!

Pete Moore [:pmoore][:pete]

Comment 26

•

11 years ago

(In reply to Gregory Szorc [:gps] from comment #24) > Thanks for the offer to distribute hg.mozilla.org across multiple IPs! > However, unless it becomes SOP, I don't believe it is necessary any more. > > We identified a lot of "dumb" traffic we shouldn't have been generating. > We've already made giant leaps to reducing the traffic and we have more in > the works (see dependent bugs). > > You can clearly see a drop-off in hg.mozilla.org this week at > https://graphite.mozilla.org/render?from=- > 14days&until=now&width=1000&height=600&_salt=1408010393. > 594&yMax=&yMin=&target=sumSeries%28hosts.hgweb*_dmz_scl3_mozilla_com. > interface.if_octets.bond0. > tx%29&title=Outbound%20Network%20Traffic&hideLegend=true&_uniq=0. > 8170505894781112. Expect a similar drop next week if the remaining patches > all land and get deployed. Awesome, that certainly helps buildduty! Keep up the good work! =)

Gregory Szorc [:gps]

Comment 27

•

11 years ago

I plotted the per-repository bandwidth usage over time. You can find results at https://docs.google.com/spreadsheets/d/1EMhKWnKLLA24jX4cPmLcBWq6Fuk4tVJBEN4BH1D25d0/pubchart?oid=1200300937&format=interactive If you are running e10s, you may need to disable that. Or just load the link in another browser. Google Docs won't let me save defaults for this chart type. You'll want to: 1) Change X axis (bottom axis) to "Order: Alphabetical" 2) Under "Size" in the top right, change to "Bytes Sent" 3) Optionally change "Color" to something else to make it slightly more readable When you move the date slider, you'll notice a few things: 1) A handful of repositories are "long poles" (build/tools, integration/gaia-central, fig pushlog, etc) and are sucking up absurd amounts of bandwidth. 2) gaia-central appears to be the main culprit behind our 200 Mbps traffic increase starting Oct 21. Still no clue what changed. 3) It appears all the major bad actors are being tracked by this bug. Resolving the dependent bugs will drastically reduce overall traffic load.

Gregory Szorc [:gps]

Updated

•

11 years ago

Depends on: 1050109

Gregory Szorc [:gps]

Comment 28

•

11 years ago

Moving to a component appropriate for ongoing discussion and changing summary to be more specific.

Component: Buildduty → Mercurial: hg.mozilla.org

Product: Release Engineering → Developer Services

QA Contact: bugspam.Callek → hwine

Summary: Trees suffering on timeouts - Trees closed → Excessive requests to hg.mozilla.org exhaust network capacity

:kanban-engops

Updated

•

11 years ago

Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4134]

Ed Morley [:emorley]

Updated

•

11 years ago

Depends on: 1100574

Dustin J. Mitchell [:dustin] (he/him)

Comment 29

•

11 years ago

> fw1.scl3: > ------------------- > 10.136/16 > 10.128/16 > 10.152/16 > > fw1.releng.scl3: > ------------------- > 10.132/16 > 10.130/16 > 10.134/16 Just to toss this out there for future reference -- all releng buildslaves, and thus all releng traffic sources, are in 10.132/16 and 10.134/16 (releng.use1 and releng.usw2, respectively). 10.130/16 is releng.usw1, but is unused due to high instance prices.

hwine

Comment 30

•

10 years ago

adjust status to reflect current reality. this event is over, but we started using it as a tracker.

Severity: blocker → normal

Keywords: meta

Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4134] → [tracker] [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4134]

hwine

Updated

•

9 years ago

QA Contact: hwine → klibby

Kendall Libby [:fubar] (he/him)

Comment 31

•

9 years ago

Closing. Using bundles has moved a lot of traffic off zeus and I don't see a reason to keep this open, even as a tracker.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED