Closed
Bug 1165317
Opened 9 years ago
Closed 9 years ago
hg.error 503 when try to pull from mozilla-central etc
Categories
(Infrastructure & Operations :: MOC: Problems, task)
Infrastructure & Operations
MOC: Problems
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: cbook, Assigned: Usul)
References
Details
Attachments
(4 files)
seems hg is running into problems currently abort: HTTP Error 503: Service Temporarily Unavailable trees are closed anyway for bug 1165311
Assignee | ||
Comment 1•9 years ago
|
||
first alert : Fri 06:43:55 PDT [5092] hgweb9.dmz.scl3.mozilla.com:httpd max clients is CRITICAL: (Service Check Timed Out) (http://m.mozilla.org/httpd+max+clients) This went thru all webheads. I've been trying to restart httpd on all of them but I'm not quick enough :( https://graphite-scl3.mozilla.org/render/?width=1129&height=661&_salt=1431698425.33&from=-4hours&target=scaleToSeconds%28derivative%28sum%28zeus.zlb1_ops_scl3_mozilla_com.Gbit_in%2C%20zeus.zlb1_ops_scl3_mozilla_com.Gbit_out%29%29%2C1%29&target=scaleToSeconds%28derivative%28sum%28zeus.zlb2_ops_scl3_mozilla_com.Gbit_in%2C%20zeus.zlb2_ops_scl3_mozilla_com.Gbit_out%29%29%2C1%29&target=scaleToSeconds%28derivative%28sum%28zeus.zlb3_ops_scl3_mozilla_com.Gbit_in%2C%20zeus.zlb3_ops_scl3_mozilla_com.Gbit_out%29%29%2C1%29&target=scaleToSeconds%28derivative%28sum%28zeus.zlb4_ops_scl3_mozilla_com.Gbit_in%2C%20zeus.zlb4_ops_scl3_mozilla_com.Gbit_out%29%29%2C1%29&target=scaleToSeconds%28derivative%28sum%28zeus.zlb5_ops_scl3_mozilla_com.Gbit_in%2C%20zeus.zlb5_ops_scl3_mozilla_com.Gbit_out%29%29%2C1%29&target=scaleToSeconds%28derivative%28sum%28zeus.zlb6_ops_scl3_mozilla_com.Gbit_in%2C%20zeus.zlb6_ops_scl3_mozilla_com.Gbit_out%29%29%2C1%29
Assignee | ||
Updated•9 years ago
|
Component: Buildduty → MOC: Problems
Product: Release Engineering → Infrastructure & Operations
QA Contact: bugspam.Callek → lypulong
Version: unspecified → other
Assignee | ||
Comment 2•9 years ago
|
||
<michal`> longest connections are from AWS <michal`> connections sending more data are from AWS
Assignee | ||
Comment 3•9 years ago
|
||
Usul: I'm trying to look at hg logs at the load balancer. ericz is looking to see how many more connections more normal than we're seeing.
Reporter | ||
Comment 4•9 years ago
|
||
(In reply to Carsten Book [:Tomcat] from comment #0) > trees are closed anyway for bug 1165311 note that was for the integration trees, now all trees are closed
Assignee | ||
Comment 5•9 years ago
|
||
<Usul> did you guys do anything ? <cyliang> Usul I don't think so. eric.z has moved on to looking at higher than normal download traffic. <cyliang> I'm still trying to see if there's something odd in the requests being made (i.e. asking for more outgoing data) <fubar|pto> http://nigelbabu.github.io/hgstats/#hours/12 <wesley> fubar|pto's shortened url is http://tinyurl.com/p3wtluu <fubar|pto> outbound traffic went to 5g <cyliang> I don't believe that I'm seeing that many more requests <Usul> nope webhaeds recovred Please leave open for now.
Reporter | ||
Comment 6•9 years ago
|
||
trees reopened at 2015-05-15T14:44:49 UTC per request from usul, still montoring the trees
Assignee | ||
Comment 7•9 years ago
|
||
I've paged hwine @ 51 Fri 07:52:53 PDT [5359] hgweb10.dmz.scl3.mozilla.com:httpd max clients is WARNING: Using 64 out of 64 Clients (http://m.mozilla.org/httpd+max+clients) <nagios-scl3> SMS from hwine: Yesterday we saw spike that self resolved. Potential dos attack
Reporter | ||
Comment 8•9 years ago
|
||
(In reply to Carsten Book [:Tomcat] from comment #6) > trees reopened at 2015-05-15T14:44:49 UTC per request from usul, still > montoring the trees all trees closed again
Assignee | ||
Comment 9•9 years ago
|
||
'/repo_local/mozilla/webroot_wsgi/build/hgweb.wsgi'. <ericz> [Fri May 15 14:58:39 2015] [error] [client 10.22.74.212] IOError: failed to write data
Assignee | ||
Comment 10•9 years ago
|
||
http://nigelbabu.github.io/hgstats/#hours/24
Comment 11•9 years ago
|
||
From the perspective of the hg.mozilla.org servers, they behaved exactly as expected. httpd worker pool was fully saturated *and* CPU was also fully pegged. They started issuing 503s, which is the universal HTTP signal for "back off." That's pretty much the best you can do at the application level when confronted with an effective DoS attack. The servers were saturating their 1Gbps network connection to the load balancer at times. Aggregate throughput to the load balancer exceeded 5.5 Gbps at one time. Unlike previous events where load balancer throughput (due to license limits) was an obvious cause, that isn't the case here (or at least it isn't apparent). The increase in high server traffic was driven by a drastic increase in requests to clone the integration/gaia-central repository. During UTC hours 1300-1500, response bytes for gaia-central were >10x more than the next repository. The clone volume was abnormally high. Properly behaving clients should not be performing full clones. Instead, they should be using bundle files to seed the local clone and then performing incremental pulls. Eventually, this feature will be built into Mercurial so clients don't need to implement extra smarts to avoid putting too much load on the server. tl;dr hg.mozilla.org was DoSd by requests to clone gaia-central.
Comment 12•9 years ago
|
||
Here is per-hour totals for gaia-central traffic from one server: UTC hour repo requests size wall time cpu time 2015-05-15T00:00:00 integration/gaia-central 2843 820603537 1056 329 2015-05-15T01:00:00 integration/gaia-central 2019 534118535 740 170 2015-05-15T02:00:00 integration/gaia-central 2127 462797783 603 200 2015-05-15T03:00:00 integration/gaia-central 1729 121510689 144 91 2015-05-15T04:00:00 integration/gaia-central 1803 331652204 421 182 2015-05-15T05:00:00 integration/gaia-central 584 208022903 302 73 2015-05-15T06:00:00 integration/gaia-central 647 100133419 155 55 2015-05-15T07:00:00 integration/gaia-central 585 273066165 383 93 2015-05-15T08:00:00 integration/gaia-central 909 367346849 592 102 2015-05-15T09:00:00 integration/gaia-central 1678 1166182896 1609 403 2015-05-15T10:00:00 integration/gaia-central 2131 1572038970 1947 448 2015-05-15T11:00:00 integration/gaia-central 1870 1106850862 1194 316 2015-05-15T12:00:00 integration/gaia-central 2077 588593800 741 192 2015-05-15T13:00:00 integration/gaia-central 1748 146887179884 72707 35033 2015-05-15T14:00:00 integration/gaia-central 4496 250455249811 122510 60992 2015-05-15T15:00:00 integration/gaia-central 2711 99468444396 21808 15014 2015-05-15T16:00:00 integration/gaia-central 1390 375470364 575 157 You can see the request size increased by 100% over base.
Comment 13•9 years ago
|
||
Trees "reopened" for this cause at 0930 PT (1630 UTC)
Comment 14•9 years ago
|
||
(In reply to Ludovic Hirlimann [:Usul] from comment #10) > http://nigelbabu.github.io/hgstats/#hours/24 that's a dynamic display - attachment is snapshot covering the outage period
Comment 15•9 years ago
|
||
(In reply to Ludovic Hirlimann [:Usul] from comment #7) > I've paged hwine @ 51 > > Fri 07:52:53 PDT [5359] hgweb10.dmz.scl3.mozilla.com:httpd max clients is > WARNING: Using 64 out of 64 Clients (http://m.mozilla.org/httpd+max+clients) > <nagios-scl3> SMS from hwine: Yesterday we saw spike that self resolved. > Potential dos attack Yesterday's spike appeared to originate from user agent "Twitterbot/1.0" across multiple IPs. Several hundred requests were lobbied at hg.mozilla.org in a very narrow time window. This temporarily overwhelmed the httpd worker pool. Requests were made to robots.txt, but the apparent bot ignored its content (which says not to index). Yesterday's event and today's event are completely separate AFAICT.
Comment 16•9 years ago
|
||
Times in PT, data from webops
Comment 17•9 years ago
|
||
Times in PT, data from webops
Comment 18•9 years ago
|
||
machines in scl3 behind a nat that were hitting hg.m.o
Comment 19•9 years ago
|
||
If you tell me a destination and the time window I might be able to tell you which machines behind NAT were hitting HG. Is the destination hg.mozilla.org or something else? Also, when is the information necessary? Right away or can wait until Monday?
Comment 20•9 years ago
|
||
There was an issue this morning around 13:00Z where our us-east-1 s3 proxy became unavailable. This resulted in taskcluster-vcs cloning from hg.m.o instead of using a cached clone stored in s3. We have since disabled the option to use the proxy, reployed it, and will re-enable once sheriffs are satisfied with the trees. This certainly would have contributed to the increase load, especially for integration/gaia-central.
Assignee | ||
Updated•9 years ago
|
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•