Closed Bug 1165317 Opened 9 years ago Closed 9 years ago

hg.error 503 when try to pull from mozilla-central etc

Categories

(Infrastructure & Operations :: MOC: Problems, task)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cbook, Assigned: Usul)

References

Details

Attachments

(4 files)

seems hg is running into problems currently abort: HTTP Error 503: Service Temporarily Unavailable

trees are closed anyway for bug 1165311
Component: Buildduty → MOC: Problems
Product: Release Engineering → Infrastructure & Operations
QA Contact: bugspam.Callek → lypulong
Version: unspecified → other
<michal`> longest connections are from AWS
<michal`> connections sending more data are from AWS
Usul: I'm trying to look at hg logs at the load balancer.  ericz is looking to see how many more connections more normal than we're seeing.
(In reply to Carsten Book [:Tomcat] from comment #0)

> trees are closed anyway for bug 1165311
 
note that was for the integration trees, now all trees are closed
Depends on: 1165336
<Usul> did you guys do anything ?
<cyliang> Usul I don't think so.  eric.z has moved on to looking at higher than normal download traffic.
<cyliang> I'm still trying to see if there's something odd in the requests being made (i.e. asking for more outgoing data)
<fubar|pto> http://nigelbabu.github.io/hgstats/#hours/12
<wesley> fubar|pto's shortened url is http://tinyurl.com/p3wtluu
<fubar|pto> outbound traffic went to 5g
<cyliang> I don't believe that I'm seeing that many more requests 
<Usul> nope webhaeds recovred

Please leave open for now.
trees reopened at  2015-05-15T14:44:49 UTC per request from usul, still montoring the trees
I've paged hwine @ 51

 Fri 07:52:53 PDT [5359] hgweb10.dmz.scl3.mozilla.com:httpd max clients is WARNING: Using 64 out of 64 Clients (http://m.mozilla.org/httpd+max+clients)
<nagios-scl3> SMS from hwine: Yesterday we saw spike that self resolved. Potential dos attack
(In reply to Carsten Book [:Tomcat] from comment #6)
> trees reopened at  2015-05-15T14:44:49 UTC per request from usul, still
> montoring the trees

all trees closed again
'/repo_local/mozilla/webroot_wsgi/build/hgweb.wsgi'.
<ericz> [Fri May 15 14:58:39 2015] [error] [client 10.22.74.212] IOError: failed to write data
From the perspective of the hg.mozilla.org servers, they behaved exactly as expected. httpd worker pool was fully saturated *and* CPU was also fully pegged. They started issuing 503s, which is the universal HTTP signal for "back off." That's pretty much the best you can do at the application level when confronted with an effective DoS attack. The servers were saturating their 1Gbps network connection to the load balancer at times. Aggregate throughput to the load balancer exceeded 5.5 Gbps at one time. Unlike previous events where load balancer throughput (due to license limits) was an obvious cause, that isn't the case here (or at least it isn't apparent).

The increase in high server traffic was driven by a drastic increase in requests to clone the integration/gaia-central repository. During UTC hours 1300-1500, response bytes for gaia-central were >10x more than the next repository. The clone volume was abnormally high. Properly behaving clients should not be performing full clones. Instead, they should be using bundle files to seed the local clone and then performing incremental pulls. Eventually, this feature will be built into Mercurial so clients don't need to implement extra smarts to avoid putting too much load on the server.

tl;dr hg.mozilla.org was DoSd by requests to clone gaia-central.
Here is per-hour totals for gaia-central traffic from one server:

UTC hour                repo                    requests   size       wall time  cpu time
2015-05-15T00:00:00	integration/gaia-central	2843	820603537	1056	329
2015-05-15T01:00:00	integration/gaia-central	2019	534118535	740	170
2015-05-15T02:00:00	integration/gaia-central	2127	462797783	603	200
2015-05-15T03:00:00	integration/gaia-central	1729	121510689	144	91
2015-05-15T04:00:00	integration/gaia-central	1803	331652204	421	182
2015-05-15T05:00:00	integration/gaia-central	584	208022903	302	73
2015-05-15T06:00:00	integration/gaia-central	647	100133419	155	55
2015-05-15T07:00:00	integration/gaia-central	585	273066165	383	93
2015-05-15T08:00:00	integration/gaia-central	909	367346849	592	102
2015-05-15T09:00:00	integration/gaia-central	1678	1166182896	1609	403
2015-05-15T10:00:00	integration/gaia-central	2131	1572038970	1947	448
2015-05-15T11:00:00	integration/gaia-central	1870	1106850862	1194	316
2015-05-15T12:00:00	integration/gaia-central	2077	588593800	741	192
2015-05-15T13:00:00	integration/gaia-central	1748	146887179884	72707	35033
2015-05-15T14:00:00	integration/gaia-central	4496	250455249811	122510	60992
2015-05-15T15:00:00	integration/gaia-central	2711	99468444396	21808	15014
2015-05-15T16:00:00	integration/gaia-central	1390	375470364	575	157

You can see the request size increased by 100% over base.
Trees "reopened" for this cause at 0930 PT (1630 UTC)
(In reply to Ludovic Hirlimann [:Usul] from comment #10)
> http://nigelbabu.github.io/hgstats/#hours/24

that's a dynamic display - attachment is snapshot covering the outage period
(In reply to Ludovic Hirlimann [:Usul] from comment #7)
> I've paged hwine @ 51
> 
>  Fri 07:52:53 PDT [5359] hgweb10.dmz.scl3.mozilla.com:httpd max clients is
> WARNING: Using 64 out of 64 Clients (http://m.mozilla.org/httpd+max+clients)
> <nagios-scl3> SMS from hwine: Yesterday we saw spike that self resolved.
> Potential dos attack

Yesterday's spike appeared to originate from user agent "Twitterbot/1.0" across multiple IPs. Several hundred requests were lobbied at hg.mozilla.org in a very narrow time window. This temporarily overwhelmed the httpd worker pool. Requests were made to robots.txt, but the apparent bot ignored its content (which says not to index).

Yesterday's event and today's event are completely separate AFAICT.
Times in PT, data from webops
Times in PT, data from webops
machines in scl3 behind a nat that were hitting hg.m.o
If you tell me a destination and the time window I might be able to tell you which machines behind NAT were hitting HG.

Is the destination hg.mozilla.org or something else?

Also, when is the information necessary? Right away or can wait until Monday?
There was an issue this morning around 13:00Z where our us-east-1 s3 proxy became unavailable.  This resulted in taskcluster-vcs cloning from hg.m.o instead of using a cached clone stored in s3.  We have since disabled the option to use the proxy, reployed it, and will re-enable once sheriffs are satisfied with the trees. 

This certainly would have contributed to the increase load, especially for integration/gaia-central.
Do we need to keep this open ?
Assignee: nobody → ludovic
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: