Closed Bug 1165317 Opened 9 years ago Closed 9 years ago

hg.error 503 when try to pull from mozilla-central etc

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: cbook, Assigned: Usul)

References

Details

Attachments

(4 files)

Hgstats by nigelbabu.pdf 9 years ago Hal Wine [:hwine] use NI! 889.27 KB, application/pdf		Details
zlb top talkers by data flow 9 years ago Hal Wine [:hwine] use NI! 781 bytes, text/plain		Details
zlb top talkers by connection 9 years ago Hal Wine [:hwine] use NI! 617 bytes, text/plain		Details
top talkers by data flow on far side of nat-fw1.scl3.mozilla.com 9 years ago Hal Wine [:hwine] use NI! 397 bytes, text/plain		Details

Carsten Book [:Tomcat]

Reporter

Description

•

9 years ago

seems hg is running into problems currently abort: HTTP Error 503: Service Temporarily Unavailable

trees are closed anyway for bug 1165311

Ludovic Hirlimann [:Usul]

Assignee

Comment 1

•

9 years ago

first alert :
Fri 06:43:55 PDT [5092] hgweb9.dmz.scl3.mozilla.com:httpd max clients is CRITICAL: (Service Check Timed Out) (http://m.mozilla.org/httpd+max+clients)

This went thru all webheads. I've been trying to restart httpd on all of them but I'm not quick enough :(

https://graphite-scl3.mozilla.org/render/?width=1129&height=661&_salt=1431698425.33&from=-4hours&target=scaleToSeconds%28derivative%28sum%28zeus.zlb1_ops_scl3_mozilla_com.Gbit_in%2C%20zeus.zlb1_ops_scl3_mozilla_com.Gbit_out%29%29%2C1%29&target=scaleToSeconds%28derivative%28sum%28zeus.zlb2_ops_scl3_mozilla_com.Gbit_in%2C%20zeus.zlb2_ops_scl3_mozilla_com.Gbit_out%29%29%2C1%29&target=scaleToSeconds%28derivative%28sum%28zeus.zlb3_ops_scl3_mozilla_com.Gbit_in%2C%20zeus.zlb3_ops_scl3_mozilla_com.Gbit_out%29%29%2C1%29&target=scaleToSeconds%28derivative%28sum%28zeus.zlb4_ops_scl3_mozilla_com.Gbit_in%2C%20zeus.zlb4_ops_scl3_mozilla_com.Gbit_out%29%29%2C1%29&target=scaleToSeconds%28derivative%28sum%28zeus.zlb5_ops_scl3_mozilla_com.Gbit_in%2C%20zeus.zlb5_ops_scl3_mozilla_com.Gbit_out%29%29%2C1%29&target=scaleToSeconds%28derivative%28sum%28zeus.zlb6_ops_scl3_mozilla_com.Gbit_in%2C%20zeus.zlb6_ops_scl3_mozilla_com.Gbit_out%29%29%2C1%29

Ludovic Hirlimann [:Usul]

Assignee

Updated

•

9 years ago

Component: Buildduty → MOC: Problems

Product: Release Engineering → Infrastructure & Operations

QA Contact: bugspam.Callek → lypulong

Version: unspecified → other

Ludovic Hirlimann [:Usul]

Assignee

Comment 2

•

9 years ago

<michal`> longest connections are from AWS
<michal`> connections sending more data are from AWS

Ludovic Hirlimann [:Usul]

Assignee

Comment 3

•

9 years ago

Usul: I'm trying to look at hg logs at the load balancer.  ericz is looking to see how many more connections more normal than we're seeing.

Carsten Book [:Tomcat]

Reporter

Comment 4

•

9 years ago

(In reply to Carsten Book [:Tomcat] from comment #0)

> trees are closed anyway for bug 1165311
 
note that was for the integration trees, now all trees are closed

Ludovic Hirlimann [:Usul]

Assignee

Updated

•

9 years ago

Depends on: 1165336

Ludovic Hirlimann [:Usul]

Assignee

Comment 5

•

9 years ago

<Usul> did you guys do anything ?
<cyliang> Usul I don't think so.  eric.z has moved on to looking at higher than normal download traffic.
<cyliang> I'm still trying to see if there's something odd in the requests being made (i.e. asking for more outgoing data)
<fubar|pto> http://nigelbabu.github.io/hgstats/#hours/12
<wesley> fubar|pto's shortened url is http://tinyurl.com/p3wtluu
<fubar|pto> outbound traffic went to 5g
<cyliang> I don't believe that I'm seeing that many more requests 
<Usul> nope webhaeds recovred

Please leave open for now.

Carsten Book [:Tomcat]

Reporter

Comment 6

•

9 years ago

trees reopened at  2015-05-15T14:44:49 UTC per request from usul, still montoring the trees

Ludovic Hirlimann [:Usul]

Assignee

Comment 7

•

9 years ago

I've paged hwine @ 51

 Fri 07:52:53 PDT [5359] hgweb10.dmz.scl3.mozilla.com:httpd max clients is WARNING: Using 64 out of 64 Clients (http://m.mozilla.org/httpd+max+clients)
<nagios-scl3> SMS from hwine: Yesterday we saw spike that self resolved. Potential dos attack

Carsten Book [:Tomcat]

Reporter

Comment 8

•

9 years ago

(In reply to Carsten Book [:Tomcat] from comment #6)
> trees reopened at  2015-05-15T14:44:49 UTC per request from usul, still
> montoring the trees

all trees closed again

Ludovic Hirlimann [:Usul]

Assignee

Comment 9

•

9 years ago

'/repo_local/mozilla/webroot_wsgi/build/hgweb.wsgi'.
<ericz> [Fri May 15 14:58:39 2015] [error] [client 10.22.74.212] IOError: failed to write data

Ludovic Hirlimann [:Usul]

Assignee

Comment 10

•

9 years ago

http://nigelbabu.github.io/hgstats/#hours/24

Gregory Szorc [:gps]

Comment 11

•

9 years ago

From the perspective of the hg.mozilla.org servers, they behaved exactly as expected. httpd worker pool was fully saturated *and* CPU was also fully pegged. They started issuing 503s, which is the universal HTTP signal for "back off." That's pretty much the best you can do at the application level when confronted with an effective DoS attack. The servers were saturating their 1Gbps network connection to the load balancer at times. Aggregate throughput to the load balancer exceeded 5.5 Gbps at one time. Unlike previous events where load balancer throughput (due to license limits) was an obvious cause, that isn't the case here (or at least it isn't apparent).

The increase in high server traffic was driven by a drastic increase in requests to clone the integration/gaia-central repository. During UTC hours 1300-1500, response bytes for gaia-central were >10x more than the next repository. The clone volume was abnormally high. Properly behaving clients should not be performing full clones. Instead, they should be using bundle files to seed the local clone and then performing incremental pulls. Eventually, this feature will be built into Mercurial so clients don't need to implement extra smarts to avoid putting too much load on the server.

tl;dr hg.mozilla.org was DoSd by requests to clone gaia-central.

Gregory Szorc [:gps]

Comment 12

•

9 years ago

Here is per-hour totals for gaia-central traffic from one server:

UTC hour                repo                    requests   size       wall time  cpu time
2015-05-15T00:00:00	integration/gaia-central	2843	820603537	1056	329
2015-05-15T01:00:00	integration/gaia-central	2019	534118535	740	170
2015-05-15T02:00:00	integration/gaia-central	2127	462797783	603	200
2015-05-15T03:00:00	integration/gaia-central	1729	121510689	144	91
2015-05-15T04:00:00	integration/gaia-central	1803	331652204	421	182
2015-05-15T05:00:00	integration/gaia-central	584	208022903	302	73
2015-05-15T06:00:00	integration/gaia-central	647	100133419	155	55
2015-05-15T07:00:00	integration/gaia-central	585	273066165	383	93
2015-05-15T08:00:00	integration/gaia-central	909	367346849	592	102
2015-05-15T09:00:00	integration/gaia-central	1678	1166182896	1609	403
2015-05-15T10:00:00	integration/gaia-central	2131	1572038970	1947	448
2015-05-15T11:00:00	integration/gaia-central	1870	1106850862	1194	316
2015-05-15T12:00:00	integration/gaia-central	2077	588593800	741	192
2015-05-15T13:00:00	integration/gaia-central	1748	146887179884	72707	35033
2015-05-15T14:00:00	integration/gaia-central	4496	250455249811	122510	60992
2015-05-15T15:00:00	integration/gaia-central	2711	99468444396	21808	15014
2015-05-15T16:00:00	integration/gaia-central	1390	375470364	575	157

You can see the request size increased by 100% over base.

Hal Wine [:hwine] use NI!

Comment 13

•

9 years ago

Trees "reopened" for this cause at 0930 PT (1630 UTC)

Hal Wine [:hwine] use NI!

Comment 14

•

9 years ago

Attached file Hgstats by nigelbabu.pdf — Details

(In reply to Ludovic Hirlimann [:Usul] from comment #10)
> http://nigelbabu.github.io/hgstats/#hours/24

that's a dynamic display - attachment is snapshot covering the outage period

Gregory Szorc [:gps]

Comment 15

•

9 years ago

(In reply to Ludovic Hirlimann [:Usul] from comment #7)
> I've paged hwine @ 51
> 
>  Fri 07:52:53 PDT [5359] hgweb10.dmz.scl3.mozilla.com:httpd max clients is
> WARNING: Using 64 out of 64 Clients (http://m.mozilla.org/httpd+max+clients)
> <nagios-scl3> SMS from hwine: Yesterday we saw spike that self resolved.
> Potential dos attack

Yesterday's spike appeared to originate from user agent "Twitterbot/1.0" across multiple IPs. Several hundred requests were lobbied at hg.mozilla.org in a very narrow time window. This temporarily overwhelmed the httpd worker pool. Requests were made to robots.txt, but the apparent bot ignored its content (which says not to index).

Yesterday's event and today's event are completely separate AFAICT.

Hal Wine [:hwine] use NI!

Comment 16

•

9 years ago

Attached file zlb top talkers by data flow — Details

Times in PT, data from webops

Hal Wine [:hwine] use NI!

Comment 17

•

9 years ago

Attached file zlb top talkers by connection — Details

Times in PT, data from webops

Hal Wine [:hwine] use NI!

Comment 18

•

9 years ago

Attached file top talkers by data flow on far side of nat-fw1.scl3.mozilla.com — Details

machines in scl3 behind a nat that were hitting hg.m.o

Michal Purzynski [:michal`] (use NEEDINFO)

Comment 19

•

9 years ago

If you tell me a destination and the time window I might be able to tell you which machines behind NAT were hitting HG.

Is the destination hg.mozilla.org or something else?

Also, when is the information necessary? Right away or can wait until Monday?

Greg Arndt [:garndt]

Comment 20

•

9 years ago

There was an issue this morning around 13:00Z where our us-east-1 s3 proxy became unavailable.  This resulted in taskcluster-vcs cloning from hg.m.o instead of using a cached clone stored in s3.  We have since disabled the option to use the proxy, reployed it, and will re-enable once sheriffs are satisfied with the trees. 

This certainly would have contributed to the increase load, especially for integration/gaia-central.

Ludovic Hirlimann [:Usul]

Assignee

Comment 21

•

9 years ago

Do we need to keep this open ?

Assignee: nobody → ludovic

Ludovic Hirlimann [:Usul]

Assignee

Updated

•

9 years ago

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.