Closed Bug 1472271 Opened 6 years ago Closed 6 years ago

Decision task problem because of mercurial

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: apop, Assigned: apop)

Details

There is an issue with hg servers and this affects the try and the decision tasks, please check the error :

[task 2018-06-29T17:13:30.584Z] ReadTimeout: HTTPSConnectionPool(host='hg.mozilla.org', port=443): Read timed out. (read timeout=5)

can you please check the issue ?
Severity: normal → blocker
Priority: -- → P1
Assignee: nobody → apop
sounds like activedata is taking down hgmo and probably related:

<nagios-scl3> Fri 17:09:01 UTC [5914] [Unknown] hgweb15.dmz.scl3.mozilla.com:httpd max clients is WARNING: Using 26 out of 26 Clients (http://m.mozilla.org/httpd+max+clients)
10:12:29 
<gps> ekyle-no-power: activedata is hammering hgmo
10:13:11 Usul, sal, dhouse: we can mitigate the load by banning UAs with "ActiveData-ETL" in them
10:15:02 
<nagios-scl3> Fri 17:15:01 UTC [5917] [Unknown] hgweb15.dmz.scl3.mozilla.com:httpd max clients is WARNING: Using 26 out of 26 Clients (http://m.mozilla.org/httpd+max+clients)
10:18:22 apop|away → apop|ciduty
10:18:53 
<nagios-scl3> Fri 17:18:52 UTC [5921] [devservices] hgweb14.dmz.scl3.mozilla.com:Load is CRITICAL: CRITICAL - load average: 47.23, 47.45, 46.96 (http://m.mozilla.org/Load)
10:20:38 
<sal> im not sure how to filter them 
10:21:01 
<nagios-scl3> Fri 17:21:00 UTC [5924] [Unknown] hgweb15.dmz.scl3.mozilla.com:httpd max clients is WARNING: Using 26 out of 26 Clients (http://m.mozilla.org/httpd+max+clients)
10:21:07 
<gps> i made noise in #sysadmins. usually someone there can take care of things
10:21:10 
<nagios-scl3> Fri 17:21:09 UTC [5927] [devservices] hgweb13.dmz.scl3.mozilla.com:httpd max clients is WARNING: Using 53 out of 53 Clients (http://m.mozilla.org/httpd+max+clients)
10:24:08 Fri 17:24:07 UTC [5932] [devservices] hgweb13.dmz.scl3.mozilla.com:Load is CRITICAL: CRITICAL - load average: 46.42, 47.28, 46.98 (http://m.mozilla.org/Load)
10:26:45 
<jlund> Jordan Lund hitting timeouts on at least try: https://treeherder.mozilla.org/#/jobs?repo=try
10:26:51 [task 2018-06-29T17:13:30.584Z] ReadTimeout: HTTPSConnectionPool(host='hg.mozilla.org', port=443): Read timed out. (read timeout=5)
10:27:01 
<nagios-scl3> Fri 17:27:00 UTC [5936] [Unknown] hgweb15.dmz.scl3.mozilla.com:httpd max clients is WARNING: Using 26 out of 26 Clients (http://m.mozilla.org/httpd+max+clients)
10:28:04 Fri 17:28:03 UTC [5939] [devservices] hgweb13.dmz.scl3.mozilla.com:httpd max clients is WARNING: Using 53 out of 53 Clients (http://m.mozilla.org/httpd+max+clients)
10:30:02 Fri 17:30:01 UTC [5942] [devservices] hgweb13.dmz.scl3.mozilla.com:httpd max clients is OK: Using 10 out of 53 Clients (http://m.mozilla.org/httpd+max+clients)
10:31:01 Fri 17:31:00 UTC [5945] [Unknown] hgweb15.dmz.scl3.mozilla.com:httpd max clients is OK: Using 5 out of 26 Clients (http://m.mozilla.org/httpd+max+clients)
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → DUPLICATE
I have re-opened the ticket, because we need to document it if the problem from try has been resolved or not.
We will keep on tracking it.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
The operational issue causing the issue has been resolved (in bug 1472251).

The service was effectively under a DoS attack due to extremely high concurrent load against API endpoints that required sufficient CPU to process. I believe the last time this specific event happened, we were able to ride it out. But the hg.mo service is at lowered capacity right now because we're in the middle of a datacenter migration and some of our high-capacity servers are not available to service requests right now. Plus we're in the middle of a work day. I think the last time ActiveData did this was when the sun was over the Pacific Ocean, which is a period of relative tranquility for the servers.
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
We have received confirmation from the Sheriffs about the decision tasks. Everything looks to be fine, now
You need to log in before you can comment on or make changes to this bug.