Closed Bug 1278722 Opened 8 years ago Closed 8 years ago

hgweb[11-14].dmz.scl3 hit max clients

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sal, Assigned: gps)

Details

16:31 <@nagios-scl3> Tue 16:31:48 PDT [5370] hgweb14.dmz.scl3.mozilla.com:httpd max clients is WARNING: Using 30 out of 30 Clients (http://m.mozilla.org/httpd+max+clients)
16:31 <@nagios-scl3> Tue 16:31:49 PDT [5371] hgweb12.dmz.scl3.mozilla.com:httpd max clients is WARNING: Using 30 out of 30 Clients (http://m.mozilla.org/httpd+max+clients)
16:31 <@nagios-scl3> Tue 16:31:49 PDT [5372] hgweb11.dmz.scl3.mozilla.com:httpd max clients is WARNING: Using 30 out of 30 Clients (http://m.mozilla.org/httpd+max+clients)
16:31 <@nagios-scl3> Tue 16:31:58 PDT [5373] hgweb13.dmz.scl3.mozilla.com:httpd max clients is WARNING: Using 30 out of 30 Clients (http://m.mozilla.org/httpd+max+clients)
16:34 <@nagios-scl3> Tue 16:34:38 PDT [5375] hgweb12.dmz.scl3.mozilla.com:httpd max clients is CRITICAL: (Service Check Timed Out) (http://m.mozilla.org/httpd+max+clients)
16:35 <@nagios-scl3> Tue 16:35:08 PDT [5376] hgweb14.dmz.scl3.mozilla.com:httpd max clients is CRITICAL: (Service Check Timed Out) (http://m.mozilla.org/httpd+max+clients)
16:35 <@nagios-scl3> Tue 16:35:08 PDT [5377] hgweb11.dmz.scl3.mozilla.com:httpd max clients is CRITICAL: (Service Check Timed Out) (http://m.mozilla.org/httpd+max+clients)
Group: mozilla-employee-confidential
Assignee: nobody → gps
Root cause we me syncing mozilla-central with the integration/autoland repo. Because the mozilla-central pushlog was synced, 33 new pushes since last sync were scheduled in buildbot. This spawned hundreds of jobs. These were cancelled within minutes of creation by KWierso. However, it didn't appear to be enough: there were still enough jobs that they flooded hg.mozilla.org with clone requests and overwhelmed the servers.

Normally we don't have this problem because clones are served from S3/CDN. However, we don't have clone bundles generated for the autoland repo yet. So the clone load went direct to the servers. Because the clone HTTP requests transfer 1+ GB each, they are long running. They consumed all worker children processes meaning new requests didn't have any workers to process them. Classic DoS.

I fixed the issue by copying the clonebundles.manifest file for mozilla-central to the autoland repo and bounced httpd. When the clones were retried, the data was fetched from S3/CDN and it was business as usual.

To mitigate this in the future, we need to advertise clone bundles for the autoland repo. We also likely want automation to bake a snapshot of Firefox in the base image so they aren't cloning the entire repo when running their first job. This has been blocked on exposing an appropriate base image snapshot for them to use. This has been blocked on having a unified repo capable of generating said snapshot. Fortunately, I've been working on both these things in the past week. We now offer a snapshot of the experimental/firefox-unified repo on the CDN/S3 and I'm in the process of making that repo non-experimental so we can provide a non-experimental snapshot to bake into the base image.

Anyway, this incident should be a one-off event stemming from activity around changing our VCS hosting and consumption strategy so we don't have these kinds of events going forward. Had the autoland repo been using mozilla-central's bundle manifest, this wouldn't have happened. That was an oversight and I'll correct it shortly.
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/b9d459f1f9f4
ansible/hg-ssh: use mozilla-central's bundles for integration/autoland repo
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Bug 1232442 tracks seeding the AMIs with an appropriate snapshot. Twas already on file and contained similar reasoning about using the unified repo as the base as what I typed in this bug.
You need to log in before you can comment on or make changes to this bug.