Closed Bug 1126825 Opened 9 years ago Closed 9 years ago

aus[1-4].webapp.phx1.mozilla.com:httpd max clients is CRITICAL

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rwatson, Assigned: nmaul)

References

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/361] )

This seems to be an ongoing theme recently. Looking over the past 2 weeks etherpad logs it's happened on 4 people's shifts. 

[15:30:21]  nagios-phx1	Wed 07:30:20 PST [1442] aus1.webapp.phx1.mozilla.com:httpd max clients is WARNING: Using 256 out of 256 Clients (http://m.mozilla.org/httpd+max+clients)
[15:30:31]  nagios-phx1	Wed 07:30:30 PST [1443] aus2.webapp.phx1.mozilla.com:httpd max clients is WARNING: Using 256 out of 256 Clients (http://m.mozilla.org/httpd+max+clients)
[15:31:00]  nagios-phx1	Wed 07:31:00 PST [1444] aus4.webapp.phx1.mozilla.com:httpd max clients is WARNING: Using 256 out of 256 Clients (http://m.mozilla.org/httpd+max+clients)


it usually recovers on it's own, or someone restarts httpd. 
Can this be investigated?
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/361]
Summary: aus1.webapp.phx1.mozilla.com:httpd max clients is CRITICAL → aus[1-4].webapp.phx1.mozilla.com:httpd max clients is CRITICAL
Looking into this. Getting New Relic set up to help identify what's going on with the service that might be a root cause.
Assignee: server-ops-webops → nmaul
Depends on: 1128507
NR is set up. Now we just play the waiting game...
This is in New Relic here: https://rpm.newrelic.com/accounts/263620/applications/5929969.

Haven't caught anything yet... MOC, have you noticed any of these servers having any issues in the past ~2 weeks?

On Jan 14, we changed AUS4 to not use any threading in mod_wsgi (multi-process instead of multi-threaded). This was in response to an issue like this, just as a general possibility - we've seen that be an issue before, in much the same way.

On Jan 28, we added the New Relic app name to the Apache config. This is trivial, extremely unlikely to have had any impact. Just mentioning for completeness.

On Feb 5, we changed the max-requests setting for the app. Previously it was 5,000 and then the app would restart a process. With the request rate we were getting, that equated to effectively a full restart of Apache (mod_wsgi) every 10min. I raised it to 20,000. This *may* have had an impact (by reducing the chance of triggering a race/deadlock condition during the max-requests restart), but as we're using multi-process now and not multi-threading, that seems unlikely.


If it hasn't done it, I'm content to close this bug out and simply pretend all is well. If it ever does crop back up, New Relic is now in place and it should help us to troubleshoot it.
Flags: needinfo?(ashish)
Last alert was on Feb 15. Previous to that was only Feb 5. Both times, it was only one host and not the whole cluster.
Flags: needinfo?(ashish)
I discovered (thanks to bug 1130379) that the 6 AUS nodes were not identical. Nodes 1-4 had a much older version of New Relic than 5 and 6. 5 and 6 weren't reporting at all due to a config mismatch (newer versions need a newer config to report properly).

I've just updated them all to be current, and they immediately started catching some new errors that had previously been undetected. I don't think anything I've seen so far is related, but it gives me confidence we are now catching things that we weren't on the 5th and 15th. There are also some fixes in NR itself that may be relevant.


Going back through the oncall etherpads back to the week of Jan 19, I see aus1, 2, 3, and 4, but not 5 or 6. If indeed 5 and 6 have never alerted, that *may* indicate that the older version of New Relic itself was responsible. If that is the case, than this should be fixed from now on. I doubt this, however, because the reason we set up NR in the first place was that we had this problem. It would mean we had a problem with threading (fixed on or before Jan 14), then broke it in a different way (but that manifested similarly) on Feb 2. It's possible, but feels unlikely. It also means we shouldn't have had failures between Jan 14 and Feb 2... but we did, so that's another strike against this theory.


I guess back to the waiting game to see if the newer NR picks anything up next time a node breaks. If 5 and 6 have never failed as (as it kinda appears to be the case), we might also want to compare packages between them. Could be there's another difference.
I discovered that aus1-4 are running an older version of mod_wsgi than 5 and 6 are. This is a much more likely candidate for a root cause than the New Relic library version.

I've updated aus3 and 4 to the latest version. The installed versions are now:

aus1 - mod_wsgi-3.4-1.el6.rfx.x86_64
aus2 - mod_wsgi-3.4-1.el6.rfx.x86_64
aus3 - mod_wsgi-4.4.8-1.el6.x86_64
aus4 - mod_wsgi-4.4.8-1.el6.x86_64
aus5 - mod_wsgi-4.2.7-6.el6.x86_64
aus6 - mod_wsgi-4.2.7-6.el6.x86_64

Based on which nodes (if any) fail in the future, we'll know if it appears to be tied to New Relic version, mod_wsgi version, or if New Relic reports any errors.
Update - There have been two alerts since Comment 6 - aus2 on Feb 20 and aus4 on Mar 03
Hmm... apache uptime on all of them is 2 days, 4 hours... which was me restarting all of them and upgrading everything 4.4.8 (I started a comment here but apparently never posted it).


Something's odd here. On the nagios alert history page for aus4.webapp.phx1.mozilla.com, I see a maxclients alert for the 3rd:

 Service Warning[03-03-2015 06:30:55] SERVICE ALERT: aus4.webapp.phx1.mozilla.com;httpd max clients;WARNING;HARD;2;Using 256 out of 256 Clients

However, that isn't anywhere in the Apache error_log.


I also see this on the 2nd:

 Service Warning[03-02-2015 11:04:19] SERVICE ALERT: aus4.webapp.phx1.mozilla.com;httpd max clients;WARNING;SOFT;1;Using 256 out of 256 Clients

And this one *is* in the log (8-hour difference, server is in UTC, so I guess nagios is in PT):

[Mon Mar 02 19:03:49 2015] [error] server reached MaxClients setting, consider raising the MaxClients setting



Anyway, I've made some other changes and will leave a separate comment in a minute...
I have changed mod_wsgi further, based on extensive testing with one of the nodes. My theory is that the MaxClients is reached when the mod_wsgi workers have to restart, due to either a deploy, Apache restarting, or the workers hitting the maximum-requests setting. When that happens, there's a delay before the restarted worker(s) starts responding requests normally again. During that time, Apache will continue to send it requests, not realizing that it will be a while before it can respond. This means Apache children are tied up waiting on a non-responsive mod_wsgi worker for a few seconds.

Under the right circumstances (load, # of workers restarting, time of day, dumb luck, etc), this can mean all the Apache children are waiting on mod_wsgi. More Apache children might help, but that's only a symptom. The problem is mod_wsgi being temporarily/intermittently unresponsive.

The change I made is:

From:
    WSGIDaemonProcess aus4 processes=8 threads=1 maximum-requests=20000 display-name=aus4
    WSGIProcessGroup aus4

To:
    WSGIDaemonProcess aus4 processes=2 threads=1 maximum-requests=80000 display-name=aus4
    WSGIImportScript /data/www/aus4.mozilla.org/app/balrog.wsgi process-group=aus4 application-group=aus4
    WSGIApplicationGroup aus4
    WSGIProcessGroup aus4


Having fewer procs means the warm-up time on startup is faster. 8 procs takes ~8.3s 2 procs takes ~1.4. This is also improved during normal proc cycling due to the maximum-requests... those restarts take a bit less time on average due to less process thrashing. On this node, 2 procs actually results in better performance overall, too- more requests/sec, faster mean/median response times, lower std-dev, similar best-case, much better worst-case. As far as I can tell, on these nodes with this app, 2 procs is universally better than 8. There is no metric I've found where 8 procs is a better choice.

The WSGIImportScript causes mod_wsgi to preload the given script before taking any requests. This doesn't really help on a full apache restart (because that stuff has to be loaded anyway), but it does help on a maximum-requests process restart... the proc that's restarting won't take new requests until it finishes preloading, which reduces the apparent slowdown due to those restarts (as long as all procs don't need to restart at the same time!).


For this app in particular, it might actually be reasonable to drastically increase the maximum-requests limit, or potentially even get rid of it altogether. It really depends on how it behaves over a long time interval (does it leak memory, etc).

It might also be possible to revert back to using threads instead of processes. In my testing, 1 proc / 8 threads is measurably faster to start up (and performs better otherwise) than 2 procs / 1 thread. I haven't gone this far because I'm not 100% sure that threading is safe to do here.
Things seem to have went south overnight. There was a fairly non-trivial code change that had a problem, so it's tempting to blame that, but even after that was fixed things weren't as peachy as I was hoping for. So I'm going to try some more.

I'm compromising and going back to 4 processes (16->8->2->4). I think these processes *are* behaving oddly after a long time serving content (as indicated by New Relic showing a non-zero "Request Queuing" value).

I'm also slightly tweaking Apache itself, up from 256 to 384 MaxClients, and with saner settings for Start and Spare server counts.

On Monday I'll evaluate again, and if it's still not quite where I want it I will try setting mod_wsgi to 1 proc / 8 threads.

I'm also tempted to switch apache to use mpm_worker instead of mpm_prefork, but that almost certainly won't help mod_wsgi behave any better. That's just sort of a "while I'm here..." change.
Catlee suggested that the amount of errors being sent to New Relic might be playing a factor here. There are a _lot_. I had thought that the vast majority should be ignored, but New Relic hooks in differently than Sentry, so that wasn't the case. I've filed bug 1141197 to fix that up, hopefully that will help.
Depends on: 1141197
(In reply to Ben Hearsum [:bhearsum] from comment #12)
> Catlee suggested that the amount of errors being sent to New Relic might be
> playing a factor here. There are a _lot_. I had thought that the vast
> majority should be ignored, but New Relic hooks in differently than Sentry,
> so that wasn't the case. I've filed bug 1141197 to fix that up, hopefully
> that will help.

Nick managed to bring the error rate down tremendously in bug 1139249. I'm not sure how to confirm whether or not it reduces how often we hit max clients or not though, since this isn't easily reproducible.
No longer depends on: 1141197
I saw the steep drop-off in error rates, thank you! Very nice.

I'm still toying with the mod_wsgi settings on this. At the moment my settings are as follows:

    WSGIScriptAlias / /data/www/aus4.mozilla.org/app/balrog.wsgi
    WSGIDaemonProcess aus4 processes=1 threads=8 display-name=aus4
    WSGIImportScript /data/www/aus4.mozilla.org/app/balrog.wsgi process-group=aus4 application-group=aus4
    WSGIApplicationGroup aus4
    WSGIProcessGroup aus4

As you can see I'm trying threading again. From previous conversations we had suspected that this was a problem, but given that we've had issues for a while even without it, I've reversed my position.

I've also removed the maximum-requests limitation for now. I'm thinking this app is simple and well-behaved enough that it may never need to restart itself. That should make it perform more consistently, and hopefully avoid the odd spikes causing Apache to hit MaxClients.

If either of these causes problems, I'll revert them and try something else. In particular, I'm watching the "Request Queuing" stat in New Relic closely. It's still very small (<10ms), but seems to be growing steadily. This app is extremely fast compared to most of our stuff, so I'm really not sure what's a reasonable point for that to level off at. Shortly after a restart it's <1ms, but I'm not sure if that's generally a sustainable level. If it seems to grow unbounded, I'll go back to having a maximum-requests counter.
(In reply to Jake Maul [:jakem] from comment #14)
> If either of these causes problems, I'll revert them and try something else.
> In particular, I'm watching the "Request Queuing" stat in New Relic closely.
> It's still very small (<10ms), but seems to be growing steadily. This app is
> extremely fast compared to most of our stuff, so I'm really not sure what's
> a reasonable point for that to level off at. Shortly after a restart it's
> <1ms, but I'm not sure if that's generally a sustainable level. If it seems
> to grow unbounded, I'll go back to having a maximum-requests counter.

Looking over the past few days it's hard to find a pattern for the request queuing spikes. It seemed to spike shortly after March 11 @ 12am and then drop off about 12 hours later, then a quick large spike, then a drop off. After, it had a more slow creep until about March 12 @ 12am...and now it's leveled off for the past 6h.

It certainly doesn't correlate with requests per minute - the most recent spike happened during one of the lower request periods.
Yeah, you'll have trouble finding a pattern... I made several changes over the span of a week or more, and didn't document when specifically various changes were made. :)

In any case, things remain stable thus far. I'll give it the weekend, and if things still seem good on Monday I'll close this out.
Still seems good. I'm going to close this out now.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Sigh... not done, aus6.webapp.phx1 hit MaxClients.

The mod_wsgi daemon process wasn't running. Per suggestions from the mod_wsgi author (not directed at me personally, but relevant), I'm increasing LogLevel from warn to info, which should give more details into why it failed.


I'm also setting this:

WSGIRestrictEmbedded on

which disabled embedded mode. We're not using this (we're using daemon mode), so doing this saves around 2.5MB per Apache worker, or around 250MB RAM per server under normal conditions. Around 66% savings for Apache worker usage on the AUS4 cluster!
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Failed again, this time on aus4.webapp.phx1. No additional output from the logs. As before, the mod_wsgi daemon process was no longer running, and apache was backed up waiting for it.

I have just upgraded the version of mod_cloudflare that was installed... should fix a could known upstream bugs (https://github.com/cloudflare/mod_cloudflare/commit/3b6a43071153b3f8ad6db576c8430990b7a79bac). Whether this will have any impact, I don't know yet. That's all I have time for right now. Next week I suppose I will contact the author of mod_wsgi and see if he has any ideas on what it might be or how to troubleshoot it.
aus2.webapp.phx1 alerted. mod_cloudfare is the same version as aus4. Since jakem is AFK, restarted httpd.
We were getting a lot of 503 errors from aus4.webapp.phx1 in some update tests, and there was a lot of this
[Fri Apr 10 09:35:28 2015] [error] [client 176.63.98.210] (11)Resource temporarily unavailable: mod_wsgi (pid=25266): Unable to connect to WSGI daemon process 'aus4' on '/var/run/wsgi.8272.1.1.sock
in /var/log/httpd/aus4.mozilla.org/error_log.

Grepping over older error_log shows that error has been hit a few hundred times per hour in (very) roughly half the hours this month, but it really went bananas at 2015-04-09-16 with 300k/hour.

We ended up restarting httpd, but I'm surprised Zeus didn't pull aus4 out of the pool because this persisted for several hours. Are we missing some Zeus config, or maybe a health check ?
Status: REOPENED → NEW
As noted in comment 21, this appears to be happening a lot. When it does, and a release is in process (at least 2 each week), we have a release build which needs manual intervention, usually at highly inconvenient hours of the day.

Should we break this up into multiple parts? FF RelEng would be much better off if the errant machine could be pulled from rotation automatically.
Flags: needinfo?(smani)
(In reply to Hal Wine [:hwine] (use needinfo) from comment #22)
> As noted in comment 21, this appears to be happening a lot. When it does,
> and a release is in process (at least 2 each week), we have a release build
> which needs manual intervention, usually at highly inconvenient hours of the
> day.
> 
> Should we break this up into multiple parts? FF RelEng would be much better
> off if the errant machine could be pulled from rotation automatically.

Hal/Nick,

Sorry about this.

We've filed Bug 1154390 to fix the health checks to be more accurate and Zeus should pull the failed nodes better next time. 

Also, we've been working on Bug 1148008 to get better data on these failures and so far don't have anything concrete to report. 

Let me know if this helps! Thanks.
Flags: needinfo?(smani)
(In reply to Shyam Mani [:fox2mike] from comment #23)
> (In reply to Hal Wine [:hwine] (use needinfo) from comment #22)
> > Should we break this up into multiple parts? FF RelEng would be much better
> > off if the errant machine could be pulled from rotation automatically.
> 
> Hal/Nick,
> 
> Sorry about this.
> 
> We've filed Bug 1154390 to fix the health checks to be more accurate and
> Zeus should pull the failed nodes better next time. 
>
> Let me know if this helps! Thanks.

We'll all be waiting for Bug 1154390 to be resolved for the aus hosts. :)  That'll make a huge difference in stability of the release process.
Happening again on aus3.webapp.phx1.mozilla.com.

Started at
[Tue Apr 21 02:50:15 2015] [error] [client XX.XX.XX.XX] (11)Resource temporarily unavailable: mod_wsgi (pid=7070): Unable to connect to WSGI daemon process 'aus4' on '/var/run/wsgi.16206.5.1.sock'.

I gracefully restarted Apache.
One interesting thing about this most recent occurence is that AFAICT there's no spike in response time at the same time as the 503s. The requests that did get responses still got them in < 20ms.
After researching this further and discussing internally, we've decided to try using the httpd "worker" MPM instead of the traditional prefork. We tested this out on one node late last week and had no issues with functionality (it didn't crash either, but it was too short a time for that to mean much). I've just switched over all production nodes to be this way.

Performance-wise there's virtually no difference as all the time is spent in mod_wsgi, not Apache. We do save a bit of RAM by having fewer Apache processes, but this cluster isn't RAM-starved so that doesn't have much effect either.
We think comment 27 (worker MPM) did the trick. Things seem smooth ever since that change.
Status: NEW → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.