Closed Bug 760570 Opened 12 years ago Closed 12 years ago

bedrock*.webapp.(scl3|phx1) are processing requests too slowly

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ashish, Assigned: dmaher)

References

Details

Attachments

(1 file)

Requests for www.mozilla.org are taking too long to process in the Bedrock cluster. This is in-turn timing out Watchmouse checks:

[ashish@bedrock1.webapp.phx1 ~]$ time curl -ILH "Host: www.mozilla.org" http://localhost:81/en-US/
HTTP/1.1 200 OK
Date: Fri, 01 Jun 2012 16:54:16 GMT
Server: Apache
x-frame-options: DENY
Expires: Fri, 01 Jun 2012 17:04:25 GMT
Cache-Control: max-age=600
X-Backend-Server: bedrock1.webapp.phx1.mozilla.com
Content-Length: 25155
Content-Type: text/html; charset=utf-8


real    0m8.476s
user    0m0.001s
sys     0m0.005s

[ashish@bedrock2.webapp.phx1 ~]$ time curl -ILH "Host: www.mozilla.org" http://localhost/en-US/
HTTP/1.1 200 OK
Date: Fri, 01 Jun 2012 16:48:49 GMT
Server: Apache
x-frame-options: DENY
Expires: Fri, 01 Jun 2012 16:58:59 GMT
Cache-Control: max-age=600
X-Backend-Server: bedrock2.webapp.phx1.mozilla.com
Content-Length: 25155
Content-Type: text/html; charset=utf-8


real    0m9.424s
user    0m0.000s
sys     0m0.006s

Oddly, these are HP blades, compared to bedrock3 which is a seamicro-xeon:

[ashish@bedrock3.webapp.phx1 ~]$  time curl -ILH "Host: www.mozilla.org" http://localhost/en-US/
HTTP/1.1 200 OK
Date: Fri, 01 Jun 2012 16:55:05 GMT
Server: Apache
x-frame-options: DENY
Expires: Fri, 01 Jun 2012 17:05:08 GMT
Cache-Control: max-age=600
X-Backend-Server: bedrock3.webapp.phx1.mozilla.com
Content-Type: text/html; charset=utf-8


real    0m2.067s
user    0m0.003s
sys     0m0.000s
One relevant note: bedrock1 and 2 in PHX1 are much more powerful than bedrock3... consequently, Zeus uses a weighted algorithm and they get a lot more traffic than 3 does.

Also, the total traffic volume sent to bedrock is evenly split between SCL3 and PHX1, so these two nodes are probably hit more strongly than the 5 nodes in SCL3.

It appears that the WSGI processes are each maxxing out a core. I will investigate further.
Assignee: server-ops → nmaul
Summary: bedrock[1-2].webapp.scl3 are processing requests too slowly → bedrock[1-2].webapp.phx1 are processing requests too slowly
Not all of the WSGI processes were maxxing out a core, but many (8+) of them were. I restarted Apache on both of them, and they seem to be back to normal.

I didn't think to attach an strace to them before restarting... might be instructive if it happens again.

It's worth noting that these 2 servers are each very powerful... 16 cores, 36GB RAM. They're ex-Zeus nodes. bedrock3 exists primarily as a fallback/failsafe mechanism... it's just a 2-core 4GB RAM VM. These 2 servers do all the real work in PHX1, and did *all* the work before bedrock was live in SCL3.

It might be worth considering replacing these 2 servers (and the VM) with 5 Seamicro Xeons, like SCL3 has. I don't think it would be a significant increase in available capacity, but it would make the 2 environments identical. That might be good or bad. :)
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
This is a recurring issue... reopening.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
31931 apache    20   0  473m  47m 4456 S 101.1  0.1  88:49.03 bedrock_prod
31923 apache    20   0  475m  47m 4488 S 99.4  0.1  92:31.04 bedrock_prod
30235 apache    20   0  327m  28m 7880 S 17.9  0.1   0:07.79 /usr/sbin/httpd
25843 apache    20   0  325m  27m 8024 S  9.8  0.1   0:31.21 /usr/sbin/httpd
29680 apache    20   0  320m  22m 8032 S  6.5  0.1   0:09.60 /usr/sbin/httpd

[root@bedrock1.webapp.phx1 ~]# strace -p 31931
Process 31931 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>

It just hangs there indefinitely, using 100% of 1 CPU core.

I'm seeing this on the new SCL3 Seamicro Xeon cluster also... exactly the same behavior.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Any ideas on this (jlongster / wenzel)? It's bad enough that I'm seriously considering a cron job to restart Apache regularly. :/

Changing the title... this affects nodes in SCL3 as well as PHX1. PHX1 nodes seem a bit harder hit, but I think that's just because there are fewer of them- basically 2 nodes vs 5, for the same traffic volume. I've seen the same behavior or all of them.

Another note: it seems to gradually build up to 100% usage on a core. It seems like the worker keeps working, but at an elevated CPU usage state. There seems to be 4 threads per worker, although the setting specifies only one:

WSGIDaemonProcess bedrock_prod processes=16 threads=1 maximum-requests=15000 display-name=bedrock_prod

Perhaps the other 3 are not "worker" threads but some sort of overhead/management threads. Seeing this makes me want to retry the strace from comment 3, attaching to individual threads within one of the failing processes. I will try that next time this occurs.

I also increased LogLevel verbosity to "info". This might help catch something when this reoccurs.
Summary: bedrock[1-2].webapp.phx1 are processing requests too slowly → bedrock*.webapp.(scl3|phx1) are processing requests too slowly
The first of the 4 threads response exactly as in comment 3... it seems to do nothing.


The second thread does almost nothing... it hangs with this:
[root@bedrock2.webapp.phx1 ~]# strace -p 29827
Process 29827 attached - interrupt to quit
select(0, NULL, NULL, NULL, {210, 845816}

No further output, after about a minute of waiting.


The third thread is a bit more exciting. It does this:

[root@bedrock2.webapp.phx1 ~]# strace -p 29828
Process 29828 attached - interrupt to quit
select(0, NULL, NULL, NULL, {0, 907676}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)

These selects are all about 1 second apart, and it repeats... forever, I guess.


The fourth thread seems to be the actual worker thread. It generates a lot of output... too much to feasibly list here, but one line stands out as an exceedingly common error... perhaps much more so when the problem is occurring:

stat("/data/www/www.mozilla.org-django/bedrock/locale/en-US/firefox/new.lang", 0x7fc047f1d9e0) = -1 ENOENT (No such file or directory)

The exact file is almost always (~95%) new.lang, but others do show up occasionally. It often occurs many times (dozens/hundreds) in a row, often interspersed with the same error with a different file... but always a big chunk for new.lang.


I feel like if we could identify what causes these stat() calls, and why they're failing (there is no locale/en-US directory, for one thing), that might help to some extent.
Awesome detective work. The stat calls are coming from the l10n system that tries to load in translations. It checks to see if the .lang file exists, and if so, loads it in:

https://github.com/mozilla/bedrock/blob/master/lib/l10n_utils/dotlang.py#L19

If this is the problem, this will be fixed once we get memcache running in production. It caches the translation results but as far as I know production is using a dummy cache that doesn't do anything.

I didn't think this would be a problem, considering that Zeus caches the page for 15 min, but maybe the stat calls are stalling the system more than expected.

memcache is mostly set up (see bug 738381). I wonder if we can get that live soon.
Depends on: 738381
A few weeks ago, we enabled the locmem backend cache. I think it was around the Firefox Mobile launch (2012-06-26). Even though it's not as powerful as memcache, this should have fixed this.
Assignee: nmaul → dmaher
I've added an account for :rik on the production webheads (which should be live within 30 minutes).

these are bedrock[1-3].webapp.phx1.mozilla.com and bedrock[1-5].webapp.scl3.mozilla.com

should be accessible through the mpt-vpn
(In reply to Jake Maul [:jakem] from comment #3)

> [root@bedrock1.webapp.phx1 ~]# strace -p 31931
> Process 31931 attached - interrupt to quit
> restart_syscall(<... resuming interrupted call ...>
> 
> It just hangs there indefinitely, using 100% of 1 CPU core.
> 
> I'm seeing this on the new SCL3 Seamicro Xeon cluster also... exactly the
> same behavior.

Fwiw, I'm currently seeing this behaviour on _all_ of the bedrock_prod processes on bedrock1.webapp.phx1, so that's possible - which is unfortunate.
Thanks Corey. I've worked with Dan so no need for an access on prod.

What I've done so far:
- On the new.lang missing file: If a file was missing, our code would store {} in the cache. But it would keep checking for the file because we did not differentiate an empty cache (None) from an empty dict from cache ({}). I fixed that
- I added a 10 second timeout for our newsletter backend (ExactTarget). Our code had no timeout so far and I suspect ExactTarget could be our cause here. We'll receive stacktraces for requests to this backend that take more than 10 seconds.

In the future, our newsletter thing should be handed off to a Celery task.
Depends on: 774312
Since yesterday, I've received 6 tracebacks for the ExactTarget 10s timeout. 3 from PHX1 and 3 from SLC3.

And right now, from PAR1, I'm seeing responses between 5s and 10s.

Seems like we did not fix it.
I have no clue what could cause this now.

One random idea: what if it was related to the proxying in AMS1? Maybe AMS1 issues too many requests to one server, bypassing the Zeus load balancing? What is the config of AMS1? I can't find it on mana.
:rik mentioned that he was seeing long response times from PHX1.  I logged on to bedrock[1-3].webapp.phx1 and ran this every 5 seconds, during 30 minutes, on each :
time -p -ao out curl -s -ILH "Host: www.mozilla.org" http://localhost/en-US/

Average response time to the above command on each :
* bedrock1 : 2.12 seconds
* bedrock2 : 2.60 seconds
* bedrock3 : 0.46 seconds

Could the end-user experience of slow response times be coming from something else entirely - perhaps something in between the webheads and the user ?
It may be worth noting that there appears to be a correlation between the individual CPU usage per process and the response time per machine.  In other words both response time AND CPU consumption (per bedrock_prod) process are: 
* bedrock2 > bedrock1 > bedrock3
I ran the recurring curl mentioned in comment #15 from nladm01 overnight, directed towards "www.mozilla.org" plainly (of course).  The average response time was 1.67 seconds; however, the pattern for the responses was interesting : every response would be around the average time, except that on average 1 in every 14 attempts it'd spike to ~ 6.18 seconds.  I'm not sure if that means anything, but I thought I'd mention it at least.
Is the short term/long term fix more hardware?
(In reply to matthew zeier [:mrz] from comment #18)
> Is the short term/long term fix more hardware?

The short-term fix is to restart bedrock/httpd on a regular basis, since this "fixes" the problem - which is to say that it removes the symptoms temporarily.  After a while, each process will slowly eat more CPU, becoming slower and slower, until it needs to be restarted again.  We haven't put a regular cron job in place yet for a variety of reasons, not the least of which being that band-aids tend to become permanent - and this would be a very poor permanent solution indeed.

The long-term fix is to delve into the code of the bedrock application and deal with whatever root issue is actually causing the problem in the first place.  :rik has been working on this; last we spoke, he was going to put Ganglia metrics into place within the app itself, in order to try to pinpoint where - within the application itself - inefficiencies are occurring.
I don't know about Ganglia, I was rather suggesting using Graphite which we need to setup first. I'll open bugs about this.
(In reply to Anthony Ricaud (:rik) from comment #20)
> I don't know about Ganglia, I was rather suggesting using Graphite which we
> need to setup first. I'll open bugs about this.

Woops, yes, I meant to type "Graphite", not "Ganglia".
Depends on: 775534
I just checked in a change to the mod_wsgi config... I reduced "maximum-requests" from 15000 to 1000. This should cause mod_wsgi to recycle its worker processes much more frequently. We are theorizing that this will result in worker processes being restarted before they spin out of control and use up CPU.

This doesn't fix whatever the underlying problem is (why CPU usage slowly climbs up to 100%), but does hopefully reduce the end-user-visible effects of it.
There may be more than one issue occurring here.

I just caught the situation with MaxClients being reached, and was able to get a dump of the server-status page from all of the web heads while it was ongoing.

The first thing I noticed is that KeepAlives are insignificant... virtually all of the workers are in the 'W' state ("Sending Reply", aka: working). Normally there are 30-60 workers alive at any one time... in this situation, it typically is maxxed out at 512 workers. So it's rather abnormal.

In looking at the Requests currently being processed, there is comparatively little duplication... leading me to believe it's not a DoS attack. One would normally expect that to generate a huge volume of traffic to a particular URL... that's not what's happening here.

I did notice however that a large percentage of the workers were working on some variant of the /whatsnew/ page. The general URL pattern is like this:

/en-US/firefox/14.0.1/whatsnew/?oldversion=13.0.1

The locale, version, and oldversion strings vary, but approximately 82% of the GET requests being processed are for a /whatsnew/ page. I spot-checked the access log, and this appears to be far out of step with the relative number of hits to /whatsnew/ pages, so I don't think it's a normal bias caused simply by those pages receiving more traffic.
(In reply to Jake Maul [:jakem] from comment #23)
> There may be more than one issue occurring here.

This lines up with the behaviour noted in comment #15, whereby local requests are being processed much more rapidly than remote requests to the self-same machines. My gut instinct is to blame Zeus for at least a portion of the problem, but having taken a look at the configuration for Bedrock, nothing really seems to pop out immediately.  I'll take a second look, though - couldn't hurt.
The whatsnew page was launched on Bedrock on Tuesday. It does appear a lot in Apache logs because there are a lot of different URLs (bypassing the HTTP cache) and also because we have to use Vary: User-Agent and there's a lot of different User-Agents out there.

So the whatsnew page may make the problem even worse because we get more hits on the webheads. But it's definitely not the root cause.

Users are also more likely to see this problem than before. Before, only POST requests would take a long time (because they are not cacheable) and we don't have a lot of POST requests (only newsletter subscriptions, AFAICT). But now we have this whatsnew page.
I think you missed part of my comment:

I spot-checked the access log, and this appears to be far out of step with the relative number of hits to /whatsnew/ pages, so I don't think it's a normal bias caused simply by those pages receiving more traffic.

This is looking at hits on the web node, not from Zeus. Thus it's only looking at things that do make it all the way to the web node.

This is not a case of simply getting more traffic. The steady state is nowhere near maxxed out. Right now I see anywhere from 40-60 workers on each node, and most of those are in KeepAlive state. When the problem occurs, we're talking about *512 active workers*, with virtually none of them idling in KeepAlive.

So when I say that multiple servers are suddenly maxxed out, you can be sure this is not a case of just needing more servers, or more traffic getting through to the web nodes. And when 82% of those workers are working on a /whatsnew/ page, that seems highly coincidental.

If you have a better idea of what the problem might be I'm definitely interested to hear it, but this is what we've got. It's bad enough now we're actually debating on setting up a cron job to restart Apache... that should tell you how troublesome this has become.
Group: infra
So we worked on analyzing this a lot more today. We set up statsd/graphite to get data about what's going on (bug 775534). Right now, we're only tracking the time each requests take. From a quick look, we haven't found an obvious culprit. We'll need to take a closer and/or add other metrics in there.

Since the problem is not solved and everyone wants to have a good weekend, we're setting up a cron to restart the wsgi processes.
So I took a look at our system metrics in Ganglia. The interesting part is that our network in/out traffic seems to match our load traffic.
Does memcache requests count as outside traffic?
I think I found it!

Everytime I refresh one page, I get two more calls to the mozorg.dotlang.translate(). I don't know yet why we get those two new calls for every request but this will be easier to track now.
Commits pushed to master at https://github.com/mozilla/bedrock

https://github.com/mozilla/bedrock/commit/45435c143ef023a3fda374d03fd42ae4e2c19545
Bug 760570 - Copy the DOTLANG_FILES settings instead before appending to it

Without this, the DOTLANG_FILES setting would grow with each request.

https://github.com/mozilla/bedrock/commit/4b113717331e6f4e73b071e72ffaa9373394d64c
Merge pull request #282 from Rik/copy-settings

Bug 760570 - Copy the DOTLANG_FILES settings instead before appending to...
As you can see, we have a fix!

The setting with all the L10N translation files would expand with every request with two new entries. So we would do more and more lookups for every request. So this fix will not modify the setting for every request.
So we'll need a push to fix this. We'll monitor a bit if the graphs show the problem again. If they don't (crosses fingers), we won't need the cron that was setup this weekend.
Push completed, please confirm things look good and I will remove the temp cron job.
Not sure if related to the current changes, but there are a ton of errors flying through www.mozilla.org error_log.
Chatted with :Rik on irc, the errors that I am seeing are "normal" as in they have been known before the current push.

Perhaps the existing errors should be fixed in a different bug?
The issue doesn't seem fix. I'll do more research tomorrow.

So don't remove the cronjob yet.
As per :rik 
Seems the fix, wasn't enough.
bedrock1.webapp.phx1.mozilla.com:httpd max clients is WARNING: Using 512 out of 512 Clients 

Seeing the same thing jakem saw;
32686 apache     20   0  327M 29796  7844 S  0.0  0.1  0:37.67 │  ├─ /usr/sbin/httpd                                                    
30704 apache     20   0  479M 50692  4792 S 81.0  0.1 12:55.31 │  ├─ bedrock_prod
30733 apache     20   0  479M 50692  4792 R 84.0  0.1 12:51.67 │  │  ├─ bedrock_prod
30732 apache     20   0  479M 50692  4792 S  0.0  0.1  0:00.04 │  │  ├─ bedrock_prod
30731 apache     20   0  479M 50692  4792 S  0.0  0.1  0:00.00 │  │  └─ bedrock_prod

# strace -p 30704 (worker)
Process 30704 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>

{no further output, this is expected}

# strace -p 30733 (thread)
poll([{fd=20, events=POLLIN}], 1, 3000) = 1 ([{fd=20, revents=POLLIN}])
recvfrom(20, "VALUE bedrock_prod:1:dotlang-en-"..., 4096, 0, NULL, NULL) = 70
poll([{fd=20, events=POLLOUT}], 1, 3000) = 1 ([{fd=20, revents=POLLOUT}])
sendto(20, "get bedrock_prod:1:dotlang-en-US"..., 43, 0, NULL, 0) = 43

{none stop output}


Will update with more findings if/when the issue reoccurs again.
So looking at the output of the push that Jason did, I see:
You are not currently on a branch. Please specify which
branch you want to merge with. See git-pull(1) for details.

This leads me to believe that prod does not run with my fix. Which would be cool :)
Aj did a 'git checkout master' before doing another deploy. And the curves looks deceptively flat now. \o/

So I think we're done for real here! Will update tomorrow morning to let you know if we can remove the cron and if we can increase WSGIDaemon maximum-requests.
Looks like this is fixed. http://i.imgur.com/wT1RL.png

Can we remove the crontab and reset the WSGIDaemon maximum-requests to the default?
02:12:50 < Rik> phrawzty: peux-tu enlever la cron qui redémarre bedrock? bug 760570
02:17:18 < Rik> phrawzty: et si tu connais la valeur par défaut de WSGIDaemon maximum-requests, ce serait bien de la remettre aussi

Translation :
* Remove the cron job that restarts Bedrock.
* Set the WSGIDaemon value back to the default setting.

WIP.
(In reply to Daniel Maher [:phrawzty] from comment #41)
> * Remove the cron job that restarts Bedrock.
> * Set the WSGIDaemon value back to the default setting.

* Cron job removed.
* WSGIDaemon maximum-requests doesn't have a default value; I set it back to 15000.

Everything seems to be in order (finally, hurrah).  Hearty congratulations all around.  Closing the bug.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
Attached image Graph of response times

imgur might delete some images and I want to keep this graph around so uploading a copy of comment 40.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: