Closed
Bug 760570
Opened 12 years ago
Closed 12 years ago
bedrock*.webapp.(scl3|phx1) are processing requests too slowly
Categories
(Infrastructure & Operations Graveyard :: WebOps: Other, task)
Infrastructure & Operations Graveyard
WebOps: Other
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: ashish, Assigned: dmaher)
References
Details
Attachments
(1 file)
123.25 KB,
image/png
|
Details |
Requests for www.mozilla.org are taking too long to process in the Bedrock cluster. This is in-turn timing out Watchmouse checks: [ashish@bedrock1.webapp.phx1 ~]$ time curl -ILH "Host: www.mozilla.org" http://localhost:81/en-US/ HTTP/1.1 200 OK Date: Fri, 01 Jun 2012 16:54:16 GMT Server: Apache x-frame-options: DENY Expires: Fri, 01 Jun 2012 17:04:25 GMT Cache-Control: max-age=600 X-Backend-Server: bedrock1.webapp.phx1.mozilla.com Content-Length: 25155 Content-Type: text/html; charset=utf-8 real 0m8.476s user 0m0.001s sys 0m0.005s [ashish@bedrock2.webapp.phx1 ~]$ time curl -ILH "Host: www.mozilla.org" http://localhost/en-US/ HTTP/1.1 200 OK Date: Fri, 01 Jun 2012 16:48:49 GMT Server: Apache x-frame-options: DENY Expires: Fri, 01 Jun 2012 16:58:59 GMT Cache-Control: max-age=600 X-Backend-Server: bedrock2.webapp.phx1.mozilla.com Content-Length: 25155 Content-Type: text/html; charset=utf-8 real 0m9.424s user 0m0.000s sys 0m0.006s Oddly, these are HP blades, compared to bedrock3 which is a seamicro-xeon: [ashish@bedrock3.webapp.phx1 ~]$ time curl -ILH "Host: www.mozilla.org" http://localhost/en-US/ HTTP/1.1 200 OK Date: Fri, 01 Jun 2012 16:55:05 GMT Server: Apache x-frame-options: DENY Expires: Fri, 01 Jun 2012 17:05:08 GMT Cache-Control: max-age=600 X-Backend-Server: bedrock3.webapp.phx1.mozilla.com Content-Type: text/html; charset=utf-8 real 0m2.067s user 0m0.003s sys 0m0.000s
Comment 1•12 years ago
|
||
One relevant note: bedrock1 and 2 in PHX1 are much more powerful than bedrock3... consequently, Zeus uses a weighted algorithm and they get a lot more traffic than 3 does. Also, the total traffic volume sent to bedrock is evenly split between SCL3 and PHX1, so these two nodes are probably hit more strongly than the 5 nodes in SCL3. It appears that the WSGI processes are each maxxing out a core. I will investigate further.
Assignee: server-ops → nmaul
Updated•12 years ago
|
Summary: bedrock[1-2].webapp.scl3 are processing requests too slowly → bedrock[1-2].webapp.phx1 are processing requests too slowly
Comment 2•12 years ago
|
||
Not all of the WSGI processes were maxxing out a core, but many (8+) of them were. I restarted Apache on both of them, and they seem to be back to normal. I didn't think to attach an strace to them before restarting... might be instructive if it happens again. It's worth noting that these 2 servers are each very powerful... 16 cores, 36GB RAM. They're ex-Zeus nodes. bedrock3 exists primarily as a fallback/failsafe mechanism... it's just a 2-core 4GB RAM VM. These 2 servers do all the real work in PHX1, and did *all* the work before bedrock was live in SCL3. It might be worth considering replacing these 2 servers (and the VM) with 5 Seamicro Xeons, like SCL3 has. I don't think it would be a significant increase in available capacity, but it would make the 2 environments identical. That might be good or bad. :)
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Comment 3•12 years ago
|
||
This is a recurring issue... reopening. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 31931 apache 20 0 473m 47m 4456 S 101.1 0.1 88:49.03 bedrock_prod 31923 apache 20 0 475m 47m 4488 S 99.4 0.1 92:31.04 bedrock_prod 30235 apache 20 0 327m 28m 7880 S 17.9 0.1 0:07.79 /usr/sbin/httpd 25843 apache 20 0 325m 27m 8024 S 9.8 0.1 0:31.21 /usr/sbin/httpd 29680 apache 20 0 320m 22m 8032 S 6.5 0.1 0:09.60 /usr/sbin/httpd [root@bedrock1.webapp.phx1 ~]# strace -p 31931 Process 31931 attached - interrupt to quit restart_syscall(<... resuming interrupted call ...> It just hangs there indefinitely, using 100% of 1 CPU core. I'm seeing this on the new SCL3 Seamicro Xeon cluster also... exactly the same behavior.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 4•12 years ago
|
||
Any ideas on this (jlongster / wenzel)? It's bad enough that I'm seriously considering a cron job to restart Apache regularly. :/ Changing the title... this affects nodes in SCL3 as well as PHX1. PHX1 nodes seem a bit harder hit, but I think that's just because there are fewer of them- basically 2 nodes vs 5, for the same traffic volume. I've seen the same behavior or all of them. Another note: it seems to gradually build up to 100% usage on a core. It seems like the worker keeps working, but at an elevated CPU usage state. There seems to be 4 threads per worker, although the setting specifies only one: WSGIDaemonProcess bedrock_prod processes=16 threads=1 maximum-requests=15000 display-name=bedrock_prod Perhaps the other 3 are not "worker" threads but some sort of overhead/management threads. Seeing this makes me want to retry the strace from comment 3, attaching to individual threads within one of the failing processes. I will try that next time this occurs. I also increased LogLevel verbosity to "info". This might help catch something when this reoccurs.
Summary: bedrock[1-2].webapp.phx1 are processing requests too slowly → bedrock*.webapp.(scl3|phx1) are processing requests too slowly
Comment 5•12 years ago
|
||
The first of the 4 threads response exactly as in comment 3... it seems to do nothing. The second thread does almost nothing... it hangs with this: [root@bedrock2.webapp.phx1 ~]# strace -p 29827 Process 29827 attached - interrupt to quit select(0, NULL, NULL, NULL, {210, 845816} No further output, after about a minute of waiting. The third thread is a bit more exciting. It does this: [root@bedrock2.webapp.phx1 ~]# strace -p 29828 Process 29828 attached - interrupt to quit select(0, NULL, NULL, NULL, {0, 907676}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) These selects are all about 1 second apart, and it repeats... forever, I guess. The fourth thread seems to be the actual worker thread. It generates a lot of output... too much to feasibly list here, but one line stands out as an exceedingly common error... perhaps much more so when the problem is occurring: stat("/data/www/www.mozilla.org-django/bedrock/locale/en-US/firefox/new.lang", 0x7fc047f1d9e0) = -1 ENOENT (No such file or directory) The exact file is almost always (~95%) new.lang, but others do show up occasionally. It often occurs many times (dozens/hundreds) in a row, often interspersed with the same error with a different file... but always a big chunk for new.lang. I feel like if we could identify what causes these stat() calls, and why they're failing (there is no locale/en-US directory, for one thing), that might help to some extent.
Comment 6•12 years ago
|
||
Awesome detective work. The stat calls are coming from the l10n system that tries to load in translations. It checks to see if the .lang file exists, and if so, loads it in: https://github.com/mozilla/bedrock/blob/master/lib/l10n_utils/dotlang.py#L19 If this is the problem, this will be fixed once we get memcache running in production. It caches the translation results but as far as I know production is using a dummy cache that doesn't do anything. I didn't think this would be a problem, considering that Zeus caches the page for 15 min, but maybe the stat calls are stalling the system more than expected. memcache is mostly set up (see bug 738381). I wonder if we can get that live soon.
Comment 9•12 years ago
|
||
A few weeks ago, we enabled the locmem backend cache. I think it was around the Firefox Mobile launch (2012-06-26). Even though it's not as powerful as memcache, this should have fixed this.
Assignee | ||
Updated•12 years ago
|
Assignee: nmaul → dmaher
Comment 10•12 years ago
|
||
I've added an account for :rik on the production webheads (which should be live within 30 minutes). these are bedrock[1-3].webapp.phx1.mozilla.com and bedrock[1-5].webapp.scl3.mozilla.com should be accessible through the mpt-vpn
Assignee | ||
Comment 11•12 years ago
|
||
(In reply to Jake Maul [:jakem] from comment #3) > [root@bedrock1.webapp.phx1 ~]# strace -p 31931 > Process 31931 attached - interrupt to quit > restart_syscall(<... resuming interrupted call ...> > > It just hangs there indefinitely, using 100% of 1 CPU core. > > I'm seeing this on the new SCL3 Seamicro Xeon cluster also... exactly the > same behavior. Fwiw, I'm currently seeing this behaviour on _all_ of the bedrock_prod processes on bedrock1.webapp.phx1, so that's possible - which is unfortunate.
Comment 12•12 years ago
|
||
Thanks Corey. I've worked with Dan so no need for an access on prod. What I've done so far: - On the new.lang missing file: If a file was missing, our code would store {} in the cache. But it would keep checking for the file because we did not differentiate an empty cache (None) from an empty dict from cache ({}). I fixed that - I added a 10 second timeout for our newsletter backend (ExactTarget). Our code had no timeout so far and I suspect ExactTarget could be our cause here. We'll receive stacktraces for requests to this backend that take more than 10 seconds. In the future, our newsletter thing should be handed off to a Celery task.
Comment 13•12 years ago
|
||
Since yesterday, I've received 6 tracebacks for the ExactTarget 10s timeout. 3 from PHX1 and 3 from SLC3. And right now, from PAR1, I'm seeing responses between 5s and 10s. Seems like we did not fix it.
Comment 14•12 years ago
|
||
I have no clue what could cause this now. One random idea: what if it was related to the proxying in AMS1? Maybe AMS1 issues too many requests to one server, bypassing the Zeus load balancing? What is the config of AMS1? I can't find it on mana.
Assignee | ||
Comment 15•12 years ago
|
||
:rik mentioned that he was seeing long response times from PHX1. I logged on to bedrock[1-3].webapp.phx1 and ran this every 5 seconds, during 30 minutes, on each : time -p -ao out curl -s -ILH "Host: www.mozilla.org" http://localhost/en-US/ Average response time to the above command on each : * bedrock1 : 2.12 seconds * bedrock2 : 2.60 seconds * bedrock3 : 0.46 seconds Could the end-user experience of slow response times be coming from something else entirely - perhaps something in between the webheads and the user ?
Assignee | ||
Comment 16•12 years ago
|
||
It may be worth noting that there appears to be a correlation between the individual CPU usage per process and the response time per machine. In other words both response time AND CPU consumption (per bedrock_prod) process are: * bedrock2 > bedrock1 > bedrock3
Assignee | ||
Comment 17•12 years ago
|
||
I ran the recurring curl mentioned in comment #15 from nladm01 overnight, directed towards "www.mozilla.org" plainly (of course). The average response time was 1.67 seconds; however, the pattern for the responses was interesting : every response would be around the average time, except that on average 1 in every 14 attempts it'd spike to ~ 6.18 seconds. I'm not sure if that means anything, but I thought I'd mention it at least.
Comment 18•12 years ago
|
||
Is the short term/long term fix more hardware?
Assignee | ||
Comment 19•12 years ago
|
||
(In reply to matthew zeier [:mrz] from comment #18) > Is the short term/long term fix more hardware? The short-term fix is to restart bedrock/httpd on a regular basis, since this "fixes" the problem - which is to say that it removes the symptoms temporarily. After a while, each process will slowly eat more CPU, becoming slower and slower, until it needs to be restarted again. We haven't put a regular cron job in place yet for a variety of reasons, not the least of which being that band-aids tend to become permanent - and this would be a very poor permanent solution indeed. The long-term fix is to delve into the code of the bedrock application and deal with whatever root issue is actually causing the problem in the first place. :rik has been working on this; last we spoke, he was going to put Ganglia metrics into place within the app itself, in order to try to pinpoint where - within the application itself - inefficiencies are occurring.
Comment 20•12 years ago
|
||
I don't know about Ganglia, I was rather suggesting using Graphite which we need to setup first. I'll open bugs about this.
Assignee | ||
Comment 21•12 years ago
|
||
(In reply to Anthony Ricaud (:rik) from comment #20) > I don't know about Ganglia, I was rather suggesting using Graphite which we > need to setup first. I'll open bugs about this. Woops, yes, I meant to type "Graphite", not "Ganglia".
Comment 22•12 years ago
|
||
I just checked in a change to the mod_wsgi config... I reduced "maximum-requests" from 15000 to 1000. This should cause mod_wsgi to recycle its worker processes much more frequently. We are theorizing that this will result in worker processes being restarted before they spin out of control and use up CPU. This doesn't fix whatever the underlying problem is (why CPU usage slowly climbs up to 100%), but does hopefully reduce the end-user-visible effects of it.
Comment 23•12 years ago
|
||
There may be more than one issue occurring here. I just caught the situation with MaxClients being reached, and was able to get a dump of the server-status page from all of the web heads while it was ongoing. The first thing I noticed is that KeepAlives are insignificant... virtually all of the workers are in the 'W' state ("Sending Reply", aka: working). Normally there are 30-60 workers alive at any one time... in this situation, it typically is maxxed out at 512 workers. So it's rather abnormal. In looking at the Requests currently being processed, there is comparatively little duplication... leading me to believe it's not a DoS attack. One would normally expect that to generate a huge volume of traffic to a particular URL... that's not what's happening here. I did notice however that a large percentage of the workers were working on some variant of the /whatsnew/ page. The general URL pattern is like this: /en-US/firefox/14.0.1/whatsnew/?oldversion=13.0.1 The locale, version, and oldversion strings vary, but approximately 82% of the GET requests being processed are for a /whatsnew/ page. I spot-checked the access log, and this appears to be far out of step with the relative number of hits to /whatsnew/ pages, so I don't think it's a normal bias caused simply by those pages receiving more traffic.
Assignee | ||
Comment 24•12 years ago
|
||
(In reply to Jake Maul [:jakem] from comment #23) > There may be more than one issue occurring here. This lines up with the behaviour noted in comment #15, whereby local requests are being processed much more rapidly than remote requests to the self-same machines. My gut instinct is to blame Zeus for at least a portion of the problem, but having taken a look at the configuration for Bedrock, nothing really seems to pop out immediately. I'll take a second look, though - couldn't hurt.
Comment 25•12 years ago
|
||
The whatsnew page was launched on Bedrock on Tuesday. It does appear a lot in Apache logs because there are a lot of different URLs (bypassing the HTTP cache) and also because we have to use Vary: User-Agent and there's a lot of different User-Agents out there. So the whatsnew page may make the problem even worse because we get more hits on the webheads. But it's definitely not the root cause. Users are also more likely to see this problem than before. Before, only POST requests would take a long time (because they are not cacheable) and we don't have a lot of POST requests (only newsletter subscriptions, AFAICT). But now we have this whatsnew page.
Comment 26•12 years ago
|
||
I think you missed part of my comment: I spot-checked the access log, and this appears to be far out of step with the relative number of hits to /whatsnew/ pages, so I don't think it's a normal bias caused simply by those pages receiving more traffic. This is looking at hits on the web node, not from Zeus. Thus it's only looking at things that do make it all the way to the web node. This is not a case of simply getting more traffic. The steady state is nowhere near maxxed out. Right now I see anywhere from 40-60 workers on each node, and most of those are in KeepAlive state. When the problem occurs, we're talking about *512 active workers*, with virtually none of them idling in KeepAlive. So when I say that multiple servers are suddenly maxxed out, you can be sure this is not a case of just needing more servers, or more traffic getting through to the web nodes. And when 82% of those workers are working on a /whatsnew/ page, that seems highly coincidental. If you have a better idea of what the problem might be I'm definitely interested to hear it, but this is what we've got. It's bad enough now we're actually debating on setting up a cron job to restart Apache... that should tell you how troublesome this has become.
Updated•12 years ago
|
Group: infra
Comment 27•12 years ago
|
||
So we worked on analyzing this a lot more today. We set up statsd/graphite to get data about what's going on (bug 775534). Right now, we're only tracking the time each requests take. From a quick look, we haven't found an obvious culprit. We'll need to take a closer and/or add other metrics in there. Since the problem is not solved and everyone wants to have a good weekend, we're setting up a cron to restart the wsgi processes.
Comment 28•12 years ago
|
||
So I took a look at our system metrics in Ganglia. The interesting part is that our network in/out traffic seems to match our load traffic. Does memcache requests count as outside traffic?
Comment 29•12 years ago
|
||
I think I found it! Everytime I refresh one page, I get two more calls to the mozorg.dotlang.translate(). I don't know yet why we get those two new calls for every request but this will be easier to track now.
Comment 30•12 years ago
|
||
Commits pushed to master at https://github.com/mozilla/bedrock https://github.com/mozilla/bedrock/commit/45435c143ef023a3fda374d03fd42ae4e2c19545 Bug 760570 - Copy the DOTLANG_FILES settings instead before appending to it Without this, the DOTLANG_FILES setting would grow with each request. https://github.com/mozilla/bedrock/commit/4b113717331e6f4e73b071e72ffaa9373394d64c Merge pull request #282 from Rik/copy-settings Bug 760570 - Copy the DOTLANG_FILES settings instead before appending to...
Comment 31•12 years ago
|
||
As you can see, we have a fix! The setting with all the L10N translation files would expand with every request with two new entries. So we would do more and more lookups for every request. So this fix will not modify the setting for every request.
Comment 32•12 years ago
|
||
So we'll need a push to fix this. We'll monitor a bit if the graphs show the problem again. If they don't (crosses fingers), we won't need the cron that was setup this weekend.
Comment 33•12 years ago
|
||
Push completed, please confirm things look good and I will remove the temp cron job.
Comment 34•12 years ago
|
||
Not sure if related to the current changes, but there are a ton of errors flying through www.mozilla.org error_log.
Comment 35•12 years ago
|
||
Chatted with :Rik on irc, the errors that I am seeing are "normal" as in they have been known before the current push. Perhaps the existing errors should be fixed in a different bug?
Comment 36•12 years ago
|
||
The issue doesn't seem fix. I'll do more research tomorrow. So don't remove the cronjob yet.
Comment 37•12 years ago
|
||
As per :rik Seems the fix, wasn't enough. bedrock1.webapp.phx1.mozilla.com:httpd max clients is WARNING: Using 512 out of 512 Clients Seeing the same thing jakem saw; 32686 apache 20 0 327M 29796 7844 S 0.0 0.1 0:37.67 │ ├─ /usr/sbin/httpd 30704 apache 20 0 479M 50692 4792 S 81.0 0.1 12:55.31 │ ├─ bedrock_prod 30733 apache 20 0 479M 50692 4792 R 84.0 0.1 12:51.67 │ │ ├─ bedrock_prod 30732 apache 20 0 479M 50692 4792 S 0.0 0.1 0:00.04 │ │ ├─ bedrock_prod 30731 apache 20 0 479M 50692 4792 S 0.0 0.1 0:00.00 │ │ └─ bedrock_prod # strace -p 30704 (worker) Process 30704 attached - interrupt to quit restart_syscall(<... resuming interrupted call ...> {no further output, this is expected} # strace -p 30733 (thread) poll([{fd=20, events=POLLIN}], 1, 3000) = 1 ([{fd=20, revents=POLLIN}]) recvfrom(20, "VALUE bedrock_prod:1:dotlang-en-"..., 4096, 0, NULL, NULL) = 70 poll([{fd=20, events=POLLOUT}], 1, 3000) = 1 ([{fd=20, revents=POLLOUT}]) sendto(20, "get bedrock_prod:1:dotlang-en-US"..., 43, 0, NULL, 0) = 43 {none stop output} Will update with more findings if/when the issue reoccurs again.
Comment 38•12 years ago
|
||
So looking at the output of the push that Jason did, I see: You are not currently on a branch. Please specify which branch you want to merge with. See git-pull(1) for details. This leads me to believe that prod does not run with my fix. Which would be cool :)
Comment 39•12 years ago
|
||
Aj did a 'git checkout master' before doing another deploy. And the curves looks deceptively flat now. \o/ So I think we're done for real here! Will update tomorrow morning to let you know if we can remove the cron and if we can increase WSGIDaemon maximum-requests.
Comment 40•12 years ago
|
||
Looks like this is fixed. http://i.imgur.com/wT1RL.png Can we remove the crontab and reset the WSGIDaemon maximum-requests to the default?
Assignee | ||
Comment 41•12 years ago
|
||
02:12:50 < Rik> phrawzty: peux-tu enlever la cron qui redémarre bedrock? bug 760570 02:17:18 < Rik> phrawzty: et si tu connais la valeur par défaut de WSGIDaemon maximum-requests, ce serait bien de la remettre aussi Translation : * Remove the cron job that restarts Bedrock. * Set the WSGIDaemon value back to the default setting. WIP.
Assignee | ||
Comment 42•12 years ago
|
||
(In reply to Daniel Maher [:phrawzty] from comment #41) > * Remove the cron job that restarts Bedrock. > * Set the WSGIDaemon value back to the default setting. * Cron job removed. * WSGIDaemon maximum-requests doesn't have a default value; I set it back to 15000. Everything seems to be in order (finally, hurrah). Hearty congratulations all around. Closing the bug.
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
Comment 43•1 year ago
|
||
imgur might delete some images and I want to keep this graph around so uploading a copy of comment 40.
You need to log in
before you can comment on or make changes to this bug.
Description
•