760570 - bedrock*.webapp.(scl3|phx1) are processing requests too slowly

Reporter

Description

•

12 years ago

Requests for www.mozilla.org are taking too long to process in the Bedrock cluster. This is in-turn timing out Watchmouse checks:

[ashish@bedrock1.webapp.phx1 ~]$ time curl -ILH "Host: www.mozilla.org" http://localhost:81/en-US/
HTTP/1.1 200 OK
Date: Fri, 01 Jun 2012 16:54:16 GMT
Server: Apache
x-frame-options: DENY
Expires: Fri, 01 Jun 2012 17:04:25 GMT
Cache-Control: max-age=600
X-Backend-Server: bedrock1.webapp.phx1.mozilla.com
Content-Length: 25155
Content-Type: text/html; charset=utf-8


real    0m8.476s
user    0m0.001s
sys     0m0.005s

[ashish@bedrock2.webapp.phx1 ~]$ time curl -ILH "Host: www.mozilla.org" http://localhost/en-US/
HTTP/1.1 200 OK
Date: Fri, 01 Jun 2012 16:48:49 GMT
Server: Apache
x-frame-options: DENY
Expires: Fri, 01 Jun 2012 16:58:59 GMT
Cache-Control: max-age=600
X-Backend-Server: bedrock2.webapp.phx1.mozilla.com
Content-Length: 25155
Content-Type: text/html; charset=utf-8


real    0m9.424s
user    0m0.000s
sys     0m0.006s

Oddly, these are HP blades, compared to bedrock3 which is a seamicro-xeon:

[ashish@bedrock3.webapp.phx1 ~]$  time curl -ILH "Host: www.mozilla.org" http://localhost/en-US/
HTTP/1.1 200 OK
Date: Fri, 01 Jun 2012 16:55:05 GMT
Server: Apache
x-frame-options: DENY
Expires: Fri, 01 Jun 2012 17:05:08 GMT
Cache-Control: max-age=600
X-Backend-Server: bedrock3.webapp.phx1.mozilla.com
Content-Type: text/html; charset=utf-8


real    0m2.067s
user    0m0.003s
sys     0m0.000s

Jake Maul [:jakem]

Comment 1

•

12 years ago

One relevant note: bedrock1 and 2 in PHX1 are much more powerful than bedrock3... consequently, Zeus uses a weighted algorithm and they get a lot more traffic than 3 does.

Also, the total traffic volume sent to bedrock is evenly split between SCL3 and PHX1, so these two nodes are probably hit more strongly than the 5 nodes in SCL3.

It appears that the WSGI processes are each maxxing out a core. I will investigate further.

Assignee: server-ops → nmaul

Jake Maul [:jakem]

Updated

•

12 years ago

Summary: bedrock[1-2].webapp.scl3 are processing requests too slowly → bedrock[1-2].webapp.phx1 are processing requests too slowly

Jake Maul [:jakem]

Comment 2

•

12 years ago

Not all of the WSGI processes were maxxing out a core, but many (8+) of them were. I restarted Apache on both of them, and they seem to be back to normal.

I didn't think to attach an strace to them before restarting... might be instructive if it happens again.

It's worth noting that these 2 servers are each very powerful... 16 cores, 36GB RAM. They're ex-Zeus nodes. bedrock3 exists primarily as a fallback/failsafe mechanism... it's just a 2-core 4GB RAM VM. These 2 servers do all the real work in PHX1, and did *all* the work before bedrock was live in SCL3.

It might be worth considering replacing these 2 servers (and the VM) with 5 Seamicro Xeons, like SCL3 has. I don't think it would be a significant increase in available capacity, but it would make the 2 environments identical. That might be good or bad. :)

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Jake Maul [:jakem]

Comment 3

•

12 years ago

This is a recurring issue... reopening.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
31931 apache    20   0  473m  47m 4456 S 101.1  0.1  88:49.03 bedrock_prod
31923 apache    20   0  475m  47m 4488 S 99.4  0.1  92:31.04 bedrock_prod
30235 apache    20   0  327m  28m 7880 S 17.9  0.1   0:07.79 /usr/sbin/httpd
25843 apache    20   0  325m  27m 8024 S  9.8  0.1   0:31.21 /usr/sbin/httpd
29680 apache    20   0  320m  22m 8032 S  6.5  0.1   0:09.60 /usr/sbin/httpd

[root@bedrock1.webapp.phx1 ~]# strace -p 31931
Process 31931 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>

It just hangs there indefinitely, using 100% of 1 CPU core.

I'm seeing this on the new SCL3 Seamicro Xeon cluster also... exactly the same behavior.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Jake Maul [:jakem]

Comment 4

•

12 years ago

Any ideas on this (jlongster / wenzel)? It's bad enough that I'm seriously considering a cron job to restart Apache regularly. :/

Changing the title... this affects nodes in SCL3 as well as PHX1. PHX1 nodes seem a bit harder hit, but I think that's just because there are fewer of them- basically 2 nodes vs 5, for the same traffic volume. I've seen the same behavior or all of them.

Another note: it seems to gradually build up to 100% usage on a core. It seems like the worker keeps working, but at an elevated CPU usage state. There seems to be 4 threads per worker, although the setting specifies only one:

WSGIDaemonProcess bedrock_prod processes=16 threads=1 maximum-requests=15000 display-name=bedrock_prod

Perhaps the other 3 are not "worker" threads but some sort of overhead/management threads. Seeing this makes me want to retry the strace from comment 3, attaching to individual threads within one of the failing processes. I will try that next time this occurs.

I also increased LogLevel verbosity to "info". This might help catch something when this reoccurs.

Summary: bedrock[1-2].webapp.phx1 are processing requests too slowly → bedrock*.webapp.(scl3|phx1) are processing requests too slowly

Jake Maul [:jakem]

Comment 5

•

12 years ago

The first of the 4 threads response exactly as in comment 3... it seems to do nothing.


The second thread does almost nothing... it hangs with this:
[root@bedrock2.webapp.phx1 ~]# strace -p 29827
Process 29827 attached - interrupt to quit
select(0, NULL, NULL, NULL, {210, 845816}

No further output, after about a minute of waiting.


The third thread is a bit more exciting. It does this:

[root@bedrock2.webapp.phx1 ~]# strace -p 29828
Process 29828 attached - interrupt to quit
select(0, NULL, NULL, NULL, {0, 907676}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)

These selects are all about 1 second apart, and it repeats... forever, I guess.


The fourth thread seems to be the actual worker thread. It generates a lot of output... too much to feasibly list here, but one line stands out as an exceedingly common error... perhaps much more so when the problem is occurring:

stat("/data/www/www.mozilla.org-django/bedrock/locale/en-US/firefox/new.lang", 0x7fc047f1d9e0) = -1 ENOENT (No such file or directory)

The exact file is almost always (~95%) new.lang, but others do show up occasionally. It often occurs many times (dozens/hundreds) in a row, often interspersed with the same error with a different file... but always a big chunk for new.lang.


I feel like if we could identify what causes these stat() calls, and why they're failing (there is no locale/en-US directory, for one thing), that might help to some extent.

James Long (:jlongster)

Comment 6

•

12 years ago

Awesome detective work. The stat calls are coming from the l10n system that tries to load in translations. It checks to see if the .lang file exists, and if so, loads it in:

https://github.com/mozilla/bedrock/blob/master/lib/l10n_utils/dotlang.py#L19

If this is the problem, this will be fixed once we get memcache running in production. It caches the translation results but as far as I know production is using a dummy cache that doesn't do anything.

I didn't think this would be a problem, considering that Zeus caches the page for 15 min, but maybe the stat calls are stalling the system more than expected.

memcache is mostly set up (see bug 738381). I wonder if we can get that live soon.

Ashish Vijayaram [:ashish]

Reporter

Updated

•

12 years ago

Depends on: 738381

Anthony Ricaud (:rik)

Comment 9

•

12 years ago

A few weeks ago, we enabled the locmem backend cache. I think it was around the Firefox Mobile launch (2012-06-26). Even though it's not as powerful as memcache, this should have fixed this.

Daniel Maher [:phrawzty]

Assignee

Updated

•

12 years ago

Assignee: nmaul → dmaher

Corey Shields [:cshields]

Comment 10

•

12 years ago

I've added an account for :rik on the production webheads (which should be live within 30 minutes).

these are bedrock[1-3].webapp.phx1.mozilla.com and bedrock[1-5].webapp.scl3.mozilla.com

should be accessible through the mpt-vpn

Daniel Maher [:phrawzty]

Assignee

Comment 11

•

12 years ago

(In reply to Jake Maul [:jakem] from comment #3)

> [root@bedrock1.webapp.phx1 ~]# strace -p 31931
> Process 31931 attached - interrupt to quit
> restart_syscall(<... resuming interrupted call ...>
> 
> It just hangs there indefinitely, using 100% of 1 CPU core.
> 
> I'm seeing this on the new SCL3 Seamicro Xeon cluster also... exactly the
> same behavior.

Fwiw, I'm currently seeing this behaviour on _all_ of the bedrock_prod processes on bedrock1.webapp.phx1, so that's possible - which is unfortunate.

Anthony Ricaud (:rik)

Comment 12

•

12 years ago

Thanks Corey. I've worked with Dan so no need for an access on prod.

What I've done so far:
- On the new.lang missing file: If a file was missing, our code would store {} in the cache. But it would keep checking for the file because we did not differentiate an empty cache (None) from an empty dict from cache ({}). I fixed that
- I added a 10 second timeout for our newsletter backend (ExactTarget). Our code had no timeout so far and I suspect ExactTarget could be our cause here. We'll receive stacktraces for requests to this backend that take more than 10 seconds.

In the future, our newsletter thing should be handed off to a Celery task.

Anthony Ricaud (:rik)

Updated

•

12 years ago

Depends on: 774312

Anthony Ricaud (:rik)

Comment 13

•

12 years ago

Since yesterday, I've received 6 tracebacks for the ExactTarget 10s timeout. 3 from PHX1 and 3 from SLC3.

And right now, from PAR1, I'm seeing responses between 5s and 10s.

Seems like we did not fix it.

Anthony Ricaud (:rik)

Comment 14

•

12 years ago

I have no clue what could cause this now.

One random idea: what if it was related to the proxying in AMS1? Maybe AMS1 issues too many requests to one server, bypassing the Zeus load balancing? What is the config of AMS1? I can't find it on mana.

Daniel Maher [:phrawzty]

Assignee

Comment 15

•

12 years ago

:rik mentioned that he was seeing long response times from PHX1.  I logged on to bedrock[1-3].webapp.phx1 and ran this every 5 seconds, during 30 minutes, on each :
time -p -ao out curl -s -ILH "Host: www.mozilla.org" http://localhost/en-US/

Average response time to the above command on each :
* bedrock1 : 2.12 seconds
* bedrock2 : 2.60 seconds
* bedrock3 : 0.46 seconds

Could the end-user experience of slow response times be coming from something else entirely - perhaps something in between the webheads and the user ?

Daniel Maher [:phrawzty]

Assignee

Comment 16

•

12 years ago

It may be worth noting that there appears to be a correlation between the individual CPU usage per process and the response time per machine.  In other words both response time AND CPU consumption (per bedrock_prod) process are: 
* bedrock2 > bedrock1 > bedrock3

Daniel Maher [:phrawzty]

Assignee

Comment 17

•

12 years ago

I ran the recurring curl mentioned in comment #15 from nladm01 overnight, directed towards "www.mozilla.org" plainly (of course).  The average response time was 1.67 seconds; however, the pattern for the responses was interesting : every response would be around the average time, except that on average 1 in every 14 attempts it'd spike to ~ 6.18 seconds.  I'm not sure if that means anything, but I thought I'd mention it at least.

matthew zeier [:mrz]

Comment 18

•

12 years ago

Is the short term/long term fix more hardware?

Daniel Maher [:phrawzty]

Assignee

Comment 19

•

12 years ago

(In reply to matthew zeier [:mrz] from comment #18)
> Is the short term/long term fix more hardware?

The short-term fix is to restart bedrock/httpd on a regular basis, since this "fixes" the problem - which is to say that it removes the symptoms temporarily.  After a while, each process will slowly eat more CPU, becoming slower and slower, until it needs to be restarted again.  We haven't put a regular cron job in place yet for a variety of reasons, not the least of which being that band-aids tend to become permanent - and this would be a very poor permanent solution indeed.

The long-term fix is to delve into the code of the bedrock application and deal with whatever root issue is actually causing the problem in the first place.  :rik has been working on this; last we spoke, he was going to put Ganglia metrics into place within the app itself, in order to try to pinpoint where - within the application itself - inefficiencies are occurring.

Anthony Ricaud (:rik)

Comment 20

•

12 years ago

I don't know about Ganglia, I was rather suggesting using Graphite which we need to setup first. I'll open bugs about this.

Daniel Maher [:phrawzty]

Assignee

Comment 21

•

12 years ago

(In reply to Anthony Ricaud (:rik) from comment #20)
> I don't know about Ganglia, I was rather suggesting using Graphite which we
> need to setup first. I'll open bugs about this.

Woops, yes, I meant to type "Graphite", not "Ganglia".

Anthony Ricaud (:rik)

Updated

•

12 years ago

Depends on: 775534

Jake Maul [:jakem]

Comment 22

•

12 years ago

I just checked in a change to the mod_wsgi config... I reduced "maximum-requests" from 15000 to 1000. This should cause mod_wsgi to recycle its worker processes much more frequently. We are theorizing that this will result in worker processes being restarted before they spin out of control and use up CPU.

This doesn't fix whatever the underlying problem is (why CPU usage slowly climbs up to 100%), but does hopefully reduce the end-user-visible effects of it.

Jake Maul [:jakem]

Comment 23

•

12 years ago

There may be more than one issue occurring here.

I just caught the situation with MaxClients being reached, and was able to get a dump of the server-status page from all of the web heads while it was ongoing.

The first thing I noticed is that KeepAlives are insignificant... virtually all of the workers are in the 'W' state ("Sending Reply", aka: working). Normally there are 30-60 workers alive at any one time... in this situation, it typically is maxxed out at 512 workers. So it's rather abnormal.

In looking at the Requests currently being processed, there is comparatively little duplication... leading me to believe it's not a DoS attack. One would normally expect that to generate a huge volume of traffic to a particular URL... that's not what's happening here.

I did notice however that a large percentage of the workers were working on some variant of the /whatsnew/ page. The general URL pattern is like this:

/en-US/firefox/14.0.1/whatsnew/?oldversion=13.0.1

The locale, version, and oldversion strings vary, but approximately 82% of the GET requests being processed are for a /whatsnew/ page. I spot-checked the access log, and this appears to be far out of step with the relative number of hits to /whatsnew/ pages, so I don't think it's a normal bias caused simply by those pages receiving more traffic.

Daniel Maher [:phrawzty]

Assignee

Comment 24

•

12 years ago

(In reply to Jake Maul [:jakem] from comment #23)
> There may be more than one issue occurring here.

This lines up with the behaviour noted in comment #15, whereby local requests are being processed much more rapidly than remote requests to the self-same machines. My gut instinct is to blame Zeus for at least a portion of the problem, but having taken a look at the configuration for Bedrock, nothing really seems to pop out immediately.  I'll take a second look, though - couldn't hurt.

Anthony Ricaud (:rik)

Comment 25

•

12 years ago

The whatsnew page was launched on Bedrock on Tuesday. It does appear a lot in Apache logs because there are a lot of different URLs (bypassing the HTTP cache) and also because we have to use Vary: User-Agent and there's a lot of different User-Agents out there.

So the whatsnew page may make the problem even worse because we get more hits on the webheads. But it's definitely not the root cause.

Users are also more likely to see this problem than before. Before, only POST requests would take a long time (because they are not cacheable) and we don't have a lot of POST requests (only newsletter subscriptions, AFAICT). But now we have this whatsnew page.

Jake Maul [:jakem]

Comment 26

•

12 years ago

I think you missed part of my comment:

I spot-checked the access log, and this appears to be far out of step with the relative number of hits to /whatsnew/ pages, so I don't think it's a normal bias caused simply by those pages receiving more traffic.

This is looking at hits on the web node, not from Zeus. Thus it's only looking at things that do make it all the way to the web node.

This is not a case of simply getting more traffic. The steady state is nowhere near maxxed out. Right now I see anywhere from 40-60 workers on each node, and most of those are in KeepAlive state. When the problem occurs, we're talking about *512 active workers*, with virtually none of them idling in KeepAlive.

So when I say that multiple servers are suddenly maxxed out, you can be sure this is not a case of just needing more servers, or more traffic getting through to the web nodes. And when 82% of those workers are working on a /whatsnew/ page, that seems highly coincidental.

If you have a better idea of what the problem might be I'm definitely interested to hear it, but this is what we've got. It's bad enough now we're actually debating on setting up a cron job to restart Apache... that should tell you how troublesome this has become.

matthew zeier [:mrz]

Updated

•

12 years ago

Group: infra

Anthony Ricaud (:rik)

Comment 27

•

12 years ago

So we worked on analyzing this a lot more today. We set up statsd/graphite to get data about what's going on (bug 775534). Right now, we're only tracking the time each requests take. From a quick look, we haven't found an obvious culprit. We'll need to take a closer and/or add other metrics in there.

Since the problem is not solved and everyone wants to have a good weekend, we're setting up a cron to restart the wsgi processes.

Anthony Ricaud (:rik)

Comment 28

•

12 years ago

So I took a look at our system metrics in Ganglia. The interesting part is that our network in/out traffic seems to match our load traffic.
Does memcache requests count as outside traffic?

Anthony Ricaud (:rik)

Comment 29

•

12 years ago

I think I found it!

Everytime I refresh one page, I get two more calls to the mozorg.dotlang.translate(). I don't know yet why we get those two new calls for every request but this will be easier to track now.

[github robot]

Comment 30

•

12 years ago

Commits pushed to master at https://github.com/mozilla/bedrock

https://github.com/mozilla/bedrock/commit/45435c143ef023a3fda374d03fd42ae4e2c19545
Bug 760570 - Copy the DOTLANG_FILES settings instead before appending to it

Without this, the DOTLANG_FILES setting would grow with each request.

https://github.com/mozilla/bedrock/commit/4b113717331e6f4e73b071e72ffaa9373394d64c
Merge pull request #282 from Rik/copy-settings

Bug 760570 - Copy the DOTLANG_FILES settings instead before appending to...

Anthony Ricaud (:rik)

Comment 31

•

12 years ago

As you can see, we have a fix!

The setting with all the L10N translation files would expand with every request with two new entries. So we would do more and more lookups for every request. So this fix will not modify the setting for every request.

Anthony Ricaud (:rik)

Comment 32

•

12 years ago

So we'll need a push to fix this. We'll monitor a bit if the graphs show the problem again. If they don't (crosses fingers), we won't need the cron that was setup this weekend.

Jason Crowe [:jd]

Comment 33

•

12 years ago

Push completed, please confirm things look good and I will remove the temp cron job.

Adrian J Fernandez [:Aj]

Comment 34

•

12 years ago

Not sure if related to the current changes, but there are a ton of errors flying through www.mozilla.org error_log.

Adrian J Fernandez [:Aj]

Comment 35

•

12 years ago

Chatted with :Rik on irc, the errors that I am seeing are "normal" as in they have been known before the current push.

Perhaps the existing errors should be fixed in a different bug?

Anthony Ricaud (:rik)

Comment 36

•

12 years ago

The issue doesn't seem fix. I'll do more research tomorrow.

So don't remove the cronjob yet.

Adrian J Fernandez [:Aj]

Comment 37

•

12 years ago

As per :rik 
Seems the fix, wasn't enough.
bedrock1.webapp.phx1.mozilla.com:httpd max clients is WARNING: Using 512 out of 512 Clients 

Seeing the same thing jakem saw;
32686 apache     20   0  327M 29796  7844 S  0.0  0.1  0:37.67 │  ├─ /usr/sbin/httpd                                                    
30704 apache     20   0  479M 50692  4792 S 81.0  0.1 12:55.31 │  ├─ bedrock_prod
30733 apache     20   0  479M 50692  4792 R 84.0  0.1 12:51.67 │  │  ├─ bedrock_prod
30732 apache     20   0  479M 50692  4792 S  0.0  0.1  0:00.04 │  │  ├─ bedrock_prod
30731 apache     20   0  479M 50692  4792 S  0.0  0.1  0:00.00 │  │  └─ bedrock_prod

# strace -p 30704 (worker)
Process 30704 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>

{no further output, this is expected}

# strace -p 30733 (thread)
poll([{fd=20, events=POLLIN}], 1, 3000) = 1 ([{fd=20, revents=POLLIN}])
recvfrom(20, "VALUE bedrock_prod:1:dotlang-en-"..., 4096, 0, NULL, NULL) = 70
poll([{fd=20, events=POLLOUT}], 1, 3000) = 1 ([{fd=20, revents=POLLOUT}])
sendto(20, "get bedrock_prod:1:dotlang-en-US"..., 43, 0, NULL, 0) = 43

{none stop output}


Will update with more findings if/when the issue reoccurs again.

Anthony Ricaud (:rik)

Comment 38

•

12 years ago

So looking at the output of the push that Jason did, I see:
You are not currently on a branch. Please specify which
branch you want to merge with. See git-pull(1) for details.

This leads me to believe that prod does not run with my fix. Which would be cool :)

Anthony Ricaud (:rik)

Comment 39

•

12 years ago

Aj did a 'git checkout master' before doing another deploy. And the curves looks deceptively flat now. \o/

So I think we're done for real here! Will update tomorrow morning to let you know if we can remove the cron and if we can increase WSGIDaemon maximum-requests.

Anthony Ricaud (:rik)

Comment 40

•

12 years ago

Looks like this is fixed. http://i.imgur.com/wT1RL.png

Can we remove the crontab and reset the WSGIDaemon maximum-requests to the default?

Daniel Maher [:phrawzty]

Assignee

Comment 41

•

12 years ago

02:12:50 < Rik> phrawzty: peux-tu enlever la cron qui redémarre bedrock? bug 760570
02:17:18 < Rik> phrawzty: et si tu connais la valeur par défaut de WSGIDaemon maximum-requests, ce serait bien de la remettre aussi

Translation :
* Remove the cron job that restarts Bedrock.
* Set the WSGIDaemon value back to the default setting.

WIP.

Daniel Maher [:phrawzty]

Assignee

Comment 42

•

12 years ago

(In reply to Daniel Maher [:phrawzty] from comment #41)
> * Remove the cron job that restarts Bedrock.
> * Set the WSGIDaemon value back to the default setting.

* Cron job removed.
* WSGIDaemon maximum-requests doesn't have a default value; I set it back to 15000.

Everything seems to be in order (finally, hurrah).  Hearty congratulations all around.  Closing the bug.

Status: REOPENED → RESOLVED

Closed: 12 years ago → 12 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Component: Server Operations: Web Operations → WebOps: Other

Product: mozilla.org → Infrastructure & Operations

BMO Automation

Updated

•

5 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Anthony Ricaud (:rik)

Comment 43

•

1 year ago

Attached image Graph of response times — Details

imgur might delete some images and I want to keep this graph around so uploading a copy of comment 40.