Closed
Bug 746854
Opened 13 years ago
Closed 13 years ago
Celery alerts on lots of processes
Categories
(mozilla.org Graveyard :: Server Operations, task)
mozilla.org Graveyard
Server Operations
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: ericz, Assigned: ericz)
Details
Received this alert a lot today:
sumocelery1.webapp.phx1:sumocelery1.webapp.phx1.mozilla.com - celeryd is CRITICAL: PROCS CRITICAL: 78 processes with regex args celeryd
Restarting celeryd did not help, but Jason restarted supervisord then celery and it looks better now. Jason said a strace showed it wasn't doing much and this may be a bug with celery that should be fixed.
Assignee | ||
Updated•13 years ago
|
Component: Server Operations → Server Operations: Web Operations
QA Contact: phong → cshields
Comment 1•13 years ago
|
||
assigning to Chris to take a look, but it sounds like Jason already did - not sure what bug he was hinting to or how it should be fixed. Chris, sync with jason on this.
cc'ing devs too in case they have input
Assignee: server-ops → cturra
Comment 2•13 years ago
|
||
I'm not entirely sure what this alert is checking, i.e. what's the threshold that it thinks is too high. Celery starts a number of separate processes, defined by the -c or --concurrency argument, +1 for the master process.
The first thing to check would be what that value is. It may be as high as 64, but can probably be tuned down to 32.
I wonder if our restarting celery so often is causing some zombie processes. It's probably worth our upgrading celery just to pick up any improvements. I notice that the number of jobs[1] has been stuck at 2 for a long time, so celery may be having issues processing some things. Which may be affecting a shutdown/restart.
Is the python module setproctitle installed on the box? It's really valuable for helping debug celery processes. If so, we can look and see if any of the processes are actually stuck in the future.
[1] https://ganglia-phx1.mozilla.org/ganglia/graph_all_periods.php?c=sumo&h=rabbit-sumo&v=2&m=kitsune_prod&r=week&z=default&jr=&js=&st=1334933725&vl=jobs&ti=kitsune_prod&z=large
Comment 3•13 years ago
|
||
spent some time looking at sumocelery1.webapp.phx1. i do not see any evidence of zombie processes and the 64 process max seems reasonable for the server. it has loads of headroom from a resource stand point.
> sumocelery1.webapp.phx1:sumocelery1.webapp.phx1.mozilla.com - celeryd is
> CRITICAL: PROCS CRITICAL: 78 processes with regex args celeryd
i would like to better understand this error. is there a reason why > 64 processes would be a critical issue? also, when this was alerting, was celery no longer working? ... and what is reporting this error?
Status: NEW → ASSIGNED
Assignee | ||
Comment 4•13 years ago
|
||
The load was a bit high (~13) _some_ of the times it alerted but not all as far I recall. I don't know that the alert is terribly valid. I wasn't aware of any problems other than it complaining of running 78 processes. Hopefully someone who knows celery better can comment.
Comment 5•13 years ago
|
||
(In reply to James Socol [:jsocol, :james] from comment #2)
> The first thing to check would be what that value is. It may be as high as
> 64, but can probably be tuned down to 32.
i think 64 processes is reasonable for this server (at least from a resource stand point).
> I wonder if our restarting celery so often is causing some zombie processes.
> It's probably worth our upgrading celery just to pick up any improvements. I
> notice that the number of jobs[1] has been stuck at 2 for a long time, so
> celery may be having issues processing some things. Which may be affecting a
> shutdown/restart.
after reviewing the ganglia link you sent, it looks like the jobs have been stuck at 2 since mid january? anyone on this thread know of any changes that were made around that time that may have caused this?
i am not convinced we're monitoring the right thing with this nagios check. rather than just looking at the number of celery processes running, i would like to write a check that actually monitors the jobs to see that they're actively being processed. this said, does anyone have advice on what this value would be?
Comment 6•13 years ago
|
||
this has sat for a while, but i wanted to poke it again. in my last message i had asked about what an appropriate number of celery processes running might be. i don't see any evidence that greater than 64 processes is actually causing any issues other than creating a nagios alert.
any thoughts out there?
Comment 7•13 years ago
|
||
(In reply to Chris Turra [:cturra] from comment #6)
> any thoughts out there?
I've observed that this alert goes off when there are two "sets" of celery workers. Whether that has any production impact, I'm not sure. But I've generally killed off all the workers + parent from the older set of processes.
Comment 8•13 years ago
|
||
I believe (but am not 100% certain) this alert was created to address the problem Ashish noticed... sometimes celery can get restarted, and the old children don't die off properly.
As to what the effects of such non-dieing (but not really "zombie") processes is, I can't really remember. Perhaps they were spinning and using up CPU or memory or something, and this was a convenient way to identify a problematic condition... that seems like the most likely scenario.
Comment 9•13 years ago
|
||
They do use some memory and CPU, even if tasks don't get assigned out to them. I guess it's mostly a memory-leak scenario.
I seem to recall, at least in old versions, that somehow those out-of-date processes could end up pulling jobs off the queue, but I may have been mistake or that may have been fixed a while ago.
I've just filed bug 785856 to upgrade to Celery 3, which may also help.
Assignee | ||
Comment 10•13 years ago
|
||
If as per comment #8 it is a "problematic condition" but not a big problem would it be reasonable to make it alert in #sysadmins only and document on mana how to clean up this situation when it happens?
Comment 11•13 years ago
|
||
:ericz - i agree. this is covered on the 'Common Nagios Pages' mana page at the link below. i am going to close this bug off r/wontfix.
https://mana.mozilla.org/wiki/display/SYSADMIN/Common+Nagios+Pages#CommonNagiosPages-celerydisCRITICAL
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → WONTFIX
Assignee | ||
Comment 12•13 years ago
|
||
I'm going to reopen and take this bug with the action to make it alert in #sysadmins only and not page oncall as per comment #10.
Assignee: cturra → eziegenhorn
Status: RESOLVED → REOPENED
Component: Server Operations: Web Operations → Server Operations
Resolution: WONTFIX → ---
Assignee | ||
Comment 13•13 years ago
|
||
$ svn ci services.pp -m'Make sumocelery celerdy alerts go to #sysadmins only as per bug 746854'
Sending services.pp
Transmitting file data .
Committed revision 49449.
Status: REOPENED → RESOLVED
Closed: 13 years ago → 13 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•