Closed Bug 1131059 Opened 10 years ago Closed 10 years ago

Determine why there were zombie celery processes on some nodes

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: emorley, Unassigned)

References

Details

In bug 1129731, it was found that the number of running celery processes had increased significantly on prod rabbitmq1, to the point where the box was swapping. The zombie processes had a lifetime of much longer than the most recent deploys, whereas the deploy script should have cleaned them up. We should: a) Kill any other similar zombie celery processes on all prod/stage nodes, as a one off. b) Fix the restart-jobs script such that it doesn't leave these zombies. c) Consider adding alerting to find zombie processes like this (depending on our confidence with #b).
Zombie procs on all three staging log processors, but none on prod. Running strace on the two on processor1 show them both spinning on [pid 22438] select(0, NULL, NULL, NULL, {0, 1886}) = 0 (Timeout) where fd 0 is a pipe to the other process. A bunch of processes on processor2 are spinning on a select(0,...) to themselves. o.O I feel as though a "fix" in restart-jobs (which is just calling supervisorctl) is a bandaid. It seems like something in the celery job isn't behaving correctly when told to stop. I've left the zombie jobs running on staging processors 1&2 for the moment, if someone wants to look at them and try to get more info.
(In reply to Kendall Libby [:fubar] from comment #1) > I feel as though a "fix" in restart-jobs (which is just calling > supervisorctl) is a bandaid. It seems like something in the celery job isn't > behaving correctly when told to stop. > > I've left the zombie jobs running on staging processors 1&2 for the moment, > if someone wants to look at them and try to get more info. Agree; thank you :-)
Summary: Clean up zombie celery processes on all nodes → Determine why there were zombie celery processes on some nodes
(In reply to Kendall Libby [:fubar] from comment #1) > I've left the zombie jobs running on staging processors 1&2 for the moment, > if someone wants to look at them and try to get more info. I've had to kill these processes as part of bug 1133138 / bug 1119479, since we were getting errors on newrelic. In addition, it appeared that some of the apache processes on the stage webheads had runtimes longer than the last restart-jobs, not sure why. Restart httpd manually to clear them, since similarly, they were hitting issues.
There's nothing we can really do here now; bug 1140882 has been filed for making us stop using gevent with Celery, which may possibly help.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → INCOMPLETE
This may have been caused by bug 1144138, which could have caused some celery tasks to appear "hung".
True - thank you for fixing that! :-)
You need to log in before you can comment on or make changes to this bug.