Closed Bug 1131059 Opened 10 years ago Closed 10 years ago

Determine why there were zombie celery processes on some nodes

Tracking

(Not tracked)

Status:

RESOLVED INCOMPLETE

People

(Reporter: emorley, Unassigned)

References

Details

Ed Morley [:emorley]

Reporter

Description

•

10 years ago

In bug 1129731, it was found that the number of running celery processes had increased significantly on prod rabbitmq1, to the point where the box was swapping. The zombie processes had a lifetime of much longer than the most recent deploys, whereas the deploy script should have cleaned them up. We should: a) Kill any other similar zombie celery processes on all prod/stage nodes, as a one off. b) Fix the restart-jobs script such that it doesn't leave these zombies. c) Consider adding alerting to find zombie processes like this (depending on our confidence with #b).

Kendall Libby [:fubar] (he/him)

Comment 1

•

10 years ago

Zombie procs on all three staging log processors, but none on prod. Running strace on the two on processor1 show them both spinning on [pid 22438] select(0, NULL, NULL, NULL, {0, 1886}) = 0 (Timeout) where fd 0 is a pipe to the other process. A bunch of processes on processor2 are spinning on a select(0,...) to themselves. o.O I feel as though a "fix" in restart-jobs (which is just calling supervisorctl) is a bandaid. It seems like something in the celery job isn't behaving correctly when told to stop. I've left the zombie jobs running on staging processors 1&2 for the moment, if someone wants to look at them and try to get more info.

Ed Morley [:emorley]

Reporter

Comment 2

•

10 years ago

(In reply to Kendall Libby [:fubar] from comment #1) > I feel as though a "fix" in restart-jobs (which is just calling > supervisorctl) is a bandaid. It seems like something in the celery job isn't > behaving correctly when told to stop. > > I've left the zombie jobs running on staging processors 1&2 for the moment, > if someone wants to look at them and try to get more info. Agree; thank you :-)

Summary: Clean up zombie celery processes on all nodes → Determine why there were zombie celery processes on some nodes

Ed Morley [:emorley]

Reporter

Comment 3

•

10 years ago

(In reply to Kendall Libby [:fubar] from comment #1) > I've left the zombie jobs running on staging processors 1&2 for the moment, > if someone wants to look at them and try to get more info. I've had to kill these processes as part of bug 1133138 / bug 1119479, since we were getting errors on newrelic. In addition, it appeared that some of the apache processes on the stage webheads had runtimes longer than the last restart-jobs, not sure why. Restart httpd manually to clear them, since similarly, they were hitting issues.

Ed Morley [:emorley]

Reporter

Comment 4

•

10 years ago

There's nothing we can really do here now; bug 1140882 has been filed for making us stop using gevent with Celery, which may possibly help.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → INCOMPLETE

William Lachance (:wlach)

Comment 5

•

10 years ago

This may have been caused by bug 1144138, which could have caused some celery tasks to appear "hung".

Ed Morley [:emorley]

Reporter

Comment 6

•

10 years ago

True - thank you for fixing that! :-)

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Determine why there were zombie celery processes on some nodes

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

Tracking

(Not tracked)

People

(Reporter: emorley, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6