Closed Bug 1116167 Opened 10 years ago Closed 10 years ago

SUMO: Dev celery workers ending with exit code -9

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: mythmon, Unassigned)

Details

(Whiteboard: [kanban:webops:https://kanbanize.com/ctrl_board/4/4164] )

Attachments

(2 files)

We've got a large wave of error emails from celery workers in the dev environment. This isn't hurting anything right now, but the emails are annoying. I read [0] that exit code -9 implies the process was killed for using too much memory. Below is the full output of one of the emails. [0]: http://stackoverflow.com/questions/18529452/sudden-exit-with-status-of-9 Task kitsune.wiki.tasks._rebuild_kb_chunk with id 455cae9c-b336-4bca-ab37-6366793546da raised exception: "WorkerLostError('Worker exited prematurely (exitcode: -9).',)" Task was called with args: [[17747L, 17748L, 17749L, 17750L, 17751L, 17752L, 17753L, 17754L, 17756L, 17757L, 17758L, 17759L, 17760L, 17763L, 17765L, 17766L, 17771L, 17772L, 17774L, 17778L, 17779L, 17780L, 17784L, 17785L, 17786L, 17787L, 17788L, 17793L, 17794L, 17795L, 17796L, 17797L, 17798L, 17799L, 17800L, 17801L, 17802L, 17810L, 17811L, 17812L, 17813L, 17814L, 17815L, 17817L, 17818L, 17821L, 17822L, 17823L, 17824L, 17825L, 17826L, 17827L, 17828L, 17829L, 17832L, 17833L, 17834L, 17835L, 17836L, 17837L, 17841L, 17842L, 17843L, 17845L, 17847L, 17849L, 17850L, 17851L, 17854L, 17855L, 17856L, 17859L, 17860L, 17861L, 17862L, 17863L, 17865L, 17867L, 17868L, 17870L, 17872L, 17875L, 17876L, 17877L, 17878L, 17879L, 17880L, 17882L, 17883L, 17884L, 17885L, 17886L, 17888L, 17890L, 17891L, 17892L, 17893L, 17894L, 17895L, 17896L]] kwargs: {}. The contents of the full traceback was: Traceback (most recent call last): File "/data/www/support-dev.allizom.org/kitsune/virtualenv/lib/python2.6/site-packages/billiard/pool.py", line 930, in _join_exited_workers lost_ret, )) WorkerLostError: Worker exited prematurely (exitcode: -9).
1) When did the email messages start and 2) are y'all still getting emails about celery workers being killed? NR shows a period of high CPU utilization on the support-celery1.dev server, from about 5:34 AM to about 7:19 AM PST. [1] (There is also a spike in network traffic to and from the server at this time.) Looking at the transactions, there is a sharp spike in the /kitsune.question.views:question_list shortly before the period of high CPU utilization. [2] I've pre-emptively stopped celery, killed some old celery processes when they didn't seem to want to die (dated back from Nov 21), and restarted celery. This should help give us a clean slate to work from to see if there are some process / processes that are getting themselves into a wedged state/ [1] support-celery1.dev.png [2] support-dev.allizom.org_transactions.png
Whiteboard: [kanban:webops:https://kanbanize.com/ctrl_board/4/4164]
@mythmon: Are you still seeing evidence of high CPU utilization. Things have looked quiet for the last week; I'm not sure what regular tests, etc. are run against the dev environment to know if it has been adequately exercised or not.
Flags: needinfo?(mcooper)
We don't do much on dev, so I think this was probably just a fluke of the holidays. If there isn't anything problematic on your end, we can close this out.
Flags: needinfo?(mcooper)
I haven't seen anything weird on this end. Closing this bug for now; if this crops back up, we can either re-open this bug or open up a new one.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: