Closed Bug 1167133 Opened 9 years ago Closed 9 years ago

Celery workers are sometimes left running after restart-jobs has run (causing rabbitmq backlogs)

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Usul, Assigned: fubar)

References

Details

(IRC) Thu 02:33:46 PDT [5014] treeherder-rabbitmq1.private.scl3.mozilla.com:Rabbit Unread Messages is CRITICAL: RABBITMQ_OVERVIEW CRITICAL - messages CRITICAL (790437) messages_ready CRITICAL (790377), messages_unacknowledged OK (60)

I tried to follow https://mana.mozilla.org/wiki/display/websites/treeherder.mozilla.org#treeherder.mozilla.org-Rabbitmqqueuesizealerts but :fubar was unavailable. 


http://treeherder-rabbitmq1.private.scl3.mozilla.com:15672/#/queues/treeherder/celeryev.03e85854-6f76-4796-b6d8-4e39f8afa615 had a 0 delivery rate and the IP of the consumer was processor3

I restarted :
[lhirlimann@treeherder-processor3.private.scl3 ~]$ sudo supervisorctl restart run_celery_worker_log_parser
run_celery_worker_log_parser: stopped
run_celery_worker_log_parser: started

but it didn't start consuming on the UI - filling to make sure I didn't break anything and figure out maybe how to make the log better.
Component: General → Treeherder: Infrastructure
Product: Developer Services → Tree Management
QA Contact: laura
Version: unspecified → ---
I was about to file a bug for the underlying issue that causes this, which is that sometimes when we use /root/bin/restart-jobs to try to gracefully restart the workers on each node (which is run as part of the deploy script) - celery workers on both the rabbitmq node and also some of the log processor nodes get orphaned.

ie: they become detached from their parent process and are no longer under the control of supervisord.

You can spot them using:
ps ax -o ppid,stime,pid,command | egrep '^\s*1\s+.*[c]elery'

Their ppid is 1 (presumably since their parent has since been killed), unlike the other processes that have been started since.

eg:
https://emorley.pastebin.mozilla.org/8834246

Now the one good thing is that the rabbitmq backlogs that are caused as a result, are generally harmless - they are just on the old celery event queues for the workers that are stuck - and not for actual jobs themselves. eg:

[emorley@treeherder-rabbitmq1.private.scl3 ~]$ sudo rabbitmqctl list_queues -p treeherder
Listing queues ...
buildapi        0
calculate_eta   0
celery@buildapi.treeherder-etl1.private.scl3.mozilla.com.celery.pidbox  0
celery@buildapi.treeherder-etl2.private.scl3.mozilla.com.celery.pidbox  0
celery@default.treeherder-rabbitmq1.private.scl3.mozilla.com.celery.pidbox      0
celery@hp.treeherder-rabbitmq1.private.scl3.mozilla.com.celery.pidbox   0
celery@log_parser.treeherder-processor1.private.scl3.mozilla.com.celery.pidbox  0
celery@log_parser.treeherder-processor2.private.scl3.mozilla.com.celery.pidbox  0
celery@log_parser.treeherder-processor3.private.scl3.mozilla.com.celery.pidbox  0
celery@pushlog.treeherder-etl1.private.scl3.mozilla.com.celery.pidbox   0
celery@pushlog.treeherder-etl2.private.scl3.mozilla.com.celery.pidbox   0
celeryev.03e85854-6f76-4796-b6d8-4e39f8afa615   316036
celeryev.17ca1c86-757e-46db-9c14-e76e15d49994   5975
celeryev.2ba1e0d5-df90-4af3-b807-1d066f0ec751   0
celeryev.31926728-42e8-4052-ae88-896641756cb1   316038
celeryev.3d4c044b-550c-4831-9b8b-88ee44baee5a   6011
celeryev.46bf1f0e-8572-45e4-99c1-e4626c7dd64b   0
celeryev.46c23755-ac4a-4952-8d64-5cc50a674e11   0
celeryev.6108d583-bdc1-497d-bac1-aa6314240440   316037
celeryev.6467369d-0aba-4b1d-b31b-484e75e8cea9   0
celeryev.80a72a14-14ad-4aa2-8fb0-386bf5099c15   0
celeryev.824bf102-ac2d-4ca4-9044-140922a182bc   0
celeryev.a35d6217-cb71-41ff-8118-1c9cbaeb4fa5   0
celeryev.a8e3be9b-8ad6-482c-baee-3b4947aa6fee   5976
celeryev.a933408f-6be9-44ce-9bd9-62e35c7bc881   0
celeryev.ae4af068-b5de-461c-89a5-c87a45fc5c6d   48673
celeryev.c5570c78-0447-42c1-8aaa-2ec4adbcbd52   0
celeryev.efa40a79-ac55-4646-8d9b-ddf0490683f8   4839
classification_mirroring        0
cycle_data      0
default 0
fetch_bugs      0
fetch_missing_push_logs 0
high_priority   0
log_parser      120
log_parser_fail 16
log_parser_hp   0
log_parser_json 4
populate_performance_series     9
process_objects 3
publish_to_pulse        0
pushlog 0
...done.

I have alerts set up on rabbitmq queue sizes, and what I've been doing so far (ie the last week or two; not sure why it wasn't occurring before?) is just leaving it an hour or two to see if the processes exit on their own, and if not, killing them manually.

Is a bit of a pain, so would be good to figure this out :-)
Priority: -- → P1
Summary: celerydev queues don't consume → Celery workers are sometimes left running after restart-jobs has run (causing rabbitmq backlogs)
After chatting with Mauro we have a few thoughts:
1) We can turn off the celery event queues, since it turns out that we don't actually use them. This won't fix the orphaned process problem, but will at least stop the noise on the rabbitmq queue sizes.
2) We should try to work out if restart-jobs' invocation of supervisorctl is hitting the "we've waited to long, let's kill the process non-gracefully" case. If so, that likely isn't propagating to the child processes, due to:
http://celery.readthedocs.org/en/latest/userguide/workers.html#stopping-the-worker
"Also as processes can’t override the KILL signal, the worker will not be able to reap its children, so make sure to do so manually."
Depends on: 1167212
Depends on: 1167217
The supervisor configs for the workers use a stopsignal of TERM and stopwaitsecs (http://supervisord.org/configuration.html) of 90 seconds. ie a TERM is sent, then after 90s a SIGKILL if that didn't work.

The TERM is correct, we just need to raise the 90s or else reap the child workers per http://celery.readthedocs.org/en/latest/userguide/workers.html#stopping-the-worker

I'd be inclined to raise the 90s at least short term, so we can see if we just have long-running tasks, or if something really is hung.
(In reply to Ed Morley [:emorley] from comment #3)
> The supervisor configs for the workers use a stopsignal of TERM and
> stopwaitsecs (http://supervisord.org/configuration.html) of 90 seconds. ie a
> TERM is sent, then after 90s a SIGKILL if that didn't work.

I spoke too soon. It appears whilst this is set fine on stage, someone has changed prod to use KILL instead:
https://emorley.pastebin.mozilla.org/8834269

eg:
[treeherder-processor3.private.scl3.mozilla.com] out: ./run_celery_worker_log_parser.conf:stopsignal = KILL

This is what is causing the problem. And I presume it's only been changed recently, perhaps as part of trying to fix bug 1076881?

Unfortunately since we do not have access to the puppet repo we can't see when these changes were made. It makes me really cross every time we get bitten by secret puppet magic - these files should be in our repo :-(

I'm really really looking forward to when we use Heroku and things like this our in our control.
(In reply to Ed Morley [:emorley] from comment #4)
> (In reply to Ed Morley [:emorley] from comment #3)
> > The supervisor configs for the workers use a stopsignal of TERM and
> > stopwaitsecs (http://supervisord.org/configuration.html) of 90 seconds. ie a
> > TERM is sent, then after 90s a SIGKILL if that didn't work.
> 
> I spoke too soon. It appears whilst this is set fine on stage, someone has
> changed prod to use KILL instead:
> https://emorley.pastebin.mozilla.org/8834269
> 
> eg:
> [treeherder-processor3.private.scl3.mozilla.com] out:
> ./run_celery_worker_log_parser.conf:stopsignal = KILL
> 
> This is what is causing the problem. And I presume it's only been changed
> recently, perhaps as part of trying to fix bug 1076881?
> 
> Unfortunately since we do not have access to the puppet repo we can't see
> when these changes were made. It makes me really cross every time we get
> bitten by secret puppet magic - these files should be in our repo :-(
> 
> I'm really really looking forward to when we use Heroku and things like this
> our in our control.

Any idea what file in the puppet repo I can look I have access....
Thank you - the files are:
/etc/supervisord.conf.d/run_celery_worker* on treeherder-rabbitmq1.private.scl3.mozilla.com and all three of the log processor nodes (treeherder-processor*.private.scl3.mozilla.com).

I'm not sure where they'd be in puppet - but presuming they are checked in as files (and not somewhere else) then a file search for run_celery_worker* might work?
[ludo@Oulanl trunk]$ find . -name  "run_celery*" 
[ludo@Oulanl trunk]$ pwd /home/ludo/sysadmins/puppet/trunk

So can't find them as files :( we'll need someone from webops to poke at this as the puppet celery+rabbit modules and treeherder are far from being simple.
I've just had another rabbitmq alert (due to stuck processes). Could we fix the prod supervisor configs please? (comment 4; need to use TERM not KILL - and do we know why the prod configs were changed, and different to stage?)
Flags: needinfo?(klibby)
Prod was changed in bug 1079270. History on that one looks confusing; I wonder if there's a missing IRC conversation somewhere. Anyway, reverted KILL signals back to TERM.
Assignee: nobody → klibby
Flags: needinfo?(klibby)
(In reply to Kendall Libby [:fubar] from comment #9)
> Prod was changed in bug 1079270. History on that one looks confusing; I
> wonder if there's a missing IRC conversation somewhere. Anyway, reverted
> KILL signals back to TERM.

Thank you :-)
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.