Closed
Bug 1072437
Opened 10 years ago
Closed 10 years ago
The deploy script's |service celerybeat restart| isn't restarting the workers properly
Categories
(Tree Management :: Treeherder: Infrastructure, defect, P1)
Tree Management
Treeherder: Infrastructure
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Assigned: fubar)
References
Details
The cause of bug 1071570 was that whilst the updated credentials.json was on both admin node and worker, we were using the stale values in memcached.
As part of Chief deployment we should clear them out.
https://github.com/mozilla/treeherder-service/blob/master/deployment/update/update.py#L65
Reporter | ||
Updated•10 years ago
|
Reporter | ||
Comment 1•10 years ago
|
||
The new theory is that the restart of the workers requested by update.py isn't working sometimes, causing the workers to not pick up the updated repos/credentials.
Summary: Refresh the memcached repository credentials during deployment → Fix the deployment script so all workers processes are restarted & pick up new repos
Reporter | ||
Updated•10 years ago
|
Summary: Fix the deployment script so all workers processes are restarted & pick up new repos → Fix the deployment script so workers processes correctly pick up changes
Reporter | ||
Comment 2•10 years ago
|
||
And again:
21:30 KWierso|sheriffduty camd: I'm seeing a bunch of "Log parsing not complete"
21:30 KWierso|sheriffduty should that be catching up?
21:32 mdoglio the log parser hasn't been restarted in the deployment script
21:33 mdoglio we need to find a reliable solution to restart those workers
21:33 mdoglio fubar: I'm so so sorry I need to ask you again... could you please kill the celery processes on the 2 processor machines?
21:33 jeads mdoglio: in theory isn’t that what this should do https://github.com/mozilla/treeher...deployment/update/update.py#L98
21:33 mdoglio for some reason they didn't get refreshed
21:34 mdoglio jeads: yes in theory that should work
21:34 mdoglio but I can see the celery processes on the processor machines are 5 hours old
21:35 fubar mdoglio: done
21:35 jeads mdoglio: I think celery just doesn’t really support a warm shutdown, when the workers are busy they probably ignore or never receive the shutdown signal
Summary: Fix the deployment script so workers processes correctly pick up changes → The deploy script's |service celerybeat restart| isn't restarting the workers properly
Reporter | ||
Updated•10 years ago
|
Reporter | ||
Comment 4•10 years ago
|
||
(In reply to Mauro Doglio [:mdoglio] from bug 1079701 comment #2)
> Restarting supervisord was our first idea to restart the celery workers, but
> for some reason it was sometimes creating zombie processes. I still think
> it's a good idea to broadcast a restart signal using celery itself, I just
> need to understand why sometimes that doesn't work. In order to do that, I
> need better monitoring of the workers, a good solution for that could [1].
>
> Maybe fubar can help us to deploy it on the admin node and make it
> accessible only under vpn?
>
> [1]http://celery.readthedocs.org/en/latest/userguide/monitoring.html#flower-
> real-time-celery-web-monitor
Flags: needinfo?(klibby)
Assignee | ||
Comment 5•10 years ago
|
||
it looks like it'd be fairly straight forward to do. but from the docs, it looks like it only gives you access to the local workers? am I misunderstanding, or would we need it on all of the non-web nodes?
Flags: needinfo?(klibby)
Comment 6•10 years ago
|
||
:fubar afaik you only need it on one node and it communicates with the workers via rabbitmq.
Reporter | ||
Updated•10 years ago
|
Component: Treeherder → Treeherder: Infrastructure
Assignee | ||
Comment 7•10 years ago
|
||
fyi, as part of bug 1112290 we reverted to using supervisorctl restart on the celery workers. I haven't seen zombie processes any time recently, as that's what I've been using any time workers get disconnected.
Comment 8•10 years ago
|
||
closing this as per the last comment by :fubar
Assignee: nobody → klibby
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•