Closed Bug 1072437 Opened 10 years ago Closed 9 years ago

The deploy script's |service celerybeat restart| isn't restarting the workers properly

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: fubar)

References

Details

The cause of bug 1071570 was that whilst the updated credentials.json was on both admin node and worker, we were using the stale values in memcached.

As part of Chief deployment we should clear them out.

https://github.com/mozilla/treeherder-service/blob/master/deployment/update/update.py#L65
Blocks: 1072681
No longer blocks: treeherder-dev-transition
The new theory is that the restart of the workers requested by update.py isn't working sometimes, causing the workers to not pick up the updated repos/credentials.
Summary: Refresh the memcached repository credentials during deployment → Fix the deployment script so all workers processes are restarted & pick up new repos
Summary: Fix the deployment script so all workers processes are restarted & pick up new repos → Fix the deployment script so workers processes correctly pick up changes
And again:

21:30	KWierso|sheriffduty	camd: I'm seeing a bunch of "Log parsing not complete"
21:30	KWierso|sheriffduty	should that be catching up?
21:32	mdoglio	the log parser hasn't been restarted in the deployment script
21:33	mdoglio	we need to find a reliable solution to restart those workers
21:33	mdoglio	fubar: I'm so so sorry I need to ask you again... could you please kill the celery processes on the 2 processor machines?
21:33	jeads	mdoglio: in theory isn’t that what this should do https://github.com/mozilla/treeher...deployment/update/update.py#L98
21:33	mdoglio	for some reason they didn't get refreshed
21:34	mdoglio	jeads: yes in theory that should work
21:34	mdoglio	but I can see the celery processes on the processor machines are 5 hours old
21:35	fubar	mdoglio: done
21:35	jeads	mdoglio: I think celery just doesn’t really support a warm shutdown, when the workers are busy they probably ignore or never receive the shutdown signal
Summary: Fix the deployment script so workers processes correctly pick up changes → The deploy script's |service celerybeat restart| isn't restarting the workers properly
Depends on: 1079701
Blocks: 1080757
No longer blocks: 1072681
No longer depends on: 1079701
(In reply to Mauro Doglio [:mdoglio] from bug 1079701 comment #2)
> Restarting supervisord was our first idea to restart the celery workers, but
> for some reason it was sometimes creating zombie processes. I still think
> it's a good idea to broadcast a restart signal using celery itself, I just
> need to understand why sometimes that doesn't work. In order to do that, I
> need better monitoring of the workers, a good solution for that could [1].
> 
> Maybe fubar can help us to deploy it on the admin node and make it
> accessible only under vpn?
> 
> [1]http://celery.readthedocs.org/en/latest/userguide/monitoring.html#flower-
> real-time-celery-web-monitor
Flags: needinfo?(klibby)
it looks like it'd be fairly straight forward to do. but from the docs, it looks like it only gives you access to the local workers? am I misunderstanding, or would we need it on all of the non-web nodes?
Flags: needinfo?(klibby)
:fubar afaik you only need it on one node and it communicates with the workers via rabbitmq.
Depends on: 1079270
Depends on: 1093760
No longer blocks: 1080757
Component: Treeherder → Treeherder: Infrastructure
fyi, as part of bug 1112290 we reverted to using supervisorctl restart on the celery workers. I haven't seen zombie processes any time recently, as that's what I've been using any time workers get disconnected.
closing this as per the last comment by :fubar
Assignee: nobody → klibby
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.