1072437 - The deploy script's |service celerybeat restart| isn't restarting the workers properly

Reporter

Description

•

10 years ago

The cause of bug 1071570 was that whilst the updated credentials.json was on both admin node and worker, we were using the stale values in memcached. As part of Chief deployment we should clear them out. https://github.com/mozilla/treeherder-service/blob/master/deployment/update/update.py#L65

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Blocks: 1072681
No longer blocks: treeherder-dev-transition

Ed Morley [:emorley]

Reporter

Comment 1

•

10 years ago

The new theory is that the restart of the workers requested by update.py isn't working sometimes, causing the workers to not pick up the updated repos/credentials.

Summary: Refresh the memcached repository credentials during deployment → Fix the deployment script so all workers processes are restarted & pick up new repos

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Summary: Fix the deployment script so all workers processes are restarted & pick up new repos → Fix the deployment script so workers processes correctly pick up changes

Ed Morley [:emorley]

Reporter

Comment 2

•

10 years ago

And again: 21:30 KWierso|sheriffduty camd: I'm seeing a bunch of "Log parsing not complete" 21:30 KWierso|sheriffduty should that be catching up? 21:32 mdoglio the log parser hasn't been restarted in the deployment script 21:33 mdoglio we need to find a reliable solution to restart those workers 21:33 mdoglio fubar: I'm so so sorry I need to ask you again... could you please kill the celery processes on the 2 processor machines? 21:33 jeads mdoglio: in theory isn’t that what this should do https://github.com/mozilla/treeher...deployment/update/update.py#L98 21:33 mdoglio for some reason they didn't get refreshed 21:34 mdoglio jeads: yes in theory that should work 21:34 mdoglio but I can see the celery processes on the processor machines are 5 hours old 21:35 fubar mdoglio: done 21:35 jeads mdoglio: I think celery just doesn’t really support a warm shutdown, when the workers are busy they probably ignore or never receive the shutdown signal

Summary: Fix the deployment script so workers processes correctly pick up changes → The deploy script's |service celerybeat restart| isn't restarting the workers properly

Mauro Doglio [:mdoglio]

Updated

•

10 years ago

Depends on: 1079701

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Blocks: 1080757
No longer blocks: 1072681

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

No longer depends on: 1079701

Ed Morley [:emorley]

Reporter

Comment 4

•

10 years ago

(In reply to Mauro Doglio [:mdoglio] from bug 1079701 comment #2) > Restarting supervisord was our first idea to restart the celery workers, but > for some reason it was sometimes creating zombie processes. I still think > it's a good idea to broadcast a restart signal using celery itself, I just > need to understand why sometimes that doesn't work. In order to do that, I > need better monitoring of the workers, a good solution for that could [1]. > > Maybe fubar can help us to deploy it on the admin node and make it > accessible only under vpn? > > [1]http://celery.readthedocs.org/en/latest/userguide/monitoring.html#flower- > real-time-celery-web-monitor

Flags: needinfo?(klibby)

Kendall Libby [:fubar] (he/him)

Assignee

Comment 5

•

10 years ago

it looks like it'd be fairly straight forward to do. but from the docs, it looks like it only gives you access to the local workers? am I misunderstanding, or would we need it on all of the non-web nodes?

Flags: needinfo?(klibby)

Mauro Doglio [:mdoglio]

Comment 6

•

10 years ago

:fubar afaik you only need it on one node and it communicates with the workers via rabbitmq.

Mauro Doglio [:mdoglio]

Updated

•

10 years ago

Depends on: 1079270

Kendall Libby [:fubar] (he/him)

Assignee

Updated

•

10 years ago

Depends on: 1093760

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

No longer blocks: 1080757

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Component: Treeherder → Treeherder: Infrastructure

Kendall Libby [:fubar] (he/him)

Assignee

Comment 7

•

10 years ago

fyi, as part of bug 1112290 we reverted to using supervisorctl restart on the celery workers. I haven't seen zombie processes any time recently, as that's what I've been using any time workers get disconnected.

Mauro Doglio [:mdoglio]

Comment 8

•

10 years ago

closing this as per the last comment by :fubar

Assignee: nobody → klibby

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Bugzilla

The deploy script's |service celerybeat restart| isn't restarting the workers properly

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

Tracking

(Not tracked)

People

(Reporter: emorley, Assigned: fubar)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Updated

Comment 2

Updated

Updated

Updated

Comment 4

Comment 5

Comment 6

Updated

Updated

Updated

Updated

Comment 7

Comment 8