Closed
Bug 1129604
Opened 10 years ago
Closed 10 years ago
MDN Celery jobs stuck, worker unhealthy?
Categories
(Infrastructure & Operations Graveyard :: WebOps: Community Platform, task)
Infrastructure & Operations Graveyard
WebOps: Community Platform
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jezdez, Assigned: cliang)
References
Details
(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/421] )
I think one or more Celery workers is stuck or something, I'm seeing a bunch of jobs that are "pending" according to the Celery admin (https://developer.mozilla.org/admin/djcelery/taskstate/?state__exact=PENDING) which implies that they were acknowledged by the broker (because they have an ID) but not handled by a worker.
I suspect that celery3 has some problems as I haven't seen any of the successful task being processed by it but rather celery1 and 2.
This is currently blocking an important recurring task to process since we're using Celery's chords now that need to process a list of tasks in serial and close the list off at the end to succeed a complete run of the recurring task.
BTW Celery3 also doesn't show up in New Relic for the developer.mozilla.org app.
Comment 1•10 years ago
|
||
FWIW, I'm starting to see some tasks succeeding from celery@developer-celery3.webapp.scl3.mozilla.com ...
https://developer.mozilla.org/admin/djcelery/taskstate/?state__exact=SUCCESS
I'll keep an eye on this too.
Comment 2•10 years ago
|
||
Another note: only our developer2 and developer3 celery nodes started spitting out errors. Their error rates are 84% and 92%, while developer1 celery node is 0%.
Comment 3•10 years ago
|
||
What is the value of CELERY_IGNORE_RESULT in the settings_local.py file on the celery nodes?
Comment 4•10 years ago
|
||
My hypothesis is that the new render_stale_documents chord isn't properly locking and unlocking in coordination with the multiple celery nodes.
developer-celery[23] are frequently raising kuma.wiki.exceptions:StaleDocumentsRenderingInProgress errors. [1] It seems like developer-celery1 is always grabbing the render_stale_documents task, and the new chord implementation [2] isn't properly releasing the lock between chunks of renders?
I'm still digging, but I don't notice any effects on the site yet - just a bunch of noisy celery workers.
[1] https://rpm.newrelic.com/accounts/263620/applications/3172075/traced_errors/3035789659/similar_errors?original_error_id=3035789659
[2] https://github.com/mozilla/kuma/pull/3046/files#diff-3
Comment 5•10 years ago
|
||
(In reply to Luke Crouch [:groovecoder] from comment #4)
> developer-celery[23] are frequently raising
> kuma.wiki.exceptions:StaleDocumentsRenderingInProgress errors. [1] It seems
> like developer-celery1 is always grabbing the render_stale_documents task,
> and the new chord implementation [2] isn't properly releasing the lock
> between chunks of renders?
I didn't consider this in my review, sadly. I believe this is true, using celery chords these would be run in parallel (the whole point of moving to chords) but the locking isn't set up to handle this correctly.
I believe we'd need to move to a pattern of something like:
chain(<acquire lock task>,
chord(header=<render stale docs tasks>,
body=<release lock task>))
Reporter | ||
Comment 6•10 years ago
|
||
Something is wrong when I deploy, the celery3 doesn't seem to be restarting correctly, but I can't login, see at the end of the deploy log:
http://developeradm.private.scl3.mozilla.com/chief/developer.stage/logs/481c32e8c26c8adc794beee732999f110e65aec2.1423136096
Can anyone from webops help please?
Comment 7•10 years ago
|
||
Note: the production celery log showing the celery3 error is here:
http://developeradm.private.scl3.mozilla.com/chief/developer.prod/logs/481c32e8c26c8adc794beee732999f110e65aec2.1423136285
Bumping importance.
Severity: major → critical
Comment 8•10 years ago
|
||
Trying to get conformation of how important this is before waking anyone up.
Can this wait, 1 - 3hrs or do I need to page someone?
Setting this to critcal is paging the on call MOC.
Comment 9•10 years ago
|
||
Down to major now that we're past the spike in errors. 1-3 hours is fine.
Severity: critical → major
Reporter | ||
Comment 10•10 years ago
|
||
Oh, thanks for the corrected link, :groovecoder.
Comment 11•10 years ago
|
||
Hopefully someone from webops will be along shortly to look at this.
Severity: major → normal
Assignee | ||
Comment 12•10 years ago
|
||
Addressed for right now with a manual restart of celery on developer-celery3. I still need to address the underlying issue.
On developer-celery3, I was able to get to the supervisorctl prompt and then issue a 'mrestart celery*'. However, from the command line on developer-celery3, '/usr/bin/supervisorctl mrestart celery*' produces the error listed above; it does not on developer-celery2.
Assignee | ||
Comment 13•10 years ago
|
||
TL;DR: You'll want to escape the '*' in the superisorctl command:
/usr/bin/supervisorctl mrestart celery\*
If there is a file matching the celery* pattern in the home directory of the user running the supervisorctl command, the wildcard is processed as a match for the filename. For example, if there is a file name 'celery_kuma.conf' in the directory, supervisorctl will attempt to restart 'celery_kuma.conf' and not 'celeryd*'.
Assignee | ||
Comment 14•10 years ago
|
||
(I have deleted the file with the celery* file name that was causing the issue on developer3. Escaping the wildcard, if possible, would proactively protect this from happening again with the deploy scripts.)
Assignee | ||
Updated•10 years ago
|
Assignee: server-ops-webops → cliang
![]() |
||
Comment 15•10 years ago
|
||
I filed bug 1139291 to track improving the deploy scripts. Based on the comments, it appears that this issue was otherwise resolved in early February, so I'm marking this RESOLVED FIXED. Please feel free to REOPEN if further action needs to occur beyond the script improvements in bug 1139291.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•