Closed Bug 1129604 Opened 10 years ago Closed 10 years ago

MDN Celery jobs stuck, worker unhealthy?

Categories

(Infrastructure & Operations Graveyard :: WebOps: Community Platform, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jezdez, Assigned: cliang)

References

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/421] )

I think one or more Celery workers is stuck or something, I'm seeing a bunch of jobs that are "pending" according to the Celery admin (https://developer.mozilla.org/admin/djcelery/taskstate/?state__exact=PENDING) which implies that they were acknowledged by the broker (because they have an ID) but not handled by a worker. I suspect that celery3 has some problems as I haven't seen any of the successful task being processed by it but rather celery1 and 2. This is currently blocking an important recurring task to process since we're using Celery's chords now that need to process a list of tasks in serial and close the list off at the end to succeed a complete run of the recurring task. BTW Celery3 also doesn't show up in New Relic for the developer.mozilla.org app.
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/421]
FWIW, I'm starting to see some tasks succeeding from celery@developer-celery3.webapp.scl3.mozilla.com ... https://developer.mozilla.org/admin/djcelery/taskstate/?state__exact=SUCCESS I'll keep an eye on this too.
Another note: only our developer2 and developer3 celery nodes started spitting out errors. Their error rates are 84% and 92%, while developer1 celery node is 0%.
What is the value of CELERY_IGNORE_RESULT in the settings_local.py file on the celery nodes?
My hypothesis is that the new render_stale_documents chord isn't properly locking and unlocking in coordination with the multiple celery nodes. developer-celery[23] are frequently raising kuma.wiki.exceptions:StaleDocumentsRenderingInProgress errors. [1] It seems like developer-celery1 is always grabbing the render_stale_documents task, and the new chord implementation [2] isn't properly releasing the lock between chunks of renders? I'm still digging, but I don't notice any effects on the site yet - just a bunch of noisy celery workers. [1] https://rpm.newrelic.com/accounts/263620/applications/3172075/traced_errors/3035789659/similar_errors?original_error_id=3035789659 [2] https://github.com/mozilla/kuma/pull/3046/files#diff-3
(In reply to Luke Crouch [:groovecoder] from comment #4) > developer-celery[23] are frequently raising > kuma.wiki.exceptions:StaleDocumentsRenderingInProgress errors. [1] It seems > like developer-celery1 is always grabbing the render_stale_documents task, > and the new chord implementation [2] isn't properly releasing the lock > between chunks of renders? I didn't consider this in my review, sadly. I believe this is true, using celery chords these would be run in parallel (the whole point of moving to chords) but the locking isn't set up to handle this correctly. I believe we'd need to move to a pattern of something like: chain(<acquire lock task>, chord(header=<render stale docs tasks>, body=<release lock task>))
Something is wrong when I deploy, the celery3 doesn't seem to be restarting correctly, but I can't login, see at the end of the deploy log: http://developeradm.private.scl3.mozilla.com/chief/developer.stage/logs/481c32e8c26c8adc794beee732999f110e65aec2.1423136096 Can anyone from webops help please?
Note: the production celery log showing the celery3 error is here: http://developeradm.private.scl3.mozilla.com/chief/developer.prod/logs/481c32e8c26c8adc794beee732999f110e65aec2.1423136285 Bumping importance.
Severity: major → critical
Trying to get conformation of how important this is before waking anyone up. Can this wait, 1 - 3hrs or do I need to page someone? Setting this to critcal is paging the on call MOC.
Down to major now that we're past the spike in errors. 1-3 hours is fine.
Severity: critical → major
Oh, thanks for the corrected link, :groovecoder.
Hopefully someone from webops will be along shortly to look at this.
Severity: major → normal
Addressed for right now with a manual restart of celery on developer-celery3. I still need to address the underlying issue. On developer-celery3, I was able to get to the supervisorctl prompt and then issue a 'mrestart celery*'. However, from the command line on developer-celery3, '/usr/bin/supervisorctl mrestart celery*' produces the error listed above; it does not on developer-celery2.
TL;DR: You'll want to escape the '*' in the superisorctl command: /usr/bin/supervisorctl mrestart celery\* If there is a file matching the celery* pattern in the home directory of the user running the supervisorctl command, the wildcard is processed as a match for the filename. For example, if there is a file name 'celery_kuma.conf' in the directory, supervisorctl will attempt to restart 'celery_kuma.conf' and not 'celeryd*'.
(I have deleted the file with the celery* file name that was causing the issue on developer3. Escaping the wildcard, if possible, would proactively protect this from happening again with the deploy scripts.)
Assignee: server-ops-webops → cliang
See Also: → 1139291
I filed bug 1139291 to track improving the deploy scripts. Based on the comments, it appears that this issue was otherwise resolved in early February, so I'm marking this RESOLVED FIXED. Please feel free to REOPEN if further action needs to occur beyond the script improvements in bug 1139291.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.