Closed Bug 929443 Opened 12 years ago Closed 12 years ago

fetch_bugs cron not running in reps.allizom.org

Categories

(Infrastructure & Operations Graveyard :: WebOps: Community Platform, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nemo-yiannis, Assigned: cturra)

References

Details

It seems that fetch_bugs cron is not running in reps.allizom.org :cturra ran manually that on stage and got the following traceback: https://gist.github.com/cturra/7099571 The error indicates a celery connection error.
as discussed in #remo-dev, the net flow from the engagement admin host is not open to the the engagement stage celery host. i am going to file a bug with netops to get this sorted. [cturra@engagementadm.private.phx1 settings]$ nc -zv engagement-celery1.stage.seamicro.phx1.mozilla.com 5672 nc: connect to engagement-celery1.stage.seamicro.phx1.mozilla.com port 5672 (tcp) failed: Connection timed out
Assignee: server-ops-webops → cturra
OS: Linux → All
Hardware: x86_64 → All
net flow has now been opened and i can manually connect successfully. [cturra@engagementadm.private.phx1 ~]$ nc -zv -w5 engagement-celery1.stage.seamicro.phx1.mozilla.com 5672 Connection to engagement-celery1.stage.seamicro.phx1.mozilla.com 5672 port [tcp/amqp] succeeded! additionally, i ran the stage `fetch_bugs` cron by hand and it completed without error. [root@engagementadm.private.phx1 ~]# /usr/bin/flock -w 10 /var/lock/reps-stage-bugs /data/engagement-stage/src/reps.allizom.org/remo/manage.py fetch_bugs /data/engagement-stage/src/reps.allizom.org/remo/vendor/lib/python/celery/loaders/default.py:64: NotConfigured: No 'celeryconfig' module found! Please make sure it exists and is available to Python. "is available to Python." % (configname, ))) [root@engagementadm.private.phx1 ~]# echo $?
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
:nemo - is something still not working as expected?
Flags: needinfo?(jgiannelos)
Reopened that bug because we have again problem with fetch_bugs in reps.allizom.org. It has stalled again. :cturra Is it possible to do a manual run to check for possible errors? Also do you have in mind any way to monitor fetch_bugs behavior? Thanks!
Flags: needinfo?(jgiannelos)
curious. i am now seeing the same 'socket.error: [Errno 111] Connection refused' we saw yesterday, which i did *NOT* see earlier. when i manually bind to the celery node, i can establish a connection just fine: [cturra@engagementadm.private.phx1 ~]$ nc -zv engagement-celery1.stage.seamicro.phx1.mozilla.com 5672 Connection to engagement-celery1.stage.seamicro.phx1.mozilla.com 5672 port [tcp/amqp] succeeded! this host and port are what are defined in remo/settings/local.py (BROKER_HOST, BROKER_PORT). are other celery tasks in the stage environment working? any further insight into what the application is doing when getting this error would also be appreciated.
Flags: needinfo?(jgiannelos)
Flags: needinfo?(jgiannelos)
Other tasks that we have, except of fetch_bugs, are the following: * send_remo_mail * send_generic_mail * send_voting_mail The first two are triggered from user input (button on the site) The last one is used with the celery eta flag to send out email notification in a specific time instance. There should be around ~30 tasks scheduled to run in the future (at least 5 months away). Right now, no active task should be running. Further insight about the application when fetch_bugs error is triggered: * We run fetch_bugs that loads bugs from bugzilla API * Add some entries in our DB. * These entries trigger the send_voting_mail (celery task). Regarding the celery setup, I assume that the settings are correct, since we are using it to send automated emails via send_voting_mail (tested it while typing). I hope I dont distract you with project specific functionality.
Also something that we noticed while debugging is that fetch_bugs stops running when we change that last updated status to a past date. That means that more bugs are requested and fetch_bugs execution time is longer. Could this cause any sort of timeout?
:giorgos added some additional logging around this error and we found that the "connection refused" is the result of the fetch_bugs command not using the BROKER_HOST defined in the settings/local.py file, but rather attempting to connect to localhost. socket.error: [Errno 111] Connection refused 127.0.0.1 5672 to get this sorted, we will need to update the fetch_bugs command to use the celery broker details defined in the settings/local.py file.
Flags: needinfo?(jgiannelos)
Thanks :cturra. I am opening a bug to track the changes to fetch_bugs connection settings and add it as a dependency to that one.
Flags: needinfo?(jgiannelos)
Depends on: 930467
Hi :cturra. After some debugging locally, I reproduced the connection error we were encountering and we have pushed a possible fix to reps.allizom.org. At first, fetch_bugs cron doesn't seem to run properly, but it might be the case of a stalled process. Can you please check that? Also, even if there aren't any stalled processed can you do a manual run to see if we have a different traceback to investigate? Thanks!
Flags: needinfo?(cturra)
s/processed/processes/g :)
it was stalled again and i forcefully killed the processed. lets see if this change sorts everything out :)
Flags: needinfo?(cturra)
i have been keeping my eye on the fetch_bugs crons and haven't seen any "hung" processes since the last was killed. i am going to mark this as r/fixed, but please reopen if this creeps back up on us. i will continue to keep a eye on it also.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Thanks for keeping an eye Chris. BTW is there a way to get alerted (e.g. by email) when a procedure hungs and / or is it possible to autorestart it?
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.