Closed
Bug 929443
Opened 12 years ago
Closed 12 years ago
fetch_bugs cron not running in reps.allizom.org
Categories
(Infrastructure & Operations Graveyard :: WebOps: Community Platform, task)
Infrastructure & Operations Graveyard
WebOps: Community Platform
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nemo-yiannis, Assigned: cturra)
References
Details
It seems that fetch_bugs cron is not running in reps.allizom.org
:cturra ran manually that on stage and got the following traceback:
https://gist.github.com/cturra/7099571
The error indicates a celery connection error.
Assignee | ||
Comment 1•12 years ago
|
||
as discussed in #remo-dev, the net flow from the engagement admin host is not open to the the engagement stage celery host. i am going to file a bug with netops to get this sorted.
[cturra@engagementadm.private.phx1 settings]$ nc -zv engagement-celery1.stage.seamicro.phx1.mozilla.com 5672
nc: connect to engagement-celery1.stage.seamicro.phx1.mozilla.com port 5672 (tcp) failed: Connection timed out
Assignee: server-ops-webops → cturra
OS: Linux → All
Hardware: x86_64 → All
Assignee | ||
Comment 2•12 years ago
|
||
net flow has now been opened and i can manually connect successfully.
[cturra@engagementadm.private.phx1 ~]$ nc -zv -w5 engagement-celery1.stage.seamicro.phx1.mozilla.com 5672
Connection to engagement-celery1.stage.seamicro.phx1.mozilla.com 5672 port [tcp/amqp] succeeded!
additionally, i ran the stage `fetch_bugs` cron by hand and it completed without error.
[root@engagementadm.private.phx1 ~]# /usr/bin/flock -w 10 /var/lock/reps-stage-bugs /data/engagement-stage/src/reps.allizom.org/remo/manage.py fetch_bugs
/data/engagement-stage/src/reps.allizom.org/remo/vendor/lib/python/celery/loaders/default.py:64: NotConfigured: No 'celeryconfig' module found! Please make sure it exists and is available to Python.
"is available to Python." % (configname, )))
[root@engagementadm.private.phx1 ~]# echo $?
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Reporter | ||
Updated•12 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 3•12 years ago
|
||
:nemo - is something still not working as expected?
Flags: needinfo?(jgiannelos)
Reporter | ||
Comment 4•12 years ago
|
||
Reopened that bug because we have again problem with fetch_bugs in reps.allizom.org.
It has stalled again.
:cturra Is it possible to do a manual run to check for possible errors? Also do you have in mind any way to monitor fetch_bugs behavior?
Thanks!
Reporter | ||
Updated•12 years ago
|
Flags: needinfo?(jgiannelos)
Assignee | ||
Comment 5•12 years ago
|
||
curious. i am now seeing the same 'socket.error: [Errno 111] Connection refused' we saw yesterday, which i did *NOT* see earlier. when i manually bind to the celery node, i can establish a connection just fine:
[cturra@engagementadm.private.phx1 ~]$ nc -zv engagement-celery1.stage.seamicro.phx1.mozilla.com 5672
Connection to engagement-celery1.stage.seamicro.phx1.mozilla.com 5672 port [tcp/amqp] succeeded!
this host and port are what are defined in remo/settings/local.py (BROKER_HOST, BROKER_PORT). are other celery tasks in the stage environment working? any further insight into what the application is doing when getting this error would also be appreciated.
Flags: needinfo?(jgiannelos)
Reporter | ||
Updated•12 years ago
|
Flags: needinfo?(jgiannelos)
Reporter | ||
Comment 6•12 years ago
|
||
Other tasks that we have, except of fetch_bugs, are the following:
* send_remo_mail
* send_generic_mail
* send_voting_mail
The first two are triggered from user input (button on the site)
The last one is used with the celery eta flag to send out email notification in a specific time instance. There should be around ~30 tasks scheduled to run in the future (at least 5 months away). Right now, no active task should be running.
Further insight about the application when fetch_bugs error is triggered:
* We run fetch_bugs that loads bugs from bugzilla API
* Add some entries in our DB.
* These entries trigger the send_voting_mail (celery task).
Regarding the celery setup, I assume that the settings are correct, since we are using it to send automated emails via send_voting_mail (tested it while typing).
I hope I dont distract you with project specific functionality.
Reporter | ||
Comment 7•12 years ago
|
||
Also something that we noticed while debugging is that fetch_bugs stops running
when we change that last updated status to a past date. That means that more
bugs are requested and fetch_bugs execution time is longer.
Could this cause any sort of timeout?
Assignee | ||
Comment 8•12 years ago
|
||
:giorgos added some additional logging around this error and we found that the "connection refused" is the result of the fetch_bugs command not using the BROKER_HOST defined in the settings/local.py file, but rather attempting to connect to localhost.
socket.error: [Errno 111] Connection refused 127.0.0.1 5672
to get this sorted, we will need to update the fetch_bugs command to use the celery broker details defined in the settings/local.py file.
Flags: needinfo?(jgiannelos)
Reporter | ||
Comment 9•12 years ago
|
||
Thanks :cturra.
I am opening a bug to track the changes to fetch_bugs connection settings and add it as a dependency to that one.
Flags: needinfo?(jgiannelos)
Reporter | ||
Comment 10•12 years ago
|
||
Hi :cturra.
After some debugging locally, I reproduced the connection error we were encountering and we have pushed a possible fix to reps.allizom.org. At first, fetch_bugs cron doesn't seem to run properly, but it might be the case of a stalled process.
Can you please check that? Also, even if there aren't any stalled processed can you do a manual run to see if we have a different traceback to investigate?
Thanks!
Flags: needinfo?(cturra)
Reporter | ||
Comment 11•12 years ago
|
||
s/processed/processes/g :)
Assignee | ||
Comment 12•12 years ago
|
||
it was stalled again and i forcefully killed the processed. lets see if this change sorts everything out :)
Flags: needinfo?(cturra)
Assignee | ||
Comment 13•12 years ago
|
||
i have been keeping my eye on the fetch_bugs crons and haven't seen any "hung" processes since the last was killed. i am going to mark this as r/fixed, but please reopen if this creeps back up on us. i will continue to keep a eye on it also.
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Comment 14•12 years ago
|
||
Thanks for keeping an eye Chris.
BTW is there a way to get alerted (e.g. by email) when a procedure hungs and / or is it possible to autorestart it?
Updated•7 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•