Closed Bug 774765 Opened 13 years ago Closed 13 years ago

celery jobs queue isn't showing data in ganglia

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

All
Other
task
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: willkg, Assigned: bkero)

Details

(Whiteboard: [triaged 20120824])

https://ganglia-phx1.mozilla.org/ganglia/graph_all_periods.php?c=sumo-web&h=rabbit-sumo&v=0&m=kitsune_prod&r=hour&z=small&jr=&js=&st=1338992640&vl=jobs&ti=kitsune_prod&z=large According to that, we've had no jobs in the celery jobs queue for SUMO since 7/12/2012. That can't possibly be right. Can someone look into why it thinks there's no jobs in the job queue?
Assignee: server-ops-webops → cturra
i have reviewed the sumocelery1 server and celery seems to be actively working as expected. going to kick this over to server-ops to review the ganglia portion of this.
Assignee: cturra → server-ops
Component: Server Operations: Web Operations → Server Operations
QA Contact: cshields → phong
I don't know what "rabbit_kitsune_prod_celery_messages_unacknowledged" is graphing. Is "messages unacknowledged" the same thing as "jobs in the queue"? We had been using the "kitsune_prod" graph since "sumo" was switched to "sumo-web" and we'd been using that for a while, I think. That's definitely not working any more.
To clarify, we'd been looking at that one set of graphs (sumo-web -> kitsune_prod) because it showed the number of jobs waiting in the queue. I don't see any other graphs on https://ganglia-phx1.mozilla.org/ganglia/?c=sumo-web&h=rabbit-sumo&m=load_one&r=hour&s=by%20name&hc=4&mc=2 that suggest they show jobs in the queue. But that's what we need to watch, so if there's another way to see the number of jobs in the queue over time, that'd be super.
I had a chat with Jason about this. We checked up the server and "rabbit_kitsune_prod_celery_messages_unacknowledged" is the right graph to be watching out for and is current.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Huh. Well that seems poorly named. I never would have guessed that.
(In reply to Ashish Vijayaram [:ashish] from comment #5) > I had a chat with Jason about this. We checked up the server and > "rabbit_kitsune_prod_celery_messages_unacknowledged" is the right graph to > be watching out for and is current. I believe sumo-web -> kitsune_prod graph is an older ganglia metric that contains the same data as 'rabbit_kitsune_prod_celery_messages_ready', or jobs waiting in the queue. 'rabbit_kitsune_prod_celery_messages_unacknowledged' represents messages that are being processed by a worker and an acknowledgement of completion has not been received from the worker yet. This could also represents tasks that have failed to complete and waiting to be reprocessed. [1] I think one of the gmetrics, either kitsune_prod graph or rabbit_kitsune_prod_celery* should be disabled as it is confusing. I think rabbit_kitsune_prod_celery* is managed by the celery puppet module so I would think that one should stay. [1] http://www.rabbitmq.com/tutorials/tutorial-two-python.html
So given that, 'rabbit_kitsune_prod_celery_messages_unacknowledged' is NOT the graph showing the number of things waiting in the queue, but 'rabbit_kitsune_prod_celery_messages_ready' is. If that's correct, aren't we back to where we started because the 'rabbit_kitsune_prod_celery_messages_ready' graphs are empty for the last two weeks? Just like the kitsune_prod graphs? Am I still confused?
I hate to be annoying, but I think it's worth reopening this because we still have the same problem.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I'll need some help from webops for understand what is happening here/what needs to be done. From ganglia's POV, the plugin being used emits stats for only two metrics - 'rabbit_kitsune_prod_celery_messages_ready' and 'rabbit_kitsune_prod_celery_messages_unacknowledged'. If the plugin needs to be modified, I'll be glad to help.
Assignee: ashish → server-ops-webops
Status: REOPENED → NEW
Component: Server Operations → Server Operations: Web Operations
QA Contact: phong → cshields
Whiteboard: [pending triage]
Group: infra
Whiteboard: [pending triage] → [triaged 20120824]
Assignee: server-ops-webops → cshields
bkero was running a script to collect more detailed data and see if this was just consistently 0 due to the frequency of ganglia checks and fast clearing of the queue. Will punt this to him to see what he discovered.
Assignee: cshields → bkero
My scripted ran successfully, and over the course of several days determined that the queue actually was empty. [root@sumocelery1.webapp.phx1 tmp]# cat kitsune_prod.stats |grep sumocelery1|uniq sumocelery1.webapp.phx1.mozilla.com.celeryd.pidbox 0 [root@sumocelery1.webapp.phx1 tmp]# cat kitsune_prod.stats |grep sumocelery1|wc -l 7040
I'm sorry--I should have updated this ages ago. The conclusion is two-fold: 1. when the graph says there are no jobs in the queue, there are actually no jobs in the queue despite our skepticism 2. the graph we were looking at before (kitsune_prod) is showing the correct thing Given that, I'm going to close this bug out. I'm going to mark it as WORKSFORME since it turned out to be a non-issue. If that messes up your stats, though, feel free to mark it something different.
Status: NEW → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → WORKSFORME
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.