Closed
Bug 774765
Opened 13 years ago
Closed 13 years ago
celery jobs queue isn't showing data in ganglia
Categories
(Infrastructure & Operations Graveyard :: WebOps: Other, task)
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: willkg, Assigned: bkero)
Details
(Whiteboard: [triaged 20120824])
https://ganglia-phx1.mozilla.org/ganglia/graph_all_periods.php?c=sumo-web&h=rabbit-sumo&v=0&m=kitsune_prod&r=hour&z=small&jr=&js=&st=1338992640&vl=jobs&ti=kitsune_prod&z=large
According to that, we've had no jobs in the celery jobs queue for SUMO since 7/12/2012.
That can't possibly be right.
Can someone look into why it thinks there's no jobs in the job queue?
Updated•13 years ago
|
Assignee: server-ops-webops → cturra
Comment 1•13 years ago
|
||
i have reviewed the sumocelery1 server and celery seems to be actively working as expected. going to kick this over to server-ops to review the ganglia portion of this.
Updated•13 years ago
|
Assignee: cturra → server-ops
Component: Server Operations: Web Operations → Server Operations
QA Contact: cshields → phong
Comment 2•13 years ago
|
||
Is https://ganglia-phx1.mozilla.org/ganglia/graph_all_periods.php?c=sumo-web&h=rabbit-sumo&v=0&m=rabbit_kitsune_prod_celery_messages_unacknowledged&r=hour&z=default&jr=&js=&st=1343741534&vl=messages&z=large what you're looking for? The graphs match up pretty well with the ones titled "kitsune_prod"...
Assignee: server-ops → ashish
| Reporter | ||
Comment 3•13 years ago
|
||
I don't know what "rabbit_kitsune_prod_celery_messages_unacknowledged" is graphing. Is "messages unacknowledged" the same thing as "jobs in the queue"?
We had been using the "kitsune_prod" graph since "sumo" was switched to "sumo-web" and we'd been using that for a while, I think. That's definitely not working any more.
| Reporter | ||
Comment 4•13 years ago
|
||
To clarify, we'd been looking at that one set of graphs (sumo-web -> kitsune_prod) because it showed the number of jobs waiting in the queue.
I don't see any other graphs on https://ganglia-phx1.mozilla.org/ganglia/?c=sumo-web&h=rabbit-sumo&m=load_one&r=hour&s=by%20name&hc=4&mc=2 that suggest they show jobs in the queue.
But that's what we need to watch, so if there's another way to see the number of jobs in the queue over time, that'd be super.
Comment 5•13 years ago
|
||
I had a chat with Jason about this. We checked up the server and "rabbit_kitsune_prod_celery_messages_unacknowledged" is the right graph to be watching out for and is current.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
| Reporter | ||
Comment 6•13 years ago
|
||
Huh. Well that seems poorly named. I never would have guessed that.
Comment 7•13 years ago
|
||
(In reply to Ashish Vijayaram [:ashish] from comment #5)
> I had a chat with Jason about this. We checked up the server and
> "rabbit_kitsune_prod_celery_messages_unacknowledged" is the right graph to
> be watching out for and is current.
I believe sumo-web -> kitsune_prod graph is an older ganglia metric that contains the same data as 'rabbit_kitsune_prod_celery_messages_ready', or jobs waiting in the queue.
'rabbit_kitsune_prod_celery_messages_unacknowledged' represents messages that are being processed by a worker and an acknowledgement of completion has not been received from the worker yet. This could also represents tasks that have failed to complete and waiting to be reprocessed. [1]
I think one of the gmetrics, either kitsune_prod graph or rabbit_kitsune_prod_celery* should be disabled as it is confusing. I think rabbit_kitsune_prod_celery* is managed by the celery puppet module so I would think that one should stay.
[1] http://www.rabbitmq.com/tutorials/tutorial-two-python.html
| Reporter | ||
Comment 8•13 years ago
|
||
So given that, 'rabbit_kitsune_prod_celery_messages_unacknowledged' is NOT the graph showing the number of things waiting in the queue, but 'rabbit_kitsune_prod_celery_messages_ready' is.
If that's correct, aren't we back to where we started because the 'rabbit_kitsune_prod_celery_messages_ready' graphs are empty for the last two weeks? Just like the kitsune_prod graphs?
Am I still confused?
| Reporter | ||
Comment 9•13 years ago
|
||
I hate to be annoying, but I think it's worth reopening this because we still have the same problem.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 10•13 years ago
|
||
I'll need some help from webops for understand what is happening here/what needs to be done. From ganglia's POV, the plugin being used emits stats for only two metrics - 'rabbit_kitsune_prod_celery_messages_ready' and 'rabbit_kitsune_prod_celery_messages_unacknowledged'. If the plugin needs to be modified, I'll be glad to help.
Assignee: ashish → server-ops-webops
Status: REOPENED → NEW
Component: Server Operations → Server Operations: Web Operations
QA Contact: phong → cshields
Updated•13 years ago
|
Whiteboard: [pending triage]
Updated•13 years ago
|
Group: infra
Updated•13 years ago
|
Whiteboard: [pending triage] → [triaged 20120824]
Updated•13 years ago
|
Assignee: server-ops-webops → cshields
Comment 11•13 years ago
|
||
bkero was running a script to collect more detailed data and see if this was just consistently 0 due to the frequency of ganglia checks and fast clearing of the queue. Will punt this to him to see what he discovered.
Assignee: cshields → bkero
| Assignee | ||
Comment 12•13 years ago
|
||
My scripted ran successfully, and over the course of several days determined that the queue actually was empty.
[root@sumocelery1.webapp.phx1 tmp]# cat kitsune_prod.stats |grep sumocelery1|uniq
sumocelery1.webapp.phx1.mozilla.com.celeryd.pidbox 0
[root@sumocelery1.webapp.phx1 tmp]# cat kitsune_prod.stats |grep sumocelery1|wc -l
7040
| Reporter | ||
Comment 13•13 years ago
|
||
I'm sorry--I should have updated this ages ago.
The conclusion is two-fold:
1. when the graph says there are no jobs in the queue, there are actually no jobs in the queue despite our skepticism
2. the graph we were looking at before (kitsune_prod) is showing the correct thing
Given that, I'm going to close this bug out. I'm going to mark it as WORKSFORME since it turned out to be a non-issue. If that messes up your stats, though, feel free to mark it something different.
Status: NEW → RESOLVED
Closed: 13 years ago → 13 years ago
Resolution: --- → WORKSFORME
Updated•12 years ago
|
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Updated•6 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•