Closed Bug 643545 Opened 13 years ago Closed 12 years ago

graph socorro postgres server_status table with ganglia

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rhelmer, Assigned: ashish)

References

Details

Socorro has a little built-in status page:
https://crash-stats.mozilla.com/status

This shows a small window of time (last hour) from this simple table:

"""
breakpad=> \d server_status
                                            Table "public.server_status"
         Column          |            Type             |                         Modifiers                          
-------------------------+-----------------------------+------------------------------------------------------------
 id                      | integer                     | not null default nextval('server_status_id_seq'::regclass)
 date_recently_completed | timestamp without time zone | 
 date_oldest_job_queued  | timestamp without time zone | 
 avg_process_sec         | real                        | 
 avg_wait_sec            | real                        | 
 waiting_job_count       | integer                     | not null
 processors_count        | integer                     | not null
 date_created            | timestamp without time zone | not null
Indexes:
    "server_status_pkey" PRIMARY KEY, btree (id)
    "idx_server_status_date" btree (date_created, id)
"""

This is fine for a small window of time, but ganglia would be a better tool for showing graphs over a longer period of time, like months or years. It'd be nice if this info was available right alongside all the other graphs we have in ganglia, too.

I think avg_process_sec, avg_wait_sec, waiting_job_count and processors_count would be handy to have in ganglia.
Assignee: server-ops → bkero
I should be able to integrate the stats you want from this  into the pgstats.py that is collected anyways.

What query will get me these stats?
Rob,

Waiting job count is already reported in ganglia, I think.  The others would be new.  At least, the overall count.

Ben,

For the others, it's really easy, you just do one metric at a time:

SELECT avg(avg_process_sec) from server_status;
SELECT avg(avg_wait_sec) from server_status;
SELECT count(*) FROM server_status;

I think we already have a generic ganglia probe set up which runs arbitrary queries which return a number.
Assignee: bkero → mpressman
think this got misassigned.  giving to bkero
Assignee: mpressman → bkero
Ashish is ramping up on ganglia and bkero isn't managing it anymore, going to punt to him.
Assignee: bkero → ashish
Shouldn't processors_count be =<10?

breakpad=> SELECT count(*) FROM server_status;
 count
-------
 41354
(1 row)
Status: NEW → ASSIGNED
(In reply to Ashish Vijayaram [:ashish] from comment #5)
> Shouldn't processors_count be =<10?
> 
> breakpad=> SELECT count(*) FROM server_status;
>  count
> -------
>  41354
> (1 row)

A row is inserted into server_status every 5 minutes, so you'd look at the "processors_count" column for the latest row to see how many processors were registered within the last 5 minutes.

Here is the table definition:

CREATE TABLE server_status (
    id integer NOT NULL,
    date_recently_completed timestamp with time zone,
    date_oldest_job_queued timestamp with time zone,
    avg_process_sec real,
    avg_wait_sec real,
    waiting_job_count integer NOT NULL,
    processors_count integer NOT NULL,
    date_created timestamp with time zone NOT NULL
);

It'd be nice to have long-term graphs of each column; you just need to pull the latest row every 5 mins.
Thanks! These 3 metrics have been added as - avg_process_sec, avg_wait_sec and processors_count.
un 13 07:00:12 tp-socorro01-master01 /usr/sbin/gmond[1698]: Unable to find the metric information for 'avg_process_sec'. Possible that the module has not been loaded.#012
Jun 13 07:00:12 tp-socorro01-master01 /usr/sbin/gmond[1698]: Unable to find the metric information for 'avg_wait_sec'. Possible that the module has not been loaded.#012
Jun 13 07:00:12 tp-socorro01-master01 /usr/sbin/gmond[1698]: Unable to find the metric information for 'processors_count'. Possible that the module has not been loaded.#012

Digging...
I got the 3 new metrics added and working in [1]. The only metric not working there is jobs_in_queue and from the looks of it, needs the query fixed:

breakpad=> SELECT * from jobs_in_queue;
ERROR:  attribute 8 has wrong type
DETAIL:  Table has type timestamp with time zone, but query expects timestamp without time zone.

[1] http://sp-admin01.phx1.mozilla.com/ganglia/?r=hour&cs=&ce=&m=&c=Socorro+Postgres&h=tp-socorro01-master01.phx1.mozilla.com&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS
(In reply to Ashish Vijayaram [:ashish] from comment #9)
> I got the 3 new metrics added and working in [1]. The only metric not
> working there is jobs_in_queue and from the looks of it, needs the query
> fixed:
> 
> breakpad=> SELECT * from jobs_in_queue;
> ERROR:  attribute 8 has wrong type
> DETAIL:  Table has type timestamp with time zone, but query expects
> timestamp without time zone.

Hmm are you sure the above error is for that query? All jobs_in_queue has is a single bigint column:

breakpad=> \d jobs_in_queue
 View "public.jobs_in_queue"
 Column |  Type  | Modifiers 
--------+--------+-----------
 count  | bigint | 

 
> [1]
> http://sp-admin01.phx1.mozilla.com/ganglia/
> ?r=hour&cs=&ce=&m=&c=Socorro+Postgres&h=tp-socorro01-master01.phx1.mozilla.
> com&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS

You could cast a column to just TIMEZONE (which will be without timestamp), but if you're not using the column(s) in question it's better to just select what you are using (this also protects your code from being broken if someone adds or reorders columns:

SELECT avg_process_sec, avg_wait_sec, processors_count FROM server_status;
Ok, so there's two issues here:

1) jobs_in_queue is broken.  I need to fix it.

2) I should write a view on top of server_status for ganglia to check.

I'll do both of those for the next release of socorro.
Assignee: ashish → josh
Depends on: 764468
Assignee: josh → ashish
QA Contact: mrz → jdow
Whiteboard: [blocked on rhelmer/berkus]
:selena provided with an updated SQL for jobs_in_queue and processors_count. All the ganglia graphs look good now.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Whiteboard: [blocked on rhelmer/berkus]
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.