Get data on the slowest DB queries in SUMO prod

RESOLVED FIXED

Status

Data & BI Services Team
DB: MySQL
--
minor
RESOLVED FIXED
7 years ago
4 years ago

People

(Reporter: rrosario, Assigned: sheeri)

Tracking

(Blocks: 2 bugs)

Details

(Whiteboard: [2014q1])

(Reporter)

Description

7 years ago
It would be awesome if we could get a report on the slowest queries we are making, especially those that happen frequently from the web app. I don't really care that much about those that happen from celery tasks, so if there is a way to separate them that would be splendid.
Assignee: server-ops → server-ops-database
Component: Server Operations → Server Operations: Database
(Assignee)

Updated

7 years ago
Assignee: server-ops-database → scabral
(Assignee)

Comment 1

7 years ago
Ricky, is there a way to tell which queries are from celery tasks?  Are they against a specific table/database, or do they have some kind of comment in them that identifies them as celery tasks?
(Reporter)

Comment 2

7 years ago
(In reply to Sheeri Cabral [:sheeri] from comment #1)
> Ricky, is there a way to tell which queries are from celery tasks?  Are they
> against a specific table/database, or do they have some kind of comment in
> them that identifies them as celery tasks?

They come from a different server (sumocelery1). Other than that, I don't think there is anything. If you can't filter out the ones that come from there, I should be able to figure it out on my end by looking at what the query is doing.
(Assignee)

Comment 3

7 years ago
OK, I'll see what I can do.  For now I've lowered the long query time from the default of 10 seconds to the more reasonable 2 seconds.  If that's not enough to get a lot of the slowest queries, I'll lower it again.
(Assignee)

Comment 4

7 years ago
Hrm, the master is not logging slow queries, nor error logs.  I fixed this brokenness in many /etc/my.cnf files a few days ago, and double-checked -- the /etc/my.cnf on the sumo production master has the proper values now.

We'll need to schedule a reboot of the sumo master for these changes to take, it should take less than 5 minutes to do.  When can we schedule this for?

[the slave is fine, and I'll be taking a look at that next, and hopefully giving out some slow query reports].
(Reporter)

Comment 5

7 years ago
(In reply to Sheeri Cabral [:sheeri] from comment #4)
> We'll need to schedule a reboot of the sumo master for these changes to
> take, it should take less than 5 minutes to do.  When can we schedule this
> for?

Good question. We would need to put the site in read-only mode for this. I'll need to consult the SUMO team for when a good time would be. In the meantime, data from the slaves would be very useful and might be most or all of what we are interested in.
(Assignee)

Comment 6

7 years ago
Amazing news:

Between Feb 1st, when I turned the slow query logging down to 2 seconds on the slave, and today, there have been 0 app queries that triggered the slow query log.

In fact, here is the entirety of the slow query log on the slave:

[root@tm-sumo01-slave01 mysql]# more mysql-slow.log 
# Time: 120201 13:32:07
# User@Host: root[root] @ localhost []
# Query_time: 3  Lock_time: 0  Rows_sent: 1  Rows_examined: 0
select sleep(3);
# Time: 120207 11:43:52
# User@Host: root[root] @ localhost []
# Query_time: 15  Lock_time: 0  Rows_sent: 1  Rows_examined: 0
select benchmark(100000000,1/1);

Both those were tests by me to make sure the slow query log was functioning as expected.

I have turned the slow query logging down to 1 second on sumo-slave.
(Assignee)

Comment 7

7 years ago
FWIW that was on slave01, slave02 looks like there are some sphinx queries that are long in duration.
(Reporter)

Comment 8

7 years ago
Yay! The sphinx queries will be going away soon once we are 100% elastic search and turn it off :-)
(Assignee)

Comment 9

7 years ago
slave02 and slave03 are similar, and I've turned down the logging on them too.  The sphinx queries were from Feb 2011, so they're not even relevant now.

There's a problem though - with this version of MySQL (version 5.0, current GA is 5.5, it's been GA for 14 months now), the min value of long_query_time is 1, so if this doesn't get us queries, we'll have to upgrade to get that info. I'd want to upgrade anyway, so it's not a huge problem, but it means that we cannot get slow query info unless we upgrade.
Status: NEW → ASSIGNED
(Assignee)

Comment 10

7 years ago
this just in - there's also the actual production boxes, which are a different set (sorry, I was logging into the tm-sumo machines, not the tp-sumo machines).  Turning those logs down now and will check back in tomorrow.  There are some, but not many, queries, at first glance.
(Assignee)

Comment 11

7 years ago
Ricky, I have some reports, and I just e-mailed one of them to you - subject is

Slow query log report: HOSTNAME for the 06 week of 2012 

(where the hostname is the fqdn for support2, one of the read slaves).

I can go through the format with you, there's a lot of information there. If that works for you, what do you think about a weekly report from each SUMO machine that's like that? There are 4 SUMO machines.....(or maybe a monthly would be more realistic?)
(Reporter)

Comment 12

7 years ago
Sheeri, I like it! Weekly is fine. I definitely need some help understanding it all :-)
(Assignee)

Comment 13

6 years ago
I have run the reports by hand, you might see some e-mails, and I have set this to run at 3 am on Sundays. Will close if this is working OK next week.
(Assignee)

Comment 14

6 years ago
Hrm, this script does not seem to be working. :(

I just tried running it now via cron and didn't see any errors in /var/log/cron:

May 23 15:20:01 support1 crond[3486]: (*system*) RELOAD (/etc/cron.d/slow_query_report)
May 23 15:20:01 support1 crond[18786]: (root) CMD (/usr/lib64/sa/sa1 1 1)
May 23 15:20:01 support1 crond[18787]: (root) CMD ([ -f "$HPHEALTH_RESET_REQUEST" ] && [ -x "$HPHEALTH_RESET_SCRIPT" ] && rm -f "$HPHEALTH_RESET_REQUEST" && "$HPHEALTH_RESET_SCRIPT")

:(
(Assignee)

Comment 15

5 years ago
putting a sane date because by then we'll be on MySQL 5.6 which includes some great query digesting stuff in PERFORMANCE_SCHEMA (and ps_helper).
Whiteboard: [2014q1]
I spent most of the morning/early afternoon comparing data sources here (newrelic, slow query logs, and performance schema) and it looks like newrelic is winning. :)

- slow query log gave a few useful queries, but the majority of it was locked queries.
  - The delete from django_session run once a day might do better to be run a few times a day if possible.

- performance_schema
  - events_waits_history was empty on all systems, and the summary tables showed very reasonable waits, etc. I really didn't get anything I found helpful from them. But these systems are running pretty optimally.

- new relic
  - a bit more detailed info, and I'll get into some query specifics below.


#1.) delete from django_sessions where expire_date < CURDATE() - INTERVAL 1 DAY could be updated to run more regularly. Right now it's run once a day and averages 21-25 seconds.

#2.)  SELECT (extract( year FROM created )) AS `year`, (extract( day FROM created )) AS `day`, (extract( month FROM created )) AS `month`, COUNT(`wiki_helpfulvote`.`created`) AS `count` FROM `wiki_helpfulvote` WHERE (`wiki_helpfulvote`.`created` >= %s  AND `wiki_helpfulvote`.`helpful` = %s ) GROUP BY (extract( year FROM created )), (extract( day FROM created )), (extract( month FROM created )) ORDER BY NULL;

     -- This query extracts every single thing it can out of a datetime column. It took 30-45 seconds to complete per execution because of the lack of index usage. There are better options here: My suggestion would be to implement one of the below (B is the better option): 
         A.) parse out the day/month/year in the app after retreiving just the date, and store the year as an extra column to satisfy the group by.
         B.) actually store the day/month/year in the database separately so that indexes can be used.
     -- This query is a full table scan, and can be resolved by adding an additional index after implementing either of the solutions above.
     -- Based upon the other queries going across this table, I believe option B is the best option.

Almost all other queries surveyed were subsecond and ran pretty happily. I was heavily impressed by the results from newrelic during this process. :)

We can talk about a few other potential optimizations but I don't think their impact will be as beneficial as their cost. One of the ones I saw: the auth_user table quite frequently gets queries to retreive every single column in the table, and has the highest throughput of any of your tables. If there were a way to reduce the number of columns being returned so that we could use an index, that could help if it were right for the individual use cases.
(Assignee)

Updated

4 years ago
Blocks: 986011
(Assignee)

Updated

4 years ago
Blocks: 986016
(Assignee)

Updated

4 years ago
Blocks: 986017
(Assignee)

Comment 17

4 years ago
I've made three dependent bugs of this one, so I'm closing this one out, as the goal was to find and make recommendations for SUMO.
Status: ASSIGNED → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.