It would be awesome if we could get a report on the slowest queries we are making, especially those that happen frequently from the web app. I don't really care that much about those that happen from celery tasks, so if there is a way to separate them that would be splendid.
Assignee: server-ops → server-ops-database
Component: Server Operations → Server Operations: Database
Ricky, is there a way to tell which queries are from celery tasks? Are they against a specific table/database, or do they have some kind of comment in them that identifies them as celery tasks?
(In reply to Sheeri Cabral [:sheeri] from comment #1) > Ricky, is there a way to tell which queries are from celery tasks? Are they > against a specific table/database, or do they have some kind of comment in > them that identifies them as celery tasks? They come from a different server (sumocelery1). Other than that, I don't think there is anything. If you can't filter out the ones that come from there, I should be able to figure it out on my end by looking at what the query is doing.
OK, I'll see what I can do. For now I've lowered the long query time from the default of 10 seconds to the more reasonable 2 seconds. If that's not enough to get a lot of the slowest queries, I'll lower it again.
Hrm, the master is not logging slow queries, nor error logs. I fixed this brokenness in many /etc/my.cnf files a few days ago, and double-checked -- the /etc/my.cnf on the sumo production master has the proper values now. We'll need to schedule a reboot of the sumo master for these changes to take, it should take less than 5 minutes to do. When can we schedule this for? [the slave is fine, and I'll be taking a look at that next, and hopefully giving out some slow query reports].
(In reply to Sheeri Cabral [:sheeri] from comment #4) > We'll need to schedule a reboot of the sumo master for these changes to > take, it should take less than 5 minutes to do. When can we schedule this > for? Good question. We would need to put the site in read-only mode for this. I'll need to consult the SUMO team for when a good time would be. In the meantime, data from the slaves would be very useful and might be most or all of what we are interested in.
Amazing news: Between Feb 1st, when I turned the slow query logging down to 2 seconds on the slave, and today, there have been 0 app queries that triggered the slow query log. In fact, here is the entirety of the slow query log on the slave: [root@tm-sumo01-slave01 mysql]# more mysql-slow.log # Time: 120201 13:32:07 # User@Host: root[root] @ localhost  # Query_time: 3 Lock_time: 0 Rows_sent: 1 Rows_examined: 0 select sleep(3); # Time: 120207 11:43:52 # User@Host: root[root] @ localhost  # Query_time: 15 Lock_time: 0 Rows_sent: 1 Rows_examined: 0 select benchmark(100000000,1/1); Both those were tests by me to make sure the slow query log was functioning as expected. I have turned the slow query logging down to 1 second on sumo-slave.
FWIW that was on slave01, slave02 looks like there are some sphinx queries that are long in duration.
Yay! The sphinx queries will be going away soon once we are 100% elastic search and turn it off :-)
slave02 and slave03 are similar, and I've turned down the logging on them too. The sphinx queries were from Feb 2011, so they're not even relevant now. There's a problem though - with this version of MySQL (version 5.0, current GA is 5.5, it's been GA for 14 months now), the min value of long_query_time is 1, so if this doesn't get us queries, we'll have to upgrade to get that info. I'd want to upgrade anyway, so it's not a huge problem, but it means that we cannot get slow query info unless we upgrade.
Status: NEW → ASSIGNED
this just in - there's also the actual production boxes, which are a different set (sorry, I was logging into the tm-sumo machines, not the tp-sumo machines). Turning those logs down now and will check back in tomorrow. There are some, but not many, queries, at first glance.
Ricky, I have some reports, and I just e-mailed one of them to you - subject is Slow query log report: HOSTNAME for the 06 week of 2012 (where the hostname is the fqdn for support2, one of the read slaves). I can go through the format with you, there's a lot of information there. If that works for you, what do you think about a weekly report from each SUMO machine that's like that? There are 4 SUMO machines.....(or maybe a monthly would be more realistic?)
Sheeri, I like it! Weekly is fine. I definitely need some help understanding it all :-)
I have run the reports by hand, you might see some e-mails, and I have set this to run at 3 am on Sundays. Will close if this is working OK next week.
Hrm, this script does not seem to be working. :( I just tried running it now via cron and didn't see any errors in /var/log/cron: May 23 15:20:01 support1 crond: (*system*) RELOAD (/etc/cron.d/slow_query_report) May 23 15:20:01 support1 crond: (root) CMD (/usr/lib64/sa/sa1 1 1) May 23 15:20:01 support1 crond: (root) CMD ([ -f "$HPHEALTH_RESET_REQUEST" ] && [ -x "$HPHEALTH_RESET_SCRIPT" ] && rm -f "$HPHEALTH_RESET_REQUEST" && "$HPHEALTH_RESET_SCRIPT") :(
putting a sane date because by then we'll be on MySQL 5.6 which includes some great query digesting stuff in PERFORMANCE_SCHEMA (and ps_helper).
I spent most of the morning/early afternoon comparing data sources here (newrelic, slow query logs, and performance schema) and it looks like newrelic is winning. :) - slow query log gave a few useful queries, but the majority of it was locked queries. - The delete from django_session run once a day might do better to be run a few times a day if possible. - performance_schema - events_waits_history was empty on all systems, and the summary tables showed very reasonable waits, etc. I really didn't get anything I found helpful from them. But these systems are running pretty optimally. - new relic - a bit more detailed info, and I'll get into some query specifics below. #1.) delete from django_sessions where expire_date < CURDATE() - INTERVAL 1 DAY could be updated to run more regularly. Right now it's run once a day and averages 21-25 seconds. #2.) SELECT (extract( year FROM created )) AS `year`, (extract( day FROM created )) AS `day`, (extract( month FROM created )) AS `month`, COUNT(`wiki_helpfulvote`.`created`) AS `count` FROM `wiki_helpfulvote` WHERE (`wiki_helpfulvote`.`created` >= %s AND `wiki_helpfulvote`.`helpful` = %s ) GROUP BY (extract( year FROM created )), (extract( day FROM created )), (extract( month FROM created )) ORDER BY NULL; -- This query extracts every single thing it can out of a datetime column. It took 30-45 seconds to complete per execution because of the lack of index usage. There are better options here: My suggestion would be to implement one of the below (B is the better option): A.) parse out the day/month/year in the app after retreiving just the date, and store the year as an extra column to satisfy the group by. B.) actually store the day/month/year in the database separately so that indexes can be used. -- This query is a full table scan, and can be resolved by adding an additional index after implementing either of the solutions above. -- Based upon the other queries going across this table, I believe option B is the best option. Almost all other queries surveyed were subsecond and ran pretty happily. I was heavily impressed by the results from newrelic during this process. :) We can talk about a few other potential optimizations but I don't think their impact will be as beneficial as their cost. One of the ones I saw: the auth_user table quite frequently gets queries to retreive every single column in the table, and has the highest throughput of any of your tables. If there were a way to reduce the number of columns being returned so that we could use an index, that could help if it were right for the individual use cases.
I've made three dependent bugs of this one, so I'm closing this one out, as the goal was to find and make recommendations for SUMO.
Status: ASSIGNED → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.