Optimize top 10 by count and top 10 by slowness addons production queries

RESOLVED FIXED

Status

P1
normal
RESOLVED FIXED
6 years ago
4 years ago

People

(Reporter: scabral, Assigned: scabral)

Tracking

Details

(Whiteboard: [2013q3] August)

(Assignee)

Description

6 years ago
Optimize top 10 by count and top 10 by slowness addons production queries
(Assignee)

Updated

5 years ago
Priority: -- → P1
Whiteboard: [2013q3] → [2013q3] August
(Assignee)

Comment 1

5 years ago
set long_query_time to 0 on db1 and db2.mktdb.services.phx1.mozilla.com. Will come back tomorrow and set it back to 2 on each server and analyze the logs. If you get a monitor alert for configuration differences this is why (we no longer monitor the servers...)
(Assignee)

Updated

5 years ago
Assignee: server-ops-database → scabral
(Assignee)

Comment 2

5 years ago
We had 42G on db2 before the 24 hours was up, so we stopped it then (just under 500M on db1). Processing the files now.
(Assignee)

Comment 3

5 years ago
I processed the files in order of longest total response time. I am processing them by count right now, in the background, so we have both.

Starting with db1, I'll put each query in a new comment so they're easily referenceable by comment:

Query 1 ran a total of 5 times, but it's a long query, total exec time was 961 minutes.

The query is:
SELECT
  personas.persona_id AS id,
  translations.localized_string AS name,
  users.display_name AS author,
  personas.addon_id
FROM personas
INNER JOIN addons ON personas.addon_id = addons.id
LEFT OUTER JOIN translations ON addons.name = translations.id
LEFT OUTER JOIN addons_users ON addons.id = addons_users.addon_id
LEFT OUTER JOIN users ON addons_users.user_id = users.id;

And it's run by the metrics user. This should move to a slave, as it's read-only.
The EXPLAIN shows that it is using an index for all tables except personas, on which it's doing a full table scan. This is appropriate, as this query has no filter and should in fact use all the rows in personas.

# Query 1: 0.00 QPS, 0.07x concurrency, ID 0x9E5E14F18DEFC48B at byte 5895087714
# This item is included in the report because it matches --limit.
# Scores: Apdex = 0.00 [1.0]*, V/M = 1.52
# Query_time sparkline: |       ^|
# Time range: 2013-08-19 15:23:14 to 19:22:59
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count          0       5
# Exec time     16    961s    167s    216s    192s    213s     17s    184s
# Lock time      0   733us    93us   285us   146us   273us    67us   119us
# Rows sent      3   1.74M 357.12k 357.15k 357.13k 345.04k       0 345.04k
# Rows examine   0   8.70M   1.74M   1.74M   1.74M   1.69M       0   1.69M
# Query size     0   3.10k     634     634     634     634       0     634
# String:
# Databases    addons_mozilla_org
# Hosts
# Users        metrics
(Assignee)

Comment 4

5 years ago
db1, query 2 by total query time, ran 303 times with a total of 581s running time.

Recommendation - can be done in real time since adding/removing indexes are non-blocking:
Eliminate duplicate index on name and add index on (name,status,addontype_id):

alter table addons_mozilla_org.addons add index(name,status,addontype_id), drop index addons_ibfk_2;

Note - even if you do not add the index, there is no reason for the index on name, since there is already a UNIQUE index on name.

----
the query itself:
SELECT DISTINCT `addons`.`id` FROM `addons` INNER JOIN `translations` ON (`addons`.`name` = `translations`.`id`) WHERE ("en-us"="en-us" AND NOT (`addons`.`status` = 11 ) AND `translations`.`localized_string` = 'SmartTemplate4'  AND NOT (`addons`.`addontype_id` = 11 ))\G
(Assignee)

Comment 5

5 years ago
db1, query 3 ran almost a million times, total exec time 395s

# Time range: 2013-08-19 14:48:31 to 20:19:16
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count          6  969939
# Exec time      6    395s   172us    12ms   407us   690us   178us   348us
# Lock time     17    144s    63us     3ms   148us   273us    66us   125us
# Rows sent      4   1.92M       0     512    2.07    6.98    5.62    0.99
# Rows examine   0   8.61M       0   2.20k    9.30   31.70   27.05    3.89
# Query size    25   1.25G   1.34k   4.34k   1.35k   1.33k   34.83   1.33k

This query is optimized and using appropriate keys. the max it ran for is 12 ms. Being called a million times in just under 6 hours seems like a lot, but if that's expected, there's nothing to do to optimize this query.

SELECT `applications_versions`.`id`, `applications_versions`.`application_id`, `applications_versions`.`version_id`, `applications_versions`.`min`, `applications_versions`.`max`, `applications`.`id`, `applications`.`created`, `applications`.`modified`, `applications`.`guid`, `applications`.`supported`, `versions`.`id`, `versions`.`created`, `versions`.`modified`, `versions`.`addon_id`, `versions`.`license_id`, `versions`.`releasenotes`, `versions`.`approvalnotes`, `versions`.`version`, `versions`.`version_int`, `versions`.`nomination`, `versions`.`reviewed`, `versions`.`has_info_request`, `versions`.`has_editor_comment`, `versions`.`deleted`, `versions`.`supported_locales`, `versions`.`_developer_name`, `appversions`.`id`, `appversions`.`created`, `appversions`.`modified`, `appversions`.`application_id`, `appversions`.`version`, `appversions`.`version_int`, T5.`id`, T5.`created`, T5.`modified`, T5.`application_id`, T5.`version`, T5.`version_int` FROM `applications_versions` 
INNER JOIN `versions` ON (`applications_versions`.`version_id` = `versions`.`id`) 
INNER JOIN `applications` ON (`applications_versions`.`application_id` = `applications`.`id`) 
INNER JOIN `appversions` ON (`applications_versions`.`min` = `appversions`.`id`) 
INNER JOIN `appversions` T5 ON (`applications_versions`.`max` = T5.`id`) 

WHERE `applications_versions`.`version_id` IN (1452032, 127489, 1244163, 1445390, 1332737, 1467402, 1419790, 1341955, 1462290, 129043, 113178, 688475, 1467240, 38493, 1245232, 1369137, 119866, 129084, 1295943, 1295944/*... omitted 187 items ...*/)\G
(Assignee)

Comment 6

5 years ago
db1, query 4 has a similar profile as query 3 - ran half a million times, total execution time is 381 seconds, but that's due to the frequency of the query.

# Time range: 2013-08-19 14:48:32 to 20:19:16
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count          3  499690
# Exec time      6    381s   479us     9ms   762us     1ms   240us   690us
# Lock time      9     79s    87us     4ms   157us   273us    52us   138us
# Rows sent      0 277.77k       0       1    0.57    0.99    0.49    0.99
# Rows examine   0  10.30M       1   3.40k   21.62   51.63  159.07    4.96
# Query size    14 750.47M   1.53k   1.54k   1.54k   1.53k    0.76   1.53k

It uses 8 tables with a join and a dependent subquery, but it uses primary keys (1 row max) for 6 of the 8 tables, and one table (the dependent subquery) only has 1 row examined, the other has 5, so that's why this is a speedy query.

No further optimization recommended.
(Assignee)

Comment 7

5 years ago
db1, query 5 is similar to queries 3 and 4 - almost a million calls, 308s total. No further optimization is recommended; this query uses appropriate indexes.

# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count          6  970046
# Exec time      5    308s   123us     7ms   317us   541us   151us   273us
# Lock time     13    113s    43us     6ms   116us   260us    66us    93us
# Rows sent      3   1.77M       0     390    1.92    5.75    4.68    0.99
# Rows examine   0   2.72M       0     770    2.94   11.95    9.32    0.99
# Query size    17 909.15M     972   3.95k  982.76  964.41   35.34  964.41

SELECT `files`.`id`, `files`.`created`, `files`.`modified`, `files`.`version_id`, `files`.`platform_id`, `files`.`filename`, `files`.`size`, `files`.`hash`, `files`.`codereview`, `files`.`jetpack_version`, `files`.`builder_version`, `files`.`status`, `files`.`datestatuschanged`, `files`.`no_restart`, `files`.`strict_compatibility`, `files`.`requires_chrome`, `files`.`reviewed`, `files`.`binary`, `files`.`binary_components`, `files`.`uses_flash`, `versions`.`id`, `versions`.`created`, `versions`.`modified`, `versions`.`addon_id`, `versions`.`license_id`, `versions`.`releasenotes`, `versions`.`approvalnotes`, `versions`.`version`, `versions`.`version_int`, `versions`.`nomination`, `versions`.`reviewed`, `versions`.`has_info_request`, `versions`.`has_editor_comment`, `versions`.`deleted`, `versions`.`supported_locales`, `versions`.`_developer_name` FROM `files` INNER JOIN `versions` ON (`files`.`version_id` = `versions`.`id`) WHERE `files`.`version_id` IN (100355, 1429508, 1300486, 1282058, 1237003, 61455, 1234966, 1445912, 1325082, 1325085, 22559, 72152, 1464326, 45095, 1437736, 1378354, 1282101, 53315, 20554, 127054/*... omitted 360 items ...*/)\G
(Assignee)

Comment 8

5 years ago
I'm going to skip the rest on db1 for now, because total execution time is not particularly useful. Most of these queries are good. I'm processing the files also by average time, and will use those figures once they're done.

I will check out the first few queries on db2 ordered by total time.
(Assignee)

Comment 9

5 years ago
recommendation: add 2 more slaves so you are n+1

db2, Query 1 - avg query time, 6 seconds.

This query is slow because it has to read rows from disk. When rows were cached, the query took about 1.5 seconds, when they were not, the query took 6 seconds. On a non-loaded server, the query took 1.7 seconds when the rows were not cached, and 0.7 seconds when rows were already in the cache.

It looks like Marketplace has 4 slaves, in contrast to the old addons' 6 slaves. With 6 slaves we were n+1, with 4 slaves, you are n-1 and so the slaves are slower than they ought to be....by 3.5 times.

# Time range: 2013-08-19 14:50:32 to 20:14:08
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count          0    8913
# Exec time     55  57931s   787ms     25s      6s     15s      5s      6s
# Lock time      0   724ms    45us   425us    81us    98us    20us    76us
# Rows sent      0   8.70k       1       1       1       1       0       1
# Rows examine  38   5.61G 660.32k 660.37k 660.34k 650.62k       0 650.62k
# Query size     0   3.98M     464     470  468.34  463.90       0  463.90


SELECT COUNT(*) FROM `addons` INNER JOIN `versions` ON (`addons`.`current_version` = `versions`.`id`) WHERE ("ru"="ru" AND NOT (`addons`.`status` = 11 ) AND (`versions`.`id` IS NOT NULL AND `addons`.`inactive` = 0  AND `addons`.`status` IN (4)) AND `addons`.`addontype_id` = 9  AND NOT (`addons`.`id` IN (9449, 199424, 258053, 261458, 263347, 316928, 324462, 326593, 328076, 337547, 339842, 346226, 354399, 361313, 361551, 365177, 366659, 371268, 429014, 450340)))\G

  KEY `visible_idx` (`addontype_id`,`status`,`inactive`,`current_version`),
(Assignee)

Comment 10

5 years ago
db2, query 2 - has a *wide* variation in times it takes to run - min 263 us, avg 817 ms, median 53ms, max 27s. This is likely due to the wide variance in rows examined in the query. It does use appropriate indexes, but trying the query on an unloaded server vs. a loaded one is very different, even when the information is in the cache. 

cache times are 2.5 - 6 seconds on a loaded server, and 1-1.5 seconds on an unloaded server. When information isn't cached, from my tests on a loaded server, the query time was between 1.5 and 2 minutes, and an unloaded server had query times between 30 seconds and 1.5 minutes - these were from my own tests.

So again, I'd recommend having more slaves to offload the servers.

# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count          0    9899
# Exec time      7   8088s   263us     27s   817ms      4s      2s    53ms
# Lock time      0   881ms    52us   608us    88us   108us    24us    84us
# Rows sent      0  94.35k       1      15    9.76   14.52    5.35   11.95
# Rows examine   6 1019.78M       3 804.51k 105.49k 399.43k 157.24k   4.07k
# Query size     0   5.15M     535     566  545.76  537.02    6.40  537.02

SELECT `addons`.`id` FROM `addons` USE INDEX (`downloads_type_idx`) INNER JOIN `versions`  ON (`addons`.`current_version` = `versions`.`id`) INNER JOIN `addons_collections`  ON (`addons`.`id` = `addons_collections`.`addon_id`) WHERE ("ga-ie"="ga-ie" AND NOT (`addons`.`status` = 11 ) AND (`versions`.`id` IS NOT NULL AND `addons`.`inactive` = 0  AND `addons`.`status` IN (1, 3, 4, 6, 7, 8, 9, 10)) AND "ga-ie"="ga-ie" AND NOT (`addons`.`status` = 11 ) AND `addons_collections`.`collection_id` = 5196082 ) ORDER BY `addons`.`weeklydownloads` DESC LIMIT 6\G
(Assignee)

Comment 11

5 years ago
Recommendation: see if this query can be run less frequently. It currently runs about once every 23 seconds.

db2, query 3 by total time - the query ran fewer than 1000 times, with an average time of 8 seconds (min time of 2 seconds).

# Time range: 2013-08-19 14:50:57 to 19:54:39
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count          0     747
# Exec time      6   6262s      2s     42s      8s     26s      8s      5s
# Lock time      0    51ms    41us     1ms    68us    84us    53us    63us
# Rows sent      0     747       1       1       1       1       0       1
# Rows examine   7   1.14G   1.56M   1.56M   1.56M   1.53M       0   1.53M
# Query size     0 181.15k     246     253  248.32  246.02    3.50  246.02

 SELECT COUNT(*) FROM `collections` LEFT OUTER JOIN `applications` ON (`collections`.`application_id` = `applications`.`id`) WHERE ("pt-pt"="pt-pt" AND `collections`.`listed` = 1  AND (`collections`.`application_id` = 1  OR `applications`.`id` IS NULL))\G

The EXPLAIN uses the primary key on the applications table, but the "listed" key for the collections table. The "listed" key is used, which is appropriate because it's a clustered index so it has application_id within it. Unfortunately, it only lessens the filter to about half the table, because the split is about even:

MariaDB [addons_mozilla_org]> select count(*),listed from collections group by listed;
+----------+--------+
| count(*) | listed |
+----------+--------+
|  1558490 |      0 |
|  1519169 |      1 |
+----------+--------+
2 rows in set (1.10 sec)

Unfortunately this query is as good as it is going to get, without major changes (like, say, making a collections_listed and a collections_unlisted table and changing all the code to match that, and then this query can select from the appropriate table). 

This query is looking for collections where the application_id is either NULL or 1, so it's looking for a special kind of application, very likely. Perhaps this query could be run once every 5 minutes and cached, and then a query once every 23 seconds could be run against the cache (whether the cache is a different db table, or in memcached or something?).
(Assignee)

Comment 12

5 years ago
db2, query 4: recommendation is again to not have so many queries on this server.

# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count          0    6157
# Exec time      3   3501s     4ms     18s   569ms      2s   982ms   230ms
# Lock time      0   697ms    71us   530us   113us   138us    27us   108us
# Rows sent      0  19.94k       1      30    3.32   20.43    6.10    0.99
# Rows examine  18   2.69G     492   1.39M 457.76k   1.03M 270.64k 345.04k
# Query size     0   4.22M     708     741  719.15  719.66   14.69  685.39

This takes between 1-3 seconds in my tests, depending on load, even when it's cached, but it's using appropriate indexes. Check out the rows examined, sometimes it's looking at a ton of rows.
(Assignee)

Comment 13

5 years ago
I'm now looking at queries sorted by average time (pt-query-digest cannot do this, but mysqldumpslow can).

The first query on db1 is the same as in comment 3.

db1, query 2 based on average time is an UPDATE with no WHERE clause:

  UPDATE collections SET weekly_subscribers = N, monthly_subscribers = N

It was run once, with an average time of 34 seconds.

# User@Host: marketplace_prod[marketplace_prod] @  [10.32.126.20]
# Thread_id: 5024886  Schema: addons_mozilla_org  QC_hit: No
# Query_time: 34.144670  Lock_time: 0.000092  Rows_sent: 0  Rows_examined: 3075083
SET timestamp=1376926240;
UPDATE collections SET weekly_subscribers = 0, monthly_subscribers = 0;

Assuming this "reset" is appropriate, it has to update 3 million rows and does not use any filters so no indexes are appropriate...so I think a 30-second query time is acceptable.
(Assignee)

Comment 14

5 years ago
db1, query 3 based on average time is another query that ran only once.

After testing, it looks like changing the addons_id index to a composite index, we can get at least 2x performance:

alter table stats_share_counts drop index addon_id, add index(addon_id,service,count);


Here's the query:
# Thread_id: 5044985  Schema: addons_mozilla_org  QC_hit: No
# Query_time: 11.784338  Lock_time: 0.000131  Rows_sent: 0  Rows_examined: 739233
SET timestamp=1376933417;
REPLACE INTO stats_share_counts_totals (addon_id, service, count)
                (SELECT addon_id, service, SUM(count)
                 FROM stats_share_counts
                 RIGHT JOIN addons ON addon_id = addons.id
                 WHERE service IN ('digg','facebook','delicious','myspace','friendfeed','twitter')
                 GROUP BY addon_id, service);
(Assignee)

Comment 15

5 years ago
db1, query 4 was the same as comment 11.
(Assignee)

Comment 16

5 years ago
db1, query 5 is similar to comment 11 and comment 15. It only ran twice, so I'm not sure if it's worth caching, and the avg runtime was 2.54 seconds.

Count: 2  Time=2.54s (5s)  Lock=0.00s (0s)  Rows=20.0 (40), marketplace_prod[mar
ketplace_prod]@[10.32.126.20]
  SELECT `collections`.`id` FROM `collections` LEFT OUTER JOIN `applications` ON
 (`collections`.`application_id` = `applications`.`id`) WHERE ("S"="S" AND `coll
ections`.`listed` = N  AND (`collections`.`application_id` = N  OR `applications
`.`id` IS NULL)) ORDER BY `collections`.`modified` DESC LIMIT N;
(Assignee)

Comment 17

5 years ago
Went to services meeting yesterday and presented the grand finding of the databases being overloaded. Jason Thomas said he'd work with Services folks to order more hardware. 4 slaves instead of 6 were chosen because the Marketplace hardware is bigger/better than what Addons had.
(Assignee)

Updated

5 years ago
Blocks: 912482
(Assignee)

Updated

5 years ago
Blocks: 912484
(Assignee)

Updated

5 years ago
Blocks: 912485
(Assignee)

Updated

5 years ago
Blocks: 912486
(Assignee)

Comment 18

5 years ago
Resolving; the August goal of optimizing the addons queries has been met, and the resulting actions are in the child bugs.
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.