Closed Bug 992900 Opened 11 years ago Closed 11 years ago

[shipping] slow queries in signoff view for more pushes

Categories

(Webtools Graveyard :: Elmo, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: Pike, Unassigned)

References

Details

(Keywords: perf, regression)

We have introduced some new slow queries, it'd be nice if we could get rid of those, as they're causing multi-second load times. From mysql-slow-log: # Time: 140407 5:57:31 # User@Host: l10nuser[l10nuser] @ l10n-dashboard1.webapp.scl3.mozilla.com [10.22.81.129] # Query_time: 2.702376 Lock_time: 0.000000 Rows_sent: 1 Rows_examined: 1724956 SET timestamp=1396875451; SELECT COUNT(DISTINCT `life_push`.`id`) FROM `life_push` INNER JOIN `life_push_changesets` ON (`life_push`.`id` = `life_push_changesets`.`push_id`) INNER JOIN `life_changeset` ON (`life_push_changesets`.`changeset_id` = `life_changeset`.`id`) WHERE (`life_changeset`.`branch_id` = 1 AND ((`life_push`.`push_date` >= '2014-03-17 21:50:00' AND `life_push`.`repository_id` = 907 ) OR (`life_push`.`push_date` <= '2014-03-17 21:50:00' AND `life_push`.`push_date` >= '2014-02-03 22:45:00' AND `life_push`.`repository_id` = 803 ) OR (`life_push`.`push_date` <= '2014-02-03 22:45:00' AND `life_push`.`push_date` >= '2013-12-09 19:30:00' AND `life_push`.`repository_id` = 58 )) AND `life_push`.`push_date` < '2014-04-06 05:27:26' ); # Time: 140407 5:57:34 # User@Host: l10nuser[l10nuser] @ l10n-dashboard1.webapp.scl3.mozilla.com [10.22.81.129] # Query_time: 2.644111 Lock_time: 0.000000 Rows_sent: 10 Rows_examined: 1724977 SET timestamp=1396875454; SELECT DISTINCT `life_push`.`id`, `life_push`.`push_date` FROM `life_push` INNER JOIN `life_push_changesets` ON (`life_push`.`id` = `life_push_changesets`.`push_id`) INNER JOIN `life_changeset` ON (`life_push_changesets`.`changeset_id` = `life_changeset`.`id`) WHERE (`life_changeset`.`branch_id` = 1 AND ((`life_push`.`push_date` >= '2014-03-17 21:50:00' AND `life_push`.`repository_id` = 907 ) OR (`life_push`.`push_date` <= '2014-03-17 21:50:00' AND `life_push`.`push_date` >= '2014-02-03 22:45:00' AND `life_push`.`repository_id` = 803 ) OR (`life_push`.`push_date` <= '2014-02-03 22:45:00' AND `life_push`.`push_date` >= '2013-12-09 19:30:00' AND `life_push`.`repository_id` = 58 )) AND `life_push`.`push_date` < '2014-04-06 05:27:26' ) ORDER BY `life_push`.`push_date` DESC LIMIT 10; These got introduced in bug 658656, one of the two is in https://github.com/mozilla/elmo/blob/master/apps/shipping/views/signoff.py#L180
Before we attack this in django, can we perhaps try to figure out if we're using indexes to maximum benefit. I tried running EXPLAIN on the queries but I don't really understand the output. Perhaps we can needinfo on Sheeri.
Sheeri, Can you help us checking that we have sufficient use of indexes on the query above (there's two of them but one is just the count of another). We have some ideas on the django side but first we thought we'd check that the queries are as fast as can be. Sometimes an index can increase the speed by a massive difference. The prod DB is: l10n_site and it's running on bm-l10n-db
Flags: needinfo?(scabral)
This is what I get on my local mysql 5.6.16 https://gist.github.com/peterbe/10023990
Table definitions: mysql> show create table life_push\G *************************** 1. row *************************** Table: life_push Create Table: CREATE TABLE `life_push` ( `id` int(11) NOT NULL AUTO_INCREMENT, `repository_id` int(11) NOT NULL, `user` varchar(200) NOT NULL, `push_date` datetime NOT NULL, `push_id` int(10) unsigned NOT NULL, PRIMARY KEY (`id`), KEY `life_push_repository_id` (`repository_id`), KEY `life_push_user` (`user`), KEY `life_push_push_date` (`push_date`), CONSTRAINT `repository_id_refs_id_20591d59` FOREIGN KEY (`repository_id`) REFERENCES `life_repository` (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=165767 DEFAULT CHARSET=utf8 1 row in set (0.00 sec) mysql> show create table life_changeset\G *************************** 1. row *************************** Table: life_changeset Create Table: CREATE TABLE `life_changeset` ( `id` int(11) NOT NULL AUTO_INCREMENT, `revision` varchar(40) NOT NULL, `user` varchar(200) NOT NULL, `description` longtext, `branch_id` int(11) NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `revision` (`revision`), KEY `life_changeset_user` (`user`), KEY `life_changeset_branch_id` (`branch_id`), CONSTRAINT `branch_id_refs_id_2146ec18` FOREIGN KEY (`branch_id`) REFERENCES `life_branch` (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=381470 DEFAULT CHARSET=utf8 1 row in set (0.00 sec)
Flags: needinfo?(scabral)
Both queries examine almost 200,000 rows in the life_changeset table, and they both use the life_changeset_branch_id key, on branch_id - both queries look for branch_id=1. 305,536 entries have branch_id=1. But there are over 800 branch ids in general, so the index isn't a bad one. I've tested with several other indexes (for example, branch_id plus id) and I don't see anything easy to optimize there. The real problem is that whenever we're searching for branch_id=1, there are tons of different entries, and the life_changeset table only uses branch_id and id, and I tried those indexes.
So basically the joining on changesets all with branch_id=1 is what's causing the excessive strain. See: https://gist.github.com/peterbe/10411869 So about 25% of the life_changeset records have branch_id that is NOT 1. I wonder if we could instead select all from life_push and somehow "post process" in django and filter out what's from the wrong changesets. Ideas, Pike?
Yeah, something like that. The non-default-branch landings are actually pretty common on the beta branch, so the logic should assume that it's gotta ignore a few. I wonder if we could just get a bunch of push IDs, and then kick off a second query which of those have changesets on the default branch. That might make us transfer less data back and forth, not sure.
Here's one weird thing I found: Just the join makes the query slow, not actually selecting anything from it. mysql> SELECT COUNT(DISTINCT `life_push`.`id`) FROM `life_push` INNER JOIN `life_push_changesets` ON (`life_push`.`id` = `life_push_changesets`.`push_id`) INNER JOIN `life_changeset` ON (`life_push_changesets`.`changeset_id` = `life_changeset`.`id`) WHERE (((`life_push`.`push_date` <= '2014-03-17 21:50:00' AND `life_push`.`push_date` >= '2014-02-03 22:45:00' AND `life_push`.`repository_id` = 829 ) OR (`life_push`.`push_date` <= '2014-02-03 22:45:00' AND `life_push`.`push_date` >= '2013-12-09 19:30:00' AND `life_push`.`repository_id` = 83 ) OR (`life_push`.`push_date` >= '2014-03-17 21:50:00' AND `life_push`.`repository_id` = 934 ))); +----------------------------------+ | COUNT(DISTINCT `life_push`.`id`) | +----------------------------------+ | 13 | +----------------------------------+ 1 row in set (3.18 sec) mysql> SELECT COUNT(DISTINCT `life_push`.`id`) FROM `life_push` WHERE (((`life_push`.`push_date` <= '2014-03-17 21:50:00' AND `life_push`.`push_date` >= '2014-02-03 22:45:00' AND `life_push`.`repository_id` = 829 ) OR (`life_push`.`push_date` <= '2014-02-03 22:45:00' AND `life_push`.`push_date` >= '2013-12-09 19:30:00' AND `life_push`.`repository_id` = 83 ) OR (`life_push`.`push_date` >= '2014-03-17 21:50:00' AND `life_push`.`repository_id` = 934 )) ); +----------------------------------+ | COUNT(DISTINCT `life_push`.`id`) | +----------------------------------+ | 13 | +----------------------------------+ 1 row in set (0.23 sec) The other thing of interest, I did dive in to actually spot how we regressed this, and found that we had the slow query before. The regression bit is that we're exercising it twice now, instead of once.
I think that branches and pushes are really essential in elmo, which might satisfy to add some redundancy in the data. Would it help to add a ManyToMany between Branch and Push? Right now, it's Push <- MM -> Changeset -> Branch, and we could add a copy as Push <- MM -> Branch.
(In reply to Axel Hecht [:Pike] from comment #8) > Here's one weird thing I found: > > Just the join makes the query slow, not actually selecting anything from it. > > mysql> SELECT COUNT(DISTINCT `life_push`.`id`) FROM `life_push` INNER JOIN > `life_push_changesets` ON (`life_push`.`id` = > `life_push_changesets`.`push_id`) INNER JOIN `life_changeset` ON > (`life_push_changesets`.`changeset_id` = `life_changeset`.`id`) WHERE > (((`life_push`.`push_date` <= '2014-03-17 21:50:00' AND > `life_push`.`push_date` >= '2014-02-03 22:45:00' AND > `life_push`.`repository_id` = 829 ) OR (`life_push`.`push_date` <= > '2014-02-03 22:45:00' AND `life_push`.`push_date` >= '2013-12-09 19:30:00' > AND `life_push`.`repository_id` = 83 ) OR (`life_push`.`push_date` >= > '2014-03-17 21:50:00' AND `life_push`.`repository_id` = 934 ))); > +----------------------------------+ > | COUNT(DISTINCT `life_push`.`id`) | > +----------------------------------+ > | 13 | > +----------------------------------+ > 1 row in set (3.18 sec) > > mysql> SELECT COUNT(DISTINCT `life_push`.`id`) FROM `life_push` WHERE > (((`life_push`.`push_date` <= '2014-03-17 21:50:00' AND > `life_push`.`push_date` >= '2014-02-03 22:45:00' AND > `life_push`.`repository_id` = 829 ) OR (`life_push`.`push_date` <= > '2014-02-03 22:45:00' AND `life_push`.`push_date` >= '2013-12-09 19:30:00' > AND `life_push`.`repository_id` = 83 ) OR (`life_push`.`push_date` >= > '2014-03-17 21:50:00' AND `life_push`.`repository_id` = 934 )) ); > +----------------------------------+ > | COUNT(DISTINCT `life_push`.`id`) | > +----------------------------------+ > | 13 | > +----------------------------------+ > 1 row in set (0.23 sec) > Wow! I remember digging deep into similar optimization "problems" in postgres. The preconditions were quite different I suspect. The tl;dr of that was that postgres is very smart in figuring out what of a join won't be needed unless you join on non-uniques. > > The other thing of interest, I did dive in to actually spot how we regressed > this, and found that we had the slow query before. The regression bit is > that we're exercising it twice now, instead of once. That's funny! The Django ORM is so darn handy you're tempted to make very inefficient queries.
(In reply to Axel Hecht [:Pike] from comment #9) > I think that branches and pushes are really essential in elmo, which might > satisfy to add some redundancy in the data. > > Would it help to add a ManyToMany between Branch and Push? > > Right now, it's Push <- MM -> Changeset -> Branch, and we could add a copy > as Push <- MM -> Branch. That's what I thought too! Redundant but would definitely solve all of this. so something like: class Push(models.Model): """stores context of who pushed what when Fields: repository -- repository changesets were pushed to user -- person who did the push push_date -- date and time of the push push_id -- unique id of the push """ objects = PushManager() repository = models.ForeignKey(Repository) changesets = models.ManyToManyField(Changeset, related_name="pushes") + branches = models.ManyToManyField(Branch, related_name="pushes") user = models.CharField(max_length=200, db_index=True) push_date = models.DateTimeField('date of push', db_index=True) push_id = models.PositiveIntegerField(default=0) So a push can definitely span multiple branches? Somehow that tastes strange but I'm no hg expert.
Adding Brandon because he's pretty awesome at puzzles like this.
Data design info: both hg push and git push affect multiple branches and/or bookmarks. Though branches are rather different things in hg and git. What's not explicit in this bug, but the reason why we made it blocked by bug 915732, we don't see this query as being slow in our local testing with 5.6, so we'd like to see if an updated mysql would "just fix" this.
Resolving WORKSFORME, now that we're on the new dbs.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WORKSFORME
Product: Webtools → Webtools Graveyard
You need to log in before you can comment on or make changes to this bug.