Closed Bug 1179223 Opened 10 years ago Closed 10 years ago

Determine why the stage mysql bin logs are 222GB large compared to 12GB on prod

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

Product:

Component:

Type:

defect

Priority:

P1

Severity:

normal

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

Attachments

(2 files)

stage treeherder2-bin.011044 summary 10 years ago Ed Morley [:emorley] 9.84 KB, text/plain		Details
stage treeherder2-bin.010831 summary 10 years ago Ed Morley [:emorley] 10.73 KB, text/plain		Details

Ed Morley [:emorley]

Assignee

Description

•

10 years ago

New Relic is saying time to full for the stage DB is 7 days. Looking at total disk usage vs usage by the binary logs (used for replication)... Stage: [emorley@treeherder1.stage.db.scl3 ~]$ sudo su mysql bash-4.1$ cd /data/mysql/ bash-4.1$ du -ch | tail -n 1 279G total bash-4.1$ du -ch treeherder1-bin.* | tail -n 1 222G total Prod: [emorley@treeherder1.db.scl3 ~]$ sudo su mysql bash-4.1$ cd /data/mysql/ bash-4.1$ du -ch | tail -n 1 649G total bash-4.1$ du -ch treeherder1-bin.* | tail -n 1 12G total bash-4.1$ du -ch *-bin.* | tail -n 1 12G total This really doesn't make sense. It is my understanding that the binary logs are a copy of all non-read operations, and that we keep them for 24 hours for both stage and prod (I've checked the timestamps for both). So in theory, prod and stage should have similar size of logs (since even though stage only keeps 45 days of data, they have similar activity/churn over 24 hours). The disk usage seems to have started rising as of the 29th June: https://rpm.newrelic.com/accounts/677903/servers/6106894/disks?tw[end]=1435756079&tw[start]=1435151279#id=5b2253797374656d2f46696c6573797374656d2f5e646174612f557365642f6279746573222c2253797374656d2f4469736b2f5e6465765e73646231225d ...though they were pretty large prior to that.

Ed Morley [:emorley]

Assignee

Comment 1

•

10 years ago

Bug 1179228 may be the reason for the recent 50GB increase on stage, but it still doesn't explain why stage's bin logs are 15-20x the size of prod.

Depends on: 1179228

Ed Morley [:emorley]

Assignee

Updated

•

10 years ago

Depends on: 1179239

Ed Morley [:emorley]

Assignee

Updated

•

10 years ago

Assignee: nobody → emorley

Status: NEW → ASSIGNED

Sheeri Cabral [:sheeri]

Comment 2

•

10 years ago

You can use the mysqlbinlog utility to see what's in each binary log. e.g. mysqlbinlog treeherder1-bin.010654 > 10654.sql You can also use the mysqldumpslow utility to get a report of the queries. Both mysqlbinlog and mysqldumpslow have good man pages. There's also pt-query-digest installed on the machine, if you'd like to use that, but I think for this case mysqldumpslow will be enough. (pt-query-digest has documentation on the web, but there's a bit of a learning curve). I'd recommend: mysqldumpslow -s c 10654.sql > report.sql this will give you the queries by count. There are 3 reasons why you are using more binary log space: 1) stage is doing more non-read queries 2) each non-read query is bigger on stage (like, the statements are just longer) 3) a combination of 1&2

Ed Morley [:emorley]

Assignee

Comment 3

•

10 years ago

Thank you - that gives me somewhere to start :-) That said, I struggle to believe there is s 15-20x difference between stage and prod (that's been going on for some time). Is there definitely no difference in the settings between stage and prod? (eg as a lay person looking at https://dev.mysql.com/doc/refman/5.6/en/replication-options-binary-log.html there seems to be several options that could make a difference on bin log size - hence filing bug 1179239)

Sheeri Cabral [:sheeri]

Comment 4

•

10 years ago

The only setting that matters is "expire_logs_days" which is set to 1 on both hosts. You can change the *size* of the binlogs, but that won't change the *total size* of the logs over the course of the same period of time. Both are set to log in MIXED log format, and there's no ignore/do options set, meaning that all writes will be saved. /etc/my.cnf is world readable, so you can verify this.

Ed Morley [:emorley]

Assignee

Updated

•

10 years ago

Blocks: 1182171

Ed Morley [:emorley]

Assignee

Comment 6

•

10 years ago

As of now: Prod: [emorley@treeherder2.db.scl3 mysql]$ sudo du -ch | tail -n 1 749G total [emorley@treeherder2.db.scl3 mysql]$ du -ch treeherder*-bin.* | tail -n 1 111G total Stage: [emorley@treeherder2.stage.db.scl3 mysql]$ sudo du -ch | tail -n 1 340G total [emorley@treeherder2.stage.db.scl3 mysql]$ du -ch treeherder*-bin.* | tail -n 1 281G total

Ed Morley [:emorley]

Assignee

Comment 7

•

10 years ago

Attached file stage treeherder2-bin.011044 summary — Details

Generated using: [emorley@treeherder2.stage.db.scl3 mysql]$ sudo mysqlbinlog treeherder2-bin.011044 > ~/dump.sql [emorley@treeherder2.stage.db.scl3 mysql]$ mysqldumpslow -s c ~/dump.sql > ~/report.sql Reading mysql slow query log from /home/emorley/dump.sql [emorley@treeherder2.stage.db.scl3 mysql]$ head ~/report.sql -n 200

Ed Morley [:emorley]

Assignee

Comment 8

•

10 years ago

Attached file stage treeherder2-bin.010831 summary — Details

Another binlog, picked at random. ps: Stage "time to full" 11 days, sigh.

Ed Morley [:emorley]

Assignee

Updated

•

10 years ago

Depends on: 1182201

Sheeri Cabral [:sheeri]

Comment 9

•

10 years ago

Adding the Pythian data folks, as treeherder stage is complaining about disk space.

Comment 10

•

10 years ago

both treeherder1 and 2 were nearly 97% full. For now I have purged binlogs prior to midnight [root@treeherder2.stage.db.scl3 mysql]# mysql -e "show master status\G" *************************** 1. row *************************** File: treeherder2-bin.011077 Position: 420880684 Binlog_Do_DB: Binlog_Ignore_DB: Executed_Gtid_Set: [root@treeherder2.stage.db.scl3 mysql]# ls -ltr treeherder2-bin.011077 -rw-rw---- 1 mysql mysql 508337115 Jul 9 18:56 treeherder2-bin.011077 mysql> pager grep Seconds_Behind PAGER set to 'grep Seconds_Behind' mysql> show slave status\G Seconds_Behind_Master: 2 mysql> PURGE BINARY LOGS BEFORE '2015-07-8 23:59:59'; Query OK, 0 rows affected (2.81 sec) mysql> quit Bye [root@treeherder2.stage.db.scl3 mysql]# df -h . Filesystem Size Used Avail Use% Mounted on /dev/sdb1 394G 308G 66G 83% /data Treeherder1 [root@treeherder1.stage.db.scl3 mysql]# mysql Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 95680098 Server version: 5.6.17-log MySQL Community Server (GPL) Copyright (c) 2000, 2014, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> PURGE BINARY LOGS BEFORE '2015-07-08 23:59:59'; Query OK, 0 rows affected (2.71 sec) mysql> quit Bye [root@treeherder1.stage.db.scl3 mysql]# df -h . Filesystem Size Used Avail Use% Mounted on /dev/sdb1 394G 308G 67G 83% /data

Ed Morley [:emorley]

Assignee

Updated

•

10 years ago

Depends on: 1182282

Ed Morley [:emorley]

Assignee

Updated

•

10 years ago

Depends on: 1182302

Ed Morley [:emorley]

Assignee

Comment 11

•

10 years ago

Paul, thank you for purging the binlogs - I'm hopeful bug 1182201 (which is now on stage, shortly to be on prod) will reduce the size of the UPDATEs in the binlog, and so stop this being a problem. Will continue to monitor tomorrow.

Ed Morley [:emorley]

Assignee

Comment 12

•

10 years ago

One of the strange things about this is that the binlogs on stage were so much larger than prod, even though: (a) there is less data on stage (apart from performance_series, where we still keep up to a year on stage; bug 1182302) (b) the changes not yet deployed to prod were fairly innocuous (https://github.com/mozilla/treeherder/compare/c9143dd...e160de2) Will, is it possible that the perf data on stage is somehow different to prod? I know recently we had a few bugs that were caught on stage that caused duplicates signatures etc - could these have left behind additional rows that we're still including in the UPDATEs to performance_series (or similar) and so explain why the DB writes are larger on stage? ie: prod only saw |change+fixup| whereas stage saw |change+deploy,then fixup+deploy| ? eg: https://github.com/mozilla/treeherder/commit/8ff009642d1c97a8563cf8e25588b6753cd47b6a https://github.com/mozilla/treeherder/commit/4fb738c6c12867e473b3dd053d06e5fb689d973d https://github.com/mozilla/treeherder/commit/4bafb7c9c620f7b5b86185f47dcf5820744e9c51

Flags: needinfo?(wlachance)

Ed Morley [:emorley]

Assignee

Updated

•

10 years ago

Depends on: 1182485

William Lachance (:wlach)

Comment 13

•

10 years ago

(In reply to Ed Morley [:emorley] from comment #12) > Will, is it possible that the perf data on stage is somehow different to > prod? I know recently we had a few bugs that were caught on stage that > caused duplicates signatures etc - could these have left behind additional > rows that we're still including in the UPDATEs to performance_series (or > similar) and so explain why the DB writes are larger on stage? ie: prod only > saw |change+fixup| whereas stage saw |change+deploy,then fixup+deploy| ? Well, even if we did have old signatures lying around, they wouldn't be updated (we calculate the signature uniquely each time we update). I did a quick check and it seems like we have the exact same number of performance signatures on stage and production (as you would expect): http://nbviewer.ipython.org/url/people.mozilla.org/~wlachance/number%20of%20perf%20signatures%20stage%20vs%20prod.ipynb There's definitely something weird going on... but I don't think this is it.

Flags: needinfo?(wlachance)

Ed Morley [:emorley]

Assignee

Comment 14

•

10 years ago

[emorley@treeherder2.stage.db.scl3 mysql]$ du -ch treeherder*-bin.* | tail -n 1 24G total Much better than the 280GB previously :-)

Ed Morley [:emorley]

Assignee

Comment 15

•

10 years ago

I've just vetted all of the blob fields we use - all of them are now compressing the json blobs, apart from the objectstore json_blob field. However the objectstore is going away in bug 1140349, so once that happens the size (and number) of the writes in the binlogs should drop slightly more. Stage: [emorley@treeherder2.stage.db.scl3 mysql]$ du -ch treeherder*-bin.* | tail -n 1 25G total Prod: [emorley@treeherder2.db.scl3 mysql]$ du -ch treeherder*-bin.* | tail -n 1 23G total

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.