Offline defrag/failover for Treeherder stage and prod DBs

RESOLVED FIXED

Status

Data & BI Services Team
DB: MySQL
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: emorley, Unassigned)

Tracking

Details

(Reporter)

Description

3 years ago
We'd like to do a defrag of all tables on Treeherder stage and Treeherder prod, since we've significantly reduced data usage in several tables - particularly on prod.

The defrag of the larger tables (eg job_artifact & performance_artifact) will likely need to be run offline with a failover (particularly on prod).

The nodes in question are:

prod...
treeherder1.db.scl3.mozilla.com
treeherder2.db.scl3.mozilla.com

stage...
treeherder1.stage.db.scl3.mozilla.com
treeherder2.stage.db.scl3.mozilla.com

Note: For prod, replication is unable to keep up if the load balancer points at both the RW and RO nodes, so both the RW and RO VIPs point at the same machine (ie the slave is just a hot spare, rather than being used for reads). As such, once this bug is fixed, we'll need it returned to this stage, and not with the RW and RO VIPS set to different nodes.

Many thanks!
(Reporter)

Comment 1

3 years ago
s/returned to this stage/returned to this state/

(In reply to Ed Morley [:emorley] from comment #0)
> The defrag of the larger tables (eg job_artifact & performance_artifact)
> will likely need to be run offline with a failover (particularly on prod).

These two tables are under each of the *_jobs_* databases.
(Reporter)

Comment 2

3 years ago
What's the best way to surface this in Pythian's queue? :-)
Sorry for the radio silence. Best way is to cc team73@pythian.com, which I've done here.

One question is, for how long have you reduced data usage? When's the best time to defrag for optimal use? (for instance, treeherder lifecycle is 6 months, though you may have changed that, we don't have details on that...but if it's 6 months, should we do it every month for 6 months? every 2 months? etc).
(Reporter)

Comment 4

3 years ago
Our prod data lifecycle is 4 months and the data reductions were made > 4 months ago (bug 1161618 comment 0) so all data has now expired - so we only need to do this once. Thank you for CCing Pythian :-)
Aha, perfect!

Pythian - as for order of operations, defrag (OPTIMIZE) tables on treeherder2.stage.db.scl3.mozilla.com first, taking it out of the "ro" load balancer VIP, and making sure treeherder1.stage.db.scl3.mozilla.com is in the read-only load balancer VIP. Make sure to set sql_log_bin=0 so that the defrag does NOT replicate.

Then go through the failover process so treeherder1.stage is the slave, and repeat the process. Then stage will be defrag'd.

And then do the whole process again in production.
(Reporter)

Comment 6

3 years ago
(In reply to Sheeri Cabral [:sheeri] from comment #5)
> Pythian - as for order of operations, defrag (OPTIMIZE) tables on
> treeherder2.stage.db.scl3.mozilla.com first, ...

Is there a timeline set for this? :-)
Flags: needinfo?(team73)

Comment 7

3 years ago
We can start as soon as the load balancer access is cleared :)
(Reporter)

Comment 8

3 years ago
(In reply to Pythian Team73 from comment #7)
> We can start as soon as the load balancer access is cleared :)

I'm guessing that is bug 1187355?
Depends on: 1187355

Comment 9

3 years ago
Sorry Ed, it says we are not authorized to see that bug
Flags: needinfo?(team73)
(Reporter)

Comment 10

3 years ago
That bug has now been duped to bug 1182029 (which is roughly about "try and wrangle Zeus configs so Pythian's LDAP access works, but probably needs upstream Zeus bugs fixed").
Depends on: 1182029
No longer depends on: 1187355
There's back and forth between Riverbed, the company that owns Zeus.

In the meantime, there are some folks who can update load balancers for you. I will update this bug first thing tomorrow with how that's done.

Comment 12

3 years ago
Ack. Thanks Sheeri. We're waiting for the issue to be cleared then.
For now, I have taken treeherder2.stage.db.scl3.mozilla.com out of the load balancer. Please proceed with defrag, remembering to make sure NOT to have anything replicate back to treeherder1.stage.
(Reporter)

Updated

3 years ago
No longer depends on: 1182029

Comment 14

3 years ago
Optimized tables on treeherder2.stage.db.scl3.mozilla.com
Space before optimalization:
mysql> \! df -hP
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3        37G  2.5G   32G   8% /
tmpfs           4.9G     0  4.9G   0% /dev/shm
/dev/sda1       969M   67M  852M   8% /boot
/dev/sdb1       394G   90G  285G  24% /data
After optimalization:
mysql> \! df -hP
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3        37G  2.5G   32G   8% /
tmpfs           4.9G     0  4.9G   0% /dev/shm
/dev/sda1       969M   67M  852M   8% /boot
/dev/sdb1       394G   75G  299G  21% /data
So, next step is to failover stage, correct? 

Ed, when's a good time to failover stage? If there's no good time tomorrow (Friday), I can offer up Saturday almost any time, as there's a big tree-closing window on Saturday and I'm working anyway.
(Reporter)

Comment 16

3 years ago
You can failover stage whenever you like - just update this bug at the time, so if/when we get New Relic alerts, we know it's a false alarm :-)

Comment 17

3 years ago
Hi Sheeri, 

I will be assisting you on this on Saturday.
Please let me know if we can schedule a time window for this process on Saturday.


Thanks,
Rohit Kumar
Failing over stage now.
Got caught up in a meeting and just got back to it. In the past 5 minutes the treeherder stage databases were failed over, so you can ignore recent newrelice errors.

For Pythian, feel free to defragment treeherder1.stage.db.scl3.mozilla.com, making sure not to have any commands replicate.

Rohit, it would be great if you can still help tomorrow (Saturday), so we can defragment the production slave while the trees are closed.

Comment 20

3 years ago
Hello Sheeri,

Rohit will do the treeherder1.stage tomorrow, his daytime (India timezone).
I will be available during the US daytime, so ping me when we should do the defrag on prod slave, I will be available on IRC, but if I dont reply, send us a mail, it beeps on my phone

Thank you
Janos
Acknowledged, will do.

Comment 22

3 years ago
Hello, 

treeherder1.stage.db.scl3.mozilla.com tables optimization is complete.
Please let us know if you see any issue.

Before Optimization
mysql> \! df -hP
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3        37G  4.7G   30G  14% /
tmpfs           4.9G     0  4.9G   0% /dev/shm
/dev/sda1       969M   67M  852M   8% /boot
/dev/sdb1       394G   88G  287G  24% /data


After Optimization :
mysql> \! df -hP
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3        37G  4.7G   30G  14% /
tmpfs           4.9G     0  4.9G   0% /dev/shm
/dev/sda1       969M   67M  852M   8% /boot
/dev/sdb1       394G   69G  305G  19% /data


Thanks
Rohit Kumar
Thanks Rohit! 

I have put treeherder1.stage.db.scl3.mozilla.com back into the load balancer as the master, treeherder2.stage is the read-only slave.
Janos - treeherder1.db.scl3.mozilla.com is out of the load balancer, and you can defrag it (this is production) any time. I'm in #db-alerts and #data if you want to check in.

Comment 25

3 years ago
Acknowledged, Will start working on it in ~10 minutes

Janos

Comment 26

3 years ago
Maintenance completed on backup4.db (treeherder-stage)

[root@backup4.db.scl3 treeherder_stage]# pwd
/data/treeherder_stage
[root@backup4.db.scl3 treeherder_stage]# du -shc .
40G	.
40G	total
[root@backup4.db.scl3 treeherder_stage]#

Comment 27

3 years ago
Maintenance completed on backup5.db (treeherder-prod)

Comment 28

3 years ago
Maintenance completed on treeherder1 (treeherder-prod)

[root@treeherder1.db.scl3 ~]# df -hP
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3        37G  6.8G   28G  20% /
tmpfs           4.9G     0  4.9G   0% /dev/shm
/dev/sda1       969M   67M  852M   8% /boot
/dev/sdb1       985G   81G  854G   9% /data
[root@treeherder1.db.scl3 ~]#
We still have to defrag treeherder2, but that's the master, so we'll need to fail back sometime.

Comment 30

3 years ago
Hi Sheeri,

Please let us know when it is (treeherder2) ready to defrag.

Thanks
Depends on: 1202781

Comment 31

3 years ago
Hello Sheeri

FYI we cannot access bug: 1202781

Regards
Janos

Comment 32

3 years ago
Hi Sheeri,

treeherder2 defragmentation complete.


mysql> select @@hostname, now(), @@read_only, @@sql_log_bin;
+---------------------------------+---------------------+-------------+---------------+
| @@hostname                      | now()               | @@read_only | @@sql_log_bin |
+---------------------------------+---------------------+-------------+---------------+
| treeherder2.db.scl3.mozilla.com | 2015-09-15 06:39:39 |           1 |             0 |
+---------------------------------+---------------------+-------------+---------------+

Before:

mysql> \! df -hP
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3        37G  8.6G   26G  25% /
tmpfs           4.9G     0  4.9G   0% /dev/shm
/dev/sdb1       985G  321G  614G  35% /data
/dev/sda1       969M   67M  852M   8% /boot


After :

mysql> \! df -hP
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3        37G  8.6G   26G  25% /
tmpfs           4.9G     0  4.9G   0% /dev/shm
/dev/sdb1       985G   67G  868G   8% /data
/dev/sda1       969M   67M  852M   8% /boot


Thanks,
Rohit Kumar
Thanx! This is all complete, resolving.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
(Reporter)

Comment 34

3 years ago
Many thanks :-)
You need to log in before you can comment on or make changes to this bug.