1190361 - Offline defrag/failover for Treeherder stage and prod DBs

Reporter

Description

•

9 years ago

We'd like to do a defrag of all tables on Treeherder stage and Treeherder prod, since we've significantly reduced data usage in several tables - particularly on prod.

The defrag of the larger tables (eg job_artifact & performance_artifact) will likely need to be run offline with a failover (particularly on prod).

The nodes in question are:

prod...
treeherder1.db.scl3.mozilla.com
treeherder2.db.scl3.mozilla.com

stage...
treeherder1.stage.db.scl3.mozilla.com
treeherder2.stage.db.scl3.mozilla.com

Note: For prod, replication is unable to keep up if the load balancer points at both the RW and RO nodes, so both the RW and RO VIPs point at the same machine (ie the slave is just a hot spare, rather than being used for reads). As such, once this bug is fixed, we'll need it returned to this stage, and not with the RW and RO VIPS set to different nodes.

Many thanks!

Ed Morley [:emorley]

Reporter

Comment 1

•

9 years ago

s/returned to this stage/returned to this state/

(In reply to Ed Morley [:emorley] from comment #0)
> The defrag of the larger tables (eg job_artifact & performance_artifact)
> will likely need to be run offline with a failover (particularly on prod).

These two tables are under each of the *_jobs_* databases.

Ed Morley [:emorley]

Reporter

Comment 2

•

9 years ago

What's the best way to surface this in Pythian's queue? :-)

Sheeri Cabral [:sheeri]

Comment 3

•

9 years ago

Sorry for the radio silence. Best way is to cc team73@pythian.com, which I've done here.

One question is, for how long have you reduced data usage? When's the best time to defrag for optimal use? (for instance, treeherder lifecycle is 6 months, though you may have changed that, we don't have details on that...but if it's 6 months, should we do it every month for 6 months? every 2 months? etc).

Ed Morley [:emorley]

Reporter

Comment 4

•

9 years ago

Our prod data lifecycle is 4 months and the data reductions were made > 4 months ago (bug 1161618 comment 0) so all data has now expired - so we only need to do this once. Thank you for CCing Pythian :-)

Sheeri Cabral [:sheeri]

Comment 5

•

9 years ago

Aha, perfect!

Pythian - as for order of operations, defrag (OPTIMIZE) tables on treeherder2.stage.db.scl3.mozilla.com first, taking it out of the "ro" load balancer VIP, and making sure treeherder1.stage.db.scl3.mozilla.com is in the read-only load balancer VIP. Make sure to set sql_log_bin=0 so that the defrag does NOT replicate.

Then go through the failover process so treeherder1.stage is the slave, and repeat the process. Then stage will be defrag'd.

And then do the whole process again in production.

Ed Morley [:emorley]

Reporter

Comment 6

•

9 years ago

(In reply to Sheeri Cabral [:sheeri] from comment #5)
> Pythian - as for order of operations, defrag (OPTIMIZE) tables on
> treeherder2.stage.db.scl3.mozilla.com first, ...

Is there a timeline set for this? :-)

Flags: needinfo?(team73)

Pythian Team73

Comment 7

•

9 years ago

We can start as soon as the load balancer access is cleared :)

Ed Morley [:emorley]

Reporter

Comment 8

•

9 years ago

(In reply to Pythian Team73 from comment #7)
> We can start as soon as the load balancer access is cleared :)

I'm guessing that is bug 1187355?

Pythian Team73

Comment 9

•

9 years ago

Sorry Ed, it says we are not authorized to see that bug

Flags: needinfo?(team73)

Ed Morley [:emorley]

Reporter

Comment 10

•

9 years ago

That bug has now been duped to bug 1182029 (which is roughly about "try and wrangle Zeus configs so Pythian's LDAP access works, but probably needs upstream Zeus bugs fixed").

Sheeri Cabral [:sheeri]

Comment 11

•

9 years ago

There's back and forth between Riverbed, the company that owns Zeus.

In the meantime, there are some folks who can update load balancers for you. I will update this bug first thing tomorrow with how that's done.

Pythian Team73

Comment 12

•

9 years ago

Ack. Thanks Sheeri. We're waiting for the issue to be cleared then.

Sheeri Cabral [:sheeri]

Comment 13

•

9 years ago

For now, I have taken treeherder2.stage.db.scl3.mozilla.com out of the load balancer. Please proceed with defrag, remembering to make sure NOT to have anything replicate back to treeherder1.stage.

Pythian Team73

Comment 14

•

9 years ago

Optimized tables on treeherder2.stage.db.scl3.mozilla.com
Space before optimalization:
mysql> \! df -hP
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3        37G  2.5G   32G   8% /
tmpfs           4.9G     0  4.9G   0% /dev/shm
/dev/sda1       969M   67M  852M   8% /boot
/dev/sdb1       394G   90G  285G  24% /data
After optimalization:
mysql> \! df -hP
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3        37G  2.5G   32G   8% /
tmpfs           4.9G     0  4.9G   0% /dev/shm
/dev/sda1       969M   67M  852M   8% /boot
/dev/sdb1       394G   75G  299G  21% /data

Sheeri Cabral [:sheeri]

Comment 15

•

9 years ago

So, next step is to failover stage, correct? 

Ed, when's a good time to failover stage? If there's no good time tomorrow (Friday), I can offer up Saturday almost any time, as there's a big tree-closing window on Saturday and I'm working anyway.

Ed Morley [:emorley]

Reporter

Comment 16

•

9 years ago

You can failover stage whenever you like - just update this bug at the time, so if/when we get New Relic alerts, we know it's a false alarm :-)

Rohit Kumar

Comment 17

•

9 years ago

Hi Sheeri, 

I will be assisting you on this on Saturday.
Please let me know if we can schedule a time window for this process on Saturday.


Thanks,
Rohit Kumar

Sheeri Cabral [:sheeri]

Comment 18

•

9 years ago

Failing over stage now.

Sheeri Cabral [:sheeri]

Comment 19

•

9 years ago

Got caught up in a meeting and just got back to it. In the past 5 minutes the treeherder stage databases were failed over, so you can ignore recent newrelice errors.

For Pythian, feel free to defragment treeherder1.stage.db.scl3.mozilla.com, making sure not to have any commands replicate.

Rohit, it would be great if you can still help tomorrow (Saturday), so we can defragment the production slave while the trees are closed.

Pythian Team73

Comment 20

•

9 years ago

Hello Sheeri,

Rohit will do the treeherder1.stage tomorrow, his daytime (India timezone).
I will be available during the US daytime, so ping me when we should do the defrag on prod slave, I will be available on IRC, but if I dont reply, send us a mail, it beeps on my phone

Thank you
Janos

Sheeri Cabral [:sheeri]

Comment 21

•

9 years ago

Acknowledged, will do.

Rohit Kumar

Comment 22

•

9 years ago

Hello, 

treeherder1.stage.db.scl3.mozilla.com tables optimization is complete.
Please let us know if you see any issue.

Before Optimization
mysql> \! df -hP
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3        37G  4.7G   30G  14% /
tmpfs           4.9G     0  4.9G   0% /dev/shm
/dev/sda1       969M   67M  852M   8% /boot
/dev/sdb1       394G   88G  287G  24% /data


After Optimization :
mysql> \! df -hP
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3        37G  4.7G   30G  14% /
tmpfs           4.9G     0  4.9G   0% /dev/shm
/dev/sda1       969M   67M  852M   8% /boot
/dev/sdb1       394G   69G  305G  19% /data


Thanks
Rohit Kumar

Sheeri Cabral [:sheeri]

Comment 23

•

9 years ago

Thanks Rohit! 

I have put treeherder1.stage.db.scl3.mozilla.com back into the load balancer as the master, treeherder2.stage is the read-only slave.

Sheeri Cabral [:sheeri]

Comment 24

•

9 years ago

Janos - treeherder1.db.scl3.mozilla.com is out of the load balancer, and you can defrag it (this is production) any time. I'm in #db-alerts and #data if you want to check in.

Pythian Team73

Comment 25

•

9 years ago

Acknowledged, Will start working on it in ~10 minutes

Janos

Pythian Team73

Comment 26

•

9 years ago

Maintenance completed on backup4.db (treeherder-stage)

[root@backup4.db.scl3 treeherder_stage]# pwd
/data/treeherder_stage
[root@backup4.db.scl3 treeherder_stage]# du -shc .
40G	.
40G	total
[root@backup4.db.scl3 treeherder_stage]#

Pythian Team73

Comment 27

•

9 years ago

Maintenance completed on backup5.db (treeherder-prod)

Pythian Team73

Comment 28

•

9 years ago

Maintenance completed on treeherder1 (treeherder-prod)

[root@treeherder1.db.scl3 ~]# df -hP
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3        37G  6.8G   28G  20% /
tmpfs           4.9G     0  4.9G   0% /dev/shm
/dev/sda1       969M   67M  852M   8% /boot
/dev/sdb1       985G   81G  854G   9% /data
[root@treeherder1.db.scl3 ~]#

Sheeri Cabral [:sheeri]

Comment 29

•

9 years ago

We still have to defrag treeherder2, but that's the master, so we'll need to fail back sometime.

Rohit Kumar

Comment 30

•

9 years ago

Hi Sheeri,

Please let us know when it is (treeherder2) ready to defrag.

Thanks

Sheeri Cabral [:sheeri]

Updated

•

9 years ago

Depends on: 1202781

Pythian Team73

Comment 31

•

9 years ago

Hello Sheeri

FYI we cannot access bug: 1202781

Regards
Janos

Pythian Team73

Comment 32

•

9 years ago

Hi Sheeri,

treeherder2 defragmentation complete.


mysql> select @@hostname, now(), @@read_only, @@sql_log_bin;
+---------------------------------+---------------------+-------------+---------------+
| @@hostname                      | now()               | @@read_only | @@sql_log_bin |
+---------------------------------+---------------------+-------------+---------------+
| treeherder2.db.scl3.mozilla.com | 2015-09-15 06:39:39 |           1 |             0 |
+---------------------------------+---------------------+-------------+---------------+

Before:

mysql> \! df -hP
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3        37G  8.6G   26G  25% /
tmpfs           4.9G     0  4.9G   0% /dev/shm
/dev/sdb1       985G  321G  614G  35% /data
/dev/sda1       969M   67M  852M   8% /boot


After :

mysql> \! df -hP
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3        37G  8.6G   26G  25% /
tmpfs           4.9G     0  4.9G   0% /dev/shm
/dev/sdb1       985G   67G  868G   8% /data
/dev/sda1       969M   67M  852M   8% /boot


Thanks,
Rohit Kumar

Sheeri Cabral [:sheeri]

Comment 33

•

9 years ago

Thanx! This is all complete, resolving.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Ed Morley [:emorley]

Reporter

Comment 34

•

9 years ago

Many thanks :-)