Closed
Bug 1190361
Opened 10 years ago
Closed 9 years ago
Offline defrag/failover for Treeherder stage and prod DBs
Categories
(Data & BI Services Team :: DB: MySQL, task)
Data & BI Services Team
DB: MySQL
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Unassigned)
References
Details
We'd like to do a defrag of all tables on Treeherder stage and Treeherder prod, since we've significantly reduced data usage in several tables - particularly on prod.
The defrag of the larger tables (eg job_artifact & performance_artifact) will likely need to be run offline with a failover (particularly on prod).
The nodes in question are:
prod...
treeherder1.db.scl3.mozilla.com
treeherder2.db.scl3.mozilla.com
stage...
treeherder1.stage.db.scl3.mozilla.com
treeherder2.stage.db.scl3.mozilla.com
Note: For prod, replication is unable to keep up if the load balancer points at both the RW and RO nodes, so both the RW and RO VIPs point at the same machine (ie the slave is just a hot spare, rather than being used for reads). As such, once this bug is fixed, we'll need it returned to this stage, and not with the RW and RO VIPS set to different nodes.
Many thanks!
Reporter | ||
Comment 1•10 years ago
|
||
s/returned to this stage/returned to this state/
(In reply to Ed Morley [:emorley] from comment #0)
> The defrag of the larger tables (eg job_artifact & performance_artifact)
> will likely need to be run offline with a failover (particularly on prod).
These two tables are under each of the *_jobs_* databases.
Reporter | ||
Comment 2•10 years ago
|
||
What's the best way to surface this in Pythian's queue? :-)
Comment 3•10 years ago
|
||
Sorry for the radio silence. Best way is to cc team73@pythian.com, which I've done here.
One question is, for how long have you reduced data usage? When's the best time to defrag for optimal use? (for instance, treeherder lifecycle is 6 months, though you may have changed that, we don't have details on that...but if it's 6 months, should we do it every month for 6 months? every 2 months? etc).
Reporter | ||
Comment 4•10 years ago
|
||
Our prod data lifecycle is 4 months and the data reductions were made > 4 months ago (bug 1161618 comment 0) so all data has now expired - so we only need to do this once. Thank you for CCing Pythian :-)
Comment 5•10 years ago
|
||
Aha, perfect!
Pythian - as for order of operations, defrag (OPTIMIZE) tables on treeherder2.stage.db.scl3.mozilla.com first, taking it out of the "ro" load balancer VIP, and making sure treeherder1.stage.db.scl3.mozilla.com is in the read-only load balancer VIP. Make sure to set sql_log_bin=0 so that the defrag does NOT replicate.
Then go through the failover process so treeherder1.stage is the slave, and repeat the process. Then stage will be defrag'd.
And then do the whole process again in production.
Reporter | ||
Comment 6•10 years ago
|
||
(In reply to Sheeri Cabral [:sheeri] from comment #5)
> Pythian - as for order of operations, defrag (OPTIMIZE) tables on
> treeherder2.stage.db.scl3.mozilla.com first, ...
Is there a timeline set for this? :-)
Flags: needinfo?(team73)
Comment 7•10 years ago
|
||
We can start as soon as the load balancer access is cleared :)
Reporter | ||
Comment 8•10 years ago
|
||
(In reply to Pythian Team73 from comment #7)
> We can start as soon as the load balancer access is cleared :)
I'm guessing that is bug 1187355?
Comment 9•10 years ago
|
||
Sorry Ed, it says we are not authorized to see that bug
Flags: needinfo?(team73)
Reporter | ||
Comment 10•10 years ago
|
||
That bug has now been duped to bug 1182029 (which is roughly about "try and wrangle Zeus configs so Pythian's LDAP access works, but probably needs upstream Zeus bugs fixed").
Comment 11•10 years ago
|
||
There's back and forth between Riverbed, the company that owns Zeus.
In the meantime, there are some folks who can update load balancers for you. I will update this bug first thing tomorrow with how that's done.
Comment 12•10 years ago
|
||
Ack. Thanks Sheeri. We're waiting for the issue to be cleared then.
Comment 13•10 years ago
|
||
For now, I have taken treeherder2.stage.db.scl3.mozilla.com out of the load balancer. Please proceed with defrag, remembering to make sure NOT to have anything replicate back to treeherder1.stage.
Comment 14•9 years ago
|
||
Optimized tables on treeherder2.stage.db.scl3.mozilla.com
Space before optimalization:
mysql> \! df -hP
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 37G 2.5G 32G 8% /
tmpfs 4.9G 0 4.9G 0% /dev/shm
/dev/sda1 969M 67M 852M 8% /boot
/dev/sdb1 394G 90G 285G 24% /data
After optimalization:
mysql> \! df -hP
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 37G 2.5G 32G 8% /
tmpfs 4.9G 0 4.9G 0% /dev/shm
/dev/sda1 969M 67M 852M 8% /boot
/dev/sdb1 394G 75G 299G 21% /data
Comment 15•9 years ago
|
||
So, next step is to failover stage, correct?
Ed, when's a good time to failover stage? If there's no good time tomorrow (Friday), I can offer up Saturday almost any time, as there's a big tree-closing window on Saturday and I'm working anyway.
Reporter | ||
Comment 16•9 years ago
|
||
You can failover stage whenever you like - just update this bug at the time, so if/when we get New Relic alerts, we know it's a false alarm :-)
Comment 17•9 years ago
|
||
Hi Sheeri,
I will be assisting you on this on Saturday.
Please let me know if we can schedule a time window for this process on Saturday.
Thanks,
Rohit Kumar
Comment 18•9 years ago
|
||
Failing over stage now.
Comment 19•9 years ago
|
||
Got caught up in a meeting and just got back to it. In the past 5 minutes the treeherder stage databases were failed over, so you can ignore recent newrelice errors.
For Pythian, feel free to defragment treeherder1.stage.db.scl3.mozilla.com, making sure not to have any commands replicate.
Rohit, it would be great if you can still help tomorrow (Saturday), so we can defragment the production slave while the trees are closed.
Comment 20•9 years ago
|
||
Hello Sheeri,
Rohit will do the treeherder1.stage tomorrow, his daytime (India timezone).
I will be available during the US daytime, so ping me when we should do the defrag on prod slave, I will be available on IRC, but if I dont reply, send us a mail, it beeps on my phone
Thank you
Janos
Comment 21•9 years ago
|
||
Acknowledged, will do.
Comment 22•9 years ago
|
||
Hello,
treeherder1.stage.db.scl3.mozilla.com tables optimization is complete.
Please let us know if you see any issue.
Before Optimization
mysql> \! df -hP
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 37G 4.7G 30G 14% /
tmpfs 4.9G 0 4.9G 0% /dev/shm
/dev/sda1 969M 67M 852M 8% /boot
/dev/sdb1 394G 88G 287G 24% /data
After Optimization :
mysql> \! df -hP
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 37G 4.7G 30G 14% /
tmpfs 4.9G 0 4.9G 0% /dev/shm
/dev/sda1 969M 67M 852M 8% /boot
/dev/sdb1 394G 69G 305G 19% /data
Thanks
Rohit Kumar
Comment 23•9 years ago
|
||
Thanks Rohit!
I have put treeherder1.stage.db.scl3.mozilla.com back into the load balancer as the master, treeherder2.stage is the read-only slave.
Comment 24•9 years ago
|
||
Janos - treeherder1.db.scl3.mozilla.com is out of the load balancer, and you can defrag it (this is production) any time. I'm in #db-alerts and #data if you want to check in.
Comment 25•9 years ago
|
||
Acknowledged, Will start working on it in ~10 minutes
Janos
Comment 26•9 years ago
|
||
Maintenance completed on backup4.db (treeherder-stage)
[root@backup4.db.scl3 treeherder_stage]# pwd
/data/treeherder_stage
[root@backup4.db.scl3 treeherder_stage]# du -shc .
40G .
40G total
[root@backup4.db.scl3 treeherder_stage]#
Comment 27•9 years ago
|
||
Maintenance completed on backup5.db (treeherder-prod)
Comment 28•9 years ago
|
||
Maintenance completed on treeherder1 (treeherder-prod)
[root@treeherder1.db.scl3 ~]# df -hP
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 37G 6.8G 28G 20% /
tmpfs 4.9G 0 4.9G 0% /dev/shm
/dev/sda1 969M 67M 852M 8% /boot
/dev/sdb1 985G 81G 854G 9% /data
[root@treeherder1.db.scl3 ~]#
Comment 29•9 years ago
|
||
We still have to defrag treeherder2, but that's the master, so we'll need to fail back sometime.
Comment 30•9 years ago
|
||
Hi Sheeri,
Please let us know when it is (treeherder2) ready to defrag.
Thanks
Comment 31•9 years ago
|
||
Hello Sheeri
FYI we cannot access bug: 1202781
Regards
Janos
Comment 32•9 years ago
|
||
Hi Sheeri,
treeherder2 defragmentation complete.
mysql> select @@hostname, now(), @@read_only, @@sql_log_bin;
+---------------------------------+---------------------+-------------+---------------+
| @@hostname | now() | @@read_only | @@sql_log_bin |
+---------------------------------+---------------------+-------------+---------------+
| treeherder2.db.scl3.mozilla.com | 2015-09-15 06:39:39 | 1 | 0 |
+---------------------------------+---------------------+-------------+---------------+
Before:
mysql> \! df -hP
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 37G 8.6G 26G 25% /
tmpfs 4.9G 0 4.9G 0% /dev/shm
/dev/sdb1 985G 321G 614G 35% /data
/dev/sda1 969M 67M 852M 8% /boot
After :
mysql> \! df -hP
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 37G 8.6G 26G 25% /
tmpfs 4.9G 0 4.9G 0% /dev/shm
/dev/sdb1 985G 67G 868G 8% /data
/dev/sda1 969M 67M 852M 8% /boot
Thanks,
Rohit Kumar
Comment 33•9 years ago
|
||
Thanx! This is all complete, resolving.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 34•9 years ago
|
||
Many thanks :-)
You need to log in
before you can comment on or make changes to this bug.
Description
•