Closed Bug 1455741 Opened 6 years ago Closed 6 years ago

Decommission buildbot machines buildbot-master74, master76 , master94,

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: apop, Assigned: zfay)

Details

The buildbot master machines are needed to be decommissioned :

 buildbot-master76.bb.releng.use1.mozilla.com
 buildbot-master94.bb.releng.use1.mozilla.com
 buildbot-master74.bb.releng.usw2.mozilla.com
Assignee: nobody → zfay
sanity check slavealloc entry is removed
sanity check buildbot db entry is removed
sanity check nagios monitoring is removed

anything from: https://wiki.mozilla.org/ReleaseEngineering/How_To/Decommission_buildbot_masters

terminate aws instance
We've verified every step until deletion of the master from DB. 

We've connected to relengwebadm.private.scl3 and then used this command: mysql -u buildslaves -p -h devtools-rw-vip.db.scl3.mozilla.com -D buildslaves.

When we've run: DELETE from masters WHERE masterid = 215; 

This return the following error: ERROR 1451 (23000): Cannot delete or update a parent row: a foreign key constraint fails (`buildslaves`.`slaves`, CONSTRAINT `slaves_ibfk_10` FOREIGN KEY (`current_masterid`) REFERENCES `masters` (`masterid`)).

Is there anything we are missing? From our initial debugging we believe that the masters that we tried to delete are linked in some form to the slaves, which causes the fail.

After the debug we have run the following:
mysql> select * from slaves where current_masterid = '215';

which returned:
+---------+---------------------+----------+--------+---------+-----------+------+---------+-------+--------+---------------+-----------------+---------+------------------+-------+--------------+
| slaveid | name                | distroid | bitsid | speedid | purposeid | dcid | trustid | envid | poolid | basedir       | locked_masterid | enabled | current_masterid | notes | custom_tplid |
+---------+---------------------+----------+--------+---------+-----------+------+---------+-------+--------+---------------+-----------------+---------+------------------+-------+--------------+
|    2591 | bld-linux64-ec2-301 |       15 |      2 |       9 |         5 |   21 |       5 |     2 |     41 | /builds/slave |            NULL |       1 |              215 |       |         NULL |
|    2593 | bld-linux64-ec2-302 |       15 |      2 |       9 |         5 |   21 |       5 |     2 |     41 | /builds/slave |            NULL |       1 |              215 |       |         NULL |
|    2633 | bld-linux64-ec2-312 |       15 |      2 |       9 |         5 |   21 |       5 |     2 |     41 | /builds/slave |            NULL |       1 |              215 | NULL  |         NULL |
|    2643 | bld-linux64-ec2-317 |       15 |      2 |       9 |         5 |   21 |       5 |     2 |     41 | /builds/slave |            NULL |       1 |              215 | NULL  |         NULL |
|    2645 | bld-linux64-ec2-318 |       15 |      2 |       9 |         5 |   21 |       5 |     2 |     41 | /builds/slave |            NULL |       1 |              215 | NULL  |         NULL |
+---------+---------------------+----------+--------+---------+-----------+------+---------+-------+--------+---------------+-----------------+---------+------------------+-------+--------------+
Flags: needinfo?(jlund)
As per :fubar's and :catlee's suggestions, we have reached the following decision:
1) We are moving the existing slaves, that are attached to one of the masters to be decommissioned to new ones.
2) Move the slaves to a new master that is in the same Region and Pool as the masters listed above.
3) Move 2 slaves per old master to new master and give them some time and see how they act. If nothing bad happens, finish the transfer Monday.

@fubar / @catlee : Do you guys have any concerns with our suggestions? 

So we came up with the following "battle plan"

OLD Master: buildbot-master94.bb.releng.use1.mozilla.com 
NEW Master: buildbot-master77.bb.releng.use1.mozilla.com 
For the following slaves:
bld-linux64-ec2-301
bld-linux64-ec2-302
bld-linux64-ec2-312
bld-linux64-ec2-317
bld-linux64-ec2-318



OLD Master: buildbot-master74.bb.releng.usw2.mozilla.com 
NEW Master: buildbot-master73.bb.releng.usw2.mozilla.com 
For the following slaves:
bld-linux64-ec2-301	
bld-linux64-ec2-302	
bld-linux64-ec2-312	
bld-linux64-ec2-317	
bld-linux64-ec2-318


OLD Master: buildbot-master76.bb.releng.use1.mozilla.com 
NEW Master: buildbot-master75.bb.releng.use1.mozilla.com 
For the following slaves:
try-linux64-spot-004	
try-linux64-spot-005	
try-linux64-spot-007	
try-linux64-spot-010	
b-2008-ec2-0004
y-2008-spot-010
y-2008-spot-011
y-2008-spot-012
y-2008-spot-013	
y-2008-spot-014	
y-2008-spot-015
y-2008-spot-016	
y-2008-spot-017	
y-2008-spot-018	
y-2008-spot-019	
y-2008-spot-022	
y-2008-spot-028	
y-2008-spot-029	
y-2008-spot-030	
y-2008-spot-033	
y-2008-spot-035	
y-2008-spot-037	
y-2008-spot-040	
y-2008-spot-041	
y-2008-spot-043	
y-2008-spot-046	
y-2008-spot-047	
y-2008-spot-048	
y-2008-spot-050	
y-2008-spot-051	
y-2008-spot-052	
y-2008-spot-053	
y-2008-spot-055	
y-2008-spot-061	
y-2008-spot-062	
y-2008-spot-065	
y-2008-spot-067	
y-2008-spot-072	
y-2008-spot-077	
y-2008-ec2-0003
Flags: needinfo?(klibby)
Flags: needinfo?(jlund)
Flags: needinfo?(catlee)
(In reply to Danut Labici [:dlabici] from comment #3)
> As per :fubar's and :catlee's suggestions, we have reached the following
> decision:
> 1) We are moving the existing slaves, that are attached to one of the
> masters to be decommissioned to new ones.
> 2) Move the slaves to a new master that is in the same Region and Pool as
> the masters listed above.
> 3) Move 2 slaves per old master to new master and give them some time and
> see how they act. If nothing bad happens, finish the transfer Monday.
> 
> @fubar / @catlee : Do you guys have any concerns with our suggestions? 
> 
> So we came up with the following "battle plan"
> 
> OLD Master: buildbot-master94.bb.releng.use1.mozilla.com 
> NEW Master: buildbot-master77.bb.releng.use1.mozilla.com 
> For the following slaves:
> bld-linux64-ec2-301
> bld-linux64-ec2-302
> bld-linux64-ec2-312
> bld-linux64-ec2-317
> bld-linux64-ec2-318

+1

> OLD Master: buildbot-master74.bb.releng.usw2.mozilla.com 
> NEW Master: buildbot-master73.bb.releng.usw2.mozilla.com 
> For the following slaves:
> bld-linux64-ec2-301	
> bld-linux64-ec2-302	
> bld-linux64-ec2-312	
> bld-linux64-ec2-317	
> bld-linux64-ec2-318

+1

> OLD Master: buildbot-master76.bb.releng.use1.mozilla.com 
> NEW Master: buildbot-master75.bb.releng.use1.mozilla.com 
> For the following slaves:
> try-linux64-spot-004	
> try-linux64-spot-005	
> try-linux64-spot-007	
> try-linux64-spot-010	

+1

> b-2008-ec2-0004
this should be moved to buildbot-master77

> y-2008-spot-010
> y-2008-spot-011
> y-2008-spot-012
> y-2008-spot-013	
> y-2008-spot-014	
> y-2008-spot-015
> y-2008-spot-016	
> y-2008-spot-017	
> y-2008-spot-018	
> y-2008-spot-019	
> y-2008-spot-022	
> y-2008-spot-028	
> y-2008-spot-029	
> y-2008-spot-030	
> y-2008-spot-033	
> y-2008-spot-035	
> y-2008-spot-037	
> y-2008-spot-040	
> y-2008-spot-041	
> y-2008-spot-043	
> y-2008-spot-046	
> y-2008-spot-047	
> y-2008-spot-048	
> y-2008-spot-050	
> y-2008-spot-051	
> y-2008-spot-052	
> y-2008-spot-053	
> y-2008-spot-055	
> y-2008-spot-061	
> y-2008-spot-062	
> y-2008-spot-065	
> y-2008-spot-067	
> y-2008-spot-072	
> y-2008-spot-077	
> y-2008-ec2-0003

+1
Flags: needinfo?(catlee)
r+!
Flags: needinfo?(klibby)
I've done the following: 

OLD Master: buildbot-master94.bb.releng.use1.mozilla.com 
NEW Master: buildbot-master77.bb.releng.use1.mozilla.com 
For the following slaves:

bld-linux64-ec2-301 ( mysql> update slaves SET current_masterid = '221' WHERE name = 'bld-linux64-ec2-301';)
bld-linux64-ec2-302 ( mysql> update slaves SET current_masterid = '221' WHERE name = 'bld-linux64-ec2-302';)

OLD Master: buildbot-master76.bb.releng.use1.mozilla.com 
NEW Master: buildbot-master75.bb.releng.use1.mozilla.com 
For the following slaves:

try-linux64-spot-004 (mysql> update slaves set current_masterid = '217' where name = 'try-linux64-spot-005';)
try-linux64-spot-005 (mysql> update slaves set current_masterid = '217' where name = 'try-linux64-spot-004';)

Basically I've moved the first 2 slaves from a master to another in the database to see how it goes. If everything will go silky-smooth we'll proceed moving the rest.
UPDATE: After initial checks it seems that machines we've tested with the moves took the jobs and also got reassigned to a new master. From what we know this is exactly the expected outcome. We'll proceed with the other slaves.
UPDATE: I've done the job and b-2008-ec2-0004 went straight to buildbot-master77. Leaving this bug opened for the final checks.

Buildduty team: Feel free to do the checks and close the bug after making sure that every slave got allocated to a new buildbot master and every slave is taking new jobs.
Hello. After I checked the DB, I noticed these:

OLD Master: buildbot-master94.bb.releng.use1.mozilla.com              current_masterid == 313
mysql> select * from slaves where current_masterid = 313;
+---------+---------------------+----------+--------+---------+-----------+------+---------+-------+--------+---------------+-----------------+---------+------------------+-------+--------------+
| slaveid | name                | distroid | bitsid | speedid | purposeid | dcid | trustid | envid | poolid | basedir       | locked_masterid | enabled | current_masterid | notes | custom_tplid |
+---------+---------------------+----------+--------+---------+-----------+------+---------+-------+--------+---------------+-----------------+---------+------------------+-------+--------------+
|   10527 | bld-linux64-ec2-003 |       15 |      2 |       9 |         5 |   19 |       5 |     2 |     37 | /builds/slave |            NULL |       0 |              313 | NULL  |         NULL |
|   10539 | bld-linux64-ec2-009 |       15 |      2 |       9 |         5 |   19 |       5 |     2 |     37 | /builds/slave |            NULL |       0 |              313 | NULL  |         NULL |
|   10541 | bld-linux64-ec2-010 |       15 |      2 |       9 |         5 |   19 |       5 |     2 |     37 | /builds/slave |            NULL |       0 |              313 | NULL  |         NULL |
|   10549 | bld-linux64-ec2-014 |       15 |      2 |       9 |         5 |   19 |       5 |     2 |     37 | /builds/slave |            NULL |       0 |              313 | NULL  |         NULL |
+---------+---------------------+----------+--------+---------+-----------+------+---------+-------+--------+---------------+-----------------+---------+------------------+-------+--------------+
4 rows in set (0,00 sec)

Should we move these slaves to the NEW Master: buildbot-master77.bb.releng.use1.mozilla.com ? ^

 
OLD Master: buildbot-master74.bb.releng.usw2.mozilla.com             current_masterid == 215
mysql> select * from slaves where current_masterid = 215;
Empty set (0,00 sec)
Could we delete this master from buildslaves DB, now?^

OLD Master: buildbot-master76.bb.releng.use1.mozilla.com             current_masterid == 219
mysql> select * from slaves where current_masterid = 219;
Empty set (0,00 sec)
Could we delete this master from buildslaves DB, now?^

Also, the slaves below appear in NEW Master: buildbot-master77.bb.releng.use1.mozilla.com 
bld-linux64-ec2-301         DC = us-west2
bld-linux64-ec2-302         DC = us-west2
bld-linux64-ec2-312         DC = us-west2
bld-linux64-ec2-317         DC = us-west2
bld-linux64-ec2-318         DC = us-west2
I've checked in Slave Allocator and these slaves are in us-west-2.
Should we move these slaves to the NEW Master: buildbot-master73.bb.releng.usw2.mozilla.com
Flags: needinfo?(klibby)
Flags: needinfo?(jlund)
Flags: needinfo?(catlee)
Slaves moving masters (after we changed them) is an expected outcome, the fact that they have new (auto-assigned) masters is telling us everything works without any issues.

I have moved the remaining slaves, re-enabled them in slavealloc and finished the decommission of the 3 masters.

mysql> DELETE from masters WHERE masterid = 313;
Query OK, 1 row affected (0,01 sec)

mysql> DELETE from masters WHERE masterid = 215;
Query OK, 1 row affected (0,00 sec)

mysql> DELETE from masters WHERE masterid = 219;
Query OK, 1 row affected (0,00 sec)


Closing this bug for now as every step was been done and everything seems to work as expected.
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(klibby)
Flags: needinfo?(jlund)
Flags: needinfo?(catlee)
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.