Closed Bug 1335435 Opened 6 years ago Closed 6 years ago

Reduce number of linux test masters in AWS

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: aselagea)

References

Details

Attachments

(2 files)

Now that most of our test load is in Taskcluster, we should be able to retire several linux test masters in AWS.

In the past two weeks we've peaked at just under 200 tst-linux64, 200 tst-linux32 instances, and 100 tst-emulator64 instances.

We currently have 10 linux32 masters and 20 linux64 masters in AWS. I think we can reduce this to 4 for each type - 2 per AWS region.
Assignee: nobody → aselagea
Linux 32 AWS masters:

us-east-1:

bm01-tests1-linux32
bm02-tests1-linux32
bm03-tests1-linux32
bm08-tests1-linux32
bm141-tests1-linux32

keep: bm01, bm02

us-west-2:
bm04-tests1-linux32
bm05-tests1-linux32
bm06-tests1-linux32
bm07-tests1-linux32
bm142-tests1-linux32

keep: bm04, bm05
Linux 64 AWS masters:

us-east-1:

bm51-tests1-linux64
bm52-tests1-linux64
bm67-tests1-linux64
bm113-tests1-linux64
bm114-tests1-linux64
bm117-tests1-linux64
bm120-tests1-linux64
bm121-tests1-linux64
bm124-tests1-linux64
bm130-tests1-linux64

keep: bm51, bm52

us-west-2:

bm53-tests1-linux64
bm54-tests1-linux64
bm68-tests1-linux64
bm115-tests1-linux64
bm116-tests1-linux64
bm118-tests1-linux64
bm122-tests1-linux64
bm123-tests1-linux64
bm125-tests1-linux64
bm131-tests1-linux64

keep: bm53, bm54
Attachment #8832432 - Flags: review?(kmoir)
Attachment #8832433 - Flags: review?(kmoir)
Depends on: 1335713
I think we should disable these masters first and see if the remaining ones are able to deal with the load. Then wait for bug 1335713 to disable Nagios alerts, before actually removing the masters from our configs.
Comment on attachment 8832432 [details] [diff] [review]
bug1335435_puppet.patch

Looks good, make sure you graceful the masters and disable them in nagios before you land the changes.
Attachment #8832432 - Flags: review?(kmoir) → review+
Attachment #8832433 - Flags: review?(kmoir) → review+
Disabled these masters on slavealloc and nagios, did a clean shutdown for each of them.
reminder to remove the master entries from slavealloc/inventory once they are completely decommissioned
Stopped all 22 masters in AWS. 
Per :catlee's suggestion in IRC, let's wait until next week to confirm that everything's OK before landing these patches.
One weird thing is that there are many spot instances that are still attached to a master even though these instances were terminated/are stopped. That results in some bb-masters that have 280+ connected instances, while others have ~23 => unbalanced.

I assume an instance with a certain fqdn will only get its "current_masterid" field updated in slavealloc when it's actually launched and connected to another master.

mysql> select masters.nickname, count(*), masters.enabled from slaves, masters where masters.nickname like '%linux%' and slaves.current_masterid = masters.masterid group by masters.nickname;
+----------------------+----------+---------+
| nickname             | count(*) | enabled |
+----------------------+----------+---------+
| bm01-tests1-linux32  |      162 |       1 |
| bm02-tests1-linux32  |      162 |       1 |
| bm03-tests1-linux32  |       37 |       0 |
| bm04-tests1-linux32  |      126 |       1 |
| bm05-tests1-linux32  |      126 |       1 |
| bm06-tests1-linux32  |       68 |       0 |
| bm07-tests1-linux32  |       64 |       0 |
| bm08-tests1-linux32  |       49 |       0 |
| bm103-tests1-linux   |       24 |       1 |
| bm104-tests1-linux   |       23 |       1 |
| bm105-tests1-linux   |       23 |       1 |
| bm113-tests1-linux64 |      151 |       0 |
| bm114-tests1-linux64 |      145 |       0 |
| bm115-tests1-linux64 |      184 |       0 |
| bm116-tests1-linux64 |      189 |       0 |
| bm117-tests1-linux64 |      157 |       0 |
| bm118-tests1-linux64 |      179 |       0 |
| bm120-tests1-linux64 |      149 |       0 |
| bm121-tests1-linux64 |      138 |       0 |
| bm122-tests1-linux64 |      184 |       0 |
| bm123-tests1-linux64 |      183 |       0 |
| bm124-tests1-linux64 |      143 |       0 |
| bm125-tests1-linux64 |      183 |       0 |
| bm130-tests1-linux64 |      150 |       0 |
| bm131-tests1-linux64 |      188 |       0 |
| bm141-tests1-linux32 |       40 |       0 |
| bm142-tests1-linux32 |       66 |       0 |
| bm51-tests1-linux64  |      281 |       1 |
| bm52-tests1-linux64  |      280 |       1 |
| bm53-tests1-linux64  |      223 |       1 |
| bm54-tests1-linux64  |      222 |       1 |
| bm67-tests1-linux64  |      131 |       0 |
| bm68-tests1-linux64  |      189 |       0 |
+----------------------+----------+---------+
33 rows in set (0.00 sec)
bm103, bm104 and bm105 having a low number of connected machines is understandable though, since they are HW machines and we only have the 'talos-ix' pool that uses them.
The way slavealloc works is that it tries to balance the # of workers across all the enabled masters. Unfortunately, it isn't aware that most of these instances get created / destroyed regularly. It assumes that all the workers are running all of the time.

We could try to do a few things:
- Reset the associated master to NULL for any workers attached to disabled masters
- Disable all but a few hundred workers in each region
(In reply to Chris AtLee [:catlee] from comment #12)
 
> We could try to do a few things:
> - Reset the associated master to NULL for any workers attached to disabled
> masters

mysql> update masters, slaves set slaves.current_masterid = NULL where masters.nickname like '%linux%' and slaves.current_masterid = masters.masterid and masters.enabled = 0;
Query OK, 2967 rows affected (0.13 sec)
Rows matched: 2967  Changed: 2967  Warnings: 0

> - Disable all but a few hundred workers in each region

Ran some queries to see how many linux servers are present in slavealloc (total/enabled):
tst-linux32-spot - total: 899; enabled:899
tst-linux64-spot - total: 2599; enabled: 2599
tst-emulator64-spot - total:1048; enabled: 1047
bld-linux64-spot - total: 698; enabled: 698
try-linux64-spot - total: 499; enabled: 499
av-linux64-spot - total: 4; enabled: 4

Any suggestions on how many we should keep enabled in case?
Looking at the dashboard: https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana/dashboard/db/ec2-dashboard?from=now-7d&to=now&panelId=8&fullscreen

I think we should keep:

tst-linux32: 300
tst-linux64: 200
tst-emulator64: 200
bld-linux64: 200
try-linux64: 100
av-linux64: 4
Updated slavealloc to match the numbers in the comment above. The number of instances for each platform is evenly distributed between us-east-1 and us-west-2.
Attachment #8832433 - Flags: checked-in+
Attachment #8832432 - Flags: checked-in+
Terminated instances, removed inventory and slavealloc entries.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.