Closed Bug 1335435 Opened 7 years ago Closed 7 years ago

Reduce number of linux test masters in AWS

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: catlee, Assigned: aselagea)

References

Details

Attachments

(2 files)

bug1335435_puppet.patch 7 years ago Alin Selagea [:aselagea] 9.01 KB, patch	kmoir : review+ aselagea : checked-in+	Details \| Diff \| Splinter Review
bug1335435_tools.patch 7 years ago Alin Selagea [:aselagea] 35.59 KB, patch	kmoir : review+ aselagea : checked-in+	Details \| Diff \| Splinter Review

Chris AtLee [:catlee]

Reporter

Description

•

7 years ago

Now that most of our test load is in Taskcluster, we should be able to retire several linux test masters in AWS.

In the past two weeks we've peaked at just under 200 tst-linux64, 200 tst-linux32 instances, and 100 tst-emulator64 instances.

We currently have 10 linux32 masters and 20 linux64 masters in AWS. I think we can reduce this to 4 for each type - 2 per AWS region.

Alin Selagea [:aselagea]

Assignee

Updated

•

7 years ago

Assignee: nobody → aselagea

Alin Selagea [:aselagea]

Assignee

Comment 1

•

7 years ago

Linux 32 AWS masters:

us-east-1:

bm01-tests1-linux32
bm02-tests1-linux32
bm03-tests1-linux32
bm08-tests1-linux32
bm141-tests1-linux32

keep: bm01, bm02

us-west-2:
bm04-tests1-linux32
bm05-tests1-linux32
bm06-tests1-linux32
bm07-tests1-linux32
bm142-tests1-linux32

keep: bm04, bm05

Alin Selagea [:aselagea]

Assignee

Comment 2

•

7 years ago

Linux 64 AWS masters:

us-east-1:

bm51-tests1-linux64
bm52-tests1-linux64
bm67-tests1-linux64
bm113-tests1-linux64
bm114-tests1-linux64
bm117-tests1-linux64
bm120-tests1-linux64
bm121-tests1-linux64
bm124-tests1-linux64
bm130-tests1-linux64

keep: bm51, bm52

us-west-2:

bm53-tests1-linux64
bm54-tests1-linux64
bm68-tests1-linux64
bm115-tests1-linux64
bm116-tests1-linux64
bm118-tests1-linux64
bm122-tests1-linux64
bm123-tests1-linux64
bm125-tests1-linux64
bm131-tests1-linux64

keep: bm53, bm54

Alin Selagea [:aselagea]

Assignee

Comment 3

•

7 years ago

Attached patch bug1335435_puppet.patch — Details — Splinter Review

Attachment #8832432 - Flags: review?(kmoir)

Alin Selagea [:aselagea]

Assignee

Comment 4

•

7 years ago

Attached patch bug1335435_tools.patch — Details — Splinter Review

Attachment #8832433 - Flags: review?(kmoir)

Alin Selagea [:aselagea]

Assignee

Updated

•

7 years ago

Depends on: 1335713

Alin Selagea [:aselagea]

Assignee

Comment 5

•

7 years ago

I think we should disable these masters first and see if the remaining ones are able to deal with the load. Then wait for bug 1335713 to disable Nagios alerts, before actually removing the masters from our configs.

Kim Moir [:kmoir] ET

Comment 6

•

7 years ago

Comment on attachment 8832432 [details] [diff] [review]
bug1335435_puppet.patch

Looks good, make sure you graceful the masters and disable them in nagios before you land the changes.

Attachment #8832432 - Flags: review?(kmoir) → review+

Kim Moir [:kmoir] ET

Updated

•

7 years ago

Attachment #8832433 - Flags: review?(kmoir) → review+

Alin Selagea [:aselagea]

Assignee

Comment 7

•

7 years ago

Disabled these masters on slavealloc and nagios, did a clean shutdown for each of them.

Kim Moir [:kmoir] ET

Comment 8

•

7 years ago

reminder to remove the master entries from slavealloc/inventory once they are completely decommissioned

Alin Selagea [:aselagea]

Assignee

Comment 9

•

7 years ago

Stopped all 22 masters in AWS. 
Per :catlee's suggestion in IRC, let's wait until next week to confirm that everything's OK before landing these patches.

Alin Selagea [:aselagea]

Assignee

Comment 10

•

7 years ago

One weird thing is that there are many spot instances that are still attached to a master even though these instances were terminated/are stopped. That results in some bb-masters that have 280+ connected instances, while others have ~23 => unbalanced.

I assume an instance with a certain fqdn will only get its "current_masterid" field updated in slavealloc when it's actually launched and connected to another master.

mysql> select masters.nickname, count(*), masters.enabled from slaves, masters where masters.nickname like '%linux%' and slaves.current_masterid = masters.masterid group by masters.nickname;
+----------------------+----------+---------+
| nickname             | count(*) | enabled |
+----------------------+----------+---------+
| bm01-tests1-linux32  |      162 |       1 |
| bm02-tests1-linux32  |      162 |       1 |
| bm03-tests1-linux32  |       37 |       0 |
| bm04-tests1-linux32  |      126 |       1 |
| bm05-tests1-linux32  |      126 |       1 |
| bm06-tests1-linux32  |       68 |       0 |
| bm07-tests1-linux32  |       64 |       0 |
| bm08-tests1-linux32  |       49 |       0 |
| bm103-tests1-linux   |       24 |       1 |
| bm104-tests1-linux   |       23 |       1 |
| bm105-tests1-linux   |       23 |       1 |
| bm113-tests1-linux64 |      151 |       0 |
| bm114-tests1-linux64 |      145 |       0 |
| bm115-tests1-linux64 |      184 |       0 |
| bm116-tests1-linux64 |      189 |       0 |
| bm117-tests1-linux64 |      157 |       0 |
| bm118-tests1-linux64 |      179 |       0 |
| bm120-tests1-linux64 |      149 |       0 |
| bm121-tests1-linux64 |      138 |       0 |
| bm122-tests1-linux64 |      184 |       0 |
| bm123-tests1-linux64 |      183 |       0 |
| bm124-tests1-linux64 |      143 |       0 |
| bm125-tests1-linux64 |      183 |       0 |
| bm130-tests1-linux64 |      150 |       0 |
| bm131-tests1-linux64 |      188 |       0 |
| bm141-tests1-linux32 |       40 |       0 |
| bm142-tests1-linux32 |       66 |       0 |
| bm51-tests1-linux64  |      281 |       1 |
| bm52-tests1-linux64  |      280 |       1 |
| bm53-tests1-linux64  |      223 |       1 |
| bm54-tests1-linux64  |      222 |       1 |
| bm67-tests1-linux64  |      131 |       0 |
| bm68-tests1-linux64  |      189 |       0 |
+----------------------+----------+---------+
33 rows in set (0.00 sec)

Alin Selagea [:aselagea]

Assignee

Comment 11

•

7 years ago

bm103, bm104 and bm105 having a low number of connected machines is understandable though, since they are HW machines and we only have the 'talos-ix' pool that uses them.

Chris AtLee [:catlee]

Reporter

Comment 12

•

7 years ago

The way slavealloc works is that it tries to balance the # of workers across all the enabled masters. Unfortunately, it isn't aware that most of these instances get created / destroyed regularly. It assumes that all the workers are running all of the time.

We could try to do a few things:
- Reset the associated master to NULL for any workers attached to disabled masters
- Disable all but a few hundred workers in each region

Alin Selagea [:aselagea]

Assignee

Comment 13

•

7 years ago

(In reply to Chris AtLee [:catlee] from comment #12)
 
> We could try to do a few things:
> - Reset the associated master to NULL for any workers attached to disabled
> masters

mysql> update masters, slaves set slaves.current_masterid = NULL where masters.nickname like '%linux%' and slaves.current_masterid = masters.masterid and masters.enabled = 0;
Query OK, 2967 rows affected (0.13 sec)
Rows matched: 2967  Changed: 2967  Warnings: 0

> - Disable all but a few hundred workers in each region

Ran some queries to see how many linux servers are present in slavealloc (total/enabled):
tst-linux32-spot - total: 899; enabled:899
tst-linux64-spot - total: 2599; enabled: 2599
tst-emulator64-spot - total:1048; enabled: 1047
bld-linux64-spot - total: 698; enabled: 698
try-linux64-spot - total: 499; enabled: 499
av-linux64-spot - total: 4; enabled: 4

Any suggestions on how many we should keep enabled in case?

Chris AtLee [:catlee]

Reporter

Comment 14

•

7 years ago

Looking at the dashboard: https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana/dashboard/db/ec2-dashboard?from=now-7d&to=now&panelId=8&fullscreen

I think we should keep:

tst-linux32: 300
tst-linux64: 200
tst-emulator64: 200
bld-linux64: 200
try-linux64: 100
av-linux64: 4

Alin Selagea [:aselagea]

Assignee

Comment 15

•

7 years ago

Updated slavealloc to match the numbers in the comment above. The number of instances for each platform is evenly distributed between us-east-1 and us-west-2.

Alin Selagea [:aselagea]

Assignee

Updated

•

7 years ago

Attachment #8832433 - Flags: checked-in+

Alin Selagea [:aselagea]

Assignee

Updated

•

7 years ago

Attachment #8832432 - Flags: checked-in+

Alin Selagea [:aselagea]

Assignee

Comment 16

•

7 years ago

Terminated instances, removed inventory and slavealloc entries.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Nick Thomas [:nthomas] (UTC+12)

Updated

•

7 years ago

Depends on: 1339943

BMO Automation

Updated

•

6 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

4 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.