Closed
Bug 1335435
Opened 7 years ago
Closed 7 years ago
Reduce number of linux test masters in AWS
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: catlee, Assigned: aselagea)
References
Details
Attachments
(2 files)
9.01 KB,
patch
|
kmoir
:
review+
aselagea
:
checked-in+
|
Details | Diff | Splinter Review |
35.59 KB,
patch
|
kmoir
:
review+
aselagea
:
checked-in+
|
Details | Diff | Splinter Review |
Now that most of our test load is in Taskcluster, we should be able to retire several linux test masters in AWS. In the past two weeks we've peaked at just under 200 tst-linux64, 200 tst-linux32 instances, and 100 tst-emulator64 instances. We currently have 10 linux32 masters and 20 linux64 masters in AWS. I think we can reduce this to 4 for each type - 2 per AWS region.
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → aselagea
Assignee | ||
Comment 1•7 years ago
|
||
Linux 32 AWS masters: us-east-1: bm01-tests1-linux32 bm02-tests1-linux32 bm03-tests1-linux32 bm08-tests1-linux32 bm141-tests1-linux32 keep: bm01, bm02 us-west-2: bm04-tests1-linux32 bm05-tests1-linux32 bm06-tests1-linux32 bm07-tests1-linux32 bm142-tests1-linux32 keep: bm04, bm05
Assignee | ||
Comment 2•7 years ago
|
||
Linux 64 AWS masters: us-east-1: bm51-tests1-linux64 bm52-tests1-linux64 bm67-tests1-linux64 bm113-tests1-linux64 bm114-tests1-linux64 bm117-tests1-linux64 bm120-tests1-linux64 bm121-tests1-linux64 bm124-tests1-linux64 bm130-tests1-linux64 keep: bm51, bm52 us-west-2: bm53-tests1-linux64 bm54-tests1-linux64 bm68-tests1-linux64 bm115-tests1-linux64 bm116-tests1-linux64 bm118-tests1-linux64 bm122-tests1-linux64 bm123-tests1-linux64 bm125-tests1-linux64 bm131-tests1-linux64 keep: bm53, bm54
Assignee | ||
Comment 3•7 years ago
|
||
Attachment #8832432 -
Flags: review?(kmoir)
Assignee | ||
Comment 4•7 years ago
|
||
Attachment #8832433 -
Flags: review?(kmoir)
Assignee | ||
Comment 5•7 years ago
|
||
I think we should disable these masters first and see if the remaining ones are able to deal with the load. Then wait for bug 1335713 to disable Nagios alerts, before actually removing the masters from our configs.
Comment 6•7 years ago
|
||
Comment on attachment 8832432 [details] [diff] [review] bug1335435_puppet.patch Looks good, make sure you graceful the masters and disable them in nagios before you land the changes.
Attachment #8832432 -
Flags: review?(kmoir) → review+
Updated•7 years ago
|
Attachment #8832433 -
Flags: review?(kmoir) → review+
Assignee | ||
Comment 7•7 years ago
|
||
Disabled these masters on slavealloc and nagios, did a clean shutdown for each of them.
Comment 8•7 years ago
|
||
reminder to remove the master entries from slavealloc/inventory once they are completely decommissioned
Assignee | ||
Comment 9•7 years ago
|
||
Stopped all 22 masters in AWS. Per :catlee's suggestion in IRC, let's wait until next week to confirm that everything's OK before landing these patches.
Assignee | ||
Comment 10•7 years ago
|
||
One weird thing is that there are many spot instances that are still attached to a master even though these instances were terminated/are stopped. That results in some bb-masters that have 280+ connected instances, while others have ~23 => unbalanced. I assume an instance with a certain fqdn will only get its "current_masterid" field updated in slavealloc when it's actually launched and connected to another master. mysql> select masters.nickname, count(*), masters.enabled from slaves, masters where masters.nickname like '%linux%' and slaves.current_masterid = masters.masterid group by masters.nickname; +----------------------+----------+---------+ | nickname | count(*) | enabled | +----------------------+----------+---------+ | bm01-tests1-linux32 | 162 | 1 | | bm02-tests1-linux32 | 162 | 1 | | bm03-tests1-linux32 | 37 | 0 | | bm04-tests1-linux32 | 126 | 1 | | bm05-tests1-linux32 | 126 | 1 | | bm06-tests1-linux32 | 68 | 0 | | bm07-tests1-linux32 | 64 | 0 | | bm08-tests1-linux32 | 49 | 0 | | bm103-tests1-linux | 24 | 1 | | bm104-tests1-linux | 23 | 1 | | bm105-tests1-linux | 23 | 1 | | bm113-tests1-linux64 | 151 | 0 | | bm114-tests1-linux64 | 145 | 0 | | bm115-tests1-linux64 | 184 | 0 | | bm116-tests1-linux64 | 189 | 0 | | bm117-tests1-linux64 | 157 | 0 | | bm118-tests1-linux64 | 179 | 0 | | bm120-tests1-linux64 | 149 | 0 | | bm121-tests1-linux64 | 138 | 0 | | bm122-tests1-linux64 | 184 | 0 | | bm123-tests1-linux64 | 183 | 0 | | bm124-tests1-linux64 | 143 | 0 | | bm125-tests1-linux64 | 183 | 0 | | bm130-tests1-linux64 | 150 | 0 | | bm131-tests1-linux64 | 188 | 0 | | bm141-tests1-linux32 | 40 | 0 | | bm142-tests1-linux32 | 66 | 0 | | bm51-tests1-linux64 | 281 | 1 | | bm52-tests1-linux64 | 280 | 1 | | bm53-tests1-linux64 | 223 | 1 | | bm54-tests1-linux64 | 222 | 1 | | bm67-tests1-linux64 | 131 | 0 | | bm68-tests1-linux64 | 189 | 0 | +----------------------+----------+---------+ 33 rows in set (0.00 sec)
Assignee | ||
Comment 11•7 years ago
|
||
bm103, bm104 and bm105 having a low number of connected machines is understandable though, since they are HW machines and we only have the 'talos-ix' pool that uses them.
Reporter | ||
Comment 12•7 years ago
|
||
The way slavealloc works is that it tries to balance the # of workers across all the enabled masters. Unfortunately, it isn't aware that most of these instances get created / destroyed regularly. It assumes that all the workers are running all of the time. We could try to do a few things: - Reset the associated master to NULL for any workers attached to disabled masters - Disable all but a few hundred workers in each region
Assignee | ||
Comment 13•7 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #12) > We could try to do a few things: > - Reset the associated master to NULL for any workers attached to disabled > masters mysql> update masters, slaves set slaves.current_masterid = NULL where masters.nickname like '%linux%' and slaves.current_masterid = masters.masterid and masters.enabled = 0; Query OK, 2967 rows affected (0.13 sec) Rows matched: 2967 Changed: 2967 Warnings: 0 > - Disable all but a few hundred workers in each region Ran some queries to see how many linux servers are present in slavealloc (total/enabled): tst-linux32-spot - total: 899; enabled:899 tst-linux64-spot - total: 2599; enabled: 2599 tst-emulator64-spot - total:1048; enabled: 1047 bld-linux64-spot - total: 698; enabled: 698 try-linux64-spot - total: 499; enabled: 499 av-linux64-spot - total: 4; enabled: 4 Any suggestions on how many we should keep enabled in case?
Reporter | ||
Comment 14•7 years ago
|
||
Looking at the dashboard: https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana/dashboard/db/ec2-dashboard?from=now-7d&to=now&panelId=8&fullscreen I think we should keep: tst-linux32: 300 tst-linux64: 200 tst-emulator64: 200 bld-linux64: 200 try-linux64: 100 av-linux64: 4
Assignee | ||
Comment 15•7 years ago
|
||
Updated slavealloc to match the numbers in the comment above. The number of instances for each platform is evenly distributed between us-east-1 and us-west-2.
Assignee | ||
Updated•7 years ago
|
Attachment #8832433 -
Flags: checked-in+
Assignee | ||
Updated•7 years ago
|
Attachment #8832432 -
Flags: checked-in+
Assignee | ||
Comment 16•7 years ago
|
||
Terminated instances, removed inventory and slavealloc entries.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•