investigate and determine if we can disable any masters after latest tcmigration cleanup

RESOLVED FIXED

Status

task
RESOLVED FIXED
2 years ago
a year ago

People

(Reporter: jlund, Assigned: aobreja)

Tracking

Details

Attachments

(3 attachments)

we disabled over 1000 machines from bb infra[1]. Let's determine if we can also disable some more masters

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1393774#c4
Assignee: nobody → aobreja
So for physical machine from scl3 I don't think we need to decomm any master (check all masters [1]),we already decommissioned some physical machines in bug1376279 and bug1383266. 
We have some staging-personal masters which are used for tests and few master remained for each OS.

for build and try in scl3:
bm82-build1
bm83-try1
bm84-build1
bm85-build1
bm86-build1
bm87-try1

for tests:

bm103-tests1-linux
bm104-tests1-linux
bm105-tests1-linux
bm106-tests1-macosx
bm107-tests1-macosx
bm109-tests1-windows
bm110-tests1-windows
bm111-tests1-windows


For AWS machines we have the following status:
use1:
   -for build and try:
       bm70-build1
       bm71-build1
       bm75-try1
       bm76-try1
       bm76-try1
       bm94-build1
  -tests:
      bm01-tests1-linux32
      bm02-tests1-linux32
      bm51-tests1-linux64
      bm52-tests1-linux64
      bm137-tests1-windows
      bm138-tests1-windows
usw2:
    -for build and try:
        bm72-build1
        bm73-build1   
        bm74-build1
        bm78-try1
        bm79-try1
        bm91-build1
  -tests:
       bm04-tests1-linux32
       bm05-tests1-linux32
       bm53-tests1-linux64
       bm54-tests1-linux64
       bm139-tests1-windows
       bm140-tests1-windows
       bm128-tests1-windows
       bm129-tests1-windows 


Maybe we ca disable few AWS masters but I don't have a metric to measure how many masters will be needed when release will come.
Adding a NI here for Kim and Chris maybe based on this status they can give us an advice.

[1] https://secure.pub.build.mozilla.org/slavealloc/ui/#masters
Flags: needinfo?(kmoir)
Flags: needinfo?(catlee)
For the aws masters I don't think that we run tests on release except ESR releases which are infrequent. For regular releases we still run number of release related jobs on buildbot that we are in the process of transitioning to tc.  Have you looked at the number of jobs that run on the masters that are still in service?  

There used to be a buildbot graph on on these pages but I don't see data for it anymore

https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana/dashboard/db/buildbot-masters

https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana/dashboard/db/ec2-dashboard
Flags: needinfo?(kmoir)
Looking at the DB for jobs in the past 14 days, for SCL3 we see:
mysql> select claimed_by_name, count(*) as njobs from buildrequests where claimed_at > unix_timestamp(NOW() - INTERVAL 14 DAY) and claimed_by_name like '%scl3%' group by claimed_by_name order by njobs asc;
+--------------------------------------------------------------------------------------+-------+
| claimed_by_name                                                                      | njobs |
+--------------------------------------------------------------------------------------+-------+
| buildbot-master83.bb.releng.scl3.mozilla.com:/builds/buildbot/try1/master            |    11 |
| buildbot-master87.bb.releng.scl3.mozilla.com:/builds/buildbot/try1/master            |    27 |
| buildbot-master86.bb.releng.scl3.mozilla.com:/builds/buildbot/build1/master          |   188 |
| buildbot-master82.bb.releng.scl3.mozilla.com:/builds/buildbot/build1/master          |   234 |
| buildbot-master84.bb.releng.scl3.mozilla.com:/builds/buildbot/build1/master          |   329 |
| buildbot-master85.bb.releng.scl3.mozilla.com:/builds/buildbot/build1/master          |   336 |
| buildbot-master107.bb.releng.scl3.mozilla.com:/builds/buildbot/tests1-macosx/master  |   442 |
| buildbot-master106.bb.releng.scl3.mozilla.com:/builds/buildbot/tests1-macosx/master  |  3764 |
| buildbot-master105.bb.releng.scl3.mozilla.com:/builds/buildbot/tests1-linux/master   | 14469 |
| buildbot-master104.bb.releng.scl3.mozilla.com:/builds/buildbot/tests1-linux/master   | 16901 |
| buildbot-master103.bb.releng.scl3.mozilla.com:/builds/buildbot/tests1-linux/master   | 18054 |
| buildbot-master110.bb.releng.scl3.mozilla.com:/builds/buildbot/tests1-windows/master | 36771 |
| buildbot-master111.bb.releng.scl3.mozilla.com:/builds/buildbot/tests1-windows/master | 38695 |
| buildbot-master109.bb.releng.scl3.mozilla.com:/builds/buildbot/tests1-windows/master | 41896 |
+--------------------------------------------------------------------------------------+-------+

There's not a lot we can do here yet. It's interesting that the macosx test load is so different between bm106/107.

For AWS we see:
mysql> select claimed_by_name, count(*) as njobs from buildrequests where claimed_at > unix_timestamp(NOW() - INTERVAL 14 DAY) and claimed_by_name not like '%scl3%' group by claimed_by_name order by njobs asc;
+--------------------------------------------------------------------------------------+-------+
| claimed_by_name                                                                      | njobs |
+--------------------------------------------------------------------------------------+-------+
| buildbot-master76.bb.releng.use1.mozilla.com:/builds/buildbot/try1/master            |     3 |
| buildbot-master75.bb.releng.use1.mozilla.com:/builds/buildbot/try1/master            |     7 |
| buildbot-master79.bb.releng.usw2.mozilla.com:/builds/buildbot/try1/master            |    20 |
| buildbot-master78.bb.releng.usw2.mozilla.com:/builds/buildbot/try1/master            |    58 |
| buildbot-master94.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master          |   156 |
| buildbot-master52.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-linux64/master  |   276 |
| buildbot-master128.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-windows/master |   285 |
| buildbot-master77.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master          |   308 |
| buildbot-master137.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-windows/master |   315 |
| buildbot-master138.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-windows/master |   317 |
| buildbot-master72.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master          |   327 |
| buildbot-master71.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master          |   341 |
| buildbot-master70.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master          |   494 |
| buildbot-master74.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master          |   517 |
| buildbot-master51.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-linux64/master  |   596 |
| buildbot-master53.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux64/master  |   615 |
| buildbot-master91.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master          |   636 |
| buildbot-master73.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master          |   699 |
| buildbot-master01.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-linux32/master  |   753 |
| buildbot-master54.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux64/master  |   805 |
| buildbot-master129.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master |   883 |
| buildbot-master139.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master |   912 |
| buildbot-master140.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master |   997 |
| buildbot-master02.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-linux32/master  |  1030 |
| buildbot-master05.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux32/master  |  1253 |
| buildbot-master04.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux32/master  |  1310 |
+--------------------------------------------------------------------------------------+-------+

We could probably turn off 1 try masters per region, leaving only 1 per region.
I think we could turn off 2 build masters per region, leaving 2 online per region.
We could also turn off 1 windows test master per region, leaving 2 online per region.
Flags: needinfo?(catlee)
As Chris mentioned in Comment #3 we should probably decommission some masters from each pool so my suggestions are:

buildbot-master76.bb.releng.use1.mozilla.com:/builds/buildbot/try1/master
buildbot-master79.bb.releng.usw2.mozilla.com:/builds/buildbot/try1/master

buildbot-master94.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master
buildbot-master77.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master 
buildbot-master74.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master
buildbot-master91.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master

buildbot-master129.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master

I disabled all the masters above.
Chris can we also remove these masters from tools,puppet,nagios?should I create the patches,or wait and monitoiring the logs till next week?
Since the disable no major changes were found,I think we'll be fine:

mysql> select claimed_by_name, count(*) as njobs from buildrequests where claimed_at > unix_timestamp(NOW() - INTERVAL 14 DAY) and claimed_by_name not like '%scl3%' group by claimed_by_name order by njobs asc;
+--------------------------------------------------------------------------------------+-------+
| claimed_by_name                                                                      | njobs |
+--------------------------------------------------------------------------------------+-------+
| buildbot-master76.bb.releng.use1.mozilla.com:/builds/buildbot/try1/master            |     5 |
| buildbot-master75.bb.releng.use1.mozilla.com:/builds/buildbot/try1/master            |     8 |
| buildbot-master79.bb.releng.usw2.mozilla.com:/builds/buildbot/try1/master            |    20 |
| buildbot-master78.bb.releng.usw2.mozilla.com:/builds/buildbot/try1/master            |    44 |
| buildbot-master94.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master          |   183 |
| buildbot-master72.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master          |   271 |
| buildbot-master52.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-linux64/master  |   281 |
| buildbot-master128.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-windows/master |   289 |
| buildbot-master138.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-windows/master |   316 |
| buildbot-master137.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-windows/master |   317 |
| buildbot-master77.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master          |   332 |
| buildbot-master71.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master          |   416 |
| buildbot-master74.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master          |   447 |
| buildbot-master91.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master          |   540 |
| buildbot-master70.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master          |   576 |
| buildbot-master51.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-linux64/master  |   618 |
| buildbot-master73.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master          |   619 |
| buildbot-master53.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux64/master  |   627 |
| buildbot-master01.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-linux32/master  |   814 |
| buildbot-master54.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux64/master  |   836 |
| buildbot-master129.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master |   885 |
| buildbot-master139.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master |   978 |
| buildbot-master140.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master |  1078 |
| buildbot-master02.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-linux32/master  |  1094 |
| buildbot-master05.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux32/master  |  1310 |
| buildbot-master04.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux32/master  |  1378 |
+--------------------------------------------------------------------------------------+-------+
Flags: needinfo?(catlee)
Leaving them disabled for now is fine. We can shut off (but not terminate) the instances after the 57 change freeze is over.
Flags: needinfo?(catlee)
Now that the freeze is over, we should disable these. Andrei and I had a chat about this and he will make sure that we don't have any other services running on these masters.
Unfortunately we still have the following services running for the build masters :

- buildbot-master94.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master   (funsize_scheduler)
- buildbot-master72.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/maste (selfserve_agent , buildbot_bridge,buildbot_bridge2)
- buildbot-master77.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master  (l10n_bumper)
- buildbot-master71.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master (selfserve_agent) 
- buildbot-master91.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master  (funsize_scheduler)
- buildbot-master70.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master (selfserve_agent)
- buildbot-master73.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master (selfserve_agent)

So I disabled:

- buildbot-master74.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master
- buildbot-master129.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master
- buildbot-master76.bb.releng.use1.mozilla.com:/builds/buildbot/try1/master
- buildbot-master79.bb.releng.usw2.mozilla.com:/builds/buildbot/try1/master


The target would be to also select for disable and shutdown from the first list 2 use1 masters and 1 usw2 master.
Chris do you know if we can also shutdown some masters which ran services like (funsize_scheduler, l10nbumper or selfserve)or should we keep just those 4 disabled above?

The current status for the masters responsible for those services is:
- l10n_bumper
  - buildbot-master01.bb.releng.use1.mozilla.com - mozilla-beta
  - buildbot-master77.bb.releng.use1.mozilla.com - mozilla-central
- funsize_scheduler
  - buildbot-master91.bb.releng.usw2.mozilla.co
  - buildbot-master94.bb.releng.use1.mozilla.com
  - buildbot-master103.bb.releng.scl3.mozilla.com
- selfserve_agent
  - buildbot-master70.bb.releng.use1.mozilla.com
  - buildbot-master71.bb.releng.use1.mozilla.com
  - buildbot-master72.bb.releng.usw2.mozilla.com
  - buildbot-master73.bb.releng.usw2.mozilla.com
  - buildbot-master81.bb.releng.scl3.mozilla.com
Flags: needinfo?(catlee)
Rail can you help me with a suggestion? I want to know if I can add to my "blacklist" other masters that are currently used for different services(l10n_bumper,self_agent,funsize_scheduler) or should I only decommission those from bellow list:

- buildbot-master74.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master
- buildbot-master129.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master
- buildbot-master76.bb.releng.use1.mozilla.com:/builds/buildbot/try1/master
- buildbot-master79.bb.releng.usw2.mozilla.com:/builds/buildbot/try1/master

I added all the info about the current status in #c8.
Flags: needinfo?(rail)
Depends on: 1422872
(In reply to Andrei Obreja [:aobreja][:buildduty] from comment #8)

> The target would be to also select for disable and shutdown from the first
> list 2 use1 masters and 1 usw2 master.
> Chris do you know if we can also shutdown some masters which ran services
> like (funsize_scheduler, l10nbumper or selfserve)or should we keep just
> those 4 disabled above?
> 
> The current status for the masters responsible for those services is:
> - l10n_bumper
>   - buildbot-master01.bb.releng.use1.mozilla.com - mozilla-beta
>   - buildbot-master77.bb.releng.use1.mozilla.com - mozilla-central

> - funsize_scheduler
>   - buildbot-master91.bb.releng.usw2.mozilla.co
>   - buildbot-master94.bb.releng.use1.mozilla.com
>   - buildbot-master103.bb.releng.scl3.mozilla.com

bug 1422872 will remove funsize_scheduler

> - selfserve_agent
>   - buildbot-master70.bb.releng.use1.mozilla.com
>   - buildbot-master71.bb.releng.use1.mozilla.com
>   - buildbot-master72.bb.releng.usw2.mozilla.com
>   - buildbot-master73.bb.releng.usw2.mozilla.com
>   - buildbot-master81.bb.releng.scl3.mozilla.com

We can definitely reduce the amount of parallel selfserve_agents to, let's say 2?

(In reply to Andrei Obreja [:aobreja][:buildduty] from comment #9)
> Rail can you help me with a suggestion? I want to know if I can add to my
> "blacklist" other masters that are currently used for different
> services(l10n_bumper,self_agent,funsize_scheduler)

I'd add bma81, bm83, and bm85 - we use them for release runner.

>  or should I only decommission those from bellow list:
> 
> - buildbot-master74.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master
> -
> buildbot-master129.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-
> windows/master
> - buildbot-master76.bb.releng.use1.mozilla.com:/builds/buildbot/try1/master
> - buildbot-master79.bb.releng.usw2.mozilla.com:/builds/buildbot/try1/master
> 
> I added all the info about the current status in #c8.

I just checked the masters you listed above in moco-nodes.pp, and it looks like it's safe to decommission them - nothing except buildbot is running there.

I hope it helps. Ping on IRC if you need more info.
Flags: needinfo?(rail)
Thank you Rail, after analysed all option I think the best one is to decomm the bellow list of masters:

- buildbot-master74.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master
- buildbot-master129.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master
- buildbot-master76.bb.releng.use1.mozilla.com:/builds/buildbot/try1/master
- buildbot-master79.bb.releng.usw2.mozilla.com:/builds/buildbot/try1/master
- buildbot-master91.bb.releng.usw2.mozilla.com
- buildbot-master94.bb.releng.use1.mozilla.com
- buildbot-master70.bb.releng.use1.mozilla.com

>I'd add bma81, bm83, and bm85 - we use them for release runner.

As for the release runner masters,are we sure we don't need them anymore? These 3 machines are the only one dedicated for release runner.I can add them for decommission if we are sure.
Flags: needinfo?(rail)
(In reply to Andrei Obreja [:aobreja][:buildduty] from comment #11)
> >I'd add bma81, bm83, and bm85 - we use them for release runner.
> 
> As for the release runner masters,are we sure we don't need them anymore?
> These 3 machines are the only one dedicated for release runner.I can add
> them for decommission if we are sure.

Sorry, I wasn't clear. We should keep them around, they are still in use.
Flags: needinfo?(rail)
Patch for puppet.
Attachment #8935337 - Flags: review?(rail)
Patch for tools.
Attachment #8935338 - Flags: review?(rail)
Patch for sysadimns puppet (nagios).
Attachment #8935339 - Flags: review?(rail)
Comment on attachment 8935337 [details] [diff] [review]
bug1410109_puppet.patch

Review of attachment 8935337 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM!
Attachment #8935337 - Flags: review?(rail) → review+
Attachment #8935338 - Flags: review?(rail) → review+
Attachment #8935339 - Flags: review?(rail) → review+
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
This briefly confused me with Thunderbird Release work.. specifically the links the wiki docs. I updated the wiki for now https://wiki.mozilla.org/index.php?title=Release%3ARelease_Automation_on_Mercurial%3AUpdates_through_Shipping&type=revision&diff=1185649&oldid=1184682

This is mostly an FYI for next batch, so we can update the wiki if we kill off bm71
Flags: needinfo?(catlee)
Would have been good to remove the master entries in slavealloc too. I've dropped a Note for each master disabled here.

Updated

a year ago
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.