Closed Bug 1410109 Opened 7 years ago Closed 7 years ago

investigate and determine if we can disable any masters after latest tcmigration cleanup

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlund, Assigned: aobreja)

References

Details

Attachments

(3 files)

we disabled over 1000 machines from bb infra[1]. Let's determine if we can also disable some more masters [1] https://bugzilla.mozilla.org/show_bug.cgi?id=1393774#c4
Assignee: nobody → aobreja
So for physical machine from scl3 I don't think we need to decomm any master (check all masters [1]),we already decommissioned some physical machines in bug1376279 and bug1383266. We have some staging-personal masters which are used for tests and few master remained for each OS. for build and try in scl3: bm82-build1 bm83-try1 bm84-build1 bm85-build1 bm86-build1 bm87-try1 for tests: bm103-tests1-linux bm104-tests1-linux bm105-tests1-linux bm106-tests1-macosx bm107-tests1-macosx bm109-tests1-windows bm110-tests1-windows bm111-tests1-windows For AWS machines we have the following status: use1: -for build and try: bm70-build1 bm71-build1 bm75-try1 bm76-try1 bm76-try1 bm94-build1 -tests: bm01-tests1-linux32 bm02-tests1-linux32 bm51-tests1-linux64 bm52-tests1-linux64 bm137-tests1-windows bm138-tests1-windows usw2: -for build and try: bm72-build1 bm73-build1 bm74-build1 bm78-try1 bm79-try1 bm91-build1 -tests: bm04-tests1-linux32 bm05-tests1-linux32 bm53-tests1-linux64 bm54-tests1-linux64 bm139-tests1-windows bm140-tests1-windows bm128-tests1-windows bm129-tests1-windows Maybe we ca disable few AWS masters but I don't have a metric to measure how many masters will be needed when release will come. Adding a NI here for Kim and Chris maybe based on this status they can give us an advice. [1] https://secure.pub.build.mozilla.org/slavealloc/ui/#masters
Flags: needinfo?(kmoir)
Flags: needinfo?(catlee)
For the aws masters I don't think that we run tests on release except ESR releases which are infrequent. For regular releases we still run number of release related jobs on buildbot that we are in the process of transitioning to tc. Have you looked at the number of jobs that run on the masters that are still in service? There used to be a buildbot graph on on these pages but I don't see data for it anymore https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana/dashboard/db/buildbot-masters https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana/dashboard/db/ec2-dashboard
Flags: needinfo?(kmoir)
Looking at the DB for jobs in the past 14 days, for SCL3 we see: mysql> select claimed_by_name, count(*) as njobs from buildrequests where claimed_at > unix_timestamp(NOW() - INTERVAL 14 DAY) and claimed_by_name like '%scl3%' group by claimed_by_name order by njobs asc; +--------------------------------------------------------------------------------------+-------+ | claimed_by_name | njobs | +--------------------------------------------------------------------------------------+-------+ | buildbot-master83.bb.releng.scl3.mozilla.com:/builds/buildbot/try1/master | 11 | | buildbot-master87.bb.releng.scl3.mozilla.com:/builds/buildbot/try1/master | 27 | | buildbot-master86.bb.releng.scl3.mozilla.com:/builds/buildbot/build1/master | 188 | | buildbot-master82.bb.releng.scl3.mozilla.com:/builds/buildbot/build1/master | 234 | | buildbot-master84.bb.releng.scl3.mozilla.com:/builds/buildbot/build1/master | 329 | | buildbot-master85.bb.releng.scl3.mozilla.com:/builds/buildbot/build1/master | 336 | | buildbot-master107.bb.releng.scl3.mozilla.com:/builds/buildbot/tests1-macosx/master | 442 | | buildbot-master106.bb.releng.scl3.mozilla.com:/builds/buildbot/tests1-macosx/master | 3764 | | buildbot-master105.bb.releng.scl3.mozilla.com:/builds/buildbot/tests1-linux/master | 14469 | | buildbot-master104.bb.releng.scl3.mozilla.com:/builds/buildbot/tests1-linux/master | 16901 | | buildbot-master103.bb.releng.scl3.mozilla.com:/builds/buildbot/tests1-linux/master | 18054 | | buildbot-master110.bb.releng.scl3.mozilla.com:/builds/buildbot/tests1-windows/master | 36771 | | buildbot-master111.bb.releng.scl3.mozilla.com:/builds/buildbot/tests1-windows/master | 38695 | | buildbot-master109.bb.releng.scl3.mozilla.com:/builds/buildbot/tests1-windows/master | 41896 | +--------------------------------------------------------------------------------------+-------+ There's not a lot we can do here yet. It's interesting that the macosx test load is so different between bm106/107. For AWS we see: mysql> select claimed_by_name, count(*) as njobs from buildrequests where claimed_at > unix_timestamp(NOW() - INTERVAL 14 DAY) and claimed_by_name not like '%scl3%' group by claimed_by_name order by njobs asc; +--------------------------------------------------------------------------------------+-------+ | claimed_by_name | njobs | +--------------------------------------------------------------------------------------+-------+ | buildbot-master76.bb.releng.use1.mozilla.com:/builds/buildbot/try1/master | 3 | | buildbot-master75.bb.releng.use1.mozilla.com:/builds/buildbot/try1/master | 7 | | buildbot-master79.bb.releng.usw2.mozilla.com:/builds/buildbot/try1/master | 20 | | buildbot-master78.bb.releng.usw2.mozilla.com:/builds/buildbot/try1/master | 58 | | buildbot-master94.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master | 156 | | buildbot-master52.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-linux64/master | 276 | | buildbot-master128.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-windows/master | 285 | | buildbot-master77.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master | 308 | | buildbot-master137.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-windows/master | 315 | | buildbot-master138.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-windows/master | 317 | | buildbot-master72.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master | 327 | | buildbot-master71.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master | 341 | | buildbot-master70.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master | 494 | | buildbot-master74.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master | 517 | | buildbot-master51.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-linux64/master | 596 | | buildbot-master53.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux64/master | 615 | | buildbot-master91.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master | 636 | | buildbot-master73.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master | 699 | | buildbot-master01.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-linux32/master | 753 | | buildbot-master54.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux64/master | 805 | | buildbot-master129.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master | 883 | | buildbot-master139.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master | 912 | | buildbot-master140.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master | 997 | | buildbot-master02.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-linux32/master | 1030 | | buildbot-master05.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux32/master | 1253 | | buildbot-master04.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux32/master | 1310 | +--------------------------------------------------------------------------------------+-------+ We could probably turn off 1 try masters per region, leaving only 1 per region. I think we could turn off 2 build masters per region, leaving 2 online per region. We could also turn off 1 windows test master per region, leaving 2 online per region.
Flags: needinfo?(catlee)
As Chris mentioned in Comment #3 we should probably decommission some masters from each pool so my suggestions are: buildbot-master76.bb.releng.use1.mozilla.com:/builds/buildbot/try1/master buildbot-master79.bb.releng.usw2.mozilla.com:/builds/buildbot/try1/master buildbot-master94.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master buildbot-master77.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master buildbot-master74.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master buildbot-master91.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master buildbot-master129.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master I disabled all the masters above.
Chris can we also remove these masters from tools,puppet,nagios?should I create the patches,or wait and monitoiring the logs till next week? Since the disable no major changes were found,I think we'll be fine: mysql> select claimed_by_name, count(*) as njobs from buildrequests where claimed_at > unix_timestamp(NOW() - INTERVAL 14 DAY) and claimed_by_name not like '%scl3%' group by claimed_by_name order by njobs asc; +--------------------------------------------------------------------------------------+-------+ | claimed_by_name | njobs | +--------------------------------------------------------------------------------------+-------+ | buildbot-master76.bb.releng.use1.mozilla.com:/builds/buildbot/try1/master | 5 | | buildbot-master75.bb.releng.use1.mozilla.com:/builds/buildbot/try1/master | 8 | | buildbot-master79.bb.releng.usw2.mozilla.com:/builds/buildbot/try1/master | 20 | | buildbot-master78.bb.releng.usw2.mozilla.com:/builds/buildbot/try1/master | 44 | | buildbot-master94.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master | 183 | | buildbot-master72.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master | 271 | | buildbot-master52.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-linux64/master | 281 | | buildbot-master128.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-windows/master | 289 | | buildbot-master138.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-windows/master | 316 | | buildbot-master137.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-windows/master | 317 | | buildbot-master77.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master | 332 | | buildbot-master71.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master | 416 | | buildbot-master74.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master | 447 | | buildbot-master91.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master | 540 | | buildbot-master70.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master | 576 | | buildbot-master51.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-linux64/master | 618 | | buildbot-master73.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master | 619 | | buildbot-master53.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux64/master | 627 | | buildbot-master01.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-linux32/master | 814 | | buildbot-master54.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux64/master | 836 | | buildbot-master129.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master | 885 | | buildbot-master139.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master | 978 | | buildbot-master140.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master | 1078 | | buildbot-master02.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-linux32/master | 1094 | | buildbot-master05.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux32/master | 1310 | | buildbot-master04.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux32/master | 1378 | +--------------------------------------------------------------------------------------+-------+
Flags: needinfo?(catlee)
Leaving them disabled for now is fine. We can shut off (but not terminate) the instances after the 57 change freeze is over.
Flags: needinfo?(catlee)
Now that the freeze is over, we should disable these. Andrei and I had a chat about this and he will make sure that we don't have any other services running on these masters.
Unfortunately we still have the following services running for the build masters : - buildbot-master94.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master (funsize_scheduler) - buildbot-master72.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/maste (selfserve_agent , buildbot_bridge,buildbot_bridge2) - buildbot-master77.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master (l10n_bumper) - buildbot-master71.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master (selfserve_agent) - buildbot-master91.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master (funsize_scheduler) - buildbot-master70.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master (selfserve_agent) - buildbot-master73.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master (selfserve_agent) So I disabled: - buildbot-master74.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master - buildbot-master129.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master - buildbot-master76.bb.releng.use1.mozilla.com:/builds/buildbot/try1/master - buildbot-master79.bb.releng.usw2.mozilla.com:/builds/buildbot/try1/master The target would be to also select for disable and shutdown from the first list 2 use1 masters and 1 usw2 master. Chris do you know if we can also shutdown some masters which ran services like (funsize_scheduler, l10nbumper or selfserve)or should we keep just those 4 disabled above? The current status for the masters responsible for those services is: - l10n_bumper - buildbot-master01.bb.releng.use1.mozilla.com - mozilla-beta - buildbot-master77.bb.releng.use1.mozilla.com - mozilla-central - funsize_scheduler - buildbot-master91.bb.releng.usw2.mozilla.co - buildbot-master94.bb.releng.use1.mozilla.com - buildbot-master103.bb.releng.scl3.mozilla.com - selfserve_agent - buildbot-master70.bb.releng.use1.mozilla.com - buildbot-master71.bb.releng.use1.mozilla.com - buildbot-master72.bb.releng.usw2.mozilla.com - buildbot-master73.bb.releng.usw2.mozilla.com - buildbot-master81.bb.releng.scl3.mozilla.com
Flags: needinfo?(catlee)
Rail can you help me with a suggestion? I want to know if I can add to my "blacklist" other masters that are currently used for different services(l10n_bumper,self_agent,funsize_scheduler) or should I only decommission those from bellow list: - buildbot-master74.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master - buildbot-master129.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master - buildbot-master76.bb.releng.use1.mozilla.com:/builds/buildbot/try1/master - buildbot-master79.bb.releng.usw2.mozilla.com:/builds/buildbot/try1/master I added all the info about the current status in #c8.
Flags: needinfo?(rail)
Depends on: 1422872
(In reply to Andrei Obreja [:aobreja][:buildduty] from comment #8) > The target would be to also select for disable and shutdown from the first > list 2 use1 masters and 1 usw2 master. > Chris do you know if we can also shutdown some masters which ran services > like (funsize_scheduler, l10nbumper or selfserve)or should we keep just > those 4 disabled above? > > The current status for the masters responsible for those services is: > - l10n_bumper > - buildbot-master01.bb.releng.use1.mozilla.com - mozilla-beta > - buildbot-master77.bb.releng.use1.mozilla.com - mozilla-central > - funsize_scheduler > - buildbot-master91.bb.releng.usw2.mozilla.co > - buildbot-master94.bb.releng.use1.mozilla.com > - buildbot-master103.bb.releng.scl3.mozilla.com bug 1422872 will remove funsize_scheduler > - selfserve_agent > - buildbot-master70.bb.releng.use1.mozilla.com > - buildbot-master71.bb.releng.use1.mozilla.com > - buildbot-master72.bb.releng.usw2.mozilla.com > - buildbot-master73.bb.releng.usw2.mozilla.com > - buildbot-master81.bb.releng.scl3.mozilla.com We can definitely reduce the amount of parallel selfserve_agents to, let's say 2? (In reply to Andrei Obreja [:aobreja][:buildduty] from comment #9) > Rail can you help me with a suggestion? I want to know if I can add to my > "blacklist" other masters that are currently used for different > services(l10n_bumper,self_agent,funsize_scheduler) I'd add bma81, bm83, and bm85 - we use them for release runner. > or should I only decommission those from bellow list: > > - buildbot-master74.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master > - > buildbot-master129.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1- > windows/master > - buildbot-master76.bb.releng.use1.mozilla.com:/builds/buildbot/try1/master > - buildbot-master79.bb.releng.usw2.mozilla.com:/builds/buildbot/try1/master > > I added all the info about the current status in #c8. I just checked the masters you listed above in moco-nodes.pp, and it looks like it's safe to decommission them - nothing except buildbot is running there. I hope it helps. Ping on IRC if you need more info.
Flags: needinfo?(rail)
Thank you Rail, after analysed all option I think the best one is to decomm the bellow list of masters: - buildbot-master74.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master - buildbot-master129.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master - buildbot-master76.bb.releng.use1.mozilla.com:/builds/buildbot/try1/master - buildbot-master79.bb.releng.usw2.mozilla.com:/builds/buildbot/try1/master - buildbot-master91.bb.releng.usw2.mozilla.com - buildbot-master94.bb.releng.use1.mozilla.com - buildbot-master70.bb.releng.use1.mozilla.com >I'd add bma81, bm83, and bm85 - we use them for release runner. As for the release runner masters,are we sure we don't need them anymore? These 3 machines are the only one dedicated for release runner.I can add them for decommission if we are sure.
Flags: needinfo?(rail)
(In reply to Andrei Obreja [:aobreja][:buildduty] from comment #11) > >I'd add bma81, bm83, and bm85 - we use them for release runner. > > As for the release runner masters,are we sure we don't need them anymore? > These 3 machines are the only one dedicated for release runner.I can add > them for decommission if we are sure. Sorry, I wasn't clear. We should keep them around, they are still in use.
Flags: needinfo?(rail)
Patch for puppet.
Attachment #8935337 - Flags: review?(rail)
Patch for tools.
Attachment #8935338 - Flags: review?(rail)
Patch for sysadimns puppet (nagios).
Attachment #8935339 - Flags: review?(rail)
Comment on attachment 8935337 [details] [diff] [review] bug1410109_puppet.patch Review of attachment 8935337 [details] [diff] [review]: ----------------------------------------------------------------- LGTM!
Attachment #8935337 - Flags: review?(rail) → review+
Attachment #8935338 - Flags: review?(rail) → review+
Attachment #8935339 - Flags: review?(rail) → review+
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
This briefly confused me with Thunderbird Release work.. specifically the links the wiki docs. I updated the wiki for now https://wiki.mozilla.org/index.php?title=Release%3ARelease_Automation_on_Mercurial%3AUpdates_through_Shipping&type=revision&diff=1185649&oldid=1184682 This is mostly an FYI for next batch, so we can update the wiki if we kill off bm71
Flags: needinfo?(catlee)
Would have been good to remove the master entries in slavealloc too. I've dropped a Note for each master disabled here.
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: