Closed
Bug 1410109
Opened 7 years ago
Closed 7 years ago
investigate and determine if we can disable any masters after latest tcmigration cleanup
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jlund, Assigned: aobreja)
References
Details
Attachments
(3 files)
4.17 KB,
patch
|
rail
:
review+
aobreja
:
checked-in+
|
Details | Diff | Splinter Review |
10.04 KB,
patch
|
rail
:
review+
aobreja
:
checked-in+
|
Details | Diff | Splinter Review |
3.37 KB,
patch
|
rail
:
review+
aobreja
:
checked-in+
|
Details | Diff | Splinter Review |
we disabled over 1000 machines from bb infra[1]. Let's determine if we can also disable some more masters
[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1393774#c4
Reporter | ||
Updated•7 years ago
|
Blocks: tcmigration_cleanup
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → aobreja
Assignee | ||
Comment 1•7 years ago
|
||
So for physical machine from scl3 I don't think we need to decomm any master (check all masters [1]),we already decommissioned some physical machines in bug1376279 and bug1383266.
We have some staging-personal masters which are used for tests and few master remained for each OS.
for build and try in scl3:
bm82-build1
bm83-try1
bm84-build1
bm85-build1
bm86-build1
bm87-try1
for tests:
bm103-tests1-linux
bm104-tests1-linux
bm105-tests1-linux
bm106-tests1-macosx
bm107-tests1-macosx
bm109-tests1-windows
bm110-tests1-windows
bm111-tests1-windows
For AWS machines we have the following status:
use1:
-for build and try:
bm70-build1
bm71-build1
bm75-try1
bm76-try1
bm76-try1
bm94-build1
-tests:
bm01-tests1-linux32
bm02-tests1-linux32
bm51-tests1-linux64
bm52-tests1-linux64
bm137-tests1-windows
bm138-tests1-windows
usw2:
-for build and try:
bm72-build1
bm73-build1
bm74-build1
bm78-try1
bm79-try1
bm91-build1
-tests:
bm04-tests1-linux32
bm05-tests1-linux32
bm53-tests1-linux64
bm54-tests1-linux64
bm139-tests1-windows
bm140-tests1-windows
bm128-tests1-windows
bm129-tests1-windows
Maybe we ca disable few AWS masters but I don't have a metric to measure how many masters will be needed when release will come.
Adding a NI here for Kim and Chris maybe based on this status they can give us an advice.
[1] https://secure.pub.build.mozilla.org/slavealloc/ui/#masters
Flags: needinfo?(kmoir)
Flags: needinfo?(catlee)
Comment 2•7 years ago
|
||
For the aws masters I don't think that we run tests on release except ESR releases which are infrequent. For regular releases we still run number of release related jobs on buildbot that we are in the process of transitioning to tc. Have you looked at the number of jobs that run on the masters that are still in service?
There used to be a buildbot graph on on these pages but I don't see data for it anymore
https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana/dashboard/db/buildbot-masters
https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana/dashboard/db/ec2-dashboard
Flags: needinfo?(kmoir)
Comment 3•7 years ago
|
||
Looking at the DB for jobs in the past 14 days, for SCL3 we see:
mysql> select claimed_by_name, count(*) as njobs from buildrequests where claimed_at > unix_timestamp(NOW() - INTERVAL 14 DAY) and claimed_by_name like '%scl3%' group by claimed_by_name order by njobs asc;
+--------------------------------------------------------------------------------------+-------+
| claimed_by_name | njobs |
+--------------------------------------------------------------------------------------+-------+
| buildbot-master83.bb.releng.scl3.mozilla.com:/builds/buildbot/try1/master | 11 |
| buildbot-master87.bb.releng.scl3.mozilla.com:/builds/buildbot/try1/master | 27 |
| buildbot-master86.bb.releng.scl3.mozilla.com:/builds/buildbot/build1/master | 188 |
| buildbot-master82.bb.releng.scl3.mozilla.com:/builds/buildbot/build1/master | 234 |
| buildbot-master84.bb.releng.scl3.mozilla.com:/builds/buildbot/build1/master | 329 |
| buildbot-master85.bb.releng.scl3.mozilla.com:/builds/buildbot/build1/master | 336 |
| buildbot-master107.bb.releng.scl3.mozilla.com:/builds/buildbot/tests1-macosx/master | 442 |
| buildbot-master106.bb.releng.scl3.mozilla.com:/builds/buildbot/tests1-macosx/master | 3764 |
| buildbot-master105.bb.releng.scl3.mozilla.com:/builds/buildbot/tests1-linux/master | 14469 |
| buildbot-master104.bb.releng.scl3.mozilla.com:/builds/buildbot/tests1-linux/master | 16901 |
| buildbot-master103.bb.releng.scl3.mozilla.com:/builds/buildbot/tests1-linux/master | 18054 |
| buildbot-master110.bb.releng.scl3.mozilla.com:/builds/buildbot/tests1-windows/master | 36771 |
| buildbot-master111.bb.releng.scl3.mozilla.com:/builds/buildbot/tests1-windows/master | 38695 |
| buildbot-master109.bb.releng.scl3.mozilla.com:/builds/buildbot/tests1-windows/master | 41896 |
+--------------------------------------------------------------------------------------+-------+
There's not a lot we can do here yet. It's interesting that the macosx test load is so different between bm106/107.
For AWS we see:
mysql> select claimed_by_name, count(*) as njobs from buildrequests where claimed_at > unix_timestamp(NOW() - INTERVAL 14 DAY) and claimed_by_name not like '%scl3%' group by claimed_by_name order by njobs asc;
+--------------------------------------------------------------------------------------+-------+
| claimed_by_name | njobs |
+--------------------------------------------------------------------------------------+-------+
| buildbot-master76.bb.releng.use1.mozilla.com:/builds/buildbot/try1/master | 3 |
| buildbot-master75.bb.releng.use1.mozilla.com:/builds/buildbot/try1/master | 7 |
| buildbot-master79.bb.releng.usw2.mozilla.com:/builds/buildbot/try1/master | 20 |
| buildbot-master78.bb.releng.usw2.mozilla.com:/builds/buildbot/try1/master | 58 |
| buildbot-master94.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master | 156 |
| buildbot-master52.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-linux64/master | 276 |
| buildbot-master128.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-windows/master | 285 |
| buildbot-master77.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master | 308 |
| buildbot-master137.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-windows/master | 315 |
| buildbot-master138.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-windows/master | 317 |
| buildbot-master72.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master | 327 |
| buildbot-master71.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master | 341 |
| buildbot-master70.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master | 494 |
| buildbot-master74.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master | 517 |
| buildbot-master51.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-linux64/master | 596 |
| buildbot-master53.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux64/master | 615 |
| buildbot-master91.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master | 636 |
| buildbot-master73.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master | 699 |
| buildbot-master01.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-linux32/master | 753 |
| buildbot-master54.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux64/master | 805 |
| buildbot-master129.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master | 883 |
| buildbot-master139.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master | 912 |
| buildbot-master140.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master | 997 |
| buildbot-master02.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-linux32/master | 1030 |
| buildbot-master05.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux32/master | 1253 |
| buildbot-master04.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux32/master | 1310 |
+--------------------------------------------------------------------------------------+-------+
We could probably turn off 1 try masters per region, leaving only 1 per region.
I think we could turn off 2 build masters per region, leaving 2 online per region.
We could also turn off 1 windows test master per region, leaving 2 online per region.
Flags: needinfo?(catlee)
Assignee | ||
Comment 4•7 years ago
|
||
As Chris mentioned in Comment #3 we should probably decommission some masters from each pool so my suggestions are:
buildbot-master76.bb.releng.use1.mozilla.com:/builds/buildbot/try1/master
buildbot-master79.bb.releng.usw2.mozilla.com:/builds/buildbot/try1/master
buildbot-master94.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master
buildbot-master77.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master
buildbot-master74.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master
buildbot-master91.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master
buildbot-master129.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master
I disabled all the masters above.
Assignee | ||
Comment 5•7 years ago
|
||
Chris can we also remove these masters from tools,puppet,nagios?should I create the patches,or wait and monitoiring the logs till next week?
Since the disable no major changes were found,I think we'll be fine:
mysql> select claimed_by_name, count(*) as njobs from buildrequests where claimed_at > unix_timestamp(NOW() - INTERVAL 14 DAY) and claimed_by_name not like '%scl3%' group by claimed_by_name order by njobs asc;
+--------------------------------------------------------------------------------------+-------+
| claimed_by_name | njobs |
+--------------------------------------------------------------------------------------+-------+
| buildbot-master76.bb.releng.use1.mozilla.com:/builds/buildbot/try1/master | 5 |
| buildbot-master75.bb.releng.use1.mozilla.com:/builds/buildbot/try1/master | 8 |
| buildbot-master79.bb.releng.usw2.mozilla.com:/builds/buildbot/try1/master | 20 |
| buildbot-master78.bb.releng.usw2.mozilla.com:/builds/buildbot/try1/master | 44 |
| buildbot-master94.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master | 183 |
| buildbot-master72.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master | 271 |
| buildbot-master52.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-linux64/master | 281 |
| buildbot-master128.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-windows/master | 289 |
| buildbot-master138.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-windows/master | 316 |
| buildbot-master137.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-windows/master | 317 |
| buildbot-master77.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master | 332 |
| buildbot-master71.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master | 416 |
| buildbot-master74.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master | 447 |
| buildbot-master91.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master | 540 |
| buildbot-master70.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master | 576 |
| buildbot-master51.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-linux64/master | 618 |
| buildbot-master73.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master | 619 |
| buildbot-master53.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux64/master | 627 |
| buildbot-master01.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-linux32/master | 814 |
| buildbot-master54.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux64/master | 836 |
| buildbot-master129.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master | 885 |
| buildbot-master139.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master | 978 |
| buildbot-master140.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master | 1078 |
| buildbot-master02.bb.releng.use1.mozilla.com:/builds/buildbot/tests1-linux32/master | 1094 |
| buildbot-master05.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux32/master | 1310 |
| buildbot-master04.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-linux32/master | 1378 |
+--------------------------------------------------------------------------------------+-------+
Flags: needinfo?(catlee)
Comment 6•7 years ago
|
||
Leaving them disabled for now is fine. We can shut off (but not terminate) the instances after the 57 change freeze is over.
Flags: needinfo?(catlee)
Reporter | ||
Comment 7•7 years ago
|
||
Now that the freeze is over, we should disable these. Andrei and I had a chat about this and he will make sure that we don't have any other services running on these masters.
Assignee | ||
Comment 8•7 years ago
|
||
Unfortunately we still have the following services running for the build masters :
- buildbot-master94.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master (funsize_scheduler)
- buildbot-master72.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/maste (selfserve_agent , buildbot_bridge,buildbot_bridge2)
- buildbot-master77.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master (l10n_bumper)
- buildbot-master71.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master (selfserve_agent)
- buildbot-master91.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master (funsize_scheduler)
- buildbot-master70.bb.releng.use1.mozilla.com:/builds/buildbot/build1/master (selfserve_agent)
- buildbot-master73.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master (selfserve_agent)
So I disabled:
- buildbot-master74.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master
- buildbot-master129.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master
- buildbot-master76.bb.releng.use1.mozilla.com:/builds/buildbot/try1/master
- buildbot-master79.bb.releng.usw2.mozilla.com:/builds/buildbot/try1/master
The target would be to also select for disable and shutdown from the first list 2 use1 masters and 1 usw2 master.
Chris do you know if we can also shutdown some masters which ran services like (funsize_scheduler, l10nbumper or selfserve)or should we keep just those 4 disabled above?
The current status for the masters responsible for those services is:
- l10n_bumper
- buildbot-master01.bb.releng.use1.mozilla.com - mozilla-beta
- buildbot-master77.bb.releng.use1.mozilla.com - mozilla-central
- funsize_scheduler
- buildbot-master91.bb.releng.usw2.mozilla.co
- buildbot-master94.bb.releng.use1.mozilla.com
- buildbot-master103.bb.releng.scl3.mozilla.com
- selfserve_agent
- buildbot-master70.bb.releng.use1.mozilla.com
- buildbot-master71.bb.releng.use1.mozilla.com
- buildbot-master72.bb.releng.usw2.mozilla.com
- buildbot-master73.bb.releng.usw2.mozilla.com
- buildbot-master81.bb.releng.scl3.mozilla.com
Flags: needinfo?(catlee)
Assignee | ||
Comment 9•7 years ago
|
||
Rail can you help me with a suggestion? I want to know if I can add to my "blacklist" other masters that are currently used for different services(l10n_bumper,self_agent,funsize_scheduler) or should I only decommission those from bellow list:
- buildbot-master74.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master
- buildbot-master129.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master
- buildbot-master76.bb.releng.use1.mozilla.com:/builds/buildbot/try1/master
- buildbot-master79.bb.releng.usw2.mozilla.com:/builds/buildbot/try1/master
I added all the info about the current status in #c8.
Flags: needinfo?(rail)
Comment 10•7 years ago
|
||
(In reply to Andrei Obreja [:aobreja][:buildduty] from comment #8)
> The target would be to also select for disable and shutdown from the first
> list 2 use1 masters and 1 usw2 master.
> Chris do you know if we can also shutdown some masters which ran services
> like (funsize_scheduler, l10nbumper or selfserve)or should we keep just
> those 4 disabled above?
>
> The current status for the masters responsible for those services is:
> - l10n_bumper
> - buildbot-master01.bb.releng.use1.mozilla.com - mozilla-beta
> - buildbot-master77.bb.releng.use1.mozilla.com - mozilla-central
> - funsize_scheduler
> - buildbot-master91.bb.releng.usw2.mozilla.co
> - buildbot-master94.bb.releng.use1.mozilla.com
> - buildbot-master103.bb.releng.scl3.mozilla.com
bug 1422872 will remove funsize_scheduler
> - selfserve_agent
> - buildbot-master70.bb.releng.use1.mozilla.com
> - buildbot-master71.bb.releng.use1.mozilla.com
> - buildbot-master72.bb.releng.usw2.mozilla.com
> - buildbot-master73.bb.releng.usw2.mozilla.com
> - buildbot-master81.bb.releng.scl3.mozilla.com
We can definitely reduce the amount of parallel selfserve_agents to, let's say 2?
(In reply to Andrei Obreja [:aobreja][:buildduty] from comment #9)
> Rail can you help me with a suggestion? I want to know if I can add to my
> "blacklist" other masters that are currently used for different
> services(l10n_bumper,self_agent,funsize_scheduler)
I'd add bma81, bm83, and bm85 - we use them for release runner.
> or should I only decommission those from bellow list:
>
> - buildbot-master74.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master
> -
> buildbot-master129.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-
> windows/master
> - buildbot-master76.bb.releng.use1.mozilla.com:/builds/buildbot/try1/master
> - buildbot-master79.bb.releng.usw2.mozilla.com:/builds/buildbot/try1/master
>
> I added all the info about the current status in #c8.
I just checked the masters you listed above in moco-nodes.pp, and it looks like it's safe to decommission them - nothing except buildbot is running there.
I hope it helps. Ping on IRC if you need more info.
Flags: needinfo?(rail)
Assignee | ||
Comment 11•7 years ago
|
||
Thank you Rail, after analysed all option I think the best one is to decomm the bellow list of masters:
- buildbot-master74.bb.releng.usw2.mozilla.com:/builds/buildbot/build1/master
- buildbot-master129.bb.releng.usw2.mozilla.com:/builds/buildbot/tests1-windows/master
- buildbot-master76.bb.releng.use1.mozilla.com:/builds/buildbot/try1/master
- buildbot-master79.bb.releng.usw2.mozilla.com:/builds/buildbot/try1/master
- buildbot-master91.bb.releng.usw2.mozilla.com
- buildbot-master94.bb.releng.use1.mozilla.com
- buildbot-master70.bb.releng.use1.mozilla.com
>I'd add bma81, bm83, and bm85 - we use them for release runner.
As for the release runner masters,are we sure we don't need them anymore? These 3 machines are the only one dedicated for release runner.I can add them for decommission if we are sure.
Flags: needinfo?(rail)
Comment 12•7 years ago
|
||
(In reply to Andrei Obreja [:aobreja][:buildduty] from comment #11)
> >I'd add bma81, bm83, and bm85 - we use them for release runner.
>
> As for the release runner masters,are we sure we don't need them anymore?
> These 3 machines are the only one dedicated for release runner.I can add
> them for decommission if we are sure.
Sorry, I wasn't clear. We should keep them around, they are still in use.
Flags: needinfo?(rail)
Assignee | ||
Comment 15•7 years ago
|
||
Patch for sysadimns puppet (nagios).
Attachment #8935339 -
Flags: review?(rail)
Comment 16•7 years ago
|
||
Comment on attachment 8935337 [details] [diff] [review]
bug1410109_puppet.patch
Review of attachment 8935337 [details] [diff] [review]:
-----------------------------------------------------------------
LGTM!
Attachment #8935337 -
Flags: review?(rail) → review+
Updated•7 years ago
|
Attachment #8935338 -
Flags: review?(rail) → review+
Updated•7 years ago
|
Attachment #8935339 -
Flags: review?(rail) → review+
Assignee | ||
Comment 17•7 years ago
|
||
Comment on attachment 8935337 [details] [diff] [review]
bug1410109_puppet.patch
https://hg.mozilla.org/build/puppet/rev/a3e596e1c0c9f7ffb149eefe05233e023ce21e79
Attachment #8935337 -
Flags: checked-in+
Assignee | ||
Updated•7 years ago
|
Attachment #8935339 -
Flags: checked-in+
Assignee | ||
Updated•7 years ago
|
Attachment #8935338 -
Flags: checked-in+
Assignee | ||
Updated•7 years ago
|
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Comment 18•7 years ago
|
||
This briefly confused me with Thunderbird Release work.. specifically the links the wiki docs. I updated the wiki for now https://wiki.mozilla.org/index.php?title=Release%3ARelease_Automation_on_Mercurial%3AUpdates_through_Shipping&type=revision&diff=1185649&oldid=1184682
This is mostly an FYI for next batch, so we can update the wiki if we kill off bm71
Updated•7 years ago
|
Flags: needinfo?(catlee)
Comment 19•7 years ago
|
||
Would have been good to remove the master entries in slavealloc too. I've dropped a Note for each master disabled here.
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•