Decommission VMWare VM builders - production slaves (non-try)

RESOLVED FIXED

Status

Infrastructure & Operations
Virtualization
P2
normal
RESOLVED FIXED
5 years ago
3 years ago

People

(Reporter: nthomas, Assigned: dustin)

Tracking

({spring-cleaning})

Details

(Reporter)

Description

5 years ago
The non-try VMWare VMs are still building things like
* beta/release on-change
* beta/release/esr releases
* inbound spidermonkey
* fuzzing  (preferentially over ix hardware)
This will get smaller with bug 798361 (bug 772446) but 
<nthomas>	how much of a difference would disabling them make ?
<lerxst>	nthomas: gigantic. those 50 build VMs together are causing the same IO load you'd expect from 500 normal VMs
but we're ok right now, where "ok" is defined as "overloaded but not currently on fire"

Going to disable them in slavealloc initially to help the netapp, and monitor the wait times for issues.

Slave list (from buildbot-configs):
bld-centos5-32-vmw-001
bld-centos5-32-vmw-002
bld-centos5-32-vmw-003
bld-centos5-32-vmw-004
bld-centos5-32-vmw-005
bld-centos5-32-vmw-006
bld-centos5-32-vmw-007
bld-centos5-32-vmw-008
bld-centos5-32-vmw-009
bld-centos5-32-vmw-010
bld-centos5-32-vmw-011
bld-centos5-32-vmw-012
bld-centos5-32-vmw-013
bld-centos5-32-vmw-014
bld-centos5-32-vmw-015
bld-centos5-32-vmw-016
bld-centos5-32-vmw-017
bld-centos5-32-vmw-018
bld-centos5-32-vmw-019
bld-centos5-32-vmw-020
bld-centos5-32-vmw-021
bld-centos5-32-vmw-022
bld-centos5-64-vmw-001
bld-centos5-64-vmw-002
bld-centos5-64-vmw-003
bld-centos5-64-vmw-004
bld-centos5-64-vmw-005
bld-centos5-64-vmw-006

Those are all in 'production', there are no staging slaves of this class.
(Reporter)

Comment 1

5 years ago
Disabled in slavealloc, and did a graceful buildbot shutdown on them all to catch the fuzzer jobs.
Summary: Decommision VMWare VM builders - production slaves (non-try) → Decommission VMWare VM builders - production slaves (non-try)
(Reporter)

Comment 2

5 years ago
... which is RelEng-speak for 'everything in comment #0 is now idle'.
(Reporter)

Comment 3

5 years ago
Re-enabled these four:
bld-centos5-32-vmw-001
bld-centos5-32-vmw-002
bld-centos5-64-vmw-001
bld-centos5-64-vmw-002
since the spidermonkey builds don't happen otherwise. They'll be busy with fuzzing otherwise, which is not I/O intensive.

Comment 4

5 years ago
From a conversation with :arr on 6 November --
18:39:26 arr: let us know if/when we should reenable the rest of the slaves

In bug 809105 I have the slaves moved over to SAS drives, so, this batch should be allowed to start back up.  Piecemeal would be nice, if feasible.
(Reporter)

Comment 5

5 years ago
(In reply to Nick Thomas [:nthomas] from comment #0)
> The non-try VMWare VMs are still building things like
> * beta/release on-change
> * beta/release/esr releases
> * inbound spidermonkey
> * fuzzing  (preferentially over ix hardware)

* beta will move off these slaves after the code merge Nov 19 (Fx18, bug 798361)
* release will move off on the merge around the new year (Fx18, riding the trains)
* esr is targeting moving off at the same time as release (backporting at 17.0.1 esr)
* spidermonkey moved off these slaves (bug 794339) recently.
* fuzzing still needs to move (bug 803764)

As it is with these slaves off the only thing that's hurting is lack of linux fuzzing right now.

Comment 6

5 years ago
OK, bring back what's hurting you.  You know we'll yell if we have to. :)
(Reporter)

Comment 7

5 years ago
These slaves are re-enabled in slavealloc again, which means they're doing work. This changed on Nov 9. kmoir, any info on that ? gcox, how's the netapp finding that ? Hopefully it's mostly CPU load rather than I/O.

In the short term what we probably should do is enable N 32bit VMs and lock them to a single master, and turn off the rest. Then we have a N-3 dedicated machines for fuzzing. The offset of 3 comes from http://mxr.mozilla.org/build/source/buildbot-configs/mozilla/production_config.py#179, where each buildbot master maintains idle_slaves so that non-fuzzing work can start quickly. We could reduce this value.
(Reporter)

Comment 8

5 years ago
OK, I've left these six enabled and locked to buildbot-master30:
bld-centos5-32-vmw-001
bld-centos5-32-vmw-002
bld-centos5-32-vmw-003
bld-centos5-32-vmw-004
bld-centos5-32-vmw-005
bld-centos5-32-vmw-006

The remainder (bld-centos5-32-vmw-007 to 022, bld-centos5-64-vmw-001 to 011) are disabled in slavealloc, but the VMs are still on.
Can you please change their trustlevel to decomm on slavealloc? IIUC it will prevent those slaves from showing up on the last job per slave report.
(Reporter)

Comment 10

5 years ago
Is the comment and disabled state not sufficient for buildduty to ignore any state in the last job report ? They're not really decommissioned yet.
If we change the status of "Trust" to "decomm" it becomes hidden on the report and no one has to figure out what to do with them.

What is the plan for these slaves? What is it left to decomission them?
Are we planning to put them back? or it is not clear yet?
(Reporter)

Comment 12

4 years ago
This can happen once the fuzzing jobs move off to mock.
Assignee: nthomas → nobody
Depends on: 803764

Comment 13

4 years ago
Lint: moving to 'new' since assignee went to 'nobody'.
Cross-bug reference: bld-centos5-32-vmw-007 to 022, bld-centos5-64-vmw-001 to 011 are gone due to bug 841331.
Status: ASSIGNED → NEW
According to catlee in bug 805587 we're good to decommission the rest of these vms:

bld-centos5-32-vmw-001
bld-centos5-32-vmw-002
bld-centos5-32-vmw-003
bld-centos5-32-vmw-004
bld-centos5-32-vmw-005
bld-centos5-32-vmw-006
Assignee: nobody → server-ops-virtualization
Component: Release Engineering: Automation (General) → Server Operations: Virtualization

Comment 15

4 years ago
This bug lists a dependency on bug 803764, though.
Error in dependency, or premature decom request?
catlee, is 803764 already done?
Flags: needinfo?(catlee)
No, we need to get that done before we can decomm these VMs.
Flags: needinfo?(catlee)
Blocks: 863236
We have 6 linux-ix and 6 linux64-ix machines around in bug 780022 until October which is when esr17 dies.
If we had 6 32-bit VMs and 6 64-bit VMs we could then go ahead and re-purpose those iX machines into something useful without having to wait until October.

These machines would be barely in use if we moved the fuzzer and nanojit jobs to our hp/ec2 infrastructure.
Depends on: 840303
The iX machines in bug 780027 will be decommissioned when they're no longer being used for ESR, not repurposed.  They're out of warranty and in mountain view.
Any update on this bug? Anything more need doing?
:arr, can you update us on the status of this bug? can we delete some VMs here?
Flags: needinfo?(arich)
Not that I know of.  Releng still hasn't given the okay.
Flags: needinfo?(arich) → needinfo?(catlee)
correct, they're still being used for linux fuzzing jobs. we could probably reduce the number of VMs though if that would help?
Flags: needinfo?(catlee)
(In reply to Chris AtLee [:catlee] from comment #23)
> correct, they're still being used for linux fuzzing jobs. we could probably
> reduce the number of VMs though if that would help?

Are we not fuzzing on other linux platforms? Are fuzzing results from this particular linux variant critical?
No, IIRC, we're only fuzzing on one platform. We should move these to the newer hardware, but that raises the issue of how to schedule them. We don't really have idle slaves in AWS.
(In reply to Chris AtLee [:catlee] from comment #25)
> No, IIRC, we're only fuzzing on one platform. We should move these to the
> newer hardware, but that raises the issue of how to schedule them. We don't
> really have idle slaves in AWS.

But we do have some in-house hardware capacity on linux, yes? Or is the issue that we prioritize those slaves so heavily that they are effectively never idle?
Also keep in mind, that these vms are running 32 bit CentOS 5, not 64 bit CentOS 6 like all the modern builders.  We have linux-ix-slave01 - linux-ix-slave06 on hardware, but AFAIK they are riding the trains and will be decommissioned in Q1 when we move out of SCL1.

Is there a reason we're still running spidermonkey and fuzzing on CentOS 5?
Blocks: 947426
No longer blocks: 863236
Where are we here?

Is running the jobs on the talos-ix machines when they're idle an option?
The 6 bld-centos5-32-vmw are not connected to buildbot.

Are we done in here?
Keywords: spring-cleaning
I have rebooted the 6 bld-centos5-32-vm hosts as the buildbot-master had shut them off (on the 7th) after 6 hours of idleness.

The Linux fuzzers are supposed to run on those VMs plus linux-ix-### (which have been re-purposed).
The Linux64 fuzzers are supposed to run on linux64-ix-### (which have been re-purposed).
I'm going to remove the linux-ix and linux64-ix machines from the configs on bug 933768.

I think we're not having enough idle time and the buildbot masters shut the centos VM due to the lack of jobs (since they don't take any other job).
We might not have the same issue with in-house machines as, unlike the bld-centos5-32-vm machines, we will not have 6 hours of idleness which would cause the master to shut them off.

I think it is best to fix bug 803764 and kill theses VMs.
Bug 803764 is fixed now, so I've disabled the 6 bld-centos5-32-vm slaves.

We can safely decomm these machines now.
(Assignee)

Comment 32

3 years ago
I can handle that.  I'll ping gcox first to be sure it's not going to cause problems.
Assignee: server-ops-virtualization → dustin
(Assignee)

Comment 33

3 years ago
Gone!
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.