804766 - Decommission VMWare VM builders - production slaves (non-try)

Reporter

Description

•

12 years ago

The non-try VMWare VMs are still building things like
* beta/release on-change
* beta/release/esr releases
* inbound spidermonkey
* fuzzing  (preferentially over ix hardware)
This will get smaller with bug 798361 (bug 772446) but 
<nthomas>	how much of a difference would disabling them make ?
<lerxst>	nthomas: gigantic. those 50 build VMs together are causing the same IO load you'd expect from 500 normal VMs
but we're ok right now, where "ok" is defined as "overloaded but not currently on fire"

Going to disable them in slavealloc initially to help the netapp, and monitor the wait times for issues.

Slave list (from buildbot-configs):
bld-centos5-32-vmw-001
bld-centos5-32-vmw-002
bld-centos5-32-vmw-003
bld-centos5-32-vmw-004
bld-centos5-32-vmw-005
bld-centos5-32-vmw-006
bld-centos5-32-vmw-007
bld-centos5-32-vmw-008
bld-centos5-32-vmw-009
bld-centos5-32-vmw-010
bld-centos5-32-vmw-011
bld-centos5-32-vmw-012
bld-centos5-32-vmw-013
bld-centos5-32-vmw-014
bld-centos5-32-vmw-015
bld-centos5-32-vmw-016
bld-centos5-32-vmw-017
bld-centos5-32-vmw-018
bld-centos5-32-vmw-019
bld-centos5-32-vmw-020
bld-centos5-32-vmw-021
bld-centos5-32-vmw-022
bld-centos5-64-vmw-001
bld-centos5-64-vmw-002
bld-centos5-64-vmw-003
bld-centos5-64-vmw-004
bld-centos5-64-vmw-005
bld-centos5-64-vmw-006

Those are all in 'production', there are no staging slaves of this class.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 1

•

12 years ago

Disabled in slavealloc, and did a graceful buildbot shutdown on them all to catch the fuzzer jobs.

Summary: Decommision VMWare VM builders - production slaves (non-try) → Decommission VMWare VM builders - production slaves (non-try)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 2

•

12 years ago

... which is RelEng-speak for 'everything in comment #0 is now idle'.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 3

•

12 years ago

Re-enabled these four:
bld-centos5-32-vmw-001
bld-centos5-32-vmw-002
bld-centos5-64-vmw-001
bld-centos5-64-vmw-002
since the spidermonkey builds don't happen otherwise. They'll be busy with fuzzing otherwise, which is not I/O intensive.

Greg Cox [:gcox]

Comment 4

•

12 years ago

From a conversation with :arr on 6 November --
18:39:26 arr: let us know if/when we should reenable the rest of the slaves

In bug 809105 I have the slaves moved over to SAS drives, so, this batch should be allowed to start back up.  Piecemeal would be nice, if feasible.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 5

•

12 years ago

(In reply to Nick Thomas [:nthomas] from comment #0)
> The non-try VMWare VMs are still building things like
> * beta/release on-change
> * beta/release/esr releases
> * inbound spidermonkey
> * fuzzing  (preferentially over ix hardware)

* beta will move off these slaves after the code merge Nov 19 (Fx18, bug 798361)
* release will move off on the merge around the new year (Fx18, riding the trains)
* esr is targeting moving off at the same time as release (backporting at 17.0.1 esr)
* spidermonkey moved off these slaves (bug 794339) recently.
* fuzzing still needs to move (bug 803764)

As it is with these slaves off the only thing that's hurting is lack of linux fuzzing right now.

Greg Cox [:gcox]

Comment 6

•

12 years ago

OK, bring back what's hurting you.  You know we'll yell if we have to. :)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 7

•

12 years ago

These slaves are re-enabled in slavealloc again, which means they're doing work. This changed on Nov 9. kmoir, any info on that ? gcox, how's the netapp finding that ? Hopefully it's mostly CPU load rather than I/O.

In the short term what we probably should do is enable N 32bit VMs and lock them to a single master, and turn off the rest. Then we have a N-3 dedicated machines for fuzzing. The offset of 3 comes from http://mxr.mozilla.org/build/source/buildbot-configs/mozilla/production_config.py#179, where each buildbot master maintains idle_slaves so that non-fuzzing work can start quickly. We could reduce this value.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 8

•

12 years ago

OK, I've left these six enabled and locked to buildbot-master30:
bld-centos5-32-vmw-001
bld-centos5-32-vmw-002
bld-centos5-32-vmw-003
bld-centos5-32-vmw-004
bld-centos5-32-vmw-005
bld-centos5-32-vmw-006

The remainder (bld-centos5-32-vmw-007 to 022, bld-centos5-64-vmw-001 to 011) are disabled in slavealloc, but the VMs are still on.

Armen [:armenzg]

Comment 9

•

12 years ago

Can you please change their trustlevel to decomm on slavealloc? IIUC it will prevent those slaves from showing up on the last job per slave report.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 10

•

12 years ago

Is the comment and disabled state not sufficient for buildduty to ignore any state in the last job report ? They're not really decommissioned yet.

Armen [:armenzg]

Comment 11

•

12 years ago

If we change the status of "Trust" to "decomm" it becomes hidden on the report and no one has to figure out what to do with them.

What is the plan for these slaves? What is it left to decomission them?
Are we planning to put them back? or it is not clear yet?

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 12

•

11 years ago

This can happen once the fuzzing jobs move off to mock.

Assignee: nthomas → nobody

Greg Cox [:gcox]

Comment 13

•

11 years ago

Lint: moving to 'new' since assignee went to 'nobody'.
Cross-bug reference: bld-centos5-32-vmw-007 to 022, bld-centos5-64-vmw-001 to 011 are gone due to bug 841331.

Status: ASSIGNED → NEW

Amy Rich [:arr] [:arich]

Comment 14

•

11 years ago

According to catlee in bug 805587 we're good to decommission the rest of these vms:

bld-centos5-32-vmw-001
bld-centos5-32-vmw-002
bld-centos5-32-vmw-003
bld-centos5-32-vmw-004
bld-centos5-32-vmw-005
bld-centos5-32-vmw-006

Assignee: nobody → server-ops-virtualization

Component: Release Engineering: Automation (General) → Server Operations: Virtualization

Greg Cox [:gcox]

Comment 15

•

11 years ago

This bug lists a dependency on bug 803764, though.
Error in dependency, or premature decom request?

Amy Rich [:arr] [:arich]

Comment 16

•

11 years ago

catlee, is 803764 already done?

Flags: needinfo?(catlee)

Chris AtLee [:catlee]

Comment 17

•

11 years ago

No, we need to get that done before we can decomm these VMs.

Flags: needinfo?(catlee)

Armen [:armenzg]

Updated

•

11 years ago

Blocks: 863236

Armen [:armenzg]

Comment 18

•

11 years ago

We have 6 linux-ix and 6 linux64-ix machines around in bug 780022 until October which is when esr17 dies.
If we had 6 32-bit VMs and 6 64-bit VMs we could then go ahead and re-purpose those iX machines into something useful without having to wait until October.

These machines would be barely in use if we moved the fuzzer and nanojit jobs to our hp/ec2 infrastructure.

Depends on: 840303

Amy Rich [:arr] [:arich]

Comment 19

•

11 years ago

The iX machines in bug 780027 will be decommissioned when they're no longer being used for ESR, not repurposed.  They're out of warranty and in mountain view.

Dan Parsons [:lerxst]

Comment 20

•

11 years ago

Any update on this bug? Anything more need doing?

Dan Parsons [:lerxst]

Comment 21

•

11 years ago

:arr, can you update us on the status of this bug? can we delete some VMs here?

Flags: needinfo?(arich)

Amy Rich [:arr] [:arich]

Comment 22

•

11 years ago

Not that I know of.  Releng still hasn't given the okay.

Flags: needinfo?(arich) → needinfo?(catlee)

Chris AtLee [:catlee]

Comment 23

•

11 years ago

correct, they're still being used for linux fuzzing jobs. we could probably reduce the number of VMs though if that would help?

Flags: needinfo?(catlee)

Chris Cooper [:coop] (he/him)

Comment 24

•

11 years ago

(In reply to Chris AtLee [:catlee] from comment #23)
> correct, they're still being used for linux fuzzing jobs. we could probably
> reduce the number of VMs though if that would help?

Are we not fuzzing on other linux platforms? Are fuzzing results from this particular linux variant critical?

Chris AtLee [:catlee]

Comment 25

•

11 years ago

No, IIRC, we're only fuzzing on one platform. We should move these to the newer hardware, but that raises the issue of how to schedule them. We don't really have idle slaves in AWS.

Chris Cooper [:coop] (he/him)

Comment 26

•

11 years ago

(In reply to Chris AtLee [:catlee] from comment #25)
> No, IIRC, we're only fuzzing on one platform. We should move these to the
> newer hardware, but that raises the issue of how to schedule them. We don't
> really have idle slaves in AWS.

But we do have some in-house hardware capacity on linux, yes? Or is the issue that we prioritize those slaves so heavily that they are effectively never idle?

Amy Rich [:arr] [:arich]

Comment 27

•

11 years ago

Also keep in mind, that these vms are running 32 bit CentOS 5, not 64 bit CentOS 6 like all the modern builders.  We have linux-ix-slave01 - linux-ix-slave06 on hardware, but AFAIK they are riding the trains and will be decommissioned in Q1 when we move out of SCL1.

Is there a reason we're still running spidermonkey and fuzzing on CentOS 5?

Armen [:armenzg]

Updated

•

11 years ago

Blocks: 947426
No longer blocks: 863236

Armen [:armenzg]

Comment 28

•

10 years ago

Where are we here?

Is running the jobs on the talos-ix machines when they're idle an option?

Armen [:armenzg]

Comment 29

•

10 years ago

The 6 bld-centos5-32-vmw are not connected to buildbot.

Are we done in here?

Amy Rich [:arr] [:arich]

Updated

•

10 years ago

Keywords: spring-cleaning

Armen [:armenzg]

Comment 30

•

10 years ago

I have rebooted the 6 bld-centos5-32-vm hosts as the buildbot-master had shut them off (on the 7th) after 6 hours of idleness.

The Linux fuzzers are supposed to run on those VMs plus linux-ix-### (which have been re-purposed).
The Linux64 fuzzers are supposed to run on linux64-ix-### (which have been re-purposed).
I'm going to remove the linux-ix and linux64-ix machines from the configs on bug 933768.

I think we're not having enough idle time and the buildbot masters shut the centos VM due to the lack of jobs (since they don't take any other job).
We might not have the same issue with in-house machines as, unlike the bld-centos5-32-vm machines, we will not have 6 hours of idleness which would cause the master to shut them off.

I think it is best to fix bug 803764 and kill theses VMs.

Chris Cooper [:coop] (he/him)

Comment 31

•

10 years ago

Bug 803764 is fixed now, so I've disabled the 6 bld-centos5-32-vm slaves.

We can safely decomm these machines now.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 32

•

10 years ago

I can handle that.  I'll ping gcox first to be sure it's not going to cause problems.

Assignee: server-ops-virtualization → dustin

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 33

•

10 years ago

Gone!

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → Infrastructure & Operations