Closed
Bug 804766
Opened 12 years ago
Closed 11 years ago
Decommission VMWare VM builders - production slaves (non-try)
Categories
(Infrastructure & Operations :: Virtualization, task, P2)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nthomas, Assigned: dustin)
References
Details
(Keywords: spring-cleaning)
The non-try VMWare VMs are still building things like
* beta/release on-change
* beta/release/esr releases
* inbound spidermonkey
* fuzzing (preferentially over ix hardware)
This will get smaller with bug 798361 (bug 772446) but
<nthomas> how much of a difference would disabling them make ?
<lerxst> nthomas: gigantic. those 50 build VMs together are causing the same IO load you'd expect from 500 normal VMs
but we're ok right now, where "ok" is defined as "overloaded but not currently on fire"
Going to disable them in slavealloc initially to help the netapp, and monitor the wait times for issues.
Slave list (from buildbot-configs):
bld-centos5-32-vmw-001
bld-centos5-32-vmw-002
bld-centos5-32-vmw-003
bld-centos5-32-vmw-004
bld-centos5-32-vmw-005
bld-centos5-32-vmw-006
bld-centos5-32-vmw-007
bld-centos5-32-vmw-008
bld-centos5-32-vmw-009
bld-centos5-32-vmw-010
bld-centos5-32-vmw-011
bld-centos5-32-vmw-012
bld-centos5-32-vmw-013
bld-centos5-32-vmw-014
bld-centos5-32-vmw-015
bld-centos5-32-vmw-016
bld-centos5-32-vmw-017
bld-centos5-32-vmw-018
bld-centos5-32-vmw-019
bld-centos5-32-vmw-020
bld-centos5-32-vmw-021
bld-centos5-32-vmw-022
bld-centos5-64-vmw-001
bld-centos5-64-vmw-002
bld-centos5-64-vmw-003
bld-centos5-64-vmw-004
bld-centos5-64-vmw-005
bld-centos5-64-vmw-006
Those are all in 'production', there are no staging slaves of this class.
Reporter | ||
Comment 1•12 years ago
|
||
Disabled in slavealloc, and did a graceful buildbot shutdown on them all to catch the fuzzer jobs.
Summary: Decommision VMWare VM builders - production slaves (non-try) → Decommission VMWare VM builders - production slaves (non-try)
Reporter | ||
Comment 2•12 years ago
|
||
... which is RelEng-speak for 'everything in comment #0 is now idle'.
Reporter | ||
Comment 3•12 years ago
|
||
Re-enabled these four:
bld-centos5-32-vmw-001
bld-centos5-32-vmw-002
bld-centos5-64-vmw-001
bld-centos5-64-vmw-002
since the spidermonkey builds don't happen otherwise. They'll be busy with fuzzing otherwise, which is not I/O intensive.
Comment 4•12 years ago
|
||
From a conversation with :arr on 6 November --
18:39:26 arr: let us know if/when we should reenable the rest of the slaves
In bug 809105 I have the slaves moved over to SAS drives, so, this batch should be allowed to start back up. Piecemeal would be nice, if feasible.
Reporter | ||
Comment 5•12 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #0)
> The non-try VMWare VMs are still building things like
> * beta/release on-change
> * beta/release/esr releases
> * inbound spidermonkey
> * fuzzing (preferentially over ix hardware)
* beta will move off these slaves after the code merge Nov 19 (Fx18, bug 798361)
* release will move off on the merge around the new year (Fx18, riding the trains)
* esr is targeting moving off at the same time as release (backporting at 17.0.1 esr)
* spidermonkey moved off these slaves (bug 794339) recently.
* fuzzing still needs to move (bug 803764)
As it is with these slaves off the only thing that's hurting is lack of linux fuzzing right now.
Comment 6•12 years ago
|
||
OK, bring back what's hurting you. You know we'll yell if we have to. :)
Reporter | ||
Comment 7•12 years ago
|
||
These slaves are re-enabled in slavealloc again, which means they're doing work. This changed on Nov 9. kmoir, any info on that ? gcox, how's the netapp finding that ? Hopefully it's mostly CPU load rather than I/O.
In the short term what we probably should do is enable N 32bit VMs and lock them to a single master, and turn off the rest. Then we have a N-3 dedicated machines for fuzzing. The offset of 3 comes from http://mxr.mozilla.org/build/source/buildbot-configs/mozilla/production_config.py#179, where each buildbot master maintains idle_slaves so that non-fuzzing work can start quickly. We could reduce this value.
Reporter | ||
Comment 8•12 years ago
|
||
OK, I've left these six enabled and locked to buildbot-master30:
bld-centos5-32-vmw-001
bld-centos5-32-vmw-002
bld-centos5-32-vmw-003
bld-centos5-32-vmw-004
bld-centos5-32-vmw-005
bld-centos5-32-vmw-006
The remainder (bld-centos5-32-vmw-007 to 022, bld-centos5-64-vmw-001 to 011) are disabled in slavealloc, but the VMs are still on.
Comment 9•12 years ago
|
||
Can you please change their trustlevel to decomm on slavealloc? IIUC it will prevent those slaves from showing up on the last job per slave report.
Reporter | ||
Comment 10•12 years ago
|
||
Is the comment and disabled state not sufficient for buildduty to ignore any state in the last job report ? They're not really decommissioned yet.
Comment 11•12 years ago
|
||
If we change the status of "Trust" to "decomm" it becomes hidden on the report and no one has to figure out what to do with them.
What is the plan for these slaves? What is it left to decomission them?
Are we planning to put them back? or it is not clear yet?
Reporter | ||
Comment 12•12 years ago
|
||
This can happen once the fuzzing jobs move off to mock.
Assignee: nthomas → nobody
Comment 13•12 years ago
|
||
Lint: moving to 'new' since assignee went to 'nobody'.
Cross-bug reference: bld-centos5-32-vmw-007 to 022, bld-centos5-64-vmw-001 to 011 are gone due to bug 841331.
Status: ASSIGNED → NEW
Comment 14•12 years ago
|
||
According to catlee in bug 805587 we're good to decommission the rest of these vms:
bld-centos5-32-vmw-001
bld-centos5-32-vmw-002
bld-centos5-32-vmw-003
bld-centos5-32-vmw-004
bld-centos5-32-vmw-005
bld-centos5-32-vmw-006
Assignee: nobody → server-ops-virtualization
Component: Release Engineering: Automation (General) → Server Operations: Virtualization
Comment 15•12 years ago
|
||
This bug lists a dependency on bug 803764, though.
Error in dependency, or premature decom request?
Comment 17•12 years ago
|
||
No, we need to get that done before we can decomm these VMs.
Flags: needinfo?(catlee)
Comment 18•12 years ago
|
||
We have 6 linux-ix and 6 linux64-ix machines around in bug 780022 until October which is when esr17 dies.
If we had 6 32-bit VMs and 6 64-bit VMs we could then go ahead and re-purpose those iX machines into something useful without having to wait until October.
These machines would be barely in use if we moved the fuzzer and nanojit jobs to our hp/ec2 infrastructure.
Depends on: 840303
Comment 19•12 years ago
|
||
The iX machines in bug 780027 will be decommissioned when they're no longer being used for ESR, not repurposed. They're out of warranty and in mountain view.
Comment 20•12 years ago
|
||
Any update on this bug? Anything more need doing?
Comment 21•11 years ago
|
||
:arr, can you update us on the status of this bug? can we delete some VMs here?
Flags: needinfo?(arich)
Comment 22•11 years ago
|
||
Not that I know of. Releng still hasn't given the okay.
Flags: needinfo?(arich) → needinfo?(catlee)
Comment 23•11 years ago
|
||
correct, they're still being used for linux fuzzing jobs. we could probably reduce the number of VMs though if that would help?
Flags: needinfo?(catlee)
Comment 24•11 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #23)
> correct, they're still being used for linux fuzzing jobs. we could probably
> reduce the number of VMs though if that would help?
Are we not fuzzing on other linux platforms? Are fuzzing results from this particular linux variant critical?
Comment 25•11 years ago
|
||
No, IIRC, we're only fuzzing on one platform. We should move these to the newer hardware, but that raises the issue of how to schedule them. We don't really have idle slaves in AWS.
Comment 26•11 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #25)
> No, IIRC, we're only fuzzing on one platform. We should move these to the
> newer hardware, but that raises the issue of how to schedule them. We don't
> really have idle slaves in AWS.
But we do have some in-house hardware capacity on linux, yes? Or is the issue that we prioritize those slaves so heavily that they are effectively never idle?
Comment 27•11 years ago
|
||
Also keep in mind, that these vms are running 32 bit CentOS 5, not 64 bit CentOS 6 like all the modern builders. We have linux-ix-slave01 - linux-ix-slave06 on hardware, but AFAIK they are riding the trains and will be decommissioned in Q1 when we move out of SCL1.
Is there a reason we're still running spidermonkey and fuzzing on CentOS 5?
Updated•11 years ago
|
Comment 28•11 years ago
|
||
Where are we here?
Is running the jobs on the talos-ix machines when they're idle an option?
Comment 29•11 years ago
|
||
The 6 bld-centos5-32-vmw are not connected to buildbot.
Are we done in here?
Updated•11 years ago
|
Keywords: spring-cleaning
Comment 30•11 years ago
|
||
I have rebooted the 6 bld-centos5-32-vm hosts as the buildbot-master had shut them off (on the 7th) after 6 hours of idleness.
The Linux fuzzers are supposed to run on those VMs plus linux-ix-### (which have been re-purposed).
The Linux64 fuzzers are supposed to run on linux64-ix-### (which have been re-purposed).
I'm going to remove the linux-ix and linux64-ix machines from the configs on bug 933768.
I think we're not having enough idle time and the buildbot masters shut the centos VM due to the lack of jobs (since they don't take any other job).
We might not have the same issue with in-house machines as, unlike the bld-centos5-32-vm machines, we will not have 6 hours of idleness which would cause the master to shut them off.
I think it is best to fix bug 803764 and kill theses VMs.
Comment 31•11 years ago
|
||
Bug 803764 is fixed now, so I've disabled the 6 bld-centos5-32-vm slaves.
We can safely decomm these machines now.
Assignee | ||
Comment 32•11 years ago
|
||
I can handle that. I'll ping gcox first to be sure it's not going to cause problems.
Assignee: server-ops-virtualization → dustin
Assignee | ||
Comment 33•11 years ago
|
||
Gone!
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•