Closed Bug 772593 Opened 12 years ago Closed 12 years ago

Setup 7 Hardware machines as Linux Foopies in mtv1

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86_64
Windows 7
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Assigned: arich)

References

Details

So, in the coming weeks, we're gearing up to need Linux Foopies brought up.

I am working on the puppetizing process on a VM right now, but we want Linux foopies to run on physical hardware, I suggest the HP DL120G7's we have.

These will be setup with the PuppetAgain process (and a bit of hands on to get the tegras attached once fully up). I am naming us to try setting up 5 at the outset:

* 1 for a linux-foopy staging
* 3-4 for our new batch of tegra hosting
* 1 for migrating existing tegras off mac foopies **or** panda/beagle board foopy setup testing, whichever comes first/makes the most sense.

I'm not yet asking for anything on the buildbot master side, while we await data on how much additional buildbot master load we'll have with these new systems. My theory is we'll need at least 1 more master.

Hopefully this work can happen in parallel, and we can morph this into a tracker if need be.
What hardware specifications are required for:

CPU speed (and multi-core vs single core, e.g. is this multi-threaded?)
amount of memory
amount of disk space (and what sort of io processing load these machines will incur).

Based on the answers to these questions, a vm may fit the bill, or we'll choose the appropriate hardware.  We don't have spare hardware, so we'll need to spec and purchase something new if that's the case.

Are we to assume centos 6.2 for the OS based on your puppetagain statement?
(In reply to Amy Rich [:arich] [:arr] from comment #1)
> Are we to assume centos 6.2 for the OS based on your puppetagain statement?

Yes.

> What hardware specifications are required for:
> 
> CPU speed (and multi-core vs single core, e.g. is this multi-threaded?)

We'll want faster the better, and multi-core preferred, since we'll be running 2 clientproxy processes per tegra attached, as well as any related processes for the tests (stuff for talos/trobocop, etc.)

> amount of memory
> amount of disk space (and what sort of io processing load these machines
> will incur).

Disk space is not our largest facter here, but we do a lot of downloading/unpacking of apk's/tests packages.

> Based on the answers to these questions, a vm may fit the bill, or we'll
> choose the appropriate hardware.  We don't have spare hardware, so we'll
> need to spec and purchase something new if that's the case.
> 

The more CPU/Memory/Network throughput we have, the more tegras we can run *reliably* on a single machine. I would say the CPU/Mem basis from our newest mac foopies is a good example of our min requirement here.

I'll let bear/armen/someone-else chime in here on their thoughts though.
(In reply to Justin Wood (:Callek) from comment #2)
> (In reply to Amy Rich [:arich] [:arr] from comment #1)
> > Are we to assume centos 6.2 for the OS based on your puppetagain statement?
> 
> Yes.
> 
> > What hardware specifications are required for:
> > 
> > CPU speed (and multi-core vs single core, e.g. is this multi-threaded?)
> 
> We'll want faster the better, and multi-core preferred, since we'll be
> running 2 clientproxy processes per tegra attached, as well as any related
> processes for the tests (stuff for talos/trobocop, etc.)

Each tegra requires 2 cp processes like Callek mentioned, buildbot process and anywhere from 0 to 3 additional processes depending on the test being run.

> 
> > amount of memory

Memory is not the limiting factor, so no less than what the mini's currently have is a good starting benchmark

> > amount of disk space (and what sort of io processing load these machines
> > will incur).
> 
> Disk space is not our largest facter here, but we do a lot of
> downloading/unpacking of apk's/tests packages.

disk space, while not the largest factor is a big one because of the supporting files and increased number of projects/platforms that will be running - remember that each tegra has it's own build environment.

The largest issue is disk i/o - each tegra does *ton* of disk i/o and the biggest reason we have had to reduce the tegra per foopy ratio is because of disk i/o.

If we are using VM's for these I would worry about the accumlated i/o on the host server.


> 
> > Based on the answers to these questions, a vm may fit the bill, or we'll
> > choose the appropriate hardware.  We don't have spare hardware, so we'll
> > need to spec and purchase something new if that's the case.
> > 
> 
> The more CPU/Memory/Network throughput we have, the more tegras we can run
> *reliably* on a single machine. I would say the CPU/Mem basis from our
> newest mac foopies is a good example of our min requirement here.
> 
> I'll let bear/armen/someone-else chime in here on their thoughts though.

Nothing that I'm am aware of prevents using VMs for these - we should spin one up now so that the final puppet checks can be done on this VM to confirm that it won't be an issue.  We can also burn-in/stage the new tegras on this VM to get some ganglia metrics on what the i/o rate is in reality.
We have a test vm up for this right now.  Can you install/configure ganglia on it and move some tegras over so we can get an idea of actual load?  We can tune the vm's CPU and RAM allocation based on that.
Assignee: server-ops-releng → arich
(In reply to Amy Rich [:arich] [:arr] from comment #4)
> We have a test vm up for this right now.  Can you install/configure ganglia
> on it and move some tegras over so we can get an idea of actual load?  We
> can tune the vm's CPU and RAM allocation based on that.

Ok, I stuck 13 tegras on it (which is the number we have allocated to the beefier mac foopies) I'd like to be able to allocate a min of 16 tegras to each of these... 

I ran start_cp.sh for all these tegras, and many have buildbot running already. They are not doing jobs properly/with-all-required-load yet though.

I am certainly seeing some slow down on here with all 13 of these running so far. And we'll see more load once we get the buildbot changes done, so that these actually start passing jobs.

http://ganglia3.build.mtv1.mozilla.com/ganglia/?r=hour&cs=&ce=&m=load_one&s=by+name&c=RelEngMTV1&h=linux-foopy-test.build.mtv1.mozilla.com&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
FYI as well:

[cltbld@linux-foopy-test builds]$ python sut_tools/tegra_powercycle.py tegra-233
snmpset: Timeout
[cltbld@linux-foopy-test builds]$ python sut_tools/tegra_powercycle.py tegra-233
snmpset: Timeout
[cltbld@linux-foopy-test builds]$ python sut_tools/tegra_powercycle.py tegra-233
snmpset: Timeout

While doing it from foopy06 worked fine, is this a PDU-needs-to-know-this-host? or is it a Linux-Foopy-Too-Loaded to powercycle?
I installed iotop and htop to get some realtime stats when we see it under load (it's not right now).

I'll leave bear to decipher what comment 6 means.
(In reply to Amy Rich [:arich] [:arr] from comment #7)
> I installed iotop and htop to get some realtime stats when we see it under
> load (it's not right now).
> 
> I'll leave bear to decipher what comment 6 means.

it just means that the vm that linux foopy is currently on does not have a flow for snmp traffic to the pdus
Depends on: 774318
I'm not sure that the DL120s are a long term solution, but if we need to we might (for space reasons) be able to swap the iX machines that are moving to scl1 to become w64 builders with some of the HPs that are currently in scl1 acting as b2g machines.  Then we could use the dl120s as foopies.
Depends on: 776977
Depends on: 777768
Summary: Setup 5 Hardware machines as Linux Foopies in mtv1 → Setup 7 Hardware machines as Linux Foopies in mtv1
The following machines were kickstarted in mtv1 this morning:

foopy26
foopy27
foopy28
foopy29
foopy30
foopy31
foopy32
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.