Closed Bug 501222 Opened 13 years ago Closed 13 years ago

Bump up RAM and update VMware tools on linux slaves

Categories

(Release Engineering :: General, defect)

All
Other
defect
Not set
minor

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: bhearsum)

References

Details

Currently our linux build slaves have 768 MB of RAM and 512 MB of swap.

We're hitting swap when linking some of the larger libraries, and we suspect that we're actually running out of memory in some cases, especially on Try.

Can we bump the RAM on all our linux slaves up to 2 GB?
Assignee: server-ops → phong
This will require the VMs to be shutdown.  This will also put additional load on our ESX hosts to allocated more memory for all the VMs.
(In reply to comment #1)
> This will require the VMs to be shutdown.  This will also put additional load
> on our ESX hosts to allocated more memory for all the VMs.

Do we have enough RAM in the ESX hosts to cover this?  We have at least 44 linux slaves right now that would need this additional RAM, giving a total increase of 55 GB on the ESX hosts.

What happens if we overcommit?
Duplicate of this bug: 503038
(In reply to comment #2)
> (In reply to comment #1)
> > This will require the VMs to be shutdown.  This will also put additional load
> > on our ESX hosts to allocated more memory for all the VMs.
> 
> Do we have enough RAM in the ESX hosts to cover this?  We have at least 44
> linux slaves right now that would need this additional RAM, giving a total
> increase of 55 GB on the ESX hosts.

Per discussion with phong in last week's group meeting, we have 70+GB ram available, so we can increase RAM like this without fear of overcommitting. 

Note: Phong  wanted to wait until after the ESX upgrades completed before doing this for all the linux VMs. However, he was fine with us doing this for a few staging linux VMs as soon as we want, if we want to do some testing in staging first.
Blocks: 500699
The staging slave moz2-linux-slave17.b.m.o now has 2GB RAM, rebooted and looks like it is processing jobs just fine. I'll leave it run over the weekend before declaring it safe.
(In reply to comment #5)
> The staging slave moz2-linux-slave17.b.m.o now has 2GB RAM, rebooted and looks
> like it is processing jobs just fine. I'll leave it run over the weekend before
> declaring it safe.

I don't see any errors on this slave related to memory, let's go ahead and do the rest.

Phong, I think it's actually easier for us to pull out slaves one by one and do them, so I'm moving this bug back to RelEng.
Assignee: phong → nobody
Component: Server Operations: Tinderbox Maintenance → Release Engineering
QA Contact: mrz → release
Got moz2-linux-slave02 today.
moz2-linux-slave03
moz2-linux-slave04
moz2-linux-slave17
...all now have 2gb ram, VMware tools installed.

Also, went back to moz-linux-slave02, verified it has 2gb ram and then I installed VMware tools on it. (unclear if VMware tools was there already or not).
Summary: Bump up RAM on linux slaves → Bump up RAM and install VMware tools on linux slaves
From bug#503392, comment#0:

Linux like this
1, login as root, cd /etc, cp fstab fstab.bak
2, Using VI, do automatic upgrade of vmware tools
3, back as root, cp fstab.bak fstab, edit fstab to remove these three lines
 # Beginning of the block added by the VMware software
 .host:/ /mnt/hgfs       vmhgfs  defaults,ttl=5  0       0
 # End of the block added by the VMware software
4, reboot

Buildbot shutdown in each case. I think we can roll this out gradually to the
slaves, but will have to schedule downtimes for masters.
Summary: Bump up RAM and install VMware tools on linux slaves → Bump up RAM and update VMware tools on linux slaves
While you're doing this, can you move some of them to the INTEL2 cluster.  I've added bm-vmware08 to that cluster.
Armen noticed that the slaving slaves were AWOL. I suspect they didn't get the reboot step after the vmware tools upgrade. That's what happened on moz2-linux-slave02, which was also not responding. Each machine was not running vmware-tools and the network was down, which is typical after you do the tools upgrade. A reboot fixed them up.
VMWare tools done on sm-try-master. Left it at 1GB RAM.
I'm going to try and finish these up today.
Assignee: nobody → bhearsum
Status: NEW → ASSIGNED
RAM and VMware tools upgrades are done on moz2-linux-slave01 -> 25 and try-linux-slave01 -> 19. I still need to go through the other VMs and do tools upgrades.
moz2-linux64-slave01 is done too.
We're going to do the rest of these updates in the downtime tomorrow.
Only machines left to do are:
production-1.9-master
qm-buildbot01
qm-rhel02
staging-1.9-master
staging-master
staging-try-master
talos-master
talos-staging-master
cruncher
production-opsi
production-prometheus-vm
production-puppet
prometheus-vm
staging-opsi
staging-puppet
staging-stage
Linux ref platform
(In reply to comment #17)
> Only machines left to do are:
> production-1.9-master
> qm-buildbot01
> staging-1.9-master
> staging-master
> staging-try-master
> talos-staging-master
> cruncher
> production-prometheus-vm
> production-puppet
> Linux ref platform

> prometheus-vm
> staging-puppet
> staging-stage

The following VMs need downtime to do the tools uprgade:
qm-rhel02
talos-master

production and staging opsi still need VMware tools, too, but they gave me an error when I tried to do the install, "A general system error occurred: Internal error". This might be because they were cloned from a Virtual Appliance? I'm not sure.
Whiteboard: still to do: talos-master, qm-rhel02, production-opsi, staging-opsi
(In reply to comment #18)
> (In reply to comment #17)
> production and staging opsi still need VMware tools, too, but they gave me an
> error when I tried to do the install, "A general system error occurred:
> Internal error". This might be because they were cloned from a Virtual
> Appliance? I'm not sure.

Phong: any idea what might be causing this?
I pinged Phong about this on Friday, actually, and he confirmed my theory about it happening because they were cloned from a Virtual Appliance. I'm not entirely certain what to do with them at this point.
(In reply to comment #18)
> The following VMs need downtime to do the tools uprgade:
> qm-rhel02
> talos-master

I've done these two while recovering from today's air-con outage.
(In reply to comment #21)
> (In reply to comment #18)
> > The following VMs need downtime to do the tools uprgade:
> > qm-rhel02
> > talos-master
> 
> I've done these two while recovering from today's air-con outage.

Nice, thanks Nick! 

bhearsum, any chance we can upgrade production-opsi without a downtime?
Whiteboard: still to do: talos-master, qm-rhel02, production-opsi, staging-opsi → still to do: production-opsi, staging-opsi
(In reply to comment #22)
> (In reply to comment #21)
> > (In reply to comment #18)
> > > The following VMs need downtime to do the tools uprgade:
> > > qm-rhel02
> > > talos-master
> > 
> > I've done these two while recovering from today's air-con outage.
> 
> Nice, thanks Nick! 
> 
> bhearsum, any chance we can upgrade production-opsi without a downtime?

Sure...the problem with staging and production-opsi is that VMware tools refuses to install on them, though, per comment 18
Whiteboard: still to do: production-opsi, staging-opsi → still to do: vmware tools on production-opsi, staging-opsi
It's confusing to have this bug open still. I filed bug 511442 (in the future) to track getting vmware tools installed on the remaining two machines. The rest of this bug is FIXED.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Whiteboard: still to do: vmware tools on production-opsi, staging-opsi
I finally managed to get VMware tools installed on an OPSI server. Seems that having the Operating System set to 'Other' causes VMware to barf. Changing it to Linux -> Other 32-bit let me mount the VMware tools CD and do the install by hand.
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.