Closed Bug 467634 Opened 16 years ago Closed 16 years ago

build ESX hosts under way-too-high-load

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: phong)

References

Details

Attachments

(1 file)

We moved production-master to better storage today in an effort clear up the load problems on it. After moving it, load issues seem unchanged. I had a look in the VI client and noticed that all of hosts in the Intel cluster are in a 'warning' state. All of them have  over 70% CPU usage, and I see them often spike to 100%. Memory usage is > 80% on all of them.

This is starting to cause pretty serious issues on production-master now - we see > 1.00 load average most of the time, and have slaves disconnecting often.

I'm talking through getting rid of some VMs with the rest of RelEng, and I think some of the VMs should be on developer ESX hosts. Ultimately, we need to get rid of some VMs, get more ESX hosts, or both.
OS: Mac OS X → All
Hardware: PC → All
Assignee: server-ops → mrz
Our VMware installation is configured to move VMs from overloaded ESX server to less loaded ESX servers dynamically. (Dunno if VMware can move VMs from Intel to AMD dynamically - mrz?)

As far as I can tell on VI, the VMs themselves seem fine. Some run "hot", but those are slaves, which always run at 100% CPU when building. And always have. 



Getting rid of unwanted VMs is always a good idea, but I'm not seeing the data that shows this to fix the problem we're hitting recently with production-master. Can you provide data?
ESX can't VMotion across different CPU architectures.

Phong's looking at this and will provide recommendations.
Assignee: mrz → phong
IT's going to set up performance trending on production-master to see if that helps narrow down perf issues.
(In reply to comment #1)
> Getting rid of unwanted VMs is always a good idea, but I'm not seeing the data
> that shows this to fix the problem we're hitting recently with
> production-master. Can you provide data?

As I outlined in comment #0 ALL of the Intel VM hosts are permanently in a 'warning' state (yellow icon and flag) - every few minutes one spikes to critical. You can see see this by changing the view in the VI client to 'Hosts and Clusters', clicking on 'INTEL-01' and then changing to the 'hosts' tab.

Most of VMs don't actually spike any alarms but my theory is that because the hosts are so overworked that a single VM can't get enough juice to trip it.

Another datapoint here (courtesy of Nick):
Before we had these clusters we were told not to have more than 6 VMs per host. Right now in the Intel cluster we have 7 hosts and 86 VMs. That's an average of 12 VMs per host. We're way over our limit.

On the AMD cluster we're at 4 hosts and 35 VMs (average of over 8 per host) - so we're doing a little better there.
(In reply to comment #4)
> Another datapoint here (courtesy of Nick):
> Before we had these clusters we were told not to have more than 6 VMs per host.
> Right now in the Intel cluster we have 7 hosts and 86 VMs. That's an average of
> 12 VMs per host. We're way over our limit.

That's an old metric based on old CPUs.  I'm budgeting 20 VMs per 8 core host.

We used to have dual -single- core Xeons and 8GB RAM.  The Intel blades are 2x quad-core Xeons and 16GB RAM.

Phong's been doing some data gathering on this and I'll let him comment with his findings instead of trying to remember what he told me.
Supplementary evidence that systems are slower in the last couple of weeks than they were:
* increasing frequent timeouts when linking xul.dll on win32 m-c, after 3600 seconds
* windows try server timeouts when cleaning up previous builds, after 1200 seconds
The ESX hosts are being overloaded.  If there certain VM's that are super important, we can set up reservations for them.  Those VM's will have priority over the shared resources.
Can we bump up production-master's priority please? I don't think anything else needs it, though.
fyi, working on adding two more ESX hosts to the Intel DRS cluster.
That's good to hear. For our part, we'll try not to overload them again ;-). Thanks for the update mrz
(Pasted from mail for completeness)
I was fixing up the log removal cronjob on production-master (and deleting a lot of mozilla-1.9.1 logs) when I noticed that our cleanup job causes a very significant impact on load average. Running it for 30s brings the load average from 0.50 to 1.00, for example. Our cleanup jobs are only set to run nightly but we do have hourly jobs which make copies of unittest logs.

My current theory, although a longshot of one, is that this log copying hits the disk hard enough to spike the load average. They are run with 'nice -n 19', but AFAIK that can only throttle cpu time which doesn't help if we're saturating our disk.

This still doesn't explain how this behaviour would cause Buildbot to chew the CPU when we load the waterfall, though.

In any case, I've adjusted those jobs to only run on a nightly basis. If these are indeed the cause we should see an improvement immediately. If it doesn't make a difference I'll restore them to their hourly state.
I've set up the reservations for production-master.  Let me know if this helps with performance.
Things have been a little better for me the past 24h. I've rarely seen a pageload for the waterfall take over 10s, and normally it happens in less than 5. We've had some slave disconnects still but it seems like less than before (I haven't taken a tally so I'm not sure exactly). Both the reservations and the cronjob change-up happened around the same time so it's hard to tell what helped, but here's some specifics about the cronjobs and load:

I checked the load periodically before & after the cronjobs ran last night and the two jobs which backup unittest logs are definitely spiking our load.

Load average is around 0.30 until 4:02am (2 minutes after the jobs start), when it rises to about 1.50 for 15 minutes.

These jobs used to run on an hourly basis, so they probably had the same effect for a shorter period of time. Perhaps only 5 or 10 minutes out of every hour? Maybe less?

I've set these up to be staggered now, so that _should_ help the load a little bit when the jobs fire.
(In reply to comment #7)
> The ESX hosts are being overloaded.  

My understanding was that there was not enough spare capacity in the ESX servers to loadbalance VMs if one of the ESX server died. This is not great, and we should get extra ESX servers to handle this situation. However, thats different to the current set of ESX servers being overloaded. 

Totally possible I misunderstood, can you clarify?


> If there certain VM's that are super
> important, we can set up reservations for them.  Those VM's will have priority
> over the shared resources.

Very cool. Thanks for doing that with production-master, lets see if it helps. Can you also do the same for qm-rhel02.mozilla.org? Its the talos buildbot master, and hit problems this morning - see details in bug#468859.
> Very cool. Thanks for doing that with production-master, lets see if it helps.
> Can you also do the same for qm-rhel02.mozilla.org? Its the talos buildbot
> master, and hit problems this morning - see details in bug#468859.

Can this be done ASAP, please?  Or move qm-rhel02.mozilla.org back to another VM host.  slaves are dropping like flies since qm-rhel02 was migrated this morning.
Mentioned to Alice that DRS does dynamic-resource-scheduling on its own and moves VMs around based on load requirements.  Very normal to see VMs move throughout the day.
According to the VI client web interface, this is the first time since October that this VM has been migrated.  And since the migration at 8:58 this morning we've had a disproportionate number of slave disconnect problems.
Two additional ESX servers on order, ETA 12/17.
We had a couple of glitches at about 16:30 where two linux slaves had build errors:
moz2-linux-slave02 building mozilla-central
 rm -f libxul.so
 <link libsul.so command>
 ../../staticlib/components/libcaps.a: member ../../staticlib/components /libcaps.a(nsPrincipal.o) in archive is not an object
 collect2: ld returned 1 exit status
http://tinderbox.mozilla.org/showlog.cgi?tree=Firefox&errorparser=unix&logfile=1228955094.1228955887.24379.gz&buildtime=1228955094&buildname=Linux%20mozilla-central%20build&fulltext=1

moz2-linux-slave16 building mozilla-central debug:
 rm -f libzipwriter.so
 <link libzipwrite.so command>
 nsZipWriter.o: In function `nsZipWriter::AddEntryStream(nsACString_internal const&, long long, int, nsIInputStream*, int, unsigned int)':
 /builds/moz2_slave/mozilla-central-linux-debug/build/modules/libjar/zipwriter/src/nsZipWriter.cpp:536: undefined reference to `nsZipDataStream::Init(nsZipWriter*, nsIOutputStream*, nsZipHeader*, int)'
 /builds/moz2_slave/mozilla-central-linux-debug/build/modules/libjar/zipwriter/src/nsZipWriter.cpp:542: undefined reference to `nsZipDataStream::ReadStream(nsIInputStream*)'
 nsZipWriter.o: In function `nsZipDataStream':
 /builds/moz2_slave/mozilla-central-linux-debug/build/modules/libjar/zipwriter/src/nsZipDataStream.h:56: undefined reference to `vtable for nsZipDataStream'
nsZipWriter.o: In function `nsZipWriter::BeginProcessingAddition(nsZipQueueItem*, int*)':
 /builds/moz2_slave/mozilla-central-linux-debug/build/modules/libjar/zipwriter/src/nsZipWriter.cpp:912: undefined reference to `nsZipDataStream::Init(nsZipWriter*, nsIOutputStream*, nsZipHeader*, int)'
 /builds/moz2_slave/mozilla-central-linux-debug/build/modules/libjar/zipwriter/src/nsZipWriter.cpp:935: undefined reference to `nsZipDataStream::Init(nsZipWriter*, nsIOutputStream*, nsZipHeader*, int)'
collect2: ld returned 1 exit status
http://tinderbox.mozilla.org/showlog.cgi?tree=Firefox&errorparser=unix&logfile=1228955439.1228955689.23985.gz&buildtime=1228955439&buildname=Linux%20mozilla-central%20leak%20test%20build&fulltext=1

Both of these look like file system corruption. Phong says there were load spikes at about when the builds bailed out, and asks for us to look for machines that we can power down until more capacity is available. At first glance there are some machines we can back off on build frequency, and some staging builds we can disable for a time. More details to follow.
Started feeling these load spikes as well, and the VMs I'm working on are not critical so I've taken the following VMs down for the time being, hopefully that will free up a little cpu.  None of them should have been doing anything resource intensive but every little bit helps:

test-mgmt.build
test-winslave.build
test-linslave.build
test-opsi.build

If someone could check these from the vmware console and make sure they shut down I'd appreciate it.

In addition, if it would be possible to get the windows reference image (the vmware image of it) then I could continue working on that front without these for the time being.
(In reply to comment #20)
> If someone could check these from the vmware console and make sure they shut
> down I'd appreciate it.

test-winslave needed some help, but it's having a lie down now. Thanks Corey.
Made these changes:

Mozilla1.8-l10n: Build once a day via --interval 86400 in multi-config.pl
* karma (AMD), cerberus-vm (Intel)

staging-1.8-master (test builds):
* disable fx "nightly" builds by shutting down
 * staging-prometheus-vm02, staging-pacifica-vm02 (both Intel)
 * disabled mac builds by commenting out "c['schedulers'].append(depend_scheduler)", small reduction of load on staging-master (Intel)
* turn off idle staging slaves
 * staging-pacifica-vm, staging-patrocles, staging-crazyhorse (all Intel)

* other idle machines: mozillabuild-builder (Intel)

But I think the real problem is that we have many, very busy, win32 slaves since the the 1.9.1 branch opened (see attachment). Right now there are a total of 10 win32 builds running - 5 mozilla-central build's running, 8 mozilla-central leak tests pending, 3 mozilla-1.9.1 builds running, and 2 mozilla-1.9.1 leak test running.
If I can get the ESX licenses soon, we can spin up two blades to fill in for a bit.  They'll be 4 core, 4GB RAM boxes but that might be enough to hold until next Wednesday.

cshields - you were asking for a local copy of the VM?  Which one did you want exported?
I'd like the win2k3 build slave reference image if it can be exported.  I can do local work and testing off of that for probably a week or two.

tia
I should clarify..  it doesn't have to be the existing test-winslave image I've been working on..  Just a clean copy of the reference image would be perfect (in fact it would be better, give me a clean slate to work this new agent I am trying)
To be clear, win2k3sp2-vc8tools-scrubbed-ref-vm is the VM to export for Corey.
John - are you okay with your MSDN key in that ref image going offsite?
I found some more stuff we can back off the build frequency on, and move to a less loaded place:
* moz1.8 - moved patrocles & crazyhorse to AMD cluster, set --interval 3600 for now, and this can probably be higher later. Also set the PeriodicScheduler for the Fx2 builds to 2 hours instead of 5 minutes
* moz1.9.0 - moved fxdbug-win32-tbox, fx-win32-tbox, and fx-linux-tbox from Intel to AMD cluster

Another source of more capacity would be to ask QA nicely if we could borrow one or both of the two QA hosts. They look like a similar spec to the 4 boxes in the build AMD cluster, and do maybe a couple of builds a day according to the CPU graphs.
The QA ESX hosts aren't in the same DRS cluster so there's some amount if work to use them. Keep in mine that you'll have to more 8 core boxes on Wednesday. 

qw-vmware02 is planned to join the AMD DRS cluster though. I forget what sort of work that involves.
bm-vmware12     IN A 10.2.71.212
bm-vmware12-vmotion     IN A 10.2.71.213
bm-vmware13     IN A 10.2.71.214
bm-vmware13-vmotion     IN A 10.2.71.215

bm-vmware12.san IN A 10.253.0.218
bm-vmware12-sc  IN A 10.253.0.219
bm-vmware13 IN A 10.253.0.220
bm-vmware13-sc  IN A 10.253.0.221
Woohoo!
bm-vmware12/13 have been added to the cluster.  VM's were automatically migrated to the new ESX host.
Is there enough capacity now to bring a few test vm's back online?
Yes.  Ping me (or phong) on IRC.

I'd like to call this bug closed...
(In reply to comment #35)
> I'd like to call this bug closed...

Me too. Please close as FIXED! Thank you, mrz and phong for all your help .
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: