467634 - build ESX hosts under way-too-high-load

Reporter

Description

•

16 years ago

We moved production-master to better storage today in an effort clear up the load problems on it. After moving it, load issues seem unchanged. I had a look in the VI client and noticed that all of hosts in the Intel cluster are in a 'warning' state. All of them have  over 70% CPU usage, and I see them often spike to 100%. Memory usage is > 80% on all of them.

This is starting to cause pretty serious issues on production-master now - we see > 1.00 load average most of the time, and have slaves disconnecting often.

I'm talking through getting rid of some VMs with the rest of RelEng, and I think some of the VMs should be on developer ESX hosts. Ultimately, we need to get rid of some VMs, get more ESX hosts, or both.

Nick Thomas [:nthomas] (UTC+12)

Updated

•

16 years ago

Blocks: 467322

Armen [:armenzg]

Updated

•

16 years ago

OS: Mac OS X → All

Hardware: PC → All

Jeremy Orem [:oremj]

Updated

•

16 years ago

Assignee: server-ops → mrz

Chris Cooper [:coop] (he/him)

Updated

•

16 years ago

Blocks: 464164

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 1

•

16 years ago

Our VMware installation is configured to move VMs from overloaded ESX server to less loaded ESX servers dynamically. (Dunno if VMware can move VMs from Intel to AMD dynamically - mrz?)

As far as I can tell on VI, the VMs themselves seem fine. Some run "hot", but those are slaves, which always run at 100% CPU when building. And always have. 



Getting rid of unwanted VMs is always a good idea, but I'm not seeing the data that shows this to fix the problem we're hitting recently with production-master. Can you provide data?

matthew zeier [:mrz]

Comment 2

•

16 years ago

ESX can't VMotion across different CPU architectures.

Phong's looking at this and will provide recommendations.

Assignee: mrz → phong

matthew zeier [:mrz]

Comment 3

•

16 years ago

IT's going to set up performance trending on production-master to see if that helps narrow down perf issues.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 4

•

16 years ago

(In reply to comment #1)
> Getting rid of unwanted VMs is always a good idea, but I'm not seeing the data
> that shows this to fix the problem we're hitting recently with
> production-master. Can you provide data?

As I outlined in comment #0 ALL of the Intel VM hosts are permanently in a 'warning' state (yellow icon and flag) - every few minutes one spikes to critical. You can see see this by changing the view in the VI client to 'Hosts and Clusters', clicking on 'INTEL-01' and then changing to the 'hosts' tab.

Most of VMs don't actually spike any alarms but my theory is that because the hosts are so overworked that a single VM can't get enough juice to trip it.

Another datapoint here (courtesy of Nick):
Before we had these clusters we were told not to have more than 6 VMs per host. Right now in the Intel cluster we have 7 hosts and 86 VMs. That's an average of 12 VMs per host. We're way over our limit.

On the AMD cluster we're at 4 hosts and 35 VMs (average of over 8 per host) - so we're doing a little better there.

matthew zeier [:mrz]

Comment 5

•

16 years ago

(In reply to comment #4)
> Another datapoint here (courtesy of Nick):
> Before we had these clusters we were told not to have more than 6 VMs per host.
> Right now in the Intel cluster we have 7 hosts and 86 VMs. That's an average of
> 12 VMs per host. We're way over our limit.

That's an old metric based on old CPUs.  I'm budgeting 20 VMs per 8 core host.

We used to have dual -single- core Xeons and 8GB RAM.  The Intel blades are 2x quad-core Xeons and 16GB RAM.

Phong's been doing some data gathering on this and I'll let him comment with his findings instead of trying to remember what he told me.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Updated

•

16 years ago

No longer blocks: 464164

Nick Thomas [:nthomas] (UTC+12)

Comment 6

•

16 years ago

Supplementary evidence that systems are slower in the last couple of weeks than they were:
* increasing frequent timeouts when linking xul.dll on win32 m-c, after 3600 seconds
* windows try server timeouts when cleaning up previous builds, after 1200 seconds

Phong Tran [:phong]

Assignee

Comment 7

•

16 years ago

The ESX hosts are being overloaded.  If there certain VM's that are super important, we can set up reservations for them.  Those VM's will have priority over the shared resources.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 8

•

16 years ago

Can we bump up production-master's priority please? I don't think anything else needs it, though.

matthew zeier [:mrz]

Comment 9

•

16 years ago

fyi, working on adding two more ESX hosts to the Intel DRS cluster.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 10

•

16 years ago

That's good to hear. For our part, we'll try not to overload them again ;-). Thanks for the update mrz

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 11

•

16 years ago

(Pasted from mail for completeness)
I was fixing up the log removal cronjob on production-master (and deleting a lot of mozilla-1.9.1 logs) when I noticed that our cleanup job causes a very significant impact on load average. Running it for 30s brings the load average from 0.50 to 1.00, for example. Our cleanup jobs are only set to run nightly but we do have hourly jobs which make copies of unittest logs.

My current theory, although a longshot of one, is that this log copying hits the disk hard enough to spike the load average. They are run with 'nice -n 19', but AFAIK that can only throttle cpu time which doesn't help if we're saturating our disk.

This still doesn't explain how this behaviour would cause Buildbot to chew the CPU when we load the waterfall, though.

In any case, I've adjusted those jobs to only run on a nightly basis. If these are indeed the cause we should see an improvement immediately. If it doesn't make a difference I'll restore them to their hourly state.

Phong Tran [:phong]

Assignee

Comment 12

•

16 years ago

I've set up the reservations for production-master.  Let me know if this helps with performance.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 13

•

16 years ago

Things have been a little better for me the past 24h. I've rarely seen a pageload for the waterfall take over 10s, and normally it happens in less than 5. We've had some slave disconnects still but it seems like less than before (I haven't taken a tally so I'm not sure exactly). Both the reservations and the cronjob change-up happened around the same time so it's hard to tell what helped, but here's some specifics about the cronjobs and load:

I checked the load periodically before & after the cronjobs ran last night and the two jobs which backup unittest logs are definitely spiking our load.

Load average is around 0.30 until 4:02am (2 minutes after the jobs start), when it rises to about 1.50 for 15 minutes.

These jobs used to run on an hourly basis, so they probably had the same effect for a shorter period of time. Perhaps only 5 or 10 minutes out of every hour? Maybe less?

I've set these up to be staggered now, so that _should_ help the load a little bit when the jobs fire.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 14

•

16 years ago

(In reply to comment #7)
> The ESX hosts are being overloaded.  

My understanding was that there was not enough spare capacity in the ESX servers to loadbalance VMs if one of the ESX server died. This is not great, and we should get extra ESX servers to handle this situation. However, thats different to the current set of ESX servers being overloaded. 

Totally possible I misunderstood, can you clarify?


> If there certain VM's that are super
> important, we can set up reservations for them.  Those VM's will have priority
> over the shared resources.

Very cool. Thanks for doing that with production-master, lets see if it helps. Can you also do the same for qm-rhel02.mozilla.org? Its the talos buildbot master, and hit problems this morning - see details in bug#468859.

Chris AtLee [:catlee]

Comment 15

•

16 years ago

> Very cool. Thanks for doing that with production-master, lets see if it helps.
> Can you also do the same for qm-rhel02.mozilla.org? Its the talos buildbot
> master, and hit problems this morning - see details in bug#468859.

Can this be done ASAP, please?  Or move qm-rhel02.mozilla.org back to another VM host.  slaves are dropping like flies since qm-rhel02 was migrated this morning.

matthew zeier [:mrz]

Comment 16

•

16 years ago

Mentioned to Alice that DRS does dynamic-resource-scheduling on its own and moves VMs around based on load requirements.  Very normal to see VMs move throughout the day.

Chris AtLee [:catlee]

Comment 17

•

16 years ago

According to the VI client web interface, this is the first time since October that this VM has been migrated.  And since the migration at 8:58 this morning we've had a disproportionate number of slave disconnect problems.

matthew zeier [:mrz]

Comment 18

•

16 years ago

Two additional ESX servers on order, ETA 12/17.

Nick Thomas [:nthomas] (UTC+12)

Comment 19

•

16 years ago

We had a couple of glitches at about 16:30 where two linux slaves had build errors:
moz2-linux-slave02 building mozilla-central
 rm -f libxul.so
 <link libsul.so command>
 ../../staticlib/components/libcaps.a: member ../../staticlib/components /libcaps.a(nsPrincipal.o) in archive is not an object
 collect2: ld returned 1 exit status
http://tinderbox.mozilla.org/showlog.cgi?tree=Firefox&errorparser=unix&logfile=1228955094.1228955887.24379.gz&buildtime=1228955094&buildname=Linux%20mozilla-central%20build&fulltext=1

moz2-linux-slave16 building mozilla-central debug:
 rm -f libzipwriter.so
 <link libzipwrite.so command>
 nsZipWriter.o: In function `nsZipWriter::AddEntryStream(nsACString_internal const&, long long, int, nsIInputStream*, int, unsigned int)':
 /builds/moz2_slave/mozilla-central-linux-debug/build/modules/libjar/zipwriter/src/nsZipWriter.cpp:536: undefined reference to `nsZipDataStream::Init(nsZipWriter*, nsIOutputStream*, nsZipHeader*, int)'
 /builds/moz2_slave/mozilla-central-linux-debug/build/modules/libjar/zipwriter/src/nsZipWriter.cpp:542: undefined reference to `nsZipDataStream::ReadStream(nsIInputStream*)'
 nsZipWriter.o: In function `nsZipDataStream':
 /builds/moz2_slave/mozilla-central-linux-debug/build/modules/libjar/zipwriter/src/nsZipDataStream.h:56: undefined reference to `vtable for nsZipDataStream'
nsZipWriter.o: In function `nsZipWriter::BeginProcessingAddition(nsZipQueueItem*, int*)':
 /builds/moz2_slave/mozilla-central-linux-debug/build/modules/libjar/zipwriter/src/nsZipWriter.cpp:912: undefined reference to `nsZipDataStream::Init(nsZipWriter*, nsIOutputStream*, nsZipHeader*, int)'
 /builds/moz2_slave/mozilla-central-linux-debug/build/modules/libjar/zipwriter/src/nsZipWriter.cpp:935: undefined reference to `nsZipDataStream::Init(nsZipWriter*, nsIOutputStream*, nsZipHeader*, int)'
collect2: ld returned 1 exit status
http://tinderbox.mozilla.org/showlog.cgi?tree=Firefox&errorparser=unix&logfile=1228955439.1228955689.23985.gz&buildtime=1228955439&buildname=Linux%20mozilla-central%20leak%20test%20build&fulltext=1

Both of these look like file system corruption. Phong says there were load spikes at about when the builds bailed out, and asks for us to look for machines that we can power down until more capacity is available. At first glance there are some machines we can back off on build frequency, and some staging builds we can disable for a time. More details to follow.

Corey Shields [:cshields]

Comment 20

•

16 years ago

Started feeling these load spikes as well, and the VMs I'm working on are not critical so I've taken the following VMs down for the time being, hopefully that will free up a little cpu.  None of them should have been doing anything resource intensive but every little bit helps:

test-mgmt.build
test-winslave.build
test-linslave.build
test-opsi.build

If someone could check these from the vmware console and make sure they shut down I'd appreciate it.

In addition, if it would be possible to get the windows reference image (the vmware image of it) then I could continue working on that front without these for the time being.

Nick Thomas [:nthomas] (UTC+12)

Comment 21

•

16 years ago

(In reply to comment #20)
> If someone could check these from the vmware console and make sure they shut
> down I'd appreciate it.

test-winslave needed some help, but it's having a lie down now. Thanks Corey.

Nick Thomas [:nthomas] (UTC+12)

Comment 22

•

16 years ago

Attached image windows slaves - warning state — Details

Made these changes:

Mozilla1.8-l10n: Build once a day via --interval 86400 in multi-config.pl
* karma (AMD), cerberus-vm (Intel)

staging-1.8-master (test builds):
* disable fx "nightly" builds by shutting down
 * staging-prometheus-vm02, staging-pacifica-vm02 (both Intel)
 * disabled mac builds by commenting out "c['schedulers'].append(depend_scheduler)", small reduction of load on staging-master (Intel)
* turn off idle staging slaves
 * staging-pacifica-vm, staging-patrocles, staging-crazyhorse (all Intel)

* other idle machines: mozillabuild-builder (Intel)

But I think the real problem is that we have many, very busy, win32 slaves since the the 1.9.1 branch opened (see attachment). Right now there are a total of 10 win32 builds running - 5 mozilla-central build's running, 8 mozilla-central leak tests pending, 3 mozilla-1.9.1 builds running, and 2 mozilla-1.9.1 leak test running.

matthew zeier [:mrz]

Comment 23

•

16 years ago

If I can get the ESX licenses soon, we can spin up two blades to fill in for a bit.  They'll be 4 core, 4GB RAM boxes but that might be enough to hold until next Wednesday.

cshields - you were asking for a local copy of the VM?  Which one did you want exported?

Corey Shields [:cshields]

Comment 24

•

16 years ago

I'd like the win2k3 build slave reference image if it can be exported.  I can do local work and testing off of that for probably a week or two.

tia

Corey Shields [:cshields]

Comment 25

•

16 years ago

I should clarify..  it doesn't have to be the existing test-winslave image I've been working on..  Just a clean copy of the reference image would be perfect (in fact it would be better, give me a clean slate to work this new agent I am trying)

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 26

•

16 years ago

To be clear, win2k3sp2-vc8tools-scrubbed-ref-vm is the VM to export for Corey.

matthew zeier [:mrz]

Comment 27

•

16 years ago

John - are you okay with your MSDN key in that ref image going offsite?

Nick Thomas [:nthomas] (UTC+12)

Comment 28

•

16 years ago

I found some more stuff we can back off the build frequency on, and move to a less loaded place:
* moz1.8 - moved patrocles & crazyhorse to AMD cluster, set --interval 3600 for now, and this can probably be higher later. Also set the PeriodicScheduler for the Fx2 builds to 2 hours instead of 5 minutes
* moz1.9.0 - moved fxdbug-win32-tbox, fx-win32-tbox, and fx-linux-tbox from Intel to AMD cluster

Another source of more capacity would be to ask QA nicely if we could borrow one or both of the two QA hosts. They look like a similar spec to the 4 boxes in the build AMD cluster, and do maybe a couple of builds a day according to the CPU graphs.

matthew zeier [:mrz]

Comment 29

•

16 years ago

The QA ESX hosts aren't in the same DRS cluster so there's some amount if work to use them. Keep in mine that you'll have to more 8 core boxes on Wednesday. 

qw-vmware02 is planned to join the AMD DRS cluster though. I forget what sort of work that involves.

matthew zeier [:mrz]

Comment 31

•

16 years ago

bm-vmware12     IN A 10.2.71.212
bm-vmware12-vmotion     IN A 10.2.71.213
bm-vmware13     IN A 10.2.71.214
bm-vmware13-vmotion     IN A 10.2.71.215

bm-vmware12.san IN A 10.253.0.218
bm-vmware12-sc  IN A 10.253.0.219
bm-vmware13 IN A 10.253.0.220
bm-vmware13-sc  IN A 10.253.0.221

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 32

•

16 years ago

Woohoo!

Phong Tran [:phong]

Assignee

Comment 33

•

16 years ago

bm-vmware12/13 have been added to the cluster.  VM's were automatically migrated to the new ESX host.

Corey Shields [:cshields]

Comment 34

•

16 years ago

Is there enough capacity now to bring a few test vm's back online?

matthew zeier [:mrz]

Comment 35

•

16 years ago

Yes.  Ping me (or phong) on IRC.

I'd like to call this bug closed...

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 36

•

16 years ago

(In reply to comment #35)
> I'd like to call this bug closed...

Me too. Please close as FIXED! Thank you, mrz and phong for all your help .

Phong Tran [:phong]

Assignee

Updated

•

16 years ago

Status: NEW → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

9 years ago

Product: mozilla.org → mozilla.org Graveyard