Closed Bug 479146 Opened 11 years ago Closed 10 years ago
set fixed resource allocations for unit test VMs
In order to reduce the variability in timeout and timing-sensitive test failures, I would like us to set fixed resource limits on the VMs running the unit tests. We should lock them each to 1/5th of a machine or whatever, both max and min, so that our timeout problems will at least become more consistent, and isolated from load on other guests. I'm not sure how much control over I/O capacity we have in our current config, but ESX gives some knobs for cases dealing with local disk (and I think iSCSI); obviously the other side of the SAN is harder to apportion as crisply.
The current pool-of-slaves implementation means that any slave can be given a unit test job, so all slaves would need a fixed resource allocation (which somewhat reduces the flexibility of having a pool in the first place). Can we dynamically add an allocation ? If so, the build process could signal to the ESX host it was starting unit tests and needed guaranteed resources, and give them back afterwards.
I'm assuming you're talking about the Firefox, Firefox3.1, and Tracemonkey trees. Dedicated VMs are used for Firefox3.0.
We don't want a unit test VM to get more than its allocation either, because that leads to "false" passes that turn into failures when things are more contended. I don't care about idle cycles nearly as much as I care about consistent results, but as long as nothing ever gets more or less than it should, we can do whatever we want with the allocations. :)
Setting up reservations for the VM's has cause some problems. Since not all of the ESX host have identical CPU and RAM, these VM's have issues migrating between ESX host. What we should do instead is to give the important VMs higher priority to the resources. We should also set up rules that will keep these VMs on separate ESX hosts.
I don't understand how giving them higher priority solves this problem. Having too many resources available is as bad as having too little, because it's variance that causes the unreliability.
Phong, as an aside, can you give us a complete list of the VMs we have reservations on?
high share priority: production master qm-rhel02 qm-buildbot01 try-master reservations limit CPU: staging-prometheus-vm production-prometheus-vm fx-win32-1.9-slave2 production-prometheus-vm02 staging-try-master tb-linux-tbox fx-linux-1.9-slave08 moz2-linux-slave06 fx-linux-1.9-slave2 fxdbug-linux-tbox bm-l10n-centos5-01 fx-linux-1.9-slave1 moz2-linux-slave15 bm-centos5-unittest-01 fx-linux-1.9-slave03/04/07/09 moz2-linuxnonsse-slave01 moz2-linux-slave(01-19) murali-experiment staging-1.9-master staging-master test-linslave test-mgmt try-linux-slave(01-05) xr-linux-tbox
(In reply to comment #7) > reservations limit CPU: Phong, what does this actually mean? Is this an upper limit on CPU usage, or a lower limit?
That means those VM's are set for an upper limit on CPU usage. I think it should be the opposite.
I'm pretty sure we want both high and low limits, and we want them to be identical.
We replaced the old AMD cluster with 4 new Intel blades. This gives us more resources to share amongst the virtual machines. Ideally we would let VMWare DRS handle resource allocations for the VMs. If we set a lower limit, then those cycle will be reserve for the VM even if it's idle and not using it. When we set an upper limit, the VM won't be able to use more than what is allocated even if there is more resources available.
(In reply to comment #11) > If we > set a lower limit, then those cycle will be reserve for the VM even if it's > idle and not using it. When we set an upper limit, the VM won't be able to use > more than what is allocated even if there is more resources available. Yes, that's exactly the point of what I'm asking for. I want to make sure that if a unit test is running on a VM, it always gets the same CPU resources -- never extra, never fewer. Otherwise, anything that has a timing element will vary according to what happens to be running next to it.
(In reply to comment #12) > (In reply to comment #11) > > If we > > set a lower limit, then those cycle will be reserve for the VM even if it's > > idle and not using it. When we set an upper limit, the VM won't be able to use > > more than what is allocated even if there is more resources available. > > Yes, that's exactly the point of what I'm asking for. I want to make sure that > if a unit test is running on a VM, it always gets the same CPU resources -- > never extra, never fewer. Otherwise, anything that has a timing element will > vary according to what happens to be running next to it. We don't have dedicated unittest machines. This will make the machines unable to burst during periods where there is free resources. Do you really want that?
Well, I want there to be as little timing variance in our unit test machines as possible. I don't care if the unit test VMs' resources are used if there is no unit test being run, but when they're running I want them to always run with the same (fixed) parameters. Fixed/permanent min=max limits are one way to address that, but if you have others (such as using the VMWare guest APIs to change the reservations when the unit test runs start and end) then go for it. How often are they idle?
I might be disconnected but I thought unit tests were on the Mac Minis and not at all on the VMs. Am I wrong?
Stalled on some input - which VMs are doing unit tests that need this?
Component: Server Operations → Release Engineering
QA Contact: mrz → release
Fixing this in bug#548768, as a side-effect of moving unittests from VMs to minis. Running unittests on identical hardware, one process per machine, should remove hardware/VM from concerns of variability in timeout and timing-sensitive test failures. I'll close this as DUP, because that seems closest; we are fixing the underlying problem in another bug - just differently to how originally asked here.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 548768
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.