Closed Bug 428865 Opened 16 years ago Closed 16 years ago

qm-centos5-01 failing lots of timing-related unittests

Categories

(Release Engineering :: General, defect, P2)

x86
Linux
defect

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: Gavin, Assigned: lsblakk)

Details

Recently, a lot of our tests that rely on timers have been failing, but only on qm-centos5-01 (e.g. bug 292789, bug 423833). One of the possible causes for this is that the machine is really bogged down, and that's causing code to run so slowly that the timers aren't firing in the right order. I haven't seen any similar issues on the Windows or Mac unit test machines. Would it be possible to investigate ways of speeding this machine up? If I recall correctly, it's a VM, so perhaps it's CPU or RAM allowance could be increased?

(Ideally, our tests would never rely on timers, but sometimes it's the most straightforward way of testing something.)
this machine's a VM so I'm throwing this over to server.ops for a consult. I'm not sure if we can give this VM a higher priority than others but that's be what we need.
Assignee: nobody → server-ops
Component: Release Engineering → Server Operations
QA Contact: release → justin
er, english fail. "That would be what we need". :)
Anything that relies on timing shouldn't be on a VM.  I can't guarantee a clock cycle

This is why all the performance boxes are standalone physical boxes.
Assignee: server-ops → nobody
Component: Server Operations → Release Engineering
QA Contact: justin → release
We'll have 3 Linux unit test VMs on trunk soon (bug 425791) so we can at least compare results between them, if that's any consolation.
(In reply to comment #4)
> We'll have 3 Linux unit test VMs on trunk soon (bug 425791) so we can at least
> compare results between them, if that's any consolation.

Are they all running on the same VMHost? Maybe they'll all fail differently at the same time. ;)

Seriously though, this is why we run the windows unittest machines on physical hardware. We've been down this road before. Should we start allocating machines to replace qm-centos5-01?
(In reply to comment #5)
> (In reply to comment #4)
> > We'll have 3 Linux unit test VMs on trunk soon (bug 425791) so we can at least
> > compare results between them, if that's any consolation.
> 
> Are they all running on the same VMHost? Maybe they'll all fail differently at
> the same time. ;)

No, in fact they run on three seperate ESX servers.
perfect. Thanks.
Priority: -- → P3
back to comment #3,

(In reply to comment #3)
> Anything that relies on timing shouldn't be on a VM.  I can't guarantee a clock
> cycle
> 
> This is why all the performance boxes are standalone physical boxes.

We're still seeing lots of intermittent errors on all the unittest vms. qm-win2k3-pgo01, the moz2 unittest boxes (bug 435064), and qm-centos5-02 (bug 431745). After wrestling with these for months, it's becoming increasingly clear that VMs are not going to be able to give us the reliability we need to run these.
Assignee: nobody → rcampbell
Priority: P3 → P1
Summary: qm-centos5-01 failing lots of timing-related tests → qm-centos5-01 failing lots of timing-related unittests
1) Should this bug be closed as dup of one of the others? 

2) Do we know *which* specific tests are failing intermittently? If its a specific few, maybe those tests need to be less-time-sensitive?

3) I note that these VMs are all running with 1GB ram, but the physical boxes that were used on mozilla1.9 had 4GB. Have you tried bumping RAM to 4GB to see if that helps?
actually, qm-centos5-01-03 have 512MB of RAM. qm-centos5-moz2-01 has 768MB of RAM. I believe 768 has generally been considered adequate for building and unittesting. I'd rather move to physical hardware than continue twiddling switches on these.

most recent random failure on this was (browser chrome):

chrome://mochikit/content/browser/browser/components/preferences/tests/browser_bug410900.js
	FAIL - Pref window opened - chrome://mochikit/content/browser/browser/components/preferences/tests/browser_bug410900.js

and on qm-centos5-03 browser chrome:
chrome://mochikit/content/browser/browser/components/preferences/tests/browser_bug410900.js
	PASS - Pref window opened
	PASS - Specified pane was opened
	FAIL - handlersView is present - chrome://mochikit/content/browser/browser/components/preferences/tests/browser_bug410900.js
	FAIL - App handler list populated - chrome://mochikit/content/browser/browser/components/preferences/tests/browser_bug410900.js

Machines went green on subsequent runs. These failures were not related to a recent checkin.

I don't have a recent list of failures on this particular box, but bug 435064 contains a laundry list of failures on moz2 on VMs.
I think if we're going to dup any of these bugs, we should consolidate them into a bug with the summary "unittests that don't work on VMs",
(In reply to comment #9)
> 3) I note that these VMs are all running with 1GB ram, but the physical boxes
> that were used on mozilla1.9 had 4GB. 
Correction, the physical boxes have 3.5GB ram.


> Have you tried bumping RAM to 4GB to see if that helps?
Seems worth a quick try, imho.
Status: NEW → ASSIGNED
Priority: P1 → P3
Priority: P3 → P1
No longer blocks: 422754
Assignee: rcampbell → lukasblakk
Blocks: 438871
Status: ASSIGNED → NEW
Priority: P1 → P2
Status: NEW → ASSIGNED
Doesn't ESX have the ability to provide pretty-hard CPU, memory, and I/O bandwidth guarantees per VM (network as well, possibly)?  I thought they could let you set min, max, or both for those resources to within pretty fine tolerances.

What are the current min/max limits for the test VMs?
Well, IMO, we're supporting 180 million daily users at this point, and allocating a single piece of hardware for these tests seems like the right thing to do as it's a drop in the bucket compared to the overall picture.  I wouldn't mess with the tweaking the settings to make things look like a real box when we could have a real box.  Is the cost of supporting a VM that much cheaper than supporting a single machine in this case?  We've developers ignoring test failures at this point and others just waiting.

Still, I'm in the "don't run unit tests in VMs" camp until someone really convinces me otherwise.
Please do not read me as voting for VMs-for-all; I am all for reducing the number of factors that can influence our test infrastructure.  But if we're married to ESX, it seems like we should be able to give pretty tight resource guarantees.  If that's not working, and we've escalated to VMWare without result, then I think I dunno what to say other than "hello, ghost-and-minis!". :(
(In reply to comment #13)
> Doesn't ESX have the ability to provide pretty-hard CPU, memory, and I/O
> bandwidth guarantees per VM (network as well, possibly)?  I thought they could
> let you set min, max, or both for those resources to within pretty fine
> tolerances.

I haven't looked at that recently but back when preed was here I thought we all came to the conclusion that things that required timing didn't work regardless of which knobs were turned because you were still time-sliced with other VMs.

I believe that's also why there aren't any Talos boxes on VMs. 

> What are the current min/max limits for the test VMs?

There aren't any.

My understanding of the ESX whitepapers is that they provide pretty fine and predictable timeslicing and good APIs to control them.  Seems like a straightforward thing to test to see how stable numbers are with some tight loop code and different things in the other guests on the system.  If we have nguests == ncpus, then it almost seems trivial and you still get the amortization of maintenance plus the ability to migrate, right?

(I mean, you get time slicing vs. the kernel and other processes on the system with physical hardware too; that's why we need to measure more than once for perf numbers.  I don't think that a variance of a single hypervisor scheduling quantum is going to disrupt unit test correctness, though!)

http://www.vmware-tsx.com/download.php?asset_id=39 is a presentation about how the CPU is scheduled in ESX, but we have people at VMWare of whom we can ask specific questions I'm sure.
I agree - seems straightforward to test.  I'll try to find you online.
Couple of other details from investigations today:

1) You can plot VM RAM usage via VI - we looked at a couple of machines and they only 1 time in a week went a little bit over 50% RAM usage.  So this is likely not the cause.

2) We can make our tests less sensitive by either removing timeouts or upping their limits (see bug 443493) as a specific example.

3) I did a quick investigation of the Mac (which is a mini) vs the linux boxes over the last month or so.  The mac is failing once a day or so with logs like (http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla2/1214839075.1214845013.3569.gz) folks are somewhat stumped as to the cause.  These errors also show up on the linux boxes. The issues on the mac go back to late May.

4) We did try to set the VM scheduling stuff for Talos runs - it was not accurate enough for perf (but might make the unit tests better)

5) We will happily spent on hardware and people time if physical hardware is the cause - but I want us to be very clear about what problems we are testing for.  In the bit I've looked into this I've found problems in lots of different places.  My 30 sec analysis is that the VM's, due to higher variance is CPU scheduling, tend to tickle sensitive test failures more often - but there are at least some set of those that *also* fail on physical boxes.   We are setting up a few minis to run linux side-by-side with the VM's to test for this.

6) There are a bunch of other bugs open on getting us betting diagnostics - which is critical because we will always run into tests or code that is flaky even if the systems are stable.
See bug 438871 as well
Closing this since qm-centos5-01 seems to be doing no worse than any of the other linux VMs lately with regards to intermittent test failures and since there are many bugs tracking the overall process of improving unittest runs including running them on physical boxes.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
No longer blocks: 433384, 438871
Resolution: FIXED → INCOMPLETE
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.