428865 - qm-centos5-01 failing lots of timing-related unittests

:Gavin Sharp [email: gavin@gavinsharp.com]

Reporter

Description

•

16 years ago

Recently, a lot of our tests that rely on timers have been failing, but only on qm-centos5-01 (e.g. bug 292789, bug 423833). One of the possible causes for this is that the machine is really bogged down, and that's causing code to run so slowly that the timers aren't firing in the right order. I haven't seen any similar issues on the Windows or Mac unit test machines. Would it be possible to investigate ways of speeding this machine up? If I recall correctly, it's a VM, so perhaps it's CPU or RAM allowance could be increased?

(Ideally, our tests would never rely on timers, but sometimes it's the most straightforward way of testing something.)

Rob Campbell [:rc] (:robcee)

Comment 1

•

16 years ago

this machine's a VM so I'm throwing this over to server.ops for a consult. I'm not sure if we can give this VM a higher priority than others but that's be what we need.

Assignee: nobody → server-ops

Component: Release Engineering → Server Operations

QA Contact: release → justin

Rob Campbell [:rc] (:robcee)

Comment 2

•

16 years ago

er, english fail. "That would be what we need". :)

matthew zeier [:mrz]

Comment 3

•

16 years ago

Anything that relies on timing shouldn't be on a VM.  I can't guarantee a clock cycle

This is why all the performance boxes are standalone physical boxes.

Assignee: server-ops → nobody

Component: Server Operations → Release Engineering

QA Contact: justin → release

Chris Cooper [:coop] (he/him)

Comment 4

•

16 years ago

We'll have 3 Linux unit test VMs on trunk soon (bug 425791) so we can at least compare results between them, if that's any consolation.

Rob Campbell [:rc] (:robcee)

Comment 5

•

16 years ago

(In reply to comment #4)
> We'll have 3 Linux unit test VMs on trunk soon (bug 425791) so we can at least
> compare results between them, if that's any consolation.

Are they all running on the same VMHost? Maybe they'll all fail differently at the same time. ;)

Seriously though, this is why we run the windows unittest machines on physical hardware. We've been down this road before. Should we start allocating machines to replace qm-centos5-01?

matthew zeier [:mrz]

Comment 6

•

16 years ago

(In reply to comment #5)
> (In reply to comment #4)
> > We'll have 3 Linux unit test VMs on trunk soon (bug 425791) so we can at least
> > compare results between them, if that's any consolation.
> 
> Are they all running on the same VMHost? Maybe they'll all fail differently at
> the same time. ;)

No, in fact they run on three seperate ESX servers.

Rob Campbell [:rc] (:robcee)

Comment 7

•

16 years ago

perfect. Thanks.

bhearsum@mozilla.com (:bhearsum)

Updated

•

16 years ago

Priority: -- → P3

Rob Campbell [:rc] (:robcee)

Comment 8

•

16 years ago

back to comment #3,

(In reply to comment #3)
> Anything that relies on timing shouldn't be on a VM.  I can't guarantee a clock
> cycle
> 
> This is why all the performance boxes are standalone physical boxes.

We're still seeing lots of intermittent errors on all the unittest vms. qm-win2k3-pgo01, the moz2 unittest boxes (bug 435064), and qm-centos5-02 (bug 431745). After wrestling with these for months, it's becoming increasingly clear that VMs are not going to be able to give us the reliability we need to run these.

Assignee: nobody → rcampbell

Priority: P3 → P1

Summary: qm-centos5-01 failing lots of timing-related tests → qm-centos5-01 failing lots of timing-related unittests

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 9

•

16 years ago

1) Should this bug be closed as dup of one of the others? 

2) Do we know *which* specific tests are failing intermittently? If its a specific few, maybe those tests need to be less-time-sensitive?

3) I note that these VMs are all running with 1GB ram, but the physical boxes that were used on mozilla1.9 had 4GB. Have you tried bumping RAM to 4GB to see if that helps?

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Updated

•

16 years ago

Blocks: 422754

Rob Campbell [:rc] (:robcee)

Comment 10

•

16 years ago

actually, qm-centos5-01-03 have 512MB of RAM. qm-centos5-moz2-01 has 768MB of RAM. I believe 768 has generally been considered adequate for building and unittesting. I'd rather move to physical hardware than continue twiddling switches on these.

most recent random failure on this was (browser chrome):

chrome://mochikit/content/browser/browser/components/preferences/tests/browser_bug410900.js
	FAIL - Pref window opened - chrome://mochikit/content/browser/browser/components/preferences/tests/browser_bug410900.js

and on qm-centos5-03 browser chrome:
chrome://mochikit/content/browser/browser/components/preferences/tests/browser_bug410900.js
	PASS - Pref window opened
	PASS - Specified pane was opened
	FAIL - handlersView is present - chrome://mochikit/content/browser/browser/components/preferences/tests/browser_bug410900.js
	FAIL - App handler list populated - chrome://mochikit/content/browser/browser/components/preferences/tests/browser_bug410900.js

Machines went green on subsequent runs. These failures were not related to a recent checkin.

I don't have a recent list of failures on this particular box, but bug 435064 contains a laundry list of failures on moz2 on VMs.

Rob Campbell [:rc] (:robcee)

Comment 11

•

16 years ago

I think if we're going to dup any of these bugs, we should consolidate them into a bug with the summary "unittests that don't work on VMs",

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 12

•

16 years ago

(In reply to comment #9)
> 3) I note that these VMs are all running with 1GB ram, but the physical boxes
> that were used on mozilla1.9 had 4GB. 
Correction, the physical boxes have 3.5GB ram.


> Have you tried bumping RAM to 4GB to see if that helps?
Seems worth a quick try, imho.

Status: NEW → ASSIGNED

Priority: P1 → P3

Rob Campbell [:rc] (:robcee)

Updated

•

16 years ago

Priority: P3 → P1

bhearsum@mozilla.com (:bhearsum)

Updated

•

16 years ago

Blocks: 433384

bhearsum@mozilla.com (:bhearsum)

Updated

•

16 years ago

No longer blocks: 422754

Lukas Blakk [:lsblakk] use ?needinfo

Assignee

Updated

•

16 years ago

Assignee: rcampbell → lukasblakk

Blocks: 438871

Status: ASSIGNED → NEW

Priority: P1 → P2

Lukas Blakk [:lsblakk] use ?needinfo

Assignee

Updated

•

16 years ago

Status: NEW → ASSIGNED

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 13

•

16 years ago

Doesn't ESX have the ability to provide pretty-hard CPU, memory, and I/O bandwidth guarantees per VM (network as well, possibly)?  I thought they could let you set min, max, or both for those resources to within pretty fine tolerances.

What are the current min/max limits for the test VMs?

Damon Sicore (:damons)

Comment 14

•

16 years ago

Well, IMO, we're supporting 180 million daily users at this point, and allocating a single piece of hardware for these tests seems like the right thing to do as it's a drop in the bucket compared to the overall picture.  I wouldn't mess with the tweaking the settings to make things look like a real box when we could have a real box.  Is the cost of supporting a VM that much cheaper than supporting a single machine in this case?  We've developers ignoring test failures at this point and others just waiting.

Still, I'm in the "don't run unit tests in VMs" camp until someone really convinces me otherwise.

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 15

•

16 years ago

Please do not read me as voting for VMs-for-all; I am all for reducing the number of factors that can influence our test infrastructure.  But if we're married to ESX, it seems like we should be able to give pretty tight resource guarantees.  If that's not working, and we've escalated to VMWare without result, then I think I dunno what to say other than "hello, ghost-and-minis!". :(

matthew zeier [:mrz]

Comment 16

•

16 years ago

(In reply to comment #13)
> Doesn't ESX have the ability to provide pretty-hard CPU, memory, and I/O
> bandwidth guarantees per VM (network as well, possibly)?  I thought they could
> let you set min, max, or both for those resources to within pretty fine
> tolerances.

I haven't looked at that recently but back when preed was here I thought we all came to the conclusion that things that required timing didn't work regardless of which knobs were turned because you were still time-sliced with other VMs.

I believe that's also why there aren't any Talos boxes on VMs. 

> What are the current min/max limits for the test VMs?

There aren't any.

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 17

•

16 years ago

My understanding of the ESX whitepapers is that they provide pretty fine and predictable timeslicing and good APIs to control them.  Seems like a straightforward thing to test to see how stable numbers are with some tight loop code and different things in the other guests on the system.  If we have nguests == ncpus, then it almost seems trivial and you still get the amortization of maintenance plus the ability to migrate, right?

(I mean, you get time slicing vs. the kernel and other processes on the system with physical hardware too; that's why we need to measure more than once for perf numbers.  I don't think that a variance of a single hypervisor scheduling quantum is going to disrupt unit test correctness, though!)

http://www.vmware-tsx.com/download.php?asset_id=39 is a presentation about how the CPU is scheduled in ESX, but we have people at VMWare of whom we can ask specific questions I'm sure.

matthew zeier [:mrz]

Comment 18

•

16 years ago

I agree - seems straightforward to test.  I'll try to find you online.

Mike Schroepfer

Comment 19

•

16 years ago

Couple of other details from investigations today:

1) You can plot VM RAM usage via VI - we looked at a couple of machines and they only 1 time in a week went a little bit over 50% RAM usage. So this is likely not the cause.

2) We can make our tests less sensitive by either removing timeouts or upping their limits (see bug 443493) as a specific example.

3) I did a quick investigation of the Mac (which is a mini) vs the linux boxes over the last month or so. The mac is failing once a day or so with logs like (http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla2/1214839075.1214845013.3569.gz) folks are somewhat stumped as to the cause. These errors also show up on the linux boxes. The issues on the mac go back to late May.

4) We did try to set the VM scheduling stuff for Talos runs - it was not accurate enough for perf (but might make the unit tests better)

5) We will happily spent on hardware and people time if physical hardware is the cause - but I want us to be very clear about what problems we are testing for. In the bit I've looked into this I've found problems in lots of different places. My 30 sec analysis is that the VM's, due to higher variance is CPU scheduling, tend to tickle sensitive test failures more often - but there are at least some set of those that *also* fail on physical boxes. We are setting up a few minis to run linux side-by-side with the VM's to test for this.

6) There are a bunch of other bugs open on getting us betting diagnostics - which is critical because we will always run into tests or code that is flaky even if the systems are stable.

Mike Schroepfer

Comment 20

•

16 years ago

See bug 438871 as well

Lukas Blakk [:lsblakk] use ?needinfo

Assignee

Comment 21

•

16 years ago

Closing this since qm-centos5-01 seems to be doing no worse than any of the other linux VMs lately with regards to intermittent test failures and since there are many bugs tracking the overall process of improving unittest runs including running them on physical boxes.

Status: ASSIGNED → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

(not currently active) Ted Mielczarek

Updated

•

16 years ago

No longer blocks: 433384, 438871

Resolution: FIXED → INCOMPLETE

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Release Engineering