Closed Bug 477885 Opened 11 years ago Closed 11 years ago

investigate whether mozilla-central Linux unit test is on an overloaded VM infra

Categories

(Release Engineering :: General, defect, major)

x86
Linux
defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dbaron, Unassigned)

References

Details

(Keywords: intermittent-failure)

The past week or so, the Linux unit test box on mozilla-central has been orange very frequently (maybe 1 out of every 3 or 4 runs) due to hangs during mochitest.  The error is simply:

buildbot.slave.commands.TimeoutError: command timed out: 300 seconds without output, killing pid 29865

at some point during mochitest... it's different every time.


Could this box be running on a VM server that's too overloaded, or something like that?

In any case, this has been one of the major sources of random orange recently.  It's possible it could be a code problem... but if it is, we'll probably need some on-the-unit-test-box debugging assistance to figure out why it's hanging.
Are you seeing similar issues on mozilla-1.9.1?
"Linux mozilla-central unit test" is a pool of about 15 slaves. This is true all of the other types of builds on 1.9.1/1.9.2/tracemonkey. IT manages the VM to VM host ratio - they would know if we're in an overloaded state.
As Chris pointed out in #developers if we're having load issues intermittent failures would be happening on all branches, not just mozilla-central.

Nevertheless, it'd be good to check the vm : host ratio, Phong or mrz, can either of you have a look and see if we're putting too much load on our VM hosts?
(In reply to comment #1)
> Are you seeing similar issues on mozilla-1.9.1?

Yes.

Maybe I should also see if it's particular slaves.
The past 12 hours, the Linux unit test runs on 1.9.1 and m-c were:

1.9.1:
  moz2-linux-slave06 OK
  moz2-linux-slave07 OK (one mochitest timeout)
  moz2-linux-slave14 OK (one crashtest timeout)
  moz2-linux-slave14 HANG
  moz2-linux-slave06 OK (some test failures)
  moz2-linux-slave15 OK
  moz2-linux-slave06 OK
  moz2-linux-slave09 OK
  moz2-linux-slave15 OK
  
mozilla-central
  moz2-linux-slave02 OK
  moz2-linux-slave08 HANG
  moz2-linux-slave15 OK
  moz2-linux-slave14 HANG
  moz2-linux-slave10 OK
  moz2-linux-slave08 HANG
  moz2-linux-slave15 HANG
  moz2-linux-slave16 OK
  moz2-linux-slave02 OK
  moz2-linux-slave05 OK
  moz2-linux-slave10 HANG
  moz2-linux-slave01 HANG
  moz2-linux-slave16 OK
The rates do seem different, though.  1.9.1 has hit a hang in 4 out of 34 builds in the past 48 hours.


That said, I'm not even 100% sure this is a hang rather than a crash.  The error messages are:

command timed out: 300 seconds without output, killing pid 29865
process killed by signal 9
program finished with exit code -1
elapsedTime=719.994465
TinderboxPrint: mochitest<br/><em class="testfail">FAIL</em>
NEXT ERROR buildbot.slave.commands.TimeoutError: command timed out: 300 seconds without output, killing pid 29865
TinderboxPrint: mochitest <em class="testfail">timeout</em><br/>

I think if it's a crash, though, there would be a message from the mochitest harness giving the exit code, which I have seen in other cases, but I didn't see in these cases.
(In reply to comment #5)
> command timed out: 300 seconds without output, killing pid 29865
> process killed by signal 9
> program finished with exit code -1

Right, this is a hang. The first line is from buildbot, then the third line is runtests.py telling you that the program exited with an error (because it was killed). If it was a crash, you would just see something like the third line. (And on Linux, you would see some kind of output indicating a segfault or however it crashed.)
Whiteboard: [random-orange]
Duplicate of this bug: 477771
Duplicate of this bug: 477083
IT, please see comment #2. We've also seen balsa-18branch go nuts quite a lot recently (similar to bug 461685), and a lot of buildbot slave disconnects (like bug 467634, worked around in bug 476677).
Assignee: nobody → server-ops
Component: Release Engineering → Server Operations
QA Contact: release → mrz
Summary: investigate whether mozilla-central Linux unit test box is on an overloaded VM → investigate whether mozilla-central Linux unit test is on an overloaded VM infra
Not a blocker, but a serious PITA for developers.
Severity: blocker → major
I've had to reset balsa-18branch three times today, and at least a couple more over the previous two days. Would appreciate someone looking into the state of ESX and SAN loading.
(In reply to comment #11)
> I've had to reset balsa-18branch three times today, and at least a couple more
> over the previous two days. Would appreciate someone looking into the state of
> ESX and SAN loading.

The Intel DRS pool has plenty of capacity after the work Phong did in December (we over added ESX hosts).  The datastore is on the EqualLogic cluster which does its own load balancing so I wouldn't suspect the problem is there (and if it were it'd be more widespread).  

You said you "reset" balsa - does that mean you rebooted the OS?

I do show in the performance grapgh that from 9:35 to 9:50am CPU climbed to 100%.
(In reply to comment #12)
> You said you "reset" balsa - does that mean you rebooted the OS?
> 
> I do show in the performance graph that from 9:35 to 9:50am CPU climbed to
> 100%.

Yes, balsa requires hard reboots, as the typical failure mode is CPU usage going to 100%. We suspect this is due to not getting enough CPU or I/O to complete disk operations in a timely way, but once it's at 100% CPU it's not possible to interact with it. Attempts to add more logging have also been unsuccessful. 

It's not that balsa is a critical machine, but I've come to the conclusion that it's a particularly sensitive test for latency in the ESX setup. It got steadily less reliable until the two ESX hosts were added in December (bug 467634), then was solid during January, and increasingly flaky during February. There's other evidence too - some timeouts on win32 VM's (taking more than 5400 seconds to relink firefox), and potentially linux unit tests (still occurring at 1 in every 4 builds, less frequent on win32, not happening on mac).

Between the end of December and Feb 04 I count another 10 VM's had being added, probably more like 15 by now. What proportion of bm-vmware12 & 13's capacity does that account for ? Also, there seem to be data gaps in longer term CPU usage data in VI, eg
  http://people.mozilla.org/~nthomas/bogus.png
which make it hard to trend.
I just noticed that fx-win32-1.9-slave2 took 4h50min to build 3.0.7build2. This is almost twice as long as it takes to build the nightly (which took 2h50min). This machine is on bm-vmware01 currently, and uses eq01-bm01 for storage.

Is that VM host or storage array overloaded?
(In reply to comment #14)
Some time difference is expected, as the nightly uses make -j5 and the release -j1.
Assignee: server-ops → phong
Whiteboard: [random-orange] → [orange]
I've turned off balsa-18branch - can't be bothered resetting it 3 times a day.
we've increase capacity of the ESX cluster.  moving to this release.
Assignee: phong → administration
QA Contact: mrz → release
Assignee: administration → nobody
Component: Server Operations → Release Engineering
From my discussion with ReleaseEng this problem is no longer occurring.  Going to marked Fixed, please re-open if there are issues to be addressed.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Perhaps this issue should be removed from the Tinderbox page:

> There have been lots of random hangs during mochitest on the Linux unit test box lately.
Comment 19 - done.
Whiteboard: [orange]
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.