477885 - investigate whether mozilla-central Linux unit test is on an overloaded VM infra

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Description

•

15 years ago

The past week or so, the Linux unit test box on mozilla-central has been orange very frequently (maybe 1 out of every 3 or 4 runs) due to hangs during mochitest.  The error is simply:

buildbot.slave.commands.TimeoutError: command timed out: 300 seconds without output, killing pid 29865

at some point during mochitest... it's different every time.


Could this box be running on a VM server that's too overloaded, or something like that?

In any case, this has been one of the major sources of random orange recently.  It's possible it could be a code problem... but if it is, we'll probably need some on-the-unit-test-box debugging assistance to figure out why it's hanging.

Chris AtLee [:catlee]

Comment 1

•

15 years ago

Are you seeing similar issues on mozilla-1.9.1?

bhearsum@mozilla.com (:bhearsum)

Comment 2

•

15 years ago

"Linux mozilla-central unit test" is a pool of about 15 slaves. This is true all of the other types of builds on 1.9.1/1.9.2/tracemonkey. IT manages the VM to VM host ratio - they would know if we're in an overloaded state.
As Chris pointed out in #developers if we're having load issues intermittent failures would be happening on all branches, not just mozilla-central.

Nevertheless, it'd be good to check the vm : host ratio, Phong or mrz, can either of you have a look and see if we're putting too much load on our VM hosts?

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Comment 3

•

15 years ago

(In reply to comment #1)
> Are you seeing similar issues on mozilla-1.9.1?

Yes.

Maybe I should also see if it's particular slaves.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Comment 4

•

15 years ago

The past 12 hours, the Linux unit test runs on 1.9.1 and m-c were:

1.9.1:
  moz2-linux-slave06 OK
  moz2-linux-slave07 OK (one mochitest timeout)
  moz2-linux-slave14 OK (one crashtest timeout)
  moz2-linux-slave14 HANG
  moz2-linux-slave06 OK (some test failures)
  moz2-linux-slave15 OK
  moz2-linux-slave06 OK
  moz2-linux-slave09 OK
  moz2-linux-slave15 OK
  
mozilla-central
  moz2-linux-slave02 OK
  moz2-linux-slave08 HANG
  moz2-linux-slave15 OK
  moz2-linux-slave14 HANG
  moz2-linux-slave10 OK
  moz2-linux-slave08 HANG
  moz2-linux-slave15 HANG
  moz2-linux-slave16 OK
  moz2-linux-slave02 OK
  moz2-linux-slave05 OK
  moz2-linux-slave10 HANG
  moz2-linux-slave01 HANG
  moz2-linux-slave16 OK

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Comment 5

•

15 years ago

The rates do seem different, though.  1.9.1 has hit a hang in 4 out of 34 builds in the past 48 hours.


That said, I'm not even 100% sure this is a hang rather than a crash.  The error messages are:

command timed out: 300 seconds without output, killing pid 29865
process killed by signal 9
program finished with exit code -1
elapsedTime=719.994465
TinderboxPrint: mochitest<br/><em class="testfail">FAIL</em>
NEXT ERROR buildbot.slave.commands.TimeoutError: command timed out: 300 seconds without output, killing pid 29865
TinderboxPrint: mochitest <em class="testfail">timeout</em><br/>

I think if it's a crash, though, there would be a message from the mochitest harness giving the exit code, which I have seen in other cases, but I didn't see in these cases.

(not currently active) Ted Mielczarek

Comment 6

•

15 years ago

(In reply to comment #5)
> command timed out: 300 seconds without output, killing pid 29865
> process killed by signal 9
> program finished with exit code -1

Right, this is a hang. The first line is from buildbot, then the third line is runtests.py telling you that the program exited with an error (because it was killed). If it was a crash, you would just see something like the third line. (And on Linux, you would see some kind of output indicating a segfault or however it crashed.)

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Updated

•

15 years ago

Whiteboard: [random-orange]

Dave Townsend [:mossop]

Updated

•

15 years ago

Blocks: 438871

Nick Thomas [:nthomas] (UTC+12)

Comment 9

•

15 years ago

IT, please see comment #2. We've also seen balsa-18branch go nuts quite a lot recently (similar to bug 461685), and a lot of buildbot slave disconnects (like bug 467634, worked around in bug 476677).

Assignee: nobody → server-ops

Component: Release Engineering → Server Operations

QA Contact: release → mrz

Summary: investigate whether mozilla-central Linux unit test box is on an overloaded VM → investigate whether mozilla-central Linux unit test is on an overloaded VM infra

Nick Thomas [:nthomas] (UTC+12)

Comment 10

•

15 years ago

Not a blocker, but a serious PITA for developers.

Severity: blocker → major

Nick Thomas [:nthomas] (UTC+12)

Comment 11

•

15 years ago

I've had to reset balsa-18branch three times today, and at least a couple more over the previous two days. Would appreciate someone looking into the state of ESX and SAN loading.

matthew zeier [:mrz]

Comment 12

•

15 years ago

(In reply to comment #11)
> I've had to reset balsa-18branch three times today, and at least a couple more
> over the previous two days. Would appreciate someone looking into the state of
> ESX and SAN loading.

The Intel DRS pool has plenty of capacity after the work Phong did in December (we over added ESX hosts).  The datastore is on the EqualLogic cluster which does its own load balancing so I wouldn't suspect the problem is there (and if it were it'd be more widespread).  

You said you "reset" balsa - does that mean you rebooted the OS?

I do show in the performance grapgh that from 9:35 to 9:50am CPU climbed to 100%.

Nick Thomas [:nthomas] (UTC+12)

Comment 13

•

15 years ago

(In reply to comment #12)
> You said you "reset" balsa - does that mean you rebooted the OS?
> 
> I do show in the performance graph that from 9:35 to 9:50am CPU climbed to
> 100%.

Yes, balsa requires hard reboots, as the typical failure mode is CPU usage going to 100%. We suspect this is due to not getting enough CPU or I/O to complete disk operations in a timely way, but once it's at 100% CPU it's not possible to interact with it. Attempts to add more logging have also been unsuccessful. 

It's not that balsa is a critical machine, but I've come to the conclusion that it's a particularly sensitive test for latency in the ESX setup. It got steadily less reliable until the two ESX hosts were added in December (bug 467634), then was solid during January, and increasingly flaky during February. There's other evidence too - some timeouts on win32 VM's (taking more than 5400 seconds to relink firefox), and potentially linux unit tests (still occurring at 1 in every 4 builds, less frequent on win32, not happening on mac).

Between the end of December and Feb 04 I count another 10 VM's had being added, probably more like 15 by now. What proportion of bm-vmware12 & 13's capacity does that account for ? Also, there seem to be data gaps in longer term CPU usage data in VI, eg
  http://people.mozilla.org/~nthomas/bogus.png
which make it hard to trend.

bhearsum@mozilla.com (:bhearsum)

Comment 14

•

15 years ago

I just noticed that fx-win32-1.9-slave2 took 4h50min to build 3.0.7build2. This is almost twice as long as it takes to build the nightly (which took 2h50min). This machine is on bm-vmware01 currently, and uses eq01-bm01 for storage.

Is that VM host or storage array overloaded?

Nick Thomas [:nthomas] (UTC+12)

Comment 15

•

15 years ago

(In reply to comment #14)
Some time difference is expected, as the nightly uses make -j5 and the release -j1.

Reed Loden [:reed]

Updated

•

15 years ago

Assignee: server-ops → phong

Mike Beltzner [:beltzner, not reading bugmail]

Updated

•

15 years ago

Whiteboard: [random-orange] → [orange]

Nick Thomas [:nthomas] (UTC+12)

Comment 16

•

15 years ago

I've turned off balsa-18branch - can't be bothered resetting it 3 times a day.

Phong Tran [:phong]

Comment 17

•

15 years ago

we've increase capacity of the ESX cluster.  moving to this release.

Assignee: phong → administration

QA Contact: mrz → release

Reed Loden [:reed]

Updated

•

15 years ago

Assignee: administration → nobody

Component: Server Operations → Release Engineering

alice nodelman [:alice] [:anode]

Comment 18

•

15 years ago

From my discussion with ReleaseEng this problem is no longer occurring.  Going to marked Fixed, please re-open if there are issues to be addressed.

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Nochum Sossonko [:Natch]

Comment 19

•

15 years ago

Perhaps this issue should be removed from the Tinderbox page:

> There have been lots of random hangs during mochitest on the Linux unit test box lately.

Jesse Ruderman

Comment 20

•

15 years ago

Comment 19 - done.

Nobody; OK to take it and work on it

Assignee

Updated

•

12 years ago

Keywords: intermittent-failure

Nobody; OK to take it and work on it

Assignee

Updated

•

12 years ago

Whiteboard: [orange]

Nobody; OK to take it and work on it

Assignee

Updated

•

11 years ago

Product: mozilla.org → Release Engineering