Figure out why kill doesn't work consistently on MoMo Windows builders

RESOLVED FIXED

Status

Mozilla Messaging
Release Engineering
RESOLVED FIXED
7 years ago
7 years ago

People

(Reporter: standard8, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

7 years ago
On the windows builders, if we get a timeout during tests, the test is shown as busted rather than timeout. For example:

http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1291121146.1291123031.14958.gz#err0

command timed out: 1200 seconds without output
SIGKILL failed to kill process
using fake rc=-1
program finished with exit code -1

remoteFailed: [Failure instance: Traceback from remote host -- Traceback (most recent call last):
Failure: buildslave.commands.base.TimeoutError: SIGKILL failed to kill process
]

I think there may be one builder where SIGKILL seems to work (though I can't remember which one).

MoCo machines seem to work fine wrt SIGKILL (or whatever buildbot is doing).

I suspect this may also be why when we interrupt windows builds, they don't always fail/stop straight away.
See https://wiki.mozilla.org/Build:TryServer:Maintenance (relevant section pasted below):

=========================
Build exceptions on Win32
Symptoms

    * Purple boxes on Waterfall display that say "exception"
    * "SIGKILL failed to kill process" errors 

Cause

    * Hanging cygwin processes 

Solution

   1. Logon to try1-win32-slave
   2. Stop the buildslave
   3. Do a 'ps' and kill all processes (ps only shows cygwin processes).
   4. Double check in Task Manager that there are no instances of 'make', 'sh', 'mkdepend', or other cygwin processes.
   5. Restart the slave 
=========================

Given that we reboot after every build, I don't know that there's much we can do about this problem.
(Reporter)

Comment 2

7 years ago
I don't think its that. Our boxes have been like this pretty much since day one.

iirc -10 or -13 may behave properly, but I wouldn't like to say what's different.
(Reporter)

Comment 3

7 years ago
Looking at older bugs, bug 420216 is about the only one that has a potential to be similar to this issue.

I'm also cc'ing some of the Firefox and other folks who may be able to

a) Verify that I'm not mad and confirm that when a test (e.g. xpcshell) times out on Windows the Firefox builders do actually kill the test gracefully and show a timeout.

b) suggest fixes for it, if bug 420216 doesn't contain the pointer to the fix.
(Reporter)

Comment 4

7 years ago
So today I was trying to kill some processes, and most of the problem builders (momo-vm-win2k3-09, -10, -11, -12, -13, -15) wouldn't kill the build from buildbot - the stdout said "command interrupted" but nothing happened.

momo-vm-win2k3-08 was the only one that worked, but even then it still gave this:

command interrupted
SIGKILL failed to kill process
using fake rc=-1
program finished with exit code -1

remoteFailed: [Failure instance: Traceback from remote host -- Traceback (most recent call last):
Failure: buildslave.commands.base.TimeoutError: SIGKILL failed to kill process
]

and resulted in a build busted result.

The Linux & Mac builders all resulted in a test fail result (as the step being aborted was a test step).

Look at WINNT 5.2 comm-central leak test build builds 1307 - 1311 for more info, 1308 was the build done by momo-vm-win2k3-08.
catlee on #build says: I think we installed twisted 10.1 and pulled in the dumbwin32proc from one of our earlier patched versions.
Haven't seen this issue since reimaging d:\mozilla-build on the win2k3 builders.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.