Closed Bug 539691 Opened 15 years ago Closed 14 years ago

automation.py should not try to invoke breakpad when a debugger is specified

Categories

(Testing :: Mochitest, defect)

x86
All
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bc, Assigned: jruderman)

References

Details

Attachments

(2 files, 1 obsolete file)

When running tests and a test times out, automation.py.in will try to invoke breakpad by sending a SIGSEGV to the process. However when a debugger is specified, the signal goes to the debugger instead. In the case of valgrind 3.5.0, it will result in valgrind working on a defunct process and not terminate valgrind or the test.

make -C firefox-debug EXTRA_TEST_ARGS='--debugger=valgrind --debugger-args="--leak-check=full --trace-children=yes" --timeout=600' TEST_PATH=js/src/tests/js1_5/Function/jstests.list jstestbrowser
Version: 1.9.2 Branch → Trunk
Blocks: 521331
> When running tests and a test times out, automation.py.in will try to 
> invoke breakpad by sending a SIGSEGV to the process.

Yeah, the condition in triggerBreakpad is wrong.  Instead of just "self.CRASHREPORTER", it should also check for "not debuggerInfo", similar to the call to environment().

> However when a debugger is
> specified, the signal goes to the debugger instead. In the case of valgrind
> 3.5.0, it will result in valgrind working on a defunct process and not
> terminate valgrind or the test.

Are you sure this is what's happening?  I thought Valgrind did all its emulation in-process.  The os.kill should go to Valgrind itself, and it shouldn't matter whether it's a SEGV or a KILL.
Attached patch patch to automation.py.in (obsolete) — Splinter Review
Assignee: nobody → jruderman
Status: NEW → ASSIGNED
Attachment #422344 - Flags: review?(ted.mielczarek)
(In reply to comment #1)

> Are you sure this is what's happening?

I think so.

I tried your patch on Mac with a relatively short timeout of 60 seconds and it killed the test run before it could really get started. I've started a different run with a longer time out of 300 seconds but haven't hit the timeout condition on Mac yet. 

I also tried your patch on Linux with a timeout of 300 seconds and still get the failure situation. I'll attach the ps aux output for the relevant processes.

I've been thinking about this whole situation and I think bug 501034 is wrong in its use of the --timeout value. The way I understand it, --timeout is used by the test frameworks to decide when to timeout a particular test and fail it before moving on to the next test. The way bug 501034's fix works, it instead attempts to kill the entire test run when any single test times out. I think it should instead use a different timeout value to decide if the test run has actually hung instead of a single test timing out.
I might have been wrong about Linux in the previous comment. The test run recovered and is testing the remainder of the tests. I won't have access to the log until it completes though. I'll update the bug later tonight when it does complete.
(In reply to comment #3)
> I've been thinking about this whole situation and I think bug 501034 is wrong
> in its use of the --timeout value. The way I understand it, --timeout is used
> by the test frameworks to decide when to timeout a particular test and fail it
> before moving on to the next test. The way bug 501034's fix works, it instead
> attempts to kill the entire test run when any single test times out. I think it
> should instead use a different timeout value to decide if the test run has
> actually hung instead of a single test timing out.

I tried to make this smart, and make the "hang" timeout equal to the harnesses' internal timeouts + 30 seconds, to give each harness 30 seconds to deal with tests that timeout, and only then kill them.

See:
http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py.in#462
http://mxr.mozilla.org/mozilla-central/source/layout/tools/reftest/runreftest.py#138
30s might not be enough when Valgrind is involved.
Well, your patch here will make that moot anyway. :)
Jesse: The patch appears to work fine for me by not attempting to kill the process when a debugger is specified. Thanks.

Ted: Yeah, this makes the issue about killing a timed out process mute for cases where a debugger is involved. but I think instead of adding a fixed 30 seconds to the user specified timeout, scaling the user timeout would be a little bit better. For example, if you used twice the user specified timeout, the additional wait would not be that long relative to a normal test and would, I think, reduce the cases where the entire run was killed due to a particularly slow test+recovery.
It could, sure, but if the test harness isn't responding within 30 seconds, I don't have a lot of faith in it doing anything useful.
> Jesse: The patch appears to work fine for me by not attempting to kill the
> process when a debugger is specified. Thanks.

Eh?  It should still kill the process, just using SIGKILL rather than SIGSEGV.
crap, you're right of course. I got confused by the timeout used in automation.py.in for detecting timeouts in reading lines from the output rather than the test timing out. I'll try the mochitest-plain tests which is where I originally saw the issue, but it will take quite a while to run. You might just want to go with Ted's r.
Comment on attachment 422344 [details] [diff] [review]
patch to automation.py.in

I don't think this is what we want. I think we should just ignore the "timeout" value (set it to None) when a debugger is attached. Presumably we don't want to send SIGKILL to the user's debugger if they're sitting there in gdb.
Attachment #422344 - Flags: review?(ted.mielczarek) → review-
For non-interactive debuggers, you really want the timeouts to continue working.  Especially Valgrind.

I understand your concern about interactive debuggers.  Perhaps we should add:

if debuggerInfo and debuggerInfo["interactive"]:
  timeout = None
  maxTime = None

We could even change automationutils.py so unrecognized debuggers are assumed to be interactive.  But those changes could just as easily be made in a separate bug and patch.
Attachment #422344 - Flags: review- → review?(ted.mielczarek)
Comment on attachment 422344 [details] [diff] [review]
patch to automation.py.in

This still doesn't fix the problem, in that you're still going to call proc.kill() if the process times out, and kill the user's debugger. I really think the right fix here is to set the timeout to None if a debugger is attached. If you want to special-case interactive debuggers, that's fine, but we should just not be running the timeout handling code *at all* in this case.
Attachment #422344 - Flags: review?(ted.mielczarek) → review-
Attached patch patch v2Splinter Review
Don't time out when using an interactive debugger. Kill non-interactive debuggers in a cleaner way.
Attachment #422344 - Attachment is obsolete: true
Attachment #428861 - Flags: review?(ted.mielczarek)
Comment on attachment 428861 [details] [diff] [review]
patch v2

Ok, this makes sense now. We should probably make more things into member variables at some point (like debuggerInfo) so we don't have to pass them around via class methods.
Attachment #428861 - Flags: review?(ted.mielczarek) → review+
http://hg.mozilla.org/mozilla-central/rev/659d517acd61
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: