Closed Bug 501034 Opened 11 years ago Closed 11 years ago

(Packaged) Unit tests boxes should be able to get a stack trace from browser hangs/timeouts

Categories

(Testing :: Mochitest, enhancement)

enhancement
Not set

Tracking

(Not tracked)

RESOLVED FIXED
mozilla1.9.3a1

People

(Reporter: sgautherie, Assigned: ted)

References

Details

Attachments

(3 files, 4 obsolete files)

Currently, the (running) test suite is abruptly aborted by buildbot, with a timeout message.

A stack would help figuring out what is happening.

***

Ftr,
maybe the test suites themselves should have something like this too, when their own timeout is triggered.
But that would be a different/(future) bug, as that case is rarer and less painful. (though interesting to solve too.)
Bug 483968 comment 8:
{
From  Ted Mielczarek (:luser)   2009-06-28 12:17:22 PDT

(In reply to comment #7)
> Could buildbot do the same for test suites (at least)?

Wouldn't do us any good, because it doesn't have the PID of the actual browser
process, just the PID of the test harness.
You can file a separate bug on implementing something similar in the test harnesses.
}

*****

Ftr, reftests and mochitests output
"INFO | automation.py | Application pid: Nnnnn"
which might be usable to get the stack!?
Component: Release Engineering → Mochitest
Product: mozilla.org → Testing
QA Contact: release → mochitest
Version: other → Trunk
If we fix this, it's going to be in the test harnesses themselves, not on the buildbot side.
Blocks: 505718, 471085
Duplicate of this bug: 520838
Just FWIW, fixing this on Linux should be fairly easy. We can just "kill -SEGV" the browser process and that will trigger Breakpad properly. On Windows and OS X, it's not so easy. I think on Windows we can use CreateRemoteThread and load a DLL that intentionally crashes us. I'm not so sure about OS X.
Here's a WIP, it works on Linux only. This isn't suitable for checkin because the changes will break Windows, since select() doesn't work on file handles there. I can probably rework it a little so that it just doesn't try to catch hangs on Windows at all and we could check this part in. This code should be fine for detecting hangs on OS X, it just won't be able to trigger Breakpad. I don't know of any way to make that happen yet. Making this code detect hangs on Windows will be a PITA, since we'd need an equivalent to select() that works on a pipe handle.
Assignee: nobody → ted.mielczarek
Ok, I hacked something in for Windows, so we can detect hangs on all platforms, but we'll still only get a Breakpad stack on Linux. We can get this landed, and figure out Windows/OS X later.
Attachment #405676 - Attachment is obsolete: true
Some necessary fixup for Windows. I'm not sure if the timeout value is really right. On my last try server run, we hit a timeout in crashtest. I think the right behavior there is probably to set the automation.py timeout to the reftest timeout + N (I don't know what N should be), since reftests don't individually produce output, so we should leave enough time for the reftest harness to time out individual tests before we decide that it's hung. Probably want to do the same thing for Mochitest, now that I think about it, even though they tend to produce more output.
Attachment #405877 - Attachment is obsolete: true
Preliminary runs on the try server look mostly good code-wise.
Win32 is ok except we timed out in both mochitest and mochitest-chrome:
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1255467299.1255474014.27607.gz
Mac looks fine, just some unrelated test failures:
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1255467299.1255475562.11997.gz
Linux has an interesting failure here, it looks like the app hung on startup, and the harness noticed and tried to kill it, but failed, so buildbot killed it:
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1255467299.1255475996.16839.gz#err12

I'm not sure exactly what happened there, I'll have to look into it. It definitely seems like the timeouts need some work, as per my previous comment.
This applies on top of the previous patch, and adds a crashinject.exe that can inject a DLL into another process to crash it. This seems to work well in my testing.

Just need to sort out the correct values for timeouts, and I think we'll be good to go.
Fixed up the timeouts, this will need fixing again when bug 521130 lands.
Attachment #406082 - Attachment is obsolete: true
Attached patch Win32 bits β€” β€” Splinter Review
Just a little fixup from changes in the previous patch. I pushed both of these to the try server a little bit ago, so I'm awaiting the results of that.
Attachment #406458 - Attachment is obsolete: true
Comment on attachment 406494 [details] [diff] [review]
Detect hangs in automation.py and trigger Breakpad (on Linux)

Ok, these were clean on the try server.
Attachment #406494 - Flags: review?(jwalden+bmo)
Attachment #406497 - Flags: review?(jwalden+bmo)
Waldo: I don't expect you to thoroughly review the Win32 C++ bits if you don't want to, it's just a tool, and it serves its purpose. If you have comments though feel free to review it.
Status: NEW → ASSIGNED
Attachment #406494 - Flags: review?(jwalden+bmo) → review+
Comment on attachment 406494 [details] [diff] [review]
Detect hangs in automation.py and trigger Breakpad (on Linux)

>bug 501034 - Unit tests boxes should be able to get a stack trace from browser hangs/timeouts. add hang detection + breakpad triggering to automation.py. WIP, Linux only for now.
>* * *
>

hg qfo mangling, also want to capitalize bug I assume

>diff --git a/build/automation.py.in b/build/automation.py.in

>+if IS_WIN32:
>+  import ctypes, time, msvcrt
>+  PeekNamedPipe = ctypes.windll.kernel32.PeekNamedPipe
>+  GetLastError = ctypes.windll.kernel32.GetLastError
>+  def read_with_timeout(f, timeout):

readWithTimeout please, this underscore-naming is out of place in this file last I recall.


>+  def read_with_timeout(f, timeout):
>+    """Try to read a line of output from the file object |f|. If no output
>+    is received within |timeout| seconds, return a blank line.
>+    Returns a tuple (line, did_timeout), where |did_timeout| is True
>+    if the read timed out, and False otherwise."""
>+    (r,w,e) = select.select([f], [], [], timeout)

Spaces in that there tuple please!


>@@ -507,6 +548,7 @@
>   log.info("INFO | automation.py | Application pid: %d", proc.pid)
> 
>   stackFixerProcess = None
>+  did_timeout = False

timedOut please, nothing else in here last I recall uses underscores in variable names.


>diff --git a/layout/tools/reftest/runreftest.py b/layout/tools/reftest/runreftest.py
>--- a/layout/tools/reftest/runreftest.py
>+++ b/layout/tools/reftest/runreftest.py
>@@ -163,7 +163,10 @@
>                                ["-reftest", reftestlist],
>                                utilityPath = options.utilityPath,
>                                xrePath=options.xrePath,
>-                               symbolsPath=options.symbolsPath)
>+                               symbolsPath=options.symbolsPath,
>+                               # give the JS harness 30 seconds to deal
>+                               # with its own timeouts
>+                               timeout=options.timeout/1000.0 + 30.0)

Spaces around '/'.
Comment on attachment 406497 [details] [diff] [review]
Win32 bits

># HG changeset patch
># User Ted Mielczarek <ted.mielczarek@gmail.com>
># Date 1255626935 14400
># Node ID 968b620846026a5853383f4a9c1c2a5549903ecb
># Parent  30d000bf54b98c31f75672f36cde079bb18cf2d5
>bug 501034 - add a Win32 helper app to crash a process, and make automation.py use it when the app hangs
>* * *


More hg qfo mangling, plus "bug" capitalization?

Holding my nose and plussing at as great a distance from the keyboard and this bug as I can manage...
Attachment #406497 - Flags: review?(jwalden+bmo) → review+
We'll punt OS X to another bug. Pushed these two patches to m-c:
http://hg.mozilla.org/mozilla-central/rev/c641519baa90
http://hg.mozilla.org/mozilla-central/rev/3b932018ce6a
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Blocks: 525296
Target Milestone: --- → mozilla1.9.3a1
Bustage fix, apparently the Linux codepath supported None, but not the Win32 codepath, and I didn't test the PGO script after I wrote the Windows part:
http://hg.mozilla.org/mozilla-central/rev/091e9e8cba01
Blocks: 525370
I disabled the Win32 code path for right now, since it seems to be timing out like crazy, but not giving stacks:
http://hg.mozilla.org/mozilla-central/rev/0de15169a880

Filed bug 525370 on sorting that out.
No longer blocks: 525370
Depends on: 525370
Blocks: 525370
No longer depends on: 525370
Duplicate of this bug: 513010
Duplicate of this bug: 504361
Mochitests and reftests were fixed.

What is the situation is for compiled and xpcshell tests ?
(feature already there ? needs separate bug ?)
I don't think it's worth fixing compiled code tests. You could file a new bug for xpcshell tests, it wouldn't be very hard to move the code from automation.py to automationutils.py, and hook it up in runxpcshelltests.py.
Blocks: 525520
(In reply to comment #22)

> I don't think it's worth fixing compiled code tests.

Agreed.

> You could file a new bug for xpcshell tests

I filed bug 525520.
The win32 patch broke crosscompilation on Linux with mingw. Lower case should be used to include windows headers instead (in this case it's windows.h instead of Windows.h).
Attachment #409983 - Flags: review?(ted.mielczarek)
Attachment #409983 - Flags: review?(ted.mielczarek) → review+
Thanks for review, Ted. Could someone please add checkin-needed keyword? I don't have rights to do it myself.
You've been granted "editbugs" so you should be able to twiddle various bug flags now.
Thanks.

Setting the flag.
Keywords: checkin-needed
Comment on attachment 409983 [details] [diff] [review]
Fixed includes on case sensitive systems
[Checkin: See comment 28+32]


http://hg.mozilla.org/mozilla-central/rev/700b6d0b5b43
Attachment #409983 - Attachment description: Fixed includes on case sensitive systems. → Fixed includes on case sensitive systems [Checkin: Comment 28]
Keywords: checkin-needed
(In reply to comment #28)
> (From update of attachment 409983 [details] [diff] [review])
> 
> http://hg.mozilla.org/mozilla-central/rev/700b6d0b5b43

Doh: empty checkin! :-[

Ted, please, s/CR+LF/CR/g on (all) the files ... then check Jacek's patch in.
(In reply to comment #29)
> Ted, please, s/CR+LF/CR/g on (all) the files ... then check Jacek's patch in.

Ted: ping.
Keywords: checkin-needed
Whiteboard: [c-n: see comment 29]
Sorry, haven't had time. Feel free to do this yourself.
Comment on attachment 409983 [details] [diff] [review]
Fixed includes on case sensitive systems
[Checkin: See comment 28+32]


http://hg.mozilla.org/mozilla-central/rev/2a7fb310722e
s/CrLf/Lf/g, EOL whitespace removals

http://hg.mozilla.org/mozilla-central/rev/24ddb9b3d3ed
Fixed includes on case sensitive systems. (actually, this time)
Attachment #409983 - Attachment description: Fixed includes on case sensitive systems [Checkin: Comment 28] → Fixed includes on case sensitive systems [Checkin: See comment 28+32]
Keywords: checkin-needed
Whiteboard: [c-n: see comment 29]
Depends on: 530453
Blocks: 540369
Duplicate of this bug: 481207
You need to log in before you can comment on or make changes to this bug.