Closed
Bug 905628
Opened 12 years ago
Closed 12 years ago
mozharness should dump stacks when a mochitest test hangs
Categories
(Release Engineering :: Applications: MozharnessCore, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jimm, Assigned: jimm)
References
Details
Attachments
(1 file, 2 obsolete files)
6.52 KB,
patch
|
ted
:
review+
|
Details | Diff | Splinter Review |
Not sure if this is possible or not, but we already dump stacks in some cases, so I'm wondering if it's possible here. I bug 891892 we are seeing occasional browser hangs:
timed out after 1000 seconds of no output
Return code: 572
command timed out: 1200 seconds without output, attempting to kill
This showed up on a big mc merge from last weekend with a bunch of stuff in it, including gfx related work. We can't reproduce locally. Filing this bug to see if we can figure out some way to trigger stack dumps when these tests hang.
![]() |
Assignee | |
Comment 1•12 years ago
|
||
Ted do you know if there's a way to do this for hung browsers through mozharness? Or does this need to happen from within the app?
Flags: needinfo?(ted)
Comment 2•12 years ago
|
||
If this is running in the Mochitest harness then we're supposed to be handling that internally:
http://mxr.mozilla.org/mozilla-central/source/build/automation.py.in#744
If this isn't working it's possible we have the harness timeouts configured badly, so that mozharness times out before the harness gets a chance to handle it. I don't see anything in the commandline that would lead me to believe that's the case, though. If you run these locally and put an intentional hang in the code somewhere, what happens? (I don't have a Win8 install to test on.)
Flags: needinfo?(ted)
![]() |
Assignee | |
Comment 3•12 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #2)
> If this is running in the Mochitest harness then we're supposed to be
> handling that internally:
> http://mxr.mozilla.org/mozilla-central/source/build/automation.py.in#744
>
> If this isn't working it's possible we have the harness timeouts configured
> badly, so that mozharness times out before the harness gets a chance to
> handle it. I don't see anything in the commandline that would lead me to
> believe that's the case, though. If you run these locally and put an
> intentional hang in the code somewhere, what happens? (I don't have a Win8
> install to test on.)
Ok, so this looks like it's coming from mozharness vs the test harness -
https://tbpl.mozilla.org/php/getParsedLog.php?id=26610876&tree=Profiling
16:16:46 ERROR - timed out after 1000 seconds of no output
^ That looks like a mozharness message. So maybe mozharness is killing the app w/ the test harness still running.
![]() |
Assignee | |
Comment 4•12 years ago
|
||
If I'm reading that right - test harness would dump something like
TEST-UNEXPECTED-FAIL | %s | application timed out after %d seconds with no output
![]() |
Assignee | |
Comment 5•12 years ago
|
||
hmm, or maybe we failed to get the stack -
Traceback (most recent call last):
Failure: exceptions.RuntimeError
![]() |
Assignee | |
Comment 6•12 years ago
|
||
I think we have two problems here, one the mozharness timeout is too short -
http://mxr.mozilla.org/build/search?find=%2Fmozharness%2F&string=timeout
and also we use a harness to launch to browser, so even if we did try to get a stack, it'd probably try to get a stack of the harness. Will do some local testing to see what happens in runtests.
Comment 7•12 years ago
|
||
The default timeout ought to be only 330 seconds:
http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#590
And the log says it's hitting 1000 seconds without output, so that doesn't seem right. Maybe the timeout logic isn't working properly on Metro builds?
![]() |
Assignee | |
Comment 8•12 years ago
|
||
Assignee: nobody → jmathies
Attachment #791611 -
Flags: review?(ted)
![]() |
Assignee | |
Comment 9•12 years ago
|
||
added some additional commenting in automation.py.
Attachment #791611 -
Attachment is obsolete: true
Attachment #791611 -
Flags: review?(ted)
Attachment #791613 -
Flags: review?(ted)
![]() |
Assignee | |
Comment 10•12 years ago
|
||
Full windows test run -
https://tbpl.mozilla.org/?tree=Try&showall=0&rev=202e3a6fe2c1
I was concerned that line buffering change might break something, but it doesn't appear to.
Comment 11•12 years ago
|
||
It's still a behavior change, which weirds me out a bit. The rest of automation.py is expecting readWithTimeout to return a line at a time, but your patch changes that. Why is that a prerequisite for this patch?
![]() |
Assignee | |
Comment 12•12 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #11)
> It's still a behavior change, which weirds me out a bit. The rest of
> automation.py is expecting readWithTimeout to return a line at a time, but
> your patch changes that. Why is that a prerequisite for this patch?
The output from the browser has to travel through a pipe via metrotestharness to get here. That breaks the line buffered nature of console output. I can probably fix this in metrotestharness as well by parsing output and flushing.
Generally though I don't think we should assume line buffering here on Windows, but maybe that can be addressed if and when we run into a problem with it.
Comment 13•12 years ago
|
||
That's a good point, since I know for a fact that Windows doesn't have line buffering for piped output. (We've been bitten by this in the past when piping stdout and stderr to the same stream.) I think the proper fix would be to make readWithTimeout do line-buffering internally.
![]() |
Assignee | |
Comment 14•12 years ago
|
||
Here's an update with better readWithTimeout for Windows.
Attachment #791613 -
Attachment is obsolete: true
Attachment #791613 -
Flags: review?(ted)
Attachment #792236 -
Flags: review?(ted)
![]() |
Assignee | |
Updated•12 years ago
|
Attachment #792236 -
Attachment is patch: true
Comment 15•12 years ago
|
||
Comment on attachment 792236 [details] [diff] [review]
patch v.2
Review of attachment 792236 [details] [diff] [review]:
-----------------------------------------------------------------
This whole thing is a little gross. I get the feeling this is going to make refactoring this code a pain in the ass in the future.
Could we do the things that metrotestharness.exe is doing in Python using ctypes, or does it really need to be an EXE?
::: build/automation.py.in
@@ +547,3 @@
> |f| must be a pipe, like the |stdout| member of a subprocess.Popen
> + object created with stdout=PIPE. Returns a tuple (line, did_timeout),
> + where |did_timeout| is True if the read timed out, and False otherwise.
I think you should keep the verbiage about "|timeout| seconds".
@@ +566,5 @@
> self.log.error("readWithTimeout got error: %d", err)
> + # read a character at a time, checking for eol. Return once we get there.
> + index = 0
> + while index < l.value:
> + char = f.read(1)
Do we need to handle EOF here, or are we covered by the checks on PeekNamedPipe above/the length we get from that?
@@ +760,5 @@
> hitMaxTime = True
> self.log.info("TEST-UNEXPECTED-FAIL | %s | application ran for longer than allowed maximum time of %d seconds", self.lastTestSeen, int(maxTime))
> self.killAndGetStack(proc.pid, utilityPath, debuggerInfo)
> if didTimeout:
> + if len(line) > 0:
You can just say "if line:"
@@ +769,5 @@
> + self.killAndGetStack(proc.pid, utilityPath, debuggerInfo)
> + else:
> + # kill the browser process we were handed back via the metro harness. killAndGetStack
> + # will try to ping the crash reporter in the metro browser.
> + self.killAndGetStack(browserProcessId, utilityPath, debuggerInfo)
I'd probably just either write:
if browserProcessId == -1:
browserProcessId = proc.pid
or:
self.killAndGetStack(proc.pid if browserProcessId == -1 else browserProcessId , ...)
Attachment #792236 -
Flags: review?(ted) → review+
![]() |
Assignee | |
Comment 16•12 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #15)
> Comment on attachment 792236 [details] [diff] [review]
> patch v.2
>
> Review of attachment 792236 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> This whole thing is a little gross. I get the feeling this is going to make
> refactoring this code a pain in the ass in the future.
>
> Could we do the things that metrotestharness.exe is doing in Python using
> ctypes, or does it really need to be an EXE?
Not sure -
http://mxr.mozilla.org/mozilla-central/source/browser/metro/shell/testing/metrotestharness.cpp
can python do basic com, file io, pipes, and windows message handling via ctypes? If so then probably. (Although nobody on the metro team has time for something like that currently.)
> @@ +566,5 @@
> > self.log.error("readWithTimeout got error: %d", err)
> > + # read a character at a time, checking for eol. Return once we get there.
> > + index = 0
> > + while index < l.value:
> > + char = f.read(1)
>
> Do we need to handle EOF here, or are we covered by the checks on
> PeekNamedPipe above/the length we get from that?
This is handled a little ways up using the return result from PeekNamedPipe.
![]() |
Assignee | |
Comment 17•12 years ago
|
||
Comment 18•12 years ago
|
||
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
![]() |
Assignee | |
Comment 19•12 years ago
|
||
(In reply to Jim Mathies [:jimm] from comment #16)
> (In reply to Ted Mielczarek [:ted.mielczarek] from comment #15)
> > Comment on attachment 792236 [details] [diff] [review]
> > patch v.2
> >
> > Review of attachment 792236 [details] [diff] [review]:
> > -----------------------------------------------------------------
> >
> > This whole thing is a little gross. I get the feeling this is going to make
> > refactoring this code a pain in the ass in the future.
> >
> > Could we do the things that metrotestharness.exe is doing in Python using
> > ctypes, or does it really need to be an EXE?
>
> Not sure -
>
> http://mxr.mozilla.org/mozilla-central/source/browser/metro/shell/testing/
> metrotestharness.cpp
filed bug 909398 on this.
Updated•11 years ago
|
Component: Release Automation → Mozharness
You need to log in
before you can comment on or make changes to this bug.
Description
•