Closed Bug 905628 Opened 7 years ago Closed 7 years ago

mozharness should dump stacks when a mochitest test hangs

Categories

(Release Engineering :: Applications: MozharnessCore, defect)

x86_64
Windows 8
defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jimm, Assigned: jimm)

References

Details

Attachments

(1 file, 2 obsolete files)

Not sure if this is possible or not, but we already dump stacks in some cases, so I'm wondering if it's possible here. I bug 891892 we are seeing occasional browser hangs: 

timed out after 1000 seconds of no output
Return code: 572
command timed out: 1200 seconds without output, attempting to kill

This showed up on a big mc merge from last weekend with a bunch of stuff in it, including gfx related work. We can't reproduce locally. Filing this bug to see if we can figure out some way to trigger stack dumps when these tests hang.
Ted do you know if there's a way to do this for hung browsers through mozharness? Or does this need to happen from within the app?
Flags: needinfo?(ted)
If this is running in the Mochitest harness then we're supposed to be handling that internally:
http://mxr.mozilla.org/mozilla-central/source/build/automation.py.in#744

If this isn't working it's possible we have the harness timeouts configured badly, so that mozharness times out before the harness gets a chance to handle it. I don't see anything in the commandline that would lead me to believe that's the case, though. If you run these locally and put an intentional hang in the code somewhere, what happens? (I don't have a Win8 install to test on.)
Flags: needinfo?(ted)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #2)
> If this is running in the Mochitest harness then we're supposed to be
> handling that internally:
> http://mxr.mozilla.org/mozilla-central/source/build/automation.py.in#744
>
> If this isn't working it's possible we have the harness timeouts configured
> badly, so that mozharness times out before the harness gets a chance to
> handle it. I don't see anything in the commandline that would lead me to
> believe that's the case, though. If you run these locally and put an
> intentional hang in the code somewhere, what happens? (I don't have a Win8
> install to test on.)

Ok, so this looks like it's coming from mozharness vs the test harness -

https://tbpl.mozilla.org/php/getParsedLog.php?id=26610876&tree=Profiling

16:16:46    ERROR - timed out after 1000 seconds of no output

^ That looks like a mozharness message. So maybe mozharness is killing the app w/ the test harness still running.
If I'm reading that right - test harness would dump something like

TEST-UNEXPECTED-FAIL | %s | application timed out after %d seconds with no output
hmm, or maybe we failed to get the stack - 

Traceback (most recent call last):
Failure: exceptions.RuntimeError
I think we have two problems here, one the mozharness timeout is too short - 

http://mxr.mozilla.org/build/search?find=%2Fmozharness%2F&string=timeout

and also we use a harness to launch to browser, so even if we did try to get a stack, it'd probably try to get a stack of the harness. Will do some local testing to see what happens in runtests.
The default timeout ought to be only 330 seconds:
http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#590

And the log says it's hitting 1000 seconds without output, so that doesn't seem right. Maybe the timeout logic isn't working properly on Metro builds?
Attached patch patch v.1 (obsolete) — Splinter Review
works - 
https://tbpl.mozilla.org/?tree=Try&showall=0&rev=449ae65479bf
Assignee: nobody → jmathies
Attachment #791611 - Flags: review?(ted)
Attached patch patch v.1 (obsolete) — Splinter Review
added some additional commenting in automation.py.
Attachment #791611 - Attachment is obsolete: true
Attachment #791611 - Flags: review?(ted)
Attachment #791613 - Flags: review?(ted)
Full windows test run - 

https://tbpl.mozilla.org/?tree=Try&showall=0&rev=202e3a6fe2c1

I was concerned that line buffering change might break something, but it doesn't appear to.
It's still a behavior change, which weirds me out a bit. The rest of automation.py is expecting readWithTimeout to return a line at a time, but your patch changes that. Why is that a prerequisite for this patch?
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #11)
> It's still a behavior change, which weirds me out a bit. The rest of
> automation.py is expecting readWithTimeout to return a line at a time, but
> your patch changes that. Why is that a prerequisite for this patch?

The output from the browser has to travel through a pipe via metrotestharness to get here. That breaks the line buffered nature of console output. I can probably fix this in metrotestharness as well by parsing output and flushing.

Generally though I don't think we should assume line buffering here on Windows, but maybe that can be addressed if and when we run into a problem with it.
That's a good point, since I know for a fact that Windows doesn't have line buffering for piped output. (We've been bitten by this in the past when piping stdout and stderr to the same stream.) I think the proper fix would be to make readWithTimeout do line-buffering internally.
Attached patch patch v.2Splinter Review
Here's an update with better readWithTimeout for Windows.
Attachment #791613 - Attachment is obsolete: true
Attachment #791613 - Flags: review?(ted)
Attachment #792236 - Flags: review?(ted)
Attachment #792236 - Attachment is patch: true
Comment on attachment 792236 [details] [diff] [review]
patch v.2

Review of attachment 792236 [details] [diff] [review]:
-----------------------------------------------------------------

This whole thing is a little gross. I get the feeling this is going to make refactoring this code a pain in the ass in the future.

Could we do the things that metrotestharness.exe is doing in Python using ctypes, or does it really need to be an EXE?

::: build/automation.py.in
@@ +547,3 @@
>        |f| must be a  pipe, like the |stdout| member of a subprocess.Popen
> +      object created with stdout=PIPE. Returns a tuple (line, did_timeout),
> +      where |did_timeout| is True if the read timed out, and False otherwise.

I think you should keep the verbiage about "|timeout| seconds".

@@ +566,5 @@
>              self.log.error("readWithTimeout got error: %d", err)
> +        # read a character at a time, checking for eol. Return once we get there.
> +        index = 0
> +        while index < l.value:
> +          char = f.read(1)

Do we need to handle EOF here, or are we covered by the checks on PeekNamedPipe above/the length we get from that?

@@ +760,5 @@
>            hitMaxTime = True
>            self.log.info("TEST-UNEXPECTED-FAIL | %s | application ran for longer than allowed maximum time of %d seconds", self.lastTestSeen, int(maxTime))
>            self.killAndGetStack(proc.pid, utilityPath, debuggerInfo)
>        if didTimeout:
> +        if len(line) > 0:

You can just say "if line:"

@@ +769,5 @@
> +          self.killAndGetStack(proc.pid, utilityPath, debuggerInfo)
> +        else:
> +          # kill the browser process we were handed back via the metro harness. killAndGetStack
> +          # will try to ping the crash reporter in the metro browser.
> +          self.killAndGetStack(browserProcessId, utilityPath, debuggerInfo)

I'd probably just either write:
if browserProcessId == -1:
  browserProcessId = proc.pid

or:
self.killAndGetStack(proc.pid if browserProcessId == -1 else browserProcessId , ...)
Attachment #792236 - Flags: review?(ted) → review+
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #15)
> Comment on attachment 792236 [details] [diff] [review]
> patch v.2
> 
> Review of attachment 792236 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This whole thing is a little gross. I get the feeling this is going to make
> refactoring this code a pain in the ass in the future.
> 
> Could we do the things that metrotestharness.exe is doing in Python using
> ctypes, or does it really need to be an EXE?

Not sure - 

http://mxr.mozilla.org/mozilla-central/source/browser/metro/shell/testing/metrotestharness.cpp

can python do basic com, file io, pipes, and windows message handling via ctypes? If so then probably. (Although nobody on the metro team has time for something like that currently.)

> @@ +566,5 @@
> >              self.log.error("readWithTimeout got error: %d", err)
> > +        # read a character at a time, checking for eol. Return once we get there.
> > +        index = 0
> > +        while index < l.value:
> > +          char = f.read(1)
> 
> Do we need to handle EOF here, or are we covered by the checks on
> PeekNamedPipe above/the length we get from that?

This is handled a little ways up using the return result from PeekNamedPipe.
https://hg.mozilla.org/mozilla-central/rev/e7fb3a5d2293
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
(In reply to Jim Mathies [:jimm] from comment #16)
> (In reply to Ted Mielczarek [:ted.mielczarek] from comment #15)
> > Comment on attachment 792236 [details] [diff] [review]
> > patch v.2
> > 
> > Review of attachment 792236 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > This whole thing is a little gross. I get the feeling this is going to make
> > refactoring this code a pain in the ass in the future.
> > 
> > Could we do the things that metrotestharness.exe is doing in Python using
> > ctypes, or does it really need to be an EXE?
> 
> Not sure - 
> 
> http://mxr.mozilla.org/mozilla-central/source/browser/metro/shell/testing/
> metrotestharness.cpp

filed bug 909398 on this.
Component: Release Automation → Mozharness
You need to log in before you can comment on or make changes to this bug.