905628 - mozharness should dump stacks when a mochitest test hangs

Assignee

Description

•

12 years ago

Not sure if this is possible or not, but we already dump stacks in some cases, so I'm wondering if it's possible here. I bug 891892 we are seeing occasional browser hangs: timed out after 1000 seconds of no output Return code: 572 command timed out: 1200 seconds without output, attempting to kill This showed up on a big mc merge from last weekend with a bunch of stuff in it, including gfx related work. We can't reproduce locally. Filing this bug to see if we can figure out some way to trigger stack dumps when these tests hang.

Jim Mathies [:jimm]

Assignee

Comment 1

•

12 years ago

Ted do you know if there's a way to do this for hung browsers through mozharness? Or does this need to happen from within the app?

Flags: needinfo?(ted)

(not currently active) Ted Mielczarek

Comment 2

•

12 years ago

If this is running in the Mochitest harness then we're supposed to be handling that internally: http://mxr.mozilla.org/mozilla-central/source/build/automation.py.in#744 If this isn't working it's possible we have the harness timeouts configured badly, so that mozharness times out before the harness gets a chance to handle it. I don't see anything in the commandline that would lead me to believe that's the case, though. If you run these locally and put an intentional hang in the code somewhere, what happens? (I don't have a Win8 install to test on.)

Flags: needinfo?(ted)

Jim Mathies [:jimm]

Assignee

Comment 3

•

12 years ago

(In reply to Ted Mielczarek [:ted.mielczarek] from comment #2) > If this is running in the Mochitest harness then we're supposed to be > handling that internally: > http://mxr.mozilla.org/mozilla-central/source/build/automation.py.in#744 > > If this isn't working it's possible we have the harness timeouts configured > badly, so that mozharness times out before the harness gets a chance to > handle it. I don't see anything in the commandline that would lead me to > believe that's the case, though. If you run these locally and put an > intentional hang in the code somewhere, what happens? (I don't have a Win8 > install to test on.) Ok, so this looks like it's coming from mozharness vs the test harness - https://tbpl.mozilla.org/php/getParsedLog.php?id=26610876&tree=Profiling 16:16:46 ERROR - timed out after 1000 seconds of no output ^ That looks like a mozharness message. So maybe mozharness is killing the app w/ the test harness still running.

Jim Mathies [:jimm]

Assignee

Comment 4

•

12 years ago

If I'm reading that right - test harness would dump something like TEST-UNEXPECTED-FAIL | %s | application timed out after %d seconds with no output

Jim Mathies [:jimm]

Assignee

Comment 5

•

12 years ago

hmm, or maybe we failed to get the stack - Traceback (most recent call last): Failure: exceptions.RuntimeError

Jim Mathies [:jimm]

Assignee

Comment 6

•

12 years ago

I think we have two problems here, one the mozharness timeout is too short - http://mxr.mozilla.org/build/search?find=%2Fmozharness%2F&string=timeout and also we use a harness to launch to browser, so even if we did try to get a stack, it'd probably try to get a stack of the harness. Will do some local testing to see what happens in runtests.

(not currently active) Ted Mielczarek

Comment 7

•

12 years ago

The default timeout ought to be only 330 seconds: http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#590 And the log says it's hitting 1000 seconds without output, so that doesn't seem right. Maybe the timeout logic isn't working properly on Metro builds?

Jim Mathies [:jimm]

Assignee

Comment 8

•

12 years ago

Attached patch patch v.1 (obsolete) — Details — Splinter Review

works - https://tbpl.mozilla.org/?tree=Try&showall=0&rev=449ae65479bf

Assignee: nobody → jmathies

Attachment #791611 - Flags: review?(ted)

Jim Mathies [:jimm]

Assignee

Comment 9

•

12 years ago

Attached patch patch v.1 (obsolete) — Details — Splinter Review

added some additional commenting in automation.py.

Attachment #791611 - Attachment is obsolete: true

Attachment #791611 - Flags: review?(ted)

Attachment #791613 - Flags: review?(ted)

Jim Mathies [:jimm]

Assignee

Comment 10

•

12 years ago

Full windows test run - https://tbpl.mozilla.org/?tree=Try&showall=0&rev=202e3a6fe2c1 I was concerned that line buffering change might break something, but it doesn't appear to.

(not currently active) Ted Mielczarek

Comment 11

•

12 years ago

It's still a behavior change, which weirds me out a bit. The rest of automation.py is expecting readWithTimeout to return a line at a time, but your patch changes that. Why is that a prerequisite for this patch?

Jim Mathies [:jimm]

Assignee

Comment 12

•

12 years ago

(In reply to Ted Mielczarek [:ted.mielczarek] from comment #11) > It's still a behavior change, which weirds me out a bit. The rest of > automation.py is expecting readWithTimeout to return a line at a time, but > your patch changes that. Why is that a prerequisite for this patch? The output from the browser has to travel through a pipe via metrotestharness to get here. That breaks the line buffered nature of console output. I can probably fix this in metrotestharness as well by parsing output and flushing. Generally though I don't think we should assume line buffering here on Windows, but maybe that can be addressed if and when we run into a problem with it.

(not currently active) Ted Mielczarek

Comment 13

•

12 years ago

That's a good point, since I know for a fact that Windows doesn't have line buffering for piped output. (We've been bitten by this in the past when piping stdout and stderr to the same stream.) I think the proper fix would be to make readWithTimeout do line-buffering internally.

Jim Mathies [:jimm]

Assignee

Comment 14

•

12 years ago

Attached patch patch v.2 — Details — Splinter Review

Here's an update with better readWithTimeout for Windows.

Attachment #791613 - Attachment is obsolete: true

Attachment #791613 - Flags: review?(ted)

Attachment #792236 - Flags: review?(ted)

Jim Mathies [:jimm]

Assignee

Updated

•

12 years ago

Attachment #792236 - Attachment is patch: true

(not currently active) Ted Mielczarek

Comment 15

•

12 years ago

Comment on attachment 792236 [details] [diff] [review] patch v.2 Review of attachment 792236 [details] [diff] [review]: ----------------------------------------------------------------- This whole thing is a little gross. I get the feeling this is going to make refactoring this code a pain in the ass in the future. Could we do the things that metrotestharness.exe is doing in Python using ctypes, or does it really need to be an EXE? ::: build/automation.py.in @@ +547,3 @@ > |f| must be a pipe, like the |stdout| member of a subprocess.Popen > + object created with stdout=PIPE. Returns a tuple (line, did_timeout), > + where |did_timeout| is True if the read timed out, and False otherwise. I think you should keep the verbiage about "|timeout| seconds". @@ +566,5 @@ > self.log.error("readWithTimeout got error: %d", err) > + # read a character at a time, checking for eol. Return once we get there. > + index = 0 > + while index < l.value: > + char = f.read(1) Do we need to handle EOF here, or are we covered by the checks on PeekNamedPipe above/the length we get from that? @@ +760,5 @@ > hitMaxTime = True > self.log.info("TEST-UNEXPECTED-FAIL | %s | application ran for longer than allowed maximum time of %d seconds", self.lastTestSeen, int(maxTime)) > self.killAndGetStack(proc.pid, utilityPath, debuggerInfo) > if didTimeout: > + if len(line) > 0: You can just say "if line:" @@ +769,5 @@ > + self.killAndGetStack(proc.pid, utilityPath, debuggerInfo) > + else: > + # kill the browser process we were handed back via the metro harness. killAndGetStack > + # will try to ping the crash reporter in the metro browser. > + self.killAndGetStack(browserProcessId, utilityPath, debuggerInfo) I'd probably just either write: if browserProcessId == -1: browserProcessId = proc.pid or: self.killAndGetStack(proc.pid if browserProcessId == -1 else browserProcessId , ...)

Attachment #792236 - Flags: review?(ted) → review+

Jim Mathies [:jimm]

Assignee

Comment 16

•

12 years ago

(In reply to Ted Mielczarek [:ted.mielczarek] from comment #15) > Comment on attachment 792236 [details] [diff] [review] > patch v.2 > > Review of attachment 792236 [details] [diff] [review]: > ----------------------------------------------------------------- > > This whole thing is a little gross. I get the feeling this is going to make > refactoring this code a pain in the ass in the future. > > Could we do the things that metrotestharness.exe is doing in Python using > ctypes, or does it really need to be an EXE? Not sure - http://mxr.mozilla.org/mozilla-central/source/browser/metro/shell/testing/metrotestharness.cpp can python do basic com, file io, pipes, and windows message handling via ctypes? If so then probably. (Although nobody on the metro team has time for something like that currently.) > @@ +566,5 @@ > > self.log.error("readWithTimeout got error: %d", err) > > + # read a character at a time, checking for eol. Return once we get there. > > + index = 0 > > + while index < l.value: > > + char = f.read(1) > > Do we need to handle EOF here, or are we covered by the checks on > PeekNamedPipe above/the length we get from that? This is handled a little ways up using the return result from PeekNamedPipe.

Jim Mathies [:jimm]

Assignee

Comment 17

•

12 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/e7fb3a5d2293

Ryan VanderMeulen [:RyanVM]

Comment 18

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/e7fb3a5d2293

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Jim Mathies [:jimm]

Assignee

Comment 19

•

12 years ago

(In reply to Jim Mathies [:jimm] from comment #16) > (In reply to Ted Mielczarek [:ted.mielczarek] from comment #15) > > Comment on attachment 792236 [details] [diff] [review] > > patch v.2 > > > > Review of attachment 792236 [details] [diff] [review]: > > ----------------------------------------------------------------- > > > > This whole thing is a little gross. I get the feeling this is going to make > > refactoring this code a pain in the ass in the future. > > > > Could we do the things that metrotestharness.exe is doing in Python using > > ctypes, or does it really need to be an EXE? > > Not sure - > > http://mxr.mozilla.org/mozilla-central/source/browser/metro/shell/testing/ > metrotestharness.cpp filed bug 909398 on this.

Chris Cooper [:coop] (he/him)

Updated

•

11 years ago

Component: Release Automation → Mozharness

patch v.1 12 years ago Jim Mathies [:jimm] 5.60 KB, patch		Details \| Diff \| Splinter Review
patch v.1 12 years ago Jim Mathies [:jimm] 5.65 KB, patch		Details \| Diff \| Splinter Review
patch v.2 12 years ago Jim Mathies [:jimm] 6.52 KB, patch	ted : review+	Details \| Diff \| Splinter Review