1143547 - [e10s] When test harness kills browser due to timeout we don't get a stack for the content process

Reporter

Description

•

10 years ago

Bug 1135424 got backed out because of a very intermittent mochitest-e10s-3 test hang. The stack was garbage each time, so I had to do a bunch of guessing and round-tripping through try to develop and confirm my theory that the main thread was hanging indefinitely in a call to MediaTaskQueue::AwaitShutdownAndIdle. If the stacks were correct, I could have instantly diagnosed the failure when it first happened on inbound on Friday. Instead, I had to spend a weekend (and a lot of try resources) chasing it down. That's a pretty high cost. I don't know whose radar this should be on exactly, but assuming it's not just a fluke with my particular push (or base revisions), it needs to be investigated and fixed. This happened for both opt and debug builds - see the crashes in test_played.html and test_playback.html. Original inbound push that got backed out: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=584d91ffdf88&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=running&filter-resultStatus=pending My logging try pushes while iterating on this: https://treeherder.mozilla.org/#/jobs?repo=try&revision=dcc31427f8cf https://treeherder.mozilla.org/#/jobs?repo=try&revision=e3c9fea3891c https://treeherder.mozilla.org/#/jobs?repo=try&revision=95e73d129de9

Bobby Holley (:bholley)

Reporter

Comment 1

•

10 years ago

NI jgriffin for next steps.

Flags: needinfo?(jgriffin)

(not currently active) Ted Mielczarek

Comment 2

•

10 years ago

So a little bit of this is bug 528231, in that we don't have symbols for system libs like libc so the stack traces suck. Another part of this is that when the chrome process kills the content process it attributes the crash to the main thread, because we couldn't figure out any other sane way to do it. You're force-killing a running process, so every thread is running and it's a crapshoot as to what thread you'd get if you just picked the active instruction pointer, so we decided "blame the main thread" was as sane as anything else. I'm not really sure what we can do to make this better. If we could profile the content process before killing it and print the profile that might give you a better idea of what it's actually doing vs. "here's some thread spinning an event loop". Although if the problem is that some random thread is actually just spinning an event loop I'm not sure that it'd be any better! I think we can do a better job with the symbols nowadays, Socorro's symbols are all in S3 and we could fetch them via HTTP, so maybe that's a good place to start.

Bobby Holley (:bholley)

Reporter

Comment 3

•

10 years ago

(In reply to Ted Mielczarek [:ted.mielczarek] from comment #2) > So a little bit of this is bug 528231, in that we don't have symbols for > system libs like libc so the stack traces suck. This isn't an issue of missing system lib symbols - the gecko stack is garbage. See this log: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/bobbyholley@gmail.com-95e73d129de9/try-linux64-debug/try_ubuntu64_vm-debug_test-mochitest-e10s-3-bm117-tests1-linux64-build63.txt.gz Right before the crash, you see the line: 20:08:57 INFO - Decoder=0x7f51ba994980 nsDecoderDisoseEvent::Run Awaiting Shutdown+Idle Which is happening on the main thread, which we never return from: https://hg.mozilla.org/try/file/95e73d129de9/dom/media/MediaDecoderStateMachine.cpp#l2548 So at minimum, the stack should have nsDecoderDisposeEvent::Run and MediaTaskQueue::AwaitShutdownAndIdle. It has none of that, and instead shows things like TimerThread::RemoveTimer spinning the event loop, which I don't think can happen. I think the nsLocalFile::AppendRelativePath is probably garbage too. > Another part of this is that when the chrome process kills the content > process it attributes the crash to the main thread, because we couldn't > figure out any other sane way to do it. You're force-killing a running > process, so every thread is running and it's a crapshoot as to what thread > you'd get if you just picked the active instruction pointer, so we decided > "blame the main thread" was as sane as anything else. That's totally fine with me. I just want the main thread stack to not lie. In this case, it _was_ the main thread that was hanging. > I think we can do a better job with the symbols nowadays, Socorro's symbols > are all in S3 and we could fetch them via HTTP, so maybe that's a good place > to start. Would that likely solve the above?

Flags: needinfo?(ted)

(not currently active) Ted Mielczarek

Comment 4

•

10 years ago

That is possible, I am going to see if I can process the minidump with the full set of symbols it wants and see what the stack looks like. When we don't have symbols for system libraries often the stack unwinding misses things.

Flags: needinfo?(ted)

Bobby Holley (:bholley)

Reporter

Comment 5

•

10 years ago

(In reply to Ted Mielczarek [:ted.mielczarek] from comment #4) > That is possible, I am going to see if I can process the minidump with the > full set of symbols it wants and see what the stack looks like. When we > don't have symbols for system libraries often the stack unwinding misses > things. Awesome, thank you!

Bobby Holley (:bholley)

Reporter

Comment 6

•

10 years ago

Bill, per our discussion today do you want to file a bug about that ipc assertion?

Flags: needinfo?(wmccloskey)

(not currently active) Ted Mielczarek

Comment 7

•

10 years ago

Attached file Fully symbolicated crash stack — Details

I took one of the minidumps from the test logs you linked, scraped symbols out of the system libraries from a test machine and reprocessed it. This is the full symbolicated stack. I still don't see MediaTaskQueue anywhere in there, and all the stacks look sane, so I'm not sure what's going on there.

Jonathan Griffin (:jgriffin)

Comment 8

•

10 years ago

Clearing NI, since Ted is on the case.

Flags: needinfo?(jgriffin)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 9

•

10 years ago

(In reply to Bobby Holley (:bholley) from comment #6) > Bill, per our discussion today do you want to file a bug about that ipc > assertion? Looks like it's already filed as bug 1065892.

Flags: needinfo?(wmccloskey)

Bobby Holley (:bholley)

Reporter

Comment 10

•

10 years ago

(In reply to Ted Mielczarek [:ted.mielczarek] from comment #7) > Created attachment 8578618 [details] > Fully symbolicated crash stack > > I took one of the minidumps from the test logs you linked, scraped symbols > out of the system libraries from a test machine and reprocessed it. This is > the full symbolicated stack. I still don't see MediaTaskQueue anywhere in > there, and all the stacks look sane, so I'm not sure what's going on there. Was this from the hang in comment 3? It seems pretty clear from the log that we're never returning from that function. Any other ideas? If nothing else, somehow making the work you did in comment 7 automatic would be a big help. Is that feasible / is there a bug?

Flags: needinfo?(ted)

(not currently active) Ted Mielczarek

Comment 11

•

10 years ago

Um, this was from the link in comment 0. I will figure out which one shortly, and see if I can do the same thing to the other failure logs. bug 528231 has been on file for the system symbols for ages, but I think I can feasibly fix it nowadays so I will pursue that.

Flags: needinfo?(ted)

Bobby Holley (:bholley)

Reporter

Comment 12

•

10 years ago

Great, thanks Ted!

(not currently active) Ted Mielczarek

Comment 13

•

10 years ago

So I was symbolicating that stack and the answer hit me pretty quickly--the minidumps you're getting here are from the chrome process, not the content process. (Should have been obvious from having firefox and not plugin-container on the stack!) What's happening is that the process is hanging and the test harness is killing it: 20:14:27 INFO - 2837 ERROR TEST-UNEXPECTED-TIMEOUT | dom/media/test/test_played.html | application timed out after 330 seconds with no output 20:14:28 INFO - Xlib: extension "RANDR" missing on display ":0". 20:14:28 INFO - TEST-INFO | screentopng: exit 0 20:14:29 INFO - TEST-INFO | Main app process: killed by SIGIOT (this is confusing, it's SIGABRT). We're sending this to the main process, so we're getting a dump out of it, but you probably really want dumps for all the processes in this case. I'm not totally sure how to handle this right now, we either need to track all the child processes and kill them (children first so the dumps get written properly) or we need a way to tell the parent process that we want it to terminate its children. Resummarizing to match. Bill: any thoughts on the best way to handle this?

Flags: needinfo?(wmccloskey)

Summary: Garbage stack dump when test harness kills content process due to timeout → [e10s] When test harness kills browser due to timeout we don't get a stack for the content process

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 14

•

10 years ago

Ted's comment 13 reminded me that I worked on something about a year ago that might be related to this. I never finished it, unfortunately. We have test harness code here that kills all existing child processes and gets their stacks: http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#1582 Unfortunately, I think it's broken. Once piece is bug 950401, which I never had time to finish. However, I also have a vague memory that the code for isPidAlive is wrong. I think the problem is that it calls waitpid on the content process. However, since the content process is never a child of the test harness, this always fails (waitpid only works for direct children IIRC). So I think it always treats every child process as if it's dead, even when it's not. For this bug, I think you would also need to change handleTimeout to call checkForZombies. Sorry this is so vague. I thought I had filed a bug (and even checked in a fix for) the isPidAlive issue, but I can't find anything now. Maybe it's not even related to this though.

Flags: needinfo?(wmccloskey)

(not currently active) Ted Mielczarek

Comment 15

•

10 years ago

That isPidAlive code apparently dates back to bug 523208, and I reviewed it but I don't see any comments on the waitpid bit. (I suggested the os.kill approach.) That waitpid code is definitely not correct, and it's always going to return errno.ECHILD, causing us to ignore child processes completely. :-(

Tracy Walker [:tracy]

Updated

•

10 years ago

Blocks: e10s-tests

Jim Mathies [:jimm]

Updated

•

10 years ago

Blocks: e10s-harness

tracking-e10s: --- → ?

Brad Lassey [:blassey] (use needinfo?)

Updated

•

10 years ago

tracking-e10s: ? → +

Chris Manchester (limited bugmail, email directly)

Assignee

Comment 16

•

10 years ago

I have a patch for this that works for Linux, but in addition to bug 950401, I don't see us getting child PIDs out of the pid logfile on windows consistently.

Depends on: 950401

Chris Manchester (limited bugmail, email directly)

Assignee

Updated

•

10 years ago

Depends on: 1193861

Chris Manchester (limited bugmail, email directly)

Assignee

Updated

•

10 years ago

Depends on: 1196393

Chris Manchester (limited bugmail, email directly)

Assignee

Updated

•

10 years ago

No longer depends on: 1193861

Chris Manchester (limited bugmail, email directly)

Assignee

Updated

•

10 years ago

No longer depends on: 950401

Chris Manchester (limited bugmail, email directly)

Assignee

Comment 17

•

10 years ago

Attached file MozReview Request: Bug 1143547 - Get a stack out of child processes before killing the parent from the mochitest harness in response to the global output timeout. r=ted — Details

Bug 1143547 - Get a stack out of child processes before killing the parent from the mochitest harness in response to the global output timeout. r=ted

Attachment #8654336 - Flags: review?(ted)

Chris Manchester (limited bugmail, email directly)

Assignee

Updated

•

10 years ago

Assignee: nobody → cmanchester

(not currently active) Ted Mielczarek

Comment 18

•

10 years ago

Comment on attachment 8654336 [details] MozReview Request: Bug 1143547 - Get a stack out of child processes before killing the parent from the mochitest harness in response to the global output timeout. r=ted https://reviewboard.mozilla.org/r/17657/#review20379 ::: testing/mochitest/runtests.py:1626 (Diff revision 1) > - processList = [] > + processList = self.extract_child_pids(processLog) So this is going to continue to not work on most platforms, but that's exactly what's happening now, right? ::: testing/mochitest/runtests.py:2358 (Diff revision 1) > + while self.count_minidumps() < (minidump_count + 1): This is sort of awful. :-/ Could we use psutil.wait_procs here instead if we have psutil? (and just fall back to waiting for a fixed timeout if not) http://pythonhosted.org/psutil/#psutil.wait_procs

Attachment #8654336 - Flags: review?(ted)

Chris Manchester (limited bugmail, email directly)

Assignee

Comment 19

•

10 years ago

https://reviewboard.mozilla.org/r/17657/#review20379 > So this is going to continue to not work on most platforms, but that's exactly what's happening now, right? Right, this patch doesn't fix checkForZombies. It's called after the parent is dead, so it's not helpful very helpful. This call can be fixed though. > This is sort of awful. :-/ Could we use psutil.wait_procs here instead if we have psutil? (and just fall back to waiting for a fixed timeout if not) > http://pythonhosted.org/psutil/#psutil.wait_procs We have psutil on all the desktop platforms now, so that should work fine!

Chris Manchester (limited bugmail, email directly)

Assignee

Comment 20

•

10 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=79d44b7fb50e

Chris Manchester (limited bugmail, email directly)

Assignee

Comment 21

•

10 years ago

Comment on attachment 8654336 [details] MozReview Request: Bug 1143547 - Get a stack out of child processes before killing the parent from the mochitest harness in response to the global output timeout. r=ted Bug 1143547 - Get a stack out of child processes before killing the parent from the mochitest harness in response to the global output timeout. r=ted

Attachment #8654336 - Flags: review?(ted)

(not currently active) Ted Mielczarek

Comment 22

•

9 years ago

Comment on attachment 8654336 [details] MozReview Request: Bug 1143547 - Get a stack out of child processes before killing the parent from the mochitest harness in response to the global output timeout. r=ted https://reviewboard.mozilla.org/r/17657/#review29001 Just a couple of nits here. Again, sorry for the delay! ::: testing/mochitest/runtests.py:2369 (Diff revision 2) > + self.killAndGetStack(pid, utilityPath, debuggerInfo, A log message indicating that we're attempting to kill child process N would be good here. ::: testing/mochitest/runtests.py:2378 (Diff revision 2) > + for pid in child_pids: Similarly, a log message here indicating that we're punting because we don't have psutil would be good. Maybe even put some text saying to file a bug if this message shows up in automation, since that should never happen?

Attachment #8654336 - Flags: review?(ted) → review+

Pulsebot

Comment 23

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/0983990d3768

Phil Ringnalda (:philor)

Comment 24

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/0983990d3768

Status: NEW → RESOLVED

Closed: 9 years ago

status-firefox47: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla47

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 25

•

9 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-aurora/rev/070012e1c5ba

status-firefox46: --- → fixed

Fully symbolicated crash stack 10 years ago (not currently active) Ted Mielczarek 164.24 KB, text/plain		Details
MozReview Request: Bug 1143547 - Get a stack out of child processes before killing the parent from the mochitest harness in response to the global output timeout. r=ted 10 years ago Chris Manchester (limited bugmail, email directly) 40 bytes, text/x-review-board-request	ted : review+	Details