Closed Bug 1143547 Opened 5 years ago Closed 4 years ago

[e10s] When test harness kills browser due to timeout we don't get a stack for the content process

Categories

(Testing :: General, defect)

x86
macOS
defect
Not set

Tracking

(e10s+, firefox46 fixed, firefox47 fixed)

RESOLVED FIXED
mozilla47
Tracking Status
e10s + ---
firefox46 --- fixed
firefox47 --- fixed

People

(Reporter: bholley, Assigned: chmanchester)

References

(Blocks 2 open bugs)

Details

Attachments

(2 files)

Bug 1135424 got backed out because of a very intermittent mochitest-e10s-3 test hang. The stack was garbage each time, so I had to do a bunch of guessing and round-tripping through try to develop and confirm my theory that the main thread was hanging indefinitely in a call to MediaTaskQueue::AwaitShutdownAndIdle.

If the stacks were correct, I could have instantly diagnosed the failure when it first happened on inbound on Friday. Instead, I had to spend a weekend (and a lot of try resources) chasing it down. That's a pretty high cost. I don't know whose radar this should be on exactly, but assuming it's not just a fluke with my particular push (or base revisions), it needs to be investigated and fixed.

This happened for both opt and debug builds - see the crashes in test_played.html and test_playback.html.

Original inbound push that got backed out: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=584d91ffdf88&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=running&filter-resultStatus=pending

My logging try pushes while iterating on this:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=dcc31427f8cf
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e3c9fea3891c
https://treeherder.mozilla.org/#/jobs?repo=try&revision=95e73d129de9
NI jgriffin for next steps.
Flags: needinfo?(jgriffin)
So a little bit of this is bug 528231, in that we don't have symbols for system libs like libc so the stack traces suck.

Another part of this is that when the chrome process kills the content process it attributes the crash to the main thread, because we couldn't figure out any other sane way to do it. You're force-killing a running process, so every thread is running and it's a crapshoot as to what thread you'd get if you just picked the active instruction pointer, so we decided "blame the main thread" was as sane as anything else.

I'm not really sure what we can do to make this better. If we could profile the content process before killing it and print the profile that might give you a better idea of what it's actually doing vs. "here's some thread spinning an event loop". Although if the problem is that some random thread is actually just spinning an event loop I'm not sure that it'd be any better!

I think we can do a better job with the symbols nowadays, Socorro's symbols are all in S3 and we could fetch them via HTTP, so maybe that's a good place to start.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #2)
> So a little bit of this is bug 528231, in that we don't have symbols for
> system libs like libc so the stack traces suck.

This isn't an issue of missing system lib symbols - the gecko stack is garbage.

See this log:

http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/bobbyholley@gmail.com-95e73d129de9/try-linux64-debug/try_ubuntu64_vm-debug_test-mochitest-e10s-3-bm117-tests1-linux64-build63.txt.gz

Right before the crash, you see the line:

20:08:57     INFO -  Decoder=0x7f51ba994980 nsDecoderDisoseEvent::Run Awaiting Shutdown+Idle

Which is happening on the main thread, which we never return from:

https://hg.mozilla.org/try/file/95e73d129de9/dom/media/MediaDecoderStateMachine.cpp#l2548

So at minimum, the stack should have nsDecoderDisposeEvent::Run and MediaTaskQueue::AwaitShutdownAndIdle. It has none of that, and instead shows things like TimerThread::RemoveTimer spinning the event loop, which I don't think can happen. I think the nsLocalFile::AppendRelativePath is probably garbage too.
 
> Another part of this is that when the chrome process kills the content
> process it attributes the crash to the main thread, because we couldn't
> figure out any other sane way to do it. You're force-killing a running
> process, so every thread is running and it's a crapshoot as to what thread
> you'd get if you just picked the active instruction pointer, so we decided
> "blame the main thread" was as sane as anything else.

That's totally fine with me. I just want the main thread stack to not lie. In this case, it _was_ the main thread that was hanging.

> I think we can do a better job with the symbols nowadays, Socorro's symbols
> are all in S3 and we could fetch them via HTTP, so maybe that's a good place
> to start.

Would that likely solve the above?
Flags: needinfo?(ted)
That is possible, I am going to see if I can process the minidump with the full set of symbols it wants and see what the stack looks like. When we don't have symbols for system libraries often the stack unwinding misses things.
Flags: needinfo?(ted)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #4)
> That is possible, I am going to see if I can process the minidump with the
> full set of symbols it wants and see what the stack looks like. When we
> don't have symbols for system libraries often the stack unwinding misses
> things.

Awesome, thank you!
Bill, per our discussion today do you want to file a bug about that ipc assertion?
Flags: needinfo?(wmccloskey)
I took one of the minidumps from the test logs you linked, scraped symbols out of the system libraries from a test machine and reprocessed it. This is the full symbolicated stack. I still don't see MediaTaskQueue anywhere in there, and all the stacks look sane, so I'm not sure what's going on there.
Clearing NI, since Ted is on the case.
Flags: needinfo?(jgriffin)
(In reply to Bobby Holley (:bholley) from comment #6)
> Bill, per our discussion today do you want to file a bug about that ipc
> assertion?

Looks like it's already filed as bug 1065892.
Flags: needinfo?(wmccloskey)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #7)
> Created attachment 8578618 [details]
> Fully symbolicated crash stack
> 
> I took one of the minidumps from the test logs you linked, scraped symbols
> out of the system libraries from a test machine and reprocessed it. This is
> the full symbolicated stack. I still don't see MediaTaskQueue anywhere in
> there, and all the stacks look sane, so I'm not sure what's going on there.

Was this from the hang in comment 3? It seems pretty clear from the log that we're never returning from that function. Any other ideas?

If nothing else, somehow making the work you did in comment 7 automatic would be a big help. Is that feasible / is there a bug?
Flags: needinfo?(ted)
Um, this was from the link in comment 0. I will figure out which one shortly, and see if I can do the same thing to the other failure logs.

bug 528231 has been on file for the system symbols for ages, but I think I can feasibly fix it nowadays so I will pursue that.
Flags: needinfo?(ted)
Great, thanks Ted!
So I was symbolicating that stack and the answer hit me pretty quickly--the minidumps you're getting here are from the chrome process, not the content process. (Should have been obvious from having firefox and not plugin-container on the stack!) What's happening is that the process is hanging and the test harness is killing it:
20:14:27     INFO -  2837 ERROR TEST-UNEXPECTED-TIMEOUT | dom/media/test/test_played.html | application timed out after 330 seconds with no output
20:14:28     INFO -  Xlib:  extension "RANDR" missing on display ":0".
20:14:28     INFO -  TEST-INFO | screentopng: exit 0
20:14:29     INFO -  TEST-INFO | Main app process: killed by SIGIOT (this is confusing, it's SIGABRT).

We're sending this to the main process, so we're getting a dump out of it, but you probably really want dumps for all the processes in this case. I'm not totally sure how to handle this right now, we either need to track all the child processes and kill them (children first so the dumps get written properly) or we need a way to tell the parent process that we want it to terminate its children.

Resummarizing to match.

Bill: any thoughts on the best way to handle this?
Flags: needinfo?(wmccloskey)
Summary: Garbage stack dump when test harness kills content process due to timeout → [e10s] When test harness kills browser due to timeout we don't get a stack for the content process
Ted's comment 13 reminded me that I worked on something about a year ago that might be related to this. I never finished it, unfortunately.

We have test harness code here that kills all existing child processes and gets their stacks:
http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#1582

Unfortunately, I think it's broken. Once piece is bug 950401, which I never had time to finish. However, I also have a vague memory that the code for isPidAlive is wrong. I think the problem is that it calls waitpid on the content process. However, since the content process is never a child of the test harness, this always fails (waitpid only works for direct children IIRC). So I think it always treats every child process as if it's dead, even when it's not.

For this bug, I think you would also need to change handleTimeout to call checkForZombies.

Sorry this is so vague. I thought I had filed a bug (and even checked in a fix for) the isPidAlive issue, but I can't find anything now. Maybe it's not even related to this though.
Flags: needinfo?(wmccloskey)
That isPidAlive code apparently dates back to bug 523208, and I reviewed it but I don't see any comments on the waitpid bit. (I suggested the os.kill approach.) That waitpid code is definitely not correct, and it's always going to return errno.ECHILD, causing us to ignore child processes completely. :-(
Blocks: e10s-tests
Blocks: e10s-harness
tracking-e10s: --- → ?
I have a patch for this that works for Linux, but in addition to bug 950401, I don't see us getting child PIDs out of the pid logfile on windows consistently.
Depends on: 950401
Depends on: 1193861
Depends on: 1196393
No longer depends on: 1193861
No longer depends on: 950401
Bug 1143547 - Get a stack out of child processes before killing the parent from the mochitest harness in response to the global output timeout. r=ted
Attachment #8654336 - Flags: review?(ted)
Assignee: nobody → cmanchester
Comment on attachment 8654336 [details]
MozReview Request: Bug 1143547 - Get a stack out of child processes before killing the parent from the mochitest harness in response to the global output timeout. r=ted

https://reviewboard.mozilla.org/r/17657/#review20379

::: testing/mochitest/runtests.py:1626
(Diff revision 1)
> -        processList = []
> +        processList = self.extract_child_pids(processLog)

So this is going to continue to not work on most platforms, but that's exactly what's happening now, right?

::: testing/mochitest/runtests.py:2358
(Diff revision 1)
> +            while self.count_minidumps() < (minidump_count + 1):

This is sort of awful. :-/ Could we use psutil.wait_procs here instead if we have psutil? (and just fall back to waiting for a fixed timeout if not)
http://pythonhosted.org/psutil/#psutil.wait_procs
Attachment #8654336 - Flags: review?(ted)
https://reviewboard.mozilla.org/r/17657/#review20379

> So this is going to continue to not work on most platforms, but that's exactly what's happening now, right?

Right, this patch doesn't fix checkForZombies. It's called after the parent is dead, so it's not helpful very helpful. This call can be fixed though.

> This is sort of awful. :-/ Could we use psutil.wait_procs here instead if we have psutil? (and just fall back to waiting for a fixed timeout if not)
> http://pythonhosted.org/psutil/#psutil.wait_procs

We have psutil on all the desktop platforms now, so that should work fine!
Comment on attachment 8654336 [details]
MozReview Request: Bug 1143547 - Get a stack out of child processes before killing the parent from the mochitest harness in response to the global output timeout. r=ted

Bug 1143547 - Get a stack out of child processes before killing the parent from the mochitest harness in response to the global output timeout. r=ted
Attachment #8654336 - Flags: review?(ted)
Comment on attachment 8654336 [details]
MozReview Request: Bug 1143547 - Get a stack out of child processes before killing the parent from the mochitest harness in response to the global output timeout. r=ted

https://reviewboard.mozilla.org/r/17657/#review29001

Just a couple of nits here. Again, sorry for the delay!

::: testing/mochitest/runtests.py:2369
(Diff revision 2)
> +                self.killAndGetStack(pid, utilityPath, debuggerInfo,

A log message indicating that we're attempting to kill child process N would be good here.

::: testing/mochitest/runtests.py:2378
(Diff revision 2)
> +            for pid in child_pids:

Similarly, a log message here indicating that we're punting because we don't have psutil would be good. Maybe even put some text saying to file a bug if this message shows up in automation, since that should never happen?
Attachment #8654336 - Flags: review?(ted) → review+
https://hg.mozilla.org/mozilla-central/rev/0983990d3768
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla47
You need to log in before you can comment on or make changes to this bug.