Closed Bug 528144 Opened 13 years ago Closed 13 years ago

Missing IPC messages cause hang during xpcshell test

Categories

(Core :: IPC, defect)

Other Branch
x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: jduell.mcbugs, Assigned: benjamin)

References

Details

Attachments

(2 files, 9 obsolete files)

82.71 KB, patch
Details | Diff | Splinter Review
1.78 KB, patch
Details | Diff | Splinter Review
I've discovered that my necko http e10s code hangs when run under xpcshell under the following scenario:

child creates PNecko protocol
child creates PHttpChannel subprotocol
Child calls SendAsyncOpen on the newly created subprotocol.

From debugging this with cjones, I can see that the IPC messages are sent, and that the I/O thread in the parent receives them and queues them for the main thread.  The main thread, however, winds up processing only the first two, and then hangs, "waiting" for a message that it actually sitting in its queue (we have verified that the msg is sitting there).  This happens every time on my Linux build, except that when I run the code in the debugger, sometimes all three msgs are processed.   The likely candidate is a missing (or inappropriately placed) notify() call somewhere.

cjones tells me that the event loop code for xpcshell is somewhat different than for firefox-bin (see, for instance, ipc/glue/MessagePump.cpp:172), and that that's the likely place to look for the culprit. 

I'm going to attach a series of patches that are needed to replicate this.
This is important:

#
(gdb) p MessageLoop::current().work_queue_
#
$6 = std::queue wrapping: std::deque with 1 elements = {{task = 0x7f5ee800c230, delayed_run_time = {static kMillisecondsPerSecond = 1000,
#
      static kMicrosecondsPerMillisecond = 1000, static kMicrosecondsPerSecond = 1000000, static kMicrosecondsPerMinute = 60000000,
#
      static kMicrosecondsPerHour = -694967296, static kMicrosecondsPerDay = 86400000000, static kMicrosecondsPerWeek = <optimized out>,
#
      static kNanosecondsPerMicrosecond = 1000, static kNanosecondsPerSecond = <optimized out>, static kTimeTToMicrosecondsOffset = 0, us_ = 0},
#
    sequence_num = 0, nestable = true}}
#
(gdb) p MessageLoop::current().delayed_work_queue_
#
$7 = std::priority_queue wrapping: std::vector of length 0, capacity 0
#
(gdb) p MessageLoop::current().deferred_non_nestable_work_queue_
#
$8 = std::queue wrapping: std::deque with 0 elements
#
(gdb) p MessageLoop::current().incoming_queue_
#
$9 = std::queue wrapping: std::deque with 0 elements

The IPC event posted by the IO thread is getting lost in the UI MessageLoop's work_queue_.  I don't understand how this can happen, since we post one XPCOM event for each IO event.
Attachment #411901 - Attachment description: HTTP e10s networking patch. → test patch #3: HTTP e10s networking patch.
Comment on attachment 411905 [details] [diff] [review]
test patch #6: printfs that say when IPC msgs have arrived at I/O thread, and when dispatched by main thread.

To reproduce bug:

1) Apply test patches #1-6.

2) Set SOLO_FILE=test_simple_wrap.js in your environment

2) go to netwerk/test, and run "make check-interactive", and run _execute_test();

You should see that the parent receives 3 msgs in a row: PNecko constructor, PHttpConstructor, and AsyncOpen.  It only processes the first two, then hangs.
Attachment #411905 - Attachment description: printfs that say when IPC msgs have arrived at I/O thread, and when dispatched by main thread. → test patch #6: printfs that say when IPC msgs have arrived at I/O thread, and when dispatched by main thread.
bent, if you're busy with other m-c-merge blockers, I can take this tomorrow.  I couldn't debug by thought, but now there's a testcase ;).  Need to turn in for the night though.
Another option is to fix bug 528147, which blocks running XPCOM on the "main" thread in subprocesses.  If we did that, we could drop the layers of indirection between receiving an IPC message and processing it in xpcshell (IO thread --> chromium UI queue + XPCOM queue, XPCOM invokes chromium invokes xpcshell).
Pushed changeset fad684be12cf to electrolysis.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Alas, it looks like our fix for this is incomplete.  The parent is no longer hanging when it gets (PHttpChannel constructor, AsyncOpen) from the child, but the child is now hanging when the parent replies with (OnStartRequest, OnDataAvailable, OnStopRequest).   The debug msgs in patch #6 show that the child process's IO thread is receiving the msgs from the parent (i.e. OnMessageReceived is called), but OnDispatchMessage in the main thread is never called for any of the three msgs.

So this basically looks like the same bug, but this time the child process is the one that's hanging with msgs lost in its in queue.

I'll attach a updated version of patch #3 that has the additional IPC traffic.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I believe that you can just replace patch #3 with this patch (patch #7) and it will work with the other patches.  If not, let me know.
> I believe that you can just replace patch #3 with this patch (patch #7) and it
> will work with the other patches.  If not, let me know.

Works for me, trying it out.
If I let the child delay for 30s before starting, I see

============ContentProcessParent::AllocPTestShell called
===== AsyncChannel::OnMessageReceived: pid=29115
===== AsyncChannel::OnMessageReceived: pid=29115
===== AsyncChannel::OnDispatchMessage: pid=29115
===== AsyncChannel::OnDispatchMessage: pid=29115
parent: TEST-INFO | (xpcshell/head.js) | test 2 pending
===== AsyncChannel::OnMessageReceived: pid=29115
parent: TEST-INFO | (xpcshell/head.js) | test 2 finished
parent: TEST-INFO | (xpcshell/head.js) | running event loop
===== AsyncChannel::OnDispatchMessage: pid=29115
============== START ==========
child: TEST-INFO | (xpcshell/head.js) | test 1 pending
============== setup_test: in
===== AsyncChannel::OnMessageReceived: pid=29110
===== AsyncChannel::OnDispatchMessage: pid=29110
============ContentProcessParent::AllocPNecko called
AAAAAA  HttpChannelChild constructor called
======== HttpChannelStub::AsyncOpen: charset=UTF-8
===== AsyncChannel::OnMessageReceived: pid=29110
===== AsyncChannel::OnDispatchMessage: pid=29110
AAAAAA  HttpChannelParent constructor called
============== setup_test: out
child: TEST-INFO | (xpcshell/head.js) | test 2 pending
===== AsyncChannel::OnMessageReceived: pid=29110
child: TEST-INFO | (xpcshell/head.js) | test 2 finished
child: TEST-INFO | (xpcshell/head.js) | running event loop
===== AsyncChannel::OnDispatchMessage: pid=29110
==== HttpChannelParent::RecvAsyncOpen called

followed by nothingness.  Is this the bug?

Note if I don't let the child wait, I see

============ContentProcessParent::AllocPTestShell called
===== AsyncChannel::OnMessageReceived: pid=29165
===== AsyncChannel::OnMessageReceived: pid=29165
===== AsyncChannel::OnDispatchMessage: pid=29165
===== AsyncChannel::OnDispatchMessage: pid=29165
parent: TEST-INFO | (xpcshell/head.js) | test 2 pending
parent: TEST-INFO | (xpcshell/head.js) | test 2 finished
parent: TEST-INFO | (xpcshell/head.js) | running event loop
===== AsyncChannel::OnMessageReceived: pid=29165
===== AsyncChannel::OnDispatchMessage: pid=29165
============== START ==========
child: TEST-INFO | (xpcshell/head.js) | test 1 pending
============== setup_test: in
WARNING: failed to bind socket: file /home/cjones/mozilla/electrolysis/netwerk/base/src/nsServerSocket.cpp, line 317
child: TEST-UNEXPECTED-FAIL | (xpcshell/head.js) | 2147746065
===== AsyncChannel::OnMessageReceived: pid=29160
===== AsyncChannel::OnDispatchMessage: pid=29160
parent: TEST-INFO | (xpcshell/head.js) | test 1 finished
parent: TEST-INFO | (xpcshell/head.js) | exiting test
parent: TEST-INFO | (xpcshell/head.js) | No (+ 0) checks actually run
jduell, I don't see any non-trivial code in HttpChannelParent::RecvAsyncOpen():

bool HttpChannelParent::RecvAsyncOpen(const nsCString& uriSpec, 
                                      const nsCString &originCharset)
{
printf("==== HttpChannelParent::RecvAsyncOpen called\n");
fflush(stdout);
//sleep(4);
//printf("=====  HttpChannelParent RecvAsyncOpen uri=%s (%s)\n", uriSpec.get(), originCharset.get());
  return true;
}

Shouldn't the request be kicked off here?  Or more generally, what's the expected behavior of this test?
There's also

protocol PHttpChannel
{
  manager PNecko;

parent:
  AsyncOpen(nsCString uriSpec, nsCString originCharset);
};

Should OnStart() et al. be here?
Whoops--attached an old version of the HTTP patch.  Sorry chris!
Attachment #411901 - Attachment is obsolete: true
Attachment #412785 - Attachment is obsolete: true
OK, I've rolled my patches into one big patch that obsoletes all the previous patches.   It applies cleanly for me on the head of e10s, and replicates my client-side hang.
Attachment #411898 - Attachment is obsolete: true
Attachment #411900 - Attachment is obsolete: true
Attachment #411902 - Attachment is obsolete: true
Attachment #411903 - Attachment is obsolete: true
Attachment #411905 - Attachment is obsolete: true
Attachment #412880 - Attachment is obsolete: true
The child-side stack that's causing this is:

#0  MessageLoop::DoWork (this=0x7f2e9b636eb0) at ../../../src/ipc/chromium/src/base/message_loop.cc:416

#1  0x00007f2e9d38fe9e in DoWorkRunnable::Run (this=0x176db90) at ../../../src/ipc/glue/MessagePump.cpp:76

#2  0x00007f2e9d4a2c10 in nsThread::ProcessNextEvent (this=0x1770f50, mayWait=1, result=0x7f2e9b635448) at ../../../src/xpcom/threads/nsThread.cpp:527

#3  0x00007f2e9d4bd5dc in NS_InvokeByIndex_P (that=0x1770f50, methodIndex=8, paramCount=2, params=0x7f2e9b635430)

    at ../../../../../../../src/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_x86_64_unix.cpp:208

#4  0x00007f2e9c208108 in XPCWrappedNative::CallMethod (ccx=@0x7f2e9b6358f0, mode=XPCWrappedNative::CALL_METHOD)

    at ../../../../../src/js/src/xpconnect/src/xpcwrappednative.cpp:2727

#5  0x00007f2e9c214f9d in XPC_WN_CallMethod (cx=0x17ff210, obj=0x7f2e9384c740, argc=1, argv=0x18ed320, vp=0x7f2e9b635aa8)

    at ../../../../../src/js/src/xpconnect/src/xpcwrappednativejsops.cpp:1756

#6  0x0000000000ff6eea in js_Invoke (cx=0x17ff210, argc=1, vp=0x18ed310, flags=2) at ../../../src/js/src/jsinterp.cpp:1375

#7  0x0000000000fe1db9 in js_Interpret (cx=0x17ff210) at ../../../src/js/src/jsops.cpp:2309

#8  0x0000000000ff612a in js_Execute (cx=0x17ff210, chain=0x7f2e93863240, script=0x194e5f0, down=0x0, flags=0, result=0x7f2e9b636748)

    at ../../../src/js/src/jsinterp.cpp:1616

#9  0x0000000000f5ea13 in JS_ExecuteScript (cx=0x17ff210, obj=0x7f2e93863240, script=0x194e5f0, rval=0x7f2e9b636748) at ../../../src/js/src/jsapi.cpp:4968

#10 0x00007f2e9d42bd36 in mozilla::ipc::XPCShellEnvironment::EvaluateString (this=0x18ae8e0, aString=@0x7f2e9b636830, aResult=0x7f2e9b6367f0)

    at ../../../src/ipc/testshell/XPCShellEnvironment.cpp:1243

#11 0x00007f2e9d429bb1 in mozilla::ipc::TestShellChild::RecvPTestShellCommandConstructor (this=0x18545c0, aActor=0x1936320, aCommand=@0x7f2e9b636830)

    at ../../../src/ipc/testshell/TestShellChild.cpp:83

#12 0x00007f2e9d42a297 in mozilla::ipc::PTestShellChild::OnMessageReceived (this=0x18545c0, msg=@0x1930aa0)

    at ../../ipc/ipdl/_ipdlheaders/mozilla/ipc/PTestShellChild.h:141

#13 0x00007f2e9d34e42a in mozilla::dom::PContentProcessChild::OnMessageReceived (this=0x176bf98, msg=@0x1930aa0)

    at ../../ipc/ipdl/_ipdlheaders/mozilla/dom/PContentProcessChild.h:162

#14 0x00007f2e9d38bdbf in mozilla::ipc::AsyncChannel::OnDispatchMessage (this=0x176bfa8, msg=@0x1930aa0) at ../../../src/ipc/glue/AsyncChannel.cpp:178

#15 0x00007f2e9d38c60c in DispatchToMethod<mozilla::ipc::AsyncChannel, void (mozilla::ipc::AsyncChannel::*)(IPC::Message const&), IPC::Message> (obj=0x176bfa8, 

    method=0x7f2e9d38bcdc <mozilla::ipc::AsyncChannel::OnDispatchMessage(IPC::Message const&)>, arg=@0x1930aa0) at ../../../src/ipc/chromium/src/base/tuple.h:393

#16 0x00007f2e9d38c648 in RunnableMethod<mozilla::ipc::AsyncChannel, void (mozilla::ipc::AsyncChannel::*)(IPC::Message const&), Tuple1<IPC::Message> >::Run (

    this=0x1930a70) at ../../../src/ipc/chromium/src/base/task.h:307

#17 0x00007f2e9d3bec5a in MessageLoop::RunTask (this=0x7f2e9b636eb0, task=0x1930a70) at ../../../src/ipc/chromium/src/base/message_loop.cc:326

#18 0x00007f2e9d3bf0a6 in MessageLoop::DeferOrRunPendingTask (this=0x7f2e9b636eb0, pending_task=@0x7f2e9b636af0) at ../../../src/ipc/chromium/src/base/message_loop.cc:334

#19 0x00007f2e9d3bf3a7 in MessageLoop::DoWork (this=0x7f2e9b636eb0) at ../../../src/ipc/chromium/src/base/message_loop.cc:434

#20 0x00007f2e9d38fe9e in DoWorkRunnable::Run (this=0x176db90) at ../../../src/ipc/glue/MessagePump.cpp:76

#21 0x00007f2e9d4a2c10 in nsThread::ProcessNextEvent (this=0x1770f50, mayWait=0, result=0x7f2e9b636c1c) at ../../../src/xpcom/threads/nsThread.cpp:527

#22 0x00007f2e9d43ac42 in NS_ProcessNextEvent_P (thread=0x1770f50, mayWait=0) at nsThreadUtils.cpp:250

#23 0x00007f2e9d38f664 in mozilla::ipc::MessagePump::Run (this=0x176db50, aDelegate=0x7f2e9b636eb0) at ../../../src/ipc/glue/MessagePump.cpp:115

#24 0x00007f2e9d38f95d in mozilla::ipc::MessagePumpForChildProcess::Run (this=0x176db50, aDelegate=0x7f2e9b636eb0) at ../../../src/ipc/glue/MessagePump.cpp:208

#25 0x00007f2e9d3bf729 in MessageLoop::RunInternal (this=0x7f2e9b636eb0) at ../../../src/ipc/chromium/src/base/message_loop.cc:211

#26 0x00007f2e9d3bf741 in MessageLoop::RunHandler (this=0x7f2e9b636eb0) at ../../../src/ipc/chromium/src/base/message_loop.cc:194

#27 0x00007f2e9d3bf7a2 in MessageLoop::Run (this=0x7f2e9b636eb0) at ../../../src/ipc/chromium/src/base/message_loop.cc:168

#28 0x00007f2e9d225a11 in nsBaseAppShell::Run (this=0x1962940) at ../../../../src/widget/src/xpwidgets/nsBaseAppShell.cpp:174

#29 0x00007f2e9c1a994e in XRE_RunAppShell () at ../../../src/toolkit/xre/nsEmbedFunctions.cpp:446

#30 0x00007f2e9d38f89e in mozilla::ipc::MessagePumpForChildProcess::Run (this=0x176db50, aDelegate=0x7f2e9b636eb0) at ../../../src/ipc/glue/MessagePump.cpp:194

#31 0x00007f2e9d3bf729 in MessageLoop::RunInternal (this=0x7f2e9b636eb0) at ../../../src/ipc/chromium/src/base/message_loop.cc:211

#32 0x00007f2e9d3bf741 in MessageLoop::RunHandler (this=0x7f2e9b636eb0) at ../../../src/ipc/chromium/src/base/message_loop.cc:194

#33 0x00007f2e9d3bf7a2 in MessageLoop::Run (this=0x7f2e9b636eb0) at ../../../src/ipc/chromium/src/base/message_loop.cc:168

#34 0x00007f2e9d3e2c5c in base::Thread::ThreadMain (this=0x176bef0) at ../../../src/ipc/chromium/src/base/thread.cc:165

#35 0x00007f2e9d4126ac in ThreadFunc (closure=0x176bef0) at ../../../src/ipc/chromium/src/base/platform_thread_posix.cc:26

#36 0x000000336ba073da in start_thread () from /lib64/libpthread.so.0

#37 0x000000336aee627d in clone () from /lib64/libc.so.6

JS is spinning a nested loop, which Chromium thinks isn't ok, so it's never dispatching the messages in the MessageLoop.

<cjones> bsmedberg, looks like JS is spinning a nested loop
<bsmedberg> yeah, that's how xpcshell works
<cjones> sure
<cjones> there's a Loop::SetNestedAllowed(bool) method or something, i think we just need to call that
<cjones> is there any way to tell whether we're in a nested JS loop?
<cjones> we might need this for firefox-bin too, for extensions et al.
<bsmedberg> well, we could just always say that nesting is ok...
* jduell_ has quit (Quit: Leaving)
* bsmedberg isn't sure what the nesting thing is supposed to be solving
<cjones> preventing nested event loops when they're unexpected
<bsmedberg> e.g. have DoWorkRunnable::Run call SetNestedAllowed
<cjones> ?
<bsmedberg> when are they unexpected?
<cjones> when they're not expected ;)
Assignee: bent.mozilla → benjamin
Attachment #412948 - Flags: review?(jones.chris.g)
Attachment #412948 - Flags: review?(bent.mozilla)
Comment on attachment 412948 [details] [diff] [review]
Allow nested MessageLoop processing, rev. 1

Thanks for diagnosing this!

I think we should do something slightly different though. Patch in a sec.
Attachment #412948 - Flags: review?(bent.mozilla) → review-
Attachment #412948 - Attachment is obsolete: true
Attachment #412948 - Flags: review?(jones.chris.g)
Comment on attachment 412955 [details] [diff] [review]
Allow nested MessageLoop processing, rev. 2

Pushed as changeset 090fc166ea4b.
Hopefully that does it.
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Attachment #412955 - Flags: review?(benjamin)
You need to log in before you can comment on or make changes to this bug.