Closed Bug 588216 Opened 9 years ago Closed 9 years ago

HttpChannelChild OnStopRequest message received but never processed by event loop

Categories

(Core :: IPC, defect)

x86_64
Linux
defect
Not set

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- beta5+
blocking1.9.2 --- .11+
status1.9.2 --- .11-fixed

People

(Reporter: dwitte, Assigned: cjones)

References

Details

(Whiteboard: [qa-ntd-192])

Attachments

(3 files)

Over in bug 580525 we have an intermittent hang that affects all the necko e10s tests. It hangs at various different points, reproducible across pretty much all the platforms.

After some debugging I've found the following:

* HttpChannelParent::OnStopRequest() is called, which calls SendOnStopRequest to the child
* This call returns true
* HttpChannelChild::RecvOnStopRequest() is never called.

IPC log looks like:

[time:1282083965611089][PHttpChannelParent] Sending Msg_OnStartRequest([TODO])
[time:1282083965611266][PHttpChannelParent] Sending Msg_OnStatus([TODO])
[time:1282083965611286][PHttpChannelParent] Sending Msg_OnProgress([TODO])
[time:1282083965611357][PHttpChannelParent] Sending Msg_OnDataAvailable([TODO])
[time:1282083965611472][PHttpChannelParent] Sending Msg_OnStopRequest([TODO])
[time:1282083965611611][PHttpChannelChild] Received Msg_OnStatus([TODO])
[time:1282083965611730][PHttpChannelChild] Received Msg_OnStartRequest([TODO])
onStartRequest [[xpconnect wrapped (nsISupports, nsIChannel, nsIRequest) @ 0x7f76d80c3120 (native @ 0x7f76d7006430)]]
onStartRequest end [[xpconnect wrapped (nsISupports, nsIChannel, nsIRequest) @ 0x7f76d80c3120 (native @ 0x7f76d7006430)]]
[time:1282083965612259][PHttpChannelChild] Received Msg_OnStatus([TODO])
[time:1282083965612435][PHttpChannelChild] Received Msg_OnProgress([TODO])
[time:1282083965612520][PHttpChannelChild] Received Msg_OnDataAvailable([TODO])
onDataAvailable [[xpconnect wrapped (nsISupports, nsIChannel, nsIRequest) @ 0x7f76d80c3120 (native @ 0x7f76d7006430)]]
onDataAvailable end [[xpconnect wrapped (nsISupports, nsIChannel, nsIRequest) @ 0x7f76d80c3120 (native @ 0x7f76d7006430)]]

IPC guys, any thoughts?
Attached patch debug printfsSplinter Review
With this patch, I see (in both a passing and failing test):

===== RPCChannel::OnMessageReceived begin
===== RPCChannel::OnMessageReceived pushed
===== RPCChannel::OnMessageReceived posting task

So the RPCChannel is receiving the message and posting it to the main thread, but the main thread never gets to it. (11274 is the OnStopRequest message type.)
Confirmed that RPCChannel::OnMaybeDequeueOne() for that message never gets called by the event loop.
Summary: HttpChannelChild OnStopRequest message sent but never received → HttpChannelChild OnStopRequest message received but never processed by event loop
I doubt this is terribly relevant, but the IPC logs from good/bad have a different sequence of "OnStatus" messages just before the hang.  It looks like in the "hang" case, we have one less OnStatus.

good:
 P                                   C
-----------                         -----------
 Sending Msg_Redirect3Complete
 Sending Msg___delete__
                                     Received Msg_Redirect3Complete
                                     Received Msg___delete__
 Sending Msg_OnStatus
 Sending Msg_OnStatus
                                     Received Msg_OnStatus
 Sending Msg_OnStatus
                                     Received Msg_OnStatus
                                     Received Msg_OnStatus
 Sending Msg_OnStartRequest
 Sending Msg_OnStatus
 Sending Msg_OnProgress
 Sending Msg_OnDataAvailable
 Sending Msg_OnStopRequest
                                     Received Msg_OnStartRequest
                                     Received Msg_OnStatus
                                     Received Msg_OnProgress
                                     Received Msg_OnDataAvailable
                                     Received Msg_OnStopRequest

bad:
 P                                   C
-----------                         -----------
 Sending Msg_Redirect3Complete
 Sending Msg___delete__
                                     Received Msg_Redirect3Complete
 Sending Msg_OnStatus
                                     Received Msg___delete__
 Sending Msg_OnStatus
                                     Received Msg_OnStatus
                                     Received Msg_OnStatus
 Sending Msg_OnStartRequest
 Sending Msg_OnStatus
 Sending Msg_OnProgress
 Sending Msg_OnDataAvailable
 Sending Msg_OnStopRequest
                                     Received Msg_OnStartRequest
                                     Received Msg_OnStatus
                                     Received Msg_OnProgress
                                     Received Msg_OnDataAvailable

Just FYI.  Still hunting underlying problem.
Again just FTR, bad news in nested xpcshell/XPCOM/glib/chromium Frankenventloop land:

(gdb) f 24
#24 0x00002ab536cc1ffa in mozilla::ipc::RPCChannel::OnMaybeDequeueOne (this=0xdd3e40) at /home/cjones/mozilla/mozilla-central/ipc/glue/RPCChannel.cpp:438
(gdb) p this->mPending.size()
$138 = 1
(gdb) p this->mDequeueOneTask
$139 = {
  mRawPtr = 0xddedf0
}
-----
(gdb) f 7
#7  0x00002ab536ed53cf in nsThread::ProcessNextEvent (this=0xde4b70, mayWait=1, result=0x7fff97163108) at /home/cjones/mozilla/mozilla-central/xpcom/threads/nsThread.cpp:517
(gdb) p mEvents->HasPendingEvent()
$142 = 1


The Necko IPC message came through, RPCChannel has it in its pending queue and has a task enqueued to process the message, and the XPCOM thread queue has a task enqueued to dequeue the RPCChannel task to process the message.  But, the appshell is spinning in the glib event loop, which implies to me that when the XPCOM task was enqueued, somehow the pipe-write that's supposed to wake up the glib loop was missed or flubbed.  Will keep looking.  Still not enough information yet for me to make a call on blocking-2.0a1.

This is a ridiculous setup.  Rewriting this mess is now high on my post-4.0 TODO list.
This looks like a general (possibly/likely still only xpcshell) problem that happens to be especially biting OnStop.  I just caught PCookieServiceChild::GetCookieString being dropped on the *parent* side, leading to a deadlock.  Similar-looking symptoms as on the parent side as in comment 4

(gdb) p &mozilla::dom::ContentParent::gSingleton->mChannel
$3 = (mozilla::ipc::RPCChannel *) 0x213e9b0
(gdb) p $3->mPending.size()
$4 = 1
(gdb) p $3->mDequeueOneTask
$5 = {
  mRawPtr = 0x21497d0
}


There's a pending message, and a task enqueued to process it, but parent is spinning at

(gdb) bt
#0  pthread_cond_wait@@GLIBC_2.3.2
#1  0x00002aac5b0bad2a in PR_WaitCondVar (cvar=0x1e4c860, timeout=4294967295) at /home/cjones/mozilla/cedar/nsprpub/pr/src/pthreads/ptsynch.c:417
#2  0x00002aac5b0bb4bc in PR_Wait (mon=0x1e55e50, timeout=4294967295) at /home/cjones/mozilla/cedar/nsprpub/pr/src/pthreads/ptsynch.c:614
#3  0x00002aac579ec280 in nsAutoMonitor::Wait (this=0x7fff7c405080, interval=4294967295) at ../../../dist/include/nsAutoLock.h:346
#4  0x00002aac58f4dd7d in nsEventQueue::GetEvent (this=0x1e55d48, mayWait=1, result=0x7fff7c405140) at /home/cjones/mozilla/cedar/xpcom/threads/nsEventQueue.cpp:85
#5  0x00002aac58f50cbc in nsThread::nsChainedEventQueue::GetEvent (this=0x1e55d38, mayWait=1, event=0x7fff7c405140) at /home/cjones/mozilla/cedar/xpcom/threads/nsThread.h:112
#6  0x00002aac58f5033f in nsThread::ProcessNextEvent (this=0x1e55d00, mayWait=1, result=0x7fff7c4053c8) at /home/cjones/mozilla/cedar/xpcom/threads/nsThread.cpp:530


It looks like there's some subtle case in which posting a chromium task fails to awaken the XPCOM loop.
Mentioned this on IRC to dwitte and jst: because the bug here looks to be missed/fumbled chromium->XPCOM task posting, it's possible that it's latent in browser but never surfaces because we have so many timers and other stuff going on.  We wouldn't have seen it in plugin processes or IPDL tests because they're both non-XPCOM.

That said, I'm working on a test to repro this better than the 1/50-1/200 rate I'm seeing in the necko test, then I'll be able to check my initial hypothesis.
(In reply to comment #6)
> Mentioned this on IRC to dwitte and jst: because the bug here looks to be
> missed/fumbled chromium->XPCOM task posting,

Rather, missed XPCOM wakeup on a posted chromium task.
Attached patch Simpler testcaseSplinter Review
This fails pretty consistently in about 1/20 runs on my machine.  Hopefully can write a better test when I have more info.
Assignee: nobody → jones.chris.g
I should add that the child process spinning a nested XPCOM event loop (like xpcshell-bin) is intentional; I couldn't repro the lost task when the child process used the normal event loop.  Not sure what that means yet.
OK, here's the bug

diff --git a/ipc/glue/MessagePump.cpp b/ipc/glue/MessagePump.cpp
--- a/ipc/glue/MessagePump.cpp
+++ b/ipc/glue/MessagePump.cpp
@@ -146,16 +146,17 @@ MessagePump::Run(MessagePump::Delegate* 
 
   keep_running_ = true;
 }
 
 void
 MessagePump::ScheduleWork()
 {
   if (gRunningSetNestableTasksAllowed) {
+    printf("\n    RUNNING SETNESTABLETASKSALLOWED\n\n");
     return;
   }
 
   // Make sure the event loop wakes up.
   if (mThread) {
     mThread->Dispatch(mDoWorkEvent, NS_DISPATCH_NORMAL);
   }
   else {


We get "    RUNNING SETNESTABLETASKSALLOWED" on hung trials but not on others.  This means that the IO thread is trying to ScheduleWork() while the main thread is running MessageLoop::SetNestableTasksAllowed().  It's a good ol' fashioned shared-memory race, and we won't be able to write a deterministic test for it.  However, the fix is easy.
Also, I believe that we're susceptible to this bug in nested XPCOM loops, in the browser, but apparently haven't hit it because of "other stuff" waking us up eventually and processing the message.
This bug prevents us from getting useful xpcshell test results.
tracking-fennec: --- → ?
Ran this on the attached test 500 times without hang.

bent, we need this reviewed ASAP because it's splatting orange all over repos that run OOP xpcshell tests.
Attachment #467348 - Flags: review?(bent.mozilla)
Comment on attachment 467348 [details] [diff] [review]
Avoid race between IO-thread loop->PostTask() and main-thread loop->SetNestableTasksAllowed() that led to Tasks being ignored

Ok, looks good.
Attachment #467348 - Flags: review?(bent.mozilla) → review+
blocking2.0: --- → beta5+
There's a chance that this is causing delays in processing plugin in-messages in firefox-bin, and should be a safe patch for branch.
blocking1.9.2: --- → ?
http://hg.mozilla.org/mozilla-central/rev/69c6ce104f45
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Is the branch patch going to be the same? If so please request approval1.9.2.10? on it.
blocking1.9.2: ? → needed
Comment on attachment 467348 [details] [diff] [review]
Avoid race between IO-thread loop->PostTask() and main-thread loop->SetNestableTasksAllowed() that led to Tasks being ignored

Er, sorry.  Yes, it will be the same.
Attachment #467348 - Flags: approval1.9.2.10?
Could this have caused a Dromaeo regression? See dev.tree-management thread "Dromaeo (V8) decrease 9.99% on Win7 Firefox"
Looking at the graph it seems that the regression magically disappeared. Some other cause perhaps?
bent responded to beltzner but the CC to the mailing list didn't reach the newsgroup.  He said

"Hm. I see the dip, but it looks like the numbers have all come back up since then? This wasn't backed out, so maybe we're seeing something else?

Though I must confess I may be misreading this graph..."

It's basically impossible for this to have regressed dromaeo, unless dromaeo is loading OOPPs, and if that's true then it needs to be taken out and shot.  This patch touches code that runs even without OOPP, but it actually reduces the number of instructions executed by 3 or 4 when we jump from the chromium-->XPCOM loop.

I call BS.
Did the test land on trunk as well? We'd want a test with this for branch. Please do a rollup patch or ask for approval for the test patch as well.
Whiteboard: [needs answer to comment 23]
Flags: in-testsuite?
Comment on attachment 467348 [details] [diff] [review]
Avoid race between IO-thread loop->PostTask() and main-thread loop->SetNestableTasksAllowed() that led to Tasks being ignored

Clearing approval request pending answer to test question in comment 23
Attachment #467348 - Flags: approval1.9.2.10? → approval1.9.2.10-
As I said in comment 10, there's no way to write a deterministic test for this bug.  The "test" I wrote isn't better than the other multiprocess xpcshell tests in that sense.  I just twiddled its parameters so that it tickled the bug more frequently on my computer, mileage will vary on tinderboxen.
Comment on attachment 467348 [details] [diff] [review]
Avoid race between IO-thread loop->PostTask() and main-thread loop->SetNestableTasksAllowed() that led to Tasks being ignored

Renominating, the test is unreliable and not something we can actually check in.
Attachment #467348 - Flags: approval1.9.2.10- → approval1.9.2.10?
Comment on attachment 467348 [details] [diff] [review]
Avoid race between IO-thread loop->PostTask() and main-thread loop->SetNestableTasksAllowed() that led to Tasks being ignored

Approved for 1.9.2.10, a=dveditz for release-drivers

Do we need to be concerned that our patches make us diverge from the upstream IPC, making it hard to take future upstream improvements? I know the <video> code is set up to apply our fixes as .patch files to the upstream.
Attachment #467348 - Flags: approval1.9.2.10? → approval1.9.2.10+
Whiteboard: [needs answer to comment 23]
blocking1.9.2: needed → .10+
(In reply to comment #27)
> Do we need to be concerned that our patches make us diverge from the upstream
> IPC, making it hard to take future upstream improvements? I know the <video>
> code is set up to apply our fixes as .patch files to the upstream.

I don't think so.  The "upstream" IPC code isn't designed to be a standalone library, there are no source or binary compat guarantees, and we don't plan on using this code post-4.x.  We've cherry-picked bug fixes from chromium as needed.

http://hg.mozilla.org/releases/mozilla-1.9.2/rev/e2aa5b39803f
So, reading this right, I believe there is no deterministic way to trigger this issue for the purposes of QA verification of the fix. Is this correct?
Whiteboard: [qa-examined-192] [qa-needs-str]
> I believe there is no deterministic way to trigger this
issue for the purposes of QA verification of the fix.

Correct.
Marking this as having nothing for QA to do.
Whiteboard: [qa-examined-192] [qa-needs-str] → [qa-ntd-192]
tracking-fennec: ? → ---
You need to log in before you can comment on or make changes to this bug.