588216 - HttpChannelChild OnStopRequest message received but never processed by event loop

Reporter

Description

•

15 years ago

Over in bug 580525 we have an intermittent hang that affects all the necko e10s tests. It hangs at various different points, reproducible across pretty much all the platforms. After some debugging I've found the following: * HttpChannelParent::OnStopRequest() is called, which calls SendOnStopRequest to the child * This call returns true * HttpChannelChild::RecvOnStopRequest() is never called. IPC log looks like: [time:1282083965611089][PHttpChannelParent] Sending Msg_OnStartRequest([TODO]) [time:1282083965611266][PHttpChannelParent] Sending Msg_OnStatus([TODO]) [time:1282083965611286][PHttpChannelParent] Sending Msg_OnProgress([TODO]) [time:1282083965611357][PHttpChannelParent] Sending Msg_OnDataAvailable([TODO]) [time:1282083965611472][PHttpChannelParent] Sending Msg_OnStopRequest([TODO]) [time:1282083965611611][PHttpChannelChild] Received Msg_OnStatus([TODO]) [time:1282083965611730][PHttpChannelChild] Received Msg_OnStartRequest([TODO]) onStartRequest [[xpconnect wrapped (nsISupports, nsIChannel, nsIRequest) @ 0x7f76d80c3120 (native @ 0x7f76d7006430)]] onStartRequest end [[xpconnect wrapped (nsISupports, nsIChannel, nsIRequest) @ 0x7f76d80c3120 (native @ 0x7f76d7006430)]] [time:1282083965612259][PHttpChannelChild] Received Msg_OnStatus([TODO]) [time:1282083965612435][PHttpChannelChild] Received Msg_OnProgress([TODO]) [time:1282083965612520][PHttpChannelChild] Received Msg_OnDataAvailable([TODO]) onDataAvailable [[xpconnect wrapped (nsISupports, nsIChannel, nsIRequest) @ 0x7f76d80c3120 (native @ 0x7f76d7006430)]] onDataAvailable end [[xpconnect wrapped (nsISupports, nsIChannel, nsIRequest) @ 0x7f76d80c3120 (native @ 0x7f76d7006430)]] IPC guys, any thoughts?

dwitte@gmail.com

Reporter

Comment 1

•

15 years ago

Attached patch debug printfs — Details — Splinter Review

With this patch, I see (in both a passing and failing test): ===== RPCChannel::OnMessageReceived begin ===== RPCChannel::OnMessageReceived pushed ===== RPCChannel::OnMessageReceived posting task So the RPCChannel is receiving the message and posting it to the main thread, but the main thread never gets to it. (11274 is the OnStopRequest message type.)

dwitte@gmail.com

Reporter

Comment 2

•

15 years ago

Confirmed that RPCChannel::OnMaybeDequeueOne() for that message never gets called by the event loop.

dwitte@gmail.com

Reporter

Updated

•

15 years ago

Summary: HttpChannelChild OnStopRequest message sent but never received → HttpChannelChild OnStopRequest message received but never processed by event loop

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 3

•

15 years ago

I doubt this is terribly relevant, but the IPC logs from good/bad have a different sequence of "OnStatus" messages just before the hang. It looks like in the "hang" case, we have one less OnStatus. good: P C ----------- ----------- Sending Msg_Redirect3Complete Sending Msg___delete__ Received Msg_Redirect3Complete Received Msg___delete__ Sending Msg_OnStatus Sending Msg_OnStatus Received Msg_OnStatus Sending Msg_OnStatus Received Msg_OnStatus Received Msg_OnStatus Sending Msg_OnStartRequest Sending Msg_OnStatus Sending Msg_OnProgress Sending Msg_OnDataAvailable Sending Msg_OnStopRequest Received Msg_OnStartRequest Received Msg_OnStatus Received Msg_OnProgress Received Msg_OnDataAvailable Received Msg_OnStopRequest bad: P C ----------- ----------- Sending Msg_Redirect3Complete Sending Msg___delete__ Received Msg_Redirect3Complete Sending Msg_OnStatus Received Msg___delete__ Sending Msg_OnStatus Received Msg_OnStatus Received Msg_OnStatus Sending Msg_OnStartRequest Sending Msg_OnStatus Sending Msg_OnProgress Sending Msg_OnDataAvailable Sending Msg_OnStopRequest Received Msg_OnStartRequest Received Msg_OnStatus Received Msg_OnProgress Received Msg_OnDataAvailable Just FYI. Still hunting underlying problem.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 4

•

15 years ago

Again just FTR, bad news in nested xpcshell/XPCOM/glib/chromium Frankenventloop land: (gdb) f 24 #24 0x00002ab536cc1ffa in mozilla::ipc::RPCChannel::OnMaybeDequeueOne (this=0xdd3e40) at /home/cjones/mozilla/mozilla-central/ipc/glue/RPCChannel.cpp:438 (gdb) p this->mPending.size() $138 = 1 (gdb) p this->mDequeueOneTask $139 = { mRawPtr = 0xddedf0 } ----- (gdb) f 7 #7 0x00002ab536ed53cf in nsThread::ProcessNextEvent (this=0xde4b70, mayWait=1, result=0x7fff97163108) at /home/cjones/mozilla/mozilla-central/xpcom/threads/nsThread.cpp:517 (gdb) p mEvents->HasPendingEvent() $142 = 1 The Necko IPC message came through, RPCChannel has it in its pending queue and has a task enqueued to process the message, and the XPCOM thread queue has a task enqueued to dequeue the RPCChannel task to process the message. But, the appshell is spinning in the glib event loop, which implies to me that when the XPCOM task was enqueued, somehow the pipe-write that's supposed to wake up the glib loop was missed or flubbed. Will keep looking. Still not enough information yet for me to make a call on blocking-2.0a1. This is a ridiculous setup. Rewriting this mess is now high on my post-4.0 TODO list.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 5

•

15 years ago

This looks like a general (possibly/likely still only xpcshell) problem that happens to be especially biting OnStop. I just caught PCookieServiceChild::GetCookieString being dropped on the *parent* side, leading to a deadlock. Similar-looking symptoms as on the parent side as in comment 4 (gdb) p &mozilla::dom::ContentParent::gSingleton->mChannel $3 = (mozilla::ipc::RPCChannel *) 0x213e9b0 (gdb) p $3->mPending.size() $4 = 1 (gdb) p $3->mDequeueOneTask $5 = { mRawPtr = 0x21497d0 } There's a pending message, and a task enqueued to process it, but parent is spinning at (gdb) bt #0 pthread_cond_wait@@GLIBC_2.3.2 #1 0x00002aac5b0bad2a in PR_WaitCondVar (cvar=0x1e4c860, timeout=4294967295) at /home/cjones/mozilla/cedar/nsprpub/pr/src/pthreads/ptsynch.c:417 #2 0x00002aac5b0bb4bc in PR_Wait (mon=0x1e55e50, timeout=4294967295) at /home/cjones/mozilla/cedar/nsprpub/pr/src/pthreads/ptsynch.c:614 #3 0x00002aac579ec280 in nsAutoMonitor::Wait (this=0x7fff7c405080, interval=4294967295) at ../../../dist/include/nsAutoLock.h:346 #4 0x00002aac58f4dd7d in nsEventQueue::GetEvent (this=0x1e55d48, mayWait=1, result=0x7fff7c405140) at /home/cjones/mozilla/cedar/xpcom/threads/nsEventQueue.cpp:85 #5 0x00002aac58f50cbc in nsThread::nsChainedEventQueue::GetEvent (this=0x1e55d38, mayWait=1, event=0x7fff7c405140) at /home/cjones/mozilla/cedar/xpcom/threads/nsThread.h:112 #6 0x00002aac58f5033f in nsThread::ProcessNextEvent (this=0x1e55d00, mayWait=1, result=0x7fff7c4053c8) at /home/cjones/mozilla/cedar/xpcom/threads/nsThread.cpp:530 It looks like there's some subtle case in which posting a chromium task fails to awaken the XPCOM loop.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 6

•

15 years ago

Mentioned this on IRC to dwitte and jst: because the bug here looks to be missed/fumbled chromium->XPCOM task posting, it's possible that it's latent in browser but never surfaces because we have so many timers and other stuff going on. We wouldn't have seen it in plugin processes or IPDL tests because they're both non-XPCOM. That said, I'm working on a test to repro this better than the 1/50-1/200 rate I'm seeing in the necko test, then I'll be able to check my initial hypothesis.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 7

•

15 years ago

(In reply to comment #6) > Mentioned this on IRC to dwitte and jst: because the bug here looks to be > missed/fumbled chromium->XPCOM task posting, Rather, missed XPCOM wakeup on a posted chromium task.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 8

•

15 years ago

Attached patch Simpler testcase — Details — Splinter Review

This fails pretty consistently in about 1/20 runs on my machine. Hopefully can write a better test when I have more info.

Assignee: nobody → jones.chris.g

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 9

•

15 years ago

I should add that the child process spinning a nested XPCOM event loop (like xpcshell-bin) is intentional; I couldn't repro the lost task when the child process used the normal event loop. Not sure what that means yet.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 10

•

15 years ago

OK, here's the bug diff --git a/ipc/glue/MessagePump.cpp b/ipc/glue/MessagePump.cpp --- a/ipc/glue/MessagePump.cpp +++ b/ipc/glue/MessagePump.cpp @@ -146,16 +146,17 @@ MessagePump::Run(MessagePump::Delegate* keep_running_ = true; } void MessagePump::ScheduleWork() { if (gRunningSetNestableTasksAllowed) { + printf("\n RUNNING SETNESTABLETASKSALLOWED\n\n"); return; } // Make sure the event loop wakes up. if (mThread) { mThread->Dispatch(mDoWorkEvent, NS_DISPATCH_NORMAL); } else { We get " RUNNING SETNESTABLETASKSALLOWED" on hung trials but not on others. This means that the IO thread is trying to ScheduleWork() while the main thread is running MessageLoop::SetNestableTasksAllowed(). It's a good ol' fashioned shared-memory race, and we won't be able to write a deterministic test for it. However, the fix is easy.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 11

•

15 years ago

Also, I believe that we're susceptible to this bug in nested XPCOM loops, in the browser, but apparently haven't hit it because of "other stuff" waking us up eventually and processing the message.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Updated

•

15 years ago

Blocks: 528144

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 12

•

15 years ago

This bug prevents us from getting useful xpcshell test results.

tracking-fennec: --- → ?

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 13

•

15 years ago

Attached patch Avoid race between IO-thread loop->PostTask() and main-thread loop->SetNestableTasksAllowed() that led to Tasks being ignored — Details — Splinter Review

Ran this on the attached test 500 times without hang. bent, we need this reviewed ASAP because it's splatting orange all over repos that run OOP xpcshell tests.

Attachment #467348 - Flags: review?(bent.mozilla)

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 14

•

15 years ago

Comment on attachment 467348 [details] [diff] [review] Avoid race between IO-thread loop->PostTask() and main-thread loop->SetNestableTasksAllowed() that led to Tasks being ignored Ok, looks good.

Attachment #467348 - Flags: review?(bent.mozilla) → review+

Benjamin Smedberg

Updated

•

15 years ago

blocking2.0: --- → beta5+

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 15

•

15 years ago

There's a chance that this is causing delays in processing plugin in-messages in firefox-bin, and should be a safe patch for branch.

blocking1.9.2: --- → ?

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 16

•

15 years ago

http://hg.mozilla.org/mozilla-central/rev/69c6ce104f45

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Daniel Veditz [:dveditz]

Comment 17

•

15 years ago

Is the branch patch going to be the same? If so please request approval1.9.2.10? on it.

blocking1.9.2: ? → needed

status1.9.2: --- → wanted

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 18

•

15 years ago

Comment on attachment 467348 [details] [diff] [review] Avoid race between IO-thread loop->PostTask() and main-thread loop->SetNestableTasksAllowed() that led to Tasks being ignored Er, sorry. Yes, it will be the same.

Attachment #467348 - Flags: approval1.9.2.10?

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 19

•

15 years ago

Could this have caused a Dromaeo regression? See dev.tree-management thread "Dromaeo (V8) decrease 9.99% on Win7 Firefox"

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 20

•

15 years ago

Looking at the graph it seems that the regression magically disappeared. Some other cause perhaps?

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 21

•

15 years ago

bent responded to beltzner but the CC to the mailing list didn't reach the newsgroup. He said "Hm. I see the dip, but it looks like the numbers have all come back up since then? This wasn't backed out, so maybe we're seeing something else? Though I must confess I may be misreading this graph..." It's basically impossible for this to have regressed dromaeo, unless dromaeo is loading OOPPs, and if that's true then it needs to be taken out and shot. This patch touches code that runs even without OOPP, but it actually reduces the number of instructions executed by 3 or 4 when we jump from the chromium-->XPCOM loop. I call BS.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 22

•

15 years ago

Agreed.

LegNeato

Comment 23

•

15 years ago

Did the test land on trunk as well? We'd want a test with this for branch. Please do a rollup patch or ask for approval for the test patch as well.

Daniel Veditz [:dveditz]

Updated

•

15 years ago

Whiteboard: [needs answer to comment 23]

Daniel Veditz [:dveditz]

Updated

•

15 years ago

Flags: in-testsuite?

Daniel Veditz [:dveditz]

Comment 24

•

15 years ago

Comment on attachment 467348 [details] [diff] [review] Avoid race between IO-thread loop->PostTask() and main-thread loop->SetNestableTasksAllowed() that led to Tasks being ignored Clearing approval request pending answer to test question in comment 23

Attachment #467348 - Flags: approval1.9.2.10? → approval1.9.2.10-

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 25

•

15 years ago

As I said in comment 10, there's no way to write a deterministic test for this bug. The "test" I wrote isn't better than the other multiprocess xpcshell tests in that sense. I just twiddled its parameters so that it tickled the bug more frequently on my computer, mileage will vary on tinderboxen.

Benjamin Smedberg

Comment 26

•

15 years ago

Comment on attachment 467348 [details] [diff] [review] Avoid race between IO-thread loop->PostTask() and main-thread loop->SetNestableTasksAllowed() that led to Tasks being ignored Renominating, the test is unreliable and not something we can actually check in.

Attachment #467348 - Flags: approval1.9.2.10- → approval1.9.2.10?

Daniel Veditz [:dveditz]

Comment 27

•

15 years ago

Comment on attachment 467348 [details] [diff] [review] Avoid race between IO-thread loop->PostTask() and main-thread loop->SetNestableTasksAllowed() that led to Tasks being ignored Approved for 1.9.2.10, a=dveditz for release-drivers Do we need to be concerned that our patches make us diverge from the upstream IPC, making it hard to take future upstream improvements? I know the <video> code is set up to apply our fixes as .patch files to the upstream.

Attachment #467348 - Flags: approval1.9.2.10? → approval1.9.2.10+

Daniel Veditz [:dveditz]

Updated

•

15 years ago

Whiteboard: [needs answer to comment 23]

Daniel Veditz [:dveditz]

Updated

•

15 years ago

blocking1.9.2: needed → .10+

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 28

•

15 years ago

(In reply to comment #27) > Do we need to be concerned that our patches make us diverge from the upstream > IPC, making it hard to take future upstream improvements? I know the <video> > code is set up to apply our fixes as .patch files to the upstream. I don't think so. The "upstream" IPC code isn't designed to be a standalone library, there are no source or binary compat guarantees, and we don't plan on using this code post-4.x. We've cherry-picked bug fixes from chromium as needed. http://hg.mozilla.org/releases/mozilla-1.9.2/rev/e2aa5b39803f

status1.9.2: wanted → .10-fixed

Al Billings [:abillings - ex-MoCo]

Comment 29

•

15 years ago

So, reading this right, I believe there is no deterministic way to trigger this issue for the purposes of QA verification of the fix. Is this correct?

Whiteboard: [qa-examined-192] [qa-needs-str]

Jason Duell

Comment 30

•

15 years ago

> I believe there is no deterministic way to trigger this issue for the purposes of QA verification of the fix. Correct.

Al Billings [:abillings - ex-MoCo]

Comment 31

•

15 years ago

Marking this as having nothing for QA to do.

Whiteboard: [qa-examined-192] [qa-needs-str] → [qa-ntd-192]

Nobody; OK to take it and work on it

Updated

•

12 years ago

tracking-fennec: ? → ---

debug printfs 15 years ago dwitte@gmail.com 1.52 KB, patch		Details \| Diff \| Splinter Review
Simpler testcase 15 years ago Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2.10 KB, patch		Details \| Diff \| Splinter Review
Avoid race between IO-thread loop->PostTask() and main-thread loop->SetNestableTasksAllowed() that led to Tasks being ignored 15 years ago Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 5.65 KB, patch	bent.mozilla : review+ dveditz : approval1.9.2.11+	Details \| Diff \| Splinter Review