Closed Bug 558986 Opened 15 years ago Closed 15 years ago

crash [@ ntdll.dll@0x200fd ] [@ hang | KiFastSystemCallRet | NtUserPeekMessage]Crashes in TSF with Flash/Silverlight on tablets (OLE/COM nested event loop)

Categories

(Core Graveyard :: Plug-ins, defect)

x86
Windows 7
defect
Not set
critical

Tracking

(blocking1.9.2 .4+, status1.9.2 .4-fixed)

RESOLVED FIXED
mozilla1.9.3a5
Tracking Status
blocking1.9.2 --- .4+
status1.9.2 --- .4-fixed

People

(Reporter: chofmann, Assigned: jimm)

References

()

Details

(Keywords: crash, inputmethod, intermittent-failure)

Crash Data

Attachments

(2 files, 3 obsolete files)

#21 on the current 3.6.3plugin1 list lots of interesting stuff on the stack that looks like http://crash-stats.mozilla.com/report/index/3cff4484-1d92-48aa-9269-bd5aa2100412 Frame Module Signature [Expand] Source 0 ntdll.dll ntdll.dll@0x200fd 1 kernel32.dll kernel32.dll@0x1162c 2 user32.dll RealMsgWaitForMultipleObjectsEx 3 ole32.dll CCliModalLoop::BlockFn 4 ole32.dll ModalLoop 5 ole32.dll ThreadSendReceive 6 ole32.dll CRpcChannelBuffer::SwitchAptAndDispatchCall 7 ole32.dll CRpcChannelBuffer::SendReceive2 8 ole32.dll CCliModalLoop::SendReceive 9 ole32.dll CAptRpcChnl::SendReceive 10 ole32.dll CCtxComChnl::SendReceive 11 ole32.dll NdrExtpProxySendReceive 12 rpcrt4.dll rpcrt4.dll@0x3414a 13 ole32.dll ObjectStublessClient 14 ole32.dll ObjectStubless 15 ole32.dll RemoteReleaseRifRefHelper 16 ole32.dll RemoteReleaseRifRef 17 ole32.dll CStdMarshal::DisconnectCliIPIDs 18 ole32.dll CStdMarshal::Disconnect 19 ole32.dll CStdIdentity::~CStdIdentity 20 ole32.dll CStdIdentity::`scalar deleting destructor' 21 ole32.dll CInternalPageAllocator::ReleaseEntryList 22 ole32.dll IUnknown_Release_Proxy 23 oleacc.dll AccWrap_AddIAccProp::CanFakeIAccIdentity 24 oleacc.dll AccWrap_Base::QueryService 25 oleacc.dll AccWrap_Annotate::Init 26 oleacc.dll AccWrap_Annotate::GetGenericProp 27 oleacc.dll AccWrap_Annotate::get_accState Active Accessibility Core Component 28 tiptsf.dll CARET::StopTimer 29 tiptsf.dll CARET::UpdateEditFieldState 30 tiptsf.dll TabletShellProc 31 tiptsf.dll PrxDllCanUnloadNow TabletPC Input Panel component, 32 user32.dll __ClientCallWinEventProc 33 ntdll.dll ntdll.dll@0x100e5 34 npctrl.dll npctrl.dll@0x511ec 35 user32.dll InternalCallWinProc 36 user32.dll UserCallWinProcCheckWow 37 user32.dll DispatchMessageWorker 38 user32.dll DispatchMessageW 39 xul.dll base::MessagePumpForUI::ProcessMessageHelper ipc/chromium/src/base/message_pump_win.cc:361 40 xul.dll base::MessagePumpForUI::ProcessNextWindowsMessage ipc/chromium/src/base/message_pump_win.cc:336 41 xul.dll base::MessagePumpForUI::DoRunLoop ipc/chromium/src/base/message_pump_win.cc:205 42 xul.dll base::MessagePumpWin::RunWithDispatcher ipc/chromium/src/base/message_pump_win.cc:52 43 xul.dll base::MessagePumpWin::Run ipc/chromium/src/base/message_pump_win.h:78 44 xul.dll MessageLoop::RunInternal ipc/chromium/src/base/message_loop.cc:216 45 xul.dll MessageLoop::RunHandler 46 xul.dll MessageLoop::Run ipc/chromium/src/base/message_loop.cc:173 47 xul.dll base::Thread::ThreadMain ipc/chromium/src/base/thread.cc:165 48 xul.dll `anonymous namespace'::ThreadFunc ipc/chromium/src/base/platform_thread_win.cc:26 49 kernel32.dll kernel32.dll@0x13676 50 ntdll.dll ntdll.dll@0x39d71 51 ntdll.dll ntdll.dll@0x39d44 more at http://crash-stats.mozilla.com/report/list?range_value=2&range_unit=weeks&signature=ntdll.dll%400x200fd&version=Firefox%3A3.6.3plugin1 trunk and plugin1 branch and only windows 7 checking --- ntdll.dll@0x200fd 20100411-crashdata.csv found in: 3.7a5pre 3.6.3plugin1 release total-crashes ntdll.dll@0x200fd crashes pct. all 309235 6 1.94027e-05 3.7a5pre 1408 3 0.00213068 3.6.3plugin1 1304 3 0.00230061 os breakdown ntdll.dll@0x200fdTotal 6 Win5.1 0.00 Win6.0 0.00 Win6.1 1.00 no urls
Keywords: crash
I can do some testing on this as I have a tablet handy from our touch input work.
Blocks: OOPP
Component: Plug-ins → Silverlight (Microsoft)
Product: Core → Plugins
QA Contact: plugins → microsoft-silverlight
Version: Trunk → unspecified
Summary: crash [@ ntdll.dll@0x200fd ] → crash [@ ntdll.dll@0x200fd ] Crashes in TSF/Silverlight on tablets
Component: Silverlight (Microsoft) → Plug-ins
Product: Plugins → Core
QA Contact: microsoft-silverlight → plugins
Summary: crash [@ ntdll.dll@0x200fd ] Crashes in TSF/Silverlight on tablets → crash [@ ntdll.dll@0x200fd ] Crashes in TSF with Flash/Silverlight on tablets
Version: unspecified → Trunk
Blocking? Based on hang stats, this comes up as about 3.5% of our hangs from yesterday. I'm not so worried about that number directly, but I am worried that people with tablets or other special system drivers will either get this hang all the time, or a significant portion of the time.
blocking1.9.2: --- → ?
Summary: crash [@ ntdll.dll@0x200fd ] Crashes in TSF with Flash/Silverlight on tablets → crash [@ ntdll.dll@0x200fd ] [@ hang | KiFastSystemCallRet | NtUserPeekMessage]Crashes in TSF with Flash/Silverlight on tablets
I've this today on a Windows 7 vm, so it may not be on tablets only: http://crash-stats.mozilla.com/report/index/bp-7269b4df-783f-4974-9b89-73fac2100506 I was trying to reproduce bug 563936 when I encountered this hang. I had an existing Silverglight 3 installation, and I uninstalled it using the option in the Control Panel. Then I ran a Silverlight 4 installer I had previously downloaded. Then I launched Fx 3.6.4build3 and tried to play one of the videos from the link below, then while it seemed to be loading I tried right clicking to bring up the context menu. Generally that's how I got into that hang. The browser window froze, but I was able to launch another window by double clicking on my Fx desktop icon. http://www.silverlight.net/learn/videos/silverlight-4-videos/
That's a different bug yet again, and worth filing separately. Let's keep this one focused on OLE32 hangs where it's spinning a nested PeekMessage loop to handle multi-process COM.
Summary: crash [@ ntdll.dll@0x200fd ] [@ hang | KiFastSystemCallRet | NtUserPeekMessage]Crashes in TSF with Flash/Silverlight on tablets → crash [@ ntdll.dll@0x200fd ] [@ hang | KiFastSystemCallRet | NtUserPeekMessage]Crashes in TSF with Flash/Silverlight on tablets (OLE/COM nested event loop)
From the duplicate bug, this happens when trying to exit fullscreen videos on Spike.com, flash 10.1
->me for investigation, I don't know how we'd know that this nested loop is entered when it only peeks the special OLE messages.
Assignee: nobody → benjamin
Not sure if this is biting us but it sure doesn't fill me with confidence: "Programs May Function Erratically or Close Unexpectedly on the Tablet PC" http://support.microsoft.com/kb/827220
Just to clarify, the basic cause of this hang is that CARET::UpdateMSAAEditFieldState (tiptsf.dll) is calling into accessibility code. That code is calling QueryInterface on a COM object which is hosted in a different process. The OLE RPC code communicates using windows messages and enters a nested event loop of sorts here: CRpcChannelBuffer::SwitchAptAndDispatchCall WdtpInterfacePointer_UserUnmarshalWorker ModalLoop CCliModalLoop::BlockFn That code is *probably* waiting for Mozilla to respond to the OLE call, but Mozilla is blocked. It doesn't appear that the plugin is processing the incoming RPC call when the hangs happens, although I'm not sure of that because the stack gets lost in Flash. This means the nested event loop detection mechanism that normally operates within an RPC call is not active, and this is basically just a race where both processes are waiting for RPC responses from the other process, using different RPC mechanisms.
That's very helpful thanks. (Aside: the stacks I posted seemed similar to the duped spike.com bug)
My crash bug is on an HP touchsmart 300-1000 all in one PC. It has a touchscreen, but I was not using it. To my knowlege, I was not running any other program, such as IE. My background programs were: Rainmeter COMODO internet security Microsoft Security Essentuals ATI Caytlist control center v. 10.2
I meant to say I was not using any other browser. Firefox's UI was also slow, and CPU Usage spiked at 50-60%, while on Opera and Chrome, CPU usage was a good 10-20%. I also had Task manager open, Firefox, and not Flash, via the plugin container was using the extra CPU.
Does someone have a reproducible crash with a Silvelright application in the stack?
STR: * Open the Tablet PC Input Panel and write something in it * Launch Firefox, go to spike.com * View one of the movie trailiers, open it full-screen * click out of full-screen mode I still really really don't know how to fix it.
(In reply to comment #11) > Not sure if this is biting us but it sure doesn't fill me with confidence: > "Programs May Function Erratically or Close Unexpectedly on the Tablet PC" > http://support.microsoft.com/kb/827220 maybe the same, but that seems to be talking about an XP tablet version. The ntdll.dll@0x200fd is 100% Windows NT 6.1.7600 (windows 7) for all 152 reports we have seen since the first of may. The hang | KiFastSystemCallRet signature is spread over a larger number of OS versions. 8305 hang | KiFastSystemCallRet Windows NT 5.1.2600 2081 hang | KiFastSystemCallRet Windows NT 6.0.6002 1937 hang | KiFastSystemCallRet Windows NT 6.1.7600 575 hang | KiFastSystemCallRet Windows NT 6.0.6001 390 KiFastSystemCallRet Windows NT 5.1.2600 339 hang | KiFastSystemCallRet Windows NT 6.0.6000 49 hang | KiFastSystemCallRet Windows NT 6.1.7100 23 hang | KiFastSystemCallRet Windows NT 5.2.3790 10 KiFastSystemCallRet Windows NT 6.0.6002 8 KiFastSystemCallRet Windows NT 6.1.7600 7 KiFastSystemCallRet Windows NT 6.0.6001 4 KiFastSystemCallRet Windows NT 6.0.6000
Yeah, this is not a crash, and isn't related to that KB article or the chromium crashes much.
http://crash-stats.mozilla.com/report/index/52248f1d-747e-4d94-9dc5-2828f2100507 Interesting but normal stack soon before the hang, in the parent process: > xul.dll!nsAccessibleWrap::QueryInterface(iid={...}, ppv=0x003dea34) Line 107 C++ xul.dll!nsHyperTextAccessibleWrap::QueryInterface(iid={...}, ppv=0x003dea34) Line 49 C++ xul.dll!nsDocAccessibleWrap::QueryInterface(iid={...}, ppv=0x003dea34) Line 90 C++ ole32.dll!CDestObjectWrapper::MarshalInterface() ole32.dll!_CoMarshalInterface@24() OLEACC.dll!MarshalInterface() OLEACC.dll!_LresultFromObject@12() xul.dll!nsWindow::LresultFromObject(riid={...}, wParam=0x00000b80, pAcc=0x049d055c) Line 6515 C++ xul.dll!__SEH_epilog4_GS() C++ xul.dll!nsWindow::WindowProc(hWnd=0x00000001, msg=0x0000003d, wParam=0x00000b80, lParam=0xfffffffc) Line 3725 C++ USER32.dll!_InternalCallWinProc@20() USER32.dll!_UserCallWinProcCheckWow@32() USER32.dll!_DispatchClientMessage@20() USER32.dll!___fnDWORD@4() ntdll.dll!_KiUserCallbackDispatcher@12() USER32.dll!_NtUserPeekMessage@20() USER32.dll!__PeekMessage@24() USER32.dll!_PeekMessageW@20() ole32.dll!CCliModalLoop::MyPeekMessage() ole32.dll!CCliModalLoop::PeekRPCAndDDEMessage() ole32.dll!CCliModalLoop::FindMessage() ole32.dll!CCliModalLoop::HandleWakeForMsg() ole32.dll!CCliModalLoop::BlockFn() ole32.dll!ModalLoop() ole32.dll!ThreadSendReceive() ole32.dll!CRpcChannelBuffer::SwitchAptAndDispatchCall() ole32.dll!CRpcChannelBuffer::SendReceive2() ole32.dll!CCliModalLoop::SendReceive() ole32.dll!CAptRpcChnl::SendReceive() ole32.dll!CCtxComChnl::SendReceive() RPCRT4.dll!_NdrProxySendReceive@8() RPCRT4.dll!@NdrpProxySendReceive@4() RPCRT4.dll!_NdrClientCall2() RPCRT4.dll!_ObjectStublessClient@8() RPCRT4.dll!_ObjectStubless@0() tiptsf.dll!CCPenIMX::UpdateStateSink() tiptsf.dll!CCPenIMX::ActivateUI() tiptsf.dll!CPenIMX_WindowProc() USER32.dll!_InternalCallWinProc@20() USER32.dll!_UserCallWinProcCheckWow@32() USER32.dll!_DispatchMessageWorker@8() USER32.dll!_DispatchMessageW@4() xul.dll!nsAppShell::ProcessNextNativeEvent(mayWait=0x00000000) Line 179 C++
blocking1.9.2: ? → .4+
(In reply to comment #13) > It doesn't appear that the plugin is processing the incoming RPC call when the > hangs happens, although I'm not sure of that because the stack gets lost in > Flash. Can we get debugging help from Adobe to confirm this?
So, I tried to understand what this bug report is all about, and how I can help with that. So, first, let me try to explain what I understood from this bug report, because if my understanding is not correct, the rest of my message can be fed to birds (or pigs). The stack in comment 0 is coming from our child process. Our child process's message loop is dispatching a message which is handled by the Tablet PC Input control thing, which calls IAccessible::get_accState, which leads to a call being dispatched to our parent process. That is done using a sync COM message to the message loop (on our main thread?) which is probably sent using some clone of SendMessage, which blocks the caller, and hence our child process. However, in the mean time, the parent is also blocked on something (I don't know what that is, is it the child process? I also don't know how we're being blocked, like are we waiting on a handle?), which leads to a deadlock, which users experience as a hang. If that is the case, we're being bitten by the COM threading model suckiness! I think we can fix this using the approach laid out in this thread (read the whole thread): http://www.tech-archive.net/Archive/Development/microsoft.public.win32.programmer.ole/2007-05/msg00070.html To summarize, it means implementing IMessageFilter in our parent process, and registering it with COM so that it doesn't expect us to respond to messages when we can't. Am I on the right track here?
Yes, you are! I have a patch which implements this using IMessageFilter::MessagePending, although it currently does it in a hamfisted way by spinning the chromium event loop in the child if an incoming message races with the COM call. Not only does this sound dangerous, but MessageLoop::RunAllPending doesn't work (it never exits). So I'm going to try a more targeted approach which detects RPC races specifically and only processes those.
So is Benjamin or Ehsan working on this particular bug? Or, hope against hope, both?
(In reply to comment #24) > Yes, you are! > > I have a patch which implements this using IMessageFilter::MessagePending, > although it currently does it in a hamfisted way by spinning the chromium event > loop in the child if an incoming message races with the COM call. Not only does > this sound dangerous, but MessageLoop::RunAllPending doesn't work (it never > exits). So I'm going to try a more targeted approach which detects RPC races > specifically and only processes those. Could you attach your patch, so that I can take a look at it? (In reply to comment #25) > So is Benjamin or Ehsan working on this particular bug? Or, hope against hope, > both? Benjamin is, but I can step in too if needed. I may need someone to hold my hand a bit as I go through our ipc code for the first time, though. :-) Benjamin, do we have a set of STR to reporduce this problem on non-tablet PCs? Because I don't have access to one...
comment 18 is still correct, you don't need a tablet PC (and I don't have one either)
Attached patch WIP, broken (obsolete) — Splinter Review
I'm seeing some very strange behavior currently, so I tried #if 0 all the interesting code. So you'll need to remove the #if 0 bits.
Some discussions on the problem and possible solution: http://etherpad.mozilla.org:9000/win32-plugin-rpc-deadlock
Attachment #445127 - Attachment is obsolete: true
Attachment #445185 - Flags: review?(jones.chris.g)
Attachment #445185 - Flags: review?(jmathies)
Attachment #445185 - Flags: review?(ehsan)
Comment on attachment 445185 [details] [diff] [review] Appears to work!, rev. 2 >diff --git a/dom/plugins/COMMessageFilter.cpp b/dom/plugins/COMMessageFilter.cpp >+DWORD >+COMMessageFilter::RetryRejectedCall(HTASK htaskCallee, >+ DWORD dwTickCount, >+ DWORD dwRejectType) >+{ >+ if (mPreviousFilter) >+ return mPreviousFilter->RetryRejectedCall(htaskCallee, dwTickCount, >+ dwRejectType); >+ return 0; MSDN says that the default value here is DWORD(-1). I know that this doesn't matter because we hand this off to the previous filter (which is the OS provided one), but can you change this return value for safety? >diff --git a/dom/plugins/PluginModuleChild.cpp b/dom/plugins/PluginModuleChild.cpp >--- a/dom/plugins/PluginModuleChild.cpp >+++ b/dom/plugins/PluginModuleChild.cpp >@@ -59,16 +59,20 @@ > #include "mozilla/plugins/StreamNotifyChild.h" > #include "mozilla/plugins/BrowserStreamChild.h" > #include "mozilla/plugins/PluginStreamChild.h" > #include "mozilla/plugins/PluginThreadChild.h" > #include "PluginIdentifierChild.h" > > #include "nsNPAPIPlugin.h" > >+#ifdef XP_WIN >+#include "COMMessageFilter.h" >+#endif >+ > using namespace mozilla::plugins; > > #if defined(XP_WIN) > #ifndef WM_MOUSEHWHEEL > #define WM_MOUSEHWHEEL 0x020E > #endif > #endif > >@@ -120,16 +124,18 @@ PluginModuleChild::current() > bool > PluginModuleChild::Init(const std::string& aPluginFilename, > base::ProcessHandle aParentProcessHandle, > MessageLoop* aIOLoop, > IPC::Channel* aChannel) > { > PLUGIN_LOG_DEBUG_METHOD; > >+ COMMessageFilter::Initialize(this); >+ Won't this fail to compile on Mac/Linux? You need to wrap it in a XP_WIN check I think. r=me on with these two issues addressed. I only reviewed the messagefilter code thoroughly, and barely reviewed the message loop change. I didn't look at the rest because I don't understand that code anyways. :-)
Attachment #445185 - Flags: review?(ehsan) → review+
Attachment #445185 - Flags: review?(jmathies) → review+
Comment on attachment 445185 [details] [diff] [review] Appears to work!, rev. 2 >diff --git a/ipc/glue/RPCChannel.cpp b/ipc/glue/RPCChannel.cpp >--- a/ipc/glue/RPCChannel.cpp >+++ b/ipc/glue/RPCChannel.cpp > void >+RPCChannel::FlushPendingRPCQueue() >+{ >+ AssertWorkerThread(); >+ mMutex.AssertNotCurrentThreadOwns(); >+ >+ { >+ MutexAutoLock lock(mMutex); >+ >+ if (mDeferred.empty()) { >+ if (mPending.empty()) >+ return; >+ >+ Message last = mPending.back(); This needs to be |const Message& last = mPending.back();| or else you're going to copy the entire message payload. >+ if (!last.is_rpc() || last.is_reply()) >+ return; This test makes me a bit nervous because of Windows wakeup insanity. IIRC it's possible on the browser side to wake up from an RPC wait because of a windows message (in the spin loop) and then dispatch any arbitrary XPCOM task. If the task sends an an async message, say stream data delivery, then the back of the queue might be an async message even though there's an RPC message earlier in the queue. Also note that this wakeup insanity might result in there being more than one RPC message in the queue. >+bool > RPCChannel::OnMaybeDequeueOne() > { > // XXX performance tuning knob: could process all or k pending > // messages here > > AssertWorkerThread(); > mMutex.AssertNotCurrentThreadOwns(); > > Message recvd; > { > MutexAutoLock lock(mMutex); > > if (!Connected()) { > ReportConnectionError("RPCChannel"); >- return; >+ return false; > } > > if (!mDeferred.empty()) > return MaybeProcessDeferredIncall(); > I think this may be a bug carried over from the patch I sent you. If in the wonderful world of Windows insanity the following sequence of events is possible browser plugin ------------ --------------- rpc Foo--> <--rpc Bar [plugin wins] [defer Foo] process Bar asyc A--> [wakeup for A] call Baz--> <--OLE RPC then mDeferred will be non-empty, but (deferred) Foo won't be processed because the time isn't right. Then we'll return here without processing Baz and deadlock on the OLE RPC. So, we could (1) File this theoretical case away until we see it in the wild (2) Change the logic to if (!mDeferred.empty() && MaybeProcessDeferredIncall()) return true; I'm fine with either. >--- a/ipc/glue/RPCChannel.h >+++ b/ipc/glue/RPCChannel.h >@@ -161,16 +161,22 @@ public: > > // Override the SyncChannel handler so we can dispatch RPC > // messages. Called on the IO thread only. > NS_OVERRIDE > virtual void OnMessageReceived(const Message& msg); > NS_OVERRIDE > virtual void OnChannelError(); > >+ /** >+ * When we detect that an outbound OLE RPC call is being made, we may need to unblock the parent >+ * by processing inbound RPC calls (and any preceding asynchronous messages). >+ */ Two points: first, this method name doesn't describe what your patch does, wrt the comment above. Second, for better or worse, this is a generic interface now, and the comment should describe what the method does rather than how it's used. >+ /** >+ * Process one deferred or pending RPC message. >+ * @return true if a message was processed >+ */ >+ bool OnMaybeDequeueOne(); >+ This comment is wrong, async and sync messages are dequeued and processed as well. Leaving r? pending how you want to address the review comments above. As it stands I would r+ with the Message copy fixed and inaccurate comments and method names edited.
> >+ if (!last.is_rpc() || last.is_reply()) > >+ return; > > This test makes me a bit nervous because of Windows wakeup insanity. IIRC it's > possible on the browser side to wake up from an RPC wait because of a windows > message (in the spin loop) and then dispatch any arbitrary XPCOM task. Hrm, I really don't think this is possible. The small set of sent messages that we allow to be processed shouldn't ever re-enter XPCOM tasks or send more messages. And we can't enter the full-on nested loop until we've processed the incoming RPC message (in which case it will no longer be pending!) > I think this may be a bug carried over from the patch I sent you. If in the > wonderful world of Windows insanity the following sequence of events is > possible > > browser plugin > ------------ --------------- > rpc Foo--> <--rpc Bar > [plugin wins] [defer Foo] > process Bar > asyc A--> [wakeup for A] > call Baz--> <--OLE RPC > > then mDeferred will be non-empty, but (deferred) Foo won't be processed because > the time isn't right. Then we'll return here without processing Baz and > deadlock on the OLE RPC. So, we could > > (1) File this theoretical case away until we see it in the wild > (2) Change the logic to > > if (!mDeferred.empty() && MaybeProcessDeferredIncall()) > return true; Unless you're extremely confident that #2 won't mess up in-order delivery, I choose #1 here: I think that the chances of a plugin ending up in an OLE call while in an RPC frame are pretty slim to begin with.
Attached patch Updated, rev. 2.1 (obsolete) — Splinter Review
Attachment #445354 - Flags: review?(jones.chris.g)
(In reply to comment #34) > > the following sequence of events is > > possible > > > > browser plugin > > ------------ --------------- > > rpc Foo--> <--rpc Bar > > [plugin wins] [defer Foo] > > process Bar > > asyc A--> [wakeup for A] > > call Baz--> <--OLE RPC > > > > then mDeferred will be non-empty, but (deferred) Foo won't be processed because > > the time isn't right. Then we'll return here without processing Baz and > > deadlock on the OLE RPC. So, we could > > > > (1) File this theoretical case away until we see it in the wild > > (2) Change the logic to > > > > if (!mDeferred.empty() && MaybeProcessDeferredIncall()) > > return true; > > Unless you're extremely confident that #2 won't mess up in-order delivery, I > choose #1 here: I think that the chances of a plugin ending up in an OLE call > while in an RPC frame are pretty slim to begin with. It wouldn't, but I'm perfectly fine playing it safe here until crash-stats dictates otherwise.
Attachment #445354 - Flags: review?(jones.chris.g) → review+
Attached patch merged to trunk (obsolete) — Splinter Review
- added RetryRejectedCall fix
Attachment #445354 - Attachment is obsolete: true
Attachment #445185 - Flags: review?(jones.chris.g)
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Attachment #445725 - Flags: approval1.9.2.4?
http://hg.mozilla.org/mozilla-central/rev/78923ef103cb follow up build bustage fix for linux/osx.
Comment on attachment 445725 [details] [diff] [review] merged to trunk >+ * The Initial Developer of the Original Code is >+ * the Mozilla Foundation <http://www.mozilla.org/>. Almost all the new files I see seem to use Mozilla Corporation, which seems more accurate (though ultimately equivalent). Did we change our policy?
The policy has always been to use "The Mozilla Foundation", which is the actual copyright holder/assignee for all MoCo code.
Backed out due to various test failure issues. This needs some try server goodness first!
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Bug 566395 - intermittent orange: modules/plugin/test/test_windowed_invalidate.html | Test timed out. Bug 566392 - REFTEST TEST-UNEXPECTED-FAIL | file:///e:/builds/moz2_slave/mozilla-central-win32-debug-unittest-crashtest/build/reftest/tests/layout/base/crashtests/325967-1.html | timed out waiting for reftest-wait to be removed (after onload fired)
... and many of the debug mochitest runs had leaks.
Comment on attachment 445725 [details] [diff] [review] merged to trunk a=LegNeato for 1.9.2.4. Please land on both mozilla-1.9.2 default and GECKO1924_20100413_RELBRANCH
Attachment #445725 - Flags: approval1.9.2.4? → approval1.9.2.4+
Also please remember to include/land the follow-up bustage fix.
Looks like this also caused a 50% (!) Tp4 regression on Windows.
Assignee: benjamin → jmathies
Looks like this is our current remaining blocker. I'll do a little investigating today to see if there's something obvious regarding the Tp4 regression. If not I suppose we'll have to look for a different approach, possible an api hook into the call chain.
The only thing which could realistically effect Tp is posting WM_NULL in MessagePumpForUI::ScheduleWork.
(In reply to comment #51) > The only thing which could realistically effect Tp is posting WM_NULL in > MessagePumpForUI::ScheduleWork. I was wondering if there might be heavy COM overhead added when installing the filter?
The filter is only installed once per plugin process: even if there were heavy overhead, it wouldn't show up in Tp4.
(In reply to comment #53) > The filter is only installed once per plugin process: even if there were heavy > overhead, it wouldn't show up in Tp4. Couldn't we test that theory by removing it a posting to try for Tp4 runs? bsmedberg, curious if you're working on this or if you want me to look int it? I have the theming work to continue on so i don't want to overlap here..
I just got back in, I probably won't have a chance to look at it until tomorrow. If you have theories, please test them.
Also, seems strange this would regress only on XP...
(In reply to comment #54) > (In reply to comment #53) > > The filter is only installed once per plugin process: even if there were heavy > > overhead, it wouldn't show up in Tp4. > > Couldn't we test that theory by removing it a posting to try for Tp4 runs? with PostMessage(WM_NULL): WINNT 6.0 try talos tp4 tp4: 859.03 (details) WINNT 5.1 try talos tp4 tp4: 554.97 (details) without PostMessage(WM_NULL): WINNT 6.0 try talos tp4 tp4: 769.42 (details) WINNT 5.1 try talos tp4 tp4: 443.22 (details) Theory proven!
Do we have any data on how many times that message is posted? Also, do we have any data on how expensive handling it is? Can we special case its handling and avoid doing anything (even calling DefWindowProc)?
I pushed http://hg.mozilla.org/try/rev/a1d7df2d945f WINNT 6.0 try talos tp4 tp4: 921.81 WINNT 5.1 try talos tp4 tp4: 523.48 That avoids the rest of the patch and just posts WM_NULL and special-cases WM_NULL when we get it in the chromium hidden windowproc. Life is still bad. We post this WM_NULL at the same frequency we post kMsgHaveWork (WM_USER + 1 in chromium-land): chromium is designed so that there is only ever one of these in the queue at a time. I really don't know where to go from here.
You're returning 1 from the window proc on WM_NULL. According to the MSDN <http://msdn.microsoft.com/en-us/library/ms632637%28VS.85%29.aspx>, you're supposed to be returning 0 if you "process" the null message (what's there to process anyway?!). I'm not sure what kinds of crazy things this might be triggering inside windows, but it might be worth trying a version of the patch which returns 0.
Here is another idea. Try registering a custom message using RegisterWindowMessage and use that message code instead of WM_NULL.
(In reply to comment #61) > Here is another idea. Try registering a custom message using > RegisterWindowMessage and use that message code instead of WM_NULL. The reason I'm suggesting this is because it seems that WM_NULL actually has grown to have a meaning in Windows. Nowadays, apparently task manager sends a WM_NULL message to an application's main thread with a timeout to determine if its "hung" or not. It still doesn't explain the tp4 regression, but it suggests to me that there may be custom processing going on inside windows for WM_NULL.
Well, the private message we're currently using (WM_USER + 1) doesn't wake up the MsgWaitForMultipleObjects call that the OLE code is blocked on. It doesn't *look* like RegisterWindowMessage would unblock that call either, although I'm not certain.
(In reply to comment #63) > Well, the private message we're currently using (WM_USER + 1) doesn't wake up > the MsgWaitForMultipleObjects call that the OLE code is blocked on. It doesn't > *look* like RegisterWindowMessage would unblock that call either, although I'm > not certain. Ah, yes, you're right. :( I spent a lot of time searching the web for possible problems with WM_NULL, and I found multiple places where WM_NULL overflowing the message queue size was discussed. Like this one: <http://trac.wxwidgets.org/ticket/9053>. Could this be an issue? Could we peek the message queue beforehand, and for example log whether there are WM_NULL messages pending to determine if this could be an issue? Also, do we have any profile data on what happens with and without the WM_NULL message sending?
By posting two messages here, I think we're basically flooding the message queue with useless wm_nulls. (see ProcessPumpReplacementMessage()) A possible solution would be to get rid of the wm_null post message and instead, change out the kMsgHaveWork schedule work message with something that will wake up COM's MsgWaitForMultipleObjects, which wm_user+1 clearly isn't doing.
How are we "flooding" it? The chromium logic guarantees that there is only ever one pending kMsgHaveWork at a time, and therefore there should also be only one pending WM_NULL (since they are posted at the same time and processed at roughly the same time).
(In reply to comment #66) > How are we "flooding" it? The chromium logic guarantees that there is only ever > one pending kMsgHaveWork at a time, and therefore there should also be only one > pending WM_NULL (since they are posted at the same time and processed at > roughly the same time). Look at ProcessPumpReplacementMessage() - http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/message_pump_win.cc#368 Aren't we always going to get a WM_NULL in that peek call with the current patch? ProcessMessageHelper() is the main dispatch loop. A less intrusive fix than the previous one might be: in ProcessPumpReplacementMessage(), if we receive a wm_null message in that peek, dump it a peek again for the next message?
Is the 50% regression for all users or only users affected buy the original bug? Does our automation use the accessibility apis such that this always manifests for our tests? Would users without accessibility support enabled still see the regression?
I talked to Alice about the talos setup on XP and she said the OS install is pretty vanilla, so whatever accessibility settings are default there apply. We also set a few Firefox prefs http://mxr.mozilla.org/mozilla/source/testing/performance/talos/sample.config#29 To summarize the Talos Tp increase - using [1,2] to look up the test results for the revision before the landing in comment #39 (30960da07185) and the bustage fix(78923ef103cb) yields WINNT 5.1 WINNT 6.1 MacOSX 10.5.8 MacOSX 10.6.2 before 283 661 488 491 after 423 628 489 481 And to break that down by site in the suite (for XP) you can look at http://people.mozilla.org/~jmuizelaar/graph-server/detail.html?host=graphs&old=4018370&new=4018400 So that looks like a slowdown across the board, although nine sites are faster. [1] http://graphs.mozilla.org/api/test/runs/revisions?revision=30960da07185 [2] http://graphs.mozilla.org/api/test/runs/revisions?revision=78923ef103cb
I have a patch that may address this. Unfortunately try server seems to be having a fit at the moment, once things clear up I'll get some tp4 stats on my patch. I've got some other ideas too if that doesn't work.
Whiteboard: [orange]
(In reply to comment #57) > with PostMessage(WM_NULL): > WINNT 6.0 try talos tp4 > tp4: 769.42 (details) > WINNT 5.1 try talos tp4 > tp4: 443.22 (details) Seems we lost winnt 6.0 try talos runs at a rather inopportune time, but we still have 5.1 builds to compare. I think my slight modification addressed the problem: old patch: WINNT 6.0 try talos tp4 tp4: 859.03 (details) WINNT 5.1 try talos tp4 tp4: 554.97 (details) new patch: WINNT 6.1 tryserver talos tp4 tp4: 626.44 (details) WINNT 5.1 tryserver talos tp4 tp4: 279.93 (details) some other tryserver build: WINNT 6.1 tryserver talos tp4 tp4: 647.96 (details) WINNT 5.1 tryserver talos tp4 tp4: 269.65 (details)
On try, this experienced one test failure, bug 559981. Not sure if this patch caused that or if it was random orange. I submitted it again to see if it repeats. The addition I made was: MSG msg; bool have_message = (0 != PeekMessage(&msg, NULL, 0, 0, PM_REMOVE)); + + if (have_message && msg.message == WM_NULL) + have_message = (0 != PeekMessage(&msg, NULL, 0, 0, PM_REMOVE)); + DCHECK(!have_message || kMsgHaveWork != msg.message || msg.hwnd != message_hwnd_); Which skips over the null event in ProcessPumpReplacementMessage(). http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/message_pump_win.cc#367
Attachment #445725 - Attachment is obsolete: true
Attachment #446693 - Flags: review?(benjamin)
Attachment #446693 - Flags: review?(benjamin) → review+
Severity: normal → critical
Status: REOPENED → RESOLVED
Closed: 15 years ago15 years ago
Resolution: --- → FIXED
Attachment #446693 - Flags: approval1.9.2.4?
Target Milestone: --- → mozilla1.9.3a5
Is this going into 1.9.2 and the 1.9.2.4 release branch today since we were going to start building at 5:00 PM pacific?
(In reply to comment #74) > Is this going into 1.9.2 and the 1.9.2.4 release branch today since we were > going to start building at 5:00 PM pacific? We need a driver to a+ the patch. I'll be happy to land it this afternoon.
Comment on attachment 446693 [details] [diff] [review] modified patch v.1 In lieu of any other available parties, I'll put on my driver hat and approve this. We need this for 1.9.2.4 release.
Attachment #446693 - Flags: approval1.9.2.4? → approval1.9.2.4+
default and GECKO1924_20100413_RELBRANCH branches in mozilla-1.9.2 please.
Crash Signature: [@ ntdll.dll@0x200fd ] [@ hang | KiFastSystemCallRet | NtUserPeekMessage]
hi guys, after a long search i finally found this issue resembling problems i regularly see when running silverlight in firefox. ff4,5 and 6 keep consistently hanging in that exact scenario if a silverlight application is unloaded and for some reason accessibility has been loaded into the silverlight process. the stack in the main thread of the plugin-container.exe always looks like this: > ntdll.dll!_NtWaitForMultipleObjects@20() + 0x15 bytes ntdll.dll!_NtWaitForMultipleObjects@20() + 0x15 bytes kernel32.dll!_WaitForMultipleObjectsExImplementation@20() + 0x8e bytes user32.dll!_RealMsgWaitForMultipleObjectsEx@20() + 0xe2 bytes ole32.dll!CCliModalLoop::BlockFn() + 0x96 bytes ole32.dll!ModalLoop() + 0x52 bytes ole32.dll!ThreadSendReceive() + 0x8a bytes ole32.dll!CRpcChannelBuffer::SwitchAptAndDispatchCall() + 0xb5 bytes ole32.dll!CRpcChannelBuffer::SendReceive2() + 0xa6 bytes ole32.dll!CCliModalLoop::SendReceive() + 0x1e bytes ole32.dll!CAptRpcChnl::SendReceive() + 0x72 bytes ole32.dll!CCtxComChnl::SendReceive() + 0x47 bytes ole32.dll!NdrExtpProxySendReceive() + 0x43 bytes rpcrt4.dll!@NdrpProxySendReceive@4() + 0xe bytes rpcrt4.dll!_NdrClientCall2() + 0x144 bytes ole32.dll!_ObjectStublessClient@8() + 0x7a bytes ole32.dll!_ObjectStubless@0() + 0xf bytes ole32.dll!RemoteReleaseRifRefHelper() + 0x3c bytes ole32.dll!RemoteReleaseRifRef() + 0xa5 bytes ole32.dll!CStdMarshal::DisconnectCliIPIDs() - 0xea bytes ole32.dll!CStdMarshal::Disconnect() + 0x160 bytes ole32.dll!CStdIdentity::~CStdIdentity() + 0x7d bytes ole32.dll!CStdIdentity::`vector deleting destructor'() + 0xd bytes ole32.dll!CStdIdentity::CInternalUnk::Release() - 0x1d726 bytes ole32.dll!_IUnknown_Release_Proxy@4() + 0x11 bytes oleacc.dll!AccWrap_Base::~AccWrap_Base() + 0x3d bytes oleacc.dll!AccWrap_Annotate::`scalar deleting destructor'() + 0xd bytes oleacc.dll!AccWrap_Base::Release() + 0x257 bytes mscorwks.dll!ReleaseTransitionHelper() + 0x5f bytes mscorwks.dll!SafeReleaseHelper() + 0x70 bytes mscorwks.dll!SafeRelease() + 0x2f bytes mscorwks.dll!IUnkEntry::Free() + 0x50 bytes mscorwks.dll!RCW::ReleaseAllInterfaces() + 0x18 bytes mscorwks.dll!RCW::ReleaseAllInterfacesCallBack() + 0x2f bytes mscorwks.dll!RCW::Cleanup() + 0x22 bytes mscorwks.dll!RCWCleanupList::ReleaseRCWListRaw() + 0x16 bytes mscorwks.dll!RCWCleanupList::ReleaseRCWListInCorrectCtx() + 0x7c bytes mscorwks.dll!CtxEntry::EnterContextCallback() + 0x8c bytes ole32.dll!CRemoteUnknown::DoCallback() + 0x3b bytes rpcrt4.dll!_Invoke@12() + 0x2a bytes rpcrt4.dll!_NdrStubCall2@16() + 0x256 bytes ole32.dll!_CStdStubBuffer_Invoke@12() + 0x70 bytes ole32.dll!SyncStubInvoke() + 0x34 bytes ole32.dll!StubInvoke() + 0x7b bytes ole32.dll!CCtxComChnl::ContextInvoke() + 0xe6 bytes ole32.dll!MTAInvoke() + 0x1a bytes ole32.dll!STAInvoke() + 0x4a bytes ole32.dll!AppInvoke() + 0x92 bytes ole32.dll!ComInvokeWithLockAndIPID() + 0x27c bytes ole32.dll!ComInvoke() + 0x71 bytes ole32.dll!ThreadDispatch() + 0x1a bytes ole32.dll!ThreadWndProc() + 0xa0 bytes user32.dll!_InternalCallWinProc@20() + 0x23 bytes user32.dll!_UserCallWinProcCheckWow@32() + 0xb7 bytes user32.dll!_DispatchMessageWorker@8() + 0xed bytes user32.dll!_DispatchMessageW@4() + 0xf bytes xul.dll!base::MessagePumpForUI::ProcessMessageHelper(msg) Line 368 C++ xul.dll!base::MessagePumpForUI::ProcessNextWindowsMessage() Line 340 + 0xc bytes C++ xul.dll!base::MessagePumpForUI::DoRunLoop() Line 209 + 0x6 bytes C++ xul.dll!base::MessagePumpWin::RunWithDispatcher(delegate, dispatcher) Line 54 C++ xul.dll!base::MessagePumpWin::Run(delegate) Line 78 + 0xc bytes C++ xul.dll!MessageLoop::RunInternal() Line 219 C++ xul.dll!MessageLoop::RunHandler() + 0x2081f6 bytes C++ xul.dll!MessageLoop::Run() Line 177 C++ xul.dll!XRE_InitChildProcess(aArgc, aArgv, aProcess) Line 514 C++
by the way: this happens on win7 and winvista systems
Whiteboard: [orange]
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: