Closed
Bug 558986
Opened 15 years ago
Closed 15 years ago
crash [@ ntdll.dll@0x200fd ] [@ hang | KiFastSystemCallRet | NtUserPeekMessage]Crashes in TSF with Flash/Silverlight on tablets (OLE/COM nested event loop)
Categories
(Core Graveyard :: Plug-ins, defect)
Tracking
(blocking1.9.2 .4+, status1.9.2 .4-fixed)
RESOLVED
FIXED
mozilla1.9.3a5
People
(Reporter: chofmann, Assigned: jimm)
References
()
Details
(Keywords: crash, inputmethod, intermittent-failure)
Crash Data
Attachments
(2 files, 3 obsolete files)
|
16.70 KB,
patch
|
ehsan.akhgari
:
review+
jimm
:
review+
|
Details | Diff | Splinter Review |
|
17.26 KB,
patch
|
benjamin
:
review+
abillings
:
approval1.9.2.4+
|
Details | Diff | Splinter Review |
#21 on the current 3.6.3plugin1 list
lots of interesting stuff on the stack that looks like
http://crash-stats.mozilla.com/report/index/3cff4484-1d92-48aa-9269-bd5aa2100412
Frame Module Signature [Expand] Source
0 ntdll.dll ntdll.dll@0x200fd
1 kernel32.dll kernel32.dll@0x1162c
2 user32.dll RealMsgWaitForMultipleObjectsEx
3 ole32.dll CCliModalLoop::BlockFn
4 ole32.dll ModalLoop
5 ole32.dll ThreadSendReceive
6 ole32.dll CRpcChannelBuffer::SwitchAptAndDispatchCall
7 ole32.dll CRpcChannelBuffer::SendReceive2
8 ole32.dll CCliModalLoop::SendReceive
9 ole32.dll CAptRpcChnl::SendReceive
10 ole32.dll CCtxComChnl::SendReceive
11 ole32.dll NdrExtpProxySendReceive
12 rpcrt4.dll rpcrt4.dll@0x3414a
13 ole32.dll ObjectStublessClient
14 ole32.dll ObjectStubless
15 ole32.dll RemoteReleaseRifRefHelper
16 ole32.dll RemoteReleaseRifRef
17 ole32.dll CStdMarshal::DisconnectCliIPIDs
18 ole32.dll CStdMarshal::Disconnect
19 ole32.dll CStdIdentity::~CStdIdentity
20 ole32.dll CStdIdentity::`scalar deleting destructor'
21 ole32.dll CInternalPageAllocator::ReleaseEntryList
22 ole32.dll IUnknown_Release_Proxy
23 oleacc.dll AccWrap_AddIAccProp::CanFakeIAccIdentity
24 oleacc.dll AccWrap_Base::QueryService
25 oleacc.dll AccWrap_Annotate::Init
26 oleacc.dll AccWrap_Annotate::GetGenericProp
27 oleacc.dll AccWrap_Annotate::get_accState Active Accessibility Core Component
28 tiptsf.dll CARET::StopTimer
29 tiptsf.dll CARET::UpdateEditFieldState
30 tiptsf.dll TabletShellProc
31 tiptsf.dll PrxDllCanUnloadNow TabletPC Input Panel component,
32 user32.dll __ClientCallWinEventProc
33 ntdll.dll ntdll.dll@0x100e5
34 npctrl.dll npctrl.dll@0x511ec
35 user32.dll InternalCallWinProc
36 user32.dll UserCallWinProcCheckWow
37 user32.dll DispatchMessageWorker
38 user32.dll DispatchMessageW
39 xul.dll base::MessagePumpForUI::ProcessMessageHelper ipc/chromium/src/base/message_pump_win.cc:361
40 xul.dll base::MessagePumpForUI::ProcessNextWindowsMessage ipc/chromium/src/base/message_pump_win.cc:336
41 xul.dll base::MessagePumpForUI::DoRunLoop ipc/chromium/src/base/message_pump_win.cc:205
42 xul.dll base::MessagePumpWin::RunWithDispatcher ipc/chromium/src/base/message_pump_win.cc:52
43 xul.dll base::MessagePumpWin::Run ipc/chromium/src/base/message_pump_win.h:78
44 xul.dll MessageLoop::RunInternal ipc/chromium/src/base/message_loop.cc:216
45 xul.dll MessageLoop::RunHandler
46 xul.dll MessageLoop::Run ipc/chromium/src/base/message_loop.cc:173
47 xul.dll base::Thread::ThreadMain ipc/chromium/src/base/thread.cc:165
48 xul.dll `anonymous namespace'::ThreadFunc ipc/chromium/src/base/platform_thread_win.cc:26
49 kernel32.dll kernel32.dll@0x13676
50 ntdll.dll ntdll.dll@0x39d71
51 ntdll.dll ntdll.dll@0x39d44
more at
http://crash-stats.mozilla.com/report/list?range_value=2&range_unit=weeks&signature=ntdll.dll%400x200fd&version=Firefox%3A3.6.3plugin1
trunk and plugin1 branch and only windows 7
checking --- ntdll.dll@0x200fd 20100411-crashdata.csv
found in: 3.7a5pre 3.6.3plugin1
release total-crashes
ntdll.dll@0x200fd crashes
pct.
all 309235 6 1.94027e-05
3.7a5pre 1408 3 0.00213068
3.6.3plugin1 1304 3 0.00230061
os breakdown
ntdll.dll@0x200fdTotal 6
Win5.1 0.00
Win6.0 0.00
Win6.1 1.00
no urls
| Assignee | ||
Comment 1•15 years ago
|
||
I can do some testing on this as I have a tablet handy from our touch input work.
Blocks: OOPP
| Assignee | ||
Comment 2•15 years ago
|
||
| Assignee | ||
Comment 3•15 years ago
|
||
Component: Plug-ins → Silverlight (Microsoft)
Product: Core → Plugins
QA Contact: plugins → microsoft-silverlight
Version: Trunk → unspecified
| Assignee | ||
Updated•15 years ago
|
Summary: crash [@ ntdll.dll@0x200fd ] → crash [@ ntdll.dll@0x200fd ] Crashes in TSF/Silverlight on tablets
| Assignee | ||
Updated•15 years ago
|
Component: Silverlight (Microsoft) → Plug-ins
Product: Plugins → Core
QA Contact: microsoft-silverlight → plugins
Summary: crash [@ ntdll.dll@0x200fd ] Crashes in TSF/Silverlight on tablets → crash [@ ntdll.dll@0x200fd ] Crashes in TSF with Flash/Silverlight on tablets
Version: unspecified → Trunk
Comment 4•15 years ago
|
||
Blocking? Based on hang stats, this comes up as about 3.5% of our hangs from yesterday. I'm not so worried about that number directly, but I am worried that people with tablets or other special system drivers will either get this hang all the time, or a significant portion of the time.
blocking1.9.2: --- → ?
Comment 5•15 years ago
|
||
Updated•15 years ago
|
Summary: crash [@ ntdll.dll@0x200fd ] Crashes in TSF with Flash/Silverlight on tablets → crash [@ ntdll.dll@0x200fd ] [@ hang | KiFastSystemCallRet | NtUserPeekMessage]Crashes in TSF with Flash/Silverlight on tablets
Comment 6•15 years ago
|
||
I've this today on a Windows 7 vm, so it may not be on tablets only:
http://crash-stats.mozilla.com/report/index/bp-7269b4df-783f-4974-9b89-73fac2100506
I was trying to reproduce bug 563936 when I encountered this hang. I had an existing Silverglight 3 installation, and I uninstalled it using the option in the Control Panel. Then I ran a Silverlight 4 installer I had previously downloaded. Then I launched Fx 3.6.4build3 and tried to play one of the videos from the link below, then while it seemed to be loading I tried right clicking to bring up the context menu. Generally that's how I got into that hang. The browser window froze, but I was able to launch another window by double clicking on my Fx desktop icon.
http://www.silverlight.net/learn/videos/silverlight-4-videos/
Comment 7•15 years ago
|
||
That's a different bug yet again, and worth filing separately. Let's keep this one focused on OLE32 hangs where it's spinning a nested PeekMessage loop to handle multi-process COM.
Summary: crash [@ ntdll.dll@0x200fd ] [@ hang | KiFastSystemCallRet | NtUserPeekMessage]Crashes in TSF with Flash/Silverlight on tablets → crash [@ ntdll.dll@0x200fd ] [@ hang | KiFastSystemCallRet | NtUserPeekMessage]Crashes in TSF with Flash/Silverlight on tablets (OLE/COM nested event loop)
Comment 9•15 years ago
|
||
From the duplicate bug, this happens when trying to exit fullscreen videos on Spike.com, flash 10.1
Comment 10•15 years ago
|
||
->me for investigation, I don't know how we'd know that this nested loop is entered when it only peeks the special OLE messages.
Assignee: nobody → benjamin
Comment 11•15 years ago
|
||
Not sure if this is biting us but it sure doesn't fill me with confidence:
"Programs May Function Erratically or Close Unexpectedly on the Tablet PC"
http://support.microsoft.com/kb/827220
Comment 12•15 years ago
|
||
Note possibly similar chrome bugs:
http://code.google.com/p/chromium/issues/detail?id=3011
http://code.google.com/p/chromium/issues/detail?id=29138
Comment 13•15 years ago
|
||
Just to clarify, the basic cause of this hang is that CARET::UpdateMSAAEditFieldState (tiptsf.dll) is calling into accessibility code. That code is calling QueryInterface on a COM object which is hosted in a different process. The OLE RPC code communicates using windows messages and enters a nested event loop of sorts here:
CRpcChannelBuffer::SwitchAptAndDispatchCall
WdtpInterfacePointer_UserUnmarshalWorker
ModalLoop
CCliModalLoop::BlockFn
That code is *probably* waiting for Mozilla to respond to the OLE call, but Mozilla is blocked.
It doesn't appear that the plugin is processing the incoming RPC call when the hangs happens, although I'm not sure of that because the stack gets lost in Flash. This means the nested event loop detection mechanism that normally operates within an RPC call is not active, and this is basically just a race where both processes are waiting for RPC responses from the other process, using different RPC mechanisms.
Comment 14•15 years ago
|
||
That's very helpful thanks. (Aside: the stacks I posted seemed similar to the duped spike.com bug)
Comment 15•15 years ago
|
||
My crash bug is on an HP touchsmart 300-1000 all in one PC.
It has a touchscreen, but I was not using it.
To my knowlege, I was not running any other program, such as IE.
My background programs were:
Rainmeter
COMODO internet security
Microsoft Security Essentuals
ATI Caytlist control center v. 10.2
Comment 16•15 years ago
|
||
I meant to say I was not using any other browser.
Firefox's UI was also slow, and CPU Usage spiked at 50-60%, while on Opera and Chrome, CPU usage was a good 10-20%.
I also had Task manager open, Firefox, and not Flash, via the plugin container was using the extra CPU.
Comment 17•15 years ago
|
||
Does someone have a reproducible crash with a Silvelright application in the stack?
Comment 18•15 years ago
|
||
STR:
* Open the Tablet PC Input Panel and write something in it
* Launch Firefox, go to spike.com
* View one of the movie trailiers, open it full-screen
* click out of full-screen mode
I still really really don't know how to fix it.
| Reporter | ||
Comment 19•15 years ago
|
||
(In reply to comment #11)
> Not sure if this is biting us but it sure doesn't fill me with confidence:
> "Programs May Function Erratically or Close Unexpectedly on the Tablet PC"
> http://support.microsoft.com/kb/827220
maybe the same, but that seems to be talking about an XP tablet version.
The ntdll.dll@0x200fd is 100% Windows NT 6.1.7600 (windows 7) for all 152 reports we have seen since the first of may.
The hang | KiFastSystemCallRet signature is spread over a larger number of OS versions.
8305 hang | KiFastSystemCallRet Windows NT 5.1.2600
2081 hang | KiFastSystemCallRet Windows NT 6.0.6002
1937 hang | KiFastSystemCallRet Windows NT 6.1.7600
575 hang | KiFastSystemCallRet Windows NT 6.0.6001
390 KiFastSystemCallRet Windows NT 5.1.2600
339 hang | KiFastSystemCallRet Windows NT 6.0.6000
49 hang | KiFastSystemCallRet Windows NT 6.1.7100
23 hang | KiFastSystemCallRet Windows NT 5.2.3790
10 KiFastSystemCallRet Windows NT 6.0.6002
8 KiFastSystemCallRet Windows NT 6.1.7600
7 KiFastSystemCallRet Windows NT 6.0.6001
4 KiFastSystemCallRet Windows NT 6.0.6000
Comment 20•15 years ago
|
||
Yeah, this is not a crash, and isn't related to that KB article or the chromium crashes much.
status1.9.2:
--- → wanted
Comment 21•15 years ago
|
||
http://crash-stats.mozilla.com/report/index/52248f1d-747e-4d94-9dc5-2828f2100507
Interesting but normal stack soon before the hang, in the parent process:
> xul.dll!nsAccessibleWrap::QueryInterface(iid={...}, ppv=0x003dea34) Line 107 C++
xul.dll!nsHyperTextAccessibleWrap::QueryInterface(iid={...}, ppv=0x003dea34) Line 49 C++
xul.dll!nsDocAccessibleWrap::QueryInterface(iid={...}, ppv=0x003dea34) Line 90 C++
ole32.dll!CDestObjectWrapper::MarshalInterface()
ole32.dll!_CoMarshalInterface@24()
OLEACC.dll!MarshalInterface()
OLEACC.dll!_LresultFromObject@12()
xul.dll!nsWindow::LresultFromObject(riid={...}, wParam=0x00000b80, pAcc=0x049d055c) Line 6515 C++
xul.dll!__SEH_epilog4_GS() C++
xul.dll!nsWindow::WindowProc(hWnd=0x00000001, msg=0x0000003d, wParam=0x00000b80, lParam=0xfffffffc) Line 3725 C++
USER32.dll!_InternalCallWinProc@20()
USER32.dll!_UserCallWinProcCheckWow@32()
USER32.dll!_DispatchClientMessage@20()
USER32.dll!___fnDWORD@4()
ntdll.dll!_KiUserCallbackDispatcher@12()
USER32.dll!_NtUserPeekMessage@20()
USER32.dll!__PeekMessage@24()
USER32.dll!_PeekMessageW@20()
ole32.dll!CCliModalLoop::MyPeekMessage()
ole32.dll!CCliModalLoop::PeekRPCAndDDEMessage()
ole32.dll!CCliModalLoop::FindMessage()
ole32.dll!CCliModalLoop::HandleWakeForMsg()
ole32.dll!CCliModalLoop::BlockFn()
ole32.dll!ModalLoop()
ole32.dll!ThreadSendReceive()
ole32.dll!CRpcChannelBuffer::SwitchAptAndDispatchCall()
ole32.dll!CRpcChannelBuffer::SendReceive2()
ole32.dll!CCliModalLoop::SendReceive()
ole32.dll!CAptRpcChnl::SendReceive()
ole32.dll!CCtxComChnl::SendReceive()
RPCRT4.dll!_NdrProxySendReceive@8()
RPCRT4.dll!@NdrpProxySendReceive@4()
RPCRT4.dll!_NdrClientCall2()
RPCRT4.dll!_ObjectStublessClient@8()
RPCRT4.dll!_ObjectStubless@0()
tiptsf.dll!CCPenIMX::UpdateStateSink()
tiptsf.dll!CCPenIMX::ActivateUI()
tiptsf.dll!CPenIMX_WindowProc()
USER32.dll!_InternalCallWinProc@20()
USER32.dll!_UserCallWinProcCheckWow@32()
USER32.dll!_DispatchMessageWorker@8()
USER32.dll!_DispatchMessageW@4()
xul.dll!nsAppShell::ProcessNextNativeEvent(mayWait=0x00000000) Line 179 C++
Comment 22•15 years ago
|
||
(In reply to comment #13)
> It doesn't appear that the plugin is processing the incoming RPC call when the
> hangs happens, although I'm not sure of that because the stack gets lost in
> Flash.
Can we get debugging help from Adobe to confirm this?
Comment 23•15 years ago
|
||
So, I tried to understand what this bug report is all about, and how I can help with that. So, first, let me try to explain what I understood from this bug report, because if my understanding is not correct, the rest of my message can be fed to birds (or pigs).
The stack in comment 0 is coming from our child process. Our child process's message loop is dispatching a message which is handled by the Tablet PC Input control thing, which calls IAccessible::get_accState, which leads to a call being dispatched to our parent process. That is done using a sync COM message to the message loop (on our main thread?) which is probably sent using some clone of SendMessage, which blocks the caller, and hence our child process. However, in the mean time, the parent is also blocked on something (I don't know what that is, is it the child process? I also don't know how we're being blocked, like are we waiting on a handle?), which leads to a deadlock, which users experience as a hang.
If that is the case, we're being bitten by the COM threading model suckiness! I think we can fix this using the approach laid out in this thread (read the whole thread):
http://www.tech-archive.net/Archive/Development/microsoft.public.win32.programmer.ole/2007-05/msg00070.html
To summarize, it means implementing IMessageFilter in our parent process, and registering it with COM so that it doesn't expect us to respond to messages when we can't.
Am I on the right track here?
Comment 24•15 years ago
|
||
Yes, you are!
I have a patch which implements this using IMessageFilter::MessagePending, although it currently does it in a hamfisted way by spinning the chromium event loop in the child if an incoming message races with the COM call. Not only does this sound dangerous, but MessageLoop::RunAllPending doesn't work (it never exits). So I'm going to try a more targeted approach which detects RPC races specifically and only processes those.
Comment 25•15 years ago
|
||
So is Benjamin or Ehsan working on this particular bug? Or, hope against hope, both?
Comment 26•15 years ago
|
||
(In reply to comment #24)
> Yes, you are!
>
> I have a patch which implements this using IMessageFilter::MessagePending,
> although it currently does it in a hamfisted way by spinning the chromium event
> loop in the child if an incoming message races with the COM call. Not only does
> this sound dangerous, but MessageLoop::RunAllPending doesn't work (it never
> exits). So I'm going to try a more targeted approach which detects RPC races
> specifically and only processes those.
Could you attach your patch, so that I can take a look at it?
(In reply to comment #25)
> So is Benjamin or Ehsan working on this particular bug? Or, hope against hope,
> both?
Benjamin is, but I can step in too if needed. I may need someone to hold my hand a bit as I go through our ipc code for the first time, though. :-)
Benjamin, do we have a set of STR to reporduce this problem on non-tablet PCs? Because I don't have access to one...
Comment 27•15 years ago
|
||
comment 18 is still correct, you don't need a tablet PC (and I don't have one either)
Comment 28•15 years ago
|
||
I'm seeing some very strange behavior currently, so I tried #if 0 all the interesting code. So you'll need to remove the #if 0 bits.
Comment 29•15 years ago
|
||
I'm testing with http://www.spike.com/video/getting-comfy-with/3387077
Comment 30•15 years ago
|
||
Some discussions on the problem and possible solution:
http://etherpad.mozilla.org:9000/win32-plugin-rpc-deadlock
Comment 31•15 years ago
|
||
Attachment #445127 -
Attachment is obsolete: true
Attachment #445185 -
Flags: review?(jones.chris.g)
Attachment #445185 -
Flags: review?(jmathies)
Attachment #445185 -
Flags: review?(ehsan)
Comment 32•15 years ago
|
||
Comment on attachment 445185 [details] [diff] [review]
Appears to work!, rev. 2
>diff --git a/dom/plugins/COMMessageFilter.cpp b/dom/plugins/COMMessageFilter.cpp
>+DWORD
>+COMMessageFilter::RetryRejectedCall(HTASK htaskCallee,
>+ DWORD dwTickCount,
>+ DWORD dwRejectType)
>+{
>+ if (mPreviousFilter)
>+ return mPreviousFilter->RetryRejectedCall(htaskCallee, dwTickCount,
>+ dwRejectType);
>+ return 0;
MSDN says that the default value here is DWORD(-1). I know that this doesn't matter because we hand this off to the previous filter (which is the OS provided one), but can you change this return value for safety?
>diff --git a/dom/plugins/PluginModuleChild.cpp b/dom/plugins/PluginModuleChild.cpp
>--- a/dom/plugins/PluginModuleChild.cpp
>+++ b/dom/plugins/PluginModuleChild.cpp
>@@ -59,16 +59,20 @@
> #include "mozilla/plugins/StreamNotifyChild.h"
> #include "mozilla/plugins/BrowserStreamChild.h"
> #include "mozilla/plugins/PluginStreamChild.h"
> #include "mozilla/plugins/PluginThreadChild.h"
> #include "PluginIdentifierChild.h"
>
> #include "nsNPAPIPlugin.h"
>
>+#ifdef XP_WIN
>+#include "COMMessageFilter.h"
>+#endif
>+
> using namespace mozilla::plugins;
>
> #if defined(XP_WIN)
> #ifndef WM_MOUSEHWHEEL
> #define WM_MOUSEHWHEEL 0x020E
> #endif
> #endif
>
>@@ -120,16 +124,18 @@ PluginModuleChild::current()
> bool
> PluginModuleChild::Init(const std::string& aPluginFilename,
> base::ProcessHandle aParentProcessHandle,
> MessageLoop* aIOLoop,
> IPC::Channel* aChannel)
> {
> PLUGIN_LOG_DEBUG_METHOD;
>
>+ COMMessageFilter::Initialize(this);
>+
Won't this fail to compile on Mac/Linux? You need to wrap it in a XP_WIN check I think.
r=me on with these two issues addressed.
I only reviewed the messagefilter code thoroughly, and barely reviewed the message loop change. I didn't look at the rest because I don't understand that code anyways. :-)
Attachment #445185 -
Flags: review?(ehsan) → review+
| Assignee | ||
Updated•15 years ago
|
Attachment #445185 -
Flags: review?(jmathies) → review+
Comment on attachment 445185 [details] [diff] [review]
Appears to work!, rev. 2
>diff --git a/ipc/glue/RPCChannel.cpp b/ipc/glue/RPCChannel.cpp
>--- a/ipc/glue/RPCChannel.cpp
>+++ b/ipc/glue/RPCChannel.cpp
> void
>+RPCChannel::FlushPendingRPCQueue()
>+{
>+ AssertWorkerThread();
>+ mMutex.AssertNotCurrentThreadOwns();
>+
>+ {
>+ MutexAutoLock lock(mMutex);
>+
>+ if (mDeferred.empty()) {
>+ if (mPending.empty())
>+ return;
>+
>+ Message last = mPending.back();
This needs to be |const Message& last = mPending.back();| or else you're going to copy the entire message payload.
>+ if (!last.is_rpc() || last.is_reply())
>+ return;
This test makes me a bit nervous because of Windows wakeup insanity. IIRC it's possible on the browser side to wake up from an RPC wait because of a windows message (in the spin loop) and then dispatch any arbitrary XPCOM task. If the task sends an an async message, say stream data delivery, then the back of the queue might be an async message even though there's an RPC message earlier in the queue. Also note that this wakeup insanity might result in there being more than one RPC message in the queue.
>+bool
> RPCChannel::OnMaybeDequeueOne()
> {
> // XXX performance tuning knob: could process all or k pending
> // messages here
>
> AssertWorkerThread();
> mMutex.AssertNotCurrentThreadOwns();
>
> Message recvd;
> {
> MutexAutoLock lock(mMutex);
>
> if (!Connected()) {
> ReportConnectionError("RPCChannel");
>- return;
>+ return false;
> }
>
> if (!mDeferred.empty())
> return MaybeProcessDeferredIncall();
>
I think this may be a bug carried over from the patch I sent you. If in the wonderful world of Windows insanity the following sequence of events is possible
browser plugin
------------ ---------------
rpc Foo--> <--rpc Bar
[plugin wins] [defer Foo]
process Bar
asyc A--> [wakeup for A]
call Baz--> <--OLE RPC
then mDeferred will be non-empty, but (deferred) Foo won't be processed because the time isn't right. Then we'll return here without processing Baz and deadlock on the OLE RPC. So, we could
(1) File this theoretical case away until we see it in the wild
(2) Change the logic to
if (!mDeferred.empty() && MaybeProcessDeferredIncall())
return true;
I'm fine with either.
>--- a/ipc/glue/RPCChannel.h
>+++ b/ipc/glue/RPCChannel.h
>@@ -161,16 +161,22 @@ public:
>
> // Override the SyncChannel handler so we can dispatch RPC
> // messages. Called on the IO thread only.
> NS_OVERRIDE
> virtual void OnMessageReceived(const Message& msg);
> NS_OVERRIDE
> virtual void OnChannelError();
>
>+ /**
>+ * When we detect that an outbound OLE RPC call is being made, we may need to unblock the parent
>+ * by processing inbound RPC calls (and any preceding asynchronous messages).
>+ */
Two points: first, this method name doesn't describe what your patch does, wrt the comment above. Second, for better or worse, this is a generic interface now, and the comment should describe what the method does rather than how it's used.
>+ /**
>+ * Process one deferred or pending RPC message.
>+ * @return true if a message was processed
>+ */
>+ bool OnMaybeDequeueOne();
>+
This comment is wrong, async and sync messages are dequeued and processed as well.
Leaving r? pending how you want to address the review comments above. As it stands I would r+ with the Message copy fixed and inaccurate comments and method names edited.
Comment 34•15 years ago
|
||
> >+ if (!last.is_rpc() || last.is_reply())
> >+ return;
>
> This test makes me a bit nervous because of Windows wakeup insanity. IIRC it's
> possible on the browser side to wake up from an RPC wait because of a windows
> message (in the spin loop) and then dispatch any arbitrary XPCOM task.
Hrm, I really don't think this is possible. The small set of sent messages that we allow to be processed shouldn't ever re-enter XPCOM tasks or send more messages. And we can't enter the full-on nested loop until we've processed the incoming RPC message (in which case it will no longer be pending!)
> I think this may be a bug carried over from the patch I sent you. If in the
> wonderful world of Windows insanity the following sequence of events is
> possible
>
> browser plugin
> ------------ ---------------
> rpc Foo--> <--rpc Bar
> [plugin wins] [defer Foo]
> process Bar
> asyc A--> [wakeup for A]
> call Baz--> <--OLE RPC
>
> then mDeferred will be non-empty, but (deferred) Foo won't be processed because
> the time isn't right. Then we'll return here without processing Baz and
> deadlock on the OLE RPC. So, we could
>
> (1) File this theoretical case away until we see it in the wild
> (2) Change the logic to
>
> if (!mDeferred.empty() && MaybeProcessDeferredIncall())
> return true;
Unless you're extremely confident that #2 won't mess up in-order delivery, I choose #1 here: I think that the chances of a plugin ending up in an OLE call while in an RPC frame are pretty slim to begin with.
Comment 35•15 years ago
|
||
Attachment #445354 -
Flags: review?(jones.chris.g)
(In reply to comment #34)
> > the following sequence of events is
> > possible
> >
> > browser plugin
> > ------------ ---------------
> > rpc Foo--> <--rpc Bar
> > [plugin wins] [defer Foo]
> > process Bar
> > asyc A--> [wakeup for A]
> > call Baz--> <--OLE RPC
> >
> > then mDeferred will be non-empty, but (deferred) Foo won't be processed because
> > the time isn't right. Then we'll return here without processing Baz and
> > deadlock on the OLE RPC. So, we could
> >
> > (1) File this theoretical case away until we see it in the wild
> > (2) Change the logic to
> >
> > if (!mDeferred.empty() && MaybeProcessDeferredIncall())
> > return true;
>
> Unless you're extremely confident that #2 won't mess up in-order delivery, I
> choose #1 here: I think that the chances of a plugin ending up in an OLE call
> while in an RPC frame are pretty slim to begin with.
It wouldn't, but I'm perfectly fine playing it safe here until crash-stats dictates otherwise.
Updated•15 years ago
|
Attachment #445354 -
Flags: review?(jones.chris.g) → review+
| Assignee | ||
Comment 37•15 years ago
|
||
- added RetryRejectedCall fix
Attachment #445354 -
Attachment is obsolete: true
| Assignee | ||
Updated•15 years ago
|
Attachment #445185 -
Flags: review?(jones.chris.g)
| Assignee | ||
Comment 38•15 years ago
|
||
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
| Assignee | ||
Updated•15 years ago
|
Attachment #445725 -
Flags: approval1.9.2.4?
| Assignee | ||
Comment 39•15 years ago
|
||
http://hg.mozilla.org/mozilla-central/rev/78923ef103cb
follow up build bustage fix for linux/osx.
Comment 40•15 years ago
|
||
Comment on attachment 445725 [details] [diff] [review]
merged to trunk
>+ * The Initial Developer of the Original Code is
>+ * the Mozilla Foundation <http://www.mozilla.org/>.
Almost all the new files I see seem to use Mozilla Corporation, which seems more accurate (though ultimately equivalent). Did we change our policy?
Comment 41•15 years ago
|
||
The policy has always been to use "The Mozilla Foundation", which is the actual copyright holder/assignee for all MoCo code.
| Assignee | ||
Comment 42•15 years ago
|
||
Backed out due to various test failure issues. This needs some try server goodness first!
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
| Assignee | ||
Comment 43•15 years ago
|
||
Bug 566395 - intermittent orange: modules/plugin/test/test_windowed_invalidate.html | Test timed out.
Bug 566392 - REFTEST TEST-UNEXPECTED-FAIL | file:///e:/builds/moz2_slave/mozilla-central-win32-debug-unittest-crashtest/build/reftest/tests/layout/base/crashtests/325967-1.html | timed out waiting for reftest-wait to be removed (after onload fired)
... and many of the debug mochitest runs had leaks.
Comment 46•15 years ago
|
||
Comment on attachment 445725 [details] [diff] [review]
merged to trunk
a=LegNeato for 1.9.2.4. Please land on both mozilla-1.9.2 default and GECKO1924_20100413_RELBRANCH
Attachment #445725 -
Flags: approval1.9.2.4? → approval1.9.2.4+
Comment 47•15 years ago
|
||
Also please remember to include/land the follow-up bustage fix.
Looks like this also caused a 50% (!) Tp4 regression on Windows.
| Assignee | ||
Comment 50•15 years ago
|
||
Looks like this is our current remaining blocker. I'll do a little investigating today to see if there's something obvious regarding the Tp4 regression. If not I suppose we'll have to look for a different approach, possible an api hook into the call chain.
Comment 51•15 years ago
|
||
The only thing which could realistically effect Tp is posting WM_NULL in MessagePumpForUI::ScheduleWork.
| Assignee | ||
Comment 52•15 years ago
|
||
(In reply to comment #51)
> The only thing which could realistically effect Tp is posting WM_NULL in
> MessagePumpForUI::ScheduleWork.
I was wondering if there might be heavy COM overhead added when installing the filter?
Comment 53•15 years ago
|
||
The filter is only installed once per plugin process: even if there were heavy overhead, it wouldn't show up in Tp4.
| Assignee | ||
Comment 54•15 years ago
|
||
(In reply to comment #53)
> The filter is only installed once per plugin process: even if there were heavy
> overhead, it wouldn't show up in Tp4.
Couldn't we test that theory by removing it a posting to try for Tp4 runs?
bsmedberg, curious if you're working on this or if you want me to look int it? I have the theming work to continue on so i don't want to overlap here..
Comment 55•15 years ago
|
||
I just got back in, I probably won't have a chance to look at it until tomorrow. If you have theories, please test them.
Comment 56•15 years ago
|
||
Also, seems strange this would regress only on XP...
| Assignee | ||
Comment 57•15 years ago
|
||
(In reply to comment #54)
> (In reply to comment #53)
> > The filter is only installed once per plugin process: even if there were heavy
> > overhead, it wouldn't show up in Tp4.
>
> Couldn't we test that theory by removing it a posting to try for Tp4 runs?
with PostMessage(WM_NULL):
WINNT 6.0 try talos tp4
tp4: 859.03 (details)
WINNT 5.1 try talos tp4
tp4: 554.97 (details)
without PostMessage(WM_NULL):
WINNT 6.0 try talos tp4
tp4: 769.42 (details)
WINNT 5.1 try talos tp4
tp4: 443.22 (details)
Theory proven!
Comment 58•15 years ago
|
||
Do we have any data on how many times that message is posted? Also, do we have any data on how expensive handling it is? Can we special case its handling and avoid doing anything (even calling DefWindowProc)?
Comment 59•15 years ago
|
||
I pushed http://hg.mozilla.org/try/rev/a1d7df2d945f
WINNT 6.0 try talos tp4
tp4: 921.81
WINNT 5.1 try talos tp4
tp4: 523.48
That avoids the rest of the patch and just posts WM_NULL and special-cases WM_NULL when we get it in the chromium hidden windowproc. Life is still bad.
We post this WM_NULL at the same frequency we post kMsgHaveWork (WM_USER + 1 in chromium-land): chromium is designed so that there is only ever one of these in the queue at a time.
I really don't know where to go from here.
Comment 60•15 years ago
|
||
You're returning 1 from the window proc on WM_NULL. According to the MSDN <http://msdn.microsoft.com/en-us/library/ms632637%28VS.85%29.aspx>, you're supposed to be returning 0 if you "process" the null message (what's there to process anyway?!). I'm not sure what kinds of crazy things this might be triggering inside windows, but it might be worth trying a version of the patch which returns 0.
Comment 61•15 years ago
|
||
Here is another idea. Try registering a custom message using RegisterWindowMessage and use that message code instead of WM_NULL.
Comment 62•15 years ago
|
||
(In reply to comment #61)
> Here is another idea. Try registering a custom message using
> RegisterWindowMessage and use that message code instead of WM_NULL.
The reason I'm suggesting this is because it seems that WM_NULL actually has grown to have a meaning in Windows. Nowadays, apparently task manager sends a WM_NULL message to an application's main thread with a timeout to determine if its "hung" or not. It still doesn't explain the tp4 regression, but it suggests to me that there may be custom processing going on inside windows for WM_NULL.
Comment 63•15 years ago
|
||
Well, the private message we're currently using (WM_USER + 1) doesn't wake up the MsgWaitForMultipleObjects call that the OLE code is blocked on. It doesn't *look* like RegisterWindowMessage would unblock that call either, although I'm not certain.
Comment 64•15 years ago
|
||
(In reply to comment #63)
> Well, the private message we're currently using (WM_USER + 1) doesn't wake up
> the MsgWaitForMultipleObjects call that the OLE code is blocked on. It doesn't
> *look* like RegisterWindowMessage would unblock that call either, although I'm
> not certain.
Ah, yes, you're right. :(
I spent a lot of time searching the web for possible problems with WM_NULL, and I found multiple places where WM_NULL overflowing the message queue size was discussed. Like this one: <http://trac.wxwidgets.org/ticket/9053>. Could this be an issue? Could we peek the message queue beforehand, and for example log whether there are WM_NULL messages pending to determine if this could be an issue?
Also, do we have any profile data on what happens with and without the WM_NULL message sending?
| Assignee | ||
Comment 65•15 years ago
|
||
By posting two messages here, I think we're basically flooding the message queue with useless wm_nulls. (see ProcessPumpReplacementMessage()) A possible solution would be to get rid of the wm_null post message and instead, change out the kMsgHaveWork schedule work message with something that will wake up COM's MsgWaitForMultipleObjects, which wm_user+1 clearly isn't doing.
Comment 66•15 years ago
|
||
How are we "flooding" it? The chromium logic guarantees that there is only ever one pending kMsgHaveWork at a time, and therefore there should also be only one pending WM_NULL (since they are posted at the same time and processed at roughly the same time).
| Assignee | ||
Comment 67•15 years ago
|
||
(In reply to comment #66)
> How are we "flooding" it? The chromium logic guarantees that there is only ever
> one pending kMsgHaveWork at a time, and therefore there should also be only one
> pending WM_NULL (since they are posted at the same time and processed at
> roughly the same time).
Look at ProcessPumpReplacementMessage() -
http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/message_pump_win.cc#368
Aren't we always going to get a WM_NULL in that peek call with the current patch? ProcessMessageHelper() is the main dispatch loop.
A less intrusive fix than the previous one might be: in ProcessPumpReplacementMessage(), if we receive a wm_null message in that peek, dump it a peek again for the next message?
Comment 68•15 years ago
|
||
Is the 50% regression for all users or only users affected buy the original
bug? Does our automation use the accessibility apis such that this always
manifests for our tests? Would users without accessibility support enabled
still see the regression?
Comment 69•15 years ago
|
||
I talked to Alice about the talos setup on XP and she said the OS install is pretty vanilla, so whatever accessibility settings are default there apply. We also set a few Firefox prefs http://mxr.mozilla.org/mozilla/source/testing/performance/talos/sample.config#29
To summarize the Talos Tp increase - using [1,2] to look up the test results for the revision before the landing in comment #39 (30960da07185) and the bustage fix(78923ef103cb) yields
WINNT 5.1 WINNT 6.1 MacOSX 10.5.8 MacOSX 10.6.2
before 283 661 488 491
after 423 628 489 481
And to break that down by site in the suite (for XP) you can look at
http://people.mozilla.org/~jmuizelaar/graph-server/detail.html?host=graphs&old=4018370&new=4018400
So that looks like a slowdown across the board, although nine sites are faster.
[1] http://graphs.mozilla.org/api/test/runs/revisions?revision=30960da07185
[2] http://graphs.mozilla.org/api/test/runs/revisions?revision=78923ef103cb
| Assignee | ||
Comment 70•15 years ago
|
||
I have a patch that may address this. Unfortunately try server seems to be having a fit at the moment, once things clear up I'll get some tp4 stats on my patch. I've got some other ideas too if that doesn't work.
Updated•15 years ago
|
Whiteboard: [orange]
| Assignee | ||
Comment 71•15 years ago
|
||
(In reply to comment #57)
> with PostMessage(WM_NULL):
> WINNT 6.0 try talos tp4
> tp4: 769.42 (details)
> WINNT 5.1 try talos tp4
> tp4: 443.22 (details)
Seems we lost winnt 6.0 try talos runs at a rather inopportune time, but we still have 5.1 builds to compare. I think my slight modification addressed the problem:
old patch:
WINNT 6.0 try talos tp4
tp4: 859.03 (details)
WINNT 5.1 try talos tp4
tp4: 554.97 (details)
new patch:
WINNT 6.1 tryserver talos tp4
tp4: 626.44 (details)
WINNT 5.1 tryserver talos tp4
tp4: 279.93 (details)
some other tryserver build:
WINNT 6.1 tryserver talos tp4
tp4: 647.96 (details)
WINNT 5.1 tryserver talos tp4
tp4: 269.65 (details)
| Assignee | ||
Comment 72•15 years ago
|
||
On try, this experienced one test failure, bug 559981. Not sure if this patch caused that or if it was random orange. I submitted it again to see if it repeats.
The addition I made was:
MSG msg;
bool have_message = (0 != PeekMessage(&msg, NULL, 0, 0, PM_REMOVE));
+
+ if (have_message && msg.message == WM_NULL)
+ have_message = (0 != PeekMessage(&msg, NULL, 0, 0, PM_REMOVE));
+
DCHECK(!have_message || kMsgHaveWork != msg.message ||
msg.hwnd != message_hwnd_);
Which skips over the null event in ProcessPumpReplacementMessage().
http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/message_pump_win.cc#367
Attachment #445725 -
Attachment is obsolete: true
Attachment #446693 -
Flags: review?(benjamin)
Updated•15 years ago
|
Attachment #446693 -
Flags: review?(benjamin) → review+
Updated•15 years ago
|
Severity: normal → critical
Comment 73•15 years ago
|
||
Status: REOPENED → RESOLVED
Closed: 15 years ago → 15 years ago
Resolution: --- → FIXED
Updated•15 years ago
|
Attachment #446693 -
Flags: approval1.9.2.4?
Updated•15 years ago
|
Target Milestone: --- → mozilla1.9.3a5
Comment 74•15 years ago
|
||
Is this going into 1.9.2 and the 1.9.2.4 release branch today since we were going to start building at 5:00 PM pacific?
| Assignee | ||
Comment 75•15 years ago
|
||
(In reply to comment #74)
> Is this going into 1.9.2 and the 1.9.2.4 release branch today since we were
> going to start building at 5:00 PM pacific?
We need a driver to a+ the patch. I'll be happy to land it this afternoon.
Comment 76•15 years ago
|
||
Comment on attachment 446693 [details] [diff] [review]
modified patch v.1
In lieu of any other available parties, I'll put on my driver hat and approve this. We need this for 1.9.2.4 release.
Attachment #446693 -
Flags: approval1.9.2.4? → approval1.9.2.4+
Comment 77•15 years ago
|
||
default and GECKO1924_20100413_RELBRANCH branches in mozilla-1.9.2 please.
| Assignee | ||
Comment 78•15 years ago
|
||
http://hg.mozilla.org/releases/mozilla-1.9.2/rev/32a26e725dac
http://hg.mozilla.org/releases/mozilla-1.9.2/rev/0383745fc4de
and a minor touch up for the default branch from an initial merge of the patch to 1.9.2:
http://hg.mozilla.org/releases/mozilla-1.9.2/rev/e928c8cf24c3
Updated•15 years ago
|
Keywords: inputmethod
Updated•14 years ago
|
Crash Signature: [@ ntdll.dll@0x200fd ]
[@ hang | KiFastSystemCallRet | NtUserPeekMessage]
Comment 79•14 years ago
|
||
hi guys,
after a long search i finally found this issue resembling problems i regularly see when running silverlight in firefox.
ff4,5 and 6 keep consistently hanging in that exact scenario if a silverlight application is unloaded and for some reason accessibility has been loaded into the silverlight process. the stack in the main thread of the plugin-container.exe always looks like this:
> ntdll.dll!_NtWaitForMultipleObjects@20() + 0x15 bytes
ntdll.dll!_NtWaitForMultipleObjects@20() + 0x15 bytes
kernel32.dll!_WaitForMultipleObjectsExImplementation@20() + 0x8e bytes
user32.dll!_RealMsgWaitForMultipleObjectsEx@20() + 0xe2 bytes
ole32.dll!CCliModalLoop::BlockFn() + 0x96 bytes
ole32.dll!ModalLoop() + 0x52 bytes
ole32.dll!ThreadSendReceive() + 0x8a bytes
ole32.dll!CRpcChannelBuffer::SwitchAptAndDispatchCall() + 0xb5 bytes
ole32.dll!CRpcChannelBuffer::SendReceive2() + 0xa6 bytes
ole32.dll!CCliModalLoop::SendReceive() + 0x1e bytes
ole32.dll!CAptRpcChnl::SendReceive() + 0x72 bytes
ole32.dll!CCtxComChnl::SendReceive() + 0x47 bytes
ole32.dll!NdrExtpProxySendReceive() + 0x43 bytes
rpcrt4.dll!@NdrpProxySendReceive@4() + 0xe bytes
rpcrt4.dll!_NdrClientCall2() + 0x144 bytes
ole32.dll!_ObjectStublessClient@8() + 0x7a bytes
ole32.dll!_ObjectStubless@0() + 0xf bytes
ole32.dll!RemoteReleaseRifRefHelper() + 0x3c bytes
ole32.dll!RemoteReleaseRifRef() + 0xa5 bytes
ole32.dll!CStdMarshal::DisconnectCliIPIDs() - 0xea bytes
ole32.dll!CStdMarshal::Disconnect() + 0x160 bytes
ole32.dll!CStdIdentity::~CStdIdentity() + 0x7d bytes
ole32.dll!CStdIdentity::`vector deleting destructor'() + 0xd bytes
ole32.dll!CStdIdentity::CInternalUnk::Release() - 0x1d726 bytes
ole32.dll!_IUnknown_Release_Proxy@4() + 0x11 bytes
oleacc.dll!AccWrap_Base::~AccWrap_Base() + 0x3d bytes
oleacc.dll!AccWrap_Annotate::`scalar deleting destructor'() + 0xd bytes
oleacc.dll!AccWrap_Base::Release() + 0x257 bytes
mscorwks.dll!ReleaseTransitionHelper() + 0x5f bytes
mscorwks.dll!SafeReleaseHelper() + 0x70 bytes
mscorwks.dll!SafeRelease() + 0x2f bytes
mscorwks.dll!IUnkEntry::Free() + 0x50 bytes
mscorwks.dll!RCW::ReleaseAllInterfaces() + 0x18 bytes
mscorwks.dll!RCW::ReleaseAllInterfacesCallBack() + 0x2f bytes
mscorwks.dll!RCW::Cleanup() + 0x22 bytes
mscorwks.dll!RCWCleanupList::ReleaseRCWListRaw() + 0x16 bytes
mscorwks.dll!RCWCleanupList::ReleaseRCWListInCorrectCtx() + 0x7c bytes
mscorwks.dll!CtxEntry::EnterContextCallback() + 0x8c bytes
ole32.dll!CRemoteUnknown::DoCallback() + 0x3b bytes
rpcrt4.dll!_Invoke@12() + 0x2a bytes
rpcrt4.dll!_NdrStubCall2@16() + 0x256 bytes
ole32.dll!_CStdStubBuffer_Invoke@12() + 0x70 bytes
ole32.dll!SyncStubInvoke() + 0x34 bytes
ole32.dll!StubInvoke() + 0x7b bytes
ole32.dll!CCtxComChnl::ContextInvoke() + 0xe6 bytes
ole32.dll!MTAInvoke() + 0x1a bytes
ole32.dll!STAInvoke() + 0x4a bytes
ole32.dll!AppInvoke() + 0x92 bytes
ole32.dll!ComInvokeWithLockAndIPID() + 0x27c bytes
ole32.dll!ComInvoke() + 0x71 bytes
ole32.dll!ThreadDispatch() + 0x1a bytes
ole32.dll!ThreadWndProc() + 0xa0 bytes
user32.dll!_InternalCallWinProc@20() + 0x23 bytes
user32.dll!_UserCallWinProcCheckWow@32() + 0xb7 bytes
user32.dll!_DispatchMessageWorker@8() + 0xed bytes
user32.dll!_DispatchMessageW@4() + 0xf bytes
xul.dll!base::MessagePumpForUI::ProcessMessageHelper(msg) Line 368 C++
xul.dll!base::MessagePumpForUI::ProcessNextWindowsMessage() Line 340 + 0xc bytes C++
xul.dll!base::MessagePumpForUI::DoRunLoop() Line 209 + 0x6 bytes C++
xul.dll!base::MessagePumpWin::RunWithDispatcher(delegate, dispatcher) Line 54 C++
xul.dll!base::MessagePumpWin::Run(delegate) Line 78 + 0xc bytes C++
xul.dll!MessageLoop::RunInternal() Line 219 C++
xul.dll!MessageLoop::RunHandler() + 0x2081f6 bytes C++
xul.dll!MessageLoop::Run() Line 177 C++
xul.dll!XRE_InitChildProcess(aArgc, aArgv, aProcess) Line 514 C++
Comment 80•14 years ago
|
||
by the way: this happens on win7 and winvista systems
Updated•13 years ago
|
Keywords: intermittent-failure
Updated•13 years ago
|
Whiteboard: [orange]
Updated•3 years ago
|
Product: Core → Core Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•