Closed Bug 522414 Opened 15 years ago Closed 15 years ago

RPC and Sync channels will deadlock easily on Windows

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: bent.mozilla, Assigned: bent.mozilla)

References

(Depends on 1 open bug)

Details

Attachments

(4 files, 2 obsolete files)

Patch, v1 15 years ago Ben Turner (not reading bugmail, use the needinfo flag!) 14.46 KB, patch		Details \| Diff \| Splinter Review
Sample deadlock 15 years ago Ben Turner (not reading bugmail, use the needinfo flag!) 7.11 KB, text/plain		Details
Patch, v1.1 15 years ago Ben Turner (not reading bugmail, use the needinfo flag!) 15.95 KB, patch	jimm : review+	Details \| Diff \| Splinter Review
Patch, v1.2 15 years ago Ben Turner (not reading bugmail, use the needinfo flag!) 17.34 KB, patch	jimm : review+ robert.strong.bugs : review+	Details \| Diff \| Splinter Review
Followup patch to assert that a message never makes it to nsWindow::WindowProc during IPC calls 15 years ago Ben Turner (not reading bugmail, use the needinfo flag!) 4.76 KB, patch	jimm : review+ robert.strong.bugs : review+	Details \| Diff \| Splinter Review
Message log for normal browsing 15 years ago Ben Turner (not reading bugmail, use the needinfo flag!) 13.24 KB, text/plain		Details

Ben Turner (not reading bugmail, use the needinfo flag!)

Assignee

Description

•

15 years ago

Attached patch Patch, v1 (obsolete) — Details — Splinter Review

Due to the craziness that is the Windows message system we can very easily deadlock when we have native widgets in two processes that are also allowed to synchronously call each other via IPC. The only way around this as far as I know is to spin a mini-event loop during synchronous calls that will only respond to certain Windows messages. Patch attached that fixes everything, I think.

Attachment #406383 - Flags: review?(jmathies)

Ben Turner (not reading bugmail, use the needinfo flag!)

Assignee

Updated

•

15 years ago

Attachment #406383 - Flags: review?(robert.bugzilla)

Jim Mathies [:jimm]

Comment 1

•

15 years ago

Checked out the code base today and have a build going so I can get into this to understand what you're trying to do here. Have a blocking top crasher for 1.9.1 though I'm working on, so it will be a day or so until I can get to this review.

Robert Strong (they/them - no direct email)

Comment 2

•

15 years ago

Comment on attachment 406383 [details] [diff] [review] Patch, v1 >diff --git a/ipc/glue/SyncChannel.cpp b/ipc/glue/SyncChannel.cpp >--- a/ipc/glue/SyncChannel.cpp >+++ b/ipc/glue/SyncChannel.cpp >... >+static inline bool >+WindowIsMozillaWindow(HWND hwnd) >+{ >+ PRUnichar buffer[256] = {0}; >+ int length = GetClassNameW(hwnd, (wchar_t*)buffer, sizeof(buffer) - 1); >+ if (length <= 0) { >+ NS_WARNING("Failed to get class name?!"); >+ return false; >+ } >+ >+ // Cheap test >+ if (buffer[0] != PRUnichar('M') && buffer[0] != PRUnichar('G') && >+ buffer[0] != PRUnichar('n') && buffer[0] != PRUnichar('F')) { >+ return false; >+ } >+ >+ // More expensive test >+ nsDependentString className(buffer, length); >+ if (!StringBeginsWith(className, NS_LITERAL_STRING("Mozilla")) && >+ !StringBeginsWith(className, NS_LITERAL_STRING("Gecko")) && >+ !className.EqualsLiteral("nsToolkitClass") && >+ !className.EqualsLiteral("nsAppShell:EventWindowClass") && >+ !className.EqualsLiteral("FirefoxMessageWindow")) { >+ return false; From nsNativeAppSupportWin.cpp for setting the MessageWindow... it is different for each app static const PRUnichar *className() { static PRUnichar classNameBuffer[128]; static PRUnichar *mClassName = 0; if ( !mClassName ) { ::_snwprintf(classNameBuffer, 128, // size of classNameBuffer in PRUnichars L"%s%s", NS_ConvertUTF8toUTF16(gAppData->name).get(), L"MessageWindow" ); mClassName = classNameBuffer; } return mClassName; }

Rob Arnold [:robarnold]

Comment 3

•

15 years ago

(In reply to comment #0) > Created an attachment (id=406383) [details] > Patch, v1 > > Due to the craziness that is the Windows message system we can very easily > deadlock when we have native widgets in two processes that are also allowed to > synchronously call each other via IPC. The only way around this as far as I > know is to spin a mini-event loop during synchronous calls that will only > respond to certain Windows messages. Patch attached that fixes everything, I > think. Can you be more specific as to how the deadlock occurs? Why are we using window messages for IPC? I don't understand under what circumstances we have two native widgets in different processes sending messages to each other.

Rob Arnold [:robarnold]

Comment 4

•

15 years ago

After some discussion with cjones over IRC, I have an idea. This all bases on the theory that SendMessage is implemented by queuing an APC (http://msdn.microsoft.com/en-us/library/ms684954(VS.85).aspx) to the thread that owns the window - a call stack would confirm this I think. APCs are not delivered until the thread is in an alertable state and many wait/synchronization calls allow you to specify if you want to be alertable or not. So long as we are not alertable when we block in the child, the APCs will not be dispatched and so we will not have these odd calls into window procedures when we intend to be blocked. After unblocking, a simple call to check the message queue should allow the APCs to be dispatched. I don't have details on how to do this right now because I don't know your IPC that well and I'd like to confirm the theory first. So that would solve the child unblocking itself, but we really shouldn't call SendMessage (sync IPC) to the child process - can we figure out where in the parent we are doing this? There's SendMessageCallback which is probably a better option (if not just PostMessage). Note that some SendMessage calls may be hiding inside API functions (SetWindowText is an example).

Ben Turner (not reading bugmail, use the needinfo flag!)

Assignee

Comment 5

•

15 years ago

(In reply to comment #4) > After some discussion with cjones over IRC, I have an idea. This all bases on > the theory that SendMessage is implemented by queuing an APC No. We're not calling SendMessage manually at all. Here's an example of the problem: 1. Browser wants to create a plugin window. It makes a new HWND in the browser process. 2. Browser calls SetWindow on the plugin process via IPC. This is a synchronous message and so the browser process blocks. 3. Plugin process creates its own plugin window that it can actually subclass, draw into, etc. It then calls SetParent to make the plugin window a child of the browser's plugin window. 4. Windows generates a WM_PARENTNOTIFY message that it sends *synchronously* to the browser's plugin HWND from _within_ the SetParent call. This causes the plugin process to block. 5. Because the browser process is blocked the WM_PARENTNOTIFY message is never serviced. We're hosed. This is just one example, and, incidentally, I was able to work around it by making the plugin's window set the WS_EX_NOPARENTNOTIFY style bit. But the general problem remains - Windows sends events synchronously between parent and child windows without notifying us beforehand. We cannot avoid deadlocks in all cases unless we have a minimal Windows message loop going while we are waiting for synchronous IPC to complete.

Jim Mathies [:jimm]

Comment 6

•

15 years ago

(In reply to comment #5) > 1. Browser wants to create a plugin window. It makes a new HWND in the browser > process. > 2. Browser calls SetWindow on the plugin process via IPC. This is a synchronous > message and so the browser process blocks. > 3. Plugin process creates its own plugin window that it can actually subclass, > draw into, etc. It then calls SetParent to make the plugin window a child of > the browser's plugin window. Under what circumstance does this happen? I'm playing with a windowed flash video from you tube. I see the set window call from parent to child, and things return fine and execution continues. I'm not 100% on how the new plugin code works - in the original implementation, the window was created by the browser and set for the plugin, which then subclased that window. The plugin never created its own base 'browser' window to draw in. Does this match our current electrolysis impl? - 1) create the plugin window in the browser process [by browser, I mean the browser shell (firefox.exe), w/mozillaruntime.exe holding the tab in a child process] 2) hand the plugin window hwnd via ipc to the child plugin shim via NPP_SetWindow. 3) child shim loads the plugin, and calls NPP_SetWindow on flash. 4) flash subclasses that window, returns. 5) execution unwinds back up to the browser and continue normally. I'm asking this because, w/out your patch, I'm not seeing any deadlock issues with my test video. I haven't tested creating a child window of the base plugin window (in the child process) yet, will try to get that going here in a bit with the test plugin.

Ben Turner (not reading bugmail, use the needinfo flag!)

Assignee

Comment 7

•

15 years ago

I got my example wrong in comment 5. It isn't the SetParent call that causes problems, it's the CreateWindow call when we give a parent HWND. We've since changed the way this works a little, but here's what the docs say for WM_PARENTNOTIFY: When the child window is being created, the system sends WM_PARENTNOTIFY just before the CreateWindow or CreateWindowEx function that creates the window returns. That happens when we send the parent HWND to the CreateWindow call, which we no longer do, so I guess this wasn't the best example. The general problem remains, though. Once the parent/child relationship is established Windows will freely send synchronous messages when we're not expecting them. It's interesting that you don't see any deadlocks... I can't even load a youtube video without hitting one. I'll attach a stack in a sec.

Ben Turner (not reading bugmail, use the needinfo flag!)

Assignee

Comment 8

•

15 years ago

Attached file Sample deadlock — Details

Ben Turner (not reading bugmail, use the needinfo flag!)

Assignee

Comment 9

•

15 years ago

(In reply to comment #6) > Does this match our current electrolysis impl? Not exactly. For plugins let's assume that we don't have tabs in their own processes yet. I've never even tried running with both prefs enabled. So here's how we do it currently: 1. firefox.exe creates a plugin window, let's call it the "parent plugin window". It is created as a child of the main firefox window. 2. HWND of parent plugin window is sent via IPC to the mozilla-runtime.exe process through the NPP_SetWindow call. 3. mozilla-runtime.exe creates its own plugin window, let's call it the "child plugin window". It must be created by the mozilla-runtime.exe process because you cannot subclass windows that belong to another process. 4. mozilla-runtime.exe calls SetParent to make the child plugin window the child of the parent plugin window. The child plugin window is resized to fill the parent plugin window and made visible. 5. mozilla-runtime.exe loads flash and calls NPP_SetWindow with the child plugin window as its parameter. 6. Flash subclasses the child plugin window to do its thing.

Rob Arnold [:robarnold]

Comment 10

•

15 years ago

Ben and I talked about this over IRC. There are few things that I'd like to see more comments/documentation on: a) Why MsgWaitForMultipleObjects checks the queue in both processes. Ben pointed me at http://msdn.microsoft.com/en-us/library/ms810439.aspx which I intend to read. b) A larger block comment somewhere that describes the race and the solution here (namely pumping the message loop with all the WndProcs substituted for DefWindowProc). Before this gets checked in, I'd like to see an audit of our WndProcs to check for important messages sent to them that we cannot miss in our code (ex: theme change notification). If something turns up, we'll need a better solution (will probably depend on the message).

Jim Mathies [:jimm]

Comment 11

•

15 years ago

I've been able to reproduce the deadlock, at least on the destroy message after removing WS_EX_NOPARENTNOTIFY from the styles. The destroy deadlock was easy to fix by setting the parent to null before calling DestroyWindow. Generally speaking, I'm not convinced we need this message trap. For example, why remove WS_EX_NOPARENTNOTIFY from the style? The purpose there is to prevent parent notifications, which is exactly what we want to do here on create and destroy. I searched around for additional notification messages that might crop up independent of those that WS_EX_NOPARENTNOTIFY filters, but didn't find much. Windows notifications are listed here - http://msdn.microsoft.com/en-us/library/dd469354(VS.85).aspx If the only deadlock problem we have is with create, destroy, and some mouse events, and those can be addressed with WS_EX_NOPARENTNOTIFY filtering, what's the point of the trap? Also, if we see deadlocks from other events caused by NPP calls, why can't we make those calls async? I'm a little hesitant to sign onto dumping windowing events for periods of time. I suppose if we have to do it, so be it, but I would hope we could address this without that. bent, I'd be more than happy to do some more debugging, looking for solutions independent of this approach if we can come up with specific cases where they occur. One additional note - on Vista, the SetParent deadlock doesn't occur, which obviously doesn't mitigate problems on XP but I find it interesting that changes in the os apparently correct the problem.

Ben Turner (not reading bugmail, use the needinfo flag!)

Assignee

Comment 12

•

15 years ago

> If the only deadlock problem we have is with create, destroy, and some mouse > events, and those can be addressed with WS_EX_NOPARENTNOTIFY filtering, what's > the point of the trap? The WM_PARENTNOTIFY message is only one message that is sent synchronously by Windows. And yes, in this case we can avoid it, though it's always possible that we could one day want to catch those messages. The general problem, however, remains. The stack I attached is in a paint message, for instance. > Also, if we see deadlocks from other events caused by NPP calls, why can't > we make those calls async? The entire plugin API is synchronous. A web page can call into a plugin and we cannot suspend the execution of the page while we wait for the plugin to respond.

Ben Turner (not reading bugmail, use the needinfo flag!)

Assignee

Comment 13

•

15 years ago

> I'm a little hesitant to sign onto dumping windowing events for periods of > time. I suppose if we have to do it, so be it, but I would hope we could > address this without that. Also, the "nice" thing about this patch is that we have a single spot to handle particular messages that we don't want to discard. So if you're nervous about ignoring a particular message then we don't have to. We can either handle them immediately if we know that it is safe to do so or we can 'fake' them by posting replacement messages to be processed as soon as we return to the message loop. None of that is particularly tricky, we just need to know which messages to care about.

Jim Mathies [:jimm]

Comment 14

•

15 years ago

+LRESULT CALLBACK +NeuteredWindowProc(HWND hwnd, + UINT uMsg, + WPARAM wParam, + LPARAM lParam) +{ + WNDPROC oldWndProc = (WNDPROC)GetProp(hwnd, kOldWndProcProp); Maybe add a warning in here that prints these messages out to the console? I noticed at least a few set focus events that got dropped, some set icon and set cursor events, but nothing critical. We should probably warn people that this is taking place, at least initially. + if (!currentWndProc) { + DWORD lastError = GetLastError(); + if (!lastError) { if (ERROR_SUCCESS != GetLastError()) { + if (SetProp(hwnd, kOldWndProcProp, (HANDLE)currentWndProc) && + SetProp(hwnd, kOldWndProcPropSet, (HANDLE)1)) { + return TRUE; + } Do we need both of these properties? Looks like you could key off of the presence of kOldWndProcProp. + NS_ASSERTION(mEventLoopDepth > 0, "Huh?!"); Generally, make these assert events more detailed. And as rob pointed out, a nice comment block explaining why we're doing this up top would probably be a good idea.

Jim Mathies [:jimm]

Comment 15

•

15 years ago

(In reply to comment #14) > if (ERROR_SUCCESS != GetLastError()) { eh! if (ERROR_SUCCESS == GetLastError()) {

Ben Turner (not reading bugmail, use the needinfo flag!)

Assignee

Comment 16

•

15 years ago

Attached patch Patch, v1.1 (obsolete) — Details — Splinter Review

Here's a patch with loads more comments and fixes for the suggestions given by rstrong and jmathies.

Attachment #406383 - Attachment is obsolete: true

Attachment #408118 - Flags: review?(tellrob)

Attachment #408118 - Flags: review?(robert.bugzilla)

Attachment #408118 - Flags: review?(jmathies)

Attachment #406383 - Flags: review?(robert.bugzilla)

Attachment #406383 - Flags: review?(jmathies)

Jim Mathies [:jimm]

Updated

•

15 years ago

Attachment #408118 - Flags: review?(jmathies) → review+

Robert Strong (they/them - no direct email)

Comment 17

•

15 years ago

Comment on attachment 408118 [details] [diff] [review] Patch, v1.1 >+static inline bool >+WindowIsMozillaWindow(HWND hwnd) >+{ >+ PRUnichar buffer[256] = { 0 }; >+ int length = GetClassNameW(hwnd, (wchar_t*)buffer, sizeof(buffer) - 1); >+ if (length <= 0) { >+ NS_WARNING("Failed to get class name!"); >+ return false; >+ } >+ >+ nsDependentString className(buffer, length); >+ if (StringBeginsWith(className, NS_LITERAL_STRING("Mozilla")) || >+ StringBeginsWith(className, NS_LITERAL_STRING("Gecko")) || >+ className.EqualsLiteral("nsToolkitClass") || >+ className.EqualsLiteral("nsAppShell:EventWindowClass") || >+ className.EqualsLiteral("FirefoxMessageWindow")) { Did you mean to keep FirefoxMessageWindow to try to improve perf?

Ben Turner (not reading bugmail, use the needinfo flag!)

Assignee

Comment 18

•

15 years ago

Attached patch Patch, v1.2 — Details — Splinter Review

This version uses a WH_CALLWNDPROC hook instead of enumerating all the thread's windows to figure out which window procedures should be swapped out. Thanks to Rob Arnold for suggesting it. I also fixed the 'FirefoxMessageWindow' thing that Rob Strong pointed out, it was not supposed to hang around.

Attachment #408118 - Attachment is obsolete: true

Attachment #408481 - Flags: review?(tellrob)

Attachment #408481 - Flags: review?(robert.bugzilla)

Attachment #408481 - Flags: review?(jmathies)

Attachment #408118 - Flags: review?(tellrob)

Attachment #408118 - Flags: review?(robert.bugzilla)

Ben Turner (not reading bugmail, use the needinfo flag!)

Assignee

Comment 19

•

15 years ago

Attached patch Followup patch to assert that a message never makes it to nsWindow::WindowProc during IPC calls — Details — Splinter Review

Attachment #408490 - Flags: review?(tellrob)

Attachment #408490 - Flags: review?(robert.bugzilla)

Attachment #408490 - Flags: review?(jmathies)

Ben Turner (not reading bugmail, use the needinfo flag!)

Assignee

Comment 20

•

15 years ago

(In reply to comment #10) > Before this gets checked in, I'd like to see an audit of our WndProcs to check > for important messages sent to them that we cannot miss in our code (ex: theme > change notification). I mentioned this on IRC today but it needs to be recorded here: I don't know of any way to do this systematically. We're talking about an unknown set of windows messages that are delivered within a tiny window of time in response to an unknown set of user and system events. This is not something we can audit conclusively in any way that I know. As an example, I don't see any messages that get hooked the majority of the time. If I run something that uses IPC heavily (YouTube is a good test) and switch back and forth between several different applications I can see a few focus related messages that get hooked. Sometimes I see IME messages. I don't change themes all that often but I'd be surprised if windows decided that that should be a "nonqueued" message. Trying to change themes at precisely the right moment would most likely lead to a mental illness, in any case. My point is that I don't believe an audit like you're requesting is possible. I believe that the best we can do is watch for bugs and make sure that the logs get analyzed (they currently note every message that gets dropped). Here's a list of all the messages I've seen while hooked during my testing, in case any of these are worrisome: WM_ACTIVATE WM_ACTIVATEAPP WM_SETFOCUS WM_KILLFOCUS WM_IME_SETCONTEXT WM_IME_NOTIFY WM_ERASEBKGND WM_WINDOWPOSCHANGED WM_NCPAINT

Jim Mathies [:jimm]

Comment 21

•

15 years ago

(In reply to comment #20) > WM_ACTIVATE > WM_ACTIVATEAPP > WM_SETFOCUS > WM_KILLFOCUS > WM_IME_SETCONTEXT > WM_IME_NOTIFY > WM_ERASEBKGND > WM_WINDOWPOSCHANGED > WM_NCPAINT If we're dropping these on a regular basis, we're going to need a way to queue some of these up and get them sent. (or a different approach all together) WM_WINDOWPOSCHANGED is critical for top level windows when we are setting window state, or size. It passes in a WINDOWPOS structure. WM_ACTIVATE, WM_SETFOCUS, and WM_KILLFOCUS are critical for focus changes and passes information through flags. The IME messages get thrown around a lot, not sure how loosing these is going to effect native and 3rd party imes.

Ben Turner (not reading bugmail, use the needinfo flag!)

Assignee

Comment 22

•

15 years ago

Attached file Message log for normal browsing — Details

Here's a list of messages that we see with the WH_CALLWNDPROC hook running normally for every window in the system (on Windows 7). I even switched themes a few times and did the taskbar preview thing. Oh, and the InSendMessage API call returns false for the (vast) majority of these, so there's clearly a separate mechanism at work here. So what do you guys think? Do we need handlers for some of these? Are there any messages here that we absolutely could not miss?

Robert Strong (they/them - no direct email)

Comment 23

•

15 years ago

Comment on attachment 408481 [details] [diff] [review] Patch, v1.2 r=me for this only way to accomplish this with our current architecture patch that doesn't make me terribly happy and I know makes you even less happy.

Attachment #408481 - Flags: review?(robert.bugzilla) → review+

Robert Strong (they/them - no direct email)

Updated

•

15 years ago

Attachment #408490 - Flags: review?(robert.bugzilla) → review+

Benjamin Smedberg

Updated

•

15 years ago

Blocks: OOPP

Robert Strong (they/them - no direct email)

Comment 24

•

15 years ago

I should also mention that I think it would be good to get this landed sooner rather than later and to fix things like perhaps storing deferred messages in an array and handling them prior to handling new messages in a followup bug.

Ben Turner (not reading bugmail, use the needinfo flag!)

Assignee

Updated

•

15 years ago

Blocks: 526361

Ben Turner (not reading bugmail, use the needinfo flag!)

Assignee

Comment 25

•

15 years ago

Filed bug 526361, patch attached for deferring messages. Jim, Rob Arnold, can you please review the patch in this bug so that we can start getting feedback on this approach? Thanks a bunch for spelunking with me.

Jim Mathies [:jimm]

Updated

•

15 years ago

Attachment #408490 - Flags: review?(jmathies) → review+

Jim Mathies [:jimm]

Updated

•

15 years ago

Attachment #408481 - Flags: review?(jmathies) → review+

Ben Turner (not reading bugmail, use the needinfo flag!)

Assignee

Comment 26

•

15 years ago

Landed changeset 478360ac91c4 on electrolysis. Rob Arnold said that he would followup if he sees anything lacking.

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Benjamin Smedberg

Updated

•

15 years ago

Depends on: 527925

Ben Turner (not reading bugmail, use the needinfo flag!)

Assignee

Updated

•

15 years ago

Attachment #408490 - Flags: review?(tellrob)

Ben Turner (not reading bugmail, use the needinfo flag!)

Assignee

Updated

•

15 years ago

Attachment #408481 - Flags: review?(tellrob)

(Away)

Updated

•

8 years ago

Depends on: 1334324

You need to log in before you can comment on or make changes to this bug.