Closed Bug 896896 Opened 6 years ago Closed 6 years ago

We sometimes call WaitMessage with a (non-new) message in the queue due to concurrency and Chromium IPC message loop

Categories

(Core :: Widget: Win32, defect)

21 Branch
x86_64
Windows 7
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla25
Tracking Status
firefox25 --- wontfix
firefox26 --- wontfix
firefox27 --- wontfix
firefox28 --- fixed
firefox32 --- wontfix
firefox33 --- fixed
firefox34 --- fixed
firefox35 --- fixed

People

(Reporter: nrc, Assigned: nrc)

References

(Depends on 1 open bug)

Details

(Whiteboard: [qa-])

Attachments

(2 files, 1 obsolete file)

In nsAppShell::ProcessNextNativeEvent we check PeekMessage and if there are no messages in our queue, we call WaitMessage. However, between these two things we can get a new message in our queue. If we hit SyncChannel::WaitForNotify, it will do its own PeekMessage so that new message in the queue is no longer new and so our call to WaitMessage will just wait for more messages and we never deal with the message on the queue.

Specifically, this causes problems when rendering using OMTC. We had intermittent mochi test failure because a timeout was not getting triggered because the wakeup message was the 'hidden' message in the queue. This occurred because we got a WM_PAINT message with null invalid region which was processed from the PeekMessage call, calling into out WndProc which called our OnPaint routine which made sync calls to the compositor thread which triggered SyncChannel::WaitForNotify which in turn called PeekMessage again and read the wakeup message which had been posted in the mean time. That meant that when we returned from our own PeekMessage call (which told us there were no messages in the queue), there was in fact a message in the queue and it was not new and so WaitMessage did just wait and so we would freeze, never executing the timeout until we got a new message, for example the user moving the mouse. On Try, our mochitest never got any more messages and so timed out.
Is the fix to just call WinUtils::PeekMessage(&msg, NULL, 0, 0, PM_REMOVE) before the WaitMessage again? Or to avoid calling PeekMessage when mayWait is true and instead first call WaitMessage?
(In reply to Brian R. Bondy [:bbondy] from comment #1)
> Is the fix to just call WinUtils::PeekMessage(&msg, NULL, 0, 0, PM_REMOVE)
> before the WaitMessage again? Or to avoid calling PeekMessage when mayWait
> is true and instead first call WaitMessage?

I don't think either of these would work. In the first case, our thread could still be interupted between the two calls, although it is less likely (I used that code (in part) to verify this bug). I don't think the latter works because we could have a non-new message in the queue before the WaitMessage and PeekMessage so we would still hide it.

We have a solution, we think, although it is a little fragile. Patch coming up later today.
The basic problem here is that Win32 PeekMessage purports to guarantee that, if it returns false, any messages in the message queue now or later are treated as "new messages" by WaitMessage. But that guarantee doesn't hold if PeekMessage synchronously dispatches a message and the processing of that message itself does a PeekMessage. The fact that this can happen is super-broken on Windows' part, IMHO.
See https://bugzilla.gnome.org/show_bug.cgi?id=552041 for relevant discussion. GTK seem to have hit exactly the same problem.
Attached patch patch (obsolete) — Splinter Review
Attachment #780265 - Flags: review?(roc)
Attachment #780265 - Flags: feedback?(benjamin)
Comment on attachment 780265 [details] [diff] [review]
patch

Review of attachment 780265 [details] [diff] [review]:
-----------------------------------------------------------------

::: widget/windows/WinUtils.cpp
@@ +36,5 @@
>  namespace mozilla {
>  namespace widget {
>  
> +#ifdef DEBUG
> +  bool gInGetQueueStatus = false;

Don't indent.

@@ +133,5 @@
> +    // We rely on GetQueueStatus not sending any messages, lets assert that in
> +    // our WindowProcs to make sure.
> +    gInGetQueueStatus = true;
> +#endif
> +    result = ::GetQueueStatus(QS_ALLINPUT);

Returning true without actually filling in aMsg won't work. If this returns true you need to rerun the PeekMessageW.

Also I think we should only do this if aFirstMessage/aLastMessage cover all messages. If they don't, it's likely GetQueueStatus will return true but we shouldn't.

::: widget/windows/WinUtils.h
@@ +32,5 @@
>  namespace widget {
>  
> +#ifdef DEBUG
> +  extern bool gInGetQueueStatus;
> +#endif

Don't indent. Also make this DebugOnly<bool>.
Comment on attachment 780265 [details] [diff] [review]
patch

There appears to be a lot of history behind the current setup, a lot of which appears pretty suspect. This bug certainly does explain something that we saw a lot in the early days of OOPP where people complained that Firefox didn't act normally unless you constantly moved the mouse over it!

Why do we need WaitMessage at all? ISTM after reading the WaitMessage docs and the analysis here that WaitMessage is an inherently-racy footgun and rather than trying to hack around it with GetQueueStatus, we could just remove the WaitMessage calls (replacing them with GetMessage or MsgWaitForMultipleObjects as appropriate). GetMessage does not suffer from the weird "read/unread" message dichotomy which appears to be solely related to WaitMessage.

We call WaitMessage in three places:

http://hg.mozilla.org/mozilla-central/annotate/b3fcd828cadc/ipc/chromium/src/base/message_pump_win.cc#l285

This came from Chromium and just looks wrong. We're issuing a WaitMessage only after checking the queue for mouse messages. I don't think I understand why the attached input queues has any relation to the check in question, and we should consider removing this entirely.

http://hg.mozilla.org/mozilla-central/annotate/b3fcd828cadc/widget/windows/nsAppShell.cpp#l228

This will trigger the WaitMessage race, and I believe it can easily be replaced with a GetMessage.

http://hg.mozilla.org/mozilla-central/annotate/b3fcd828cadc/widget/windows/winrt/MetroAppShell.cpp#l146

For metro, this is kinda weird, but it looks like we could replace it with a "process one message even if not present" call.

Technically I worry about this patch a lot because you're not looking at the aFirstMessage/aLastMessage params, and because you can return TRUE without setting *aMsg, which will confuse anyone who passes in PM_REMOVE expecting a message.
Attachment #780265 - Flags: feedback?(benjamin) → feedback-
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #8)
> Comment on attachment 780265 [details] [diff] [review]
> patch
> 
> There appears to be a lot of history behind the current setup, a lot of
> which appears pretty suspect. This bug certainly does explain something that
> we saw a lot in the early days of OOPP where people complained that Firefox
> didn't act normally unless you constantly moved the mouse over it!
> 
> Why do we need WaitMessage at all? ISTM after reading the WaitMessage docs
> and the analysis here that WaitMessage is an inherently-racy footgun and
> rather than trying to hack around it with GetQueueStatus, we could just
> remove the WaitMessage calls (replacing them with GetMessage or
> MsgWaitForMultipleObjects as appropriate). GetMessage does not suffer from
> the weird "read/unread" message dichotomy which appears to be solely related
> to WaitMessage.
> 

Note that MsgWaitForMultipleObjects does seem to have the same drawback as WaitMessage. From MSDN:

"MsgWaitForMultipleObjects does not return if there is unread input of the specified type in the message queue after the thread has called a function to check the queue. This is because functions such as PeekMessage, GetMessage, GetQueueStatus, and WaitMessage check the queue and then change the state information for the queue so that the input is no longer considered new. A subsequent call to MsgWaitForMultipleObjects will not return until new input of the specified type arrives. The existing unread input (received prior to the last time the thread checked the queue) is ignored."

> We call WaitMessage in three places:
> 
> http://hg.mozilla.org/mozilla-central/annotate/b3fcd828cadc/ipc/chromium/src/
> base/message_pump_win.cc#l285
> 
> This came from Chromium and just looks wrong. We're issuing a WaitMessage
> only after checking the queue for mouse messages. I don't think I understand
> why the attached input queues has any relation to the check in question, and
> we should consider removing this entirely.
> 

I think that is OK, because any non-mouse messages in the queue won't be marked as 'seen' and so WaitMessage will return early. But I don't really understand what is going on there. From the comment, it sounds intentional to only check mouse messages.

Even if this event loop misses a message, then our main one should still wake up from its GetMessage (if we change it to that).

> http://hg.mozilla.org/mozilla-central/annotate/b3fcd828cadc/widget/windows/
> nsAppShell.cpp#l228
> 
> This will trigger the WaitMessage race, and I believe it can easily be
> replaced with a GetMessage.

Yep, this is where we've been getting the errors. I agree that it should be easy to replace with GetMessage.

> 
> http://hg.mozilla.org/mozilla-central/annotate/b3fcd828cadc/widget/windows/
> winrt/MetroAppShell.cpp#l146
> 
> For metro, this is kinda weird, but it looks like we could replace it with a
> "process one message even if not present" call.
> 

I don't think that is an option, the only choice which includes a wait is 'ProcessOneAndAllPending' which waits or processes all events. I can't tell from the code whether that is OK or if we rely on only processing one event at a time.

jimm: do you know what we can do here?

> Technically I worry about this patch a lot because you're not looking at the
> aFirstMessage/aLastMessage params, and because you can return TRUE without
> setting *aMsg, which will confuse anyone who passes in PM_REMOVE expecting a
> message.

roc also pointed these out. Fixed in a new version.
Flags: needinfo?(jmathies)
Attached patch patchSplinter Review
This is a corrected version of the old patch. But I am also going to experiment with the new approach suggested by bsmedberg.
Attachment #780265 - Attachment is obsolete: true
Attachment #780265 - Flags: review?(roc)
I think trying ProcessOneAndAllPending for starters would be ok, from the docs:

"Dispatch all currently available events in the queue. If no events are pending, wait for the next new event."

Sounds like a good replacement for PeekMessage + WaitMessage + ProcessOneNativeEventIfPresent().

However, with this we bypass masayuki's TSF dispatching. Not sure what the side effects of that might be. We might have to add that in here as well.

If that gives us headaches we could try to bypass the winrt dispatch calls altogether and do our own dispatching. I'd like to avoid that though because we don't know what ICoreDispatcher might be doing under the hood besides dispatching messages. (Some assembly inspection might shed some light on that if need be.)
Flags: needinfo?(jmathies)
Flags: needinfo?(masayuki)
(In reply to Nick Cameron [:nrc] from comment #9)
> Note that MsgWaitForMultipleObjects does seem to have the same drawback as
> WaitMessage. From MSDN:
> 
> "MsgWaitForMultipleObjects does not return if there is unread input of the
> specified type in the message queue after the thread has called a function
> to check the queue. This is because functions such as PeekMessage,
> GetMessage, GetQueueStatus, and WaitMessage check the queue and then change
> the state information for the queue so that the input is no longer
> considered new. A subsequent call to MsgWaitForMultipleObjects will not
> return until new input of the specified type arrives. The existing unread
> input (received prior to the last time the thread checked the queue) is
> ignored."

Yes.

However, MsgWaitForMultipleObjectsEx has a flag MWMO_INPUTAVAILABLE that disables this behavior. See http://msdn.microsoft.com/en-us/library/windows/desktop/ms684245%28v=vs.85%29.aspx. I think maybe we could use that instead of WaitMessage.

> > We call WaitMessage in three places:
> > 
> > http://hg.mozilla.org/mozilla-central/annotate/b3fcd828cadc/ipc/chromium/src/
> > base/message_pump_win.cc#l285
> > 
> > This came from Chromium and just looks wrong. We're issuing a WaitMessage
> > only after checking the queue for mouse messages. I don't think I understand
> > why the attached input queues has any relation to the check in question, and
> > we should consider removing this entirely.
> 
> I think that is OK, because any non-mouse messages in the queue won't be
> marked as 'seen' and so WaitMessage will return early.

I'm not so sure about that!

> > http://hg.mozilla.org/mozilla-central/annotate/b3fcd828cadc/widget/windows/
> > nsAppShell.cpp#l228
> > 
> > This will trigger the WaitMessage race, and I believe it can easily be
> > replaced with a GetMessage.
> 
> Yep, this is where we've been getting the errors. I agree that it should be
> easy to replace with GetMessage.

GetMessage would require restructuring logic in the nsAppShell event loop. I'm not so keen on doing that.

My current favourite approach is to replace usage of WaitMessage with MsgWaitForMultipleObjectsEx(MWMO_INPUTAVAILABLE).
But if we do that we should probably duplicate the Chromium logic to avoid spurious "message pending" results.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #12)
> (In reply to Nick Cameron [:nrc] from comment #9)
> > > http://hg.mozilla.org/mozilla-central/annotate/b3fcd828cadc/widget/windows/
> > > nsAppShell.cpp#l228
> > > 
> > > This will trigger the WaitMessage race, and I believe it can easily be
> > > replaced with a GetMessage.
> > 
> > Yep, this is where we've been getting the errors. I agree that it should be
> > easy to replace with GetMessage.
> 
> GetMessage would require restructuring logic in the nsAppShell event loop.
> I'm not so keen on doing that.

Did you mean that we need to use PeekMassage() when mayWait is false? Otherwise, I have no idea why there is the requirement restructing it. Currently, we don't need to use PM_NOREMOVE in nsAppShell...
Flags: needinfo?(masayuki)
(In reply to Masayuki Nakano (:masayuki) (Mozilla Japan) from comment #14)
> (In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #12)
> > (In reply to Nick Cameron [:nrc] from comment #9)
> > > > http://hg.mozilla.org/mozilla-central/annotate/b3fcd828cadc/widget/windows/
> > > > nsAppShell.cpp#l228
> > > > 
> > > > This will trigger the WaitMessage race, and I believe it can easily be
> > > > replaced with a GetMessage.
> > > 
> > > Yep, this is where we've been getting the errors. I agree that it should be
> > > easy to replace with GetMessage.
> > 
> > GetMessage would require restructuring logic in the nsAppShell event loop.
> > I'm not so keen on doing that.
> 
> Did you mean that we need to use PeekMassage() when mayWait is false?
> Otherwise, I have no idea why there is the requirement restructing it.
> Currently, we don't need to use PM_NOREMOVE in nsAppShell...

See https://bugzilla.mozilla.org/show_bug.cgi?id=896896#c0

The favourite plan for now is to use MsgWaitForMultipleObjectsEx rather than WaitMessage, and make the dispatcher process all messages at once and wait, rather than one at a time and make explicit calls to WaitMessage.

This means we will no longer call WinUtils::WaitMessage and so miss out on the TSF stuff. How big of an issue is that?
Flags: needinfo?(masayuki)
(In reply to Nick Cameron [:nrc] from comment #15)
> The favourite plan for now is to use MsgWaitForMultipleObjectsEx rather than
> WaitMessage, and make the dispatcher process all messages at once and wait,
> rather than one at a time and make explicit calls to WaitMessage.

I wouldn't change anything except calling MsgWaitForMultipleObjectsEx instead of WaitMessage. I would not change the way messages are actually processed. This should not affect TSF.
(In reply to Nick Cameron [:nrc] from comment #15)
> (In reply to Masayuki Nakano (:masayuki) (Mozilla Japan) from comment #14)
> > (In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #12)
> > > (In reply to Nick Cameron [:nrc] from comment #9)
> > > > > http://hg.mozilla.org/mozilla-central/annotate/b3fcd828cadc/widget/windows/
> > > > > nsAppShell.cpp#l228
> > > > > 
> > > > > This will trigger the WaitMessage race, and I believe it can easily be
> > > > > replaced with a GetMessage.
> > > > 
> > > > Yep, this is where we've been getting the errors. I agree that it should be
> > > > easy to replace with GetMessage.
> > > 
> > > GetMessage would require restructuring logic in the nsAppShell event loop.
> > > I'm not so keen on doing that.
> > 
> > Did you mean that we need to use PeekMassage() when mayWait is false?
> > Otherwise, I have no idea why there is the requirement restructing it.
> > Currently, we don't need to use PM_NOREMOVE in nsAppShell...
> 
> See https://bugzilla.mozilla.org/show_bug.cgi?id=896896#c0
> 
> The favourite plan for now is to use MsgWaitForMultipleObjectsEx rather than
> WaitMessage, and make the dispatcher process all messages at once and wait,
> rather than one at a time and make explicit calls to WaitMessage.
> 
> This means we will no longer call WinUtils::WaitMessage and so miss out on
> the TSF stuff. How big of an issue is that?

Sorry s/WinUtils::WaitMessage/WinUtils::PeekMessage
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #16)
> (In reply to Nick Cameron [:nrc] from comment #15)
> > The favourite plan for now is to use MsgWaitForMultipleObjectsEx rather than
> > WaitMessage, and make the dispatcher process all messages at once and wait,
> > rather than one at a time and make explicit calls to WaitMessage.
> 
> I wouldn't change anything except calling MsgWaitForMultipleObjectsEx
> instead of WaitMessage. I would not change the way messages are actually
> processed. This should not affect TSF.

Yes, I was thinking we would have to do something extra for Metro, but it doesn't look like it.
Flags: needinfo?(masayuki)
(In reply to Nick Cameron [:nrc] from comment #15)
> (In reply to Masayuki Nakano (:masayuki) (Mozilla Japan) from comment #14)
> > (In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #12)
> > > (In reply to Nick Cameron [:nrc] from comment #9)
> > > > > http://hg.mozilla.org/mozilla-central/annotate/b3fcd828cadc/widget/windows/
> > > > > nsAppShell.cpp#l228
> > > > > 
> > > > > This will trigger the WaitMessage race, and I believe it can easily be
> > > > > replaced with a GetMessage.
> > > > 
> > > > Yep, this is where we've been getting the errors. I agree that it should be
> > > > easy to replace with GetMessage.
> > > 
> > > GetMessage would require restructuring logic in the nsAppShell event loop.
> > > I'm not so keen on doing that.
> > 
> > Did you mean that we need to use PeekMassage() when mayWait is false?
> > Otherwise, I have no idea why there is the requirement restructing it.
> > Currently, we don't need to use PM_NOREMOVE in nsAppShell...
> 
> See https://bugzilla.mozilla.org/show_bug.cgi?id=896896#c0
> 
> The favourite plan for now is to use MsgWaitForMultipleObjectsEx rather than
> WaitMessage, and make the dispatcher process all messages at once and wait,
> rather than one at a time and make explicit calls to WaitMessage.

I'm being confused, the favorite plan is not the attached patch?

> This means we will no longer call WinUtils::PeekMessage and so miss out on
> the TSF stuff. How big of an issue is that?

It completely breaks current TSF message handling. ITfMessagePump::PeekMessage() gets message *before* TSF handles it. WinAPI's PeekMessage() gets message *after* TSF handles it. This is very important for DOM keydown event implementation (in the future, I'm working for it now).
Attachment #782308 - Flags: review?(roc)
Attachment #782308 - Flags: feedback?(benjamin)
Try push: https://tbpl.mozilla.org/?tree=Try&rev=6655175742ab

I have verified locally that this fixes the problem with OMTC mochitest 5. I can't verify this on Try just yet because my patches have regressed horribly.
Comment on attachment 782308 [details] [diff] [review]
patch using MsgWaitForMultipleObjectsEx

I'm still a little worried about the WaitMessage in the mouse loop, but I can't think of a way around it right now.
Attachment #782308 - Flags: feedback?(benjamin) → feedback+
https://hg.mozilla.org/mozilla-central/rev/4887f7d34df2
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla25
I filed bug 937306 to work on a new version of this, but which will hopefully not break things like bug 933733.
Let's backout of mozilla-release as well (see https://bugzilla.mozilla.org/show_bug.cgi?id=933733#c34)
Keywords: verifyme
Backed out of Fx27 too: https://hg.mozilla.org/releases/mozilla-beta/rev/b952d0e9de86

(Carrying a=lsblakk from the other backouts, hope that is ok).
Why don't you just back out everywhere until bug 937306 is fixed?
(In reply to Masatoshi Kimura [:emk] from comment #30)
> Why don't you just back out everywhere until bug 937306 is fixed?

Because we need this fix for Windows OMTC and bug 933733 affects a minority of users (none on Aurora/Nightly that we know of).
Is there any manual testing needed here? I see this bug caused failures of automated tests, so the fix should already be verified by those tests.
Flags: needinfo?(ncameron)
(In reply to Ioana Budnar, QA [:ioana] from comment #32)
> Is there any manual testing needed here? I see this bug caused failures of
> automated tests, so the fix should already be verified by those tests.

Not at the moment, thanks.
Flags: needinfo?(ncameron)
Keywords: verifyme
Given comment 33 this fix will not be manually verified by QA. If you believe this warrants extra QA attention please nominate for testing by removing this whiteboard tag and adding the verifyme keyword. Please also provide any details you have that may inform our testing.
Whiteboard: [qa-]
Blocks: 937306
You need to log in before you can comment on or make changes to this bug.