We sometimes call WaitMessage with a (non-new) message in the queue due to concurrency and Chromium IPC message loop

RESOLVED FIXED in Firefox 28

Status

()

Core
Widget: Win32
RESOLVED FIXED
4 years ago
3 years ago

People

(Reporter: nrc, Assigned: nrc)

Tracking

(Depends on: 1 bug)

21 Branch
mozilla25
x86_64
Windows 7
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox25 wontfix, firefox26 wontfix, firefox27 wontfix, firefox28 fixed, firefox32 wontfix, firefox33 fixed, firefox34 fixed, firefox35 fixed)

Details

(Whiteboard: [qa-])

Attachments

(2 attachments, 1 obsolete attachment)

(Assignee)

Description

4 years ago
In nsAppShell::ProcessNextNativeEvent we check PeekMessage and if there are no messages in our queue, we call WaitMessage. However, between these two things we can get a new message in our queue. If we hit SyncChannel::WaitForNotify, it will do its own PeekMessage so that new message in the queue is no longer new and so our call to WaitMessage will just wait for more messages and we never deal with the message on the queue.

Specifically, this causes problems when rendering using OMTC. We had intermittent mochi test failure because a timeout was not getting triggered because the wakeup message was the 'hidden' message in the queue. This occurred because we got a WM_PAINT message with null invalid region which was processed from the PeekMessage call, calling into out WndProc which called our OnPaint routine which made sync calls to the compositor thread which triggered SyncChannel::WaitForNotify which in turn called PeekMessage again and read the wakeup message which had been posted in the mean time. That meant that when we returned from our own PeekMessage call (which told us there were no messages in the queue), there was in fact a message in the queue and it was not new and so WaitMessage did just wait and so we would freeze, never executing the timeout until we got a new message, for example the user moving the mouse. On Try, our mochitest never got any more messages and so timed out.
Is the fix to just call WinUtils::PeekMessage(&msg, NULL, 0, 0, PM_REMOVE) before the WaitMessage again? Or to avoid calling PeekMessage when mayWait is true and instead first call WaitMessage?
(Assignee)

Comment 2

4 years ago
(In reply to Brian R. Bondy [:bbondy] from comment #1)
> Is the fix to just call WinUtils::PeekMessage(&msg, NULL, 0, 0, PM_REMOVE)
> before the WaitMessage again? Or to avoid calling PeekMessage when mayWait
> is true and instead first call WaitMessage?

I don't think either of these would work. In the first case, our thread could still be interupted between the two calls, although it is less likely (I used that code (in part) to verify this bug). I don't think the latter works because we could have a non-new message in the queue before the WaitMessage and PeekMessage so we would still hide it.

We have a solution, we think, although it is a little fragile. Patch coming up later today.
The basic problem here is that Win32 PeekMessage purports to guarantee that, if it returns false, any messages in the message queue now or later are treated as "new messages" by WaitMessage. But that guarantee doesn't hold if PeekMessage synchronously dispatches a message and the processing of that message itself does a PeekMessage. The fact that this can happen is super-broken on Windows' part, IMHO.
(Assignee)

Comment 4

4 years ago
See https://bugzilla.gnome.org/show_bug.cgi?id=552041 for relevant discussion. GTK seem to have hit exactly the same problem.
(Assignee)

Comment 5

4 years ago
https://tbpl.mozilla.org/?tree=Try&rev=72363962ff49
(Assignee)

Comment 6

4 years ago
Created attachment 780265 [details] [diff] [review]
patch
Attachment #780265 - Flags: review?(roc)
Attachment #780265 - Flags: feedback?(benjamin)
Comment on attachment 780265 [details] [diff] [review]
patch

Review of attachment 780265 [details] [diff] [review]:
-----------------------------------------------------------------

::: widget/windows/WinUtils.cpp
@@ +36,5 @@
>  namespace mozilla {
>  namespace widget {
>  
> +#ifdef DEBUG
> +  bool gInGetQueueStatus = false;

Don't indent.

@@ +133,5 @@
> +    // We rely on GetQueueStatus not sending any messages, lets assert that in
> +    // our WindowProcs to make sure.
> +    gInGetQueueStatus = true;
> +#endif
> +    result = ::GetQueueStatus(QS_ALLINPUT);

Returning true without actually filling in aMsg won't work. If this returns true you need to rerun the PeekMessageW.

Also I think we should only do this if aFirstMessage/aLastMessage cover all messages. If they don't, it's likely GetQueueStatus will return true but we shouldn't.

::: widget/windows/WinUtils.h
@@ +32,5 @@
>  namespace widget {
>  
> +#ifdef DEBUG
> +  extern bool gInGetQueueStatus;
> +#endif

Don't indent. Also make this DebugOnly<bool>.

Comment 8

4 years ago
Comment on attachment 780265 [details] [diff] [review]
patch

There appears to be a lot of history behind the current setup, a lot of which appears pretty suspect. This bug certainly does explain something that we saw a lot in the early days of OOPP where people complained that Firefox didn't act normally unless you constantly moved the mouse over it!

Why do we need WaitMessage at all? ISTM after reading the WaitMessage docs and the analysis here that WaitMessage is an inherently-racy footgun and rather than trying to hack around it with GetQueueStatus, we could just remove the WaitMessage calls (replacing them with GetMessage or MsgWaitForMultipleObjects as appropriate). GetMessage does not suffer from the weird "read/unread" message dichotomy which appears to be solely related to WaitMessage.

We call WaitMessage in three places:

http://hg.mozilla.org/mozilla-central/annotate/b3fcd828cadc/ipc/chromium/src/base/message_pump_win.cc#l285

This came from Chromium and just looks wrong. We're issuing a WaitMessage only after checking the queue for mouse messages. I don't think I understand why the attached input queues has any relation to the check in question, and we should consider removing this entirely.

http://hg.mozilla.org/mozilla-central/annotate/b3fcd828cadc/widget/windows/nsAppShell.cpp#l228

This will trigger the WaitMessage race, and I believe it can easily be replaced with a GetMessage.

http://hg.mozilla.org/mozilla-central/annotate/b3fcd828cadc/widget/windows/winrt/MetroAppShell.cpp#l146

For metro, this is kinda weird, but it looks like we could replace it with a "process one message even if not present" call.

Technically I worry about this patch a lot because you're not looking at the aFirstMessage/aLastMessage params, and because you can return TRUE without setting *aMsg, which will confuse anyone who passes in PM_REMOVE expecting a message.
Attachment #780265 - Flags: feedback?(benjamin) → feedback-
(Assignee)

Comment 9

4 years ago
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #8)
> Comment on attachment 780265 [details] [diff] [review]
> patch
> 
> There appears to be a lot of history behind the current setup, a lot of
> which appears pretty suspect. This bug certainly does explain something that
> we saw a lot in the early days of OOPP where people complained that Firefox
> didn't act normally unless you constantly moved the mouse over it!
> 
> Why do we need WaitMessage at all? ISTM after reading the WaitMessage docs
> and the analysis here that WaitMessage is an inherently-racy footgun and
> rather than trying to hack around it with GetQueueStatus, we could just
> remove the WaitMessage calls (replacing them with GetMessage or
> MsgWaitForMultipleObjects as appropriate). GetMessage does not suffer from
> the weird "read/unread" message dichotomy which appears to be solely related
> to WaitMessage.
> 

Note that MsgWaitForMultipleObjects does seem to have the same drawback as WaitMessage. From MSDN:

"MsgWaitForMultipleObjects does not return if there is unread input of the specified type in the message queue after the thread has called a function to check the queue. This is because functions such as PeekMessage, GetMessage, GetQueueStatus, and WaitMessage check the queue and then change the state information for the queue so that the input is no longer considered new. A subsequent call to MsgWaitForMultipleObjects will not return until new input of the specified type arrives. The existing unread input (received prior to the last time the thread checked the queue) is ignored."

> We call WaitMessage in three places:
> 
> http://hg.mozilla.org/mozilla-central/annotate/b3fcd828cadc/ipc/chromium/src/
> base/message_pump_win.cc#l285
> 
> This came from Chromium and just looks wrong. We're issuing a WaitMessage
> only after checking the queue for mouse messages. I don't think I understand
> why the attached input queues has any relation to the check in question, and
> we should consider removing this entirely.
> 

I think that is OK, because any non-mouse messages in the queue won't be marked as 'seen' and so WaitMessage will return early. But I don't really understand what is going on there. From the comment, it sounds intentional to only check mouse messages.

Even if this event loop misses a message, then our main one should still wake up from its GetMessage (if we change it to that).

> http://hg.mozilla.org/mozilla-central/annotate/b3fcd828cadc/widget/windows/
> nsAppShell.cpp#l228
> 
> This will trigger the WaitMessage race, and I believe it can easily be
> replaced with a GetMessage.

Yep, this is where we've been getting the errors. I agree that it should be easy to replace with GetMessage.

> 
> http://hg.mozilla.org/mozilla-central/annotate/b3fcd828cadc/widget/windows/
> winrt/MetroAppShell.cpp#l146
> 
> For metro, this is kinda weird, but it looks like we could replace it with a
> "process one message even if not present" call.
> 

I don't think that is an option, the only choice which includes a wait is 'ProcessOneAndAllPending' which waits or processes all events. I can't tell from the code whether that is OK or if we rely on only processing one event at a time.

jimm: do you know what we can do here?

> Technically I worry about this patch a lot because you're not looking at the
> aFirstMessage/aLastMessage params, and because you can return TRUE without
> setting *aMsg, which will confuse anyone who passes in PM_REMOVE expecting a
> message.

roc also pointed these out. Fixed in a new version.
Flags: needinfo?(jmathies)
(Assignee)

Comment 10

4 years ago
Created attachment 780623 [details] [diff] [review]
patch

This is a corrected version of the old patch. But I am also going to experiment with the new approach suggested by bsmedberg.
Attachment #780265 - Attachment is obsolete: true
Attachment #780265 - Flags: review?(roc)

Comment 11

4 years ago
I think trying ProcessOneAndAllPending for starters would be ok, from the docs:

"Dispatch all currently available events in the queue. If no events are pending, wait for the next new event."

Sounds like a good replacement for PeekMessage + WaitMessage + ProcessOneNativeEventIfPresent().

However, with this we bypass masayuki's TSF dispatching. Not sure what the side effects of that might be. We might have to add that in here as well.

If that gives us headaches we could try to bypass the winrt dispatch calls altogether and do our own dispatching. I'd like to avoid that though because we don't know what ICoreDispatcher might be doing under the hood besides dispatching messages. (Some assembly inspection might shed some light on that if need be.)
Flags: needinfo?(jmathies)

Updated

4 years ago
Flags: needinfo?(masayuki)
(In reply to Nick Cameron [:nrc] from comment #9)
> Note that MsgWaitForMultipleObjects does seem to have the same drawback as
> WaitMessage. From MSDN:
> 
> "MsgWaitForMultipleObjects does not return if there is unread input of the
> specified type in the message queue after the thread has called a function
> to check the queue. This is because functions such as PeekMessage,
> GetMessage, GetQueueStatus, and WaitMessage check the queue and then change
> the state information for the queue so that the input is no longer
> considered new. A subsequent call to MsgWaitForMultipleObjects will not
> return until new input of the specified type arrives. The existing unread
> input (received prior to the last time the thread checked the queue) is
> ignored."

Yes.

However, MsgWaitForMultipleObjectsEx has a flag MWMO_INPUTAVAILABLE that disables this behavior. See http://msdn.microsoft.com/en-us/library/windows/desktop/ms684245%28v=vs.85%29.aspx. I think maybe we could use that instead of WaitMessage.

> > We call WaitMessage in three places:
> > 
> > http://hg.mozilla.org/mozilla-central/annotate/b3fcd828cadc/ipc/chromium/src/
> > base/message_pump_win.cc#l285
> > 
> > This came from Chromium and just looks wrong. We're issuing a WaitMessage
> > only after checking the queue for mouse messages. I don't think I understand
> > why the attached input queues has any relation to the check in question, and
> > we should consider removing this entirely.
> 
> I think that is OK, because any non-mouse messages in the queue won't be
> marked as 'seen' and so WaitMessage will return early.

I'm not so sure about that!

> > http://hg.mozilla.org/mozilla-central/annotate/b3fcd828cadc/widget/windows/
> > nsAppShell.cpp#l228
> > 
> > This will trigger the WaitMessage race, and I believe it can easily be
> > replaced with a GetMessage.
> 
> Yep, this is where we've been getting the errors. I agree that it should be
> easy to replace with GetMessage.

GetMessage would require restructuring logic in the nsAppShell event loop. I'm not so keen on doing that.

My current favourite approach is to replace usage of WaitMessage with MsgWaitForMultipleObjectsEx(MWMO_INPUTAVAILABLE).
But if we do that we should probably duplicate the Chromium logic to avoid spurious "message pending" results.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #12)
> (In reply to Nick Cameron [:nrc] from comment #9)
> > > http://hg.mozilla.org/mozilla-central/annotate/b3fcd828cadc/widget/windows/
> > > nsAppShell.cpp#l228
> > > 
> > > This will trigger the WaitMessage race, and I believe it can easily be
> > > replaced with a GetMessage.
> > 
> > Yep, this is where we've been getting the errors. I agree that it should be
> > easy to replace with GetMessage.
> 
> GetMessage would require restructuring logic in the nsAppShell event loop.
> I'm not so keen on doing that.

Did you mean that we need to use PeekMassage() when mayWait is false? Otherwise, I have no idea why there is the requirement restructing it. Currently, we don't need to use PM_NOREMOVE in nsAppShell...
Flags: needinfo?(masayuki)
(Assignee)

Comment 15

4 years ago
(In reply to Masayuki Nakano (:masayuki) (Mozilla Japan) from comment #14)
> (In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #12)
> > (In reply to Nick Cameron [:nrc] from comment #9)
> > > > http://hg.mozilla.org/mozilla-central/annotate/b3fcd828cadc/widget/windows/
> > > > nsAppShell.cpp#l228
> > > > 
> > > > This will trigger the WaitMessage race, and I believe it can easily be
> > > > replaced with a GetMessage.
> > > 
> > > Yep, this is where we've been getting the errors. I agree that it should be
> > > easy to replace with GetMessage.
> > 
> > GetMessage would require restructuring logic in the nsAppShell event loop.
> > I'm not so keen on doing that.
> 
> Did you mean that we need to use PeekMassage() when mayWait is false?
> Otherwise, I have no idea why there is the requirement restructing it.
> Currently, we don't need to use PM_NOREMOVE in nsAppShell...

See https://bugzilla.mozilla.org/show_bug.cgi?id=896896#c0

The favourite plan for now is to use MsgWaitForMultipleObjectsEx rather than WaitMessage, and make the dispatcher process all messages at once and wait, rather than one at a time and make explicit calls to WaitMessage.

This means we will no longer call WinUtils::WaitMessage and so miss out on the TSF stuff. How big of an issue is that?
Flags: needinfo?(masayuki)
(In reply to Nick Cameron [:nrc] from comment #15)
> The favourite plan for now is to use MsgWaitForMultipleObjectsEx rather than
> WaitMessage, and make the dispatcher process all messages at once and wait,
> rather than one at a time and make explicit calls to WaitMessage.

I wouldn't change anything except calling MsgWaitForMultipleObjectsEx instead of WaitMessage. I would not change the way messages are actually processed. This should not affect TSF.
(Assignee)

Comment 17

4 years ago
(In reply to Nick Cameron [:nrc] from comment #15)
> (In reply to Masayuki Nakano (:masayuki) (Mozilla Japan) from comment #14)
> > (In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #12)
> > > (In reply to Nick Cameron [:nrc] from comment #9)
> > > > > http://hg.mozilla.org/mozilla-central/annotate/b3fcd828cadc/widget/windows/
> > > > > nsAppShell.cpp#l228
> > > > > 
> > > > > This will trigger the WaitMessage race, and I believe it can easily be
> > > > > replaced with a GetMessage.
> > > > 
> > > > Yep, this is where we've been getting the errors. I agree that it should be
> > > > easy to replace with GetMessage.
> > > 
> > > GetMessage would require restructuring logic in the nsAppShell event loop.
> > > I'm not so keen on doing that.
> > 
> > Did you mean that we need to use PeekMassage() when mayWait is false?
> > Otherwise, I have no idea why there is the requirement restructing it.
> > Currently, we don't need to use PM_NOREMOVE in nsAppShell...
> 
> See https://bugzilla.mozilla.org/show_bug.cgi?id=896896#c0
> 
> The favourite plan for now is to use MsgWaitForMultipleObjectsEx rather than
> WaitMessage, and make the dispatcher process all messages at once and wait,
> rather than one at a time and make explicit calls to WaitMessage.
> 
> This means we will no longer call WinUtils::WaitMessage and so miss out on
> the TSF stuff. How big of an issue is that?

Sorry s/WinUtils::WaitMessage/WinUtils::PeekMessage
(Assignee)

Comment 18

4 years ago
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #16)
> (In reply to Nick Cameron [:nrc] from comment #15)
> > The favourite plan for now is to use MsgWaitForMultipleObjectsEx rather than
> > WaitMessage, and make the dispatcher process all messages at once and wait,
> > rather than one at a time and make explicit calls to WaitMessage.
> 
> I wouldn't change anything except calling MsgWaitForMultipleObjectsEx
> instead of WaitMessage. I would not change the way messages are actually
> processed. This should not affect TSF.

Yes, I was thinking we would have to do something extra for Metro, but it doesn't look like it.
Flags: needinfo?(masayuki)
(In reply to Nick Cameron [:nrc] from comment #15)
> (In reply to Masayuki Nakano (:masayuki) (Mozilla Japan) from comment #14)
> > (In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #12)
> > > (In reply to Nick Cameron [:nrc] from comment #9)
> > > > > http://hg.mozilla.org/mozilla-central/annotate/b3fcd828cadc/widget/windows/
> > > > > nsAppShell.cpp#l228
> > > > > 
> > > > > This will trigger the WaitMessage race, and I believe it can easily be
> > > > > replaced with a GetMessage.
> > > > 
> > > > Yep, this is where we've been getting the errors. I agree that it should be
> > > > easy to replace with GetMessage.
> > > 
> > > GetMessage would require restructuring logic in the nsAppShell event loop.
> > > I'm not so keen on doing that.
> > 
> > Did you mean that we need to use PeekMassage() when mayWait is false?
> > Otherwise, I have no idea why there is the requirement restructing it.
> > Currently, we don't need to use PM_NOREMOVE in nsAppShell...
> 
> See https://bugzilla.mozilla.org/show_bug.cgi?id=896896#c0
> 
> The favourite plan for now is to use MsgWaitForMultipleObjectsEx rather than
> WaitMessage, and make the dispatcher process all messages at once and wait,
> rather than one at a time and make explicit calls to WaitMessage.

I'm being confused, the favorite plan is not the attached patch?

> This means we will no longer call WinUtils::PeekMessage and so miss out on
> the TSF stuff. How big of an issue is that?

It completely breaks current TSF message handling. ITfMessagePump::PeekMessage() gets message *before* TSF handles it. WinAPI's PeekMessage() gets message *after* TSF handles it. This is very important for DOM keydown event implementation (in the future, I'm working for it now).
(Assignee)

Comment 20

4 years ago
Created attachment 782308 [details] [diff] [review]
patch using MsgWaitForMultipleObjectsEx
Attachment #782308 - Flags: review?(roc)
Attachment #782308 - Flags: feedback?(benjamin)
(Assignee)

Comment 21

4 years ago
Try push: https://tbpl.mozilla.org/?tree=Try&rev=6655175742ab

I have verified locally that this fixes the problem with OMTC mochitest 5. I can't verify this on Try just yet because my patches have regressed horribly.
Attachment #782308 - Flags: review?(roc) → review+

Comment 22

4 years ago
Comment on attachment 782308 [details] [diff] [review]
patch using MsgWaitForMultipleObjectsEx

I'm still a little worried about the WaitMessage in the mouse loop, but I can't think of a way around it right now.
Attachment #782308 - Flags: feedback?(benjamin) → feedback+
(Assignee)

Comment 23

4 years ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/4887f7d34df2
https://hg.mozilla.org/mozilla-central/rev/4887f7d34df2
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla25

Updated

4 years ago
Depends on: 933733
Backed out from Fx26 due to bug 933733.
https://hg.mozilla.org/releases/mozilla-beta/rev/b71b215ff897
status-firefox25: --- → fixed
status-firefox26: --- → wontfix
(Assignee)

Comment 26

4 years ago
I filed bug 937306 to work on a new version of this, but which will hopefully not break things like bug 933733.
Let's backout of mozilla-release as well (see https://bugzilla.mozilla.org/show_bug.cgi?id=933733#c34)
Fx25 backout: https://hg.mozilla.org/releases/mozilla-release/rev/33b48d5a8be3
status-firefox25: fixed → wontfix
status-firefox27: --- → fixed

Updated

4 years ago
Keywords: verifyme
(Assignee)

Comment 29

4 years ago
Backed out of Fx27 too: https://hg.mozilla.org/releases/mozilla-beta/rev/b952d0e9de86

(Carrying a=lsblakk from the other backouts, hope that is ok).
status-firefox27: fixed → wontfix
status-firefox28: --- → fixed
Why don't you just back out everywhere until bug 937306 is fixed?
(Assignee)

Comment 31

4 years ago
(In reply to Masatoshi Kimura [:emk] from comment #30)
> Why don't you just back out everywhere until bug 937306 is fixed?

Because we need this fix for Windows OMTC and bug 933733 affects a minority of users (none on Aurora/Nightly that we know of).

Comment 32

4 years ago
Is there any manual testing needed here? I see this bug caused failures of automated tests, so the fix should already be verified by those tests.
Flags: needinfo?(ncameron)
(Assignee)

Comment 33

4 years ago
(In reply to Ioana Budnar, QA [:ioana] from comment #32)
> Is there any manual testing needed here? I see this bug caused failures of
> automated tests, so the fix should already be verified by those tests.

Not at the moment, thanks.
Flags: needinfo?(ncameron)

Updated

4 years ago
Keywords: verifyme
Given comment 33 this fix will not be manually verified by QA. If you believe this warrants extra QA attention please nominate for testing by removing this whiteboard tag and adding the verifyme keyword. Please also provide any details you have that may inform our testing.
Whiteboard: [qa-]
(Assignee)

Updated

4 years ago
Blocks: 937306
Relanded: https://hg.mozilla.org/integration/mozilla-inbound/rev/62d1670e37d9
https://hg.mozilla.org/mozilla-central/rev/62d1670e37d9
This was backed from 32 as well.
https://hg.mozilla.org/releases/mozilla-release/rev/e99c2d9a3243
status-firefox32: --- → wontfix
status-firefox33: --- → fixed
status-firefox34: --- → fixed
status-firefox35: --- → fixed
Depends on: 1105386
You need to log in before you can comment on or make changes to this bug.