Properly fix our WaitMessage use

RESOLVED FIXED in Firefox 33

Status

()

Core
Widget: Win32
RESOLVED FIXED
4 years ago
3 years ago

People

(Reporter: nrc, Assigned: aklotz)

Tracking

(Depends on: 1 bug)

Trunk
mozilla35
x86_64
Windows 7
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox33 fixed, firefox34 fixed, firefox35 fixed)

Details

Attachments

(2 attachments, 1 obsolete attachment)

(Reporter)

Description

4 years ago
Bug 896896 repaced WaitMessage with MsgWaitForMultipleObjectsEx, that was good. However, it broke things for some users (bug 933733), which is bad. Let's try and come up with a better fix here.
Do we know what went wrong? Was there a nested event loop, or did the problem occur in the toplevel event loop? Can we give the users who experienced problems a special build with extra logging and ask them to send us their logs?
(Reporter)

Comment 2

4 years ago
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #1)
> Do we know what went wrong? Was there a nested event loop, or did the
> problem occur in the toplevel event loop? Can we give the users who
> experienced problems a special build with extra logging and ask them to send
> us their logs?

ATM, I have no idea what went wrong. At least some of the users have been helpful so far in running the moz regression finder tool, so I expect they would be willing and able to run a build with logging. I'll put something together for that today.
(Reporter)

Comment 3

4 years ago
I've been banging my head against the code a little bit. I can't repro bug 933733 and I can't work out what could be special about those users for that bug to happen.

I have verified that we do hit the ::WaitMessage in the patch from 896896, so that means we could end up with the same problem as that tried to fix if the PeekMessage dispatches a sent message and in dealing with that we hit the Chromium ipc message loop. However, I don't see why that should happen more frequently calling WinUtils::WaitMessage rather than ::WaitMessage - I would expect to see it less often. Perhaps it happens less often, but the effects are more dramatic when it does? I can't see how that would come about though.

I would like to replace the idiom around the ::WaitMessage with something else. I don't know what though. I guess I would also like to verify if we need it at all and how often it is hit. I'll rollup some experimental builds with logging and try and get some of the users who reported 933733 to help out.
(Reporter)

Comment 4

4 years ago
I added links to a build with some logging around our uses of WaitMessage, and a build without the extra check and call to WaitMessage to bug 933733 and asked for volunteers.

I used a build without that extra check for a bunch of browsing and found no problems (but then I can't repro 933733). Try push with the same change looks good: https://tbpl.mozilla.org/?tree=Try&rev=fd2184b61c29. Of course, it's only a guess that this might help, but lets see.

(And just to preserve the link, here is a link to builds with logging on try: https://tbpl.mozilla.org/?tree=Try&rev=ed7bd96a1254)
See also bug 912521.
(Reporter)

Comment 6

4 years ago
I got some info back from a user over on bug 933733 and via email. The speculative fix of removing the new ::WaitMessage call did not help. We are hitting that code path, but (according to the logs) we are never actually waiting.

Next step is to log other places where we might be waiting, I guess.
(Reporter)

Comment 7

4 years ago
The last round of logging turned up something vaguely useful - when we lock up it is in MsgWaitForMultipleObjectsEx called from WinUtils::WaitForMessage. If the wakeup message is in the queue, MsgWaitForMultipleObjectsEx should _always_ return. So I wonder if we are missing it somehow, or if we are hitting some rare bug in Windows code where it is not returning? I suspect the former.
(Reporter)

Comment 8

4 years ago
Do you have any ideas for things to try here? We know where the problem is now (see comment 7), but I have no idea why it is happening, or how to fix it.
Flags: needinfo?(benjamin)
(Reporter)

Updated

4 years ago
Depends on: 896896, 933733
(Reporter)

Comment 9

4 years ago
For the benefit of the gfx folks, the status of this bug is that we need some solution to enable OMTC on Windows. The previous setup which we have on beta and release makes us fail tests with OMTC turned on. I addressed that in bug 896896, but the new version caused us to regress in some rare cases (bug 933733), which is why we reverted for beta and release.

So currently we can choose from failing reftest (old version) or user errors (new version). This bug is to find a solution which has neither problem, we need that before Windows OMTC rides the trains (not necessarily before OMTC gets turned on for nightly, because we have had no reports of problems with the new version on Aurora or Nightly).
Blocks: 899787, 899785
I don't have any great suggestions.

If we wanted to deeply rearchitect this, we could stop using messages to wake up the event loop and instead using an event object. That way when we post an XPCOM event to the main event loop, we'd signal the event object, and in MsgWaitForMultipleObjectsEx we'd pass the event into the handle array. I bet that would be a pretty risky change, though.
Flags: needinfo?(benjamin)
(Reporter)

Updated

4 years ago
Assignee: ncameron → nobody
Now that bug 896896 was backed out in bug 933733, I don't know if you want to use this as the "fix and re-land" bug or if you want to just reopen bug 896896 and do the work there.
Blocks: 912521
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #10)
> I don't have any great suggestions.
> 
> If we wanted to deeply rearchitect this, we could stop using messages to
> wake up the event loop and instead using an event object. That way when we
> post an XPCOM event to the main event loop, we'd signal the event object,
> and in MsgWaitForMultipleObjectsEx we'd pass the event into the handle
> array. I bet that would be a pretty risky change, though.

Well, I just wasted almost a week trying to find this problem (for some reason it doesn't repro on my machines, probably because it's timing problems). We'll need to address it, should I implement this solution or do you have any other ideas?
Assignee: nobody → bas
Status: NEW → ASSIGNED
Flags: needinfo?(benjamin)

Comment 13

3 years ago
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #10)
> I don't have any great suggestions.
> 
> If we wanted to deeply rearchitect this, we could stop using messages to
> wake up the event loop and instead using an event object. That way when we
> post an XPCOM event to the main event loop, we'd signal the event object,
> and in MsgWaitForMultipleObjectsEx we'd pass the event into the handle
> array. I bet that would be a pretty risky change, though.

I'd be interested to see what happens if we're in a modal dialog dispatch loop with this, I wonder if this would get picked up at all. If it does, that would be great. You might have to use some sort of hybrid system that uses both based on the nesting level.

Updated

3 years ago
Blocks: 1005317
Which kind of modal dialog dispatch loop? Any of our standard loops would need to be corrected. Filtered nested loops such as that used by MessageBox probably shouldn't wake up in those conditions anyway, should they?

Bas, since we really don't why it's broken, we don't know whether the approach from comment 10 will actually work.
Flags: needinfo?(benjamin)

Comment 15

3 years ago
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #14)
> Filtered nested loops such as that used by MessageBox
> probably shouldn't wake up in those conditions anyway, should they?

I think you were suggesting we replace the posted event in ScheduleNativeEventCallback with the signaled event, or am I missing something? With that, Windows modal loops won't generate callbacks for us and xpcom events won't get processed.
What Windows modal loops remain in the browser? I suspect that any of those that remain *should* be blocking XPCOM (or we should continue removing them).

Comment 17

3 years ago
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #16)
> What Windows modal loops remain in the browser? I suspect that any of those
> that remain *should* be blocking XPCOM (or we should continue removing them).

Good question, I don't know of every modal dialog use off hand. A couple I do know of - file and color pickers make sync calls on desktop.
Fwiw, my suggestion for now is to reland Nick Cameron's patch on nightly for now (it's sat on nightly without significant issues for over 3 months), and then work from there to find a way to resolve the problems some people experience because of it.
I don't understand your suggestion. We know that nrc's patch causes regressions. What makes that acceptable?
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #19)
> I don't understand your suggestion. We know that nrc's patch causes
> regressions. What makes that acceptable?

The fact Nick Cameron's patch is the only thing we have that fixes -this- problem and we currently have no way to determine what the problem's with it are. We also know that we had this patch on nightly -and- aurora for several months without issues, and only on Beta were we able to single out users who ran into trouble with it.

In order to get a significant amount of the cycle for OMTC to be on in order for people to report any issues they have, I believe the prudent thing to do would be to reland nrc's patch, turn on OMTC, and contact those people experiencing issues with this patch again to try and resolve them before the merge to Aurora. (Which we know is well before a significant amount of people is affected, after all this patch stuck on m-c for 3-4 months before)
Do people fee; that is an acceptable approach?

Comment 22

3 years ago
> The fact Nick Cameron's patch is the only thing we have that fixes -this-
> problem and we currently have no way to determine what the problem's with it
> are. We also know that we had this patch on nightly -and- aurora for several
> months without issues, and only on Beta were we able to single out users who
> ran into trouble with it.

On beta we get a big jump in adis so I'm not surprising we started to see reports of the problem there. 

If we land this what's the plan on tracking down the remaining issue with nrc's patch?

Comment 23

3 years ago
(In reply to Jim Mathies [:jimm] from comment #22)
> If we land this what's the plan on tracking down the remaining issue with
> nrc's patch?

(In reply to Bas Schouten (:bas.schouten) from comment #20)
> In order to get a significant amount of the cycle for OMTC to be on in order
> for people to report any issues they have, I believe the prudent thing to do
> would be to reland nrc's patch, turn on OMTC, and contact those people
> experiencing issues with this patch again to try and resolve them before the
> merge to Aurora. (Which we know is well before a significant amount of
> people is affected, after all this patch stuck on m-c for 3-4 months before)

Sorry I was skimming and missed thisd part. With your approach I'm concerned that we'll just punt on the remaining problem until it becomes a crisis on aurora or beta prior to a merge.
(In reply to Jim Mathies [:jimm] from comment #23)
> (In reply to Jim Mathies [:jimm] from comment #22)
> > If we land this what's the plan on tracking down the remaining issue with
> > nrc's patch?
> 
> (In reply to Bas Schouten (:bas.schouten) from comment #20)
> > In order to get a significant amount of the cycle for OMTC to be on in order
> > for people to report any issues they have, I believe the prudent thing to do
> > would be to reland nrc's patch, turn on OMTC, and contact those people
> > experiencing issues with this patch again to try and resolve them before the
> > merge to Aurora. (Which we know is well before a significant amount of
> > people is affected, after all this patch stuck on m-c for 3-4 months before)
> 
> Sorry I was skimming and missed thisd part. With your approach I'm concerned
> that we'll just punt on the remaining problem until it becomes a crisis on
> aurora or beta prior to a merge.

Well, I don't think that will happen here since we -need- this for OMTC to stay on. So it will become the highest priority issue once we switch OMTC on (hopefully this week). I'm also fine with doing something else, but this is a -hard- blocker for OMTC on windows. So we have to fix this somehow, and I don't see other routes that can bring progress.

Comment 25

3 years ago
(In reply to Bas Schouten (:bas.schouten) from comment #24)
> Well, I don't think that will happen here since we -need- this for OMTC to
> stay on. So it will become the highest priority issue once we switch OMTC on
> (hopefully this week). I'm also fine with doing something else, but this is
> a -hard- blocker for OMTC on windows. So we have to fix this somehow, and I
> don't see other routes that can bring progress.

Ok. Note bug 1005317 has a good test case for situations where WaitMessage doesn't free up, which nrc's patch should fix. But it could be useful in figuring out which calls are doing the message updating.

FWIW I did some testing and was able to rule out the HangMonitor (which has a bunch of PeekMessage calls in it) and WindowsMessageLoop, which didn't seem to interact with the omtc channel (I'm guessing it's not a ui loop).
(In reply to Jim Mathies [:jimm] from comment #25)
> (In reply to Bas Schouten (:bas.schouten) from comment #24)
> > Well, I don't think that will happen here since we -need- this for OMTC to
> > stay on. So it will become the highest priority issue once we switch OMTC on
> > (hopefully this week). I'm also fine with doing something else, but this is
> > a -hard- blocker for OMTC on windows. So we have to fix this somehow, and I
> > don't see other routes that can bring progress.
> 
> Ok. Note bug 1005317 has a good test case for situations where WaitMessage
> doesn't free up, which nrc's patch should fix. But it could be useful in
> figuring out which calls are doing the message updating.
> 
> FWIW I did some testing and was able to rule out the HangMonitor (which has
> a bunch of PeekMessage calls in it) and WindowsMessageLoop, which didn't
> seem to interact with the omtc channel (I'm guessing it's not a ui loop).

Well, I think Nick's patch is a good change regardless. I suspected WindowsMessageLoop myself, it's interesting that that isn't the one, but I suppose maybe you're right.

Right now I'm mainly wondering what causes Nick's patch to cause trouble for such a small set of people. But I'm hoping we can figure that out once we land it.
We couldn't figure it out the first time, even with extensive logging. What makes you think this time will be different?

As a rule we don't land known-broken code into the tree. If you're going to fix the brokenness of this path, why not do it now regardless, by making more try builds with logging as nick did in bug 933733 and diagnosing the root issue? Or trying my alternate suggestion and see if it works.

I really don't want to be in a position where we turn on OMTC on nightly but have to turn it off repeatedly in Aurora because we'd be regressing bug 933733.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #27)
> We couldn't figure it out the first time, even with extensive logging. What
> makes you think this time will be different?
> 
> As a rule we don't land known-broken code into the tree. If you're going to
> fix the brokenness of this path, why not do it now regardless, by making
> more try builds with logging as nick did in bug 933733 and diagnosing the
> root issue? Or trying my alternate suggestion and see if it works.
> 
> I really don't want to be in a position where we turn on OMTC on nightly but
> have to turn it off repeatedly in Aurora because we'd be regressing bug
> 933733.

Well, we might have to turn it off once (although I hope not). But that's better than the situation we're in now.

But the main problem is we never really focused too much on fixing it, because OMTC wasn't on by default on windows. Currently every week we get approximately 2-3 new regressions with OMTC on, meaning we're just playing catch up all the time. Once OMTC is on in central this can no longer happen and we can -actually- spend time on fixing bug 933733. I'm even willing to land this and then completely rewrite out event processing code to subsequently land that. But at the moment I can't do any of this because I'm spending all my time fixing regressions when OMTC is on. I want to get us out of that state first.
This bug appears to have disappeared.. that's not entirely satisfying, but at this point if noone experiences it there's not much we can do!
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED

Updated

3 years ago
No longer blocks: 912521

Updated

3 years ago
No longer blocks: 1005317
Duplicate of this bug: 1005317
Created attachment 8487480 [details] [diff] [review]
WinUtils::WaitForMessage improvements

I'm not really happy with the way the Chromium-style code works.

In particular, executing a PeekMessage followed by a WaitMessage could be bad if the peek found mouse messages but there were other input messages in the queue. These unprocessed messages are no longer considered "new" after the peek, so WaitMessage will not resume the thread for them.

Furthermore, there are far more messages involved with synchronized input queues than just mouse messages.

The code in this patch is very similar to the waiting code for synchronous ipc in ipc/glue/WindowsMessageLoop.cpp.

I think that this is a better solution because it takes into account all possible messages that could be sent to a child window with an attached input queue. When it determines that the thread resumption was due to another thread's messages, it yields to allow those other threads to handle the messages, thereby preventing busy waiting.
Assignee: bas → aklotz
Status: RESOLVED → REOPENED
Attachment #8487480 - Flags: review?(jmathies)
Resolution: FIXED → ---

Comment 32

3 years ago
Comment on attachment 8487480 [details] [diff] [review]
WinUtils::WaitForMessage improvements

Review of attachment 8487480 [details] [diff] [review]:
-----------------------------------------------------------------

::: widget/windows/WinUtils.cpp
@@ +622,5 @@
>  void
>  WinUtils::WaitForMessage()
>  {
> +  while (true) {
> +    DWORD result = ::MsgWaitForMultipleObjectsEx(0, NULL, INFINITE, QS_ALLINPUT,

please use MOZ_QS_ALLEVENT (defined in WinUtils.h)

@@ +638,3 @@
>      MSG msg = {0};
> +    if (haveSentMessagesPending ||
> +        ::PeekMessageW(&msg, nullptr, 0, 0, PM_NOREMOVE)) {

I'm confused, if we have send messages pending, we don't call PeekMessage?
(In reply to Jim Mathies [:jimm] from comment #32)
> Comment on attachment 8487480 [details] [diff] [review]
> 
> please use MOZ_QS_ALLEVENT (defined in WinUtils.h)
> 

Will do.

> I'm confused, if we have send messages pending, we don't call PeekMessage?

It should be called once we've returned from this function and the full message loop has run. My intention here was to ensure that, when possible, this function truly does nothing but waiting and leaves that actual PeekMessage call to the caller.

Comment 34

3 years ago
Comment on attachment 8487480 [details] [diff] [review]
WinUtils::WaitForMessage improvements

Ok, I see what you're doing here. This looks good, lets test it. I'd love to see a full try push.
Attachment #8487480 - Flags: review?(jmathies) → review+
Let's see how this goes...
https://tbpl.mozilla.org/?tree=Try&rev=72e7a82b7ca3
Created attachment 8487500 [details] [diff] [review]
WinUtils::WaitForMessage improvements

Fixed nit and updated commit msg. Carrying forward r+.
Attachment #8487480 - Attachment is obsolete: true
Attachment #8487500 - Flags: review+
Philipp, once this try build comes back, could you have our helpful user test with it? :-)
Flags: needinfo?(madperson)
Any news here? We really shouldn't let this sit here :). Your try run is all in the green!
Flags: needinfo?(aklotz)
Flags: needinfo?(aklotz)
Keywords: checkin-needed
https://hg.mozilla.org/integration/mozilla-inbound/rev/a925804b5cb2
Keywords: checkin-needed
had feedback from one affected user testing the try-build from https://tbpl.mozilla.org/?tree=Try&rev=72e7a82b7ca3
he said the situation improved, but hangs were still occurring occasionally.
Flags: needinfo?(madperson)
https://hg.mozilla.org/mozilla-central/rev/a925804b5cb2
Status: REOPENED → RESOLVED
Last Resolved: 3 years ago3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla35

Comment 42

3 years ago
Is that something we need to get to 33 if we want to ship OMTC there?
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #42)
> Is that something we need to get to 33 if we want to ship OMTC there?

Judging from comment 40. It improves things so we probably should. We just need to hope it fixes things 'sufficiently'. Otherwise we'll need more improvements. Maybe Jimm's patch from Bug 1066242 will also help.
Great to get this feedback Philipp, we have one extra patch on central now and this one has landed. Could you see if current central is improved further for the affected user?
Flags: needinfo?(madperson)
Comment on attachment 8487500 [details] [diff] [review]
WinUtils::WaitForMessage improvements

Approval Request Comment
[Feature/regressing bug #]: Windows OMTC
[User impact if declined]: Browser unresponsive for some users unless there is mouse activity.
[Describe test coverage new/current, TBPL]: Landed on m-c
[Risks and why]: Moderate; This patch affects the main thread event loop on Windows builds. OTOH it is based on a well-tested implementation in our IPC code.
[String/UUID change made/needed]: None
Attachment #8487500 - Flags: approval-mozilla-beta?
Attachment #8487500 - Flags: approval-mozilla-aurora?
status-firefox33: --- → affected
status-firefox34: --- → affected
status-firefox35: --- → fixed
Comment on attachment 8487500 [details] [diff] [review]
WinUtils::WaitForMessage improvements

I am going to take it for aurora to have more testing coverage.
If no critical issues are found until then, I will take it for beta 5 (gtb next Thursday).
Attachment #8487500 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
https://hg.mozilla.org/releases/mozilla-aurora/rev/7546fedad918
status-firefox34: affected → fixed
Comment on attachment 8487500 [details] [diff] [review]
WinUtils::WaitForMessage improvements

Taking it now to have it in beta 5.
Attachment #8487500 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
https://hg.mozilla.org/releases/mozilla-beta/rev/60aecc9d11ab
status-firefox33: affected → fixed
I hit this again today, using the 2014-09-18 nightly on Win 7. It seems to have gone away, but it was definitely present. I had to move the mouse in order to get keyboard-initiated scrolling to happen.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Created attachment 8494880 [details] [diff] [review]
Make chromium message pump call into WinUtils::WaitForMessage

Given Ted's report, I've done some more thinking on this issue and I think that we should have the Chromium message pump call into the same implementation that we're using for the main thread. Otherwise a chromium UI thread syncs up with the main thread (likely), we could see the same problem.
Attachment #8494880 - Flags: review?(jmathies)
Comment on attachment 8494880 [details] [diff] [review]
Make chromium message pump call into WinUtils::WaitForMessage

I'm going to move this to a separate bug, it'll be easier to track that way.
Attachment #8494880 - Flags: review?(jmathies)
Depends on: 1072752
See bug 1072752.
Status: REOPENED → RESOLVED
Last Resolved: 3 years ago3 years ago
Resolution: --- → FIXED
requested further feedback from a handful of affected users, unfortunately none of them followed up on it and replied...
Flags: needinfo?(madperson)
Depends on: 1105386
You need to log in before you can comment on or make changes to this bug.