Closed Bug 1209464 Opened 9 years ago Closed 9 years ago

Window neutering missing from MessageChannel::WaitForInterruptNotify

Categories

(Core Graveyard :: Plug-ins, defect)

42 Branch
All
Windows
defect
Not set
normal

Tracking

(firefox41+ wontfix, firefox42+ fixed, firefox43+ fixed, firefox44 fixed)

RESOLVED FIXED
mozilla44
Tracking Status
firefox41 + wontfix
firefox42 + fixed
firefox43 + fixed
firefox44 --- fixed

People

(Reporter: bobowen, Assigned: bugzilla)

References

Details

Attachments

(5 files)

While testing bug 1185639 (and bug 1189709) when applied against release, I noticed that the right click flash settings menu is often taking a while to open.

Things seem to hang while it responds, although it always seems to come back eventually.

Occasionally seeing the plugin hang dialog with this as well.

I want to get bugs 1185639 and 1189709 uplifted to release to fix more serious hangs.
This is far less serious, but annoying.

This is happening in beta, aurora and nightly with e10s disabled.
Doesn't happen when e10s is enabled.
This is just on windowless plugins.
Summary: Right click menu on Flash is often slow to open on Windows. → Right click menu on windowless Flash is often slow to open on Windows.
Stack from firefox.exe main thread when waiting for menu.
Stack from plugin-container.exe main thread when waiting for menu.
These two stacks look more like some sort of deadlock.
The corresponding plugin-container.exe one.

If I click on another window the menu appears and the browser is freed up.
Looks like a PluginInstanceChild::SendShow call is having its reply delayed. Everything looks timely on the firefox side, but the reply isn't making it back to the main plugin-container thread in a timely fashion. More investigation is necessary, but I don't consider this serious enough to hold back uplift of those other patches at the moment.
(In reply to Aaron Klotz [:aklotz] (please use needinfo) from comment #6)
> Looks like a PluginInstanceChild::SendShow call is having its reply delayed.
> Everything looks timely on the firefox side, but the reply isn't making it
> back to the main plugin-container thread in a timely fashion. More
> investigation is necessary, but I don't consider this serious enough to hold
> back uplift of those other patches at the moment.

I don't think this is quite right.

When you right click, you get a WM_RBUTTONDOWN, a WM_KILLFOCUS, and 2 or 3 WM_IME_SETCONTEXT messages sent down to the plugin with a load of nested CallNPP_HandleEvents.

At the same time the pluing process is sending back a SendShow.

If all the replies (and the Show message at the end) are in mPending before we get to [1], then we satisfy the latest call in the nested CallNPP_HandleEvent and end up at [2].
But the message from SendShow is still in mPending.
So, we then go back up the stack to the previous nesting of CallNPP_HandleEvent and end up at [3].
But, now there are no messages returning to signal mEvent.
You only get out of this when some other Windows event causes another nested CallNPP_HandleEvent and when the reply comes back the Show message is the first in mPending.

Not sure all the details are correct there, but I think that's the general issue.


[1] https://dxr.mozilla.org/mozilla-central/rev/891ee0d0ba3ec42b6484cf0205b3c95e21c58f74/ipc/glue/WindowsMessageLoop.cpp#1149
[2] https://dxr.mozilla.org/mozilla-central/rev/891ee0d0ba3ec42b6484cf0205b3c95e21c58f74/ipc/glue/MessageChannel.cpp#1126
[3] https://dxr.mozilla.org/mozilla-central/rev/891ee0d0ba3ec42b6484cf0205b3c95e21c58f74/ipc/glue/WindowsMessageLoop.cpp#1145
If I change:
https://dxr.mozilla.org/mozilla-central/rev/891ee0d0ba3ec42b6484cf0205b3c95e21c58f74/ipc/glue/WindowsMessageLoop.cpp#1140

to:
      if (!Connected() ||
          (AwaitingInterruptReply() && InterruptEventOccurred())) {

We check to see if we've actually had an out of order reply to the message we are about to wait for or if another one is pending.

This seems to do the trick.

What do you think?
Flags: needinfo?(aklotz)
(In reply to Bob Owen (:bobowen) from comment #8)

>       if (!Connected() ||
>           (AwaitingInterruptReply() && InterruptEventOccurred())) {

Just thought, it probably makes sense (if it makes sense at all) to have this new check at the end of the loop, as it should never be true to start with.
This code is going round in my head now!

The whole way we handle the interrupt replies seems a bit odd.

If mInterruptStack were changed to be a vector instead of a stack, we could hold a pair of Messages, for the request and reply.
When an interrupt reply comes in we could use the (abs(msg.seqno() - mInterruptStack.front().seqno())) to find the position in the vector to store our reply.
Then we would immediately know when our reply was in and we wouldn't need mOutOfTurnReplies.

mPending would then only be for incoming requests (as we handle incoming sync replies differently).

This would be a bigger change, so not suitable for a release fix.
OK, what I'm seeing here is actually a regression from bug 1189709. That patch neglected to add neutering back to the appropriate location in MessageChannel::WaitForInterruptNotify when the scope of the NeuteredWindowRegion was reduced in MessageChannel::Call.
Flags: needinfo?(aklotz)
Attached patch PatchSplinter Review
Assignee: nobody → aklotz
Status: NEW → ASSIGNED
Attachment #8669530 - Flags: review?(jmathies)
Ah I see, this seems to stop it from getting into the nested intr messages in the first place.

I wonder if we can ever getting into the nested circumstance, if we can the issue I saw could still occur.
[Tracking Requested - why for this release]: This is serious enough that I think it should go into a 41 chemspill. I also think it will take care of bug 1210665 and others.
Updating the summary as the consequences are much more significant than Flash context menus.
Summary: Right click menu on windowless Flash is often slow to open on Windows. → Window neutering missing from MessageChannel::WaitForInterruptMessage
Summary: Window neutering missing from MessageChannel::WaitForInterruptMessage → Window neutering missing from MessageChannel::WaitForInterruptNotify
[Tracking Requested - why for this release]:
See comment #14 - not sure if it's bad enough to trigger a point-release by itself, but we should track it for Dev Edition and Beta.
This makes Flash settings slow to open (see comment #0) and some sites like in bug 1210665 can get Firefox to freeze up completely.
Indeed, tracking for 42 & 43; cf comment #17 for the rational.
Attachment #8669530 - Flags: review?(jmathies) → review+
(In reply to Bob Owen (:bobowen) from comment #8)
> If I change:
> https://dxr.mozilla.org/mozilla-central/rev/
> 891ee0d0ba3ec42b6484cf0205b3c95e21c58f74/ipc/glue/WindowsMessageLoop.cpp#1140
> 
> to:
>       if (!Connected() ||
>           (AwaitingInterruptReply() && InterruptEventOccurred())) {
> 
> We check to see if we've actually had an out of order reply to the message
> we are about to wait for or if another one is pending.
> 
> This seems to do the trick.
> 
> What do you think?

do we need a follow up on this? It sounds like something you should discuss with billm.
Flags: needinfo?(bobowen.code)
Comment on attachment 8669530 [details] [diff] [review]
Patch

Approval Request Comment
[Feature/regressing bug #]: bug 1189709
[User impact if declined]: Plugin hangs undetectable by crash reporter
[Describe test coverage new/current, TreeHerder]: Plugin test suite, locally with test case in bug
[Risks and why]: Low, this reverts back to previous behaviour that already existed prior to the regression.
[String/UUID change made/needed]: None
Attachment #8669530 - Flags: approval-mozilla-release?
Attachment #8669530 - Flags: approval-mozilla-beta?
Attachment #8669530 - Flags: approval-mozilla-aurora?
Depends on: 1211958
(In reply to Jim Mathies [:jimm] from comment #19)

> do we need a follow up on this? It sounds like something you should discuss
> with billm.

Filed bug 1212308 and NIed billm.
Flags: needinfo?(bobowen.code)
https://hg.mozilla.org/mozilla-central/rev/f641c7662bdd
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla44
Comment on attachment 8669530 [details] [diff] [review]
Patch

Sure, fix a regression, taking it. Should be in 42 beta 5.
Attachment #8669530 - Flags: approval-mozilla-beta?
Attachment #8669530 - Flags: approval-mozilla-beta+
Attachment #8669530 - Flags: approval-mozilla-aurora?
Attachment #8669530 - Flags: approval-mozilla-aurora+
Bob wonders whether this bug is responsible for the recent spike in Netflix's F7353 errors (bug 1160439).
Chris, the fix landed in 42.0b5 (AFAIK). It would be nice to know if the fix helped the Netflix errors. I am considering taking this fix into the next 41 dot release (sometime this week) and trying to confirm whether this fix is safe.
Flags: needinfo?(cpeterson)
(In reply to Ritu Kothari (:ritu) from comment #31)
> Chris, the fix landed in 42.0b5 (AFAIK). It would be nice to know if the fix
> helped the Netflix errors. I am considering taking this fix into the next 41
> dot release (sometime this week) and trying to confirm whether this fix is
> safe.

Netflix's F7353 errors have spiked (from ~0% to 2.48% of views) in their recent beta tests of Firefox 42. There could be multiple causes for error F7353, so I don't know if this fix had any affect.

Netflix is not going to launch EME with Firefox 41, so there is no need to uplift this fix to 41 solely for Netflix's benefit.
Flags: needinfo?(cpeterson)
Ritu, other definite hangs in 41.0.1 have been fixed by this though.
And this patch fixes a known regression -- this is how things worked in 40.0.0 but was broken in 40.0.1
Comment on attachment 8669530 [details] [diff] [review]
Patch

As I mentioned in comment 35, this fix has been verified by a few sources already. It seems safe to uplift to m-r for inclusion in 41.0.2
Attachment #8669530 - Flags: approval-mozilla-release? → approval-mozilla-release+
As far as I can tell, this issue can not be verified on Firefox for Android
I veto taking this on release and actually request we back it out from beta, as it looks like this caused a 10% rise in crashes.
(In reply to Chris Peterson [:cpeterson] from comment #32)
> Netflix's F7353 errors have spiked (from ~0% to 2.48% of views) in their
> recent beta tests of Firefox 42. There could be multiple causes for error
> F7353, so I don't know if this fix had any affect.

Does this mean that there is a good chance that this fix made Netflix errors jump up? That's more reason to consider this fix be backed out from 42 to find out if that's true and the errors subside again.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #39)
> I veto taking this on release and actually request we back it out from beta,
> as it looks like this caused a 10% rise in crashes.

Would you link us to the data that makes you believe this?

(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #40)
> (In reply to Chris Peterson [:cpeterson] from comment #32)
> > Netflix's F7353 errors have spiked (from ~0% to 2.48% of views) in their
> > recent beta tests of Firefox 42. There could be multiple causes for error
> > F7353, so I don't know if this fix had any affect.
> 
> Does this mean that there is a good chance that this fix made Netflix errors
> jump up? That's more reason to consider this fix be backed out from 42 to
> find out if that's true and the errors subside again.

No, there was a small possibility that the patch from bug 1189709 was causing problems for GMPs, if it was this would have been fixing that, not making it worse.
Flags: needinfo?(kairo)
One thing to note is that this patch fixes some Firefox ui hangs or lockups that were not being caught by the crash reporter, I don't know if that might be skewing the stats any.
(In reply to Bob Owen (:bobowen) from comment #41)
> (In reply to Robert Kaiser (:kairo@mozilla.com) from comment #39)
> > I veto taking this on release and actually request we back it out from beta,
> > as it looks like this caused a 10% rise in crashes.
> 
> Would you link us to the data that makes you believe this?

The data is not 100% straight-forward to understand, unfortunately. https://arewestableyet.com/graph/?fxbeta tells me the crash rate of around 1.20 crashes / 100 ADI of Mon/Tues this week are about 10% worse than the 1.08 or so a week previously (before that, we had another spike that we had fixed). https://crash-analysis.mozilla.com/rkaiser/2015-10-13/2015-10-13.firefox.beta.explosiveness.html tells me that a number of signatures did rise, many of them with some window stuff in the signature, which would be fitting. I reported on that in bug 1213567 comment #5. The same bug points to the patch here as the possible culprit and as the rise is in 42.0b5, which shipped on 10/09 and for which this patch landed, it's my primary suspect for causing the 10% crash volume increase.

Now, we need to find out if that's possible or even probably based on the crashes reported, if we can find a different culprit among the patches that landed for b5, and what to do about this. Until we have done that, I do not think we should ship this patch here to users.
Flags: needinfo?(kairo)
I am strongly opposed to the backout proposal:

1) This patch restores code that was present in 41.0.0 and every version prior to that since OOP plugins was released. When the changes in bug 1189709 were made in 42, they mistakenly removed the affected code in this bug. You are asking us to remove code that has been in every released version prior to 41.0.1.

2) I strongly recommend that you compare the crash rate of 41.0.2 against versions that still shipped with this code (such as 39), instead of our upcoming prerelease channels that were also affected by this regression.

3) This code is very important to fix Flash performance and correctness problems, most of which are undetectable by crash reporter. They are either temporary hangs or not the types of hangs that our hang detection code is equipped to notice. What you have probably seen is a shift of many of these hangs from the undetectable kind to the ones that are actually visible to our reporting infrastructure.
Flags: needinfo?(kairo)
(In reply to Aaron Klotz [:aklotz] (please use needinfo) from comment #45)
> 2) I strongly recommend that you compare the crash rate of 41.0.2 against
> versions that still shipped with this code (such as 39), instead of our
> upcoming prerelease channels that were also affected by this regression.


To be clear: Was this code also present in 40 release? When did the removal of this code land? In what bug?
Flags: needinfo?(kairo)
Yes, it was present in 40 release and 41 release. It was removed by bug 1185639 and bug 1189709, which were uplifted for the 41.0.1 chemspill.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #40)
> (In reply to Chris Peterson [:cpeterson] from comment #32)
> > Netflix's F7353 errors have spiked (from ~0% to 2.48% of views) in their
> > recent beta tests of Firefox 42. There could be multiple causes for error
> > F7353, so I don't know if this fix had any affect.
> 
> Does this mean that there is a good chance that this fix made Netflix errors
> jump up? That's more reason to consider this fix be backed out from 42 to
> find out if that's true and the errors subside again.

I don't think the jump in Netflix F7353 errors is related to this change landing on Beta. The jump happened between 2015-08-24 (Beta 41) and 2015-09-29 (Beta 42), but this change landed on Beta on 2015-10-08. And the Netflix errors have decreased a little (2.48% to 1.98%) since 2015-10-08. However, there are many contributing factors, including a new Adobe CDM build.
(In reply to Aaron Klotz [:aklotz] (please use needinfo) from comment #47)
> Yes, it was present in 40 release and 41 release. It was removed by bug
> 1185639 and bug 1189709, which were uplifted for the 41.0.1 chemspill.

So why did 41.0.1 then not make crashes like found now decrease? I do not see a decrease in any window-related crashes or overall crash volume there. Does re-adding this now trigger some different code that causes the crashes now? Do we need fixes for that?
Flags: needinfo?(aklotz)
(In reply to Chris Peterson [:cpeterson] from comment #48)
> I don't think the jump in Netflix F7353 errors is related to this change
> landing on Beta.

Thanks, that's good to know. One worry less.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #49)
> So why did 41.0.1 then not make crashes like found now decrease? I do not
> see a decrease in any window-related crashes or overall crash volume there.
> Does re-adding this now trigger some different code that causes the crashes
> now? Do we need fixes for that?

The code was refactored a bit to make it more reusable but should be doing the exact same thing as before. Can you point me to the specific stacks that increased?
Flags: needinfo?(aklotz) → needinfo?(kairo)
(In reply to Aaron Klotz [:aklotz] (please use needinfo) from comment #51)
> (In reply to Robert Kaiser (:kairo@mozilla.com) from comment #49)
> > So why did 41.0.1 then not make crashes like found now decrease? I do not
> > see a decrease in any window-related crashes or overall crash volume there.
> > Does re-adding this now trigger some different code that causes the crashes
> > now? Do we need fixes for that?
> 
> The code was refactored a bit to make it more reusable but should be doing
> the exact same thing as before. Can you point me to the specific stacks that
> increased?

Aaron, I believe it is bug 1213567 where the crash signatures are that have spiked up.
(In reply to Ritu Kothari (:ritu) from comment #52)
> Aaron, I believe it is bug 1213567 where the crash signatures are that have
> spiked up.

OK. As I said on IRC, I don't see directly how this patch could be causing it but it is suspicious enough that I am OK to leave it out of 41.0.2. I'd like to leave it in beta though so that I can continue to investigate.
Flags: needinfo?(kairo)
(In reply to Aaron Klotz [:aklotz] (please use needinfo) from comment #51)
> The code was refactored a bit to make it more reusable but should be doing
> the exact same thing as before. Can you point me to the specific stacks that
> increased?

As I mentioned above, bug 1213567 comment #5 points to all those. It's suspicious that there's so many references to window stuff in those crashes.
Looks like the PeekMessageW in WaitForInterruptNotify is causing exception stack overflows somehow.
Yeah it looks like maybe something is being "doubly neutered" even though that isn't supposed to be possible (sounds painful).
OK, thanks, makes me feel somewhat better at least that we can confirm the suspicions that this patch is related. Now I hope we can find a fix for that in 42 as well - but maybe we should take that to bug 1213567.
Depends on: 1213567
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: