1209464 - Window neutering missing from MessageChannel::WaitForInterruptNotify

Reporter

Description

•

9 years ago

While testing bug 1185639 (and bug 1189709) when applied against release, I noticed that the right click flash settings menu is often taking a while to open.

Things seem to hang while it responds, although it always seems to come back eventually.

Occasionally seeing the plugin hang dialog with this as well.

I want to get bugs 1185639 and 1189709 uplifted to release to fix more serious hangs.
This is far less serious, but annoying.

This is happening in beta, aurora and nightly with e10s disabled.
Doesn't happen when e10s is enabled.

Bob Owen (:bobowen)

Reporter

Comment 1

•

9 years ago

This is just on windowless plugins.

Summary: Right click menu on Flash is often slow to open on Windows. → Right click menu on windowless Flash is often slow to open on Windows.

Bob Owen (:bobowen)

Reporter

Comment 2

•

9 years ago

Attached file bug1209464-firefox-stack.txt — Details

Stack from firefox.exe main thread when waiting for menu.

Bob Owen (:bobowen)

Reporter

Comment 3

•

9 years ago

Attached file bug1209464-plugin-container-stack.txt — Details

Stack from plugin-container.exe main thread when waiting for menu.

Bob Owen (:bobowen)

Reporter

Comment 4

•

9 years ago

Attached file bug1209464-firefox-stack2.txt — Details

These two stacks look more like some sort of deadlock.

Bob Owen (:bobowen)

Reporter

Comment 5

•

9 years ago

Attached file bug1209464-plugin-container-stack2.txt — Details

The corresponding plugin-container.exe one.

If I click on another window the menu appears and the browser is freed up.

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 6

•

9 years ago

Looks like a PluginInstanceChild::SendShow call is having its reply delayed. Everything looks timely on the firefox side, but the reply isn't making it back to the main plugin-container thread in a timely fashion. More investigation is necessary, but I don't consider this serious enough to hold back uplift of those other patches at the moment.

Bob Owen (:bobowen)

Reporter

Comment 7

•

9 years ago

(In reply to Aaron Klotz [:aklotz] (please use needinfo) from comment #6)
> Looks like a PluginInstanceChild::SendShow call is having its reply delayed.
> Everything looks timely on the firefox side, but the reply isn't making it
> back to the main plugin-container thread in a timely fashion. More
> investigation is necessary, but I don't consider this serious enough to hold
> back uplift of those other patches at the moment.

I don't think this is quite right.

When you right click, you get a WM_RBUTTONDOWN, a WM_KILLFOCUS, and 2 or 3 WM_IME_SETCONTEXT messages sent down to the plugin with a load of nested CallNPP_HandleEvents.

At the same time the pluing process is sending back a SendShow.

If all the replies (and the Show message at the end) are in mPending before we get to [1], then we satisfy the latest call in the nested CallNPP_HandleEvent and end up at [2].
But the message from SendShow is still in mPending.
So, we then go back up the stack to the previous nesting of CallNPP_HandleEvent and end up at [3].
But, now there are no messages returning to signal mEvent.
You only get out of this when some other Windows event causes another nested CallNPP_HandleEvent and when the reply comes back the Show message is the first in mPending.

Not sure all the details are correct there, but I think that's the general issue.


[1] https://dxr.mozilla.org/mozilla-central/rev/891ee0d0ba3ec42b6484cf0205b3c95e21c58f74/ipc/glue/WindowsMessageLoop.cpp#1149
[2] https://dxr.mozilla.org/mozilla-central/rev/891ee0d0ba3ec42b6484cf0205b3c95e21c58f74/ipc/glue/MessageChannel.cpp#1126
[3] https://dxr.mozilla.org/mozilla-central/rev/891ee0d0ba3ec42b6484cf0205b3c95e21c58f74/ipc/glue/WindowsMessageLoop.cpp#1145

Bob Owen (:bobowen)

Reporter

Comment 8

•

9 years ago

If I change:
https://dxr.mozilla.org/mozilla-central/rev/891ee0d0ba3ec42b6484cf0205b3c95e21c58f74/ipc/glue/WindowsMessageLoop.cpp#1140

to:
      if (!Connected() ||
          (AwaitingInterruptReply() && InterruptEventOccurred())) {

We check to see if we've actually had an out of order reply to the message we are about to wait for or if another one is pending.

This seems to do the trick.

What do you think?

Flags: needinfo?(aklotz)

Bob Owen (:bobowen)

Reporter

Comment 9

•

9 years ago

(In reply to Bob Owen (:bobowen) from comment #8)

>       if (!Connected() ||
>           (AwaitingInterruptReply() && InterruptEventOccurred())) {

Just thought, it probably makes sense (if it makes sense at all) to have this new check at the end of the loop, as it should never be true to start with.

Bob Owen (:bobowen)

Reporter

Comment 10

•

9 years ago

This code is going round in my head now!

The whole way we handle the interrupt replies seems a bit odd.

If mInterruptStack were changed to be a vector instead of a stack, we could hold a pair of Messages, for the request and reply.
When an interrupt reply comes in we could use the (abs(msg.seqno() - mInterruptStack.front().seqno())) to find the position in the vector to store our reply.
Then we would immediately know when our reply was in and we wouldn't need mOutOfTurnReplies.

mPending would then only be for incoming requests (as we handle incoming sync replies differently).

This would be a bigger change, so not suitable for a release fix.

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 11

•

9 years ago

OK, what I'm seeing here is actually a regression from bug 1189709. That patch neglected to add neutering back to the appropriate location in MessageChannel::WaitForInterruptNotify when the scope of the NeuteredWindowRegion was reduced in MessageChannel::Call.

Flags: needinfo?(aklotz)

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 12

•

9 years ago

Attached patch Patch — Details — Splinter Review

Assignee: nobody → aklotz

Status: NEW → ASSIGNED

Attachment #8669530 - Flags: review?(jmathies)

Bob Owen (:bobowen)

Reporter

Comment 13

•

9 years ago

Ah I see, this seems to stop it from getting into the nested intr messages in the first place.

I wonder if we can ever getting into the nested circumstance, if we can the issue I saw could still occur.

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 14

•

9 years ago

[Tracking Requested - why for this release]: This is serious enough that I think it should go into a 41 chemspill. I also think it will take care of bug 1210665 and others.

status-firefox41: --- → affected

tracking-firefox41: --- → ?

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 16

•

9 years ago

Updating the summary as the consequences are much more significant than Flash context menus.

Summary: Right click menu on windowless Flash is often slow to open on Windows. → Window neutering missing from MessageChannel::WaitForInterruptMessage

(No longer employed by Mozilla) Aaron Klotz

Assignee

Updated

•

9 years ago

Summary: Window neutering missing from MessageChannel::WaitForInterruptMessage → Window neutering missing from MessageChannel::WaitForInterruptNotify

Robert Kaiser

Comment 17

•

9 years ago

[Tracking Requested - why for this release]:
See comment #14 - not sure if it's bad enough to trigger a point-release by itself, but we should track it for Dev Edition and Beta.
This makes Flash settings slow to open (see comment #0) and some sites like in bug 1210665 can get Firefox to freeze up completely.

tracking-firefox42: --- → ?

tracking-firefox43: --- → ?

Sylvestre Ledru [:Sylvestre]

Comment 18

•

9 years ago

Indeed, tracking for 42 & 43; cf comment #17 for the rational.

tracking-firefox42: ? → +

tracking-firefox43: ? → +

Jim Mathies [:jimm]

Updated

•

9 years ago

Attachment #8669530 - Flags: review?(jmathies) → review+

Jim Mathies [:jimm]

Comment 19

•

9 years ago

(In reply to Bob Owen (:bobowen) from comment #8)
> If I change:
> https://dxr.mozilla.org/mozilla-central/rev/
> 891ee0d0ba3ec42b6484cf0205b3c95e21c58f74/ipc/glue/WindowsMessageLoop.cpp#1140
> 
> to:
>       if (!Connected() ||
>           (AwaitingInterruptReply() && InterruptEventOccurred())) {
> 
> We check to see if we've actually had an out of order reply to the message
> we are about to wait for or if another one is pending.
> 
> This seems to do the trick.
> 
> What do you think?

do we need a follow up on this? It sounds like something you should discuss with billm.

Jim Mathies [:jimm]

Updated

•

9 years ago

Flags: needinfo?(bobowen.code)

Pulsebot

Comment 20

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/f641c7662bdd

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 21

•

9 years ago

Comment on attachment 8669530 [details] [diff] [review]
Patch

Approval Request Comment
[Feature/regressing bug #]: bug 1189709
[User impact if declined]: Plugin hangs undetectable by crash reporter
[Describe test coverage new/current, TreeHerder]: Plugin test suite, locally with test case in bug
[Risks and why]: Low, this reverts back to previous behaviour that already existed prior to the regression.
[String/UUID change made/needed]: None

Attachment #8669530 - Flags: approval-mozilla-release?

Attachment #8669530 - Flags: approval-mozilla-beta?

Attachment #8669530 - Flags: approval-mozilla-aurora?

:Gijs (he/him)

Updated

•

9 years ago

Depends on: 1211958

Bob Owen (:bobowen)

Reporter

Comment 22

•

9 years ago

(In reply to Jim Mathies [:jimm] from comment #19)

> do we need a follow up on this? It sounds like something you should discuss
> with billm.

Filed bug 1212308 and NIed billm.

Flags: needinfo?(bobowen.code)

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 24

•

9 years ago

https://hg.mozilla.org/mozilla-central/rev/f641c7662bdd

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

status-firefox44: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla44

Sylvestre Ledru [:Sylvestre]

Comment 25

•

9 years ago

Comment on attachment 8669530 [details] [diff] [review]
Patch

Sure, fix a regression, taking it. Should be in 42 beta 5.

Attachment #8669530 - Flags: approval-mozilla-beta?

Attachment #8669530 - Flags: approval-mozilla-beta+

Attachment #8669530 - Flags: approval-mozilla-aurora?

Attachment #8669530 - Flags: approval-mozilla-aurora+

Carsten Book [:Tomcat]

Comment 26

•

9 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/3a692b89bd09

status-firefox43: affected → fixed

Carsten Book [:Tomcat]

Comment 27

•

9 years ago

https://hg.mozilla.org/releases/mozilla-beta/rev/f61cd0dd63a1

status-firefox42: affected → fixed

Chris Peterson [:cpeterson]

Comment 28

•

9 years ago

Bob wonders whether this bug is responsible for the recent spike in Netflix's F7353 errors (bug 1160439).

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 31

•

9 years ago

Chris, the fix landed in 42.0b5 (AFAIK). It would be nice to know if the fix helped the Netflix errors. I am considering taking this fix into the next 41 dot release (sometime this week) and trying to confirm whether this fix is safe.

Flags: needinfo?(cpeterson)

Chris Peterson [:cpeterson]

Comment 32

•

9 years ago

(In reply to Ritu Kothari (:ritu) from comment #31)
> Chris, the fix landed in 42.0b5 (AFAIK). It would be nice to know if the fix
> helped the Netflix errors. I am considering taking this fix into the next 41
> dot release (sometime this week) and trying to confirm whether this fix is
> safe.

Netflix's F7353 errors have spiked (from ~0% to 2.48% of views) in their recent beta tests of Firefox 42. There could be multiple causes for error F7353, so I don't know if this fix had any affect.

Netflix is not going to launch EME with Firefox 41, so there is no need to uplift this fix to 41 solely for Netflix's benefit.

Flags: needinfo?(cpeterson)

Bob Owen (:bobowen)

Reporter

Comment 33

•

9 years ago

Ritu, other definite hangs in 41.0.1 have been fixed by this though.

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 34

•

9 years ago

And this patch fixes a known regression -- this is how things worked in 40.0.0 but was broken in 40.0.1

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 35

•

9 years ago

Thanks guys! Based on verification of this fix in https://bugzilla.mozilla.org/show_bug.cgi?id=1210665#c21 and https://bugzilla.mozilla.org/show_bug.cgi?id=1210648#c10, I feel comfortable taking the fix in 41.0.2.

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 36

•

9 years ago

Comment on attachment 8669530 [details] [diff] [review]
Patch

As I mentioned in comment 35, this fix has been verified by a few sources already. It seems safe to uplift to m-r for inclusion in 41.0.2

Attachment #8669530 - Flags: approval-mozilla-release? → approval-mozilla-release+

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Updated

•

9 years ago

tracking-firefox41: ? → +

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 37

•

9 years ago

remote:   https://hg.mozilla.org/releases/mozilla-release/rev/122907a6aa4a

status-firefox41: affected → fixed

u421692

Comment 38

•

9 years ago

As far as I can tell, this issue can not be verified on Firefox for Android

Robert Kaiser

Comment 39

•

9 years ago

I veto taking this on release and actually request we back it out from beta, as it looks like this caused a 10% rise in crashes.

Robert Kaiser

Comment 40

•

9 years ago

(In reply to Chris Peterson [:cpeterson] from comment #32)
> Netflix's F7353 errors have spiked (from ~0% to 2.48% of views) in their
> recent beta tests of Firefox 42. There could be multiple causes for error
> F7353, so I don't know if this fix had any affect.

Does this mean that there is a good chance that this fix made Netflix errors jump up? That's more reason to consider this fix be backed out from 42 to find out if that's true and the errors subside again.

Bob Owen (:bobowen)

Reporter

Comment 41

•

9 years ago

(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #39)
> I veto taking this on release and actually request we back it out from beta,
> as it looks like this caused a 10% rise in crashes.

Would you link us to the data that makes you believe this?

(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #40)
> (In reply to Chris Peterson [:cpeterson] from comment #32)
> > Netflix's F7353 errors have spiked (from ~0% to 2.48% of views) in their
> > recent beta tests of Firefox 42. There could be multiple causes for error
> > F7353, so I don't know if this fix had any affect.
> 
> Does this mean that there is a good chance that this fix made Netflix errors
> jump up? That's more reason to consider this fix be backed out from 42 to
> find out if that's true and the errors subside again.

No, there was a small possibility that the patch from bug 1189709 was causing problems for GMPs, if it was this would have been fixing that, not making it worse.

Flags: needinfo?(kairo)

Calvin Walton

Comment 42

•

9 years ago

One thing to note is that this patch fixes some Firefox ui hangs or lockups that were not being caught by the crash reporter, I don't know if that might be skewing the stats any.

Robert Kaiser

Comment 43

•

9 years ago

(In reply to Bob Owen (:bobowen) from comment #41)
> (In reply to Robert Kaiser (:kairo@mozilla.com) from comment #39)
> > I veto taking this on release and actually request we back it out from beta,
> > as it looks like this caused a 10% rise in crashes.
> 
> Would you link us to the data that makes you believe this?

The data is not 100% straight-forward to understand, unfortunately. https://arewestableyet.com/graph/?fxbeta tells me the crash rate of around 1.20 crashes / 100 ADI of Mon/Tues this week are about 10% worse than the 1.08 or so a week previously (before that, we had another spike that we had fixed). https://crash-analysis.mozilla.com/rkaiser/2015-10-13/2015-10-13.firefox.beta.explosiveness.html tells me that a number of signatures did rise, many of them with some window stuff in the signature, which would be fitting. I reported on that in bug 1213567 comment #5. The same bug points to the patch here as the possible culprit and as the rise is in 42.0b5, which shipped on 10/09 and for which this patch landed, it's my primary suspect for causing the 10% crash volume increase.

Now, we need to find out if that's possible or even probably based on the crashes reported, if we can find a different culprit among the patches that landed for b5, and what to do about this. Until we have done that, I do not think we should ship this patch here to users.

Flags: needinfo?(kairo)

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 44

•

9 years ago

Backed out in https://hg.mozilla.org/releases/mozilla-release/rev/9ca177eef6f5

status-firefox41: fixed → affected

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 45

•

9 years ago

I am strongly opposed to the backout proposal:

1) This patch restores code that was present in 41.0.0 and every version prior to that since OOP plugins was released. When the changes in bug 1189709 were made in 42, they mistakenly removed the affected code in this bug. You are asking us to remove code that has been in every released version prior to 41.0.1.

2) I strongly recommend that you compare the crash rate of 41.0.2 against versions that still shipped with this code (such as 39), instead of our upcoming prerelease channels that were also affected by this regression.

3) This code is very important to fix Flash performance and correctness problems, most of which are undetectable by crash reporter. They are either temporary hangs or not the types of hangs that our hang detection code is equipped to notice. What you have probably seen is a shift of many of these hangs from the undetectable kind to the ones that are actually visible to our reporting infrastructure.

(No longer employed by Mozilla) Aaron Klotz

Assignee

Updated

•

9 years ago

Flags: needinfo?(kairo)

Robert Kaiser

Comment 46

•

9 years ago

(In reply to Aaron Klotz [:aklotz] (please use needinfo) from comment #45)
> 2) I strongly recommend that you compare the crash rate of 41.0.2 against
> versions that still shipped with this code (such as 39), instead of our
> upcoming prerelease channels that were also affected by this regression.


To be clear: Was this code also present in 40 release? When did the removal of this code land? In what bug?

Flags: needinfo?(kairo)

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 47

•

9 years ago

Yes, it was present in 40 release and 41 release. It was removed by bug 1185639 and bug 1189709, which were uplifted for the 41.0.1 chemspill.

Chris Peterson [:cpeterson]

Comment 48

•

9 years ago

(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #40)
> (In reply to Chris Peterson [:cpeterson] from comment #32)
> > Netflix's F7353 errors have spiked (from ~0% to 2.48% of views) in their
> > recent beta tests of Firefox 42. There could be multiple causes for error
> > F7353, so I don't know if this fix had any affect.
> 
> Does this mean that there is a good chance that this fix made Netflix errors
> jump up? That's more reason to consider this fix be backed out from 42 to
> find out if that's true and the errors subside again.

I don't think the jump in Netflix F7353 errors is related to this change landing on Beta. The jump happened between 2015-08-24 (Beta 41) and 2015-09-29 (Beta 42), but this change landed on Beta on 2015-10-08. And the Netflix errors have decreased a little (2.48% to 1.98%) since 2015-10-08. However, there are many contributing factors, including a new Adobe CDM build.

Robert Kaiser

Comment 49

•

9 years ago

(In reply to Aaron Klotz [:aklotz] (please use needinfo) from comment #47)
> Yes, it was present in 40 release and 41 release. It was removed by bug
> 1185639 and bug 1189709, which were uplifted for the 41.0.1 chemspill.

So why did 41.0.1 then not make crashes like found now decrease? I do not see a decrease in any window-related crashes or overall crash volume there. Does re-adding this now trigger some different code that causes the crashes now? Do we need fixes for that?

Flags: needinfo?(aklotz)

Robert Kaiser

Comment 50

•

9 years ago

(In reply to Chris Peterson [:cpeterson] from comment #48)
> I don't think the jump in Netflix F7353 errors is related to this change
> landing on Beta.

Thanks, that's good to know. One worry less.

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 51

•

9 years ago

(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #49)
> So why did 41.0.1 then not make crashes like found now decrease? I do not
> see a decrease in any window-related crashes or overall crash volume there.
> Does re-adding this now trigger some different code that causes the crashes
> now? Do we need fixes for that?

The code was refactored a bit to make it more reusable but should be doing the exact same thing as before. Can you point me to the specific stacks that increased?

Flags: needinfo?(aklotz) → needinfo?(kairo)

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 52

•

9 years ago

(In reply to Aaron Klotz [:aklotz] (please use needinfo) from comment #51)
> (In reply to Robert Kaiser (:kairo@mozilla.com) from comment #49)
> > So why did 41.0.1 then not make crashes like found now decrease? I do not
> > see a decrease in any window-related crashes or overall crash volume there.
> > Does re-adding this now trigger some different code that causes the crashes
> > now? Do we need fixes for that?
> 
> The code was refactored a bit to make it more reusable but should be doing
> the exact same thing as before. Can you point me to the specific stacks that
> increased?

Aaron, I believe it is bug 1213567 where the crash signatures are that have spiked up.

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 53

•

9 years ago

(In reply to Ritu Kothari (:ritu) from comment #52)
> Aaron, I believe it is bug 1213567 where the crash signatures are that have
> spiked up.

OK. As I said on IRC, I don't see directly how this patch could be causing it but it is suspicious enough that I am OK to leave it out of 41.0.2. I'd like to leave it in beta though so that I can continue to investigate.

Flags: needinfo?(kairo)

Robert Kaiser

Comment 54

•

9 years ago

(In reply to Aaron Klotz [:aklotz] (please use needinfo) from comment #51)
> The code was refactored a bit to make it more reusable but should be doing
> the exact same thing as before. Can you point me to the specific stacks that
> increased?

As I mentioned above, bug 1213567 comment #5 points to all those. It's suspicious that there's so many references to window stuff in those crashes.

Bob Owen (:bobowen)

Reporter

Comment 55

•

9 years ago

Looks like the PeekMessageW in WaitForInterruptNotify is causing exception stack overflows somehow.

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 56

•

9 years ago

Yeah it looks like maybe something is being "doubly neutered" even though that isn't supposed to be possible (sounds painful).

Robert Kaiser

Comment 57

•

9 years ago

OK, thanks, makes me feel somewhat better at least that we can confirm the suspicions that this patch is related. Now I hope we can find a fix for that in 42 as well - but maybe we should take that to bug 1213567.

Robert Kaiser

Updated

•

9 years ago

Depends on: 1213567

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 59

•

9 years ago

FF41 was EOL'd last week, so it's a wontfix.

status-firefox41: affected → wontfix

BMO Automation

Updated

•

2 years ago

Product: Core → Core Graveyard

bug1209464-firefox-stack.txt 9 years ago Bob Owen (:bobowen) 5.91 KB, text/plain		Details
bug1209464-plugin-container-stack.txt 9 years ago Bob Owen (:bobowen) 2.24 KB, text/plain		Details
bug1209464-firefox-stack2.txt 9 years ago Bob Owen (:bobowen) 3.04 KB, text/plain		Details
bug1209464-plugin-container-stack2.txt 9 years ago Bob Owen (:bobowen) 1.82 KB, text/plain		Details
Patch 9 years ago (No longer employed by Mozilla) Aaron Klotz 1.05 KB, patch	jimm : review+ Sylvestre : approval-mozilla-aurora+ Sylvestre : approval-mozilla-beta+ ritu : approval-mozilla-release+	Details \| Diff \| Splinter Review