Closed Bug 831768 Opened 7 years ago Closed 7 years ago

Consider delaying async painting calls into NPP_HandleEvent while we're in another call

Categories

(Core :: Plug-ins, defect, P1)

x86_64
Windows 7
defect

Tracking

()

RESOLVED FIXED
mozilla22
Tracking Status
firefox21 --- fixed
firefox22 --- fixed

People

(Reporter: benjamin, Assigned: benjamin)

References

Details

Attachments

(2 files)

From Adobe:

We’ve looked at Firefox’s top hangs from the nightly reports (https://crash-analysis.mozilla.com/hang-reports/2013/01-13/nightly.html). All the top hangs seem to have the same root cause.

Sandbox process calls into the PlugIn Container(PIC) via some NPN call and waits for a response. PIC sends along the NPN request to Firefox and waits for a response. However, PIC gets woken by a Firefox request to process a NPP_HandleEvent call. PIC forwards the request to the Sandbox to handle. Unfortunately NPP_HandleEvent cannot wake up the Sandbox process. We now have a deadlock. Sandbox is waiting for PIC to handle the NPN request; PIC is waiting for Sandbox to handle the NPP_HandleEvent call.

By design, NPP calls cannot wake up the Sandbox process. This is to prevent the browser from changing the player’s state from under its feet. NPN calls should return to the Sandbox in the same state in which the call left from. The player code assumes no state change occurs for calling out NPN calls. Violations of this assumption can corrupt the player’s state and end up crashing later on.

So if possible, we would like Mozilla to change their code so that NPP calls cannot interrupt pending NPN calls in the PIC. If this behavior change isn’t possible for all NPP calls, we would like Mozilla to at least try to have NPP_HandleEvent calls wait for pending NPN calls to complete.

***

This is a followup from a message I sent previously:

I have loaded:

https://crash-analysis.mozilla.com/hang-reports/2012/10-21/hr-20121021-547b66a2-380c-4b0d-95c2-4b4cfeb04be6.html

If you expand the "metadata" section, it says that the CPU usage of plugin-contain and both flash processes appeared to be 0 at the time of the hang, which may indicate a deadlock situation (CpuUsageFlashProcess2 == sandbox, CpuUsageFlashProcess1 == broker, PluginCpuUsage = plugin-container).

Right now I'm reporting the stack of thread 0 for both of the FlashPlayerPlugin processes, although you can expand the "Frames" table to see the other threads.

In all reports it appears that the broker is pretty much always in F_399546359

The sandbox is doing something, then calling SendMessage from F_216007423. This reenters and ends up calling NPN_GetURLNotify (probably triggered by AS, but I'm not certain).

Plugin-container is in its event loop, is processing a windows message which presumably has received the NPN_GetURLNotify call from the sandbox (mozilla::plugins::child::_geturlnotify). It makes its IPC call to the browser and gets an unrelated IPC message in return, RecvAsyncSetWindow which is part of the asynchronous painting loop. That calls NPP_HandleEvent with WM_WINDOWPOSCHANGED here:

http://hg.mozilla.org/mozilla-central/annotate/ff4af83233dc/dom/plugins/ipc/PluginInstanceChild.cpp#l3192

Flash code ends up at F_1152915508 (IPC stuff)

Meanwhile the browser process has been happily oblivious to most of this: it has been sending asynchronous painting updates and network stream updates and is now starting to deliver a stream to the plugin (calling NPP_NewStream which is a synchronous call).

So if I were to guess, I think that FP doesn't expect to be receiving the NPP_HandleEvent(WM_WINDOWPOSCHANGED) call nested within NPN_GetURLNotify and is deadlocking somehow.

***

If this is *only* caused by reentrance through the async setwindow messages through RecvAsyncSetWindow, then we could try to delay the processing of RecvAsyncSetWindow until we get back to the top of the event loop. (There is also PPluginInstance::UpdateBackground which is also async and would therefore have similar reentrancy issues).
got r+ from gfritzsche who's sitting next to me. Yay pair programming.
Assignee: nobody → benjamin
Status: NEW → ASSIGNED
Attachment #726157 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/9c6474e8d3c8
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla22
Comment on attachment 726157 [details] [diff] [review]
Always delay PluginInstanceChild::RecvAsyncSetWindow, rev. 1

Bug caused by (feature/regressing bug #): Longstanding issue
User impact if declined: Java crashes more when run OOP (on mac). Taking this should also improve the Flash hang situation, but we don't know whether the improvement will be 15% or 60% of Flash hangs ;-)
Testing completed (on m-c, etc.): Passes tests, in nightlies without reported regressions.
Risk to taking this patch (and alternatives if risky): The main risk of this patch is that painting Flash may be slightly jankier or that scrolling/resizing may tear a little more.
String or UUID changes made by this patch: None
Attachment #726157 - Flags: approval-mozilla-aurora?
Benjamin, do we have any hang volume comparisons from before/after this patch on Nightly? Hangs still go to your machinery, so I don't have any Socorro data to compare here.
https://crash-analysis.mozilla.com/bsmedberg/graphing/ has the volume numbers. There was no noticeable hang volume change on nightly. But there was also no volume change when the hangui timeout changed, so I'm not sure how much I trust the data.
Looking at https://crash-analysis.mozilla.com/hang-reports/2013/03-26/hr-20130326-84b291f4-44b1-43f5-a949-476138157eb4.html I've found another way to trigger similar problems through RecvUpdateBackground.

I think we should continue to uplift the current patches, but I'm going to reopen this for another patch as well.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #3)
> User impact if declined: Java crashes more when run OOP (on mac). Taking
> this should also improve the Flash hang situation, but we don't know whether
> the improvement will be 15% or 60% of Flash hangs ;-)

> Risk to taking this patch (and alternatives if risky): The main risk of this
> patch is that painting Flash may be slightly jankier or that
> scrolling/resizing may tear a little more.

Are we sure we want to uplift a potentially risky change when we can't measure whether or not we've moved the needle? I'm leaning towards letting this ride the trains, given it's the end of the Aurora cycle and doesn't have measurable impact. May give us more time to analyze why our data isn't showing what we'd like.
Or if we had reproducible hangs that we could show no longer occurred (even if that's not represented in crash-stats), that would at least be a start.
Note: the primary reason for uplift is to solve Java crashes, not to solve Flash hangs. We have demonstrated that this significantly improves Java stability on mac.

In terms of hangs, this has definitely reduced the percentage of hangs with the "signature" adbe-3355131: https://crash-analysis.mozilla.com/hang-reports/2013/03-26/nightly.html

Those are inherently race conditions, so it's very hard to prove that they are better with a testcase.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #10)
> Note: the primary reason for uplift is to solve Java crashes, not to solve
> Flash hangs. We have demonstrated that this significantly improves Java
> stability on mac.

Benjamin, can you please point us to bugs/crash-stats that this patch is helping with ? 
Please correct me in case I am wrong here, but have not seen mac Java crashes to be a top concern and want to be sure the patch is the right tradeoff of risk-reward before approving ;-) 

> 
> In terms of hangs, this has definitely reduced the percentage of hangs with
> the "signature" adbe-3355131:
> https://crash-analysis.mozilla.com/hang-reports/2013/03-26/nightly.html
> 
> Those are inherently race conditions, so it's very hard to prove that they
> are better with a testcase.
See the 6 dependent bugs.
Comment on attachment 730189 [details] [diff] [review]
part 2 - Defer painting from RecvBackgroundUpdate as well, rev. 1

Review of attachment 730189 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good to me.
Attachment #730189 - Flags: review?(georg.fritzsche) → review+
(In reply to bhavana bajaj [:bajaj] from comment #11)
> (In reply to Benjamin Smedberg  [:bsmedberg] from comment #10)
> > Note: the primary reason for uplift is to solve Java crashes, not to solve
> > Flash hangs. We have demonstrated that this significantly improves Java
> > stability on mac.
> 
> Benjamin, can you please point us to bugs/crash-stats that this patch is
> helping with ? 

Bug 823559, which is currently in Aurora, made Java plugins OOP again on non-Windows platforms, which triggered a set of connected bugs. This patch fixed all of those we can reproduce and seems to take care of the rest as well.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #12)
> See the 6 dependent bugs.

Thanks, Looks like the 841914,841916,845735 have been resolved by this which was overlooked..& I see comment 14 shedding more light on the need of the uplift here due to landing of Bug 823559 .

Approving for uplift based on supporting comments now :)
Attachment #726157 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
https://hg.mozilla.org/mozilla-central/rev/51be4c6ccf8e
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Depends on: 872969
You need to log in before you can comment on or make changes to this bug.