Closed Bug 878195 Opened 11 years ago Closed 11 years ago

Quakelive instances not being cleaned up properly

Categories

(Core Graveyard :: Plug-ins, defect, P1)

x86_64
Windows 7
defect

Tracking

(firefox21 affected, firefox22- affected, firefox23- affected, firefox24- affected)

RESOLVED WONTFIX
Tracking Status
firefox21 --- affected
firefox22 - affected
firefox23 - affected
firefox24 - affected

People

(Reporter: mike.rubits, Assigned: johns)

References

()

Details

(Keywords: regression)

User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31

Steps to reproduce:

We use an NPAPI plugin for http://www.quakelive.com and once 21 was rolled out, we began receiving reports from users. We use a mutex to prevent multiple plugin instances running at once, that is released in NPP_Destroy when the user closes the tab or browses away from the page.


Actual results:

Navigating away, closing the tab, or refreshing the page will cause NPP_Destroy to never be called. Removing the <object> tag from the document through Firebug is similarly unsuccessful. The user must close the browser entirely, at which time NPP_Destroy is called and the resource is released.

I've used mozregression to narrow it down to Last Good: 2013-03-04 to First Bad: 2013-03-05 with this pushlog: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=86c98c4d36da&tochange=015da7030aab


Expected results:

NPP_Destroy should be called under the situations mentioned here https://developer.mozilla.org/en-US/docs/NPP_Destroy
Component: Untriaged → Plug-ins
Product: Firefox → Core
Yikes, that's not good.

That range includes bug 827158 and bug 829557, which would be the main suspects.
Assignee: nobody → jschoenick
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Trying to load the quakelive plugin in a debug build yields:

> ###!!! ASSERTION: Plugin using object not created with NPN_CreateObject?: 'Error', file dom/plugins/ipc/PluginModuleChild.cpp, line 767

This does not occur in in-process plugin mode. I would guess then that 827158 is the culprit, and this error is corrupting the plugin's state such that it is not cleaned up properly (also a bug).

Mike, if you are feeling up to debugging, you can grab a debug build of firefox which will throw up an assertion dialog when it hits this error. If you attach to plugin-container.exe at that point you should be able to get a backtrace of what the plugin is doing to upset NPAPI

http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win32-debug/1370032792/

I'm guessing this error puts the plugin in a bad state that causes it to not get cleaned up when it should, but I will need to take a closer look monday.
I can't see an obvious way for the assertion to be related to the lack of NPP_Destroy, although it's something that quakelive should fix.

I'm sure that we are calling NPP_Destroy for other plugins, so this is not pervasive.
Comment 0 suggests that FF21 is in fact affected, even though this first regressed in FF22 on central. That implies an uplift. Bug 824069?
(In reply to Alex Keybl [:akeybl] from comment #4)
> Comment 0 suggests that FF21 is in fact affected, even though this first
> regressed in FF22 on central. That implies an uplift. Bug 824069?

I don't think so - if that was the trigger it would have been crashing with a null-pointer access before that change. Any other uplift candidates?
So we did some debugging on IRC yesterday, what it looks like is happening is quakelive doesn't use NPN_CreateObject, resulting in us losing track of the object. In Firefox 20, this would hit bug 824069, and crash plugin-container on shutdown. So bug 824069 caused us to *not* crash the container, causing it to survive, exposing the issue.

Mike thinks that there might be a refcounting issue in the plugin, which could be related to some calls besides NPP_Destroy not making it, do to the CreateObject issue, making it believe it was not free'd.

The bug on our side here is that we should kill plugins that should've shut down but did not, and probably make the assertion about invalid objects a more visible error or abort in release builds.
Blocks: 824069
No longer blocks: 827158
We weren't tracking bug 824069 for release - even if it's more correct to leave in, should we consider backing out at this point for sake of quakelive? Probably best to move forward with a fix plugin-side, of course..
(In reply to Alex Keybl [:akeybl] from comment #7)
> We weren't tracking bug 824069 for release - even if it's more correct to
> leave in, should we consider backing out at this point for sake of
> quakelive? Probably best to move forward with a fix plugin-side, of course..

The problem in bug 824069 is that we get into a bad state due to objects not being created with NPN_CreateObject, and then crash. Restoring a crash into plugin-container to *fix* quakelive is a bit odd, but if there's no other realworld issues it would be possible to make that assert a runtime abort.

Long term, we should probably be killing the plugin when the initial CreateObject() badness occurs, so issues like this don't go undetected.
(In reply to John Schoenick [:johns] from comment #6)
> The bug on our side here is that we should kill plugins that should've shut
> down but did not, and probably make the assertion about invalid objects a
> more visible error or abort in release builds.

An abort would bring us right back to a variation of what the patch for bug 824069 fixed.
The "NPObject not in object table" assertion didn't show up?
(In reply to Georg Fritzsche [:gfritzsche] from comment #9)
> An abort would bring us right back to a variation of what the patch for bug
> 824069 fixed.

But 824069 was fixing a test suite issue, if there are no real-world crashes happening on that signature it might be worth backing out for a release to give quakelive time to update.

> The "NPObject not in object table" assertion didn't show up?

It did, the full assertions a plugin lifecycle hits are:

< First script access >

###!!! ASSERTION: Plugin using object not created with NPN_CreateObject?: 'Error', file g:/moz-git/dom/plugins/ipc/PluginModuleChild.cpp, line 767
###!!! ASSERTION: NPObject not in object table: 'Error', file g:/moz-git/dom/plugins/ipc/PluginModuleChild.cpp, line 736
###!!! ASSERTION: Out of memory?: 'Error', file g:/moz-git/dom/plugins/ipc/PluginScriptableObjectChild.cpp, line 480
###!!! ASSERTION: Releasing object not in mObjectMap?: Error', file g:/moz-git/dom/plugins/ipc/PluginModuleChild.cpp, line 2144

<Stopping plugin instance>

###!!! ASSERTION: NPObject not in object table: 'd', file g:/moz-git/dom/plugins/ipc/PluginModuleChild.cpp, line 752
###!!! ASSERTION: Releasing object not in mObjectMap?: 'Error', file g:/moz-git/dom/plugins/ipc/PluginModuleChild.cpp, line 2144
(In reply to John Schoenick [:johns] from comment #10)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #9)
> > An abort would bring us right back to a variation of what the patch for bug
> > 824069 fixed.
> 
> But 824069 was fixing a test suite issue, if there are no real-world crashes
> happening on that signature it might be worth backing out for a release to
> give quakelive time to update.

Right, i thought you meant making this a (permanent) abort. Not sure if that issue is limited to tests though as that combination wasn't on release or beta yet.

Per IRC it turns out that QuakeLive does require users to always have the latest version installed, so if they have a fix out soon we probably won't have to worry about this.
I'm rolling out a workaround on our plugin to our playtesters, since there does seem to be some reference counter behavior that's different when comparing to Chrome. Even though native plugins are much less common on the web nowadays, I'd hate to have to rely on broken behavior existing in Firefox and possibly affecting other plugins. We'd definitely like to fix this on our side.

From what I could tell by debugging our plugin again, we increase the ref counter on NPP_New, and GetValue, and decrease it on deallocate, and destroy. We only have one property on the plugin, and that is the plugin's version number. This seems to get called a couple of times, but I wasn't seeing our deallocate function being called even when closing the tab. I'm not sure if this is indicative of another change within those two nightlies, or just us using NPAPI incorrectly, but worth noting for the future
(In reply to Mike Rubits from comment #12)
> I'm rolling out a workaround on our plugin to our playtesters, since there
> does seem to be some reference counter behavior that's different when
> comparing to Chrome. Even though native plugins are much less common on the
> web nowadays, I'd hate to have to rely on broken behavior existing in
> Firefox and possibly affecting other plugins. We'd definitely like to fix
> this on our side.
> 
> From what I could tell by debugging our plugin again, we increase the ref
> counter on NPP_New, and GetValue, and decrease it on deallocate, and
> destroy. We only have one property on the plugin, and that is the plugin's
> version number. This seems to get called a couple of times, but I wasn't
> seeing our deallocate function being called even when closing the tab. I'm
> not sure if this is indicative of another change within those two nightlies,
> or just us using NPAPI incorrectly, but worth noting for the future

I think this is a side effect of the lack of NPN_CreateObject:

http://dxr.mozilla.org/mozilla-central/source/dom/plugins/ipc/PluginModuleChild.cpp#l2144

We raise an error and return early from that function, so deallocate would never be called. It looks like, however, we could still continue with DeallocNPObject in that case. We should file a bug on either tolerating a lack of NPN_CreateObject better, or being more strict about requiring it.
Blocks: 879567
I filed bug 879567 for the NPN_CreateObject issue causing a lack of deallocate callbacks and other potential issues
Has anyone investigated the root problem here (lack of NPP_Destroy)? I think we've explained why fixing bug 824069 actually stopped a crash in quakelive which then exposed the current behavior, but that doesn't really help explain why we aren't calling NPP_Destroy in the first place.

Note that it's safe from a security perspective to back out bug 824069, but also note that we probably wouldn't get crash-stats from it because it is a crash that occurs on unload of a plugin instance, and we wouldn't have any place to show UI for these crashes so they would not typically be reported.

But that also means that they aren't typically a big deal to users.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #15)
> Has anyone investigated the root problem here (lack of NPP_Destroy)? I think
> we've explained why fixing bug 824069 actually stopped a crash in quakelive
> which then exposed the current behavior, but that doesn't really help
> explain why we aren't calling NPP_Destroy in the first place.

Actually, I think we determined on IRC that NPP_Destroy was being called, and that the refcount mismatch was due to the missing deallocate calls.
Summary: NPP_Destroy not being called in NPAPI plugins on page navigation. → Quakelive instances not being cleaned up properly
I'm going to mark this wontfix on our end since it sounds like a fix is ready on quakelive's end, and backing out bug 824069 is not ideal. Bug 879567 was filed to issue more explicit warnings about NPAPI usage errors like skipping NPN_CreateObject, which should help detect issues like this sooner.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → WONTFIX
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.