Last Comment Bug 768651 - Firefox started with "cfx run" hangs for some URLs: Windows only
: Firefox started with "cfx run" hangs for some URLs: Windows only
Status: RESOLVED FIXED
:
Product: Add-on SDK
Classification: Client Software
Component: General (show other bugs)
: unspecified
: x86 Windows 7
: P1 normal with 4 votes (vote)
: mozilla25
Assigned To: Gabor Krizsanits [:krizsa :gabor]
:
Mentors:
: 916849 (view as bug list)
Depends on: 897683
Blocks:
  Show dependency treegraph
 
Reported: 2012-06-26 14:41 PDT by Will Bamberg [:wbamberg]
Modified: 2014-01-05 15:30 PST (History)
16 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
Pointer to Github pull request: https://github.com/mozilla/addon-sdk/pull/1104/files (378 bytes, text/html)
2013-07-15 03:46 PDT, Gabor Krizsanits [:krizsa :gabor]
dtownsend: review+
Details

Description Will Bamberg [:wbamberg] 2012-06-26 14:41:41 PDT
Reported in the forum by Konrad Gorski: https://forums.mozilla.org/addons/viewtopic.php?f=27&t=10904&p=22901#p22901

1) On Windows 7, create a minimal add-on, for example:

const widgets = require("widget");
const tabs = require("tabs");

var widget = widgets.Widget({
  id: "mozilla-link",
  label: "Mozilla website",
  contentURL: "http://www.mozilla.org/favicon.ico",
  onClick: function() {
    tabs.open("http://www.mozilla.org/");
  }
});

console.log("The add-on is running.");

2) Run it using "cfx run", then navigate to certain URLs, for example:
http://www.ebay.com/ctg/Nikon-D5100-162-MP-Digital-SLR-Camera-Black-Kit-w-AFS-1855mm-VR-Lens-/101827356?_pcatid=782&_pcategid=31388&_dmpt=Digital_Cameras&_dashexp=1

http://www.ebay.com/itm/COBRA-ESD-9275-Digital-6-Band-Laser-Radar-Detector-w-Safety-Alert-LaserEye-/350560167708?pt=LH_DefaultDomain_0&hash=item519f03a71c

3) Firefox will hang.

If instead you:
1) build the add-on with "cfx xpi"
2) run Firefox
3) install the add-on
4) visit those pages
...then everything's fine.
Comment 1 Wes Kocher (:KWierso) 2012-06-26 14:46:03 PDT
As a guess, one of the SDK's default preferences that change Firefox's defaults for cfx run/test?
Comment 2 Wes Kocher (:KWierso) 2012-07-05 12:27:37 PDT
(In reply to Wes Kocher (:KWierso) from comment #1)
> As a guess, one of the SDK's default preferences that change Firefox's
> defaults for cfx run/test?

I've tested with all of the preferences that the SDK sets commented out, and this still happens.

I've tested with all of the environmental variables that the SDK sets commented out, and this still happens.



That last post in the forum thread about a Windows update possibly causing this seems possible.
Comment 3 Dave Townsend [:mossop] 2012-07-05 13:10:15 PDT
I think if you set hangmonitor.timeout in about:config to a number of seconds then after that a crash report will be submitted. That could at least give us a stack trace so we know what is hanging here.
Comment 5 Dave Townsend [:mossop] 2012-07-05 13:19:57 PDT
Looks like it's hanging on plugin IPC stuff. I imagine disabling plugins would make the problem go away. Might need some of the plugin guys to look at this.
Comment 6 Dave Townsend [:mossop] 2012-07-05 13:20:35 PDT
Maybe Eddy might have some quick thoughts here
Comment 7 Eddy Bruel [:ejpbruel] 2012-07-05 15:45:26 PDT
(In reply to Dave Townsend (:Mossop) from comment #6)
> Maybe Eddy might have some quick thoughts here

nsNPAPIPlugin::CreatePlugin is returning with error code NS_ERROR_OUT_OF_MEMORY (see thread 0, stackframe 14). That looks very suspicious.
Comment 8 Alexandre Poirot [:ochameau] PTO, back on 1st 2012-07-07 16:57:16 PDT
Wouldn't it be related to bug 771847?
Is firefox crashing in bug 771847? If yes, we may have to tweak our test runner in order to retrieve the crash report somehow!!
Comment 9 Konstantin 2012-08-06 09:01:02 PDT
have the same problem on FF 14.0.1 on Win7
Comment 10 Konstantin 2012-08-06 10:39:58 PDT
make off all plugins - now can develop without problem
Comment 11 Anant Narayanan [:anant] 2012-08-20 21:43:21 PDT
I can reproduce this, the problem is specifically with the flash plugin (so far I've seen crashes on last.fm and amazon.com/cloudplayer). Disabling flash makes the problem go away.
Comment 12 John Nagle 2013-03-22 16:44:42 PDT
I'm seeing this now with Firefox 19.0.2 under "cfx" on Windows 7 Pro.  Firefox hangs on pages which have Flash. Pages without Flash work fine.  One or two copies of "plugin-container.exe" are running.  Firefox does not respond to input, the cursor does not change on mouse-over, and the Firefox window cannot be closed.  

Killing the "plugin-container.exe" processes will sometimes get Firefox going again.
Comment 13 John Nagle 2013-03-22 16:51:55 PDT
Tried opening the error console, then loading a page with Flash with the browser running under "cfx": 
Errors:

Timestamp: 3/22/2013 4:46:55 PM
Warning: ReferenceError: reference to undefined property aEvent.button
Source File: chrome://browser/content/browser.js
Line: 8893
Timestamp: 3/22/2013 4:46:55 PM
Warning: ReferenceError: reference to undefined property e.button
Source File: chrome://browser/content/utilityOverlay.js
Line: 148

Running the same version of Firefox, not under "cfx", works fine for the same pages.
Comment 14 John Nagle 2013-03-22 17:01:01 PDT
The above is with Windows 7 Pro, Firefox 19.0.2, Flash 11.6.602.180, SDK 1.13.2.  
So it's broken with the latest and greatest versions of everything.
Comment 15 Jeff Griffiths (:canuckistani) (:⚡︎) 2013-03-24 16:55:43 PDT
cfx run seems flaky, I wonder if cfx.js will help us here eventually. My workaround is just to use Wladimir's add-on:

https://addons.mozilla.org/en-US/firefox/addon/autoinstaller/

In my mind the class of problems you might run into because you aren't using a clean Firefox profile are exceedingly rare.
Comment 16 Gabor Krizsanits [:krizsa :gabor] 2013-07-02 07:05:48 PDT
Here is what happens:

- nsNPAPIPlugin::CreatePlugin sends out an rpc init request for the flash plugin
- flash returns an error code from the NP_Initialize (error code value: 1)
(- regardless of the error code on windows we send out an async SendSetAudioSessionData right after, but I turned this off and did not change a thing)
- PPluginModuleParent::CallNP_Shutdown gets called because of the error code, an rpc NP_Shutdown request is sent out
- PluginModuleChild::AnswerNP_Shutdown (this is the plugin-container process) calls mShutdownFunc(), which is a function in the flash plugin itself, and it never returns, and our main thread in the firefox process is being blocked waiting for the response for the rpc command... 

Problems: 

1., I don't think it's a good idea that the shutdown part is rpc... can we do this part async? Or can we use some kind of timeout and shut it down forcefully after a while (I know this sounds terrible too...)? Letting the plugin-container process deadlock our main thread this way is quite bad, is there a chance we can fix this somehow? I kind of know how difficult is to give a good answer for this... I'm just desperately trying to figure out a workaround for this issue since this bug hurting add-on developers a lot. Any smart hack would be great... (the real solution would be not to block ever the main thread of the firefox process I guess, but that's on a long time wish list of all of us and very hard in practice, right?)

2., If we could figure out why the flash plugin fails to init in this setup that would be great. I've been trying to playing with procmon to figure out the root of the failure, and changing stuff like the cwd of the firefox process we start up with cfx, but no luck. Should we try and assemble a minimal example, and it to the flash plugin developers?
Comment 17 Benjamin Smedberg AWAY UNTIL 2-AUG-2016 [:bsmedberg] 2013-07-02 07:19:18 PDT
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #16)
> Here is what happens:
> 
> - nsNPAPIPlugin::CreatePlugin sends out an rpc init request for the flash
> plugin
> - flash returns an error code from the NP_Initialize (error code value: 1)
> (- regardless of the error code on windows we send out an async
> SendSetAudioSessionData right after, but I turned this off and did not
> change a thing)
> - PPluginModuleParent::CallNP_Shutdown gets called because of the error
> code, an rpc NP_Shutdown request is sent out

This is a bug. If NP_Initialize fails, NP_Shutdown should not be called. Please file a separate bug for that.

> 1., I don't think it's a good idea that the shutdown part is rpc... can we
> do this part async?

Not really, no.

 Or can we use some kind of timeout and shut it down
> forcefully after a while (I know this sounds terrible too...)?

We already have a timeout for RPC calls; 45 seconds in release builds and infinite in debug builds. It should also show the plugin hang UI on Windows.

> 
> 2., If we could figure out why the flash plugin fails to init in this setup
> that would be great. I've been trying to playing with procmon to figure out
> the root of the failure, and changing stuff like the cwd of the firefox
> process we start up with cfx, but no luck. Should we try and assemble a
> minimal example, and it to the flash plugin developers?

It is unlikely that this will get attention from Adobe. It sounds like your test runner is setting some environment variable or other setup which causes Flash to fail. You can of course just debug the Flash player at the point we call NP_Initialize to see if you can figure out what's going on.
Comment 18 Gabor Krizsanits [:krizsa :gabor] 2013-07-02 11:26:25 PDT
Thanks for the lots of useful info.

(In reply to Benjamin Smedberg  [:bsmedberg] from comment #17)
> This is a bug. If NP_Initialize fails, NP_Shutdown should not be called.
> Please file a separate bug for that.

Bug 889480.

> We already have a timeout for RPC calls; 45 seconds in release builds and
> infinite in debug builds. It should also show the plugin hang UI on Windows.

Alright, I don't think I have anything better than that...

> It is unlikely that this will get attention from Adobe.

I was afraid you're going to say this :)

> You can of course just debug the Flash player at the point we
> call NP_Initialize to see if you can figure out what's going on.

I wish I had a debug version of the Flash player, or the source code... I feel like poking a black box with a stick and hope that it starts to work accidentally...
Comment 19 Gabor Krizsanits [:krizsa :gabor] 2013-07-02 11:34:31 PDT
How much of a win would be if instead of hanging on sites for a long long time, flash plugin would simply just fail to load instead in this setup? Hopefully we can figure out why flash fails to init (and I think we should), just trying to estimate how much we would gain if Bug 889480 were fixed.
Comment 20 Benjamin Smedberg AWAY UNTIL 2-AUG-2016 [:bsmedberg] 2013-07-02 11:37:25 PDT
cfx is a python tool, right? The slightly tedious way to do this is to launch Firefox from Python using subprocess (or whatever cfx is doing) and progressively add special environment setup until Flash fails. Some possibilities:

* security descriptors on process launch
* Custom profile, files or permissions
* Unusual prefs
* Environment variables
* Process groups

In fact, if cfx is using process groups at all, it's possible that is the cause. Flash probably uses groups in its own way to enable its sandbox, and if cfx is setting up its own process groups that could interfere.
Comment 21 Benjamin Smedberg AWAY UNTIL 2-AUG-2016 [:bsmedberg] 2013-07-02 11:37:55 PDT
When 889480 is fixed Flash will no longer hang, it just won't work.
Comment 22 Gabor Krizsanits [:krizsa :gabor] 2013-07-02 11:55:07 PDT
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #20)
> cfx is a python tool, right? The slightly tedious way to do this is to
> launch Firefox from Python using subprocess (or whatever cfx is doing) and
> progressively add special environment setup until Flash fails. Some
> possibilities:
> 
> * security descriptors on process launch
> * Custom profile, files or permissions
> * Unusual prefs
> * Environment variables
> * Process groups
> 
> In fact, if cfx is using process groups at all, it's possible that is the
> cause. Flash probably uses groups in its own way to enable its sandbox, and
> if cfx is setting up its own process groups that could interfere.

Not much I see yet there:
env vars:
https://github.com/mozilla/addon-sdk/blob/master/python-lib/cuddlefish/runner.py#L502
process start:
https://github.com/mozilla/addon-sdk/blob/master/python-lib/mozrunner/__init__.py#L56-L59

The no-remote flag can be interesting...

I'm sure you know a lot more about the killableprocess.py than I do :) Does it do anything special in the way it starts up the process?
Comment 23 Benjamin Smedberg AWAY UNTIL 2-AUG-2016 [:bsmedberg] 2013-07-02 12:07:40 PDT
Holy crap, are you guys still using killableprocess?

https://github.com/mozilla/addon-sdk/blob/master/python-lib/mozrunner/killableprocess.py#L119 could well be the cause of this. Try removing that flag and see what happens.
Comment 24 Dave Townsend [:mossop] 2013-07-02 12:09:04 PDT
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #19)
> How much of a win would be if instead of hanging on sites for a long long
> time, flash plugin would simply just fail to load instead in this setup?
> Hopefully we can figure out why flash fails to init (and I think we should),
> just trying to estimate how much we would gain if Bug 889480 were fixed.

I'll certainly take that over hanging any day of the week. I do think having flash working is useful for certain use cases (add-ons that affect youtube f.e.)
Comment 25 Dave Townsend [:mossop] 2013-07-02 12:10:07 PDT
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #23)
> Holy crap, are you guys still using killableprocess?
> 
> https://github.com/mozilla/addon-sdk/blob/master/python-lib/mozrunner/
> killableprocess.py#L119 could well be the cause of this. Try removing that
> flag and see what happens.

we made the mistake of forking mozrunner some time ago and we haven't taken the time to get ourselves onto a more recent version :(
Comment 26 Gabor Krizsanits [:krizsa :gabor] 2013-07-02 12:18:47 PDT
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #23)
> Holy crap, are you guys still using killableprocess?
> 
> https://github.com/mozilla/addon-sdk/blob/master/python-lib/mozrunner/
> killableprocess.py#L119 could well be the cause of this. Try removing that
> flag and see what happens.

It didn't help, but at least I know where to focus now, I know some about windows processes, just not familiar with all these python code in the SDK. Has been trying to find my way in it so far... I'll play with it some more tomorrow.
Comment 27 Gabor Krizsanits [:krizsa :gabor] 2013-07-03 05:26:07 PDT
CREATE_BREAKAWAY_FROM_JOB is the main problem, Flash plugin starts a child process as well at some point. If I comment it out I get all sort of errors cannot access the process, and I get all sort of errors, because we cannot access some handles...

  File "c:\development\addonsdk\trunk\addon-sdk\python-lib\cuddlefish\runner.py", line 708, in run_app
    runner.start()
  File "c:\development\addonsdk\trunk\addon-sdk\python-lib\mozrunner\__init__.py", line 532, in start
    self.process_handler = run_command(self.command+self.cmdargs, self.env, **self.kp_kwargs)
  File "c:\development\addonsdk\trunk\addon-sdk\python-lib\mozrunner\__init__.py", line 60, in run_command
    return killableprocess.Popen(cmd, cwd="c:/Development/mozilla/mozilla-central3/obj/dist/bin", env=env, **killable_kw
args)
  File "c:\mozilla-build\python\lib\subprocess.py", line 679, in __init__
    errread, errwrite)
  File "c:\development\addonsdk\trunk\addon-sdk\python-lib\mozrunner\killableprocess.py", line 165, in _execute_child
    winprocess.AssignProcessToJobObject(self._job, int(hp))
  File "c:\development\addonsdk\trunk\addon-sdk\python-lib\mozrunner\winprocess.py", line 51, in ErrCheckBool
    raise WinError()
WindowsError: [Error 5] Access is denied.
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "c:\mozilla-build\python\lib\atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "c:\development\addonsdk\trunk\addon-sdk\python-lib\cuddlefish\runner.py", line 536, in maybe_remove_outfile
    os.remove(outfile)
WindowsError: [Error 32] The process cannot access the file because it is being used by another process: 'c:\\users\\gar
bo\\appdata\\local\\temp\\harness-stdout-m7uwfe'
Error in sys.exitfunc:
Traceback (most recent call last):
  File "c:\mozilla-build\python\lib\atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "c:\development\addonsdk\trunk\addon-sdk\python-lib\cuddlefish\runner.py", line 536, in maybe_remove_outfile
    os.remove(outfile)
WindowsError: [Error 32] The process cannot access the file because it is being used by another process: 'c:\\users\\gar
bo\\appdata\\local\\temp\\harness-stdout-m7uwfe'

But if I comment out the CREATE_SUSPENDED, then the process starts regardless, and flash works in it. Although I have to comment out the in/out/err channels as well too, to have the console working as before, and the closing of the process is less than ideal...

There should be a way to set a flag on the job object (JOB_OBJECT_LIMIT_BREAKAWAY_OK) and then no subsequent child process will be part of this job, that might be enough for making flash to work and would be a minimal change... but I don't know how to do that from python... Frankly, I don't know much about win jobs, just found this flag on msdn.

Also, I'm not really sure why we need all this killableprocess thing. It would be great to do something simpler probably...
Comment 28 Ted Mielczarek [:ted.mielczarek] 2013-07-03 06:50:51 PDT
I suspect most of it stems from the days when Firefox would restart itself out from under you, to ensure that you could kill the actual process.
Comment 29 Gabor Krizsanits [:krizsa :gabor] 2013-07-08 06:44:46 PDT
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #28)
> I suspect most of it stems from the days when Firefox would restart itself
> out from under you, to ensure that you could kill the actual process.

So it seems like if I don't set the CREATE_BREAKAWAY_FROM_JOB flag and don't create a windows job object (and assign the process to it), flash just works. What might be the downside of this approach (if there is any) in the current world?
Comment 30 Benjamin Smedberg AWAY UNTIL 2-AUG-2016 [:bsmedberg] 2013-07-08 07:06:49 PDT
The point of killableprocess is that it would not only kill the process you launch, but also any *other* child process that gets launched. Removing the job/CREATE_BREAKAWAY_FROM_JOB code will basically mean that killableprocess at best only kills the one process it launches, and not any subprocesses such as plugin processes.
Comment 31 Gabor Krizsanits [:krizsa :gabor] 2013-07-08 07:30:44 PDT
I think that it's still better than the current version we have. I wish I could come up with something better, but I kind of tried out all the various flags, windows has to offer here for the job object, and flash does not seem to like any of them, or we get an access denied when trying to access the process handle. So I can keep playing with it for a while but I'm getting a bit pessimistic about finding any better solution here. The good thing that hitting the close button of the browser kills everything nicely at least, but if addons will have their separate process in the future, this is going to be a problem. What do you think Dave?
Comment 32 Benjamin Smedberg AWAY UNTIL 2-AUG-2016 [:bsmedberg] 2013-07-08 07:39:27 PDT
Its possible you can experiment with giving the Firefox process the JOB_OBJECT_LIMIT_BREAKAWAY_OK permission, but that assumes that Flash creates its subprocesses with CREATE_BREAKAWAY_FROM_JOB, which is... unlikely.
Comment 33 Gabor Krizsanits [:krizsa :gabor] 2013-07-08 07:49:05 PDT
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #32)
> Its possible you can experiment with giving the Firefox process the
> JOB_OBJECT_LIMIT_BREAKAWAY_OK permission, but that assumes that Flash
> creates its subprocesses with CREATE_BREAKAWAY_FROM_JOB, which is...
> unlikely.

Exactly. I was desperate enough to try that but it is not the case, and it's unlikely that anyone will make that change... further more, any other plugin can create custom processes.
Comment 34 Gabor Krizsanits [:krizsa :gabor] 2013-07-08 10:55:01 PDT
I was talking to Dave about this one, and he agreed that this approach is still a lot more than what we have right now. But we were wondering how the current mozrunner solve this whole shutting down problem without this killableprocess? And if we could integrate it / migrate to it easily?
Comment 35 Gabor Krizsanits [:krizsa :gabor] 2013-07-10 08:59:41 PDT
So the current version of mozrunner does the same thing (creating job and all), so in this respect it's very similar. We might want to migrate to it at some point, but it won't fix our problem.

On try my fix seems to be green...

https://tbpl.mozilla.org/?tree=Try&rev=a61bff0e0abb
Comment 36 Gabor Krizsanits [:krizsa :gabor] 2013-07-15 03:46:55 PDT
Created attachment 775577 [details]
Pointer to Github pull request: https://github.com/mozilla/addon-sdk/pull/1104/files
Comment 37 [github robot] 2013-07-17 14:30:27 PDT
Commits pushed to master at https://github.com/mozilla/addon-sdk

https://github.com/mozilla/addon-sdk/commit/465b49e7c0d74899e1fef7ad2f8425b4c2e85fa0
Bug 768651 - Cfx run hangs on windows on sites using flash

https://github.com/mozilla/addon-sdk/commit/980641c74debec6ed76a021fc5be13124d59164a
Merge pull request #1104 from krizsa/master

Fixing Bug 768651 - cfx run hangs on windows on sites with flash. r=Mossop
Comment 38 Wes Kocher (:KWierso) 2013-09-16 08:51:17 PDT
*** Bug 916849 has been marked as a duplicate of this bug. ***
Comment 39 cprcrack 2013-10-14 11:11:23 PDT
It's still hanging for me on youtube.com with Firefox Setup Stub 25.0b7
Comment 40 Dave Townsend [:mossop] 2013-10-14 11:13:54 PDT
(In reply to cprcrack from comment #39)
> It's still hanging for me on youtube.com with Firefox Setup Stub 25.0b7

Which version of the SDK?
Comment 41 John Nagle 2013-10-14 11:59:13 PDT
It's not supposed to be "fixed" until Firefox 25. The release channel is still at 24.
Comment 42 cprcrack 2013-10-14 17:12:27 PDT
(In reply to John Nagle from comment #41)
> It's not supposed to be "fixed" until Firefox 25. The release channel is
> still at 24.

I downloaded from the beta channel at <http://www.mozilla.org/firefox/beta/>, version 25.0 beta 7. It's supposed to be fixed there right?

(In reply to Dave Townsend (:Mossop) from comment #40)
> Which version of the SDK?

The latest: addon-sdk-1.14
Comment 43 Dave Townsend [:mossop] 2013-10-15 08:59:04 PDT
(In reply to cprcrack from comment #42)
> (In reply to John Nagle from comment #41)
> > It's not supposed to be "fixed" until Firefox 25. The release channel is
> > still at 24.
> 
> I downloaded from the beta channel at
> <http://www.mozilla.org/firefox/beta/>, version 25.0 beta 7. It's supposed
> to be fixed there right?
> 
> (In reply to Dave Townsend (:Mossop) from comment #40)
> > Which version of the SDK?
> 
> The latest: addon-sdk-1.14

Unfortunately the current release version is too old to contain this fix. We're working on releasing 1.15 soon to address this
Comment 44 tim 2014-01-05 15:30:54 PST
I'm running 1.15 and FF 26 on Win 7 pro and still appear to be getting the error.  Any recommendations to investigate?

Note You need to log in before you can comment on or make changes to this bug.