Closed Bug 959551 Opened 10 years ago Closed 10 years ago

Mozmill keeps a zombie process open after a restart of Firefox

Categories

(Testing Graveyard :: Mozmill, defect)

All
macOS
defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whimboo, Assigned: whimboo)

References

Details

(Whiteboard: [mozmill-2.0.4+])

Attachments

(4 files, 7 obsolete files)

As what I have seen recently is that Mozmill keeps a zombie process active when Firefox gets restarted. This happens at least with debug builds.

After the first start:

$ ps -ef |grep firefox
  501 30598 25557   0  1:26PM ttys000    0:00.60 /usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python /usr/local/bin/mozmill -t ../mozmill-tests/nightly/firefox/tests/functional/restartTests/testAddons_enableDisableExtension/ -b FirefoxNightlyDebug.app/ --debug
  501 30604 30598   0  1:26PM ttys000    0:11.54 /Volumes/data/code/mozmill/FirefoxNightlyDebug.app/Contents/MacOS/firefox -jsconsole -foreground -profile /var/folders/wd/zmy4z7xn7wd7sjq90z1y52f80000gn/T/tmpAVbia4.mozrunner
  501 30611 14614   0  1:26PM ttys001    0:00.00 grep firefox

After the restart:

$ ps -ef |grep firefox
  501 30598 25557   0  1:26PM ttys000    0:00.62 /usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python /usr/local/bin/mozmill -t ../mozmill-tests/nightly/firefox/tests/functional/restartTests/testAddons_enableDisableExtension/ -b FirefoxNightlyDebug.app/ --debug
  501 30604 30598   0  1:26PM ttys000    0:00.00 (firefox)
  501 30615     1   0  1:26PM ttys000    0:00.02 /Volumes/data/code/mozmill/FirefoxNightlyDebug.app/Contents/MacOS/firefox

As you can see the parent process with id 30598 which is mozmill, keeps a firefox process active. All this might be the reason why we are failing on bug 956315.
So i added some debug lines in the mozmill code to figure out what's going on. I was surprised to see the following:

1. With the first start of firefox we spawn the process with id 5902, whereby 5896 is the pid of mozmill:

  501  5902  5896   0 11:31PM ttys000    0:00.88 /Volumes/data/code/mozmill/FirefoxNightlyDebug.app/Contents/MacOS/firefox -foreground -profile /var/folders/wd/zmy4z7xn7wd7sjq90z1y52f80000gn/T/tmp0lMGCM.mozrunner

2. When Firefox restarts itself we have the following situation for a split of a second. Means it spawns a new process as child of the first process. Which is ok.

  501  5902  5896   0 11:31PM ttys000    0:00.00 (firefox)
  501  5930  5902   0 11:32PM ttys000    0:00.02 /Volumes/data/code/mozmill/FirefoxNightlyDebug.app/Contents/MacOS/firefox

3. But a couple of milliseconds later the parent process changes, so that is no longer a child of the first process. But that one keeps around because we don't call process.wait() on it.

  501  5902  5896   0 11:31PM ttys000    0:00.00 (firefox)
  501  5930     1   0 11:32PM ttys000    0:00.30 /Volumes/data/code/mozmill/FirefoxNightlyDebug.app/Contents/MacOS/firefox

Situation 3 will remain until the new process gets stopped. Then both processes are gone.

As it looks like mozmill has to trigger mozrunner or even the process handler to update the pid for the new process. Joel, given that you are very familiar with mozprocess, do you have an idea?
Flags: needinfo?(jmaher)
Attached file wip (using mozprocess directly) (obsolete) —
I wrote a little script by using mozprocess directly and that seems to solve the problem perfectly:

501 10355 10349   0 12:02AM ttys000    0:06.65 /Volumes/data/code/mozmill/FirefoxNightlyDebug.app/Contents/MacOS/firefox -foreground -profile /var/folders/wd/zmy4z7xn7wd7sjq90z1y52f80000gn/T/tmp2IFn5I.mozrunner

501 10374     1   0 12:03AM ttys000    0:00.06 /Volumes/data/code/mozmill/FirefoxNightlyDebug.app/Contents/MacOS/firefox

There is no zombie process hanging around. So it looks like that we indeed have to call wait() on the process everytime Firefox gets restarted.
Flags: needinfo?(jmaher)
I got it temporarily fixed in mozmill! Tomorrow I will work on a final patch which will hopefully handle all the different cases of restarts and aborts. Challenging!
Attached patch WIP v1.0 (obsolete) — Splinter Review
This fixes the problem for me on OS X with debug builds. Andrei, can you also please have a look? Also it would be great if you could get it tested on Windows with opt and debug builds. I will test Linux.
Attachment #8360703 - Attachment is obsolete: true
Attachment #8360924 - Flags: feedback?(andrei.eftimie)
Comment on attachment 8360924 [details] [diff] [review]
WIP v1.0

Review of attachment 8360924 [details] [diff] [review]:
-----------------------------------------------------------------

I see no more zombie processes with this WIP patch on OSX while running tests via mozmill.

But this fails while running tests through mozmill-automation:
> ************************************************************
> * Call to xpconnect wrapped JSObject produced this error:  *
> [Exception... "'Error: PR_Bind() failed (-5982)' when calling method: [nsIObserver::observe]"  nsresult: "0x8057001c (NS_ERROR_XPC_JS_THREW_JS_OBJECT)"  location: "native frame :: <unknown filename> :: <TOP_LEVEL> :: line 0"  data: no]
> ************************************************************

I'll shortly test this on Windows as well.

This has feedback+ as I think the approach is good.
We need to properly wait and close FF processes before continuing.
Attachment #8360924 - Flags: feedback?(andrei.eftimie) → feedback+
(In reply to Andrei Eftimie from comment #5)
> But this fails while running tests through mozmill-automation:
> > ************************************************************
> > * Call to xpconnect wrapped JSObject produced this error:  *
> > [Exception... "'Error: PR_Bind() failed (-5982)' when calling method: [nsIObserver::observe]"  nsresult: "0x8057001c (NS_ERROR_XPC_JS_THREW_JS_OBJECT)"  location: "native frame :: <unknown filename> :: <TOP_LEVEL> :: line 0"  data: no]
> > ************************************************************

Is is totally unrelated to this bug. And i know that this exists because bug 956315 has still not been fixed. Thanks for testing!
And some bad news.

This fails in Windows:
> Parent process 82316 exited with children alive:
> PIDS: 82248
> Attempting to kill them...
> Parent process 82316 exited with children alive:
> PIDS: 82248
> Attempting to kill them...
> Error Code 6 trying to query IO Completion Port, exiting
> Exception in thread Thread-1:
> Traceback (most recent call last):
>   File "C:\mozmill-env\2.0.3-windows_patch\mozmill-env\python\Lib\threading.py", line 551, in __bootstrap_inner
>     self.run()
>   File "C:\mozmill-env\2.0.3-windows_patch\mozmill-env\python\Lib\threading.py", line 504, in run
>     self.__target(*self.__args, **self.__kwargs)
>   File "c:\mozmill-env\2.0.3-windows_patch\mozmill-env\python\lib\site-packages\mozprocess\processhandler.py", line 302, in _procmgr
>     self._poll_iocompletion_port()
>   File "c:\mozmill-env\2.0.3-windows_patch\mozmill-env\python\lib\site-packages\mozprocess\processhandler.py", line 352, in _poll_iocompletion_port
>     raise WinError(errcode)
> WindowsError: [Error 6] The handle is invalid.
There is a lack of context again. When did that happen? Which test? Which kind of restart or shutdown? The information above and the log is not helpful. You know that I always ask for a full log.
While thinking about the failure on Windows I think the io completion port takes already care of it, so we do not have to call self.runner.wait(1.). Andrei, can you please remove that line and try again?
There's not much more here.
Its the first attempted restart.

On Windows 8 with a regular Nightly build.
(I can't run debug builds ATM duew to some missing DLL, I think I require Visual Studio installed) 

> $ mozmill -m /c/work/mozmill-tests/firefox/tests/functional/restartTests/manifest.ini -b /c/Mozilla/Nightly/en-US/firefox/firefox.exe
> TEST-SKIPPED | test1.js | Bug 931704 - plainTheme is undefined.
> TEST-SKIPPED | test2.js | Bug 931704 - plainTheme is undefined.
> TEST-SKIPPED | test3.js | Bug 931704 - plainTheme is undefined.
> 
> TEST-START | testAddons_enableDisableExtension\test1.js | setupModule
> TEST-START | testAddons_enableDisableExtension\test1.js | testInstallAddon
> TEST-PASS | testAddons_enableDisableExtension\test1.js | testInstallAddon
> TEST-START | testAddons_enableDisableExtension\test1.js | teardownModule
> TEST-END | testAddons_enableDisableExtension\test1.js | finished in 2241ms
> Parent process 82736 exited with children alive:
> PIDS: 83736
> Attempting to kill them...
> Parent process 82736 exited with children alive:
> PIDS: 83736
> Attempting to kill them...
> Error Code 6 trying to query IO Completion Port, exiting
> Exception in thread Thread-1:
> Traceback (most recent call last):
>   File "C:\mozmill-env\2.0.3-windows_patch\mozmill-env\python\Lib\threading.py", line 551, in __bootstrap_inner
>     self.run()
>   File "C:\mozmill-env\2.0.3-windows_patch\mozmill-env\python\Lib\threading.py", line 504, in run
>     self.__target(*self.__args, **self.__kwargs)
>   File "c:\mozmill-env\2.0.3-windows_patch\mozmill-env\python\lib\site-packages\mozprocess\processhandler.py", line 302, in _procmgr
>     self._poll_iocompletion_port()
>   File "c:\mozmill-env\2.0.3-windows_patch\mozmill-env\python\lib\site-packages\mozprocess\processhandler.py", line 352, in _poll_iocompletion_port
>     raise WinError(errcode)
> WindowsError: [Error 6] The handle is invalid.
>
Yep, works fine without calling self.runner.wait(0.1) for restarts
Good to know. Thanks! I will make this call Unix dependent then. Also noticed that the same call has to be inserted into run_test_file() for the new type of restart tests.
Severity: normal → major
Attached patch WIP 1.1 (obsolete) — Splinter Review
Updated patch which should also work with mutt tests and the new restart test type. I will do further testing in a bit.
Attachment #8360924 - Attachment is obsolete: true
Attachment #8361073 - Flags: feedback?(andrei.eftimie)
Still testing this.

One thing I noticed is that in some circumstances we don't remain with a zombie process open, but (I assume) we fail to reconnect to the Firefox process and we never timeout (I had to use Ctrl+C to end the run).

Here is the log: https://pastebin.mozilla.org/4035241
This is on OSX with a Debug build.
Andrei, we have a timeout of 60s for reconnecting the socket. As the times show you weren't waiting long enough here. And again, what you are seeing is the result of bug 956315.
I just run into the jsbridge disconnect issue because the socket is still in use. So as the output here shows we indeed have a socket remaining, even with my patch applied. So I'm not sure right now, why this is still happening exactly at this moment. But it's interesting that we are hanging in processhandler to wait for the process. Sadly we do not see the pid, so I will add more debugging output.

But at least my suspicion seem to be right.
Attachment #8361751 - Attachment mime type: text/x-log → text/plain
Actually the above log comes from a test which is doing an user restart:

https://github.com/mozilla/mozmill/blob/master/mutt/mutt/tests/python/js-modules/restartTests/testCountTestPasses.js#L31

The usage of usershutdown is not allowed as of now because we know it is broken and bug 886360 should handle that. Exactly that might also be the reason why user restart tests are broken.

I will fix the test for now and make a comment on bug 886360.
Blocks: 886360
Blocks: 794020
Attached patch patch v1 (obsolete) — Splinter Review
This should be the final patch for this bug. It works pretty fine for me across platforms with the Mutt tests and the tests from the mozmill-tests repository.

Cosmin, can you please run some additional testruns for the above tests across platform? That would be pretty helpful. Thanks!
Attachment #8361073 - Attachment is obsolete: true
Attachment #8361073 - Flags: feedback?(andrei.eftimie)
Attachment #8363988 - Flags: review?(dave.hunt)
Attachment #8363988 - Flags: feedback?(cosmin.malutan)
Comment on attachment 8363988 [details] [diff] [review]
patch v1

Python mutt test /mozmill/mutt/mutt/tests/python/test_restart.py fails with the patch applied on Mac 10.7.5 and on Ubuntu 12.04  with Nightly. On windows I had diffrent errors with and without the patch applied. 
I will ran some testruns too.
Comment on attachment 8363988 [details] [diff] [review]
patch v1

It crashed with Firefox already open in testrun both on mac 10.7.5 and on windows 8.1, with Nightly. 
If you're sure it works for you I will check again.
Attachment #8363988 - Flags: feedback?(cosmin.malutan) → feedback-
(In reply to Cosmin Malutan from comment #19)
> Python mutt test /mozmill/mutt/mutt/tests/python/test_restart.py fails with
> the patch applied on Mac 10.7.5 and on Ubuntu 12.04  with Nightly. On
> windows I had diffrent errors with and without the patch applied. 

All this information is pretty useless without any logs or details. Not sure how often I have to request such kind of information again and again. Please provide those details by yourself!
Flags: needinfo?(cosmin.malutan)
(In reply to Cosmin Malutan from comment #20)
> It crashed with Firefox already open in testrun both on mac 10.7.5 and on
> windows 8.1, with Nightly. 

Crashed what? Again, I cannot action on it because of none information.
Comment on attachment 8363988 [details] [diff] [review]
patch v1

Review of attachment 8363988 [details] [diff] [review]:
-----------------------------------------------------------------

The code changes look okay to me, with a couple of comments. We will need more information regarding the test results with this patch applied though.

::: mozmill/mozmill/__init__.py
@@ +482,5 @@
> +            # To avoid having zombie processes we have to release the old process
> +            # by calling wait(). There has to be a timeout so we do not wait forever
> +            # for the new pid, but the value doesn't matter because we will wait
> +            # in create_network for a new connection within 60s.
> +            if not mozinfo.isWin:

Could we add something to the comment to explain why this is not needed for Windows?

@@ +488,5 @@
> +        else:
> +            # It's a JS triggered shutdown of the application. We
> +            # have to wait until the process is gone. Otherwise we
> +            # try to use a profile which is still in use
> +            self.runner.wait()

Why is the timeout not needed here?
Attachment #8363988 - Flags: review?(dave.hunt) → review+
Attached file mac_failure (socket still in use) (obsolete) —
(In reply to Henrik Skupin (:whimboo) from comment #21)
> All this information is pretty useless without any logs or details. Not sure
> how often I have to request such kind of information again and again. Please
> provide those details by yourself!
This was on osx 10.7.5 with latest Nightly 

Here is what I did:
  * I download and installed mozmill
  * I applied the patch
  * I download and installed mozmill-automation
  * I ran 'testran_functional --report=http://mozmill-crowd.blargon7.com/db/ /Applications/FirefoxNightly.app/'

Mutt tests I ran as:
  * mutt testall -b /Applications/FirefoxNightly.app/
  * it failed on  /mozmill/mutt/mutt/tests/python/test_restart.py
Flags: needinfo?(cosmin.malutan)
Attached file windows8.1_failure (obsolete) —
Here is the report from windows 8.1 with latest Nightly
I did the same steps
Comment on attachment 8364261 [details]
mac_failure (socket still in use)

(In reply to Cosmin Malutan from comment #24)
>   * I download and installed mozmill
>   * I applied the patch
>   * I download and installed mozmill-automation
>   * I ran 'testran_functional --report=http://mozmill-crowd.blargon7.com/db/
> /Applications/FirefoxNightly.app/'
 
This patch is not going to fix the socket issue, you are hitting here. This is for not having any zombie process. So while running the test you should do a 'ps aux | grep firefox' and check for zombie processes after Firefox gets restarted by the tests.

> Mutt tests I ran as:
>   * mutt testall -b /Applications/FirefoxNightly.app/
>   * it failed on  /mozmill/mutt/mutt/tests/python/test_restart.py

And? I still don't see why it failed. Still no helpful information has been specified here! But I assume it is the same issue from above. If you want to be sure all is fixed you might want to also apply my patch from bug 956315.

Also if something is unclear how to execute specific tests I would love if you could ask on IRC.
Attachment #8364261 - Attachment description: mac_failure → mac_failure (socket still in use)
Flags: needinfo?(cosmin.malutan)
Comment on attachment 8364264 [details]
windows8.1_failure

(In reply to Cosmin Malutan from comment #25)
> Here is the report from windows 8.1 with latest Nightly
> I did the same steps

This is something totally different and should be tracked separately. But I assume it's bug 794020. Something in testGreenLarry.js is failing here, and it not related to restarts at all. Please read through the bug or ask on IRC if steps or expected results are unclear.
Attachment #8364264 - Attachment is obsolete: true
(In reply to Dave Hunt (:davehunt) from comment #23)
> The code changes look okay to me, with a couple of comments. We will need
> more information regarding the test results with this patch applied though.

I can run all of the tests again and push the results to the dashboard.

> > +            # To avoid having zombie processes we have to release the old process
> > +            # by calling wait(). There has to be a timeout so we do not wait forever
> > +            # for the new pid, but the value doesn't matter because we will wait
> > +            # in create_network for a new connection within 60s.
> > +            if not mozinfo.isWin:
> 
> Could we add something to the comment to explain why this is not needed for
> Windows?

Sure. Will do so.

> > +        else:
> > +            # It's a JS triggered shutdown of the application. We
> > +            # have to wait until the process is gone. Otherwise we
> > +            # try to use a profile which is still in use
> > +            self.runner.wait()
> 
> Why is the timeout not needed here?

Hm, it's not a change made in this patch, but good question. I think we should give it a timeout similar to the normal stop. But personally I feel a bit anxious to use 10 seconds here. If Firefox takes a bit longer to shutdown, we force kill the process. What do you think about using the standard 60 seconds of timeout here too?
Attached file mac_failure (obsolete) —
With both patches it passed on windows and linux, though on mac I still get a disconnect when running a Nightly build.
When I ran with a Nightly-debug build it passed, also I checked the zombie processes while running the tests(ps aux | grep firefox) and it cleans them, and I saw them without the patch applied.
I will test it again tomorrow probably is something with my machine.
Attachment #8364261 - Attachment is obsolete: true
Flags: needinfo?(cosmin.malutan)
(In reply to Henrik Skupin (:whimboo) from comment #28)
> Hm, it's not a change made in this patch, but good question. I think we
> should give it a timeout similar to the normal stop. But personally I feel a
> bit anxious to use 10 seconds here. If Firefox takes a bit longer to
> shutdown, we force kill the process. What do you think about using the
> standard 60 seconds of timeout here too?

Dave, what do you think?
Flags: needinfo?(dave.hunt)
Attached patch Patch v2Splinter Review
Just updated the patch. Dave, is that better now?
Attachment #8363988 - Attachment is obsolete: true
Attachment #8364971 - Flags: review+
Hi henrik, an update about the failure I see.
I made a new virtualenv i cloned mozmill, I applied the two patches and I ran tests. It passed every single time.
I I cloned mozbase too and I installed it, it still passed.
I cloned mozmill-automation and I installed it and tests "FAILS" every single time.

So as an input I guess is a dependency installed by mozmill-automation that makes the tests to fail after we install it.
Comment on attachment 8364971 [details] [diff] [review]
Patch v2

Yeah, looks good. Thanks Henrik!
Flags: needinfo?(dave.hunt)
(In reply to Cosmin Malutan from comment #32)
> Hi henrik, an update about the failure I see.
> I made a new virtualenv i cloned mozmill, I applied the two patches and I
> ran tests. It passed every single time.
> I I cloned mozbase too and I installed it, it still passed.
> I cloned mozmill-automation and I installed it and tests "FAILS" every
> single time.

Ok, I would suggest that you run pip freeze and tell me all the dependencies. But given that this is a separate project and we indeed could have version conflicts, we should handle it in a separate issue, if it still happens once we bumped all deps for mozbase.

I will go ahead and get this patch landed. Thanks for all the testing and your feedback!
https://github.com/mozilla/mozmill/commit/429e4951d7bf12ca5b085df8aaddecc831d4d1f6 (master)

The test change caused a merge conflict. I'm going to fix that and upload an updated version of the patch.
Flags: in-testsuite+
Backport patch for hotfix-2.0 branch.
Attachment #8365012 - Flags: review?(dave.hunt)
Attached file dependencies
Here are the dependencies when it works and after we install mozmill-automation.
Th one that changes are:
mozfile==1.1 > mozfile==1.0
mozlog==1.5 > mozlog==1.2
Attachment #8364395 - Attachment is obsolete: true
Attachment #8365012 - Flags: review?(dave.hunt) → review+
(In reply to Cosmin Malutan from comment #37)
> mozmill-automation.
> Th one that changes are:
> mozfile==1.1 > mozfile==1.0
> mozlog==1.5 > mozlog==1.2

This should only cause a problem with removing the files after a shutdown of the application. But that should never cause an application disconnect. So lets keep an eye on it over the next couple of days.
https://github.com/mozilla/mozmill/commit/5c254c3c99b59441341b5b9ab56064ff2426ba8a (hotfix-2.0)
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Whiteboard: [mozmill-2.0.4+]
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: