Closed
Bug 959551
Opened 10 years ago
Closed 10 years ago
Mozmill keeps a zombie process open after a restart of Firefox
Categories
(Testing Graveyard :: Mozmill, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: whimboo, Assigned: whimboo)
References
Details
(Whiteboard: [mozmill-2.0.4+])
Attachments
(4 files, 7 obsolete files)
44.61 KB,
text/plain
|
Details | |
4.69 KB,
patch
|
whimboo
:
review+
|
Details | Diff | Splinter Review |
4.71 KB,
patch
|
davehunt
:
review+
|
Details | Diff | Splinter Review |
887 bytes,
text/plain
|
Details |
As what I have seen recently is that Mozmill keeps a zombie process active when Firefox gets restarted. This happens at least with debug builds. After the first start: $ ps -ef |grep firefox 501 30598 25557 0 1:26PM ttys000 0:00.60 /usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python /usr/local/bin/mozmill -t ../mozmill-tests/nightly/firefox/tests/functional/restartTests/testAddons_enableDisableExtension/ -b FirefoxNightlyDebug.app/ --debug 501 30604 30598 0 1:26PM ttys000 0:11.54 /Volumes/data/code/mozmill/FirefoxNightlyDebug.app/Contents/MacOS/firefox -jsconsole -foreground -profile /var/folders/wd/zmy4z7xn7wd7sjq90z1y52f80000gn/T/tmpAVbia4.mozrunner 501 30611 14614 0 1:26PM ttys001 0:00.00 grep firefox After the restart: $ ps -ef |grep firefox 501 30598 25557 0 1:26PM ttys000 0:00.62 /usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python /usr/local/bin/mozmill -t ../mozmill-tests/nightly/firefox/tests/functional/restartTests/testAddons_enableDisableExtension/ -b FirefoxNightlyDebug.app/ --debug 501 30604 30598 0 1:26PM ttys000 0:00.00 (firefox) 501 30615 1 0 1:26PM ttys000 0:00.02 /Volumes/data/code/mozmill/FirefoxNightlyDebug.app/Contents/MacOS/firefox As you can see the parent process with id 30598 which is mozmill, keeps a firefox process active. All this might be the reason why we are failing on bug 956315.
Assignee | ||
Comment 1•10 years ago
|
||
So i added some debug lines in the mozmill code to figure out what's going on. I was surprised to see the following: 1. With the first start of firefox we spawn the process with id 5902, whereby 5896 is the pid of mozmill: 501 5902 5896 0 11:31PM ttys000 0:00.88 /Volumes/data/code/mozmill/FirefoxNightlyDebug.app/Contents/MacOS/firefox -foreground -profile /var/folders/wd/zmy4z7xn7wd7sjq90z1y52f80000gn/T/tmp0lMGCM.mozrunner 2. When Firefox restarts itself we have the following situation for a split of a second. Means it spawns a new process as child of the first process. Which is ok. 501 5902 5896 0 11:31PM ttys000 0:00.00 (firefox) 501 5930 5902 0 11:32PM ttys000 0:00.02 /Volumes/data/code/mozmill/FirefoxNightlyDebug.app/Contents/MacOS/firefox 3. But a couple of milliseconds later the parent process changes, so that is no longer a child of the first process. But that one keeps around because we don't call process.wait() on it. 501 5902 5896 0 11:31PM ttys000 0:00.00 (firefox) 501 5930 1 0 11:32PM ttys000 0:00.30 /Volumes/data/code/mozmill/FirefoxNightlyDebug.app/Contents/MacOS/firefox Situation 3 will remain until the new process gets stopped. Then both processes are gone. As it looks like mozmill has to trigger mozrunner or even the process handler to update the pid for the new process. Joel, given that you are very familiar with mozprocess, do you have an idea?
Assignee | ||
Updated•10 years ago
|
Flags: needinfo?(jmaher)
Assignee | ||
Comment 2•10 years ago
|
||
I wrote a little script by using mozprocess directly and that seems to solve the problem perfectly: 501 10355 10349 0 12:02AM ttys000 0:06.65 /Volumes/data/code/mozmill/FirefoxNightlyDebug.app/Contents/MacOS/firefox -foreground -profile /var/folders/wd/zmy4z7xn7wd7sjq90z1y52f80000gn/T/tmp2IFn5I.mozrunner 501 10374 1 0 12:03AM ttys000 0:00.06 /Volumes/data/code/mozmill/FirefoxNightlyDebug.app/Contents/MacOS/firefox There is no zombie process hanging around. So it looks like that we indeed have to call wait() on the process everytime Firefox gets restarted.
Assignee | ||
Updated•10 years ago
|
Flags: needinfo?(jmaher)
Assignee | ||
Comment 3•10 years ago
|
||
I got it temporarily fixed in mozmill! Tomorrow I will work on a final patch which will hopefully handle all the different cases of restarts and aborts. Challenging!
Assignee | ||
Comment 4•10 years ago
|
||
This fixes the problem for me on OS X with debug builds. Andrei, can you also please have a look? Also it would be great if you could get it tested on Windows with opt and debug builds. I will test Linux.
Attachment #8360703 -
Attachment is obsolete: true
Attachment #8360924 -
Flags: feedback?(andrei.eftimie)
Comment 5•10 years ago
|
||
Comment on attachment 8360924 [details] [diff] [review] WIP v1.0 Review of attachment 8360924 [details] [diff] [review]: ----------------------------------------------------------------- I see no more zombie processes with this WIP patch on OSX while running tests via mozmill. But this fails while running tests through mozmill-automation: > ************************************************************ > * Call to xpconnect wrapped JSObject produced this error: * > [Exception... "'Error: PR_Bind() failed (-5982)' when calling method: [nsIObserver::observe]" nsresult: "0x8057001c (NS_ERROR_XPC_JS_THREW_JS_OBJECT)" location: "native frame :: <unknown filename> :: <TOP_LEVEL> :: line 0" data: no] > ************************************************************ I'll shortly test this on Windows as well. This has feedback+ as I think the approach is good. We need to properly wait and close FF processes before continuing.
Attachment #8360924 -
Flags: feedback?(andrei.eftimie) → feedback+
Assignee | ||
Comment 6•10 years ago
|
||
(In reply to Andrei Eftimie from comment #5) > But this fails while running tests through mozmill-automation: > > ************************************************************ > > * Call to xpconnect wrapped JSObject produced this error: * > > [Exception... "'Error: PR_Bind() failed (-5982)' when calling method: [nsIObserver::observe]" nsresult: "0x8057001c (NS_ERROR_XPC_JS_THREW_JS_OBJECT)" location: "native frame :: <unknown filename> :: <TOP_LEVEL> :: line 0" data: no] > > ************************************************************ Is is totally unrelated to this bug. And i know that this exists because bug 956315 has still not been fixed. Thanks for testing!
Comment 7•10 years ago
|
||
And some bad news.
This fails in Windows:
> Parent process 82316 exited with children alive:
> PIDS: 82248
> Attempting to kill them...
> Parent process 82316 exited with children alive:
> PIDS: 82248
> Attempting to kill them...
> Error Code 6 trying to query IO Completion Port, exiting
> Exception in thread Thread-1:
> Traceback (most recent call last):
> File "C:\mozmill-env\2.0.3-windows_patch\mozmill-env\python\Lib\threading.py", line 551, in __bootstrap_inner
> self.run()
> File "C:\mozmill-env\2.0.3-windows_patch\mozmill-env\python\Lib\threading.py", line 504, in run
> self.__target(*self.__args, **self.__kwargs)
> File "c:\mozmill-env\2.0.3-windows_patch\mozmill-env\python\lib\site-packages\mozprocess\processhandler.py", line 302, in _procmgr
> self._poll_iocompletion_port()
> File "c:\mozmill-env\2.0.3-windows_patch\mozmill-env\python\lib\site-packages\mozprocess\processhandler.py", line 352, in _poll_iocompletion_port
> raise WinError(errcode)
> WindowsError: [Error 6] The handle is invalid.
Assignee | ||
Comment 8•10 years ago
|
||
There is a lack of context again. When did that happen? Which test? Which kind of restart or shutdown? The information above and the log is not helpful. You know that I always ask for a full log.
Assignee | ||
Comment 9•10 years ago
|
||
While thinking about the failure on Windows I think the io completion port takes already care of it, so we do not have to call self.runner.wait(1.). Andrei, can you please remove that line and try again?
Comment 10•10 years ago
|
||
There's not much more here.
Its the first attempted restart.
On Windows 8 with a regular Nightly build.
(I can't run debug builds ATM duew to some missing DLL, I think I require Visual Studio installed)
> $ mozmill -m /c/work/mozmill-tests/firefox/tests/functional/restartTests/manifest.ini -b /c/Mozilla/Nightly/en-US/firefox/firefox.exe
> TEST-SKIPPED | test1.js | Bug 931704 - plainTheme is undefined.
> TEST-SKIPPED | test2.js | Bug 931704 - plainTheme is undefined.
> TEST-SKIPPED | test3.js | Bug 931704 - plainTheme is undefined.
>
> TEST-START | testAddons_enableDisableExtension\test1.js | setupModule
> TEST-START | testAddons_enableDisableExtension\test1.js | testInstallAddon
> TEST-PASS | testAddons_enableDisableExtension\test1.js | testInstallAddon
> TEST-START | testAddons_enableDisableExtension\test1.js | teardownModule
> TEST-END | testAddons_enableDisableExtension\test1.js | finished in 2241ms
> Parent process 82736 exited with children alive:
> PIDS: 83736
> Attempting to kill them...
> Parent process 82736 exited with children alive:
> PIDS: 83736
> Attempting to kill them...
> Error Code 6 trying to query IO Completion Port, exiting
> Exception in thread Thread-1:
> Traceback (most recent call last):
> File "C:\mozmill-env\2.0.3-windows_patch\mozmill-env\python\Lib\threading.py", line 551, in __bootstrap_inner
> self.run()
> File "C:\mozmill-env\2.0.3-windows_patch\mozmill-env\python\Lib\threading.py", line 504, in run
> self.__target(*self.__args, **self.__kwargs)
> File "c:\mozmill-env\2.0.3-windows_patch\mozmill-env\python\lib\site-packages\mozprocess\processhandler.py", line 302, in _procmgr
> self._poll_iocompletion_port()
> File "c:\mozmill-env\2.0.3-windows_patch\mozmill-env\python\lib\site-packages\mozprocess\processhandler.py", line 352, in _poll_iocompletion_port
> raise WinError(errcode)
> WindowsError: [Error 6] The handle is invalid.
>
Comment 11•10 years ago
|
||
Yep, works fine without calling self.runner.wait(0.1) for restarts
Assignee | ||
Comment 12•10 years ago
|
||
Good to know. Thanks! I will make this call Unix dependent then. Also noticed that the same call has to be inserted into run_test_file() for the new type of restart tests.
Assignee | ||
Updated•10 years ago
|
Severity: normal → major
Assignee | ||
Comment 13•10 years ago
|
||
Updated patch which should also work with mutt tests and the new restart test type. I will do further testing in a bit.
Attachment #8360924 -
Attachment is obsolete: true
Attachment #8361073 -
Flags: feedback?(andrei.eftimie)
Comment 14•10 years ago
|
||
Still testing this. One thing I noticed is that in some circumstances we don't remain with a zombie process open, but (I assume) we fail to reconnect to the Firefox process and we never timeout (I had to use Ctrl+C to end the run). Here is the log: https://pastebin.mozilla.org/4035241 This is on OSX with a Debug build.
Assignee | ||
Comment 15•10 years ago
|
||
Andrei, we have a timeout of 60s for reconnecting the socket. As the times show you weren't waiting long enough here. And again, what you are seeing is the result of bug 956315.
Assignee | ||
Comment 16•10 years ago
|
||
I just run into the jsbridge disconnect issue because the socket is still in use. So as the output here shows we indeed have a socket remaining, even with my patch applied. So I'm not sure right now, why this is still happening exactly at this moment. But it's interesting that we are hanging in processhandler to wait for the process. Sadly we do not see the pid, so I will add more debugging output. But at least my suspicion seem to be right.
Assignee | ||
Updated•10 years ago
|
Attachment #8361751 -
Attachment mime type: text/x-log → text/plain
Assignee | ||
Comment 17•10 years ago
|
||
Actually the above log comes from a test which is doing an user restart: https://github.com/mozilla/mozmill/blob/master/mutt/mutt/tests/python/js-modules/restartTests/testCountTestPasses.js#L31 The usage of usershutdown is not allowed as of now because we know it is broken and bug 886360 should handle that. Exactly that might also be the reason why user restart tests are broken. I will fix the test for now and make a comment on bug 886360.
Assignee | ||
Comment 18•10 years ago
|
||
This should be the final patch for this bug. It works pretty fine for me across platforms with the Mutt tests and the tests from the mozmill-tests repository. Cosmin, can you please run some additional testruns for the above tests across platform? That would be pretty helpful. Thanks!
Attachment #8361073 -
Attachment is obsolete: true
Attachment #8361073 -
Flags: feedback?(andrei.eftimie)
Attachment #8363988 -
Flags: review?(dave.hunt)
Attachment #8363988 -
Flags: feedback?(cosmin.malutan)
Comment 19•10 years ago
|
||
Comment on attachment 8363988 [details] [diff] [review] patch v1 Python mutt test /mozmill/mutt/mutt/tests/python/test_restart.py fails with the patch applied on Mac 10.7.5 and on Ubuntu 12.04 with Nightly. On windows I had diffrent errors with and without the patch applied. I will ran some testruns too.
Comment 20•10 years ago
|
||
Comment on attachment 8363988 [details] [diff] [review] patch v1 It crashed with Firefox already open in testrun both on mac 10.7.5 and on windows 8.1, with Nightly. If you're sure it works for you I will check again.
Attachment #8363988 -
Flags: feedback?(cosmin.malutan) → feedback-
Assignee | ||
Comment 21•10 years ago
|
||
(In reply to Cosmin Malutan from comment #19) > Python mutt test /mozmill/mutt/mutt/tests/python/test_restart.py fails with > the patch applied on Mac 10.7.5 and on Ubuntu 12.04 with Nightly. On > windows I had diffrent errors with and without the patch applied. All this information is pretty useless without any logs or details. Not sure how often I have to request such kind of information again and again. Please provide those details by yourself!
Flags: needinfo?(cosmin.malutan)
Assignee | ||
Comment 22•10 years ago
|
||
(In reply to Cosmin Malutan from comment #20) > It crashed with Firefox already open in testrun both on mac 10.7.5 and on > windows 8.1, with Nightly. Crashed what? Again, I cannot action on it because of none information.
Comment 23•10 years ago
|
||
Comment on attachment 8363988 [details] [diff] [review] patch v1 Review of attachment 8363988 [details] [diff] [review]: ----------------------------------------------------------------- The code changes look okay to me, with a couple of comments. We will need more information regarding the test results with this patch applied though. ::: mozmill/mozmill/__init__.py @@ +482,5 @@ > + # To avoid having zombie processes we have to release the old process > + # by calling wait(). There has to be a timeout so we do not wait forever > + # for the new pid, but the value doesn't matter because we will wait > + # in create_network for a new connection within 60s. > + if not mozinfo.isWin: Could we add something to the comment to explain why this is not needed for Windows? @@ +488,5 @@ > + else: > + # It's a JS triggered shutdown of the application. We > + # have to wait until the process is gone. Otherwise we > + # try to use a profile which is still in use > + self.runner.wait() Why is the timeout not needed here?
Attachment #8363988 -
Flags: review?(dave.hunt) → review+
Comment 24•10 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #21) > All this information is pretty useless without any logs or details. Not sure > how often I have to request such kind of information again and again. Please > provide those details by yourself! This was on osx 10.7.5 with latest Nightly Here is what I did: * I download and installed mozmill * I applied the patch * I download and installed mozmill-automation * I ran 'testran_functional --report=http://mozmill-crowd.blargon7.com/db/ /Applications/FirefoxNightly.app/' Mutt tests I ran as: * mutt testall -b /Applications/FirefoxNightly.app/ * it failed on /mozmill/mutt/mutt/tests/python/test_restart.py
Flags: needinfo?(cosmin.malutan)
Comment 25•10 years ago
|
||
Here is the report from windows 8.1 with latest Nightly I did the same steps
Assignee | ||
Comment 26•10 years ago
|
||
Comment on attachment 8364261 [details] mac_failure (socket still in use) (In reply to Cosmin Malutan from comment #24) > * I download and installed mozmill > * I applied the patch > * I download and installed mozmill-automation > * I ran 'testran_functional --report=http://mozmill-crowd.blargon7.com/db/ > /Applications/FirefoxNightly.app/' This patch is not going to fix the socket issue, you are hitting here. This is for not having any zombie process. So while running the test you should do a 'ps aux | grep firefox' and check for zombie processes after Firefox gets restarted by the tests. > Mutt tests I ran as: > * mutt testall -b /Applications/FirefoxNightly.app/ > * it failed on /mozmill/mutt/mutt/tests/python/test_restart.py And? I still don't see why it failed. Still no helpful information has been specified here! But I assume it is the same issue from above. If you want to be sure all is fixed you might want to also apply my patch from bug 956315. Also if something is unclear how to execute specific tests I would love if you could ask on IRC.
Attachment #8364261 -
Attachment description: mac_failure → mac_failure (socket still in use)
Flags: needinfo?(cosmin.malutan)
Assignee | ||
Comment 27•10 years ago
|
||
Comment on attachment 8364264 [details] windows8.1_failure (In reply to Cosmin Malutan from comment #25) > Here is the report from windows 8.1 with latest Nightly > I did the same steps This is something totally different and should be tracked separately. But I assume it's bug 794020. Something in testGreenLarry.js is failing here, and it not related to restarts at all. Please read through the bug or ask on IRC if steps or expected results are unclear.
Attachment #8364264 -
Attachment is obsolete: true
Assignee | ||
Comment 28•10 years ago
|
||
(In reply to Dave Hunt (:davehunt) from comment #23) > The code changes look okay to me, with a couple of comments. We will need > more information regarding the test results with this patch applied though. I can run all of the tests again and push the results to the dashboard. > > + # To avoid having zombie processes we have to release the old process > > + # by calling wait(). There has to be a timeout so we do not wait forever > > + # for the new pid, but the value doesn't matter because we will wait > > + # in create_network for a new connection within 60s. > > + if not mozinfo.isWin: > > Could we add something to the comment to explain why this is not needed for > Windows? Sure. Will do so. > > + else: > > + # It's a JS triggered shutdown of the application. We > > + # have to wait until the process is gone. Otherwise we > > + # try to use a profile which is still in use > > + self.runner.wait() > > Why is the timeout not needed here? Hm, it's not a change made in this patch, but good question. I think we should give it a timeout similar to the normal stop. But personally I feel a bit anxious to use 10 seconds here. If Firefox takes a bit longer to shutdown, we force kill the process. What do you think about using the standard 60 seconds of timeout here too?
Comment 29•10 years ago
|
||
With both patches it passed on windows and linux, though on mac I still get a disconnect when running a Nightly build. When I ran with a Nightly-debug build it passed, also I checked the zombie processes while running the tests(ps aux | grep firefox) and it cleans them, and I saw them without the patch applied. I will test it again tomorrow probably is something with my machine.
Attachment #8364261 -
Attachment is obsolete: true
Flags: needinfo?(cosmin.malutan)
Assignee | ||
Comment 30•10 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #28) > Hm, it's not a change made in this patch, but good question. I think we > should give it a timeout similar to the normal stop. But personally I feel a > bit anxious to use 10 seconds here. If Firefox takes a bit longer to > shutdown, we force kill the process. What do you think about using the > standard 60 seconds of timeout here too? Dave, what do you think?
Flags: needinfo?(dave.hunt)
Assignee | ||
Comment 31•10 years ago
|
||
Just updated the patch. Dave, is that better now?
Attachment #8363988 -
Attachment is obsolete: true
Attachment #8364971 -
Flags: review+
Comment 32•10 years ago
|
||
Hi henrik, an update about the failure I see. I made a new virtualenv i cloned mozmill, I applied the two patches and I ran tests. It passed every single time. I I cloned mozbase too and I installed it, it still passed. I cloned mozmill-automation and I installed it and tests "FAILS" every single time. So as an input I guess is a dependency installed by mozmill-automation that makes the tests to fail after we install it.
Comment 33•10 years ago
|
||
Comment on attachment 8364971 [details] [diff] [review] Patch v2 Yeah, looks good. Thanks Henrik!
Flags: needinfo?(dave.hunt)
Assignee | ||
Comment 34•10 years ago
|
||
(In reply to Cosmin Malutan from comment #32) > Hi henrik, an update about the failure I see. > I made a new virtualenv i cloned mozmill, I applied the two patches and I > ran tests. It passed every single time. > I I cloned mozbase too and I installed it, it still passed. > I cloned mozmill-automation and I installed it and tests "FAILS" every > single time. Ok, I would suggest that you run pip freeze and tell me all the dependencies. But given that this is a separate project and we indeed could have version conflicts, we should handle it in a separate issue, if it still happens once we bumped all deps for mozbase. I will go ahead and get this patch landed. Thanks for all the testing and your feedback!
Assignee | ||
Comment 35•10 years ago
|
||
https://github.com/mozilla/mozmill/commit/429e4951d7bf12ca5b085df8aaddecc831d4d1f6 (master) The test change caused a merge conflict. I'm going to fix that and upload an updated version of the patch.
Flags: in-testsuite+
Assignee | ||
Comment 36•10 years ago
|
||
Backport patch for hotfix-2.0 branch.
Attachment #8365012 -
Flags: review?(dave.hunt)
Comment 37•10 years ago
|
||
Here are the dependencies when it works and after we install mozmill-automation. Th one that changes are: mozfile==1.1 > mozfile==1.0 mozlog==1.5 > mozlog==1.2
Attachment #8364395 -
Attachment is obsolete: true
Updated•10 years ago
|
Attachment #8365012 -
Flags: review?(dave.hunt) → review+
Assignee | ||
Comment 38•10 years ago
|
||
(In reply to Cosmin Malutan from comment #37) > mozmill-automation. > Th one that changes are: > mozfile==1.1 > mozfile==1.0 > mozlog==1.5 > mozlog==1.2 This should only cause a problem with removing the files after a shutdown of the application. But that should never cause an application disconnect. So lets keep an eye on it over the next couple of days.
Assignee | ||
Comment 39•10 years ago
|
||
https://github.com/mozilla/mozmill/commit/5c254c3c99b59441341b5b9ab56064ff2426ba8a (hotfix-2.0)
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Whiteboard: [mozmill-2.0.4+]
Updated•8 years ago
|
Product: Testing → Testing Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•