Closed Bug 658509 Opened 13 years ago Closed 12 years ago

Failure in spawning the process: "Resource temporarily unavailable"

Categories

(Testing :: Mozbase, defect)

All
macOS
defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whimboo, Assigned: cmtalbert)

References

Details

(Whiteboard: [mozmill-1.5.7+])

Attachments

(3 files, 3 obsolete files)

While running my all locales testrun for our functional tests, at one time the following failure appears during the restart tests:

   Process spawn failed with code 35!Sorry, cannot connect to jsbridge extension, port 24242

The teardown code in our automation script gets successfully run:

   *** Removing old installation at /var/folders/2M/2MxHJNLXHluoVFD5WLTavE+++TI/-Tmp-/tmpykN51B.binary/Firefox.app
   *** Removing repository '/var/folders/2M/2MxHJNLXHluoVFD5WLTavE+++TI/-Tmp-/tmpd9oQQE.mozmill-tests'


But after that any further attempt to mount a dmg images fails:

   *** Mounting firefox-5.0b2-build1.de.mac.dmg => /var/folders/2M/2MxHJNLXHluoVFD5WLTavE+++TI/-Tmp-/tmpONMcn1
   hdiutil: attach failed - Resource temporarily unavailable
   No section: 'App'

Right now, I'm not sure why that happens and it would need some more investigation. But that jsbridge issue only occurs on OS X, Windows and Linux do not show this problem. Further I'm also not sure why the mounting doesn't work anymore and if that has anything todo with the jsbridge issue.
Well, not sure where this error message comes from but it has to be part of Mozmill. Jeff, do you know?
Component: Mozmill Automation → Mozmill Utilities
Product: Mozilla QA → Testing
QA Contact: mozmill-automation → mozmill-utilities
It really happens in-between a restart test:

TEST-START | /var/folders/2M/2MxHJNLXHluoVFD5WLTavE+++TI/-Tmp-/tmpd9oQQE.mozmill-tests/tests/functional/restartTests/testRestartChangeArchitecture/test2.js | testRestartedNormally
TEST-START | /var/folders/2M/2MxHJNLXHluoVFD5WLTavE+++TI/-Tmp-/tmpd9oQQE.mozmill-tests/tests/functional/restartTests/testRestartChangeArchitecture/test2.js | teardownTest
TEST-PASS | /var/folders/2M/2MxHJNLXHluoVFD5WLTavE+++TI/-Tmp-/tmpd9oQQE.mozmill-tests/tests/functional/restartTests/testRestartChangeArchitecture/test2.js | teardownTest
TEST-PASS | /var/folders/2M/2MxHJNLXHluoVFD5WLTavE+++TI/-Tmp-/tmpd9oQQE.mozmill-tests/tests/functional/restartTests/testRestartChangeArchitecture/test2.js | testRestartedNormally
Process spawn failed with code 35!Sorry, cannot connect to jsbridge extension, port 24242


It's not always reproducible and seems to require a couple of other locales to get run before.
This is in jsbridge:

jsbridge/jsbridge/__init__.py :
70-        sleep(.25)
71-        ttl += .25
72-    if ttl == timeout:
73:        raise Exception("Sorry, cannot connect to jsbridge extension, port %s" % port)
74-    
75-    back_channel, bridge = create_network(host, port)
76-    sleep(.5)

In my experience, this occurs in one of two circumstances:
1. the port is not available
2. there is a failure in the harness code

We can probably safely eliminate 2. This leaves 1. or another error I have not yet seen. 

My blind guess is that, however this would happen, Firefox isn't being shutdown 100% before the restart is called.

The other possibility is that the time to create the network is exceeded. This is 60s by default.
(In reply to comment #3)
> We can probably safely eliminate 2. This leaves 1. or another error I have
> not yet seen. 

Why can we eliminate 2? You know we are still using 1.5.x for our testing. And there should still be some erroneous code floating around. Something like Clint is working on for OS X.

> My blind guess is that, however this would happen, Firefox isn't being
> shutdown 100% before the restart is called.

What's the best way to get this debugged? Do you have some hints?

> The other possibility is that the time to create the network is exceeded.
> This is 60s by default.

I don't think that this is the case.
Whiteboard: [mozmill-1.5.4?]
Can I get you to update the bug with the checkouts and the branches you are testing against as well as the command line you're using so we have some hope of reproducing it ourselves?

This error message "Process spawn failed" is not a Mozmill message, it's not in any part of our codebase.  That combined with the fact that you're unable to mount a dmg, makes me suspect memory/space/resource issues on the machine.  That's the only thing I can think of that would cause the issues you're seeing.  I also don't find any mention of an error code 35 for mac: http://support.apple.com/kb/ht2088, so I'm rather stumped.
Interesting! That's XRE code which has been added for Firefox 4:
http://mxr.mozilla.org/mozilla2.0/source/toolkit/xre/MacLaunchHelper.mm

Not sure yet, what the error code 35 means but I'm certainly sure that the Mozmill test Dave has added on bug 631052 is causing this problem now. It changes the architecture and restarts Firefox in 64bit vs. 32bit.

Josh, can you easily say what this error code means? The documentation for Darwin isn't that helpful:

http://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/man2/posix_spawn.2.html
Clint, also why do we start the teardown function before the test function finishes? See the execution stack in comment 2. That looks somewhat wrong.
Looks like error 35 is either EWOULDBLOCK or EAGAIN which is the same resource is unavailable problem that you're getting trying to mount the dmg.
(In reply to comment #4)
> (In reply to comment #3)
> > We can probably safely eliminate 2. This leaves 1. or another error I have
> > not yet seen. 
> 
> Why can we eliminate 2? You know we are still using 1.5.x for our testing.
> And there should still be some erroneous code floating around. Something
> like Clint is working on for OS X.

I am not aware of any harness failures that would effect this issue.  It is not impossible, hence the "probably", but I think it is less likely.  When I say harness failures, I ultimately mean genuine uncaught exceptions.

> > My blind guess is that, however this would happen, Firefox isn't being
> > shutdown 100% before the restart is called.
> 
> What's the best way to get this debugged? Do you have some hints?

I would put in a bunch of dump statements.  I realize that is vague and hand-wavy, but I'm not sure if there is much I can recommend methodologically.

> > The other possibility is that the time to create the network is exceeded.
> > This is 60s by default.
> 
> I don't think that this is the case.

I doubt so as well.
Simply save this file into the mozmill-automation root folder and run it with "all_locales_bft.py mac" on OS X. I should hopefully break after a dozen of tests. Keep in mind that such a testrun can take a really long time.
(In reply to comment #8)
> Looks like error 35 is either EWOULDBLOCK or EAGAIN which is the same
> resource is unavailable problem that you're getting trying to mount the dmg.

Thanks Dave! Josh, do you have an idea when this could happen on OS X?
This failure always happens with the testRestartChangeArchitecture restart test. It's not bound to a specific locale but happens after a couple of those have been run for the functional tests. In the case of the link below it broke with the 'hi-IN' locale:

http://mozmill-archive.brasstacks.mozilla.com/#/functional/reports?branch=All&platform=All&from=2011-05-28&to=2011-05-29

You can see that because there are only results for the non-restart tests but not for the restart tests. Surprisingly a follow-up 'it' locale could be mounted and run for the non-restart tests. But it also failed finally in testRestartChangeArchitecture, right after test1.js.

I assume that there is a bad interaction between restarting Firefox with another architecture and JSBridge.
(In reply to comment #12)
> This failure always happens with the testRestartChangeArchitecture restart
> test. It's not bound to a specific locale but happens after a couple of
> those have been run for the functional tests. In the case of the link below
> it broke with the 'hi-IN' locale:
> 
> http://mozmill-archive.brasstacks.mozilla.com/#/functional/
> reports?branch=All&platform=All&from=2011-05-28&to=2011-05-29
> 
> You can see that because there are only results for the non-restart tests
> but not for the restart tests. Surprisingly a follow-up 'it' locale could be
> mounted and run for the non-restart tests. But it also failed finally in
> testRestartChangeArchitecture, right after test1.js.
> 
> I assume that there is a bad interaction between restarting Firefox with
> another architecture and JSBridge.

Awesome detective work Henrik. That's quite actionable.  I'll try running it on my mac and see if I can repro locally.
If we don't have a solution for that issue when I'm back from vacation, I will probably have to disable those architecture tests for the time being.
Clint, have you had a chance in the last two weeks to take a look at?
To the best of my knowledge, Clint is out of town.  However, IIRC, he did go to quite a bit of time/effort to reproduce this locally and was unable to.  That's about all I know.
Yeah, it came up at the last meeting. Clint said he couldn't reproduce it with the test run you gave him.

He indicated that having a very long multi-part restart test could help, and I discussed writing a quickie script that took part1 of some test and cloned it into part2-N. I dropped the ball on giving him that script, so he may have blocked on me. :|

I'll make sure he has it by the time he's back.
I always hit this issue on qa-horus and qa-mozmill when running the update tests for all locales. As we have agreed on earlier, we should just use one of those boxes for investigation, whereby qa-mozmill is preferred here.
This failure was machine related. I got even more troubles yesterday on qa-horus which forced me to restart this machine. Afterward I haven't seen this issue anymore. That means it has nothing to do with Mozmill => closing as invalid.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → INVALID
Whiteboard: [mozmill-1.5.4?]
Happened again today on qa-horus. I will reopen this bug for further investigations and to track the process. Therefore moving into Mozmill Automation.
Status: RESOLVED → REOPENED
Component: Mozmill Utilities → Mozmill Automation
Product: Testing → Mozilla QA
QA Contact: mozmill-utilities → mozmill-automation
Resolution: INVALID → ---
A Pivotal Tracker story has been created for this Bug: https://www.pivotaltracker.com/story/show/15490945
Henrik Skupin changed story state to started in Pivotal Tracker
Running into the same issue on qa-masterblaster now -- it happens during the all_locales_bft.py testrun reliably after the DA locale is tested.

This needs to be bumped in priority as it blocks us from reliably running l10n tests during releases.
Attachment #534886 - Attachment mime type: text/x-python-script → text/plain
Is there anything more in the logs than what is in the above comments?  I have not been able to reproduce this error, so any sort of hints and reproduction steps would be appreciated. I do not really have access to a Mac for testing, so this is hard to diagnose over the wire.

I would also recommend trying to port the tests (and whatever harness pieces) to Mozmill 2 and try that.  It handles restarting/stopping a bit more elegantly and should have better diagnostics for failure.
@Jeff, there is nothing more in the logs which I can point to. Henrik is on PTO until next week and might be able to provide more feedback at that time.

I was able to reproduce this on qa-mozmill as well last night but it happened much later (IT locale instead of DA locale).

I'm guessing that this is something garbage collection related (ie. something is not being cleaned up properly and builds up with each iteration until it can't take anymore) -- highly speculative, I know.
(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #25)
> I was able to reproduce this on qa-mozmill as well last night but it
> happened much later (IT locale instead of DA locale).

I have never seen this on qa-mozmill before when OS X 10.5 was installed. Seems like after the upgrade 10.7 it has also been started on this box. I get the feeling that 10.6 has introduced something which breaks us here.

(In reply to Henrik Skupin (:whimboo) from comment #6)
> Interesting! That's XRE code which has been added for Firefox 4:
> http://mxr.mozilla.org/mozilla2.0/source/toolkit/xre/MacLaunchHelper.mm
> 
> Not sure yet, what the error code 35 means but I'm certainly sure that the
> Mozmill test Dave has added on bug 631052 is causing this problem now. It
> changes the architecture and restarts Firefox in 64bit vs. 32bit.
> 
> Josh, can you easily say what this error code means? The documentation for
> Darwin isn't that helpful:
> 
> http://developer.apple.com/library/mac/#documentation/Darwin/Reference/
> ManPages/man2/posix_spawn.2.html

Josh or Steven, has one of you an idea what this error message could mean?

Here the not verbose output from the hdiutil:

*** Mounting firefox-8.0b3-build1.zu.mac.dmg => /var/folders/yc/yc-yt4kPHweMwoUftOHQgE+++TQ/-Tmp-/tmpK9nEx4
2011-10-12 16:02:20.652 hdiutil[36754:2503] timed out waiting for helper registration
hdiutil: attach failed - Operation timed out
Assignee: nobody → hskupin
Status: REOPENED → ASSIGNED
Whiteboard: [see comment 6]
> hdiutil: attach failed - Operation timed out

We use hdiutil (on the Mac) to build *.dmg images when running "make package".

Here it seems to be failing in an attempt to mount a *.dmg image.

More than that I can't say.
Thanks Steven. Given that the failure is located in the call to hdiutil I have now started an all locales test-run qa-masterblaster with the '-debug' option of hdiutil set. Lets see if I can reproduce and get more information.
This is strange. Given the console output during the test-run I do not think that this issue is related to hdiutil, but really how we restart Firefox during the restart tests. I have mentioned that suspicion already earlier in comment 2, but that time we had a different error message.

With the latest test-run I have executed, the failure happens for the de build between the restart tests for default bookmarks and the master password. Please see the 'sh: fork: Resource temporarily unavailable' error right in between:

TEST-START | /var/folders/yc/yc-yt4kPHweMwoUftOHQgE+++TQ/-Tmp-/tmp6_pqWO.mozmill-tests/tests/functional/restartTests/testDefaultBookmarks/test1.js | teardownModule
TEST-PASS | /var/folders/yc/yc-yt4kPHweMwoUftOHQgE+++TQ/-Tmp-/tmp6_pqWO.mozmill-tests/tests/functional/restartTests/testDefaultBookmarks/test1.js | test1.js::teardownModule
sh: fork: Resource temporarily unavailable
TEST-START | /var/folders/yc/yc-yt4kPHweMwoUftOHQgE+++TQ/-Tmp-/tmp6_pqWO.mozmill-tests/tests/functional/restartTests/testMasterPassword/test1.js | setupModule
TEST-PASS | /var/folders/yc/yc-yt4kPHweMwoUftOHQgE+++TQ/-Tmp-/tmp6_pqWO.mozmill-tests/tests/functional/restartTests/testMasterPassword/test1.js | test1.js::setupModule

Jeff, when Firefox gets restarted (no user restart), is mozrunner involved in any way? Or is that completely done by Firefox? I wonder where this error messages comes from. Currently the output is not very helpful.
Attached file hdiutil (debug) output
Debug output during of a broken mount action with hdiutil.
(In reply to Henrik Skupin (:whimboo) from comment #29)
> This is strange. Given the console output during the test-run I do not think
> that this issue is related to hdiutil, but really how we restart Firefox
> during the restart tests. I have mentioned that suspicion already earlier in
> comment 2, but that time we had a different error message.

Is there anything in particular about the way we restart Firefox that raises these suspicions? Or is it a gut feeling?  I'll say up front that MozMillRestart from 1.5 logic is questionable, though its actually better than the parent MozMill class.  In 2.0, both of these are better.
 
> With the latest test-run I have executed, the failure happens for the de
> build between the restart tests for default bookmarks and the master
> password. Please see the 'sh: fork: Resource temporarily unavailable' error
> right in between:
> 
> TEST-START |
> /var/folders/yc/yc-yt4kPHweMwoUftOHQgE+++TQ/-Tmp-/tmp6_pqWO.mozmill-tests/
> tests/functional/restartTests/testDefaultBookmarks/test1.js | teardownModule
> TEST-PASS |
> /var/folders/yc/yc-yt4kPHweMwoUftOHQgE+++TQ/-Tmp-/tmp6_pqWO.mozmill-tests/
> tests/functional/restartTests/testDefaultBookmarks/test1.js |
> test1.js::teardownModule
> sh: fork: Resource temporarily unavailable
> TEST-START |
> /var/folders/yc/yc-yt4kPHweMwoUftOHQgE+++TQ/-Tmp-/tmp6_pqWO.mozmill-tests/
> tests/functional/restartTests/testMasterPassword/test1.js | setupModule
> TEST-PASS |
> /var/folders/yc/yc-yt4kPHweMwoUftOHQgE+++TQ/-Tmp-/tmp6_pqWO.mozmill-tests/
> tests/functional/restartTests/testMasterPassword/test1.js |
> test1.js::setupModule

This looks like a system error message.  I don't know anything about Macs, but googling reveals, amongst other things, http://blog.ghostinthemachines.com/2010/01/19/mac-os-x-fork-resource-temporarily-unavailable/

> Jeff, when Firefox gets restarted (no user restart), is mozrunner involved
> in any way? Or is that completely done by Firefox? I wonder where this error
> messages comes from. Currently the output is not very helpful.

Other than user restart, mozrunner is always involved in process starting and stopping.
Thanks Jeff for the pointer! I was able to get more progress on it. I now have added a 'ps -ef | grep firefox' call right before each test-run, which gets called from the wrapper script. As a result we get hundreds of Firefox processes listed after a couple of test-runs.

After the first run:
  503 77339 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77355 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77363 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77370 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77377 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77384 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77391 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77398 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77405 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77412 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77419 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77426 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77433 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77440 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77454 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)

After the second run:
  503 77339 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77355 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77363 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77370 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77377 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77384 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77391 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77398 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77405 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77412 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77419 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77426 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77433 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77440 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77454 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77461 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77491 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77509 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77517 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77524 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77531 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77538 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77545 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77552 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77559 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77566 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77574 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77581 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77588 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77595 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77609 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)
  503 77616 77316   0   0:00.00 ttys005    0:00.00 (firefox-bin)

That means for each build under test, we leave nearly all the time 16 dead Firefox processes behind (15, 32, 48, 64, 80, ...). This increment stays constant, so I would now try to nail down if it is related to the restart tests, or our handling of the Firefox process in general.
My assumption was correct. Removing restart tests from the test-run the number of left-over firefox processes drops. Now I will try to figure out if that is related to our automation scripts or if it also happens when calling mozmill via the command line directly.
So this issue is absolutely related to our implementation of the automation scripts, and probably highly dependent on the MozmillWrapper class.

I'm in a limbo now how to proceed as best. One option would be to figure out what's wrong with our implementation for Mozmill 1.5.x and fix it, or directly make all the changes to use the API of Mozmill 2. The latter I would have to do in any way.

Option 1 seems to be kinda hard, because as Jeff always says the API on the hotfix-1.5 branch is awkward, and we will probably not get any other fix or enhancement for it. Everyone concentrates on Mozmill 2 respectively Mozmill 2.1 now.

I also don't want to spend my time on fixing code which we would get rid of somewhat soon. Until we have made a decision, I will check the implementation of the runtest.py for Thunderbird tests. That would give us hopefully a fix in the short term.
Summary: Failure in spawning the process (code 35) results in no jsbridge connection → Failure in spawning the process: "Resource temporarily unavailable"
Whiteboard: [see comment 6]
Ok, so I have re-arranged the scripts and got it partly working but not by using mozmill directly. A way would be to subclass CLI and RestartCLI.

> class FirefoxCLI(mozmill.CLI):
>     parser_options = copy.copy(mozmill.CLI.parser_options)
>     parser_options[("--entities",)] = dict(dest="entities")
>     parser_options.pop(("-b", "--binary"))
> 
>     def __init__(self, *args, **kwargs):
>         mozmill.CLI.__init__(self, *args, **kwargs)
> 
>         self.options.debug = True
>         self.options.binary = self.args[0]
> 
> 
> FirefoxCLI().run()
> FirefoxCLI().run()
> FirefoxCLI().run()
> 
> cmdArgs = ["ps", "-ef"]
> subprocess.call(cmdArgs)

BUT, the code snippet above presents the same results:

 501 31587   298   0  6:13PM ttys000    0:00.60 python ./_test.py /Applications/Firefox/Nightly.app/ -t ../mozmill-tests/nightly/tests/functional/
  501 31590 31587   0  6:13PM ttys000    0:00.00 (firefox-bin)
  501 31601 31587   0  6:13PM ttys000    0:00.00 (firefox-bin)
  501 31610 31587   0  6:13PM ttys000    0:00.00 (firefox-bin)
 
There are still dead Firefox instances existent while being in the shell. So I can't believe anymore that the leak is in our own automation scripts but in Mozmill itself. Seems like the application processes are never really get shutdown.

I will move this over to mozbase now. Jeff or Clint, can one of please check the process handling at least on OS X? Something is kinda broken here and blocking us from running tests for all locales of Firefox. Thanks.
Assignee: hskupin → nobody
Component: Mozmill Automation → Mozbase
Product: Mozilla QA → Testing
QA Contact: mozmill-automation → mozbase
The solution can be found here:
http://bytes.com/topic/python/answers/44787-defunct-processes-subprocess-popen

We are simply missing a final os.wait() call.
Assignee: nobody → hskupin
If kill() is called either 

    os.kill(process.pid, signal.SIGTERM)

or 

    process.kill 

we should call wait() afterwards. Not calling wait() can leave a number of Zombie processes that the OS can be really slow to clean up.

http://code.google.com/p/selenium/source/browse/trunk/py/selenium/webdriver/firefox/firefox_binary.py?spec=svn14564&r=14564#49 is my WebDriver code for killing the browser when we don't need it anymore. This works cross platform.
Is it known which version of which piece of software is being talked about here?
(In reply to Jeff Hammel [:jhammel] from comment #38)
> Is it known which version of which piece of software is being talked about
> here?

Jeff, please read my comment 35. This is mozrunner/killableprocess land.
Unassinging myself because it looks like to be a simple fix which can be done by Jeff or Clint, who are both more familiar with the mozrunner/killableprocess code.
Assignee: hskupin → nobody
Status: ASSIGNED → NEW
Clint, can you please give us any update on this bug? When your team will have time to fix this issue, as mentioned in the last Mozmill meeting? Thanks.
Ok, here is a fix for mozmill 1.5, calling os.wait() doesn't work on linux no matter what I do. I keep getting a traceback with "error 10: no child process found".  But since the issue only appears on mac os x, I'm somewhat confident about this fix to only call the os.wait if we are on mac os x.

Henrik, do you have any means to test this?  No one on my team has ever been able to consistently reproduce this bug, but none of us are heavy mac os users, so that is likely why.
Assignee: nobody → ctalbert
Status: NEW → ASSIGNED
Attachment #575557 - Flags: review?(jhammel)
Attachment #575557 - Flags: feedback?(hskupin)
Comment on attachment 575557 [details] [diff] [review]
Mozmill 1.5 Fix for mac os x

this works for me.  i have not tested though
Attachment #575557 - Flags: review?(jhammel) → review+
Attached patch Patch for hotfix-2.0 (obsolete) — Splinter Review
Jeff, I think I'm hitting an issue you've seen.  With a stock checkout of hotfix-2.0, I'm getting an exception at the end of my test:
TEST-START | mutt/mutt/tests/js/test_nothing.js | setupModule
TEST-START | mutt/mutt/tests/js/test_nothing.js | test_nothing
INFO | Step Pass: {"function": "Controller.open()"}
INFO | Step Pass: {"function": "controller.waitFor()"}
INFO | Step Pass: {"function": "controller.waitForPageLoad()"}
TEST-PASS | mutt/mutt/tests/js/test_nothing.js | test_nothing
INFO | Timeout: bridge.execFunction("85439636-123e-11e1-aed1-000c29f1d61d", bridge.registry["{b421fada-0211-4e8a-9ae9-dfdff206e79b}"]["cleanQuit"], [])
INFO | 
INFO | Passed: 1
INFO | Failed: 0
INFO | Skipped: 0

Note that timeout.  That happens whether I have this patch active or not, whether I put os.wait() directly after our os.kill statements or not.  So, right now I'm being cautious and just making exactly the same patch that I made for hotfix-1.5 for the 2.0 codebase.  It doesn't seem to change the above error in any way on linux (which I wouldn't expect it to, being gated with the if mozinfo.isMac property).  So, I guess it's safe.  Is there a bug on this above error?
Attachment #575589 - Flags: review?(jhammel)
I have this as bug https://bugzilla.mozilla.org/show_bug.cgi?id=696468, though that includes all the weirdiosity that i've seen on a particular machine and may conflate issues.  FWIW, that is the message I get, with every shutdown
Comment on attachment 575589 [details] [diff] [review]
Patch for hotfix-2.0

Looks good to me.  The other issue scares the hell out of me though
Attachment #575589 - Flags: review?(jhammel) → review+
Attached file testcase
Clint, this is a testcase we probably could easily transform into a Mutt test. Simply execute it with -b and -t. Before the script exists it calls 'ps -ef' via os.system(). As you can see with Mozmill 1.5.6 there is one process listed:

501 10642 10639   0 12:53PM ttys001    0:00.00 (firefox-bin)

When you uncomment the os.wait() call in the test, it will solve the problem. I think it's a good candidate for a test.

I will now try your patch and check if it works.
Attachment #534886 - Attachment is obsolete: true
Comment on attachment 575557 [details] [diff] [review]
Mozmill 1.5 Fix for mac os x

Looks good and solves the problem on OS X. I have also tested on Linux, which I haven't done before and I'm also not able to see this issue there. So it's really OS X only.

Would be great if we could immediately push a new version of mozmill/mozrunner to pypi
Attachment #575557 - Flags: feedback?(hskupin) → feedback+
So we want mozrunner/mozmill bumped for the hotfix-1.5 branch and released to pypi?
(In reply to Jeff Hammel [:jhammel] from comment #49)
> So we want mozrunner/mozmill bumped for the hotfix-1.5 branch and released
> to pypi?

If we do, please file a bug and assign to me
see bug 704342 for the version bumping bug
There is a bug on test_driver.js in hotfix-2.0. I need to figure this out before land there.
Clint, it looks like that we haven't caught any possible path, when defunct processes are left behind. Today I have seen that when the jsbridge port is already used, we horribly end-up with a lot of such processes when trying to execute mozmill a second time in parallel. It will not happen so often but given that our process handling on 1.5 is broken, and we don't shutdown Firefox in cases of non-handled unexpected modal dialogs we could run into those issues. So we should take care of and test it for mozmill 2.0.
Whiteboard: [mozmill-2.0?]
Whiteboard: [mozmill-2.0?] → [mozmill-2.0+]
Whiteboard: [mozmill-2.0+] → [mozmill-2.0+][mozmill-1.5.7+]
Depends on: 718496
See bug 718496 for the regression introduced by the checkin into hotfix-1.5. Clint, what is the status for this bug for Mozmill 2.0?
Assignee: ctalbert → hskupin
Attached file Forward port to Mozbase master (v1) (obsolete) —
Pointer to Github pull-request
Attachment #622960 - Attachment description: Pointer to Github pull request: https://github.com/mozilla/mozbase/pull/12 → Forward port to Mozbase master (v1)
Attachment #622960 - Flags: review?(jhammel)
Attachment #575648 - Attachment mime type: text/x-python-script → text/plain
Comment on attachment 622960 [details]
Forward port to Mozbase master (v1)

comment in pull request
Attachment #622960 - Flags: review?(jhammel) → review+
So I should have tested it with my testcase attached to the bug first. This problem doesn't occur anymore with the latest mozprocess version on master.

import mozrunner
import os
import sys

for i in range(2) :
  try:
    mozrunner.CLI(args=sys.argv[1:]).run()
  except Exception, e:
    print str(e)
  finally:
    pass

os.system("ps -ef | grep firefox")


Results in:

  501 48807 48799   0  9:03AM ttys007    0:00.00 sh -c ps -ef | grep firefox
  501 48809 48807   0  9:03AM ttys007    0:00.00 grep firefox

So I don't think we have to fix anything here for mozprofile.

Jeff and Clint, if you disagree feel free to reopen the bug.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago12 years ago
Resolution: --- → FIXED
Whiteboard: [mozmill-2.0+][mozmill-1.5.7+] → [mozmill-1.5.7+]
Attachment #622960 - Attachment is obsolete: true
Attachment #575589 - Attachment is obsolete: true
Assignee: hskupin → ctalbert
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: