Closed Bug 504440 Opened 15 years ago Closed 14 years ago

Use a global timeout for command line tests to kill the process if Firefox doesn't respond

Categories

(Testing Graveyard :: Mozmill, defect, P1)

defect

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: whimboo, Assigned: k0scist)

References

Details

(Whiteboard: [qae-p1][mozmill-1.4.2+])

Attachments

(2 files, 1 obsolete file)

Given by our yesterdays meeting we should have a frozen check for Mozmill. Under some situations it can happen that the network connection is dropped and commands cannot be transferred anymore by jsbridge. Firefox will stay open forever and you have to kill it manually. There should be a default timeout which users can modify via a command line option.

This would be necessary for the integration into the build system (see bug 457265). Adding it on the blocking list for now.
This has to be implemented on Python end and will be a global timeout when no events have occured for a specific amount of time.
Summary: Add frozen check to mozmill → Add global timeout to mozmill command line
Needs to be done ASAP for more reliable runs on buildbot.
Whiteboard: [mozmill-1.2.1]
Setting to P1.
Severity: normal → critical
Priority: -- → P1
Blocks: 504563
Blocks: 504571
Summary: Add global timeout to mozmill command line → Use a global timeout for command line tests to kill the process if Firefox doesn't respond
Whiteboard: [mozmill-1.2.1] → [mozmill-1.2.2]
Blocks: 516012
Severity: critical → blocker
Whiteboard: [mozmill-1.2.2]
As discussed with Clint on phone today we would have to raise it as a major thing to fix. While running our daily Mozmill tests on the QA lab machine now we still have failures and a couple of process stay open. Given that I have to check the box each day after the test run.

It's eventually something testdev could work on next week.
Whiteboard: [qae-p1]
Solving this in two parts:
* Part 1: between firefox and mozmill.  I've put something in place that destroys mozmill if it detects that no action is happening in the test framework - no new tests starting, no tests ending etc.

* Part 2: if a modal dialog is orphaned and thus causing tests to timeout, we should trap that and try to identify if an unknown modal dialog is hanging out.  If a dialog is hanging out, we should kill all open windows (except main window, if we can determine that) and we will then try to keep going with the tests.
Forgot to Add in the info for the git stuff.  it's on this branch: http://github.com/ctalbert/mozmill/commits/foobar2.
As talked on IRC with Mikeal, part 1 needs to handle the command line option and should not be enabled by default. Otherwise for users who run the command "mozmill -b %app%", Mozmill will be killed after 30 minutes.
Clint, when you will be able to continue with the remaining stuff?

Also something which came into my mind and would be really useful: When we kill the Firefox process, do we know which test gets run actually? If yes, we should add an error too.
Assignee: nobody → ctalbert
Status: NEW → ASSIGNED
Since a couple of days I get the following error when Firefox gets closed after a test-run. Could it be related to the already checked-in first patch?

Exception in thread Thread-6 (most likely raised during interpreter shutdown):
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/threading.py", line 460, in __bootstrap
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/threading.py", line 625, in run
  File "/Volumes/data/build/tools/mozmill/mozmill/__init__.py", line 277, in stop
<type 'exceptions.TypeError'>: 'NoneType' object is not callable
Unhandled exception in thread started by 
Error in sys.excepthook:

Original exception was:
No longer blocks: 457265
Blocks: 516984
Whiteboard: [qae-p1] → [qae-p1][mozmill-1.4.2+]
Blocks: 574043
The global timeout is addressed in (obsolescent) bug 574043.  Repeating from 
https://bugzilla.mozilla.org/show_bug.cgi?id=574043#c2


> I've added a global timeout to jsbridge:
> 
> http://github.com/k0s/jsbridge/commit/39f6b9ea983cb9e469751f7ee012218f0d085ab8
> http://github.com/k0s/jsbridge/commit/bd49c30218388c80c83e0ba8e46f99b7e41fdfbd
> http://github.com/k0s/jsbridge/commit/0388256d3de02c5ae4ae785ae28e8e78362efeb0
> 
> This seems to work correctly;  on my ubuntu machine, the tests run through
> passing, as they did before.  On the CentOS VM, instead of the tests hanging,
> the exception is thrown and the run ends with "Application quit unexpectedly."
> 
> There are a few more things that need to be done:
>  - a CLI switch in mozmill should be added that sets this on the bridge
> object(s)
>  - there should be a per-test timeout in mozmill/extension's JS; currently, the
> whole run stops (as jsbridge doesn't know anything about "tests") whereas, for
> reproducibility, individual tests should timeout.  The default value (whatever
> it might be) should be overridable on a per-test basis.
>  - mozmill/mozrunner should ensure that Firefox is killed when Something Bad
> Happens; currently, the jsbridge stuff kills the run successfully, but Firefox
> will remain open, which would be bad for buildbot

The CLI switch is also done: http://github.com/k0s/mozmill/commit/945cc4b8654330a9c72a28f5071eb4b9d41fb161

So what remains to be done is better cleanup of Firefox upon failure.  I'm guessing this is just a missing try: except: condition.
Assignee: ctalbert → jhammel
Attachment #454555 - Flags: review?(ctalbert)
This patch does a few things:

 * front end to setting the global timeout for communication in jsbridge
 * makes both of the wait JS functions go through the same function;  seems to work, but the comment scares me
 * better cleanup of firefox upon disconnect failure;  again, seems to work, but may not be adequate for all cases [hard to test]
 * cleanup
Attachment #454574 - Flags: review?(harthur)
Comment on attachment 454574 [details] [diff] [review]
patch to mozmill to support global timeout


>+    parser_options[("--timeout",)] = dict(dest="timeout", type="float",
>+                                          default=60., 
>+                                          help="seconds before harness timeout if no communication is taking place")

why float?

>             except JSBridgeDisconnectError:
>-                print 'Application unexpectedly closed'
>+                print >> sys.stderr, 'Disconnect Error: Application unexpectedly closed'

we print all our other errors to stdout, why did you choose stderr for this one?

>+  while(gWaitOnLocked) {
>+    throw 'waitOn locked';
>+    dump("Waiting...");
>   }
> 
>+  gWaitOnLocked = true;

Did this work? I don't know about adding this if it doesn't help. Also, this is kind of for a separate bug.

>+
>   // Calls repeatedly every X milliseconds until clearInterval is called
>-  var interval = hwindow.setInterval(wait, milliseconds);
>+  var interval = hwindow.setInterval(callback, milliseconds);
> 
>   var thread = Components.classes["@mozilla.org/thread-manager;1"]
>             .getService()
>@@ -409,14 +408,30 @@ function sleep (milliseconds) {
>   // This blocks execution until our while loop condition is invalidated.  Note
>   // that you must use a simple boolean expression for the loop, a function call
>   // will not work.
>-  while(!self.timeup)
>-    thread.processNextEvent(true);
>+  while(condition(self)) { 
>+    // see nsIThread.idl
>+    // https://developer.mozilla.org/en/nsIThread
>+    var retval = thread.processNextEvent(false);
>+  }
>+

I don't see 'retval' used anywhere


Other than these questions it looks great.
Comment on attachment 454555 [details] [diff] [review]
patch to jsbridge to have a global timeout

>--- a/jsbridge/network.py
>+++ b/jsbridge/network.py
>@@ -45,15 +45,16 @@ from threading import Thread
>@@ -170,24 +167,40 @@ class Bridge(Telnet):
>     
>+
>+        # XXX we've actually already connected in Telnet
>+        self.connect((host, port))
Why the XXX comment here?  If this is a condition we should handle in the future let's either go ahead and do it now (it would seem to be easy to me to print an error if we find that we are already connected and exit) or file a follow on bug to handle it in the future.  But since all we really need here is a simple error message, we should just ensure we have that for now.
>     
>     def handle_connect(self):
>         self.register()
> 
>     def run(self, _uuid, exec_string, interval=.2, raise_exeption=True):
>+        global ctr
>+
As discussed, I think this is more readable (to me, if it's better python to do it this way, then maybe we should keep it) as a member of the base class, probably with a better variable name to indicate that it is the current "uptime" between messages from the javascript end that is running tests.

r=me with those nits.
Attachment #454555 - Flags: review?(ctalbert) → review+
(In reply to comment #13)
> (From update of attachment 454574 [details] [diff] [review])
> 
> >+    parser_options[("--timeout",)] = dict(dest="timeout", type="float",
> >+                                          default=60., 
> >+                                          help="seconds before harness timeout if no communication is taking place")
> 
> why float?

As opposed to int?  Why not?

> 
> >             except JSBridgeDisconnectError:
> >-                print 'Application unexpectedly closed'
> >+                print >> sys.stderr, 'Disconnect Error: Application unexpectedly closed'
> 
> we print all our other errors to stdout, why did you choose stderr for this
> one?

I can do stdout if desired.  I just wanted to ensure things were unbuffered.

> >+  while(gWaitOnLocked) {
> >+    throw 'waitOn locked';
> >+    dump("Waiting...");
> >   }
> > 
> >+  gWaitOnLocked = true;
> 
> Did this work? I don't know about adding this if it doesn't help. Also, this is
> kind of for a separate bug.

It is separate.  As far as if it works, it doesn't seem to break anything.  This is *not* to say that it in fact works.  I'd be happy to revert this change, if desired, as it is actually a separate issue.

> >+
> >   // Calls repeatedly every X milliseconds until clearInterval is called
> >-  var interval = hwindow.setInterval(wait, milliseconds);
> >+  var interval = hwindow.setInterval(callback, milliseconds);
> > 
> >   var thread = Components.classes["@mozilla.org/thread-manager;1"]
> >             .getService()
> >@@ -409,14 +408,30 @@ function sleep (milliseconds) {
> >   // This blocks execution until our while loop condition is invalidated.  Note
> >   // that you must use a simple boolean expression for the loop, a function call
> >   // will not work.
> >-  while(!self.timeup)
> >-    thread.processNextEvent(true);
> >+  while(condition(self)) { 
> >+    // see nsIThread.idl
> >+    // https://developer.mozilla.org/en/nsIThread
> >+    var retval = thread.processNextEvent(false);
> >+  }
> >+
> 
> I don't see 'retval' used anywhere

retval isn't used.  It originally was for debugging purposes.  Should I nix?
 
> 
> Other than these questions it looks great.
(In reply to comment #14)
> (From update of attachment 454555 [details] [diff] [review])
> >--- a/jsbridge/network.py
> >+++ b/jsbridge/network.py
> >@@ -45,15 +45,16 @@ from threading import Thread
> >@@ -170,24 +167,40 @@ class Bridge(Telnet):
> >     
> >+
> >+        # XXX we've actually already connected in Telnet
> >+        self.connect((host, port))
> Why the XXX comment here?  If this is a condition we should handle in the
> future let's either go ahead and do it now (it would seem to be easy to me to
> print an error if we find that we are already connected and exit) or file a
> follow on bug to handle it in the future.  But since all we really need here is
> a simple error message, we should just ensure we have that for now.

I put the XXX comment in as there is some duplication here:  see 

http://github.com/mozautomation/mozmill/blob/master/jsbridge/jsbridge/network.py#L59
vs. 
http://github.com/mozautomation/mozmill/blob/master/jsbridge/jsbridge/network.py#L176

I assume that this might have been done for a reason.  I have not exhaustively debugged whether the dual-implementation solved some race condition or not.  In general, if this sort of thing is intentional, the intention should be documented such that future hackers don't have to guess what it is.

I would gladly comment out either/or legacy connect, though I can't guess if this will break anything in practice.

> >     
> >     def handle_connect(self):
> >         self.register()
> > 
> >     def run(self, _uuid, exec_string, interval=.2, raise_exeption=True):
> >+        global ctr
> >+
> As discussed, I think this is more readable (to me, if it's better python to do
> it this way, then maybe we should keep it) as a member of the base class,
> probably with a better variable name to indicate that it is the current
> "uptime" between messages from the javascript end that is running tests.
> 
> r=me with those nits.

K;  I'll do so.  As mentioned, the only reason I kept it global was to specifically note that it is global.  Putting it as an attribute on jsbridge will not change its scope, I just wanted to be explicit about it.
(In reply to comment #15)
> (In reply to comment #13)
<snip/>
> > >+  while(gWaitOnLocked) {
> > >+    throw 'waitOn locked';
> > >+    dump("Waiting...");
> > >   }
> > > 
> > >+  gWaitOnLocked = true;
> > 
> > Did this work? I don't know about adding this if it doesn't help. Also, this is
> > kind of for a separate bug.
> 
> It is separate.  As far as if it works, it doesn't seem to break anything. 
> This is *not* to say that it in fact works.  I'd be happy to revert this
> change, if desired, as it is actually a separate issue.

Moved to bug 575405
a new patch with the refactoring of waitForEval + sleep moved to bug 575405

anything else I need to do to get an r+?
Attachment #454574 - Attachment is obsolete: true
Attachment #454653 - Flags: review?(harthur)
Attachment #454574 - Flags: review?(harthur)
Attachment #454653 - Flags: review?(harthur) → review+
looks good, push it
Why do we still use a float as timeout value?
(In reply to comment #20)
> Why do we still use a float as timeout value?

Why shouldn't we?
Solved the question on IRC. Python is using seconds instead of milliseconds per default. So float is the correct way to go. Sorry.
pushed to http://github.com/mozautomation/mozmill/tree/1.4.2 ; should i close or wait until this is pushed to master and/or until 1.4.2 is put in m-c?
What blocks us from pushing it to the master? Once this is done we will mark it as fixed. m-c doesn't play into account here.
(In reply to comment #24)
> What blocks us from pushing it to the master? Once this is done we will mark it
> as fixed. m-c doesn't play into account here.

I marked as fixed anyway.  I'm not sure what the workflow is for pushing mozauto 1.4.2 -> mozauto master, but I'm guessing it should only be done when we're ready to release (does that sound right?)
I don't think so. Once a feature has been reviewed and has been finalized by the developer it should be also pushed to the master immediately. That way we can get wider exposure.
Forgot to mark this as fixed for 1.4.2
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
(In reply to comment #26)
> I don't think so. Once a feature has been reviewed and has been finalized by
> the developer it should be also pushed to the master immediately. That way we
> can get wider exposure.
What Henrik said.  We should land on both branches simultaneously when fixing issues on 1.4.2.  When doing 2.0 work on Master, that will not be landed on 1.4.2 (perhaps obviously), but everything that gets changed on 1.4.2 should also be changed on master.
Hasn't been landed on the master yet -> reopen.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to comment #29)
> Hasn't been landed on the master yet -> reopen.

:ctalbert, :henrik:  well, this really depresses me.  I've tried really hard to follow what development guidelines we have for mozmill and 1.4.2 and future.  However this isn't documented:

https://wiki.mozilla.org/Auto-tools/Projects/Mozmill/RepoSetup#Setting_up_for_Development_.28for_1.4.2.29

We should document our workflow so that dummies like me have some set of instructions in order to follow.

Also, I wasn't going to bring it up, but what is the purpose of pushing to two branches?  In most development methodologies I am familiar with (both single-trunk VCS and mercuria/bzr DVCS) there is a trunk/master which is either stable or unstable.  If the trunk is stable, the development is done on whatever active branch until it is collapsed to another branch or to trunk, and then the development branch is closed except for backports. 

In the other model, the trunk is unstable and development is done on the trunk.  The changes are, sometimes selectively, moved over to a release branch at a time when a release is near.  

Are we following either of these models?  What is the point of pushing to both branches instead of doing so?  Why isn't this documented?

I'm sorry to be an ass, but quite frrankly I try really hard to learn and use a version control system well that I don't like very much.  When on the day before a summit I find that my bugs are reopened, I wonder if I am really as much of a horrible programmer and possibly person as I feel like right now.
(In reply to comment #30)
> https://wiki.mozilla.org/Auto-tools/Projects/Mozmill/RepoSetup#Setting_up_for_Development_.28for_1.4.2.29
> 
> We should document our workflow so that dummies like me have some set of
> instructions in order to follow.

Well, we miss the part for the master branch. I will leave the documentation up to Clint.

> Also, I wasn't going to bring it up, but what is the purpose of pushing to two
> branches?  In most development methodologies I am familiar with (both
> single-trunk VCS and mercuria/bzr DVCS) there is a trunk/master which is either
> stable or unstable.  If the trunk is stable, the development is done on
> whatever active branch until it is collapsed to another branch or to trunk, and
> then the development branch is closed except for backports. 

We should discuss that offline and not on this bug. 

> Are we following either of these models?  What is the point of pushing to both
> branches instead of doing so?  Why isn't this documented?

Master will get refactoring patches soon which will not land on the 1.4.2 branch. We can't use 1.4.2 for mozmill-2.0 development. Because of that you will have to push to both master and 1.4.2.

> version control system well that I don't like very much.  When on the day
> before a summit I find that my bugs are reopened, I wonder if I am really as
> much of a horrible programmer and possibly person as I feel like right now.

I don't see a reason why it has something todo with the summit here. You have only fixed the problem on 1.4.2 and not master. I have already stated that on comment 26. So once you have pushed the fix to the master you can mark this bug as closed.
Status: REOPENED → ASSIGNED
Jeff, when will you push the patch to the master? Even there is no documentation it will work the same way as for 1.4.2. Thanks.
No longer blocks: 574043
(In reply to comment #32)
> Jeff, when will you push the patch to the master? Even there is no
> documentation it will work the same way as for 1.4.2. Thanks.

done, http://github.com/mozautomation/mozmill/commit/e99a3d115217709e0114578bd9fe744da69c9f0f
Status: ASSIGNED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
Blocks: 579791
Blocks: 579792
Tested locally and on our qa-horus box. There is no hanging test-run anymore. I'm impressed. Even with that verification I have filed other bugs I have discovered as follow-up bugs. See the dependency list.

Marking this one as verified fixed.
Status: RESOLVED → VERIFIED
Comment on attachment 454653 [details] [diff] [review]
patch to mozmill to support global timeout

>@@ -502,16 +518,20 @@ class CLI(jsbridge.CLI):
>                                          help="Report the results. Requires url to results server.")
>     parser_options[("--showall",)] = dict(dest="showall", default=False, action="store_true",
>                                          help="Show all test output.")
>+    parser_options[("--timeout",)] = dict(dest="timeout", type="float",
>+                                          default=60., 
>+                                          help="seconds before harness timeout if no communication is taking place")
> 
>     def __init__(self, *args, **kwargs):
>         super(CLI, self).__init__(*args, **kwargs)
>         self.mozmill = self.mozmill_class(runner_class=mozrunner.FirefoxRunner,
>                                           profile_class=mozrunner.FirefoxProfile,
>-                                          jsbridge_port=int(self.options.port))
>+                                          jsbridge_port=int(self.options.port),
>+                                          jsbridge_timeout=self.options.timeout
>+                                          )

Not a very friendly way to set the options. Derived classes do not have a chance to modify those values. I will try how to workaround that issue tomorrow. If needed I will file a follow-up bug.
(In reply to comment #35)
> Not a very friendly way to set the options. Derived classes do not have a
> chance to modify those values. I will try how to workaround that issue
> tomorrow. If needed I will file a follow-up bug.

'self.mozmill.jsbridge_timeout = value' did the trick for now. I hope that it will become more friendly with the reorganization of the classes for Mozmill 2.0.

For the software update tests I'm able now to set a longer timeout value. So we are good here.
(In reply to comment #35)
> Comment on attachment 454653 [details] [diff] [review]
> patch to mozmill to support global timeout
> 
> >@@ -502,16 +518,20 @@ class CLI(jsbridge.CLI):
> >                                          help="Report the results. Requires url to results server.")
> >     parser_options[("--showall",)] = dict(dest="showall", default=False, action="store_true",
> >                                          help="Show all test output.")
> >+    parser_options[("--timeout",)] = dict(dest="timeout", type="float",
> >+                                          default=60., 
> >+                                          help="seconds before harness timeout if no communication is taking place")
> > 
> >     def __init__(self, *args, **kwargs):
> >         super(CLI, self).__init__(*args, **kwargs)
> >         self.mozmill = self.mozmill_class(runner_class=mozrunner.FirefoxRunner,
> >                                           profile_class=mozrunner.FirefoxProfile,
> >-                                          jsbridge_port=int(self.options.port))
> >+                                          jsbridge_port=int(self.options.port),
> >+                                          jsbridge_timeout=self.options.timeout
> >+                                          )
> 
> Not a very friendly way to set the options. Derived classes do not have a
> chance to modify those values. I will try how to workaround that issue
> tomorrow. If needed I will file a follow-up bug.

I'm not sure what you mean by this.  Are you talking about derived classes to the CLI interface?  Ultimately, CLI shouldn't be a controller, which it currently is, if that's what you're referring to.  Why not instantiate the MozMill object in the derived class instead of going through CLI?
(In reply to comment #36)
> (In reply to comment #35)
> > Not a very friendly way to set the options. Derived classes do not have a
> > chance to modify those values. I will try how to workaround that issue
> > tomorrow. If needed I will file a follow-up bug.
> 
> 'self.mozmill.jsbridge_timeout = value' did the trick for now. I hope that it
> will become more friendly with the reorganization of the classes for Mozmill
> 2.0.

"Touching" values with more than one dot is ultimately subject to implementation changes.

> For the software update tests I'm able now to set a longer timeout value. So we
> are good here.
(In reply to comment #37)
> I'm not sure what you mean by this.  Are you talking about derived classes to
> the CLI interface?  Ultimately, CLI shouldn't be a controller, which it
> currently is, if that's what you're referring to.  Why not instantiate the
> MozMill object in the derived class instead of going through CLI?

Well, with your refactoring of the CLI stuff into the Mozmill classes it should be possible now. There wasn't a way for me with Mozmill 1.4.1. I will have to check that once Mozmill 1.4.2 has been released and I could update our automation scripts.

(In reply to comment #38)
> "Touching" values with more than one dot is ultimately subject to
> implementation changes.

That's my concern too. But to stay compatible with Mozmill 1.4.1 for now, I do not have another option. I have to subclass the CLI interface. Once the above step has been done, this dependency would have been killed.
Depends on: 586652
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: