Closed Bug 574043 Opened 14 years ago Closed 14 years ago

mozmill fails in tests on CentOS ; timeouts should be introduced to correct

Categories

(Testing Graveyard :: Mozmill, defect)

x86
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: k0scist, Assigned: k0scist)

References

Details

Attachments

(1 file)

In trying to get mozmill to work on buildbot (bug 516984) there are several difficulties encountered.  I will recount them all here, even though several of them are unrelated, in order to isolate the work necessary to debug mozmill and firefox on the CentOS vm that I'm using.

Firstly, the Firefox window for minefield is not painted/rendered.  This does not seem to affect the tests, but it makes debugging very difficult.  I do not know if minefield works on this VM for others.  If it does, then I should have access to a setup where I can actually see firefox painting its windows.  If it does not, then this bug should be filed separately and corrected.

Initially, using the mozmill packaged tests as part of m-c (http://mxr.mozilla.org/mozilla-central/source/testing/mozmill/), I consistently encountered the issue where mozmill would hang/wait indefinitely and the invoked version of Firefox would become unresponsive.  This was observed on every run of `mozmill -b firefox/firefox -t mozmill/tests/firefox` using the buildslave files.

Changing some apparently unrelated code (see attachment) fixes the firefox becoming unresponsive problem.  As :ctalbert pointed out, this is likely a race condition.  But a race condition of what? 
TODO:  slowly back out these changes and see what actually breaks/fixes things.

After this part was fixed, mozmill still hung indefinitely.  It dies on the testDownloadStates.js test, and very specifically, on this line: http://hg.mozilla.org/qa/mozmill-tests/file/14acc3e342ba/shared-modules/testDownloadsAPI.js#l365  Mozmill waits indefinitely for the click event, which never occurs, assumedly because the element isn't correctly selected.  Once again, it would be much easier to diagnose if I could actually see what Firefox was doing as I can on my laptop.

The exact cause of why this hangs I have not yet determined.

A two prong approach should be taken to ensure mozmill does not hang indefinitely:

 - there should be a deadman timeout (say 5 min) on the python side that is reset each time the jsbridge python gets a message from the JS counterpart.  

 - there should be a per-test timeout in the mozmill javascript so that hanging tests can be bypassed.  The deadman timeout is a good safety precaution, but it will kill all subsequent tests instead of trying to progress through them, yielding widely inconsistent results if the timeout condition is seen.

The unreponsive firefox issue should be assessed and diagnosed as well.

In general, tests should never hang the harness.  We need to do our best to ensure this.

This bug is some overlap with issues in bug 571630.  It has been spun off as a separate issue in belief that the root causes may be different.  I do not know this to be true or untrue.
Assignee: nobody → jhammel
Whiteboard: [mozmill-1.4.2?]
Blocks: 516984
Drilling down even further, the failure is actually at a sleep(0) line:

http://mxr.mozilla.org/mozilla-central/source/testing/mozmill/mozmill-1.4.1/mozmill/extension/resource/modules/controller.js#273

Of course, there's something weird, too.  Putting a dump() statement in the `wait()` function ( http://mxr.mozilla.org/mozilla-central/source/testing/mozmill/mozmill-1.4.1/mozmill/extension/resource/modules/utils.js#399 ) and in the while loop ( http://mxr.mozilla.org/mozilla-central/source/testing/mozmill/mozmill-1.4.1/mozmill/extension/resource/modules/utils.js#412 ) , the dump in wait gets hit (many many times) and the dump in the while loop gets hit only once.

A e10s issue?
See Also: → 422451
Depends on: 504440
I've added a global timeout to jsbridge:

http://github.com/k0s/jsbridge/commit/39f6b9ea983cb9e469751f7ee012218f0d085ab8
http://github.com/k0s/jsbridge/commit/bd49c30218388c80c83e0ba8e46f99b7e41fdfbd
http://github.com/k0s/jsbridge/commit/0388256d3de02c5ae4ae785ae28e8e78362efeb0

This seems to work correctly;  on my ubuntu machine, the tests run through passing, as they did before.  On the CentOS VM, instead of the tests hanging, the exception is thrown and the run ends with "Application quit unexpectedly."

There are a few more things that need to be done:
 - a CLI switch in mozmill should be added that sets this on the bridge object(s)
 - there should be a per-test timeout in mozmill/extension's JS; currently, the whole run stops (as jsbridge doesn't know anything about "tests") whereas, for reproducibility, individual tests should timeout.  The default value (whatever it might be) should be overridable on a per-test basis.
 - mozmill/mozrunner should ensure that Firefox is killed when Something Bad Happens; currently, the jsbridge stuff kills the run successfully, but Firefox will remain open, which would be bad for buildbot
(In reply to comment #2)
> I've added a global timeout to jsbridge:
<snip/> 
> There are a few more things that need to be done:
>  - a CLI switch in mozmill should be added that sets this on the bridge
> object(s)

Done:

http://github.com/k0s/mozmill/commit/945cc4b8654330a9c72a28f5071eb4b9d41fb161

I've tested manually and it works.  I would doubt there's side effects.
Summary: mozmill fails in test runs on CentOS → mozmill fails in tests on CentOS ; timeouts should be introduced to correct
(In reply to comment #1)
<snip/>
> A e10s issue?

No!  Foolish me.  e10s isn't even on trunk yet.
(In reply to comment #2)
<snip/> 
> There are a few more things that need to be done:
<snip/>
>  - there should be a per-test timeout in mozmill/extension's JS; currently, the
> whole run stops (as jsbridge doesn't know anything about "tests") whereas, for
> reproducibility, individual tests should timeout.  

I have tried to solve this two ways, neither of which seems to work.

Firstly, I tried to use a hidden window and using a setTimeout on that.  

http://github.com/k0s/mozmill/commit/716e7389a7168ce35c3d02f724ebfea4bc317c56

This seems to have no effect whatsoever

Secondly, I tried to use nsITimer with a callback function.

http://github.com/k0s/mozmill/commit/6fb0ec3a55fd42e0da670f1c0466e712e339e875  

Again, this doesn't seem to work either.

Both of these rely on throw.  Are they throwing in the wrong thread?  In either case, they do not halt execution (while throwing directly in the code of course does).

I'm kinda at a loss here.  The only other thing I know to try is to introduce a worker/master thread model into mozmill on the JS side (and make the master "unsinkable").  This would require a pretty vast overhaul.

The "works today, breaks tomorrow" alternative is to just not do per-test timeouts right now.  This will lead to inconsistent results.  Still, cleanup would have to be done to better ensure browser-window closing, as this will kill buildbot.  

> The default value (whatever
> it might be) should be overridable on a per-test basis.

It was decided (via IRC and IRL) that this wasn't needed up front.
(In reply to comment #5)
<snip/>
> I'm kinda at a loss here.  The only other thing I know to try is to introduce a
> worker/master thread model into mozmill on the JS side (and make the master
> "unsinkable").  This would require a pretty vast overhaul.

Maybe not that vast.  Maybe each test could be done in a worker thread and that is enough.  I'm not sure if this will work.  I am somewhat confused by the threading model still.
Except the rendering issue, this bug has been broken into components:

https://bugzilla.mozilla.org/show_bug.cgi?id=504440
https://bugzilla.mozilla.org/show_bug.cgi?id=574871

Closing
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → WORKSFORME
No longer blocks: 516984
No longer depends on: 504440
Whiteboard: [mozmill-1.4.2?]
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: