574043 - mozmill fails in tests on CentOS ; timeouts should be introduced to correct

Assignee

Description

•

14 years ago

Attached patch weird code that fixes firefox unresponsive — Details — Splinter Review

In trying to get mozmill to work on buildbot (bug 516984) there are several difficulties encountered. I will recount them all here, even though several of them are unrelated, in order to isolate the work necessary to debug mozmill and firefox on the CentOS vm that I'm using. Firstly, the Firefox window for minefield is not painted/rendered. This does not seem to affect the tests, but it makes debugging very difficult. I do not know if minefield works on this VM for others. If it does, then I should have access to a setup where I can actually see firefox painting its windows. If it does not, then this bug should be filed separately and corrected. Initially, using the mozmill packaged tests as part of m-c (http://mxr.mozilla.org/mozilla-central/source/testing/mozmill/), I consistently encountered the issue where mozmill would hang/wait indefinitely and the invoked version of Firefox would become unresponsive. This was observed on every run of `mozmill -b firefox/firefox -t mozmill/tests/firefox` using the buildslave files. Changing some apparently unrelated code (see attachment) fixes the firefox becoming unresponsive problem. As :ctalbert pointed out, this is likely a race condition. But a race condition of what? TODO: slowly back out these changes and see what actually breaks/fixes things. After this part was fixed, mozmill still hung indefinitely. It dies on the testDownloadStates.js test, and very specifically, on this line: http://hg.mozilla.org/qa/mozmill-tests/file/14acc3e342ba/shared-modules/testDownloadsAPI.js#l365 Mozmill waits indefinitely for the click event, which never occurs, assumedly because the element isn't correctly selected. Once again, it would be much easier to diagnose if I could actually see what Firefox was doing as I can on my laptop. The exact cause of why this hangs I have not yet determined. A two prong approach should be taken to ensure mozmill does not hang indefinitely: - there should be a deadman timeout (say 5 min) on the python side that is reset each time the jsbridge python gets a message from the JS counterpart. - there should be a per-test timeout in the mozmill javascript so that hanging tests can be bypassed. The deadman timeout is a good safety precaution, but it will kill all subsequent tests instead of trying to progress through them, yielding widely inconsistent results if the timeout condition is seen. The unreponsive firefox issue should be assessed and diagnosed as well. In general, tests should never hang the harness. We need to do our best to ensure this. This bug is some overlap with issues in bug 571630. It has been spun off as a separate issue in belief that the root causes may be different. I do not know this to be true or untrue.

Jeff Hammel

Assignee

Updated

•

14 years ago

Assignee: nobody → jhammel

Jeff Hammel

Assignee

Updated

•

14 years ago

Whiteboard: [mozmill-1.4.2?]

Jeff Hammel

Assignee

Updated

•

14 years ago

Blocks: 516984

Jeff Hammel

Assignee

Comment 1

•

14 years ago

Drilling down even further, the failure is actually at a sleep(0) line: http://mxr.mozilla.org/mozilla-central/source/testing/mozmill/mozmill-1.4.1/mozmill/extension/resource/modules/controller.js#273 Of course, there's something weird, too. Putting a dump() statement in the `wait()` function ( http://mxr.mozilla.org/mozilla-central/source/testing/mozmill/mozmill-1.4.1/mozmill/extension/resource/modules/utils.js#399 ) and in the while loop ( http://mxr.mozilla.org/mozilla-central/source/testing/mozmill/mozmill-1.4.1/mozmill/extension/resource/modules/utils.js#412 ) , the dump in wait gets hit (many many times) and the dump in the while loop gets hit only once. A e10s issue?

Jeff Hammel

Assignee

Updated

•

14 years ago

Updated

•

14 years ago

Depends on: 504440

Jeff Hammel

Assignee

Comment 2

•

14 years ago

I've added a global timeout to jsbridge: http://github.com/k0s/jsbridge/commit/39f6b9ea983cb9e469751f7ee012218f0d085ab8 http://github.com/k0s/jsbridge/commit/bd49c30218388c80c83e0ba8e46f99b7e41fdfbd http://github.com/k0s/jsbridge/commit/0388256d3de02c5ae4ae785ae28e8e78362efeb0 This seems to work correctly; on my ubuntu machine, the tests run through passing, as they did before. On the CentOS VM, instead of the tests hanging, the exception is thrown and the run ends with "Application quit unexpectedly." There are a few more things that need to be done: - a CLI switch in mozmill should be added that sets this on the bridge object(s) - there should be a per-test timeout in mozmill/extension's JS; currently, the whole run stops (as jsbridge doesn't know anything about "tests") whereas, for reproducibility, individual tests should timeout. The default value (whatever it might be) should be overridable on a per-test basis. - mozmill/mozrunner should ensure that Firefox is killed when Something Bad Happens; currently, the jsbridge stuff kills the run successfully, but Firefox will remain open, which would be bad for buildbot

Jeff Hammel

Assignee

Comment 3

•

14 years ago

(In reply to comment #2) > I've added a global timeout to jsbridge: <snip/> > There are a few more things that need to be done: > - a CLI switch in mozmill should be added that sets this on the bridge > object(s) Done: http://github.com/k0s/mozmill/commit/945cc4b8654330a9c72a28f5071eb4b9d41fb161 I've tested manually and it works. I would doubt there's side effects.

Jeff Hammel

Assignee

Updated

•

14 years ago

Summary: mozmill fails in test runs on CentOS → mozmill fails in tests on CentOS ; timeouts should be introduced to correct

Jeff Hammel

Assignee

Comment 4

•

14 years ago

(In reply to comment #1) <snip/> > A e10s issue? No! Foolish me. e10s isn't even on trunk yet.

Jeff Hammel

Assignee

Comment 5

•

14 years ago

(In reply to comment #2) <snip/> > There are a few more things that need to be done: <snip/> > - there should be a per-test timeout in mozmill/extension's JS; currently, the > whole run stops (as jsbridge doesn't know anything about "tests") whereas, for > reproducibility, individual tests should timeout. I have tried to solve this two ways, neither of which seems to work. Firstly, I tried to use a hidden window and using a setTimeout on that. http://github.com/k0s/mozmill/commit/716e7389a7168ce35c3d02f724ebfea4bc317c56 This seems to have no effect whatsoever Secondly, I tried to use nsITimer with a callback function. http://github.com/k0s/mozmill/commit/6fb0ec3a55fd42e0da670f1c0466e712e339e875 Again, this doesn't seem to work either. Both of these rely on throw. Are they throwing in the wrong thread? In either case, they do not halt execution (while throwing directly in the code of course does). I'm kinda at a loss here. The only other thing I know to try is to introduce a worker/master thread model into mozmill on the JS side (and make the master "unsinkable"). This would require a pretty vast overhaul. The "works today, breaks tomorrow" alternative is to just not do per-test timeouts right now. This will lead to inconsistent results. Still, cleanup would have to be done to better ensure browser-window closing, as this will kill buildbot. > The default value (whatever > it might be) should be overridable on a per-test basis. It was decided (via IRC and IRL) that this wasn't needed up front.

Jeff Hammel

Assignee

Comment 6

•

14 years ago

(In reply to comment #5) <snip/> > I'm kinda at a loss here. The only other thing I know to try is to introduce a > worker/master thread model into mozmill on the JS side (and make the master > "unsinkable"). This would require a pretty vast overhaul. Maybe not that vast. Maybe each test could be done in a worker thread and that is enough. I'm not sure if this will work. I am somewhat confused by the threading model still.

Jeff Hammel

Assignee

Comment 7

•

14 years ago

Except the rendering issue, this bug has been broken into components: https://bugzilla.mozilla.org/show_bug.cgi?id=504440 https://bugzilla.mozilla.org/show_bug.cgi?id=574871 Closing

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → WORKSFORME

Jeff Hammel

Assignee

Updated

•

14 years ago

No longer blocks: 516984

No longer depends on: 504440

Whiteboard: [mozmill-1.4.2?]

Nobody; OK to take it and work on it

Updated

•

8 years ago

Product: Testing → Testing Graveyard

Bugzilla

mozmill fails in tests on CentOS ; timeouts should be introduced to correct

Categories

(Testing Graveyard :: Mozmill, defect)

Tracking

(Not tracked)

People

(Reporter: k0scist, Assigned: k0scist)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Updated

Updated

Comment 1

Updated

Updated

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Updated

Attachment

General

Description

File Name

Content Type