Closed Bug 615662 Opened 14 years ago Closed 13 years ago

In case of disconnect errors Firefox not always get killed

Categories

(Testing Graveyard :: Mozmill, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whimboo, Assigned: k0scist)

References

()

Details

(Whiteboard: [mozmill-1.5.2-][mozmill-2.0+])

Attachments

(4 files, 5 obsolete files)

As seen on different platforms and for builds on different branches, we sometimes fail in correctly killing the browser. The application stays open, while the Python side still think the browser is killed and exits normally. The results are also send to the report server (see the URL field)
Attached image screenshot
The open instance I have seen today stopped when downloading a theme. In that case we probably tried to close the add-ons manager, which caused another modal dialog. That one doesn't get handled by our tests. The global timeout didn't kill the browser.
Depends on: 616195
Henrik, what is the command line used to launch that test?  Does it happen if you run only that test, does it happen if you run all the tests? Every time? What are the steps and the command line to repro this?  What OS's + platform versions does the hang occur most frequently on?  We need much better data in order to track this down and fix it.

Does it occur once a day, several times a day, once a week?  etc.
Attached patch patch for demonstration (obsolete) — Splinter Review
This test shows that we do not kill the browser, but the mozmill script exits normally with an error.

1. Apply the patch and run the testThemeInstallUninstall test
2. Don't manually close the modal dialog which pops-up when the browser is closed.
Attached patch patch for real (obsolete) — Splinter Review
Now, the correct patch.
Attachment #495674 - Attachment is obsolete: true
Something I also missed to note, this patch is for the mozilla1.9.2 branch, means it has to be run with a Namoroka or any Firefox 3.6.x release build.
I can't make this work.  Steps I followed:
1. Apply patch to current mozmill-tests repo
2. Comment out the __force-skip__ lines at the bottom of test1, test2, and test3.js in restartTests/testThemeInstallUninstall
3. Ran it with Firefox 3.6: mozmill-restart -b "C:\Program Files (x86)\Firefox3.6\firefox.exe" -t ../../m
ozmill-tests/firefox/restartTests/testThemeInstallUninstall

The tests are still being skipped, and therefore the problem is not reproducing itself.  Henrik, what is incorrect in my steps?
As I have said in the last comment this patch is for the 1.9.2 branch of our repository. You should switch the branch before applying any patch to the tree. The addons.js module is completely different on default. So the way you have used will not work.

https://developer.mozilla.org/en/Mozmill_Tests#Handling_branches
Whiteboard: [mozmill-1.5.2?] → [mozmill-1.5.2+]
Assignee: nobody → jhammel
another way of reproducing is to launch `mozmill` with no command line arguments and hit ctrl+c.  Firefox will not get shutdown. (feature? or bug?)
This seems related to, if not the same issue, as bug 619110.  Does that seem about right?

In both cases, I cannot reproduce the issue on ubuntu linux.  I'm not sure what the root cause is.  That would be good to know, as its difficult to verified as having fixed the root cause unless we know what it is.

Is this the same symptom as sending SIGINT (ctrl+c) to the mozmill process before the test loop is entered (but after mozrunner starts)? This should mostly be corrected for on master ( https://github.com/mozautomation/mozmill/commit/5a58949f9628d7ebbcbe64ede2674c2100a8ea0a ). I can backport these changes, but it would be nice to know what was actually happening here.
I've opened bug 619410 about backporting the above changes to 1.5.2.  I opened this as a separate issue as I'm not sure if it will help this bug or not.  Probably not.  Still can't repro though.
Lets add dependency. I can figure this out once a patch is ready.
Depends on: 619410
No longer depends on: 619410
Still reproducible on OS X with a Namoroka build and latest revision of Mozmill on hotfix-1.5.2.
Tried this on hotfix-1.5.2 today with fresh mozmill-tests on the mozilla1.9.2 branch with a fresh Firefox 3.6.  Could not reproduce
Id recommend giving a verbose install script if this continues to be the issue (the more verbose, the better).  Trying to reproduce these things is timeconsuming and i've had a pretty low hit/miss ratio.  For example:

virtualenv mozmill
cd mozmill
. bin/activate
mkdir src
git clone http://github.com/mozautomation/mozmill.git
cd mozmill
git remote add mozauto git://github.com/mozautomation/mozmill.git
git fetch mozauto 
git branch hotfix-1.5.2 mozauto/hotfix-1.5.2
git checkout hotfix-1.5.2
for i in mozrunner jsbridge mozmill
do
  cd $i
  python setup.py develop
  cd ..
done
# fetch firefox 3.6 [TODO]
hg clone http://hg.mozilla.org/qa/mozmill-tests
cd mozmill-tests
hg checkout -C mozilla1.9.2
curl https://bugzilla.mozilla.org/attachment.cgi?id=495675 | patch -p1
mozmill -t firefox/restartTests/testThemInstallUninstall/ --show-all -b /Applications/Firefox.app

If this is too much to type we can make a template.  However, without verbose reproduction instructions I can only wonder....did I do something wrong?  Or is it environment specific?  It also helps to narrow down culprits if you include verbose environment information
Jeff, as we have talked during the all hands this issue is probably related to slower connections. Please try to reproduce it outside of the office, or on a stressed WLAN connection, or even via your smartphone? Let me check if I have a better testcase for you.
No other/easier testcase possible. You should really run this test with a slow network connection. The interesting part for this bug is probably the second output here:

Timeout: bridge.execFunction("054dc282-1d91-11e0-9de9-002332b130e8", bridge.registry["{0d8b7cb8-2906-ed4f-a6cd-94f7589492c1}"]["cleanQuit"], [])

Timeout: bridge.set("341a94fa-1d91-11e0-9de9-002332b130e8", Components.utils.import('resource://mozmill/modules/mozmill.js'))

This is not visible in all other cases. So something gets transferred and probably stops us from cleanly shutting down the browser.
I was unable to reproduce this yesterday on Snow Leopard/hotfix-1.5.2/mozilla 1.9.2
Andrew, have you run this outside of the office?
No I haven't. It could still be related to slower networks. I just realized that I had forgotten to comment in the bug that I couldn't reproduce.
Andrew and Jeff, when you watch the test is the download of the theme completed?
It goes by too fast to tell
Then choose a network with a slower connection. That should then be the reason for. :)
see also bug 626777
The cool iris addon is 4+ M in size, so maybe it will suffice even on a fast network
Attached patch testcaseSplinter Review
Ok, this testcase now installs the Cooliris extension which should absolutely trigger this failure. Once the download gets started the test finished and Firefox should be closed. In this state we will cancel the download and you will see the modal dialog with the warning. Don't click that away but wait the 60s for the disconnect. You will notice that the application will still be present in the dock.
Attachment #495675 - Attachment is obsolete: true
I still can't reproduce this.  I may not understand reproduction steps.  I have tried both running the test through as well as issuing <command>+q while CoolIris is downloading.  In both cases, the browser gets killed correctly on mac 10.5.8.  For the latter case, when I <command>+Q during the download, the "Cancel All Downloads" dialogue gets opened, but mozmill eventually quits with a jsbridge timeout and Fx 3.6 gets closed.

Maybe the thing to do is to endeavor to fix bug 626777 and hope that fixes the issue?
I can clearly reproduce this behavior on qa-horus which runs 10.6. Not sure if it's because of Snow Leopard. But you can get access to that machine to dig into the problem.
bug 626777 turns out to be invalid.  We *are* in fact sending sigkill.  I'm not sure what stronger we can do. Maybe we're missing a PID, or maybe we're hitting the except clause somehow.
So, finally getting somewhere, this is related to our horribly complicated stopping logic:

https://github.com/mozautomation/mozmill/blob/hotfix-1.5.2/mozmill/mozmill/__init__.py#L475

A few things to think about:
 - why do we pass on JSBridgeDisconnect errors but kill the process on other errors?
 - what's with all of the other special casing?
I think that "pass" on a JSBridgeDisconnect error is quite simply, wrong.

The if not close_bridge: segment doesn't make sense to me.  This code is never called with close_bridge==False so that doesn't really seem to be necessary either. For some quick testing how about the following changes (and then make it pretty afterward if the tests work):
* comment out the except(JSBridgeDisconnect, blah): pass business, let JSBridgeDisconnect errors fail like everything else
* change if not close_bridge to if False just to be absolutely certain that code is never executed and see what happens (that code inside the if block doesn't appear to ever be executed as I'm grepping through the code).
The code is called with close_bridge=False (implicitly, via default arguments) from the MozMillRestart.run_dir method:

https://github.com/mozautomation/mozmill/blob/hotfix-1.5.2/mozmill/mozmill/__init__.py#L618
Can't reproduce at the moment....the URL referenced in the test, https://addons.allizom.org/en-US/firefox/addon/cooliris/ , gives a 500 at the moment.

It might be possible to add a self.runner.cleanup() in the passing except clause in https://github.com/mozautomation/mozmill/blob/hotfix-1.5.2/mozmill/mozmill/__init__.py#L475 . Again, I have no idea why the code is structured this way nor can I guess if this causes any other issue. 

See https://bugzilla.mozilla.org/show_bug.cgi?id=627160 for a more 2.0 solution. Regardless of whether or not we can tackle this bug for 1.5.x, I would highly advise the following course of action:

- code freeze on stop_runner and related code for 1.x . This means nothing gets changed here unless its absolutely necessary.  It is impossible to fix this code without a prodigious amount of refactoring
- complete overhaul of Mozmill stopping conditions for 2.0 as suggested by bug 627160 
- tests that try to break things.  One of the reasons that it is hard to fix the stopping bugs is that they only come up when something bad happens.  We need automated tests that make sure we don't regress (or an army of manual testors)
- a CI system to run said tests
This doesn't fix the problem.
allizom.org is currently having problems: https://addons.allizom.org/en-US/firefox/addon/cooliris/ (I get a 500).  I have been unable to reproduce since.

Do we know the conditions of the hang? If there is an appropriate HTTP response to be issued, it should be simple to make a dummy server so that we're not contingent on allizom.org ? If anyone knows what causes the failure, it should be doable to get an easier test case.

If the above patch doesn't work, I don't really have a best guess on where the code is failing.  I was able to determine that runner.stop() was not being called in the failing case.  This means that our kills are fine, but we're just not hitting them due to one of the meandering paths through MozMillRestart.run_dir or MozMill.stop_runner. We'll need to figure out what path is actually affected to determine what the fix needs to be.  Without knowing, and without knowing the intent why the various paths in the stopper code exist, I can't guess if an easy fix is possible or if it is contingent on a massive refactor of the stopping code.
So, more mysteries.

allizom is working again, so I was sorta able to reproduce on qa-horus.  This morning, I was also (intermittently) able to reproduce on timesink, but I have not been able to do so since allizom came back (not sure why).

on qa-horus, using the system mozmill, I have been able to reproduce again.  So I was going to start putting in print and pdb statements to trace when exactly Something Bad Happens.  I made a new virtualenv and grabbed a tarball from github of the 1.5.2 branch (the system doesn't have git on the path, it seems).  I `python setup.py develop`ed mozrunner, jsbridge, and mozmill in that order in the virtualenv.  And just for the sake of sanity, I tried to reproduce the bug again.  I couldn't.  With the new setup, the browser does close.

This is really strange.

I'd be likely to chalk it up to some systems issue, except for the case when I could (albeit briefly, though multiple times) reproduce on timesink, it *was* running the latest 1.5.2 code that was identically installed(!)

This really leaves me with more questions than answers, sadly.

I do think it would be a good idea to make a more focused test case.  I'm not exactly sure what exactly causes the case to fail.  But this should be isolated and a very finite test case that is not so contingent on a particular web service.
I have tried to reproduce this on timesink today.  The certificates issue no longer happens (looks like they were doing something with their server which is now clear) and it appears that we encounter the same timeout and hang that we have encountered before. 

However, the browser does get closed when MozMill quits.  So again, I'm a bit clueless here.

I'd be curious if anyone else can reproduce and it what circumstances.  To summarize, I have been able to reproduce:

- yesterday morning on timesink
- yesterday morning on qa-horus using the system mozmill

I have not been able to reproduce:

- this morning on timesink (using identical code!)
- yesterday on qa-horus using a `python setup.py develop`ed version of the mozmill tarball in a fresh virtualenv (which *should* be identical to the system mozmill)
I think I'm going to punt on this for now.  I really don't know sensible next steps here.  I can't reproduce in any way that allows me to edit and inspect the code path.  I don't really know what is wrong -- I do know that our stopping path is awful and should be improved and that should be reason enough alone to switch towards 2.0 development.  

We need a better test case.  I started working on http://k0s.org/mozilla/hg/DownloadInfinity  However, killing FF during the resulting download does not prompt for "Are you sure?".  Not sure if I'm missing some headers or what not.  I haven't looked in the Fx code to see what the conditions for this are.  Maybe worth doing if we want a solid test here, but, as I said, I haven't been able to reproduce even when I get this dialog (the bridge times out and the browser closes), so maybe not.
Have you ever tried the way to slow down the connection as what I have sent to you yesterday? As you know with that setting we were always able to reproduce on qa-horus.
On timesink, it didn't seem to matter whether i throttled the bandwidth or not.  I *do* get the dialog box when the test tries to quit before the download is done and get a disconnect error, but, again, the browser does get correctly killed.

I am unaware of the virtualenv sandbox code on qa-horus being tested with the bandwidth throttler.  In the several runs that I did, in all cases the symptoms were I identical to what occurs on timesink: the dialog pops up, the bridge times out, Fx gets killed as mozmill exits.  Since the events are the same and the dialog pops up, I don't know what difference throttling the bandwidth would make.
Attached patch WIP (obsolete) — Splinter Review
So a little bit more...executive summary="still no fix in sight" :(

I put up a WIP patch (attachment 505602 [details] [diff] [review]), but it is not viable.  It *will* ensure the browser is closed and it *will* work correctly for not-broken mozmill tests, but it *will not* work with not-borken mozmill-restart tests.

As best I can tell, we *always* hit the JSBridgeDisconnectError exception at https://github.com/mozautomation/mozmill/blob/hotfix-1.5.2/mozmill/mozmill/__init__.py#L480 . I don't know why.  I haven't introspected the code (next on my tasklist), but it would be nice if it said "# cleanQuit will always raise a disconnect error" or "# cleanQuit will raise a disconnect error on such and such condition". But it doesn't.  I have never hit the socket.io error there to date, so i'm not sure what that is supposed to protect against.  Without knowing the answers to these questions, it is hard to know what edge cases I'm dealing with.  Suffice it to say, the WIP is not a viable solution.  The one-line patch is probably also not a viable solution. Again, I believe it will break restart tests.

I'll look into this a little more, but its looking more and more like a complete overhaul.  No one understands the intention of the stopping code.  In fact, there probably isn't much intention except for adding more and more try: except: clauses to deal with each failure.  The WIP is an example of these.  I don't really know what to do except a real overhaul at this point.
(In reply to comment #43)
> complete overhaul.  No one understands the intention of the stopping code.  In
> fact, there probably isn't much intention except for adding more and more try:
> except: clauses to deal with each failure.  The WIP is an example of these.  I
> don't really know what to do except a real overhaul at this point.

Mikeal, can you remember who has originally written the stopping code in Mozmill? Was it you or Adam? Would be great if we could get some insight from you guys. Thanks.
Status: NEW → ASSIGNED
Attached patch patch to mozrunner (obsolete) — Splinter Review
This adds a stop call to the runner whenever start is called.  Its not really an ideal solution and may screw up tests, as (ABICT) there is no coherent way of telling whether a restart test actually failed with the current control flow.  However, fixing that is probably out of scope for 1.5.2 

I've tested this code as best I could.  However, since this, like all the stopping code, handles and edge case, it is possible that A) it may misbehave and B) it may fail under certain conditions
That code is part of killable process, which start at like netscape or maybe mozilla right after netscape, then forked by people at OSAF and improved, then forked by me with addtions and better wait code.

It's really old and kinda crazy but it basically works. If you plan on re-writing it be aware that there a dozens of edge cases baked in to that code that you'll have to deal with.
I wish the edge cases had been documented, as now I am just left at guessing at intention here.  In any case, I'm not sure any case where you wouldn't want to kill the existing process (if any) before starting a new process.  As said in this and other bugs (e.g. bug 627794), the stopping code is all kinds of messed up, up the chain from killableprocess to mozrunner to mozmill.  They need to be written with a clear intent, documented, and unified.

Do you have any recommendations as how better this should be done?  Or are there any edge cases you know about that this patch might negatively expose?
Attachment #505889 - Flags: review?(ahalberstadt)
Attachment #505889 - Flags: feedback?(ctalbert)
Attached patch unbitrot patchSplinter Review
Attachment #505889 - Attachment is obsolete: true
Attachment #505984 - Flags: review?(ahalberstadt)
Attachment #505984 - Flags: feedback?(ctalbert)
Attachment #505889 - Flags: review?(ahalberstadt)
Attachment #505889 - Flags: feedback?(ctalbert)
Comment on attachment 505984 [details] [diff] [review]
unbitrot patch

I managed to reproduce the bug (without this patch) using the CoolIris test case.  Applying this patch fixed the issue. This was on Snow Leopard with 1.9.2.

I couldn't find any new breakage in other tests either, looks good.
Attachment #505984 - Flags: review?(ahalberstadt) → review+
pushed to hotfix-1.5.2: https://github.com/mozautomation/mozmill/commit/215651a4051d12d00999b07baacb1fa64fa79036
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Attachment #505167 - Attachment is obsolete: true
Attachment #505602 - Attachment is obsolete: true
I have tested this patch as landed on 10.6 and shutting down the browser works fine. At least for our known condition. But, sorry that I have to say that, we do not report disconnects as failures anymore. This is critical for our result analysis. Reopening.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
My patch on bug 627983 should make this failure appear less often now.
Can you provide output on what has changed and the command line used to run?  Again, need repro steps.  

For Mozmill, start should only be called once, so if this is an issue with non-restart tests, I'm not really sure what to do.  The only code path that is changed will be if runner.start() is called multiple times

For MozmillRestart, this could indeed affect reporting. Without a substantive API refactor, I'm sorry to say that it is likely a choice between shutting down the browser and getting good results.  Not sure if we can do the right thing without breaking the API.
Use the command line as we have used the last days for reproducing this bug. You will see in the final statement that the test hasn't been marked as failed. Same when using --report=file://report.json.

If we can't fix this we should probably back out this patch from 1.5.2pre. As said before bug 627983 will make it less often happening now. I have only tested with restart tests and not with non-restart tests.
backed out the one line of code that did anything, the rest is fine to leave

https://github.com/mozautomation/mozmill/commit/52c0bc36bc48bb735a1da7aa3caf1093e414310e

(on hotfix-1.5.2, btw)

Should we wontfix or port this patch to 2.0?
Whiteboard: [mozmill-1.5.2+] → [mozmill-1.5.2-][mozmill-2.0?]
Would it be necessary for 2.0? Then I would be in favor of, but with a complete patch which includes the failure notice in the report.
Status: REOPENED → ASSIGNED
It should be part of 2.0, though its merely part of bug 627160
So lets add the dependency and we don't forget about it.
Depends on: 627160
No longer depends on: 627160
Attachment #505984 - Flags: feedback?(ctalbert) → feedback+
We should take this for 2.0 (which probably means making it applicable to 2.0 rather than taking the exact patch).
Whiteboard: [mozmill-1.5.2-][mozmill-2.0?] → [mozmill-1.5.2-][mozmill-2.0+]
Attached patch patch for 2.0Splinter Review
Seems to work for 2.0.  I don't see any side effects but they'd only show in rare cases anyway.  

Note that this is orthogonal to the problem of whether a disconnect error should cause to proceed to the next test or not.
Attachment #516432 - Flags: review?(ahalberstadt)
Comment on attachment 516432 [details] [diff] [review]
patch for 2.0

Looks alright to me,
r+
Attachment #516432 - Flags: review?(ahalberstadt) → review+
pushed to master: https://github.com/mozautomation/mozmill/commit/fd628947995d7896fea476c2eb81d58c171e9a14
Status: ASSIGNED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: