Due to changes in Python 2.7.2 the EBADF error is handled internally now but causes JSBridge to stay in an endless loop

RESOLVED FIXED

Status

defect
--
critical
RESOLVED FIXED
7 years ago
3 years ago

People

(Reporter: whimboo, Assigned: whimboo)

Tracking

({regression})

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [mozmill-1.5.9+][mozmill-2.0+])

Attachments

(1 attachment, 1 obsolete attachment)

I'm not sure what's going on here. I have never seen this before until William pointed me to this problem yesterday. On Windows if you run restart tests Firefox is not closed correctly, a disconnect gets fired, and Mozmill hangs. After a while Firefox starts again but port 24242 is still in use.

This is a regression by something unknown for now. It can be reproduced with our new Mozmill environment but also with MozillaBuild. Could this be Python 2.7 related? I haven't tested 2.6 thought yet.

Steps:
1. Get our environment: http://people.mozilla.com/~hskupin/downloads/mozmill-env/1.5.8-win.zip
2. Execute run.cmd
3. Run 'hg clone http://hg.mozilla.org/qa/mozmill-tests'
4. Run 'mozmill-restart --show-all -t tests/functional/restartTests -b %Nightly%

After the first restart attempt Firefox is not closed correctly and Mozmill hangs.

Clint or Jeff, can one of you please have a look? This kinda blocks us from running DTC tests. I'm away this evening so I would have to leave this in your hands.
Looks like the problem as given my William on Linux is caused by the exact same underlying cause.

On the console I can see:

*** e = [Exception... "Component returned failure code: 0x80570016 (NS_ERROR_XPC_GS_RETURNED_FAILURE) [nsIJSCID.getService]"  nsresult: "0x80570016 (NS_ERROR_XPC_GS_RETURNED_FAILURE)"  location: "JS frame :: chrome://browser/content/utilityOverlay.js :: getShellService :: line 339"  data: no]

Not sure if this is somewhat related to our code or not.
Summary: On Windows mozmill-restart causes a hang on shutdown of Firefox - Process at 99% cpu load → mozmill-restart causes a hang on shutdown of Firefox - Process at 99% cpu load
This is not a problem on OS X 10.7.

Comment 3

7 years ago
Posted patch Instrumentation patch (obsolete) — Splinter Review
So, this is the closest to one of those "how the hell did this ever work" moments I've ever seen. 

tldr
Something changed on windows and linux such that we are hanging onto this socket and are unable to close it.  Probably something down inside the libraries somewhere.  I haven't dug in there yet to figure that out. I'd recommend that for the next person that takes a look at this.

= The details =
So, in network.py, we have this global thread. Take a look at create_network. You can't miss it.  We start up our bridge and our backchannel bridge then we spin up this thread with asyncore.loop(). asyncore.loop is a blocking call, we'll only return from it once the sockets that asyncore is using are all dead. Specifically that means the sockets for the bridge and the backchannel (which are managed by the Telnet class up at the top of the file). 

Now, that's all good and well as long as you *kill the socket*. If you run this instrumentation with PYTHONUNBUFFERED set on your environment, you'll see that once we stop running the first test and start to spin up jsbridge to run the second test, we try to close the socket. Something prevents us from closing the socket and the test will hang when it tries to create the two bridges (note how the thread is still active and we hit the else case).

Now, on mac OS X, things can't be that bad, can they? Why it works on mac! Well...sorta.  You see, on mac, we try to close the socket, and somewhere down in the deep guts of asyncore we throw an exception.  That exception causes the asyncore.loop function to throw, which in turn causes the global thread to exit, thus allowing us to actually keep running.

I see a few ways of fixing this:
1. Rewrite jsbridge from the ground up
2. Figure out what changed low level and/or what is causing the exception to be thrown on mac, and throw/cause that exception in the Telnet class when it handles close. (You can't just throw any old exception, I tried that already)
3. Figure out a way to hold the python sockets open between runs of the browser - this is actually the same as point 1, and is enough of a difference that we might as well rewrite the whole thing
4. Implement our own inherited version of asyncore.loop for the global thread with a message loop that we can break out of.
5. Switch jsbridge to use multiprocessing over threads and forceably kill the asyncore process (not sure if this would work at all or if this would destroy jsbridge as we know it, thus leading to solution 1).
I think 2 is probably our best bet, at least figuring out what the exception is on mac might tell us what has changed on windows/linux that is causing this behavior. And maybe from there, we can figure out what to do next.

= How to run (assuming you have virutalenv already installed and have never checked out the mozmill tree, and that you don't have write access to mozmill - you probabaly don't want to set it up this way if you have write access) =
== Setup ==
hg clone http://hg.mozilla.org/qa/mozmill-tests
git clone git://github.com/mozautomation/mozmill.git
cd mozmill
git fetch -a
git branch --track hotfix-1.5 origin/hotfix-1.5
git checkout hotfix-1.5
mkvirtualenv mozmill-1.5
python setup_development.py

= Ok, run the test =
(my nightly build of firefox lives in ~/tools/nightly, change that to where yours is.)
mozmill-restart -b ~/tools/nightly/firefox -t ../mozmill-tests/tests/functional/restartTests/testMasterPassword 2>&1|tee foo.log

You should see the browser start three times.  If it hangs, you will see it start the second time and hang until it times out.
(In reply to Clint Talbert ( :ctalbert ) from comment #3)
> 2. Figure out what changed low level and/or what is causing the exception to
> be thrown on mac, and throw/cause that exception in the Telnet class when it
> handles close. (You can't just throw any old exception, I tried that already)
[..]
> I think 2 is probably our best bet, at least figuring out what the exception
> is on mac might tell us what has changed on windows/linux that is causing
> this behavior. And maybe from there, we can figure out what to do next.

Wow this is hilarious! Thanks for this deep analysis Clint. I agree that we have to figure out where those problems have been introduced. Especially because of we haven't seen this at anytime before.

I will CC a couple of our folks who work on Mozmill, so we can get more eyes on it. Some scenarios we should test are:

* An older version of virtualenv
* Different versions of Python
* Old versions of our preconfigured VMs for Windows and Linux without the lastest OS updates

It would be great if everyone could have a look at this issue because it would stop our capability to run Mozmill tests.
I have tested various ways and something I have seen is that it works fine with Python 2.6.5, which we are still using on our box for daily tests. On the same box it breaks with Python 2.7.2 installed. So I will now test Python versions in between to figure out if this is a regression in Python.
This is a regression in Python 2.7.2. I have build an environment based on Python 2.7.1 and everything works as expected. That's also one reason why we don't see it on Mac because Lion is using 2.6.7.

http://python.org/download/releases/2.7.2/
Python website tells that the current stable version for Python is 2.7.1 for Python 2 for Windows and Linux.
It might be worth to test the latest stable version which is 3.2.
Quoted from http://wiki.python.org/moin/BeginnersGuide/Download

I'm not getting the point in using an unstable version.
No, the latest stable version of Python 2.7 is 2.7.2 as you can see on the download page: http://python.org/download/

Also do not support Python >3 yet for Mozmill so we have to stick with Python 2.7 or 2.6.

(In reply to Remus Pop (:RemusPop) from comment #7)
> I'm not getting the point in using an unstable version.

Not sure what you mean with this sentence.
So I managed to get the tests working under Ubuntu Linux by creating a virtualenv running Python 2.6 (installable via the python2.6 package). The steps to doing so are pretty much identical to Clint's, except you pass a '-p' parameter to virtualenv, like this:

virtualenv -p python2.6 mozmill-1.5

(not sure what that mkvirtualenv program is, I presume that it has a similar option)
(In reply to Henrik Skupin (:whimboo) from comment #8)
> No, the latest stable version of Python 2.7 is 2.7.2 as you can see on the
> download page: http://python.org/download/
Sorry about that. I correlated that and the fact that my Ubuntu showed 2.7.1 as being the latest version and being unable to force other version install (still not sure why I can't update through synaptic)

> 
> (In reply to Remus Pop (:RemusPop) from comment #7)
> > I'm not getting the point in using an unstable version.
> 
> Not sure what you mean with this sentence.

Was related to my belief of v 2.7.1 being the latest.
So I have bisected Python between 2.7.1 and 2.7.2 with about 280 changesets and found the regressor:

The first bad revision is:
changeset:   68064:965ab9911fcd
branch:      2.7
parent:      68057:1797300f87b9
date:        Thu Mar 03 14:17:51 2011 +0000
summary:     Merged revisions 88722 via svnmerge from

Details can be found here: http://hg.python.org/cpython/log?rev=965ab9911fcd

Interesting changes are the following:

asyncore:  http://hg.python.org/cpython/rev/965ab9911fcd
telnetlib: http://hg.python.org/cpython/rev/1797300f87b9
So the telnetlib change is a noop because it's a test only. So I have investigated the changes in asyncore and there are two additions: EPIPE and EBADF. 

Reverting code to the former version showed me that the regression has been caused by 1.47: http://hg.python.org/cpython/rev/965ab9911fcd#l1.47

1.47 -            if why.args[0] in [ECONNRESET, ENOTCONN, ESHUTDOWN, ECONNABORTED]:
1.48 +            if why.args[0] in _DISCONNECTED:

Removing EPIPE from _DISCONNECTED disconnected doesn't make a difference, but removing EBADF made it work again. So the addition of EBADF in dispatcher.recv is causing us problems here.
So this seems to be an easy one to fix. Look at our Telnet class and the loop which is used to read the data:

https://github.com/mozautomation/mozmill/blob/master/jsbridge/jsbridge/network.py#L80

We break this loop when we receive a socket.error. But as you can see in the Python code change they are now covering the EBADF condition for socket.error themselves:

http://hg.python.org/cpython/file/965ab9911fcd/Lib/asyncore.py#l380

So no exception is raised anymore here if the socket descriptor is invalid. Instead they are closing the socket and return an empty string.
Fix looks easy. Taking.
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Summary: mozmill-restart causes a hang on shutdown of Firefox - Process at 99% cpu load → Due to changes in Python 2.7.2 the EBADF error is handled internally now but causes JSBridge to stay in an endless loop
Posted patch Patch v1Splinter Review
Simple fix which will gracefully handle a closed socket now. This works with any version of Python.
Attachment #593306 - Attachment is obsolete: true
Attachment #593604 - Flags: review?(ctalbert)

Comment 16

7 years ago
Comment on attachment 593604 [details] [diff] [review]
Patch v1

So freaking awesome!

And the gold medal goes to Henrik.

I bet this code is in 2.0 as well, so we should probably check on whether a patch is required there too.

Thanks so much!!!
Attachment #593604 - Flags: review?(ctalbert) → review+
This needs a fix across all versions of Mozmill.
Whiteboard: [mozmill-1.5.9?][mozmill-2.0?]

Updated

7 years ago
Whiteboard: [mozmill-1.5.9?][mozmill-2.0?] → [mozmill-1.5.9+][mozmill-2.0+]
Landed as:

master: https://github.com/mozautomation/mozmill/commit/4a18584f00b18d3b750e5e2bb4ea25870f30362b

hotfix-2.0: https://github.com/mozautomation/mozmill/commit/116cf77b0e706606a7e2bfbfa77e76f6d2d2d5e0

hotfix-1.5: https://github.com/mozautomation/mozmill/commit/912d97b6632988de5569d2a788ef7a7de67061ae
Status: ASSIGNED → RESOLVED
Last Resolved: 7 years ago
Component: Mozbase → Mozmill
QA Contact: mozbase → mozmill
Resolution: --- → FIXED
Blocks: 723296

Comment 19

5 years ago
Hi, 

I am new to mozilla and i was trying to start contributing to bug fixing. while checking out code on my mac 10.9.4 which has python version 2.7.5 , command 'python client.py checkout' hangs eternally . 

I was searching if somebody else faced this problem and ended here. If this is not relevant to this bug please ignore. 

I was following instructions from  https://developer.mozilla.org/en-US/docs/Simple_Thunderbird_build , while i faced this problem. 

Thanks 
Prashant
Prashant, your issue is different, and I don't know what it is. Here we fixed a bug in jsbridge a long time ago. You might wanna join IRC and ask this question in #maildev where people hang out who also work on testing Thunderbird.
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.