Closed Bug 740360 Opened 13 years ago Closed 13 years ago

balrog needs to make sure connections are always closed

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

References

Details

Attachments

(1 file, 1 obsolete file)

We hit an issue this morning where Rail's Balrog client was getting "Bad Status Line" error when making requests. After poking around the server I found that there was a bunch of seemingly hung httpd workers, as well as a bunch of open sockets. Here's everything I found: - 21 connections to the mysql server in a CLOSE_WAIT state. -- Sheeri looked at the mysql server, and found no connections from the aus4-dev node, so as far as that server is concerned those connections are dead - ~50 connections to zlb nodes in a TIME_WAIT state (actual number varies a little bit) - This exception in the application log: ERROR:auslib.web.base:Exception on /releases/Firefox-mozilla-aurora-nightly-20120329042012/builds/Linux_x86-gcc3/kk [PUT] Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/flask/app.py", line 1292, in wsgi_app response = self.full_dispatch_request() File "/usr/lib/python2.6/site-packages/flask/app.py", line 1062, in full_dispatch_request rv = self.handle_user_exception(e) File "/usr/lib/python2.6/site-packages/flask/app.py", line 1060, in full_dispatch_request rv = self.dispatch_request() File "/usr/lib/python2.6/site-packages/flask/app.py", line 1047, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "/usr/lib/python2.6/site-packages/flask/views.py", line 56, in view return self.dispatch_request(*args, **kwargs) File "/usr/lib/python2.6/site-packages/flask/views.py", line 112, in dispatch_request return meth(*args, **kwargs) File "/data/www/aus4-admin-dev.allizom.org/aus4-admin/auslib/web/views/base.py", line 40, in put return self._put(*args, transaction=trans, **kwargs) File "/data/www/aus4-admin-dev.allizom.org/aus4-admin/auslib/web/views/base.py", line 40, in put return self._put(*args, transaction=trans, **kwargs) File "/data/www/aus4-admin-dev.allizom.org/aus4-admin/auslib/web/views/base.py", line 11, in decorated return f(*args, changed_by=username, **kwargs) File "/data/www/aus4-admin-dev.allizom.org/aus4-admin/auslib/web/views/base.py", line 24, in decorated if not db.permissions.hasUrlPermission(username, url, method, urlOptions=extra): File "/data/www/aus4-admin-dev.allizom.org/aus4-admin/auslib/db.py", line 752, in hasUrlPermission if self.select(where=[self.username==username, self.permission=='admin'], transaction=transaction): File "/data/www/aus4-admin-dev.allizom.org/aus4-admin/auslib/db.py", line 22, in convertRows for row in fn(*args, **kwargs): File "/data/www/aus4-admin-dev.allizom.org/aus4-admin/auslib/db.py", line 191, in select return query.execute().fetchall() File "/usr/lib/python2.6/site-packages/sqlalchemy/sql/expression.py", line 2667, in execute return e._execute_clauseelement(self, multiparams, params) File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 2297, in _execute_clauseelement connection = self.contextual_connect(close_with_result=True) File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 2334, in contextual_connect self.pool.connect(), File "/usr/lib/python2.6/site-packages/sqlalchemy/pool.py", line 210, in connect return _ConnectionFairy(self).checkout() File "/usr/lib/python2.6/site-packages/sqlalchemy/pool.py", line 371, in __init__ rec = self._connection_record = pool._do_get() File "/usr/lib/python2.6/site-packages/sqlalchemy/pool.py", line 685, in _do_get (self.size(), self.overflow(), self._timeout)) TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30 ERROR:auslib.web.views.releases:Caught ISA 500 error. DEBUG:auslib.web.views.releases:Balrog version is: 0.1.1 DEBUG:auslib.web.views.releases:Request path is: /releases/Firefox-mozilla-aurora-nightly-20120329042012/builds/Linux_x86-gcc3/kk - Immediately before the above exception, a IndexError: list index out of range Exception, which is caused by bug 738755 (which has been fixed by not pulled onto the aus4-dev nodes yet) - In the access log: 10.8.33.240 - stage-ffxbld [29/Mar/2012:06:43:25 -0700] "PUT /releases/Firefox-mozilla-central-dummy-nightly-20120329031156/builds/Linux_x86_64-gcc3/ko HTTP/1.1" 500 541 "-" "python-requests/0.10.8" - In the error log: [Thu Mar 29 06:43:55 2012] [error] [client 10.8.33.240] mod_wsgi (pid=9620): Exception occurred processing WSGI script '/data/www/aus4-admin-dev.allizom.org/aus4-admin/admin.wsgi'. [Thu Mar 29 06:43:55 2012] [error] [client 10.8.33.240] Traceback (most recent call last): [Thu Mar 29 06:43:55 2012] [error] [client 10.8.33.240] File "/usr/lib/python2.6/site-packages/flask/app.py", line 1306, in __call__ [Thu Mar 29 06:43:55 2012] [error] [client 10.8.33.240] return self.wsgi_app(environ, start_response) [Thu Mar 29 06:43:55 2012] [error] [client 10.8.33.240] File "/usr/lib/python2.6/site-packages/flask/app.py", line 1294, in wsgi_app [Thu Mar 29 06:43:55 2012] [error] [client 10.8.33.240] response = self.make_response(self.handle_exception(e)) [Thu Mar 29 06:43:55 2012] [error] [client 10.8.33.240] File "/usr/lib/python2.6/site-packages/flask/app.py", line 1120, in make_response [Thu Mar 29 06:43:55 2012] [error] [client 10.8.33.240] return self.response_class.force_type(rv, request.environ) [Thu Mar 29 06:43:55 2012] [error] [client 10.8.33.240] File "/data/www/aus4-admin-dev.allizom.org/aus4-admin/vendor/lib/python/werkzeug/wrappers.py", line 711, in force_type [Thu Mar 29 06:43:55 2012] [error] [client 10.8.33.240] response = BaseResponse(*_run_wsgi_app(response, environ)) [Thu Mar 29 06:43:55 2012] [error] [client 10.8.33.240] File "/data/www/aus4-admin-dev.allizom.org/aus4-admin/vendor/lib/python/werkzeug/test.py", line 818, in run_wsgi_app [Thu Mar 29 06:43:55 2012] [error] [client 10.8.33.240] app_iter = app(environ, start_response) [Thu Mar 29 06:43:55 2012] [error] [client 10.8.33.240] TypeError: 'TimeoutError' object is not callable
One thing we did find was that the admin application was pointed at a read-only database node. That node happened to be accidentally configured to be rw, but it makes me wonder if something else about this caused issues. solarce changed that to point at the proper node and kicked apache, so at this point we're back to a working state. Not much more we can do unless the problem happens again.
Catlee thought that ISA 500 errors might be somehow hanging up db connections or something like that. We tested this theory by running a couple of hundred requests that we know will generate 500 errors quickly, and indeed, we got the server hung up the same way again. I'm going to be getting sudo access to the nodes, and should be able to debug more efficiently with that. I'll see if i can figure out what's up!
sudo has been given via puppet changes.
After a ton more debugging I finally figured out (with tons of help from arr, sheeri, solarce, and catlee) that the code that manages the Connection objects was never calling .close(), which makes perfect sense given the CLOSE_WAIT sockets. After changing the application code to do that, I haven't been able to reproduce the problem. One thing I'm still a bit unsure of is whether or not there's an issue with some networking equipment killing some connections. While debugging, sheeri found errors like this on the mysql server: 120330 9:48:20 [Warning] Aborted connection 30397518 to db: 'aus4_dev' user: 'aus4_dev' host: '10.8.70.207' (Got timeout reading communication packets) And I'm not sure if those are explained by having too many open connections already. I don't think we need to debug that further for now, but if more issues crop up it's something to look into it.
Summary: balrog admin application gets hung sometimes → balrog needs to make sure connections are always closed
I also switched insert/delete/update to use the context manager to simplify those sections of code.
Attachment #610997 - Flags: review?(catlee)
16:42 <@catlee> so are you sure you want transactions to always close their connections? 16:44 < bhearsum> pretty sure. an AUSTransaction is basically a Connection + Transaction - the connection is only held by the AUSTransaction 16:44 < bhearsum> i guess that could change at some point 16:45 < bhearsum> but the reason i wrote AUSTransaction was to let all the code that executes things work with one object, rather than managing both a Connection and Transaction 16:45 <@catlee> ok 16:46 <@catlee> so it's weird that it's not the one opening the connection 16:46 < bhearsum> hmmm 16:46 < bhearsum> yeah, i can see that 16:46 < bhearsum> maybe it should take an engine instead, and open the connection itself? 16:46 <@catlee> yeah, that would make more sense I think 16:50 < bhearsum> alright, i'll do that
Attachment #610997 - Attachment is obsolete: true
Attachment #610997 - Flags: review?(catlee)
Attachment #611021 - Flags: review?(catlee)
Attachment #611021 - Flags: review?(catlee) → review+
Comment on attachment 611021 [details] [diff] [review] pass engine to AUSTransaction Landed. Jenkins run is here: https://jenkins.mozilla.org/job/Balrog/68/ (oddly, I had to force it).
Attachment #611021 - Flags: checked-in+
Thanks again everyone for all your help diagnosing this.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: