Closed Bug 619419 Opened 14 years ago Closed 14 years ago

SQL connection errors from SUMO

Categories

(mozilla.org Graveyard :: Server Operations, task)

All
Other
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jsocol, Assigned: justdave)

Details

Attachments

(1 file)

We seem to get relatively low but steady stream of stack traces from SUMO production that end with this error, from MySQL: File "/data/www/support.mozilla.com/kitsune/vendor/src/django/django/db/backends/mysql/base.py", line 297, in _cursor self.connection = Database.connect(**kwargs) File "/usr/lib64/python2.6/site-packages/MySQLdb/__init__.py", line 81, in Connect return Connection(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/MySQLdb/connections.py", line 188, in __init__ super(Connection, self).__init__(*args, **kwargs2) OperationalError: (2013, "Lost connection to MySQL server at 'reading authorization packet', system error: 0") I'm not sure what the issue is, or if it's solvable, but if it is solvable, we should solve it.
Assignee: server-ops → justdave
Any insight or ideas here?
We've had 70 of these in less than 24 hours. It is by far the most common issue with SUMO.
(In reply to comment #1) > Any insight or ideas here? Hrrrm. That looks like it has an issue connecting to the phx slaves. I'm not even sure sumo's using those. Dave?
(In reply to comment #3) > (In reply to comment #1) > > Any insight or ideas here? > > Hrrrm. That looks like it has an issue connecting to the phx slaves. I'm not > even sure sumo's using those. Why PHX? SUMO is using Zeus as a proxy to the databases, and that error message *usually* means Zeus doesn't have any backends available when the connection was attempted. Zeus has a max connections setting of its own, it's possible we're hitting Zeus' max connections limit. Do you know whether the connection in question is for the master or the slave databases? They're different pools in Zeus.
(In reply to comment #4) > Do you know whether the connection in question is for the master or the slave > databases? They're different pools in Zeus. James, alternatively if we can find out if the stacktrace was from a write or a r/o op...that'd help too (if not the server name).
Here's a full stack trace. I can tell that this one was a read to the slaves (we do read from master immediately after writes to avoid rep lag). Traceback (most recent call last): File "/data/www/support.mozilla.com/kitsune/vendor/src/django/django/core/handlers/base.py", line 100, in get_response response = callback(request, *callback_args, **callback_kwargs) File "/data/www/support.mozilla.com/kitsune/vendor/src/django/django/views/decorators/http.py", line 37, in inner return func(request, *args, **kwargs) File "/data/www/support.mozilla.com/kitsune/apps/wiki/views.py", line 128, in document return jingo.render(request, 'wiki/document.html', data) File "/data/www/support.mozilla.com/kitsune/vendor/src/jingo/jingo/__init__.py", line 78, in render rendered = render_to_string(request, template, context) File "/data/www/support.mozilla.com/kitsune/vendor/src/jingo/jingo/__init__.py", line 96, in render_to_string return template.render(**get_context()) File "/usr/lib/python2.6/site-packages/jinja2/environment.py", line 891, in render return self.environment.handle_exception(exc_info, True) File "/data/www/support.mozilla.com/kitsune/apps/wiki/templates/wiki/document.html", line 10, in top-level template code {% set localizable_url = url('wiki.document', document.parent.slug, locale=settings.WIKI_DEFAULT_LANGUAGE) %} File "/data/www/support.mozilla.com/kitsune/apps/wiki/templates/wiki/base.html", line 15, in top-level template code {% set top_text = _('Firefox Help') %} File "/data/www/support.mozilla.com/kitsune/templates/layout/base.html", line 58, in top-level template code {% block content_area %} File "/data/www/support.mozilla.com/kitsune/apps/wiki/templates/wiki/base.html", line 34, in block "content_area" {% block content %} File "/data/www/support.mozilla.com/kitsune/apps/wiki/templates/wiki/document.html", line 15, in block "content" {% if related %} File "/data/www/support.mozilla.com/kitsune/vendor/src/django/django/db/models/query.py", line 112, in __nonzero__ iter(self).next() File "/data/www/support.mozilla.com/kitsune/vendor/src/django/django/db/models/query.py", line 106, in _result_iter self._fill_cache() File "/data/www/support.mozilla.com/kitsune/vendor/src/django/django/db/models/query.py", line 760, in _fill_cache self._result_cache.append(self._iter.next()) File "/data/www/support.mozilla.com/kitsune/vendor/src/django-cache-machine/caching/base.py", line 127, in __iter__ obj = iterator.next() File "/data/www/support.mozilla.com/kitsune/vendor/src/django/django/db/models/query.py", line 269, in iterator for row in compiler.results_iter(): File "/data/www/support.mozilla.com/kitsune/vendor/src/django/django/db/models/sql/compiler.py", line 672, in results_iter for rows in self.execute_sql(MULTI): File "/data/www/support.mozilla.com/kitsune/vendor/src/django/django/db/models/sql/compiler.py", line 726, in execute_sql cursor = self.connection.cursor() File "/data/www/support.mozilla.com/kitsune/vendor/src/django/django/db/backends/__init__.py", line 75, in cursor cursor = self._cursor() File "/data/www/support.mozilla.com/kitsune/vendor/src/django/django/db/backends/mysql/base.py", line 297, in _cursor self.connection = Database.connect(**kwargs) File "/usr/lib64/python2.6/site-packages/MySQLdb/__init__.py", line 81, in Connect return Connection(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/MySQLdb/connections.py", line 188, in __init__ super(Connection, self).__init__(*args, **kwargs2) OperationalError: (2013, "Lost connection to MySQL server at 'reading authorization packet', system error: 0")
AFAICT, they're coming from more than just one webhead: (~7:20AM pacific) 'platform.name': 'pm-app-sumo02.mozilla.org', 'REMOTE_ADDR': '10.2.81.102', 'SERVER_ADDR': '10.2.81.141', (~2:50 AM pacific) 'platform.name': 'pm-app-sumo03.mozilla.org', 'REMOTE_ADDR': '10.2.81.100', 'SERVER_ADDR': '10.2.81.142',
OK, there is no max connections limit set up on the sumo database pools. There is, however, a 4 second timeout on connections to the backends. I bumped that to 10 seconds on the master pool, since there's only one backend there. On the slave pool it'll cycle through the backends if that 4 seconds expires hitting one of them. There are only 3 backends in the slave pool, and you said these are slave connections, so I suppose it's possible that all three slaves were slow at some point, so I've bumped it up to 10 seconds there as well. Let me know if it still happens at all.
Seen it happen at least 4 times since you made the change, so about the same rate. Occasionally we see a similar type of error from search, though I don't know if we're connecting to Sphinx through Zeus or Netscaler. If it's Zeus, might that point to trouble with Zeus? Here's the error, it looks like sock.recv(8) is failing. It might be a red-herring, but I figure more information is better. File "/data/www/support.mozilla.com/kitsune/apps/search/sphinxapi.py", line 223, in _GetResponse (status, ver, length) = unpack('>2HL', sock.recv(8)) error: unpack requires a string argument of length 8
I think there are more now than before, actually. Last one (~2:11pm): Traceback (most recent call last): File "/data/www/support.mozilla.com/kitsune/vendor/src/django/django/db/models/manager.py", line 132, in get return self.get_query_set().get(*args, **kwargs) File "/data/www/support.mozilla.com/kitsune/vendor/src/django/django/db/models/query.py", line 336, in get num = len(clone) File "/data/www/support.mozilla.com/kitsune/vendor/src/django/django/db/models/query.py", line 81, in __len__ self._result_cache = list(self.iterator()) File "/data/www/support.mozilla.com/kitsune/vendor/src/django-cache-machine/caching/base.py", line 127, in __iter__ obj = iterator.next() File "/data/www/support.mozilla.com/kitsune/vendor/src/django/django/db/models/query.py", line 269, in iterator for row in compiler.results_iter(): File "/data/www/support.mozilla.com/kitsune/vendor/src/django/django/db/models/sql/compiler.py", line 672, in results_iter for rows in self.execute_sql(MULTI): File "/data/www/support.mozilla.com/kitsune/vendor/src/django/django/db/models/sql/compiler.py", line 726, in execute_sql cursor = self.connection.cursor() File "/data/www/support.mozilla.com/kitsune/vendor/src/django/django/db/backends/__init__.py", line 75, in cursor cursor = self._cursor() File "/data/www/support.mozilla.com/kitsune/vendor/src/django/django/db/backends/mysql/base.py", line 297, in _cursor self.connection = Database.connect(**kwargs) File "/usr/lib64/python2.6/site-packages/MySQLdb/__init__.py", line 81, in Connect return Connection(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/MySQLdb/connections.py", line 188, in __init__ super(Connection, self).__init__(*args, **kwargs2) OperationalError: (2013, "Lost connection to MySQL server at 'reading authorization packet', system error: 0")
The rate of this error has definitely gone up in the past few hours.
Is it possible the problem is locking up resources? That would explain increased timeout being worse. Since it's on slaves, I imagine not - reads are non-blocking, right?
some interesting spikes and dips in here...
oh, image didn't include the legend.... The green line is the RO pool, the purple one is the RW pool.
How's it looking so far, any change? Do the errors you're still getting seem to come in clusters or are they evenly spread out?
I have actually seen *nothing* today. See screenshot. All times are PDT, yesterday. http://grab.by/8lUg Looks promising. You can close this if you want. Hope there will be nothing new by push time tomorrow, too.
In general, I wouldn't say they came in clusters, but they rarely came alone. It was typical to see 1-3 of these in a 2-3 minute period, but it wasn't like every request would cause it for 2 minutes. I haven't see any since around 4:15pm PT yesterday.
Haven't seen this in a week. I'm not sure what y'all did, but thanks!
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: