Closed Bug 1221538 Opened 10 years ago Closed 10 years ago

Autophone - AutophonePulseMonitor falls down and can't get up after SSLError: [Errno 8] _ssl.c:510: EOF occurred in violation of protocol

Categories

(Testing Graveyard :: Autophone, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bc, Assigned: bc)

References

Details

Attachments

(2 files)

Yesterday Autophone suffered a pulse failure which resulted in it failing to consume Pulse messages. Autophone did not recover properly from the error The queue filled and was deleted and new builds were not detected afterwards. 2015-11-03 14:23:29,160|76|PulseMonitorThread|root|ERROR|AutophonePulseMonitor Exception Traceback (most recent call last): File "/mozilla/projects/autophone/src/autophone/autophonepulsemonitor.py", line 245, in listen auto_declare=False) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/connection.py", line 652, in Consumer return Consumer(channel or self, queues, *args, **kwargs) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/messaging.py", line 359, in __init__ self.revive(self.channel) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/messaging.py", line 364, in revive channel = self.channel = maybe_channel(channel) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/connection.py", line 1054, in maybe_channel return channel.default_channel File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/connection.py", line 756, in default_channel self.connection File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/connection.py", line 741, in connection self._connection = self._establish_connection() File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/connection.py", line 696, in _establish_connection conn = self.transport.establish_connection() File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 112, in establish_connection conn = self.Connection(**opts) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/amqp/connection.py", line 165, in __init__ self.transport = self.Transport(host, connect_timeout, ssl) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/amqp/connection.py", line 186, in Transport return create_transport(host, connect_timeout, ssl) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/amqp/transport.py", line 297, in create_transport return SSLTransport(host, connect_timeout, ssl) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/amqp/transport.py", line 199, in __init__ super(SSLTransport, self).__init__(host, connect_timeout) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/amqp/transport.py", line 102, in __init__ self._setup_transport() File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/amqp/transport.py", line 206, in _setup_transport self.sock = ssl.wrap_socket(self.sock) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ssl.py", line 392, in wrap_socket ciphers=ciphers) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ssl.py", line 148, in __init__ self.do_handshake() File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ssl.py", line 310, in do_handshake self._sslobj.do_handshake() SSLError: [Errno 8] _ssl.c:510: EOF occurred in violation of protocol 2015-11-03 14:23:33,857|76|PulseMonitorThread|root|ERROR|AutophonePulseMonitor Exception 2015-11-03 14:23:33,857|76|PulseMonitorThread|root|ERROR|AutophonePulseMonitor Exception Traceback (most recent call last): File "/mozilla/projects/autophone/src/autophone/autophonepulsemonitor.py", line 245, in listen auto_declare=False) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/connection.py", line 652, in Consumer return Consumer(channel or self, queues, *args, **kwargs) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/messaging.py", line 359, in __init__ self.revive(self.channel) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/messaging.py", line 364, in revive channel = self.channel = maybe_channel(channel) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/connection.py", line 1054, in maybe_channel return channel.default_channel File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/connection.py", line 758, in default_channel self._default_channel = self.channel() File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/connection.py", line 242, in channel chan = self.transport.create_channel(self.connection) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 88, in create_channel return connection.channel() AttributeError: 'NoneType' object has no attribute 'channel'
mcote: do you have any insights into what is happening on Pulse Guardian? Have there been known issues or upgrades recently?
Flags: needinfo?(mcote)
Summary: Autophone - AutophonePulseMonitor falls down and can't get uip after SSLError: [Errno 8] _ssl.c:510: EOF occurred in violation of protocol → Autophone - AutophonePulseMonitor falls down and can't get up after SSLError: [Errno 8] _ssl.c:510: EOF occurred in violation of protocol
mcote: also, can we increase the size of the queues before they are deleted? The current load means they are deleted if the consumer is down for an hour. I frequently get warnings during restarts when the consumer is paused for the orderly shutdown of the workers.
mcote increased the queue size to 16,000 and the warning limit to 4,000. https://github.com/mozilla/autophone/blob/master/autophonepulsemonitor.py#L268 releases the Kombu Connection when an error occurs but does not recreate it. I'm thinking that I just need to re-create it afterwards as in https://github.com/mozilla/autophone/blob/master/autophonepulsemonitor.py#L197
Flags: needinfo?(mcote)
This patch recovers from exceptions by releasing the previous connection, recreating it, then starting a new listening thread as the old one exits. Tested locally with mcote's help.
Attachment #8683179 - Flags: review?(mcote)
Comment on attachment 8683179 [details] [diff] [review] bug-1221538-pulse-connection-errors.patch Review of attachment 8683179 [details] [diff] [review]: ----------------------------------------------------------------- As discussed on IRC, starting a new thread from within the existing thread just before it exits is kind of weird, and its side-effects are not entirely obvious. Would be better to loop in listen(), ideally creating the connection and related objects at the beginning of the function.
Attachment #8683179 - Flags: review?(mcote) → review-
Blocks: 1220762
Comment on attachment 8683270 [details] [diff] [review] bug-1221538-pulse-connection-errors-v2.patch Review of attachment 8683270 [details] [diff] [review]: ----------------------------------------------------------------- ::: autophonepulsemonitor.py @@ +274,5 @@ > + logger.exception('AutophonePulseMonitor Exception') > + if connection: > + connection.release() > + restart = True > + time.sleep(1) This should probably have some sort of back-off timer, but I'm fine with that being a follow-up.
Attachment #8683270 - Flags: review?(mcote) → review+
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: