Closed Bug 1221538 Opened 9 years ago Closed 9 years ago

Autophone - AutophonePulseMonitor falls down and can't get up after SSLError: [Errno 8] _ssl.c:510: EOF occurred in violation of protocol

Categories

(Testing Graveyard :: Autophone, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bc, Assigned: bc)

References

Details

Attachments

(2 files)

Yesterday Autophone suffered a pulse failure which resulted in it failing to consume Pulse messages. Autophone did not recover properly from the error  The queue filled and was deleted and new builds were not detected afterwards.

2015-11-03 14:23:29,160|76|PulseMonitorThread|root|ERROR|AutophonePulseMonitor Exception
Traceback (most recent call last):
  File "/mozilla/projects/autophone/src/autophone/autophonepulsemonitor.py", line 245, in listen
    auto_declare=False)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/connection.py", line 652, in Consumer
    return Consumer(channel or self, queues, *args, **kwargs)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/messaging.py", line 359, in __init__
    self.revive(self.channel)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/messaging.py", line 364, in revive
    channel = self.channel = maybe_channel(channel)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/connection.py", line 1054, in maybe_channel
    return channel.default_channel
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/connection.py", line 756, in default_channel
    self.connection
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/connection.py", line 741, in connection
    self._connection = self._establish_connection()
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/connection.py", line 696, in _establish_connection
    conn = self.transport.establish_connection()
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 112, in establish_connection
    conn = self.Connection(**opts)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/amqp/connection.py", line 165, in __init__
    self.transport = self.Transport(host, connect_timeout, ssl)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/amqp/connection.py", line 186, in Transport
    return create_transport(host, connect_timeout, ssl)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/amqp/transport.py", line 297, in create_transport
    return SSLTransport(host, connect_timeout, ssl)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/amqp/transport.py", line 199, in __init__
    super(SSLTransport, self).__init__(host, connect_timeout)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/amqp/transport.py", line 102, in __init__
    self._setup_transport()
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/amqp/transport.py", line 206, in _setup_transport
    self.sock = ssl.wrap_socket(self.sock)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ssl.py", line 392, in wrap_socket
    ciphers=ciphers)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ssl.py", line 148, in __init__
    self.do_handshake()
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ssl.py", line 310, in do_handshake
    self._sslobj.do_handshake()
SSLError: [Errno 8] _ssl.c:510: EOF occurred in violation of protocol
2015-11-03 14:23:33,857|76|PulseMonitorThread|root|ERROR|AutophonePulseMonitor Exception

2015-11-03 14:23:33,857|76|PulseMonitorThread|root|ERROR|AutophonePulseMonitor Exception
Traceback (most recent call last):
  File "/mozilla/projects/autophone/src/autophone/autophonepulsemonitor.py", line 245, in listen
    auto_declare=False)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/connection.py", line 652, in Consumer
    return Consumer(channel or self, queues, *args, **kwargs)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/messaging.py", line 359, in __init__
    self.revive(self.channel)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/messaging.py", line 364, in revive
    channel = self.channel = maybe_channel(channel)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/connection.py", line 1054, in maybe_channel
    return channel.default_channel
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/connection.py", line 758, in default_channel
    self._default_channel = self.channel()
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/connection.py", line 242, in channel
    chan = self.transport.create_channel(self.connection)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 88, in create_channel
    return connection.channel()
AttributeError: 'NoneType' object has no attribute 'channel'
mcote: do you have any insights into what is happening on Pulse Guardian? Have there been known issues or upgrades recently?
Flags: needinfo?(mcote)
Summary: Autophone - AutophonePulseMonitor falls down and can't get uip after SSLError: [Errno 8] _ssl.c:510: EOF occurred in violation of protocol → Autophone - AutophonePulseMonitor falls down and can't get up after SSLError: [Errno 8] _ssl.c:510: EOF occurred in violation of protocol
mcote: also, can we increase the size of the queues before they are deleted? The current load means they are deleted if the consumer is down for an hour. I frequently get warnings during restarts when the consumer is paused for the orderly shutdown of the workers.
mcote increased the queue size to 16,000 and the warning limit to 4,000.

https://github.com/mozilla/autophone/blob/master/autophonepulsemonitor.py#L268 releases the Kombu Connection when an error occurs but does not recreate it. I'm thinking that I just need to re-create it afterwards as in https://github.com/mozilla/autophone/blob/master/autophonepulsemonitor.py#L197
Flags: needinfo?(mcote)
This patch recovers from exceptions by releasing the previous connection, recreating it, then starting a new listening thread as the old one exits.

Tested locally with mcote's help.
Attachment #8683179 - Flags: review?(mcote)
Comment on attachment 8683179 [details] [diff] [review]
bug-1221538-pulse-connection-errors.patch

Review of attachment 8683179 [details] [diff] [review]:
-----------------------------------------------------------------

As discussed on IRC, starting a new thread from within the existing thread just before it exits is kind of weird, and its side-effects are not entirely obvious.  Would be better to loop in listen(), ideally creating the connection and related objects at the beginning of the function.
Attachment #8683179 - Flags: review?(mcote) → review-
Blocks: 1220762
Comment on attachment 8683270 [details] [diff] [review]
bug-1221538-pulse-connection-errors-v2.patch

Review of attachment 8683270 [details] [diff] [review]:
-----------------------------------------------------------------

::: autophonepulsemonitor.py
@@ +274,5 @@
> +                logger.exception('AutophonePulseMonitor Exception')
> +                if connection:
> +                    connection.release()
> +                restart = True
> +                time.sleep(1)

This should probably have some sort of back-off timer, but I'm fine with that being a follow-up.
Attachment #8683270 - Flags: review?(mcote) → review+
https://github.com/mozilla/autophone/commit/44f7029f481dc9f38eb8aa70c6019a35b902ad5b

Filed  Bug 1221723
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: