Closed Bug 628720 Opened 14 years ago Closed 14 years ago

twistd.log exceptions from amqp-related (pulse) failures

Categories

(Release Engineering :: General, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: catlee)

References

Details

(Whiteboard: [pulse])

Attachments

(2 files)

We're getting a lot of mail with tracebacks in amqplib or pulse code. I *suspect* that these are harmless and the messages are safely re-transmitted later. If that's the case, it would be great to handle the exceptions smoothly so we don't get emails about them. Exception in /builds/buildbot/build_master3/master/twistd.log: 2011-01-25 08:12:04-0800 [-] Unhandled Error Traceback (most recent call last): File "/tools/python-2.6.5/lib/python2.6/threading.py", line 484, in run self.__target(*self.__args, **self.__kwargs) File "/builds/buildbot/build_master3/lib/python2.6/site-packages/twisted/python/threadpool.py", line 207, in _worker result = context.call(ctx, function, *args, **kwargs) File "/builds/buildbot/build_master3/lib/python2.6/site-packages/twisted/python/context.py", line 59, in callWithContext return self.currentContext().callWithContext(ctx, func, *args, **kw) File "/builds/buildbot/build_master3/lib/python2.6/site-packages/twisted/python/context.py", line 37, in callWithContext return func(*args,**kw) --- <exception caught here> --- File "/builds/buildbot/build_master3/lib/python2.6/site-packages/buildbotcustom/status/pulse.py", line 95, in threadedPush self.publisher.publish(msg) File "/builds/buildbot/build_master3/lib/python2.6/site-packages/mozillapulse/publishers.py", line 66, in publish routing_key=message.routing_key) File "/builds/buildbot/build_master3/lib/python2.6/site-packages/carrot/messaging.py", line 657, in __init__ self.declare() File "/builds/buildbot/build_master3/lib/python2.6/site-packages/carrot/messaging.py", line 669, in declare auto_delete=self.auto_delete) File "/builds/buildbot/build_master3/lib/python2.6/site-packages/carrot/backends/pyamqplib.py", line 253, in exchange_declare return self.channel.exchange_declare(exchange=exchange, File "/builds/buildbot/build_master3/lib/python2.6/site-packages/carrot/backends/pyamqplib.py", line 188, in channel self._channel_ref = weakref.ref(connection.channel()) File "/builds/buildbot/build_master3/lib/python2.6/site-packages/amqplib/client_0_8/connection.py", line 230, in channel return Channel(self, channel_id) File "/builds/buildbot/build_master3/lib/python2.6/site-packages/amqplib/client_0_8/channel.py", line 82, in __init__ self._x_open() File "/builds/buildbot/build_master3/lib/python2.6/site-packages/amqplib/client_0_8/channel.py", line 471, in _x_open (20, 11), # Channel.open_ok File "/builds/buildbot/build_master3/lib/python2.6/site-packages/amqplib/client_0_8/abstract_channel.py", line 89, in wait self.channel_id, allowed_methods) File "/builds/buildbot/build_master3/lib/python2.6/site-packages/amqplib/client_0_8/connection.py", line 198, in _wait_method self.method_reader.read_method() File "/builds/buildbot/build_master3/lib/python2.6/site-packages/amqplib/client_0_8/method_framing.py", line 215, in read_method raise m socket.error: [Errno 110] Connection timed out Exception in /builds/buildbot/try_master1/master/twistd.log: 2011-01-25 01:14:06-0800 [-] Unhandled Error Traceback (most recent call last): File "/tools/python-2.6.5/lib/python2.6/threading.py", line 484, in run self.__target(*self.__args, **self.__kwargs) File "/builds/buildbot/try_master1/lib/python2.6/site-packages/twisted/python/threadpool.py", line 207, in _worker result = context.call(ctx, function, *args, **kwargs) File "/builds/buildbot/try_master1/lib/python2.6/site-packages/twisted/python/context.py", line 59, in callWithContext return self.currentContext().callWithContext(ctx, func, *args, **kw) File "/builds/buildbot/try_master1/lib/python2.6/site-packages/twisted/python/context.py", line 37, in callWithContext return func(*args,**kw) --- <exception caught here> --- File "/builds/buildbot/try_master1/buildbotcustom/status/pulse.py", line 95, in threadedPush self.publisher.publish(msg) File "build/bdist.linux-i686/egg/mozillapulse/publishers.py", line 67, in publish
Assignee: nobody → catlee
Priority: -- → P2
Whiteboard: [pulse]
Summary: twistd.log exceptions from amqp-related failures → twistd.log exceptions from amqp-related (pulse) failures
Depends on: 629654
Attached patch pimped out pulseSplinter Review
A lot has changed. It may be easier to look at the whole file here: http://hg.mozilla.org/users/catlee_mozilla.com/buildbotcustom/file/b2f71b37bba8/status/pulse.py These errors are due to the firewall shutting down our connection to pulse after some maximum amount of time (used to be 30 minutes, now 12h). To work around this, we need to be proactive about shutting our connection down. Two timers have been added: * _disconnect_timer, which fires after a connection has been active more than max_connect_time seconds * _idle_timer, which fires after a connection has been idle more than max_idle_time seconds When either of these fire, they force us to disconnect. Subsequent messages will automatically reconnect. I've also added a heartbeat message, sent every 15 minutes by default.
Attachment #509890 - Flags: review?(dustin)
Comment on attachment 509890 [details] [diff] [review] pimped out pulse From IRC, this is still hanging reconfigs, so no review: 09:41 < catlee> dustin: so ignore my pulse patch 09:41 < catlee> I'm still getting hangs sometimes on reconfig
Attachment #509890 - Flags: review?(dustin)
Comment on attachment 509890 [details] [diff] [review] pimped out pulse can't reproduce the problem I was having before in staging. it's most likely due to an older instance of the pulse plugin that was still running.
Attachment #509890 - Flags: review?(dustin)
Comment on attachment 509890 [details] [diff] [review] pimped out pulse + a connection to the broken open. r^
Attachment #509890 - Flags: review?(dustin) → review+
Some fallout resulting in this exception: 2011-03-08 11:32:30-0800 [-] Pulse <0x23dfeb6c>: Disconnecting after 600 seconds 2011-03-08 11:32:30-0800 [-] Unhandled Error Traceback (most recent call last): File "/builds/buildbot/tests_master3/lib/python2.6/site-packages/buildbotcustom/status/pulse.py", line 277, in _finished File "/builds/buildbot/tests_master3/lib/python2.6/site-packages/twisted/internet/defer.py", line 1122, in release d.callback(self) File "/builds/buildbot/tests_master3/lib/python2.6/site-packages/twisted/internet/defer.py", line 318, in callback self._startRunCallbacks(result) File "/builds/buildbot/tests_master3/lib/python2.6/site-packages/twisted/internet/defer.py", line 424, in _startRunCallbacks self._runCallbacks() --- <exception caught here> --- File "/builds/buildbot/tests_master3/lib/python2.6/site-packages/twisted/internet/defer.py", line 441, in _runCallbacks self.result = callback(self.result, *args, **kw) File "/builds/buildbot/tests_master3/lib/python2.6/site-packages/buildbotcustom/status/pulse.py", line 193, in set_timers return self File "/builds/buildbot/tests_master3/lib/python2.6/site-packages/twisted/internet/base.py", line 89, in cancel raise error.AlreadyCalled twisted.internet.error.AlreadyCalled: Tried to cancel an already-called event. will try and repro in staging and come up with a fix.
in cases where the network is dropping packets, we can get blocked elsewhere trying to send events or disconnect. if the idle timer fires during this period we won't handle it due to the thread lock in place, and it's possible that the next pending call on the thread lock is another pushEvents call which will try and cancel the (already called, but not handled) idle timer.
Attachment #517992 - Flags: review?(dustin)
Attachment #517992 - Flags: review?(dustin) → review+
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: