Closed Bug 791909 Opened 12 years ago Closed 12 years ago

pulsebuildmonitor timed out and never reconnected

Categories

(Webtools :: Pulse, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mcote, Assigned: jgriffin)

Details

Attachments

(1 file)

I'm not sure if this is a bug or a feature request, but both the autophone production server (Mountain View) and staging server (Montreal) died this weekend with this exception:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/threading.py", line 522, in __bootstrap_inner
    self.run()
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/threading.py", line 477, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/Users/mozauto/pulsebuildmonitor/pulsebuildmonitor/pulsebuildmonitor.py", line 95, in listen
    self.pulse.listen()
  File "/Library/Python/2.6/site-packages/mozillapulse/consumers.py", line 136, in listen
    self.consumer.wait()
  File "/Library/Python/2.6/site-packages/carrot/messaging.py", line 446, in wait   
    it.next()
  File "/Library/Python/2.6/site-packages/carrot/backends/pyamqplib.py", line 300, in consume
    self.channel.wait()
  File "/Library/Python/2.6/site-packages/amqplib/client_0_8/abstract_channel.py", line 95, in wait
    self.channel_id, allowed_methods)
  File "/Library/Python/2.6/site-packages/amqplib/client_0_8/connection.py", line 202, in _wait_method
    self.method_reader.read_method()
  File "/Library/Python/2.6/site-packages/amqplib/client_0_8/method_framing.py", line 221, in read_method
    raise m
error: [Errno 60] Operation timed out

The staging server last found a build on Sept 14 at 21:34 EDT, and the production server on Sept 15 at 20:38 PDT. 

I'm not sure if the pulsebuildmonitor is *supposed* to continue to reconnect or not, but it would be nice if it did. :)
We do want it to reconnect.  I've updated pulsebuildmonitor on pypi (v0.64) to include the fix from bug 788580.  Hopefully if you update your pulsebuildmonitor, this will stop happening.
I was using the latest code from the repo when I saw this exception, though.
Oh, right.  That fix requires the caller to wrap listen in try/except, and call listen again in case of failure, if desired.  I can fix this so that it happens automatically...I guess there is no reason we'd want to propagate the exception to the caller.
Just got the same traceback, though with "[Errno 54] Connection reset by peer" this time.

Yeah I think it makes sense to do this in the pulsebuildmonitor. Even if pulse goes down for a day or two, ideally I wouldn't have to restart my listeners. I can't think of a good reason for making the user shut down and start up either the program or the listener thread, unless maybe I have configured a timeout.
What do you think about something like this?
Attachment #664944 - Flags: review?(jgriffin)
Comment on attachment 664944 [details] [diff] [review]
Relaunch listener if exception detected

Review of attachment 664944 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for the fix!  Looks good with the fix below.

::: pulsebuildmonitor/pulsebuildmonitor.py
@@ +95,4 @@
>      self.make_pulse_consumer()
> +    while True:
> +      try: 
> +        self.pulse.listen()

You should pull self.make_pulse_consumer() into the try clause, so that it gets called before self.pulse.listen().

Otherwise, the amqp lib may attempt to re-use an existing dead connection, and it will not be successful in reconnecting.  Creating a new pulse consumer works around this problem.
Attachment #664944 - Flags: review?(jgriffin) → review+
Cool, fixed and pushed: http://hg.mozilla.org/automation/pulsebuildmonitor/rev/6e94fe6db44c
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: