Closed Bug 1093046 Opened 10 years ago Closed 5 years ago

pulsebuildmonitor better timestamps and logged messages

Categories

(Webtools :: Pulse, defect)

x86
All
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: andrei, Unassigned)

Details

As seen in bug 1083166 we're experiencing timeouts on a regular basis in the last couple of weeks.

Unfortunately the messaging and timestamps for messaging regarding connection lost/reestablished are a bit lacking.

This is what we see:
> 2014-11-02T22:42:46Z INFO:automation:Connecting to Mozilla Pulse as "qa-auto@mozilla.com|mozmill_release|mm-ci-staging.qa.scl3.mozilla.com"...                                                                     
> 2014-11-02T22:42:46Z INFO:automation:Connected                                                                                                                                                                     
> 2014-11-02T22:47:49Z ERROR:automation:Socket closed                                                                                                                                                                
> Traceback (most recent call last):                                                                                                                                                                                 
>   File "/home/mozauto/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/pulsebuildmonitor/pulsebuildmonitor.py", line 103, in listen                                                                        
>     self.pulse.listen(on_connect_callback=self.on_connect_callback)                                                                                                                                                
>   File "/home/mozauto/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/mozillapulse/consumers.py", line 175, in listen                                                                                     
>     self.connection.drain_events()                                                                                                                                                                                 
>   File "/home/mozauto/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/kombu/connection.py", line 275, in drain_events                                                                                     
>     return self.transport.drain_events(self.connection, **kwargs)                                                                                                                                                  
>   File "/home/mozauto/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 91, in drain_events                                                                                
>     return connection.drain_events(**kwargs)                                                                                                                                                                       
>   File "/home/mozauto/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/amqp/connection.py", line 302, in drain_events                                                                                      
>     chanmap, None, timeout=timeout,                                                                                                                                                                                
>   File "/home/mozauto/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/amqp/connection.py", line 365, in _wait_multiple                                                                                    
>     channel, method_sig, args, content = read_timeout(timeout)                                                                                                                                                     
>   File "/home/mozauto/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/amqp/connection.py", line 329, in read_timeout                                                                                      
>     return self.method_reader.read_method()                                                                                                                                                                        
>   File "/home/mozauto/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/amqp/method_framing.py", line 189, in read_method                                                                                   
>     raise m                                                                                                                                                                                                        
> IOError: Socket closed                                                                                                                                                                                             
> 2014-11-02T22:47:49Z INFO:automation:Connecting to Mozilla Pulse as "qa-auto@mozilla.com|mozmill_release|mm-ci-staging.qa.scl3.mozilla.com"...                                                                     
> 2014-11-02T22:47:49Z INFO:automation:Connected                                                                                                                                                                     
> 2014-11-02T23:24:43Z ERROR:automation:Socket closed                                                                                                                                                                
> Traceback (most recent call last):                                                                                                                                                                                 
>   File "/home/mozauto/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/pulsebuildmonitor/pulsebuildmonitor.py", line 103, in listen                                                                        
>     self.pulse.listen(on_connect_callback=self.on_connect_callback)                                                                                                                                                
>   File "/home/mozauto/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/mozillapulse/consumers.py", line 175, in listen                                                                                     
>     self.connection.drain_events()                                                                                                                                                                                 
>   File "/home/mozauto/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/kombu/connection.py", line 275, in drain_events                                                                                     
>     return self.transport.drain_events(self.connection, **kwargs)                                                                                                                                                  
>   File "/home/mozauto/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 91, in drain_events                                                                                
>     return connection.drain_events(**kwargs)                                                                                                                                                                       
>   File "/home/mozauto/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/amqp/connection.py", line 302, in drain_events                                                                                      
>     chanmap, None, timeout=timeout,                                                                                                                                                                                
>   File "/home/mozauto/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/amqp/connection.py", line 365, in _wait_multiple                                                                                    
>     channel, method_sig, args, content = read_timeout(timeout)                                                                                                                                                     
>   File "/home/mozauto/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/amqp/connection.py", line 329, in read_timeout                                                                                      
>     return self.method_reader.read_method()                                                                                                                                                                        
>   File "/home/mozauto/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/amqp/method_framing.py", line 189, in read_method                                                                                   
>     raise m                                                                                                                                                                                                        
> IOError: Socket closed                                                                                                                                                                                             
> 2014-11-02T23:24:43Z INFO:automation:Connecting to Mozilla Pulse as "qa-auto@mozilla.com|mozmill_release|mm-ci-staging.qa.scl3.mozilla.com"...         

The "Connecting to" message has the same timestamp as the "Socket closed" message, which infers that we only know when we try to reconnect, not when the connection dropped

Due to the 5 min sleep time when the connection is failing, am I correctly inferring that if we see a "Connecting to" + "Connected" message followed by a "Socket closed" message 5 minutes later the connection was actually never established?

With pulsebuildmonitor 0.90 we were seeing timeouts regularly every 5 minutes.

=====

What we would like:
- better timestamps. Is it possible to see more clearly when the connection dropped and when it was re-established again?
- better wording since "Connected" from what I see now might not actually mean that we have connected succesfully
> The "Connecting to" message has the same timestamp as the "Socket closed"
> message, which infers that we only know when we try to reconnect, not when
> the connection dropped

It has the same timestamp as the *previous* "socket closed" message.  That just sounds like it's reconnecting immediately after the connection is dropped.  The "Connected" message is also in the same second, so it again sounds like it has no problems reconnecting.

> Due to the 5 min sleep time when the connection is failing, am I correctly
> inferring that if we see a "Connecting to" + "Connected" message followed by
> a "Socket closed" message 5 minutes later the connection was actually never
> established?

I don't think this is a sleep; this is pulsebuildmonitor connected and waiting for messages.  Nothing is sleeping afaik.  It sounds like a normal reconnect cycle:

* connection established
* some time passes with no messages
* zlb times out and closes connection
* pulsebuildmonitor detects the closed connection and immediately reconnects
* repeat

> With pulsebuildmonitor 0.90 we were seeing timeouts regularly every 5
> minutes.

Undoubtedly due to the zlb (Zeus load balancer); see bug 1091730.  Now if you believe you haven't seen this before a few weeks ago, *that's* strange, since the zlb has been in front of pulse for a long time, unless you were somehow connecting via the VPN straight to one of the Pulse machines (which seems unlikely).

> What we would like:
> - better timestamps. Is it possible to see more clearly when the connection
> dropped and when it was re-established again?
> - better wording since "Connected" from what I see now might not actually
> mean that we have connected succesfully

To the best of my knowledge, the logs are currently accurately reflecting both those points.

The bigger question is whether you think you are missing messages.  Without a durable queue, the only time you run a risk of missing a message is if it arrives during the very short interval between being disconnected and establishing a new connection (and queue).  It's possible, but it should be rare.

To me, it *sounds* like everything is working normally, that is, you are being disconnected due to the zlb's timeouts, but you are immediately reconnecting.  Using a durable queue would avoid the possibility of missed messages; disabling the zlb's timeouts by itself will avoid frequent disconnects, but it's not a full solution since I'm sure you will still periodically be disconnected just due to random network glitches.

If there are other symptoms (like missing messages when it appears that you are connected, i.e. between the "Connected" and "Socket closed" timestamps), let me know.  I would be happy to continue debugging with you.

I believe pulsebuildmonitor has been retired.. please reopen if not!

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.