Closed Bug 879257 Opened 11 years ago Closed 11 years ago

Pulse listener is failing with connection timeouts

Categories

(Testing :: General, defect)

defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: davehunt, Unassigned)

Details

(Whiteboard: [qa-automation-blocked])

The Mozmill pulse listeners on mm-ci-master.qa.scl3.mozilla.com are failing with connection timeouts. This was first noticed on June 3rd, and the pulse listeners were restarted. It was then noticed again today (June 4th).

The following are seen multiple times in the terminal:

INFO:automation:Connecting to Mozilla Pulse as "qa-auto@mozilla.com|mozmill_release|mm-ci-master1.qa.scl3.mozilla.com"...
ERROR:automation:[Errno 110] Connection timed out
Traceback (most recent call last):
  File "/home/mozauto/mozmill-ci/jenkins-env/lib/python2.7/site-packages/pulsebuildmonitor/pulsebuildmonitor.py", line 100, in listen
    self.pulse.listen()
  File "/home/mozauto/mozmill-ci/jenkins-env/lib/python2.7/site-packages/mozillapulse/consumers.py", line 136, in listen
    self.consumer.wait()
  File "/home/mozauto/mozmill-ci/jenkins-env/lib/python2.7/site-packages/carrot/messaging.py", line 446, in wait
    it.next()
  File "/home/mozauto/mozmill-ci/jenkins-env/lib/python2.7/site-packages/carrot/backends/pyamqplib.py", line 300, in consume
    self.channel.wait()
  File "/home/mozauto/mozmill-ci/jenkins-env/lib/python2.7/site-packages/amqplib/client_0_8/abstract_channel.py", line 95, in wait
    self.channel_id, allowed_methods)
  File "/home/mozauto/mozmill-ci/jenkins-env/lib/python2.7/site-packages/amqplib/client_0_8/connection.py", line 202, in _wait_method
    self.method_reader.read_method()
  File "/home/mozauto/mozmill-ci/jenkins-env/lib/python2.7/site-packages/amqplib/client_0_8/method_framing.py", line 221, in read_method
    raise m

INFO:automation:Connecting to Mozilla Pulse as "qa-auto@mozilla.com|mozmill_release|mm-ci-master1.qa.scl3.mozilla.com"...
Exception socket.error: (107, 'Transport endpoint is not connected') in <bound method TCPTransport.__del__ of <amqplib.client_0_8.transport.TCPTranspor
at 0xb65361ac>> ignored

Also, when shutting down the pulse listener, we see the following several times:

Exception socket.error: (107, 'Transport endpoint is not connected') in <bound method TCPTransport.__del__ of <amqplib.client_0_8.transport.TCPTranspor
at 0xb653642c>> ignored
Severity: normal → blocker
Whiteboard: [qa-automation-blocked]
Anything on the releng side for us to look into?
Not releng/relops.
Assignee: server-ops-releng → server-ops-devservices
Severity: blocker → major
Component: Server Operations: RelEng → Server Operations: Developer Services
QA Contact: arich → shyam
Known, pulse has been broken for a while. See the bug this one is dependent on.
Assignee: server-ops-devservices → shyam
Status: NEW → ASSIGNED
Depends on: 866467
This is kinda blocking any QA related automated testing. Not sure what's going on here, but the same happens for our staging instance in MV. So that it is not a network issue in SCL3.

As Dave pointed out we noticed it yesterday, but it started to fail on June 1st. Since then no other Pulse message made it through. The exact time might be a little bit earlier than 2013-06-01T14:12:18.000Z.

Shyam, the referenced bug is secured so I cannot look into it. Is there any ETA when this will work again?
Severity: major → blocker
(In reply to Henrik Skupin (:whimboo) from comment #4)

> Shyam, the referenced bug is secured so I cannot look into it. Is there any
> ETA when this will work again?

Sorry, I just opened that up. It's a problem in the code, it'll work when that's fixed.
I don't think these are related.  The server side of pulse is mostly made of two things: (a) a RabbitMQ service and (b) a Django web app.  Almost everything uses (a).  (b) is mostly static content (docs) with a couple dynamic pages.  It has been down since the end of April and is not particularly important.

(a) appears to be working, based on the dashboard, though one component of the whole system--the pulse translator--appears to be stuck.  I think I need jgriffin to look at this.

We will also fully document pulse for future problems if we aren't around.
Mark, does only Jonathan has access to the VM pulse-translator runs on? When will he be back? i tried to find him yesterday on IRC but he was always afk. Is it just a restart of the daemon? I would assume someone else could do that too.
Yeah I think he is the only one.  I don't know much about the pulsetranslator but I believe it is just one process.  Shyam, can you either give me access to pulsetranslator.ateam.phx1.mozilla.com (better for the long run) or run "/etc/init.d/pulsetranslator restart" on it?
Flags: needinfo?(shyam)
I did some work on pulsetranslator a couple of weeks back. So if something is broken please let me know and I can help to get it fixed.
Would the translator make sense as a pulse shim, along with the hg shim, ftp shim, etc.?  I realize many of those shims aren't in use, but at least they have a nice slot for being managed, upgraded, monitored, etc.
No longer depends on: 866467
Dustin, not entirely sure if I understand you, but as far as I know (and I'm only really learning about the whole system now), the translator is a service that listens for build messages, normalizes them into a consistent format, then republishes them to *.normalized queues.  So it pretty much has to be running as a separate service somewhere.
Oh, ok.  The shims are all crontasks that poll things.  They run on the app host.

    # activate some shims - these should each only be on one host!
    include webapp::pulse::shim::heartbeat
    webapp::pulse::shim::hg {
        'mozilla-central': ;
        'users/jgriffin_mozilla.com/synkme': ;
    }   
    # these will likely not be used, and are not implemented in puppet
    #include webapp::pulse::shim::bugzilla
    #include webapp::pulse::shim::ftp

anyway, we could certainly run the translator on the app host, as well, even if it is a service rather than a crontask.  That would let new versions be deployed using the "normal" webapp push process, just like for the Django app.  It's up to you, really - just an idea.
Component: Server Operations: Developer Services → General
Flags: needinfo?(shyam)
Product: mozilla.org → Testing
QA Contact: shyam
Version: other → unspecified
Whee. This isn't dev services. I'll kick the service for now Mark, and see if I can get you access.
Assignee: shyam → nobody
And done. 

[shyam@pulsetranslator.ateam.phx1 ~]$ sudo /etc/init.d/pulsetranslator restart
Stopping pulsetranslator:                                  [  OK  ]
Starting pulsetranslator: daemon --user webtools /home/webtools/apps/pulsetranslator/bin/runtranslator --daemon --pidfile /home/webtools/apps/pulsetranslator/translator.pid --logfile /home/webtools/apps/pulsetranslator/stdout.log --durable --logdir /home/webtools/apps/pulsetranslator/logs
                                                           [  OK  ]
Dustin: oh yeah, definitely we would like to transition this to a completely IT-managed system.  We have plans to fix and tighten up a variety of pulse-related stuff, all of which ended up in our lap a while ago.

Shyam: thanks, I can see the translator queue being drained.  I don't know for sure if this will fix this particular bug, but it will probably fix bug 879204 at least.
The pulstranslator being down would absolutely not cause connection timeouts for a pulse consumer, since consumers neither connect to nor have any knowledge of the pulsetranslator.

Connection timeouts mean either pulse is having problems itself, or there are some transient network problems between the consumer and pulse.
Just to let you know our mozmill-ci consumer is working again and already got a couple of notifications. Thanks.
Severity: blocker → critical
What's left to do on this bug? Jonathan or Shyam?
There doesn't seem to be anything actionable on the pulse side.  WORKSFORME?
Marking as fixed by the restart of the daemon in comment 14.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.