Closed
Bug 879257
Opened 11 years ago
Closed 11 years ago
Pulse listener is failing with connection timeouts
Categories
(Testing :: General, defect)
Testing
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: davehunt, Unassigned)
Details
(Whiteboard: [qa-automation-blocked])
The Mozmill pulse listeners on mm-ci-master.qa.scl3.mozilla.com are failing with connection timeouts. This was first noticed on June 3rd, and the pulse listeners were restarted. It was then noticed again today (June 4th). The following are seen multiple times in the terminal: INFO:automation:Connecting to Mozilla Pulse as "qa-auto@mozilla.com|mozmill_release|mm-ci-master1.qa.scl3.mozilla.com"... ERROR:automation:[Errno 110] Connection timed out Traceback (most recent call last): File "/home/mozauto/mozmill-ci/jenkins-env/lib/python2.7/site-packages/pulsebuildmonitor/pulsebuildmonitor.py", line 100, in listen self.pulse.listen() File "/home/mozauto/mozmill-ci/jenkins-env/lib/python2.7/site-packages/mozillapulse/consumers.py", line 136, in listen self.consumer.wait() File "/home/mozauto/mozmill-ci/jenkins-env/lib/python2.7/site-packages/carrot/messaging.py", line 446, in wait it.next() File "/home/mozauto/mozmill-ci/jenkins-env/lib/python2.7/site-packages/carrot/backends/pyamqplib.py", line 300, in consume self.channel.wait() File "/home/mozauto/mozmill-ci/jenkins-env/lib/python2.7/site-packages/amqplib/client_0_8/abstract_channel.py", line 95, in wait self.channel_id, allowed_methods) File "/home/mozauto/mozmill-ci/jenkins-env/lib/python2.7/site-packages/amqplib/client_0_8/connection.py", line 202, in _wait_method self.method_reader.read_method() File "/home/mozauto/mozmill-ci/jenkins-env/lib/python2.7/site-packages/amqplib/client_0_8/method_framing.py", line 221, in read_method raise m INFO:automation:Connecting to Mozilla Pulse as "qa-auto@mozilla.com|mozmill_release|mm-ci-master1.qa.scl3.mozilla.com"... Exception socket.error: (107, 'Transport endpoint is not connected') in <bound method TCPTransport.__del__ of <amqplib.client_0_8.transport.TCPTranspor at 0xb65361ac>> ignored Also, when shutting down the pulse listener, we see the following several times: Exception socket.error: (107, 'Transport endpoint is not connected') in <bound method TCPTransport.__del__ of <amqplib.client_0_8.transport.TCPTranspor at 0xb653642c>> ignored
Reporter | ||
Updated•11 years ago
|
Severity: normal → blocker
Whiteboard: [qa-automation-blocked]
Comment 1•11 years ago
|
||
Anything on the releng side for us to look into?
Comment 2•11 years ago
|
||
Not releng/relops.
Assignee: server-ops-releng → server-ops-devservices
Severity: blocker → major
Component: Server Operations: RelEng → Server Operations: Developer Services
QA Contact: arich → shyam
Comment 3•11 years ago
|
||
Known, pulse has been broken for a while. See the bug this one is dependent on.
Comment 4•11 years ago
|
||
This is kinda blocking any QA related automated testing. Not sure what's going on here, but the same happens for our staging instance in MV. So that it is not a network issue in SCL3. As Dave pointed out we noticed it yesterday, but it started to fail on June 1st. Since then no other Pulse message made it through. The exact time might be a little bit earlier than 2013-06-01T14:12:18.000Z. Shyam, the referenced bug is secured so I cannot look into it. Is there any ETA when this will work again?
Severity: major → blocker
Comment 5•11 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #4) > Shyam, the referenced bug is secured so I cannot look into it. Is there any > ETA when this will work again? Sorry, I just opened that up. It's a problem in the code, it'll work when that's fixed.
Comment 6•11 years ago
|
||
I don't think these are related. The server side of pulse is mostly made of two things: (a) a RabbitMQ service and (b) a Django web app. Almost everything uses (a). (b) is mostly static content (docs) with a couple dynamic pages. It has been down since the end of April and is not particularly important. (a) appears to be working, based on the dashboard, though one component of the whole system--the pulse translator--appears to be stuck. I think I need jgriffin to look at this. We will also fully document pulse for future problems if we aren't around.
Comment 7•11 years ago
|
||
Mark, does only Jonathan has access to the VM pulse-translator runs on? When will he be back? i tried to find him yesterday on IRC but he was always afk. Is it just a restart of the daemon? I would assume someone else could do that too.
Comment 8•11 years ago
|
||
Yeah I think he is the only one. I don't know much about the pulsetranslator but I believe it is just one process. Shyam, can you either give me access to pulsetranslator.ateam.phx1.mozilla.com (better for the long run) or run "/etc/init.d/pulsetranslator restart" on it?
Flags: needinfo?(shyam)
Comment 9•11 years ago
|
||
I did some work on pulsetranslator a couple of weeks back. So if something is broken please let me know and I can help to get it fixed.
Comment 10•11 years ago
|
||
Would the translator make sense as a pulse shim, along with the hg shim, ftp shim, etc.? I realize many of those shims aren't in use, but at least they have a nice slot for being managed, upgraded, monitored, etc.
Comment 11•11 years ago
|
||
Dustin, not entirely sure if I understand you, but as far as I know (and I'm only really learning about the whole system now), the translator is a service that listens for build messages, normalizes them into a consistent format, then republishes them to *.normalized queues. So it pretty much has to be running as a separate service somewhere.
Comment 12•11 years ago
|
||
Oh, ok. The shims are all crontasks that poll things. They run on the app host. # activate some shims - these should each only be on one host! include webapp::pulse::shim::heartbeat webapp::pulse::shim::hg { 'mozilla-central': ; 'users/jgriffin_mozilla.com/synkme': ; } # these will likely not be used, and are not implemented in puppet #include webapp::pulse::shim::bugzilla #include webapp::pulse::shim::ftp anyway, we could certainly run the translator on the app host, as well, even if it is a service rather than a crontask. That would let new versions be deployed using the "normal" webapp push process, just like for the Django app. It's up to you, really - just an idea.
Updated•11 years ago
|
Component: Server Operations: Developer Services → General
Flags: needinfo?(shyam)
Product: mozilla.org → Testing
QA Contact: shyam
Version: other → unspecified
Comment 13•11 years ago
|
||
Whee. This isn't dev services. I'll kick the service for now Mark, and see if I can get you access.
Assignee: shyam → nobody
Comment 14•11 years ago
|
||
And done. [shyam@pulsetranslator.ateam.phx1 ~]$ sudo /etc/init.d/pulsetranslator restart Stopping pulsetranslator: [ OK ] Starting pulsetranslator: daemon --user webtools /home/webtools/apps/pulsetranslator/bin/runtranslator --daemon --pidfile /home/webtools/apps/pulsetranslator/translator.pid --logfile /home/webtools/apps/pulsetranslator/stdout.log --durable --logdir /home/webtools/apps/pulsetranslator/logs [ OK ]
Comment 15•11 years ago
|
||
Dustin: oh yeah, definitely we would like to transition this to a completely IT-managed system. We have plans to fix and tighten up a variety of pulse-related stuff, all of which ended up in our lap a while ago. Shyam: thanks, I can see the translator queue being drained. I don't know for sure if this will fix this particular bug, but it will probably fix bug 879204 at least.
Comment 16•11 years ago
|
||
The pulstranslator being down would absolutely not cause connection timeouts for a pulse consumer, since consumers neither connect to nor have any knowledge of the pulsetranslator. Connection timeouts mean either pulse is having problems itself, or there are some transient network problems between the consumer and pulse.
Comment 17•11 years ago
|
||
Just to let you know our mozmill-ci consumer is working again and already got a couple of notifications. Thanks.
Updated•11 years ago
|
Severity: blocker → critical
Comment 18•11 years ago
|
||
What's left to do on this bug? Jonathan or Shyam?
Comment 19•11 years ago
|
||
There doesn't seem to be anything actionable on the pulse side. WORKSFORME?
Comment 20•11 years ago
|
||
Marking as fixed by the restart of the daemon in comment 14.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•