Closed Bug 968169 Opened 11 years ago Closed 11 years ago

Pulse times out when trying to connect

Categories

(Mozilla QA Graveyard :: Infrastructure, defect)

defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: AndreeaMatei, Assigned: whimboo)

References

Details

(Keywords: regression)

I see no jobs got triggered today, last of them are 20h ago, but we have available builds. I checked master and it showed a "error: [Errno 110] Connection timed out" error, so I tried to reconnect but it stays at: "Connecting to Mozilla Pulse as "qa-auto@mozilla.com|mozmill_daily.." Might be related to the network issues we had?
Priority: -- → P1
You should include the failures you are facing. Otherwise how should we work on this? So here the lines which can be seen: Exception socket.error: (107, 'Transport endpoint is not connected') in <bound m ethod TCPTransport.__del__ of <amqplib.client_0_8.transport.TCPTransport object at 0xb5b3450c>> ignore Jonathan, is that a Pulse or PulseMonitor issue in trying to reconnect after a network failure?
Flags: needinfo?(jgriffin)
Priority: P1 → --
Not a P1 given that the restart fixed it, right?
(In reply to Henrik Skupin (:whimboo) from comment #1) > Exception socket.error: (107, 'Transport endpoint is not connected') in > <bound m > ethod TCPTransport.__del__ of <amqplib.client_0_8.transport.TCPTransport > object > at 0xb5b3450c>> ignore This actually only showed up when we hit Ctrl+C to cancel the current job to restart the Pulse listener. Not sure that this message is related to the actual problem.
That message was from when I stopped the command. Here is what was before: Traceback (most recent call last): File "/data/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/pulsebuil self.pulse.listen() File "/data/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/mozillapu self.consumer.wait() File "/data/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/carrot/me it.next() File "/data/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/carrot/ba self.channel.wait() File "/data/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/amqplib/c self.channel_id, allowed_methods) File "/data/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/amqplib/c self.method_reader.read_method() File "/data/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/amqplib/c raise m error: [Errno 110] Connection timed out
If you only see this error when you CTRL+C, it probably means we're not shutting down cleanly somehow. It's not a big concern if you only see it in that context.
Flags: needinfo?(jgriffin)
Jonathan, for the real problem please see comment 5. That happened each time we tried to reconnect.
Severity: normal → major
(In reply to Henrik Skupin (:whimboo) from comment #1) > You should include the failures you are facing. Otherwise how should we work > on this? So here the lines which can be seen: > > Exception socket.error: (107, 'Transport endpoint is not connected') in > <bound m > ethod TCPTransport.__del__ of <amqplib.client_0_8.transport.TCPTransport > object > at 0xb5b3450c>> ignore > > Jonathan, is that a Pulse or PulseMonitor issue in trying to reconnect after > a network failure? If this is the error you mean, it means that the socket was closed before we attempted to clean it up; I believe this error is harmless. If you mean the "error: [Errno 110] Connection timed out" error, that looks like some kind of networking problem that we didn't auto-recover from.
Yes, its about the Errno 110 problem. So we have restarted our listeners yesterday but this is still not working! So something lower in the stack is having massive problems to reconnect. We are still not getting any Pulse message for builds. Can anyone from the A-team please have a look if the pulsetranslator is workign as expected? I assume the problem lays there.
Assignee: nobody → dkl
Severity: major → blocker
Component: Infrastructure → Pulse
Product: Mozilla QA → Webtools
Whiteboard: [automation-blocked]
Version: unspecified → other
Someone should try to check the status of pulsetranslator: host: pulsetranslator.ateam.phx1.mozilla.com command: sudo /etc/init.d/pulsetranslator status If it doesn't work we have to get it restarted: > sudo /etc/init.d/pulsetranslator stop > sudo /etc/init.d/pulsetranslator start
Ludo, I cannot find anyone who could have a look at this problem. Could you check that please? It looks like something similar like bug 893851.
Depends on: 893851
Flags: needinfo?(ludovic)
Thankfully I got help from Ashish who restarted pulsetranslator for me. Now everything should work as expected. Thanks a lot!
Assignee: dkl → nobody
Status: NEW → RESOLVED
Closed: 11 years ago
Component: Pulse → Infrastructure
Flags: needinfo?(ludovic)
Product: Webtools → Mozilla QA
Resolution: --- → FIXED
Whiteboard: [automation-blocked]
Version: other → unspecified
Assignee: nobody → ashish
I'm not sure if it's still related to pulsetranslator, but we're still not triggering any jobs, although builds are available. I can still see the same error in the terminal, followed by the connection, but having no timestamps I can't be sure it failed during the night and tried on its own to reconnect. Were there any more network issues?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Jonathan, this is a real blocker for us. So by the fix (restart of pulsetranslator) yesterday we got a couple of build notifications for l10n builds. But then all stopped again. Can you please have a look at this today? It's a hard blocker for us because we cannot run any tests at all! Thanks!
Flags: needinfo?(jgriffin)
Assignee: ashish → nobody
Jonathan fixed the issue for now. I don't close this bug yet but want to know first the problems and that it has been fixed for real.
Assignee: nobody → jgriffin
With the fix for pulsetranslator pushed live yesterday it seems to work again. As Jonathan noticed we had an old version of MozillaPulse running on our side, which might also have been caused problems. So I upgraded it from 0.5 to 0.80. Therefore I had to restart Jenkins too. Andreea please check thta all is working fine tomorrow.
Assignee: jgriffin → hskupin
Flags: needinfo?(jgriffin) → needinfo?(andreea.matei)
It's working, jobs for aurora and esr24 were triggered, yey!
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Flags: needinfo?(andreea.matei)
Resolution: --- → FIXED
Product: Mozilla QA → Mozilla QA Graveyard
You need to log in before you can comment on or make changes to this bug.