Closed
Bug 968169
Opened 11 years ago
Closed 11 years ago
Pulse times out when trying to connect
Categories
(Mozilla QA Graveyard :: Infrastructure, defect)
Mozilla QA Graveyard
Infrastructure
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: AndreeaMatei, Assigned: whimboo)
References
Details
(Keywords: regression)
I see no jobs got triggered today, last of them are 20h ago, but we have available builds. I checked master and it showed a "error: [Errno 110] Connection timed out" error, so I tried to reconnect but it stays at:
"Connecting to Mozilla Pulse as "qa-auto@mozilla.com|mozmill_daily.."
Might be related to the network issues we had?
Reporter | ||
Updated•11 years ago
|
Priority: -- → P1
Assignee | ||
Comment 1•11 years ago
|
||
You should include the failures you are facing. Otherwise how should we work on this? So here the lines which can be seen:
Exception socket.error: (107, 'Transport endpoint is not connected') in <bound m
ethod TCPTransport.__del__ of <amqplib.client_0_8.transport.TCPTransport object
at 0xb5b3450c>> ignore
Jonathan, is that a Pulse or PulseMonitor issue in trying to reconnect after a network failure?
Flags: needinfo?(jgriffin)
Priority: P1 → --
Assignee | ||
Comment 2•11 years ago
|
||
Not a P1 given that the restart fixed it, right?
Comment 3•11 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #1)
> Exception socket.error: (107, 'Transport endpoint is not connected') in
> <bound m
> ethod TCPTransport.__del__ of <amqplib.client_0_8.transport.TCPTransport
> object
> at 0xb5b3450c>> ignore
This actually only showed up when we hit Ctrl+C to cancel the current job to restart the Pulse listener. Not sure that this message is related to the actual problem.
Reporter | ||
Comment 4•11 years ago
|
||
That message was from when I stopped the command.
Here is what was before:
Traceback (most recent call last):
File "/data/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/pulsebuil
self.pulse.listen()
File "/data/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/mozillapu
self.consumer.wait()
File "/data/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/carrot/me
it.next()
File "/data/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/carrot/ba
self.channel.wait()
File "/data/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/amqplib/c
self.channel_id, allowed_methods)
File "/data/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/amqplib/c
self.method_reader.read_method()
File "/data/mozmill-ci/jenkins-env/local/lib/python2.7/site-packages/amqplib/c
raise m
error: [Errno 110] Connection timed out
Comment 5•11 years ago
|
||
If you only see this error when you CTRL+C, it probably means we're not shutting down cleanly somehow. It's not a big concern if you only see it in that context.
Flags: needinfo?(jgriffin)
Assignee | ||
Comment 6•11 years ago
|
||
Jonathan, for the real problem please see comment 5. That happened each time we tried to reconnect.
Assignee | ||
Updated•11 years ago
|
Severity: normal → major
Comment 7•11 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #1)
> You should include the failures you are facing. Otherwise how should we work
> on this? So here the lines which can be seen:
>
> Exception socket.error: (107, 'Transport endpoint is not connected') in
> <bound m
> ethod TCPTransport.__del__ of <amqplib.client_0_8.transport.TCPTransport
> object
> at 0xb5b3450c>> ignore
>
> Jonathan, is that a Pulse or PulseMonitor issue in trying to reconnect after
> a network failure?
If this is the error you mean, it means that the socket was closed before we attempted to clean it up; I believe this error is harmless.
If you mean the "error: [Errno 110] Connection timed out" error, that looks like some kind of networking problem that we didn't auto-recover from.
Assignee | ||
Comment 8•11 years ago
|
||
Yes, its about the Errno 110 problem. So we have restarted our listeners yesterday but this is still not working! So something lower in the stack is having massive problems to reconnect. We are still not getting any Pulse message for builds.
Can anyone from the A-team please have a look if the pulsetranslator is workign as expected? I assume the problem lays there.
Assignee: nobody → dkl
Severity: major → blocker
Component: Infrastructure → Pulse
Product: Mozilla QA → Webtools
Whiteboard: [automation-blocked]
Version: unspecified → other
Assignee | ||
Comment 9•11 years ago
|
||
Someone should try to check the status of pulsetranslator:
host: pulsetranslator.ateam.phx1.mozilla.com
command: sudo /etc/init.d/pulsetranslator status
If it doesn't work we have to get it restarted:
> sudo /etc/init.d/pulsetranslator stop
> sudo /etc/init.d/pulsetranslator start
Assignee | ||
Comment 10•11 years ago
|
||
Ludo, I cannot find anyone who could have a look at this problem. Could you check that please?
It looks like something similar like bug 893851.
Depends on: 893851
Flags: needinfo?(ludovic)
Assignee | ||
Comment 11•11 years ago
|
||
Thankfully I got help from Ashish who restarted pulsetranslator for me. Now everything should work as expected. Thanks a lot!
Assignee: dkl → nobody
Status: NEW → RESOLVED
Closed: 11 years ago
Component: Pulse → Infrastructure
Flags: needinfo?(ludovic)
Product: Webtools → Mozilla QA
Resolution: --- → FIXED
Whiteboard: [automation-blocked]
Version: other → unspecified
Assignee | ||
Updated•11 years ago
|
Assignee: nobody → ashish
Reporter | ||
Comment 12•11 years ago
|
||
I'm not sure if it's still related to pulsetranslator, but we're still not triggering any jobs, although builds are available.
I can still see the same error in the terminal, followed by the connection, but having no timestamps I can't be sure it failed during the night and tried on its own to reconnect.
Were there any more network issues?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 13•11 years ago
|
||
Jonathan, this is a real blocker for us. So by the fix (restart of pulsetranslator) yesterday we got a couple of build notifications for l10n builds. But then all stopped again. Can you please have a look at this today? It's a hard blocker for us because we cannot run any tests at all! Thanks!
Flags: needinfo?(jgriffin)
Updated•11 years ago
|
Assignee: ashish → nobody
Assignee | ||
Comment 14•11 years ago
|
||
Jonathan fixed the issue for now. I don't close this bug yet but want to know first the problems and that it has been fixed for real.
Assignee: nobody → jgriffin
Assignee | ||
Comment 15•11 years ago
|
||
With the fix for pulsetranslator pushed live yesterday it seems to work again. As Jonathan noticed we had an old version of MozillaPulse running on our side, which might also have been caused problems. So I upgraded it from 0.5 to 0.80. Therefore I had to restart Jenkins too.
Andreea please check thta all is working fine tomorrow.
Assignee: jgriffin → hskupin
Flags: needinfo?(jgriffin) → needinfo?(andreea.matei)
Reporter | ||
Comment 16•11 years ago
|
||
It's working, jobs for aurora and esr24 were triggered, yey!
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Flags: needinfo?(andreea.matei)
Resolution: --- → FIXED
Assignee | ||
Updated•11 years ago
|
Keywords: regression
Updated•6 years ago
|
Product: Mozilla QA → Mozilla QA Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•