Closed Bug 1215464 Opened 9 years ago Closed 9 years ago

mozmill-ci Pulse listeners are no longer able to connect to broker running on pulse.mozilla.org

Categories

(Mozilla QA Graveyard :: Infrastructure, defect)

defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whimboo, Assigned: whimboo)

References

Details

(Keywords: regression)

Our Pulse listeners are timing out since yesterday due to a broken connection to the broker. Sadly they do not re-connect automatically. Not sure yet if it is related to some changes on the Pulse side lately. As of now more than 3000 jobs are in the queue...

2015-10-15T00:08:58Z WARNING kombu.mixins: Connection to broker lost. Trying to re-establish the connection...
2015-10-15T00:09:13Z WARNING kombu.mixins: Broker connection error: error(timeout('timed out',),). Trying again in 2.0 seconds.
2015-10-15T00:09:30Z WARNING kombu.mixins: Broker connection error: error(timeout('timed out',),). Trying again in 4.0 seconds.
2015-10-15T00:09:49Z WARNING kombu.mixins: Broker connection error: error(timeout('timed out',),). Trying again in 6.0 seconds.
2015-10-15T00:10:10Z WARNING kombu.mixins: Broker connection error: error(timeout('timed out',),). Trying again in 8.0 seconds.
2015-10-15T00:10:33Z WARNING kombu.mixins: Broker connection error: error(timeout('timed out',),). Trying again in 10.0 seconds.
2015-10-15T00:10:58Z WARNING kombu.mixins: Broker connection error: error(timeout('timed out',),). Trying again in 12.0 seconds.
2015-10-15T00:11:25Z WARNING kombu.mixins: Broker connection error: error(timeout('timed out',),). Trying again in 14.0 seconds.
2015-10-15T00:11:54Z WARNING kombu.mixins: Broker connection error: error(timeout('timed out',),). Trying again in 16.0 seconds.
2015-10-15T00:12:25Z WARNING kombu.mixins: Broker connection error: error(timeout('timed out',),). Trying again in 18.0 seconds.
2015-10-15T00:12:58Z WARNING kombu.mixins: Broker connection error: error(timeout('timed out',),). Trying again in 20.0 seconds.
2015-10-15T00:13:33Z WARNING kombu.mixins: Broker connection error: error(timeout('timed out',),). Trying again in 22.0 seconds.
2015-10-15T00:14:10Z WARNING kombu.mixins: Broker connection error: error(timeout('timed out',),). Trying again in 24.0 seconds.
2015-10-15T00:14:49Z WARNING kombu.mixins: Broker connection error: error(timeout('timed out',),). Trying again in 26.0 seconds.
2015-10-15T00:15:30Z WARNING kombu.mixins: Broker connection error: error(timeout('timed out',),). Trying again in 28.0 seconds.
2015-10-15T00:16:13Z WARNING kombu.mixins: Broker connection error: error(timeout('timed out',),). Trying again in 30.0 seconds.
2015-10-15T00:16:58Z WARNING kombu.mixins: Broker connection error: error(timeout('timed out',),). Trying again in 32.0 seconds.
Interestingly it started to happen at midnight from Oct 14th to Oct 15th.
Severity: normal → critical
Version: Firefox 43 → unspecified
I can successfully connect with those credentials from my local host. So I would assume there is a problem with the network in qa.scl3.mozilla.org which doesn't let us connect to Pulse.
As mentioned on bug 1094272 comment 30 this is most likely due to the move to CloudAMQP and leaving us behind because we no longer can reach the new hosts which are restricted by ACL entries for our network. 

I will have to check which hosts exist now and get new ACL entries added.

In generally I wish we would do better transitions like that. :(
Summary: Pulse listeners are timing out on both staging and production → mozmill-ci Pulse listeners are no longer able to reach pulse.mozilla.org
Currently we have the following ACL setting:

source: * ; port: * --> dest: pulse.mozilla.org ; port 5671, 5672

Trying to connect to those ports times out:

XYZ@mm-ci-production:/data/mozmill-ci$ telnet pulse.mozilla.org 5671
Trying 54.215.253.142...
^C
XYZ@mm-ci-production:/data/mozmill-ci$ telnet pulse.mozilla.org 5672
Trying 54.215.254.97...
^C

I can ping the host but traceroute also fails to show any valuable data.

Here the current DNS settings:

$ nslookup pulse.mozilla.org
Server:         127.0.1.1
Address:        127.0.1.1#53

Name:   pulse.mozilla.org
Address: 54.215.253.142
Name:   pulse.mozilla.org
Address: 54.215.223.66
Name:   pulse.mozilla.org
Address: 54.215.254.97
Summary: mozmill-ci Pulse listeners are no longer able to reach pulse.mozilla.org → mozmill-ci Pulse listeners are no longer able to connect to broker running on pulse.mozilla.org
So this actually got busted most likely by the work on bug 1205867.
Blocks: 1205867
Keywords: regression
It's working again. But interestingly I do not see any drop of the messages in the queues.

queue/mozauto/mm-ci-production.qa.scl3.mozilla.com/production_update 51 messages
queue/mozauto/mm-ci-production.qa.scl3.mozilla.com/production_build 3397 messages 42%
queue/mozauto/mm-ci-staging.qa.scl3.mozilla.com/staging_update 51 messages
queue/mozauto/mm-ci-staging.qa.scl3.mozilla.com/staging_build 3397 messages 42%

How often is the number of messages in the queues updated on PulseGuardian? Marc, can you have a look at it please?
Flags: needinfo?(mcote)
Sorry, PulseGuardian has been having issues.  It's sorted out now, so your queue lengths should now be valid.

And sorry about the net-flow confusion.  We had updated the Buildbot flows, but I neglected to mention in my migration announcements that other data-centre apps would have to have theirs updated as well.  Note that the IPs will change occasionally (very infrequently).  We're coming up with a policy for that in bug 1208600.  It's worded just for Buildbot right now, but we'll expand it to include other data-centre apps.
Flags: needinfo?(mcote)
Thanks Marc! All looks good again. So closing this bug out.
Assignee: nobody → hskupin
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Mozilla QA → Mozilla QA Graveyard
You need to log in before you can comment on or make changes to this bug.