Closed Bug 1297547 Opened 8 years ago Closed 5 years ago

pulse outage on 2016-08-23 around 22:53 UTC

Categories

(Webtools :: Pulse, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Unassigned)

References

Details

Multiple services experienced some kind of outage/network event to pulse.mozilla.org around 2016-08-23T22:53Z.

Trees are closed.

hg.mo pulse writing has recovered. TaskCluster is still experiencing issues...
Underlying issue appears to be DNS misconfiguration.

pulse.mozilla.org is currently advertising 3 A records:

pulse.mozilla.org.      60      IN      A       54.215.223.66
pulse.mozilla.org.      60      IN      A       54.215.253.142
pulse.mozilla.org.      60      IN      A       54.215.254.97

Only 54.215.254.97 is currently accepting connections.

According to mcote, our CloudAMQP server is running at orange-antelope.rmq.cloudamqp.com. This resolves as:

orange-antelope.rmq.cloudamqp.com. 28 IN CNAME  ec2-54-215-254-97.us-west-1.compute.amazonaws.com.
ec2-54-215-254-97.us-west-1.compute.amazonaws.com. 3568 IN A 54.215.254.97

Who knows what CloudAMQP did in the last ~hour to DNS and their servers. It shouldn't matter: I think pulse.mozilla.org should be a CNAME to orange-antelope.rmq.cloudamqp.com
To be on the conservative side, we may want to just drop the 54.215.223.66 and 54.215.253.142 A records until we talk to CloudAMQP and figure out what's going on.
To get around the immediate problem, I've deleted the As for 54.215.223.66 and 54.215.253.142. CNAME chains are not good practice, so I didn't want to just blow away all the As and create CNAMEs for them. That's probably a larger discussion where we figure out what we should be pointing at for production use.
FYI the addresses of the three nodes are orange-antelope-01.rmq.cloudamqp.com, orange-antelope-02.rmq.cloudamqp.com, and orange-antelope-03.rmq.cloudamqp.com, which currently resolve to 54.215.253.142, 54.215.223.66, and 54.215.254.97.  So it doesn't appear to be a DNS issue; rather, the first two hosts appear to be unreachable.
According to mcote, the 3 servers we're configured to use are:

  orange-antelope-0[123].rmq.cloudamsqp.com

These resolve to the 3 IPs previously listed in this bug. However, orange-antelope.rmq.cloudamqp.com currently only appears to be advertising 54.215.254.97.

My guess is they either took the other 2 out of service or they went down. I'm not sure if their SLA is to only guarantee the orange-antelope.rmq.cloudamqp.com endpoint or all of the orange-antelope-[123].rmq.cloudamqp.com endpoints work. If it's just the former, our statically configured A records aren't sufficient: we need some kind of active monitoring to dynamically update our DNS A records if CloudAMQP takes a server our of their DNS rotation.
Sadly CloudAMQP's chat support doesn't appear to be active right now.  I just sent the following to support@cloudamqp.com:

Hi, my cluster, orange-antelope.rmq.cloudamqp.com, appears to be operating in a degraded fashion. Only orange-antelope-03.rmq.cloudamqp.com appears to be available right now; orange-antelope-01.rmq.cloudamqp.com and orange-antelope-02.rmq.cloudamqp.com are throwing 502 Bad Gateway errors. I don't see any errors showing up on the RabbitMQ nodes page, nor anything on http://status.cloudamqp.com/. Is there some sort of maintenance going on? This has caused some severe infrastructure problems over here.

Thanks,
Mark Côté
Engineering Manager, Engineering Productivity
Mozilla
We have a new wrinkle: data loss.

hg.mozilla.org claims to be sending messages to pulse.mozilla.org without error. However, glandium's pulsebot isn't seeing them (all). And I've had pulse inspector - https://tools.taskcluster.net/pulse-inspector/#!((exchange:exchange/hgpushes/v1,routingKeyPattern:%23)) - open for several minutes and it is definitely missing messages I've seen hg.mozilla.org send.

It's possible hg.mozilla.org isn't sending messages properly. But we've never seen this type of behavior before. And given that CloudAMQP appears to be in some kind of degraded state, I think the events are connected.
See Also: → 1297560
Apparently the service powering pulse inspector was restarted. So that may explain why it was not seeing updates.

glandium still reports the pulsebot queue isn't accumulating messages, however. So something is wonky.
It has been 140 minutes since mcote initially emailed CloudAMQP's 24/7 support line. Still no reply. http://status.cloudamqp.com/ claims their service is fine.

Try is open but Treeherder is reporting no TC activity. That's either a problem with TC not running or reporting task activity (via pulse) or Treeherder not consuming it (again from pulse).

The issue with pulsebot's durable queue losing data is still unresolved. I sent a follow-up to mcote's email to CloudAMQP support on that particular issue. If pulsebot's durable queue is losing data, who knows what else may have lost or will lose data.

lizzard pinged me earlier about kicking off a release build (not sure which channel). I informed her that is blocked on this issue.

At this time, we have no ETA for reopening the trees because it appears pulse is still in a degraded state and we have no indication when things will change since their support isn't responding.
Release build would have been 48.0.2 on the release channel, I believe.
(In reply to Gregory Szorc [:gps] from comment #10)
> Try is open but Treeherder is reporting no TC activity. That's either a
> problem with TC not running or reporting task activity (via pulse) or
> Treeherder not consuming it (again from pulse).

Try is now closed, since it was just cruel to leave it open: buildbot builds can build, but can't upload and thus can't run tests without taskcluster working, so the only thing it being open allows is for people who will dig through the full log to find out whether they got a failed build or a successful build marked as a failed build because of an upload failure to do builds without tests.
Trees reopened, leaving the bug open since I've no idea if there's followup work.
See Also: → 1297759
Received an email from support at 11:48 am PDT: "Ok, it's up. Investigating"

MOC is being cced on those emails.  We should follow up with them if we don't hear anything back later today, since this has now happened twice within about 12 hours.
(In reply to Mark Côté [:mcote] from comment #15)
> Received an email from support at 11:48 am PDT: "Ok, it's up. Investigating"

Sorry, that should have been 11:43 pm EDT.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.