Closed Bug 1537756 Opened 5 years ago Closed 5 years ago

pulse is down, causing cascading failures

Categories

(Webtools :: Pulse, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Unassigned)

References

Details

Pulse appears to be overloaded and mostly failing. I can't really load the admin UI but it did show me a graph with two nodes down and one overloaded.

This is causing Taskcluster to fail.

Severity: normal → blocker
Priority: -- → P1

I've contacted cloudampq support with an urgent request. Their admin console is not responsive. Our nodes are having high memory utilization two of them appear to be unresponsive. I've rebooted them this did not address the problem.

From Cloudampq support
Your consumer count has gone from 110 to almost up to a 1000 and the same for queues, this is the reason for the increased memory usage which stopped the servers from processing messages.

They restarted the nodes and messages started being processed again.

Trees are closed until the backlog is caught up, will defer to ciduty to decide when to reopen
I will schedule a postmortem for this outage
We should investigate the root cause of the increase in volume
Are they valid consumers/messages
Is the current server configuration we have at cloudampq valid or do we need to change it to accommodate our volume

See Also: → 1538914

This occurred again today. Root-cause work and fix development is in bug 1538961.

Occurred again today. The remediation is in bug 1540758.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.