Closed Bug 1537756 Opened 7 years ago Closed 7 years ago

pulse is down, causing cascading failures

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: dustin, Unassigned)

References

Details

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Description

•

7 years ago

Pulse appears to be overloaded and mostly failing. I can't really load the admin UI but it did show me a graph with two nodes down and one overloaded.

This is causing Taskcluster to fail.

Chris AtLee [:catlee]

Updated

•

7 years ago

Severity: normal → blocker

Priority: -- → P1

Kim Moir [:kmoir] ET

Comment 2

•

7 years ago

I've contacted cloudampq support with an urgent request. Their admin console is not responsive. Our nodes are having high memory utilization two of them appear to be unresponsive. I've rebooted them this did not address the problem.

Kim Moir [:kmoir] ET

Comment 3

•

7 years ago

From Cloudampq support
Your consumer count has gone from 110 to almost up to a 1000 and the same for queues, this is the reason for the increased memory usage which stopped the servers from processing messages.

They restarted the nodes and messages started being processed again.

Trees are closed until the backlog is caught up, will defer to ciduty to decide when to reopen
I will schedule a postmortem for this outage
We should investigate the root cause of the increase in volume
Are they valid consumers/messages
Is the current server configuration we have at cloudampq valid or do we need to change it to accommodate our volume

Roland Mutter Michael (:rmutter)

Updated

•

7 years ago

Blocks: 1537762

Jordan Lund (:jlund)

Updated

•

7 years ago

Comment 4

•

7 years ago

This occurred again today. Root-cause work and fix development is in bug 1538961.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 5

•

7 years ago

Occurred again today. The remediation is in bug 1540758.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

pulse is down, causing cascading failures

Categories

(Webtools :: Pulse, defect, P1)

Tracking

(Not tracked)

People

(Reporter: dustin, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 2

Comment 3

Updated

Updated

Comment 4

Comment 5