Closed Bug 1540696 Opened 6 years ago Closed 6 years ago

[tc-queue] PulsePublisher.sendDeadline exceeded knocked out queue this morning

Categories

(Taskcluster :: Services, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

Details

34,000+ exceptions with this title..

Assignee: nobody → dustin

And a bunch more just now. Rabbitmq shows queue's connections in the "flow" state which means it's rate-limiting them.

Blocks: 1540742

This is RabbitMQ's flow control and occurred because there were some queues (queue/releng-services-{production,staging}/exchange/taskcluster-queue/v1/task-group-resolved) with about 2.5 million unread messages. I deleted those queues and tc-queue is no longer being rate-limited.

It's an open question why pulseguardian didn't delete those queues.

I'm not sure there's much Taskcluster could do about this -- at best we could allow createTask to succeed without sending pulse messages. But that's still failing to fulfill an API promise (that we'll send messages about tasks). We could potentially queue the messages in some other service (redis?) but that just moves the problem: if whatever consumes from redis and publishes to pulse gets backed up, then eventually redis will fill up and fail. That might buy us more time to diagnose the problem, but at a big cost in complexity.

pulseguardian issue is bug 1540758.

The cascading failures we see here (causing hook failures for example) are handled in bug 1540697. I don't think we can make this "just work", but we can at least make the failures more tractable.

Dustin-- In pulse guarding, there is a checkbox to prevent deleting the queue, even if it's in violation. It will have a badge next to the name saying "Unbounded". We have that set for Treeherder. Maybe that got set accidentally for the queues in question? I don't have access to see the queues in question, only the Treeherder ones.

Depends on: 1540697, 1540758
Blocks: 1542805

(In reply to Cameron Dawson [:camd] from comment #6)

Dustin-- In pulse guarding, there is a checkbox to prevent deleting the queue, even if it's in violation. It will have a badge next to the name saying "Unbounded". We have that set for Treeherder. Maybe that got set accidentally for the queues in question? I don't have access to see the queues in question, only the Treeherder ones.

That may be worth checking once the worker is back up -- but from what I can tell, nothing is being deleted right now. But that's bug 1540758. This one is done.

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.