[tc-queue] PulsePublisher.sendDeadline exceeded knocked out queue this morning
Categories
(Taskcluster :: Services, defect)
Tracking
(Not tracked)
People
(Reporter: dustin, Assigned: dustin)
References
Details
34,000+ exceptions with this title..
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Comment 1•6 years ago
|
||
And a bunch more just now. Rabbitmq shows queue's connections in the "flow" state which means it's rate-limiting them.
Assignee | ||
Comment 3•6 years ago
|
||
This is RabbitMQ's flow control and occurred because there were some queues (queue/releng-services-{production,staging}/exchange/taskcluster-queue/v1/task-group-resolved
) with about 2.5 million unread messages. I deleted those queues and tc-queue is no longer being rate-limited.
It's an open question why pulseguardian didn't delete those queues.
I'm not sure there's much Taskcluster could do about this -- at best we could allow createTask to succeed without sending pulse messages. But that's still failing to fulfill an API promise (that we'll send messages about tasks). We could potentially queue the messages in some other service (redis?) but that just moves the problem: if whatever consumes from redis and publishes to pulse gets backed up, then eventually redis will fill up and fail. That might buy us more time to diagnose the problem, but at a big cost in complexity.
Assignee | ||
Comment 4•6 years ago
|
||
pulseguardian issue is bug 1540758.
Assignee | ||
Comment 5•6 years ago
|
||
The cascading failures we see here (causing hook failures for example) are handled in bug 1540697. I don't think we can make this "just work", but we can at least make the failures more tractable.
Comment 6•6 years ago
|
||
Dustin-- In pulse guarding, there is a checkbox to prevent deleting the queue, even if it's in violation. It will have a badge next to the name saying "Unbounded". We have that set for Treeherder. Maybe that got set accidentally for the queues in question? I don't have access to see the queues in question, only the Treeherder ones.
Assignee | ||
Updated•6 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 9•6 years ago
|
||
(In reply to Cameron Dawson [:camd] from comment #6)
Dustin-- In pulse guarding, there is a checkbox to prevent deleting the queue, even if it's in violation. It will have a badge next to the name saying "Unbounded". We have that set for Treeherder. Maybe that got set accidentally for the queues in question? I don't have access to see the queues in question, only the Treeherder ones.
That may be worth checking once the worker is back up -- but from what I can tell, nothing is being deleted right now. But that's bug 1540758. This one is done.
Description
•