Detect RabbitMQ flow-control and alert on it
Categories
(Taskcluster :: Services, defect)
Tracking
(Not tracked)
People
(Reporter: dustin, Unassigned)
References
Details
When consumers don't consume quickly enough, Rabbit's solution is to stop producers from producing as quickly (flow control). Which means the queue, etc. can't publish messages to pulse as quickly. Unfortunately, that means that HTTP requests to the queue time out and all manner of ugliness ensues.
But RabbitMQ apparently tells us this so we could respond by:
- logging (so that we can set up a stackdriver alert)
- immediately failing any publish attempts in such a way that they go back to the user as a 503 that indicates pulse is down.
It would be great to make this available from the tc-lib-pulse publisher so that API methods which do a bunch of stuff and then send a pulse message can avoid doing the "bunch of stuff" first, if desired. For example, queue.createTask
could check whether pulse is OK before creating the task, thereby avoiding the case where a task is created but no pulse message about it is delivered (as occurred in our recent two pulse outages).
Reporter | ||
Updated•6 years ago
|
Reporter | ||
Comment 1•6 years ago
|
||
Worth noting, I didn't see "flow" in the channel status for the queue during today's outage..
Reporter | ||
Comment 2•6 years ago
|
||
Reporter | ||
Comment 3•6 years ago
|
||
Deployed to notify and it's OK. I'll try queue shortly.
Reporter | ||
Comment 4•6 years ago
|
||
Great success on queue.
Reporter | ||
Comment 5•5 years ago
|
||
These messages appear to not be sent..
Reporter | ||
Comment 6•5 years ago
|
||
I'm going to wait until CloudAMQP support helps me to undersatnd what happened here, and then reproduce it in our dev RabbitMQ cluster, and use that to verify that the log messages are being produced.
Reporter | ||
Comment 7•5 years ago
|
||
The dev environment is not the same version of RabbitMQ, so less helpful. I can't repro using the obvious send-faster-than-consuming:
const pulse = require('taskcluster-lib-pulse');
const {defaultMonitorManager} = require('taskcluster-lib-monitor');
const monitorManager = defaultMonitorManager.configure({
serviceName: 'queue',
});
let credentials;
// raw AMQP credentials
credentials = pulse.pulseCredentials({
username: 'dustin',
password: '<mumble>',
hostname: 'pulse.mozilla.org',
vhost: '/',
});
const main = async () => {
const monitor = monitorManager.setup({
processName: 'pulse',
verify: false,
});
const client = new pulse.Client({
namespace: 'dustin',
credentials, // from above
monitor,
});
await client.withChannel(async chan => {
await chan.assertExchange('exchange/dustin/testproducer', 'topic');
});
let pc = await pulse.consume({
client,
bindings: [{exchange: 'exchange/dustin/testproducer', routingKeyPattern: '#'}],
queueName: 'testcons',
prefetch: 1,
}, async ({payload, exchange, routingKey, redelivered, routes, routing}) => {
console.log('got', payload);
await new Promise(resolve => setTimeout(resolve, 200));
});
await client.withChannel(async chan => {
let i = 0;
while (1) {
console.log('send', i);
await chan.publish('exchange/dustin/testproducer', '-', Buffer.from(JSON.stringify({count: i++})));
await new Promise(resolve => setTimeout(resolve, 10));
}
});
};
main().then(console.log, console.log);
I don't see any of the connection, the channel, or the queue going into flow state.
Reporter | ||
Comment 8•5 years ago
|
||
Great success on queue.
I think that meant "didn't crash".
https://www.rabbitmq.com/changelog.html suggests that the notification functionality was added in 3.2.0, which is earlier than the 3.5.7 we're running. It's possible that this support was improved in a later version? bug 1603633 should help discover that.
As it stands, I can't manage to trigger this issue to verify that these messages are being sent, so that's about all I can figure out.
Reporter | ||
Comment 9•5 years ago
|
||
I've marked this as depending on upgrading RabbitMQ on pulse, and hopefully it will magically work better after that.
Reporter | ||
Updated•4 years ago
|
Description
•