Closed Bug 1543378 Opened 6 years ago Closed 4 years ago

Detect RabbitMQ flow-control and alert on it

Categories

(Taskcluster :: Services, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Unassigned)

References

Details

When consumers don't consume quickly enough, Rabbit's solution is to stop producers from producing as quickly (flow control). Which means the queue, etc. can't publish messages to pulse as quickly. Unfortunately, that means that HTTP requests to the queue time out and all manner of ugliness ensues.

But RabbitMQ apparently tells us this so we could respond by:

  • logging (so that we can set up a stackdriver alert)
  • immediately failing any publish attempts in such a way that they go back to the user as a 503 that indicates pulse is down.

It would be great to make this available from the tc-lib-pulse publisher so that API methods which do a bunch of stuff and then send a pulse message can avoid doing the "bunch of stuff" first, if desired. For example, queue.createTask could check whether pulse is OK before creating the task, thereby avoiding the case where a task is created but no pulse message about it is delivered (as occurred in our recent two pulse outages).

Assignee: nobody → dustin

Worth noting, I didn't see "flow" in the channel status for the queue during today's outage..

Deployed to notify and it's OK. I'll try queue shortly.

Great success on queue.

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED

These messages appear to not be sent..

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

I'm going to wait until CloudAMQP support helps me to undersatnd what happened here, and then reproduce it in our dev RabbitMQ cluster, and use that to verify that the log messages are being produced.

The dev environment is not the same version of RabbitMQ, so less helpful. I can't repro using the obvious send-faster-than-consuming:

const pulse = require('taskcluster-lib-pulse');                                                                                                                                                                                                               

const {defaultMonitorManager} = require('taskcluster-lib-monitor');

const monitorManager = defaultMonitorManager.configure({
  serviceName: 'queue',
});

let credentials;

// raw AMQP credentials
credentials = pulse.pulseCredentials({
  username: 'dustin',
  password: '<mumble>',
  hostname: 'pulse.mozilla.org',
  vhost: '/',
});

const main = async () => {
  const monitor = monitorManager.setup({
    processName: 'pulse',
    verify: false,
  });
  const client = new pulse.Client({
    namespace: 'dustin',
    credentials, // from above
    monitor,
  });

  await client.withChannel(async chan => {
    await chan.assertExchange('exchange/dustin/testproducer', 'topic');
  });

  let pc = await pulse.consume({
    client,
    bindings: [{exchange: 'exchange/dustin/testproducer', routingKeyPattern: '#'}],
    queueName: 'testcons',
    prefetch: 1,
  }, async ({payload, exchange, routingKey, redelivered, routes, routing}) => {
    console.log('got', payload);
    await new Promise(resolve => setTimeout(resolve, 200));
  });

  await client.withChannel(async chan => {
    let i = 0;
    while (1) {
      console.log('send', i);
      await chan.publish('exchange/dustin/testproducer', '-', Buffer.from(JSON.stringify({count: i++})));
      await new Promise(resolve => setTimeout(resolve, 10));
    }
  });
};

main().then(console.log, console.log);

I don't see any of the connection, the channel, or the queue going into flow state.

Great success on queue.

I think that meant "didn't crash".

https://www.rabbitmq.com/changelog.html suggests that the notification functionality was added in 3.2.0, which is earlier than the 3.5.7 we're running. It's possible that this support was improved in a later version? bug 1603633 should help discover that.

As it stands, I can't manage to trigger this issue to verify that these messages are being sent, so that's about all I can figure out.

I've marked this as depending on upgrading RabbitMQ on pulse, and hopefully it will magically work better after that.

Assignee: dustin → nobody
Mentor: dustin
Depends on: 1603633
Status: REOPENED → RESOLVED
Closed: 6 years ago4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.