Closed Bug 951558 Opened 11 years ago Closed 11 years ago

buildapi-web2 RabbitMQ queue is high

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ashish, Assigned: dustin)

References

Details

Attachments

(3 files)

00:14:13 < nagios-scl3> | (IRC) Wed 00:14:13 PST [5397] rabbit1.releng.webapp.scl3.mozilla.com:Rabbit Unread Messages is CRITICAL: RABBITMQ_OVERVIEW CRITICAL - messages CRITICAL (982) messages_ready CRITICAL (982), messages_unacknowledged OK (0) (http://m.allizom.org/Rabbit+Unread+Messages)
00:08:24 < nagios-scl3> | (IRC) Wed 00:08:24 PST [5395] rabbit2.releng.webapp.scl3.mozilla.com:Rabbit Unread Messages is CRITICAL: RABBITMQ_OVERVIEW CRITICAL - messages CRITICAL (965) messages_ready CRITICAL (965), messages_unacknowledged OK (0) (http://m.allizom.org/Rabbit+Unread+Messages)

Opening a bug as per alert documentation. Unsure of severity, please adjust accordingly.
I was unable to find the build duty personnel but I shall mention this bug in #buildduty for better visibility.
Looks like fallout from bug 946334 to me. From buildbot-master{61,62,65,66,81}:/builds/selfserve-agent/agent.log the self-serve messages are still getting through, and it worked for me too, so downgrading this to major.

Dustin and catlee (in SF this week) would know more about this.
Blocks: 946334
Severity: critical → major
Group: infra
Dustin said yesterday: in the old arrangement, we were using a single rabbit server.  In the new, there are two behind a load balancer.  As far as I can tell, Buildapi does all of the right things to make this work (durable messages, queues, and exchanges)
Assignee: nobody → dustin
There were no connections to the buildapi-web2 queue, which is what buildapi (not self-serve) consumes from.  I restarted the buildapi daemon.  Unfortunately, due to bug 806777, there's no logging that might indicate why it stopped.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
This started happening again:

(IRC) Wed 15:58:39 PST [5543] rabbit1.releng.webapp.scl3.mozilla.com:Rabbit Unread Messages is WARNING: RABBITMQ_OVERVIEW WARNING - messages WARNING (306) messages_ready WARNING (306), messages_unacknowledged OK (0) (http://m.allizom.org/Rabbit+Unread+Messages)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
We do have buildapi logs this time, but nothing obviously wrong. This is the last message back from self-serve that gets processed:

2013-12-18 08:09:41,251 INFO  [buildapi.lib.mq] [Dummy-1] Got {u'body': {u'msg': u'Ok', u'errors': False}, u'request_id': 426451}

The next message out to self-serve is:
2013-12-18 08:42:51,599 INFO  [buildapi.lib.mq] [worker 18] Sending {'action': 'rebuild_build', 'body': {'priority': 0, 'bid': 34405899, 'when': 1387384971.5606539, 'request_id': 426452}, 'who': 'ryanvm@gmail.com'}

Maybe a 30min TCP timeout ? It's the first gap of that size size the restart in comment #4.
Indeed - nice find!

[root@buildapi01 ~]# netstat -ap | grep rabbit
tcp        0      0 buildapi01.build.scl1:56297 releng-rabbitmq-zlb.we:amqp ESTABLISHED 28855/python

but

[root@zlb2.ops.scl3 dmitchell]# netstat -ap | grep buildapi01 | grep -v mysql
[root@zlb2.ops.scl3 dmitchell]#


RabbitMQ should fix this by suggesting heartbeats every 580s to the client.  I have no idea why that (9 minutes 40 seconds) is the default, but there you are.  At any rate, amqplib ignores this value, even in the latest version, so that doesn't help.  Hooray for crappy tools.

I changed the idle connection timeout in zeus to 300s, which should initiate a regular TCP FIN sequence at 300s, well before any firewalls in the way decide to expire the state.
Maybe we should switch to py-amqp or librabbitmq.. I'll file a bug.

I restarted buildapi again, so let's hope this is solved this time.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
..and it's back..
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Same symptoms: no connection on the zlb, but a connection present on buildapi01.

In the logs:

Unhandled exception in thread started by <bound method LoggingJobRequestDoneConsumer.wait of <buildapi.lib.mq.LoggingJobRequestDoneConsumer object at 0x6572310>>
Traceback (most recent call last):
  File "/home/buildapi/src/buildapi/lib/mq.py", line 105, in wait
    consumer.wait(limit)
  File "/home/buildapi/lib/python2.6/site-packages/carrot/messaging.py", line 446, in wait
    it.next()
  File "/home/buildapi/lib/python2.6/site-packages/carrot/backends/pyamqplib.py", line 300, in consume
    self.channel.wait()
  File "/home/buildapi/lib/python2.6/site-packages/amqplib/client_0_8/abstract_channel.py", line 89, in wait
    self.channel_id, allowed_methods)
  File "/home/buildapi/lib/python2.6/site-packages/amqplib/client_0_8/connection.py", line 198, in _wait_method
    self.method_reader.read_method()
  File "/home/buildapi/lib/python2.6/site-packages/amqplib/client_0_8/method_framing.py", line 215, in read_method
    raise m
IOError: Socket closed

So, yeah, amqplib.. you are NOT PRODUCTION READY.
Depends on: 952266
The connection timeout is still 28800s.  I'm not sure if my change didn't "take", or if it got changed back, but either way that's where it should be, lest the socket be closed after only 5m of idle.
Restarted buildapi in response to nagios alerting again.
Thanks!  I suspect that 952266 is the fix here.
Restarted again.  Sigh.
Ah, I think there's a fairly simple fix here - ReconnectingConsumer needs to handle IOError.  I still hate amqplib and carrot :)
Attached patch bug951558.patchSplinter Review
In my testing at home (without a rabbitmq server), I get IOError.  With this patch, it keeps trying.
Attachment #8350650 - Flags: review?(jhopkins)
Attachment #8350650 - Flags: review?(jhopkins) → review+
That helped, but now:

Unhandled exception in thread started by <bound method LoggingJobRequestDoneConsumer.wait of <buildapi.lib.mq.LoggingJobRequestDoneConsumer object at 0x351c290>>
Traceback (most recent call last):
  File "/home/buildapi/src/buildapi/lib/mq.py", line 108, in wait
    self.disconnect()
  File "/home/buildapi/src/buildapi/lib/mq.py", line 88, in disconnect
    self.connection.close()
  File "/home/buildapi/lib/python2.6/site-packages/carrot/connection.py", line 180, in close
    backend.close_connection(self._connection)
  File "/home/buildapi/lib/python2.6/site-packages/carrot/backends/pyamqplib.py", line 212, in close_connection
    connection.close()
  File "/home/buildapi/lib/python2.6/site-packages/amqplib/client_0_8/connection.py", line 301, in close
    (10, 61),    # Connection.close_ok
  File "/home/buildapi/lib/python2.6/site-packages/amqplib/client_0_8/abstract_channel.py", line 89, in wait
    self.channel_id, allowed_methods)
  File "/home/buildapi/lib/python2.6/site-packages/amqplib/client_0_8/connection.py", line 198, in _wait_method
    self.method_reader.read_method()
  File "/home/buildapi/lib/python2.6/site-packages/amqplib/client_0_8/method_framing.py", line 215, in read_method
    raise m
IOError: Socket closed
Comment on attachment 8350664 [details] [diff] [review]
bug951558-p2.patch

from jhopkins in irc.

This seems more durable.  I'm also going to bring the ZLB timeout back down to 5m, so we can avoid firewall TCP session timeouts.
Attachment #8350664 - Flags: review+
2013-12-20 09:17:11,005 INFO  [buildapi.lib.mq] [Dummy-1] Lost connection, trying again in 1s
Traceback (most recent call last):
  File "/home/buildapi/src/buildapi/lib/mq.py", line 108, in wait
    consumer.wait(limit)
  File "/home/buildapi/lib/python2.6/site-packages/carrot/messaging.py", line 446, in wait
    it.next()
  File "/home/buildapi/lib/python2.6/site-packages/carrot/backends/pyamqplib.py", line 300, in consume
    self.channel.wait()
  File "/home/buildapi/lib/python2.6/site-packages/amqplib/client_0_8/abstract_channel.py", line 89, in wait
    self.channel_id, allowed_methods)
  File "/home/buildapi/lib/python2.6/site-packages/amqplib/client_0_8/connection.py", line 198, in _wait_method
    self.method_reader.read_method()
  File "/home/buildapi/lib/python2.6/site-packages/amqplib/client_0_8/method_framing.py", line 215, in read_method
    raise m
IOError: Socket closed
2013-12-20 09:17:11,006 INFO  [buildapi.lib.mq] [Dummy-1] error while disconnecting; ignored
Traceback (most recent call last):
  File "/home/buildapi/src/buildapi/lib/mq.py", line 89, in disconnect
    self.connection.close()
  File "/home/buildapi/lib/python2.6/site-packages/carrot/connection.py", line 180, in close
    backend.close_connection(self._connection)
  File "/home/buildapi/lib/python2.6/site-packages/carrot/backends/pyamqplib.py", line 212, in close_connection
    connection.close()
  File "/home/buildapi/lib/python2.6/site-packages/amqplib/client_0_8/connection.py", line 301, in close
    (10, 61),    # Connection.close_ok
  File "/home/buildapi/lib/python2.6/site-packages/amqplib/client_0_8/abstract_channel.py", line 89, in wait
    self.channel_id, allowed_methods)
  File "/home/buildapi/lib/python2.6/site-packages/amqplib/client_0_8/connection.py", line 198, in _wait_method
    self.method_reader.read_method()
  File "/home/buildapi/lib/python2.6/site-packages/amqplib/client_0_8/method_framing.py", line 215, in read_method
    raise m
IOError: Socket closed


.. promising!
No longer depends on: 952266
..deploy to selfserve
Attachment #8350681 - Flags: review?(jhopkins)
Attachment #8350681 - Flags: review?(jhopkins) → review+
This is causing selfserve to exit every 5m, since it doesn't use the ReconnectingConsumer class (?).  However, it seems stable this way, so let's leave well enough alone.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: