Closed
Bug 951558
Opened 11 years ago
Closed 11 years ago
buildapi-web2 RabbitMQ queue is high
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: ashish, Assigned: dustin)
References
Details
Attachments
(3 files)
1.25 KB,
patch
|
jhopkins
:
review+
|
Details | Diff | Splinter Review |
745 bytes,
patch
|
dustin
:
review+
|
Details | Diff | Splinter Review |
1.08 KB,
patch
|
jhopkins
:
review+
|
Details | Diff | Splinter Review |
00:14:13 < nagios-scl3> | (IRC) Wed 00:14:13 PST [5397] rabbit1.releng.webapp.scl3.mozilla.com:Rabbit Unread Messages is CRITICAL: RABBITMQ_OVERVIEW CRITICAL - messages CRITICAL (982) messages_ready CRITICAL (982), messages_unacknowledged OK (0) (http://m.allizom.org/Rabbit+Unread+Messages) 00:08:24 < nagios-scl3> | (IRC) Wed 00:08:24 PST [5395] rabbit2.releng.webapp.scl3.mozilla.com:Rabbit Unread Messages is CRITICAL: RABBITMQ_OVERVIEW CRITICAL - messages CRITICAL (965) messages_ready CRITICAL (965), messages_unacknowledged OK (0) (http://m.allizom.org/Rabbit+Unread+Messages) Opening a bug as per alert documentation. Unsure of severity, please adjust accordingly.
Reporter | ||
Comment 1•11 years ago
|
||
I was unable to find the build duty personnel but I shall mention this bug in #buildduty for better visibility.
Comment 2•11 years ago
|
||
Looks like fallout from bug 946334 to me. From buildbot-master{61,62,65,66,81}:/builds/selfserve-agent/agent.log the self-serve messages are still getting through, and it worked for me too, so downgrading this to major. Dustin and catlee (in SF this week) would know more about this.
Blocks: 946334
Severity: critical → major
Reporter | ||
Updated•11 years ago
|
Group: infra
Comment 3•11 years ago
|
||
Dustin said yesterday: in the old arrangement, we were using a single rabbit server. In the new, there are two behind a load balancer. As far as I can tell, Buildapi does all of the right things to make this work (durable messages, queues, and exchanges)
Assignee | ||
Updated•11 years ago
|
Assignee: nobody → dustin
Assignee | ||
Comment 4•11 years ago
|
||
There were no connections to the buildapi-web2 queue, which is what buildapi (not self-serve) consumes from. I restarted the buildapi daemon. Unfortunately, due to bug 806777, there's no logging that might indicate why it stopped.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Comment 5•11 years ago
|
||
This started happening again: (IRC) Wed 15:58:39 PST [5543] rabbit1.releng.webapp.scl3.mozilla.com:Rabbit Unread Messages is WARNING: RABBITMQ_OVERVIEW WARNING - messages WARNING (306) messages_ready WARNING (306), messages_unacknowledged OK (0) (http://m.allizom.org/Rabbit+Unread+Messages)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 6•11 years ago
|
||
We do have buildapi logs this time, but nothing obviously wrong. This is the last message back from self-serve that gets processed: 2013-12-18 08:09:41,251 INFO [buildapi.lib.mq] [Dummy-1] Got {u'body': {u'msg': u'Ok', u'errors': False}, u'request_id': 426451} The next message out to self-serve is: 2013-12-18 08:42:51,599 INFO [buildapi.lib.mq] [worker 18] Sending {'action': 'rebuild_build', 'body': {'priority': 0, 'bid': 34405899, 'when': 1387384971.5606539, 'request_id': 426452}, 'who': 'ryanvm@gmail.com'} Maybe a 30min TCP timeout ? It's the first gap of that size size the restart in comment #4.
Assignee | ||
Comment 7•11 years ago
|
||
Indeed - nice find! [root@buildapi01 ~]# netstat -ap | grep rabbit tcp 0 0 buildapi01.build.scl1:56297 releng-rabbitmq-zlb.we:amqp ESTABLISHED 28855/python but [root@zlb2.ops.scl3 dmitchell]# netstat -ap | grep buildapi01 | grep -v mysql [root@zlb2.ops.scl3 dmitchell]# RabbitMQ should fix this by suggesting heartbeats every 580s to the client. I have no idea why that (9 minutes 40 seconds) is the default, but there you are. At any rate, amqplib ignores this value, even in the latest version, so that doesn't help. Hooray for crappy tools. I changed the idle connection timeout in zeus to 300s, which should initiate a regular TCP FIN sequence at 300s, well before any firewalls in the way decide to expire the state.
Assignee | ||
Comment 8•11 years ago
|
||
Maybe we should switch to py-amqp or librabbitmq.. I'll file a bug. I restarted buildapi again, so let's hope this is solved this time.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 9•11 years ago
|
||
..and it's back..
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 10•11 years ago
|
||
Same symptoms: no connection on the zlb, but a connection present on buildapi01. In the logs: Unhandled exception in thread started by <bound method LoggingJobRequestDoneConsumer.wait of <buildapi.lib.mq.LoggingJobRequestDoneConsumer object at 0x6572310>> Traceback (most recent call last): File "/home/buildapi/src/buildapi/lib/mq.py", line 105, in wait consumer.wait(limit) File "/home/buildapi/lib/python2.6/site-packages/carrot/messaging.py", line 446, in wait it.next() File "/home/buildapi/lib/python2.6/site-packages/carrot/backends/pyamqplib.py", line 300, in consume self.channel.wait() File "/home/buildapi/lib/python2.6/site-packages/amqplib/client_0_8/abstract_channel.py", line 89, in wait self.channel_id, allowed_methods) File "/home/buildapi/lib/python2.6/site-packages/amqplib/client_0_8/connection.py", line 198, in _wait_method self.method_reader.read_method() File "/home/buildapi/lib/python2.6/site-packages/amqplib/client_0_8/method_framing.py", line 215, in read_method raise m IOError: Socket closed So, yeah, amqplib.. you are NOT PRODUCTION READY.
Assignee | ||
Comment 11•11 years ago
|
||
The connection timeout is still 28800s. I'm not sure if my change didn't "take", or if it got changed back, but either way that's where it should be, lest the socket be closed after only 5m of idle.
Comment 12•11 years ago
|
||
Restarted buildapi in response to nagios alerting again.
Assignee | ||
Comment 13•11 years ago
|
||
Thanks! I suspect that 952266 is the fix here.
Assignee | ||
Comment 15•11 years ago
|
||
Restarted again. Sigh.
Assignee | ||
Comment 16•11 years ago
|
||
Ah, I think there's a fairly simple fix here - ReconnectingConsumer needs to handle IOError. I still hate amqplib and carrot :)
Assignee | ||
Comment 17•11 years ago
|
||
In my testing at home (without a rabbitmq server), I get IOError. With this patch, it keeps trying.
Attachment #8350650 -
Flags: review?(jhopkins)
Updated•11 years ago
|
Attachment #8350650 -
Flags: review?(jhopkins) → review+
Assignee | ||
Comment 18•11 years ago
|
||
That helped, but now: Unhandled exception in thread started by <bound method LoggingJobRequestDoneConsumer.wait of <buildapi.lib.mq.LoggingJobRequestDoneConsumer object at 0x351c290>> Traceback (most recent call last): File "/home/buildapi/src/buildapi/lib/mq.py", line 108, in wait self.disconnect() File "/home/buildapi/src/buildapi/lib/mq.py", line 88, in disconnect self.connection.close() File "/home/buildapi/lib/python2.6/site-packages/carrot/connection.py", line 180, in close backend.close_connection(self._connection) File "/home/buildapi/lib/python2.6/site-packages/carrot/backends/pyamqplib.py", line 212, in close_connection connection.close() File "/home/buildapi/lib/python2.6/site-packages/amqplib/client_0_8/connection.py", line 301, in close (10, 61), # Connection.close_ok File "/home/buildapi/lib/python2.6/site-packages/amqplib/client_0_8/abstract_channel.py", line 89, in wait self.channel_id, allowed_methods) File "/home/buildapi/lib/python2.6/site-packages/amqplib/client_0_8/connection.py", line 198, in _wait_method self.method_reader.read_method() File "/home/buildapi/lib/python2.6/site-packages/amqplib/client_0_8/method_framing.py", line 215, in read_method raise m IOError: Socket closed
Assignee | ||
Comment 19•11 years ago
|
||
Assignee | ||
Comment 20•11 years ago
|
||
Comment on attachment 8350664 [details] [diff] [review] bug951558-p2.patch from jhopkins in irc. This seems more durable. I'm also going to bring the ZLB timeout back down to 5m, so we can avoid firewall TCP session timeouts.
Attachment #8350664 -
Flags: review+
Assignee | ||
Comment 21•11 years ago
|
||
2013-12-20 09:17:11,005 INFO [buildapi.lib.mq] [Dummy-1] Lost connection, trying again in 1s Traceback (most recent call last): File "/home/buildapi/src/buildapi/lib/mq.py", line 108, in wait consumer.wait(limit) File "/home/buildapi/lib/python2.6/site-packages/carrot/messaging.py", line 446, in wait it.next() File "/home/buildapi/lib/python2.6/site-packages/carrot/backends/pyamqplib.py", line 300, in consume self.channel.wait() File "/home/buildapi/lib/python2.6/site-packages/amqplib/client_0_8/abstract_channel.py", line 89, in wait self.channel_id, allowed_methods) File "/home/buildapi/lib/python2.6/site-packages/amqplib/client_0_8/connection.py", line 198, in _wait_method self.method_reader.read_method() File "/home/buildapi/lib/python2.6/site-packages/amqplib/client_0_8/method_framing.py", line 215, in read_method raise m IOError: Socket closed 2013-12-20 09:17:11,006 INFO [buildapi.lib.mq] [Dummy-1] error while disconnecting; ignored Traceback (most recent call last): File "/home/buildapi/src/buildapi/lib/mq.py", line 89, in disconnect self.connection.close() File "/home/buildapi/lib/python2.6/site-packages/carrot/connection.py", line 180, in close backend.close_connection(self._connection) File "/home/buildapi/lib/python2.6/site-packages/carrot/backends/pyamqplib.py", line 212, in close_connection connection.close() File "/home/buildapi/lib/python2.6/site-packages/amqplib/client_0_8/connection.py", line 301, in close (10, 61), # Connection.close_ok File "/home/buildapi/lib/python2.6/site-packages/amqplib/client_0_8/abstract_channel.py", line 89, in wait self.channel_id, allowed_methods) File "/home/buildapi/lib/python2.6/site-packages/amqplib/client_0_8/connection.py", line 198, in _wait_method self.method_reader.read_method() File "/home/buildapi/lib/python2.6/site-packages/amqplib/client_0_8/method_framing.py", line 215, in read_method raise m IOError: Socket closed .. promising!
Assignee | ||
Comment 22•11 years ago
|
||
..deploy to selfserve
Attachment #8350681 -
Flags: review?(jhopkins)
Updated•11 years ago
|
Attachment #8350681 -
Flags: review?(jhopkins) → review+
Assignee | ||
Comment 23•11 years ago
|
||
This is causing selfserve to exit every 5m, since it doesn't use the ReconnectingConsumer class (?). However, it seems stable this way, so let's leave well enough alone.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•