Closed Bug 946196 Opened 12 years ago Closed 11 years ago

self-serve agent on bm66 eating jobs

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Unassigned)

Details

Attachments

(2 files)

the self-serve agent on bm66 was taking messages and failing to operate on them due to sql exceptions: 2013-12-04 05:05:54,385 rebuilding build by ryanvm@gmail.com of 33597270 2013-12-04 05:06:24,385 Error processing message Traceback (most recent call last): File "/builds/selfserve-agent/lib/python2.7/site-packages/buildapi/scripts/selfserve_agent.py", line 233, in receive_message retval = action_func(message_data, message) File "/builds/selfserve-agent/lib/python2.7/site-packages/buildapi/scripts/selfserve_agent.py", line 292, in do_rebuild_build ), bid=bid).fetchone() File "/builds/selfserve-agent/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1787, in execute connection = self.contextual_connect(close_with_result=True) File "/builds/selfserve-agent/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1829, in contextual_connect self.pool.connect(), File "/builds/selfserve-agent/lib/python2.7/site-packages/sqlalchemy/pool.py", line 182, in connect return _ConnectionFairy(self).checkout() File "/builds/selfserve-agent/lib/python2.7/site-packages/sqlalchemy/pool.py", line 369, in __init__ rec = self._connection_record = pool.get() File "/builds/selfserve-agent/lib/python2.7/site-packages/sqlalchemy/pool.py", line 213, in get return self.do_get() File "/builds/selfserve-agent/lib/python2.7/site-packages/sqlalchemy/pool.py", line 722, in do_get (self.size(), self.overflow(), self._timeout)) TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30
I killed the agent, and supervisor started it back up.
Any idea how to fix the root clause?
A simple way would be to catch this and then abort the process.
[10:52] <rail> catlee: http://hg.mozilla.org/build/buildapi/file/b60098b51361/buildapi/scripts/selfserve-agent.py#l232 it fails here. it sounds like the messages shouldn't be acked in case of exception and shouldn't be dropped in this case. Am I following the code? [10:54] <catlee> rail: yeah, I wonder if they get retried eventually... (In reply to Chris AtLee [:catlee] from comment #3) > A simple way would be to catch this and then abort the process. This will help with the following requests. Let's doo eet!
Attached patch catch it!Splinter Review
Attachment #8342442 - Flags: review?(catlee)
Attachment #8342442 - Flags: review?(catlee) → review+
Attached patch puppetSplinter Review
... once the tarball is synced
Attachment #8342531 - Flags: review?(catlee)
Attachment #8342531 - Flags: review?(catlee)
Attachment #8342531 - Flags: review?(catlee) → review+
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: