Closed Bug 946196 Opened 11 years ago Closed 10 years ago

self-serve agent on bm66 eating jobs

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Unassigned)

Details

Attachments

(2 files)

the self-serve agent on bm66 was taking messages and failing to operate on them due to sql exceptions:

2013-12-04 05:05:54,385 rebuilding build by ryanvm@gmail.com of 33597270
2013-12-04 05:06:24,385 Error processing message
Traceback (most recent call last):
  File "/builds/selfserve-agent/lib/python2.7/site-packages/buildapi/scripts/selfserve_agent.py", line 233, in receive_message
    retval = action_func(message_data, message)
  File "/builds/selfserve-agent/lib/python2.7/site-packages/buildapi/scripts/selfserve_agent.py", line 292, in do_rebuild_build
    ), bid=bid).fetchone()
  File "/builds/selfserve-agent/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1787, in execute
    connection = self.contextual_connect(close_with_result=True)
  File "/builds/selfserve-agent/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1829, in contextual_connect
    self.pool.connect(),
  File "/builds/selfserve-agent/lib/python2.7/site-packages/sqlalchemy/pool.py", line 182, in connect
    return _ConnectionFairy(self).checkout()
  File "/builds/selfserve-agent/lib/python2.7/site-packages/sqlalchemy/pool.py", line 369, in __init__
    rec = self._connection_record = pool.get()
  File "/builds/selfserve-agent/lib/python2.7/site-packages/sqlalchemy/pool.py", line 213, in get
    return self.do_get()
  File "/builds/selfserve-agent/lib/python2.7/site-packages/sqlalchemy/pool.py", line 722, in do_get
    (self.size(), self.overflow(), self._timeout))
TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30
I killed the agent, and supervisor started it back up.
Any idea how to fix the root clause?
A simple way would be to catch this and then abort the process.
[10:52] <rail> catlee: http://hg.mozilla.org/build/buildapi/file/b60098b51361/buildapi/scripts/selfserve-agent.py#l232 it fails here. it sounds like the messages shouldn't be acked in case of exception and shouldn't be dropped in this case. Am I following the code?
[10:54] <catlee> rail: yeah, I wonder if they get retried eventually...


(In reply to Chris AtLee [:catlee] from comment #3)
> A simple way would be to catch this and then abort the process.

This will help with the following requests. Let's doo eet!
Attached patch catch it!Splinter Review
Attachment #8342442 - Flags: review?(catlee)
Attachment #8342442 - Flags: review?(catlee) → review+
Attached patch puppetSplinter Review
... once the tarball is synced
Attachment #8342531 - Flags: review?(catlee)
Attachment #8342531 - Flags: review?(catlee)
Attachment #8342531 - Flags: review?(catlee) → review+
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: