Closed
Bug 1250184
Opened 8 years ago
Closed 8 years ago
add Buildbot Bridge Reflector exception handling
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: Callek, Assigned: rail)
References
Details
Attachments
(1 file)
50 bytes,
text/x-github-pull-request
|
bhearsum
:
review+
rail
:
checked-in+
|
Details | Review |
This bug should be used to harden the reflector of buildbotbridge, but the following is a documentation of what went wrong.
Reporter | ||
Comment 1•8 years ago
|
||
When I logged in to see why my code wasn't running in buildbot I saw the tclistener see it and claim it, then I saw the reflector log with [cltbld@buildbot-master84.bb.releng.scl3.mozilla.com bbb]$ cat reflector.log | grep -U50 Exception | head -n50 2016-02-21 03:45:21,656 - bbb.runner - Running reflector service 2016-02-21 03:45:21,656 - bbb.services - Starting reflector 2016-02-21 03:45:21,659 - bbb.services - Processing task: EUj1rLkNT12T4wPPT07BUA 2016-02-21 03:45:21,689 - bbb.services - BuildRequest disappeared before starting, cancelling task 2016-02-21 03:45:21,708 - requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): queue.taskcluster.net 2016-02-21 03:45:22,345 - sqlalchemy.pool.QueuePool - Exception during reset or similar Traceback (most recent call last): File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/pool.py", line 567, in _finalize_fairy fairy._reset(pool) File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/pool.py", line 701, in _reset pool._dialect.do_rollback(self) File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py", line 2519, in do_rollback dbapi_connection.rollback() ProgrammingError: (2014, "Commands out of sync; you can't run this command now") On restarting it via supervisorctl I saw this spewing over and over again in the log: 2016-02-22 07:41:07,077 - bbb.runner - Running reflector service 2016-02-22 07:41:07,077 - bbb.services - Starting reflector 2016-02-22 07:41:07,080 - bbb.services - Processing task: EUj1rLkNT12T4wPPT07BUA 2016-02-22 07:41:07,113 - bbb.services - BuildRequest disappeared before starting, cancelling task 2016-02-22 07:41:07,134 - requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): queue.taskcluster.net 2016-02-22 07:41:07,688 - sqlalchemy.pool.QueuePool - Exception during reset or similar Traceback (most recent call last): File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/pool.py", line 567, in _finalize_fairy fairy._reset(pool) File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/pool.py", line 701, in _reset pool._dialect.do_rollback(self) File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py", line 2519, in do_rollback dbapi_connection.rollback() ProgrammingError: (2014, "Commands out of sync; you can't run this command now") in the supervisor reflector log we had lines like this (I'm only copying the last one, for my own sanity, all were like this though) [root@buildbot-master84.bb.releng.scl3.mozilla.com ~]# tail -n300 /var/log/supervisor/reflector.log #.... Exception _mysql_exceptions.InterfaceError: (0, '') in <bound method SSCursor.__del__ of <MySQLdb.cursors.SSCursor object at 0 x23f94d0>> ignored Traceback (most recent call last): File "/builds/bbb/bin/buildbot-bridge", line 9, in <module> load_entry_point('bbb==1.5.4', 'console_scripts', 'buildbot-bridge')() File "/builds/bbb/lib/python2.7/site-packages/bbb/runner.py", line 85, in main service.start() File "/builds/bbb/lib/python2.7/site-packages/bbb/services.py", line 272, in start self.reflectTasks() File "/builds/bbb/lib/python2.7/site-packages/bbb/services.py", line 301, in reflectTasks self.tc_queue.cancelTask(t.taskId) File "/builds/bbb/lib/python2.7/site-packages/taskcluster/client.py", line 455, in apiCall return self._makeApiCall(e, *args, **kwargs) File "/builds/bbb/lib/python2.7/site-packages/taskcluster/client.py", line 232, in _makeApiCall return self._makeHttpRequest(entry['method'], route, payload) File "/builds/bbb/lib/python2.7/site-packages/taskcluster/client.py", line 424, in _makeHttpRequest superExc=rerr taskcluster.exceptions.TaskclusterRestFailure: Task SUK9Dvw1TD6att21p-HbxA Can't be cancelled past it's deadline of 2016-02-19T23:05:28.784Z. ---- errorCode: RequestConflict statusCode: 409 statusCode: 409 requestInfo: method: cancelTask params: {"taskId":"SUK9Dvw1TD6att21p-HbxA"} payload: {} time: 2016-02-22T15:59:26.233Z details: { "taskId": "SUK9Dvw1TD6att21p-HbxA", "deadline": "2016-02-19T23:05:28.784Z" } Which was also spewing out of control. Via https://github.com/mozilla/buildbot-bridge/blob/2c38c205635ba018598f412e288f7b10a3201100/bbb/services.py#L276 I noticed that this is a TODO'd error condition :( So after a bit of understanding I logged into the bbb [staging] mysql db and deleted the offending task. (I also saw more failing tasks as we went and deleted those too) mysql> select * from tasks where taskId LIKE "EUj1rLkNT12T4wPPT07BUA"; +----------------+------------------------+-------+-------------+---------------+------------+ | buildrequestId | taskId | runId | createdDate | processedDate | takenUntil | +----------------+------------------------+-------+-------------+---------------+------------+ | 98479944 | EUj1rLkNT12T4wPPT07BUA | 0 | 1455909718 | 1455909722 | NULL | +----------------+------------------------+-------+-------------+---------------+------------+ 1 row in set (0.00 sec) mysql> delete from tasks where taskId LIKE "EUj1rLkNT12T4wPPT07BUA"; Query OK, 1 row affected (0.00 sec) mysql> select * from tasks where taskId LIKE "fM1xuI5jSI2KIElyw-pB8g"; +----------------+------------------------+-------+-------------+---------------+------------+ | buildrequestId | taskId | runId | createdDate | processedDate | takenUntil | +----------------+------------------------+-------+-------------+---------------+------------+ | 98496696 | fM1xuI5jSI2KIElyw-pB8g | 0 | 1455918584 | 1455918587 | NULL | +----------------+------------------------+-------+-------------+---------------+------------+ 1 row in set (0.01 sec) mysql> delete from tasks where taskId LIKE "fM1xuI5jSI2KIElyw-pB8g"; Query OK, 1 row affected (0.00 sec) mysql> select * from tasks where taskId LIKE "SUK9Dvw1TD6att21p-HbxA"; +----------------+------------------------+-------+-------------+---------------+------------+ | buildrequestId | taskId | runId | createdDate | processedDate | takenUntil | +----------------+------------------------+-------+-------------+---------------+------------+ | 98498362 | SUK9Dvw1TD6att21p-HbxA | 0 | 1455919528 | 1455919532 | NULL | +----------------+------------------------+-------+-------------+---------------+------------+ 1 row in set (0.00 sec) mysql> delete from tasks where taskId LIKE "SUK9Dvw1TD6att21p-HbxA"; Query OK, 1 row affected (0.00 sec) =========== I also then saw that the bblistener was not up, last log message related to watching pulse messages on the 21'st I restarted it too.
Comment 2•8 years ago
|
||
this bug should track better exception handling as suggested in comment 1. starting with this block https://github.com/mozilla/buildbot-bridge/blob/2c38c205635ba018598f412e288f7b10a3201100/bbb/services.py#L275
Summary: Buildbot Bridge Reflector was down, unable to proceed after restarted → add Buildbot Bridge Reflector exception handling
Updated•8 years ago
|
Assignee: nobody → rail
Assignee | ||
Comment 3•8 years ago
|
||
The easiest way to see the actual changes (without whitespace changes) would be pulling the change and running `git show -b`.
Attachment #8725488 -
Flags: review?(bhearsum)
Updated•8 years ago
|
Attachment #8725488 -
Flags: review?(bhearsum) → review+
Assignee | ||
Updated•8 years ago
|
Attachment #8725488 -
Flags: checked-in+
Comment 4•8 years ago
|
||
Comment on attachment 8725488 [details] [review] handle bbb exceptions harder deployed on dev bbb instance: https://hg.mozilla.org/build/puppet/rev/b785ca55e9d9 bbb services restarted
Assignee | ||
Comment 5•8 years ago
|
||
Jordan, can you see if the dev instance behaves ok (I cancelled tons of tasks today in TC). If everything is ok, can you deploy it to production?
Flags: needinfo?(jlund)
Comment 6•8 years ago
|
||
logs looked on dev. I added this to prod instances and restarted them.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(jlund)
Resolution: --- → FIXED
Updated•6 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•