Closed Bug 1250184 Opened 8 years ago Closed 8 years ago

add Buildbot Bridge Reflector exception handling

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Assigned: rail)

References

Details

Attachments

(1 file)

This bug should be used to harden the reflector of buildbotbridge, but the following is a documentation of what went wrong.
When I logged in to see why my code wasn't running in buildbot I saw the tclistener see it and claim it, then I saw the reflector log with

[cltbld@buildbot-master84.bb.releng.scl3.mozilla.com bbb]$ cat reflector.log | grep -U50 Exception | head -n50
2016-02-21 03:45:21,656 - bbb.runner - Running reflector service
2016-02-21 03:45:21,656 - bbb.services - Starting reflector
2016-02-21 03:45:21,659 - bbb.services - Processing task: EUj1rLkNT12T4wPPT07BUA
2016-02-21 03:45:21,689 - bbb.services - BuildRequest disappeared before starting, cancelling task
2016-02-21 03:45:21,708 - requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): queue.taskcluster.net
2016-02-21 03:45:22,345 - sqlalchemy.pool.QueuePool - Exception during reset or similar
Traceback (most recent call last):
  File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/pool.py", line 567, in _finalize_fairy
    fairy._reset(pool)
  File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/pool.py", line 701, in _reset
    pool._dialect.do_rollback(self)
  File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py", line 2519, in do_rollback
    dbapi_connection.rollback()
ProgrammingError: (2014, "Commands out of sync; you can't run this command now")

On restarting it via supervisorctl I saw this spewing over and over again in the log:

2016-02-22 07:41:07,077 - bbb.runner - Running reflector service
2016-02-22 07:41:07,077 - bbb.services - Starting reflector
2016-02-22 07:41:07,080 - bbb.services - Processing task: EUj1rLkNT12T4wPPT07BUA
2016-02-22 07:41:07,113 - bbb.services - BuildRequest disappeared before starting, cancelling task
2016-02-22 07:41:07,134 - requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): queue.taskcluster.net
2016-02-22 07:41:07,688 - sqlalchemy.pool.QueuePool - Exception during reset or similar
Traceback (most recent call last):
  File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/pool.py", line 567, in _finalize_fairy
    fairy._reset(pool)
  File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/pool.py", line 701, in _reset
    pool._dialect.do_rollback(self)
  File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py", line 2519, in do_rollback
    dbapi_connection.rollback()
ProgrammingError: (2014, "Commands out of sync; you can't run this command now")


in the supervisor reflector log we had lines like this (I'm only copying the last one, for my own sanity, all were like this though)

[root@buildbot-master84.bb.releng.scl3.mozilla.com ~]# tail -n300 /var/log/supervisor/reflector.log
#....
Exception _mysql_exceptions.InterfaceError: (0, '') in <bound method SSCursor.__del__ of <MySQLdb.cursors.SSCursor object at 0
x23f94d0>> ignored
Traceback (most recent call last):
  File "/builds/bbb/bin/buildbot-bridge", line 9, in <module>
    load_entry_point('bbb==1.5.4', 'console_scripts', 'buildbot-bridge')()
  File "/builds/bbb/lib/python2.7/site-packages/bbb/runner.py", line 85, in main
    service.start()
  File "/builds/bbb/lib/python2.7/site-packages/bbb/services.py", line 272, in start
    self.reflectTasks()
  File "/builds/bbb/lib/python2.7/site-packages/bbb/services.py", line 301, in reflectTasks
    self.tc_queue.cancelTask(t.taskId)
  File "/builds/bbb/lib/python2.7/site-packages/taskcluster/client.py", line 455, in apiCall
    return self._makeApiCall(e, *args, **kwargs)
  File "/builds/bbb/lib/python2.7/site-packages/taskcluster/client.py", line 232, in _makeApiCall
    return self._makeHttpRequest(entry['method'], route, payload)
  File "/builds/bbb/lib/python2.7/site-packages/taskcluster/client.py", line 424, in _makeHttpRequest
    superExc=rerr
taskcluster.exceptions.TaskclusterRestFailure: Task SUK9Dvw1TD6att21p-HbxA Can't be cancelled past it's deadline of 2016-02-19T23:05:28.784Z.
----
errorCode:  RequestConflict
statusCode: 409
statusCode: 409
requestInfo:
  method:   cancelTask
  params:   {"taskId":"SUK9Dvw1TD6att21p-HbxA"}
  payload:  {}
  time:     2016-02-22T15:59:26.233Z
details:
{
  "taskId": "SUK9Dvw1TD6att21p-HbxA",
  "deadline": "2016-02-19T23:05:28.784Z"
}

Which was also spewing out of control.

Via https://github.com/mozilla/buildbot-bridge/blob/2c38c205635ba018598f412e288f7b10a3201100/bbb/services.py#L276 I noticed that this is a TODO'd error condition :(

So after a bit of understanding I logged into the bbb [staging] mysql db and deleted the offending task.  (I also saw more failing tasks as we went and deleted those too)

mysql> select * from tasks where taskId LIKE "EUj1rLkNT12T4wPPT07BUA";
+----------------+------------------------+-------+-------------+---------------+------------+
| buildrequestId | taskId                 | runId | createdDate | processedDate | takenUntil |
+----------------+------------------------+-------+-------------+---------------+------------+
|       98479944 | EUj1rLkNT12T4wPPT07BUA |     0 |  1455909718 |    1455909722 |       NULL |
+----------------+------------------------+-------+-------------+---------------+------------+
1 row in set (0.00 sec)

mysql> delete from tasks where taskId LIKE "EUj1rLkNT12T4wPPT07BUA";
Query OK, 1 row affected (0.00 sec)

mysql> select * from tasks where taskId LIKE "fM1xuI5jSI2KIElyw-pB8g";
+----------------+------------------------+-------+-------------+---------------+------------+
| buildrequestId | taskId                 | runId | createdDate | processedDate | takenUntil |
+----------------+------------------------+-------+-------------+---------------+------------+
|       98496696 | fM1xuI5jSI2KIElyw-pB8g |     0 |  1455918584 |    1455918587 |       NULL |
+----------------+------------------------+-------+-------------+---------------+------------+
1 row in set (0.01 sec)

mysql> delete from tasks where taskId LIKE "fM1xuI5jSI2KIElyw-pB8g";
Query OK, 1 row affected (0.00 sec)

mysql> select * from tasks where taskId LIKE "SUK9Dvw1TD6att21p-HbxA";
+----------------+------------------------+-------+-------------+---------------+------------+
| buildrequestId | taskId                 | runId | createdDate | processedDate | takenUntil |
+----------------+------------------------+-------+-------------+---------------+------------+
|       98498362 | SUK9Dvw1TD6att21p-HbxA |     0 |  1455919528 |    1455919532 |       NULL |
+----------------+------------------------+-------+-------------+---------------+------------+
1 row in set (0.00 sec)

mysql> delete from tasks where taskId LIKE "SUK9Dvw1TD6att21p-HbxA";
Query OK, 1 row affected (0.00 sec)

===========

I also then saw that the bblistener was not up, last log message related to watching pulse messages on the 21'st I restarted it too.
this bug should track better exception handling as suggested in comment 1.

starting with this block https://github.com/mozilla/buildbot-bridge/blob/2c38c205635ba018598f412e288f7b10a3201100/bbb/services.py#L275
Summary: Buildbot Bridge Reflector was down, unable to proceed after restarted → add Buildbot Bridge Reflector exception handling
Assignee: nobody → rail
Blocks: 1252062
Depends on: 1252682
The easiest way to see the actual changes (without whitespace changes) would be pulling the change and running `git show -b`.
Attachment #8725488 - Flags: review?(bhearsum)
Attachment #8725488 - Flags: review?(bhearsum) → review+
Attachment #8725488 - Flags: checked-in+
Comment on attachment 8725488 [details] [review]
handle bbb exceptions harder

deployed on dev bbb instance: https://hg.mozilla.org/build/puppet/rev/b785ca55e9d9

bbb services restarted
Jordan, can you see if the dev instance behaves ok (I cancelled tons of tasks today in TC). If everything is ok, can you deploy it to production?
Flags: needinfo?(jlund)
logs looked on dev. I added this to prod instances and restarted them.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(jlund)
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: