Closed Bug 1197841 Opened 10 years ago Closed 10 years ago

investigate issues with stuck bbb tasks

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Unassigned)

Details

I first noticed this on buildbot-master82, which was alerting about the reflector not running. I went to compare to bm72, and noticed that its reflector service was continually restarting, eg: 2015-08-24 08:02:59,417 - bbb.runner - Running reflector service 2015-08-24 08:02:59,417 - bbb.services - Starting reflector 2015-08-24 08:02:59,444 - bbb.services - Processing task: 4N-0yTkmQzuNmq3zjLf7cw 2015-08-24 08:02:59,546 - bbb.services - BuildRequest 77842006 is done. BBListener should process it soon, reclaiming in the meantime 2015-08-24 08:02:59,547 - bbb.services - Processing task: sKklWUdZS1mircngEJBBzw 2015-08-24 08:02:59,652 - bbb.services - BuildRequest 78070838 is done. BBListener should process it soon, reclaiming in the meantime 2015-08-24 08:02:59,652 - bbb.services - Processing task: _dgDktc4R0GytOXdydnIpQ 2015-08-24 08:02:59,754 - bbb.services - BuildRequest 78092858 is done. BBListener should process it soon, reclaiming in the meantime 2015-08-24 08:02:59,755 - bbb.services - Processing task: JJM9oHrARx6fsTffPT99aA 2015-08-24 08:02:59,885 - bbb.services - BuildRequest 78109284 is done. BBListener should process it soon, reclaiming in the meantime 2015-08-24 08:02:59,885 - bbb.services - Processing task: A18IB6cARBGyKrV7w5YAMA 2015-08-24 08:02:59,990 - bbb.services - BuildRequest disappeared before starting, cancelling task 2015-08-24 08:03:00,009 - requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): queue.taskcluster.net 2015-08-24 08:03:00,608 - sqlalchemy.pool.QueuePool - Exception during reset or similar Traceback (most recent call last): File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/pool.py", line 567, in _finalize_fairy fairy._reset(pool) File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/pool.py", line 701, in _reset pool._dialect.do_rollback(self) File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py", line 2519, in do_rollback dbapi_connection.rollback() ProgrammingError: (2014, "Commands out of sync; you can't run this command now") 2015-08-24 08:03:04,472 - bbb.runner - Running reflector service 2015-08-24 08:03:04,472 - bbb.services - Starting reflector 2015-08-24 08:03:04,499 - bbb.services - Processing task: 4N-0yTkmQzuNmq3zjLf7cw 2015-08-24 08:03:06,311 - bbb.services - BuildRequest 77842006 is done. BBListener should process it soon, reclaiming in the meantime 2015-08-24 08:03:06,311 - bbb.services - Processing task: sKklWUdZS1mircngEJBBzw 2015-08-24 08:03:06,526 - bbb.services - BuildRequest 78070838 is done. BBListener should process it soon, reclaiming in the meantime 2015-08-24 08:03:06,526 - bbb.services - Processing task: _dgDktc4R0GytOXdydnIpQ 2015-08-24 08:03:06,884 - bbb.services - BuildRequest 78092858 is done. BBListener should process it soon, reclaiming in the meantime 2015-08-24 08:03:06,885 - bbb.services - Processing task: JJM9oHrARx6fsTffPT99aA 2015-08-24 08:03:07,009 - bbb.services - BuildRequest 78109284 is done. BBListener should process it soon, reclaiming in the meantime 2015-08-24 08:03:07,009 - bbb.services - Processing task: A18IB6cARBGyKrV7w5YAMA 2015-08-24 08:03:07,118 - bbb.services - BuildRequest disappeared before starting, cancelling task 2015-08-24 08:03:07,129 - requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): queue.taskcluster.net 2015-08-24 08:03:07,724 - sqlalchemy.pool.QueuePool - Exception during reset or similar Traceback (most recent call last): File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/pool.py", line 567, in _finalize_fairy fairy._reset(pool) File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/pool.py", line 701, in _reset pool._dialect.do_rollback(self) File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py", line 2519, in do_rollback dbapi_connection.rollback() ProgrammingError: (2014, "Commands out of sync; you can't run this command now") Over and over and over. Googling the exception makes it sound like there's an issue with some of the code not closing cursor before starting new ones (eg: http://stackoverflow.com/questions/11583083/python-commands-out-of-sync-you-cant-run-this-command-now). However, the only changes in the past couple of months haven't touched Reflector code at all, so that's quite strange. The Bridge is getting more use now so it may be something that has only come up now that we're getting more load.
After upping the log level I found this: 2015-08-24 08:24:59,886 - bbb.services - Processing task: A18IB6cARBGyKrV7w5YAMA 2015-08-24 08:24:59,982 - bbb.services - Task info: (78316076L, 'A18IB6cARBGyKrV7w5YAMA', 0L, 1440173702L, 1440173708L, None) 2015-08-24 08:24:59,982 - bbb.services - BuildRequest 78316076 is complete 2015-08-24 08:24:59,982 - bbb.services - BuildRequest disappeared before starting, cancelling task ... 2015-08-24 08:25:00,595 - taskcluster.utils - Received HTTP Status: 409 2015-08-24 08:25:00,596 - taskcluster.utils - Received HTTP Headers: {'content-length': '124', 'via': '1.1 vegur', 'x-powered-by': 'Express', 'server': 'Cowboy', 'access-control-request-method': '*', 'connection': 'keep-alive', 'date': 'Mon, 24 Aug 2015 15:25:00 GMT', 'access-control-allow-origin': '*', 'access-control-allow-methods': 'OPTIONS,GET,HEAD,POST,PUT,DELETE,TRACE,CONNECT', 'content-type': 'application/json; charset=utf-8', 'access-control-allow-headers': 'X-Requested-With,Content-Type,Authorization,Accept,Origin'} 2015-08-24 08:25:00,596 - taskcluster.utils - Received HTTP Payload: { "message": "Task can't be cancel task past it's deadline", "error": { "deadline": "2015-08-21T17:15:02.980Z" } } (limit 1024 char) 2015-08-24 08:25:00,596 - sqlalchemy.pool.QueuePool - Exception during reset or similar Traceback (most recent call last): File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/pool.py", line 567, in _finalize_fairy fairy._reset(pool) File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/pool.py", line 701, in _reset pool._dialect.do_rollback(self) File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py", line 2519, in do_rollback dbapi_connection.rollback() ProgrammingError: (2014, "Commands out of sync; you can't run this command now") Which happens near this code self.tc_queue.cancelTask(t.taskId) self.bbb_db.deleteBuildRequest(t.buildrequestId) And also these two things in stdout (_not_ the log, which is weird): Exception _mysql_exceptions.InterfaceError: (0, '') in <bound method SSCursor.__del__ of <MySQLdb.cursors.SSCursor object at 0x1830310>> ignored Traceback (most recent call last): File "/builds/bbb/bin/buildbot-bridge", line 9, in <module> load_entry_point('bbb==1.2', 'console_scripts', 'buildbot-bridge')() File "/builds/bbb/lib/python2.7/site-packages/bbb/runner.py", line 82, in main service.start() File "/builds/bbb/lib/python2.7/site-packages/bbb/services.py", line 268, in start self.reflectTasks() File "/builds/bbb/lib/python2.7/site-packages/bbb/services.py", line 297, in reflectTasks self.tc_queue.cancelTask(t.taskId) File "/builds/bbb/lib/python2.7/site-packages/taskcluster/client.py", line 455, in apiCall return self._makeApiCall(e, *args, **kwargs) File "/builds/bbb/lib/python2.7/site-packages/taskcluster/client.py", line 232, in _makeApiCall return self._makeHttpRequest(entry['method'], route, payload) File "/builds/bbb/lib/python2.7/site-packages/taskcluster/client.py", line 424, in _makeHttpRequest superExc=rerr taskcluster.exceptions.TaskclusterRestFailure: Task can't be cancel task past it's deadline Since it always seems to be happening righta fter the TaskclusterRestFailure, it makes me think something isn't cleaning up from that properly. The fact that that exception ends up in stdout may be indicative of it going uncaught, which may make the SQLAlchemy pool to close the cursor properly. However, the SSCursor exception happening prior to that Traceback makes me think otherwise. I need to poke a bit this some more still...
Digging further, I found that the tasks there's actually no truely active tasks right now, just 5 that are stuck in some way: +----------------+------------------------+----------------------------+------------------------------+ | buildrequestId | taskId | FROM_UNIXTIME(createdDate) | FROM_UNIXTIME(processedDate) | +----------------+------------------------+----------------------------+------------------------------+ | 77842006 | 4N-0yTkmQzuNmq3zjLf7cw | 2015-08-17 07:30:38 | 2015-08-17 07:30:49 | | 78070838 | sKklWUdZS1mircngEJBBzw | 2015-08-19 06:16:54 | 2015-08-19 06:16:58 | | 78092858 | _dgDktc4R0GytOXdydnIpQ | 2015-08-19 11:11:12 | 2015-08-19 11:11:17 | | 78109284 | JJM9oHrARx6fsTffPT99aA | 2015-08-19 13:58:43 | 2015-08-19 13:58:48 | | 78316076 | A18IB6cARBGyKrV7w5YAMA | 2015-08-21 09:15:02 | 2015-08-21 09:15:08 | +----------------+------------------------+----------------------------+------------------------------+ I know at least one of them was created by me, and then cancelled with the TC UI, so there's probably a bug around how we handle that. Perhaps it doesn't work correctly if the job has already started in buildbot. For now, I've deleted these tasks from the bbb db to unstick the processes.
Here's an analysis for each of the tasks: 4N-0yTkmQzuNmq3zjLf7cw (August 17th): * 7:30am - TCListener got a task-pending event, successfully created BuildRequest for it. ** 7;31:21am - Reflector ran and ignored the task because the Build hadn't started yet * 7:31:23am - BBListener got a build start event and claimed the Task * 7:32-7:36am - Reflector ran and did nothing, because the Build was still in progress, but the claim was still long enough not to reclaim * 7:37am - Reflector started noticing that the Build was done, and didn't reclaim it (although the log message said it did) * 7:51am - Taskcluster expired the current Run and started a new one; TCListener got a task-pending event and updated the runId * 8:05am - TCListener got a task-exception event because the second run was cancelled in TC and attempted to cancel the Build via buildapi. * 1:31pm - TCListener got another task-exception event, due to a Taskcluster bug: https://github.com/taskcluster/taskcluster-queue/pull/51 The root issue here was the "build finished" message never getting through because of bug 1195751. Even though that's fixed, there's a couple of things that the Bridge can do better: 1) Actually reclaim tasks where the Build is done but the build finished message hasn't been received. This would allow Buildbot more time to get the build finished message sent out, and prevent Taskcluster from starting a new run. 2) Have the Reflector forget about Tasks that are past their deadline. That would prevent the case where it spins on Tasks that are days old. 3) Enforce "product" being in payload, since it needs to be for the build finished event to get through (already done in https://github.com/mozilla/buildbot-bridge/pull/1) _dgDktc4R0GytOXdydnIpQ and JJM9oHrARx6fsTffPT99aA are the same as above, with a slighty different timeline. sKklWUdZS1mircngEJBBzw is the same as above, except there was only one run. A18IB6cARBGyKrV7w5YAMA (August 21nd): * 9:15am - TCListener got a task-pending event and created a BuildRequest * 12:53pm - BBListener tried to claim the Task but failed because it was past the deadline. * 12:54pm and onwards - BBListener continually tries to attach properties to the Task, fails. (Normally this would happen hours after the claim, but the Build failed rapidly). * 12:54pm and onwards - The Reflector continually tries to cancel the task, because it's seen that the BuildRequest disappeared before starting. So the root of the issue here is that the Task went past its deadline before the Build even started. A couple of things will help cope with this better: 1) Handle task-exception w/ reason set to deadline-exceeded. The Bridge should cancel the BuildRequest and/or Build in this case. This would prevent the case where BBListener is trying to claim a Task that is past its deadline, and the subsequent attaching of properties. 2) Fix "BuildRequest cancelled" logic in the Reflector. Right now, it considers a Task whose takenUntil is unset, and whose BuildRequest is complete, to be cancelled. This breaks down in a case like this where the Build completes before BBListener has had a chance to claim it. This should be adjusted to look for completed BuildRequests _without_ Builds. I'm going to file separate bugs for all of these issues.
OK, I've filed bugs for all of the issues I discovered. I'm going to scope this bug as an investigation bug, since the original summary will be fixed by one of the issues I found. (In reply to Ben Hearsum (:bhearsum) from comment #4) > 1) Actually reclaim tasks where the Build is done but the build finished > message hasn't been received. This would allow Buildbot more time to get the > build finished message sent out, and prevent Taskcluster from starting a new > run. https://bugzilla.mozilla.org/show_bug.cgi?id=1198238 > 2) Have the Reflector forget about Tasks that are past their deadline. That > would prevent the case where it spins on Tasks that are days old. > 1) Handle task-exception w/ reason set to deadline-exceeded. The Bridge > should cancel the BuildRequest and/or Build in this case. This would prevent > the case where BBListener is trying to claim a Task that is past its > deadline, and the subsequent attaching of properties. Both of these are https://bugzilla.mozilla.org/show_bug.cgi?id=1197855 > 2) Fix "BuildRequest cancelled" logic in the Reflector. Right now, it > considers a Task whose takenUntil is unset, and whose BuildRequest is > complete, to be cancelled. This breaks down in a case like this where the > Build completes before BBListener has had a chance to claim it. This should > be adjusted to look for completed BuildRequests _without_ Builds. https://bugzilla.mozilla.org/show_bug.cgi?id=1198240 I also filed https://bugzilla.mozilla.org/show_bug.cgi?id=1198242 and https://bugzilla.mozilla.org/show_bug.cgi?id=1198241, both of which would've made debugging easier.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Summary: buildbot bridge reflector services continually restarting → investigate issues with stuck bbb tasks
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.