Closed Bug 869447 Opened 11 years ago Closed 8 years ago

zombie jobs when sql query fails

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: catlee, Unassigned)

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2075] [buildbot])

http://cruncher.build.mozilla.org/~catlee/reportor/2013-05-07:14/long_jobs/long_jobs.html

Has a list of jobs that have been running for more than 12 hours.

Many of these jobs have actually completed. However, the update to the DB failed:
2013-05-06 11:20:56-0700 [Broker,21007,10.26.56.167]  <Build Ubuntu HW 12.04 birch pgo talos chromez>: build finished
2013-05-06 11:21:03-0700 [Broker,21007,10.26.56.167]  setting expectations for next time
2013-05-06 11:21:03-0700 [Broker,21007,10.26.56.167] new expectations: 518.070086718 seconds
2013-05-06 11:21:04-0700 [Broker,21007,10.26.56.167] rollback failed, will reconnect next query 
2013-05-06 11:21:04-0700 [Broker,21007,10.26.56.167] Unhandled Error 
    Traceback (most recent call last):
      File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/twisted/internet/defer.py", line 441, in _runCallbacks
        self.result = callback(self.result, *args, **kw) 
      File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbot-0.8.2_hg_bccbfc2a314f_production_0.8-py2.7.egg/buildbot/process/builder.py", line 934, in buildFinished
        self.db.builds_finished(bids)
      File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbot-0.8.2_hg_bccbfc2a314f_production_0.8-py2.7.egg/buildbot/db/connector.py", line 931, in builds_finished
        return self.runInteractionNow(self._txn_build_finished, bids) 
      File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbot-0.8.2_hg_bccbfc2a314f_production_0.8-py2.7.egg/buildbot/db/connector.py", line 212, in runInteractionNow
        return self._runInteractionNow(interaction, *args, **kwargs)
    --- <exception caught here> ---
      File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbot-0.8.2_hg_bccbfc2a314f_production_0.8-py2.7.egg/buildbot/db/connector.py", line 244, in _runInteractionNow
        conn.rollback()
    _mysql_exceptions.OperationalError: (2006, 'MySQL server has gone away')

The current state of the db is
mysql> select * from buildrequests where id=23866634;
+----------+------------+-----------------------------------------+----------+------------+------------------------------------------------------------------------------------+------------------------+----------+---------+--------------+-------------+
| id       | buildsetid | buildername                             | priority | claimed_at | claimed_by_name                                                                    | claimed_by_incarnation | complete | results | submitted_at | complete_at |
+----------+------------+-----------------------------------------+----------+------------+------------------------------------------------------------------------------------+------------------------+----------+---------+--------------+-------------+
| 23866634 |    6294394 | Ubuntu HW 12.04 birch pgo talos chromez |        0 | 1367936901 | buildbot-master52.srv.releng.use1.mozilla.com:/builds/buildbot/tests1-linux/master | pid621-boot1366727682  |        0 |    NULL |   1367863931 |        NULL |
+----------+------------+-----------------------------------------+----------+------------+------------------------------------------------------------------------------------+------------------------+----------+---------+--------------+-------------+
1 row in set (0.00 sec)

mysql> select * from builds where brid=23866634;
+----------+--------+----------+------------+-------------+
| id       | number | brid     | start_time | finish_time |
+----------+--------+----------+------------+-------------+
| 24100785 |      5 | 23866634 | 1367863938 |        NULL |
+----------+--------+----------+------------+-------------+
1 row in set (0.00 sec)

Interestingly, the master is still claiming the build, so it's not getting automatically re-built.
Whiteboard: [buildbot]
all of the jobs above finished around 11:21am yesterday. the sql server or network must have hiccuped at that time.
What time did you reconfig yesterday?
On bm52, the reconfig happened from 10:54:37 to 10:59:34
Product: mozilla.org → Release Engineering
Whiteboard: [buildbot] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2065] [buildbot]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2065] [buildbot] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2075] [buildbot]
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INCOMPLETE
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.