Closed Bug 1164819 Opened 9 years ago Closed 7 years ago

buildbot bridge dies when it loses connection to mysql

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Unassigned)

References

Details

(Whiteboard: [bbb])

This is less frequent now that pool recycle is set, but I'm still seeing errors like:
Traceback (most recent call last):
  File "/builds/bbb/bin/buildbot-bridge", line 9, in <module>
    load_entry_point('bbb==0.3', 'console_scripts', 'buildbot-bridge')()
  File "/builds/bbb/lib/python2.7/site-packages/bbb/runner.py", line 82, in main
    service.start()
  File "/builds/bbb/lib/python2.7/site-packages/bbb/services.py", line 180, in start
    self.reflectTasks()
  File "/builds/bbb/lib/python2.7/site-packages/bbb/services.py", line 186, in reflectTasks
    for t in self.bbb_db.tasks:
  File "/builds/bbb/lib/python2.7/site-packages/bbb/servicebase.py", line 40, in tasks
    tasks = self.tasks_table.select().execute().fetchall()
  File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/sql/base.py", line 386, in execute
    return e._execute_clauseelement(self, multiparams, params)
  File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1984, in _execute_clauseelement
    connection = self.contextual_connect(close_with_result=True)
  File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2026, in contextual_connect
    self._wrap_pool_connect(self.pool.connect, None),
  File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2065, in _wrap_pool_connect
    e, dialect, self)
  File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1395, in _handle_dbapi_exception_noconnection
    exc_info
  File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb)
  File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2061, in _wrap_pool_connect
    return fn()
  File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/pool.py", line 337, in connect
    return _ConnectionFairy._checkout(self)
  File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/pool.py", line 644, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/pool.py", line 444, in checkout
    rec.checkin()
  File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 60, in __exit__
    compat.reraise(exc_type, exc_value, exc_tb)
  File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/pool.py", line 441, in checkout
    dbapi_connection = rec.get_connection()
  File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/pool.py", line 525, in get_connection
    self.connection = self.__connect()
  File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/pool.py", line 538, in __connect
    connection = self.__pool._creator()
  File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/engine/strategies.py", line 89, in connect
    return dialect.connect(*cargs, **cparams)
  File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 377, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/builds/bbb/lib/python2.7/site-packages/MySQLdb/__init__.py", line 81, in Connect
    return Connection(*args, **kwargs)
  File "/builds/bbb/lib/python2.7/site-packages/MySQLdb/connections.py", line 193, in __init__
    super(Connection, self).__init__(*args, **kwargs2)
sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError) (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 0")


from time to time. Maybe this isn't a big deal because supervisord should be restarting it....
No longer blocks: bbb
Whiteboard: [bbb]
(In reply to Ben Hearsum (:bhearsum) from comment #0)
> This is less frequent now that pool recycle is set, but I'm still seeing
> errors like:
...
> from time to time. Maybe this isn't a big deal because supervisord should be
> restarting it....

ftr: mysql disconnect neither detected nor fixed by supervisord on Friday Aug 7
Hit this from a different type of error over the weekend:
2015-08-21 12:54:27,937 - bbb.runner - Running reflector service
2015-08-21 12:54:27,937 - bbb.services - Starting reflector
2015-08-21 12:54:27,940 - bbb.services - Processing task: 4N-0yTkmQzuNmq3zjLf7cw
2015-08-21 12:54:27,947 - bbb.services - BuildRequest 77842006 is done. BBListener should process it soon, reclaiming in the meantime
2015-08-21 12:54:27,947 - bbb.services - Processing task: sKklWUdZS1mircngEJBBzw
2015-08-21 12:54:27,954 - bbb.services - BuildRequest 78070838 is done. BBListener should process it soon, reclaiming in the meantime
2015-08-21 12:54:27,954 - bbb.services - Processing task: _dgDktc4R0GytOXdydnIpQ
2015-08-21 12:54:27,961 - bbb.services - BuildRequest 78092858 is done. BBListener should process it soon, reclaiming in the meantime
2015-08-21 12:54:27,961 - bbb.services - Processing task: JJM9oHrARx6fsTffPT99aA
2015-08-21 12:54:27,967 - bbb.services - BuildRequest 78109284 is done. BBListener should process it soon, reclaiming in the meantime
2015-08-21 12:54:27,967 - bbb.services - Processing task: A18IB6cARBGyKrV7w5YAMA
2015-08-21 12:54:27,972 - bbb.services - BuildRequest disappeared before starting, cancelling task
2015-08-21 12:54:27,983 - requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): queue.taskcluster.net
2015-08-21 12:54:28,493 - sqlalchemy.pool.QueuePool - Exception during reset or similar
Traceback (most recent call last):
  File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/pool.py", line 567, in _finalize_fairy
    fairy._reset(pool)
  File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/pool.py", line 701, in _reset
    pool._dialect.do_rollback(self)
  File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py", line 2519, in do_rollback
    dbapi_connection.rollback()
ProgrammingError: (2014, "Commands out of sync; you can't run this command now")
2015-08-21 12:54:32,688 - bbb.runner - Running reflector service
2015-08-21 12:54:32,688 - bbb.services - Starting reflector
2015-08-21 12:54:32,690 - bbb.services - Processing task: 4N-0yTkmQzuNmq3zjLf7cw
2015-08-21 12:54:32,697 - bbb.services - BuildRequest 77842006 is done. BBListener should process it soon, reclaiming in the meantime
2015-08-21 12:54:32,698 - bbb.services - Processing task: sKklWUdZS1mircngEJBBzw
2015-08-21 12:54:32,704 - bbb.services - BuildRequest 78070838 is done. BBListener should process it soon, reclaiming in the meantime
2015-08-21 12:54:32,705 - bbb.services - Processing task: _dgDktc4R0GytOXdydnIpQ
2015-08-21 12:54:32,711 - bbb.services - BuildRequest 78092858 is done. BBListener should process it soon, reclaiming in the meantime
2015-08-21 12:54:32,711 - bbb.services - Processing task: JJM9oHrARx6fsTffPT99aA
2015-08-21 12:54:32,717 - bbb.services - BuildRequest 78109284 is done. BBListener should process it soon, reclaiming in the meantime
2015-08-21 12:54:32,717 - bbb.services - Processing task: A18IB6cARBGyKrV7w5YAMA
2015-08-21 12:54:32,723 - bbb.services - BuildRequest disappeared before starting, cancelling task
2015-08-21 12:54:32,734 - requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): queue.taskcluster.net
2015-08-21 12:54:33,245 - sqlalchemy.pool.QueuePool - Exception during reset or similar
Traceback (most recent call last):
  File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/pool.py", line 567, in _finalize_fairy
    fairy._reset(pool)
  File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/pool.py", line 701, in _reset
    pool._dialect.do_rollback(self)
  File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py", line 2519, in do_rollback
    dbapi_connection.rollback()
ProgrammingError: (2014, "Commands out of sync; you can't run this command now")


Looksl ike supervisor tried to restart it at least once based on the timestamps.
(In reply to Ben Hearsum (:bhearsum) from comment #2)
> Hit this from a different type of error over the weekend:
> 2015-08-21 12:54:27,937 - bbb.runner - Running reflector service
> 2015-08-21 12:54:27,937 - bbb.services - Starting reflector
> 2015-08-21 12:54:27,940 - bbb.services - Processing task:
> 4N-0yTkmQzuNmq3zjLf7cw
> 2015-08-21 12:54:27,947 - bbb.services - BuildRequest 77842006 is done.
> BBListener should process it soon, reclaiming in the meantime
> 2015-08-21 12:54:27,947 - bbb.services - Processing task:
> sKklWUdZS1mircngEJBBzw
> 2015-08-21 12:54:27,954 - bbb.services - BuildRequest 78070838 is done.
> BBListener should process it soon, reclaiming in the meantime
> 2015-08-21 12:54:27,954 - bbb.services - Processing task:
> _dgDktc4R0GytOXdydnIpQ
> 2015-08-21 12:54:27,961 - bbb.services - BuildRequest 78092858 is done.
> BBListener should process it soon, reclaiming in the meantime
> 2015-08-21 12:54:27,961 - bbb.services - Processing task:
> JJM9oHrARx6fsTffPT99aA
> 2015-08-21 12:54:27,967 - bbb.services - BuildRequest 78109284 is done.
> BBListener should process it soon, reclaiming in the meantime
> 2015-08-21 12:54:27,967 - bbb.services - Processing task:
> A18IB6cARBGyKrV7w5YAMA
> 2015-08-21 12:54:27,972 - bbb.services - BuildRequest disappeared before
> starting, cancelling task
> 2015-08-21 12:54:27,983 - requests.packages.urllib3.connectionpool -
> Starting new HTTPS connection (1): queue.taskcluster.net
> 2015-08-21 12:54:28,493 - sqlalchemy.pool.QueuePool - Exception during reset
> or similar
> Traceback (most recent call last):
>   File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/pool.py", line
> 567, in _finalize_fairy
>     fairy._reset(pool)
>   File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/pool.py", line
> 701, in _reset
>     pool._dialect.do_rollback(self)
>   File
> "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py",
> line 2519, in do_rollback
>     dbapi_connection.rollback()
> ProgrammingError: (2014, "Commands out of sync; you can't run this command
> now")
> 2015-08-21 12:54:32,688 - bbb.runner - Running reflector service
> 2015-08-21 12:54:32,688 - bbb.services - Starting reflector
> 2015-08-21 12:54:32,690 - bbb.services - Processing task:
> 4N-0yTkmQzuNmq3zjLf7cw
> 2015-08-21 12:54:32,697 - bbb.services - BuildRequest 77842006 is done.
> BBListener should process it soon, reclaiming in the meantime
> 2015-08-21 12:54:32,698 - bbb.services - Processing task:
> sKklWUdZS1mircngEJBBzw
> 2015-08-21 12:54:32,704 - bbb.services - BuildRequest 78070838 is done.
> BBListener should process it soon, reclaiming in the meantime
> 2015-08-21 12:54:32,705 - bbb.services - Processing task:
> _dgDktc4R0GytOXdydnIpQ
> 2015-08-21 12:54:32,711 - bbb.services - BuildRequest 78092858 is done.
> BBListener should process it soon, reclaiming in the meantime
> 2015-08-21 12:54:32,711 - bbb.services - Processing task:
> JJM9oHrARx6fsTffPT99aA
> 2015-08-21 12:54:32,717 - bbb.services - BuildRequest 78109284 is done.
> BBListener should process it soon, reclaiming in the meantime
> 2015-08-21 12:54:32,717 - bbb.services - Processing task:
> A18IB6cARBGyKrV7w5YAMA
> 2015-08-21 12:54:32,723 - bbb.services - BuildRequest disappeared before
> starting, cancelling task
> 2015-08-21 12:54:32,734 - requests.packages.urllib3.connectionpool -
> Starting new HTTPS connection (1): queue.taskcluster.net
> 2015-08-21 12:54:33,245 - sqlalchemy.pool.QueuePool - Exception during reset
> or similar
> Traceback (most recent call last):
>   File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/pool.py", line
> 567, in _finalize_fairy
>     fairy._reset(pool)
>   File "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/pool.py", line
> 701, in _reset
>     pool._dialect.do_rollback(self)
>   File
> "/builds/bbb/lib/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py",
> line 2519, in do_rollback
>     dbapi_connection.rollback()
> ProgrammingError: (2014, "Commands out of sync; you can't run this command
> now")
> 
> 
> Looksl ike supervisor tried to restart it at least once based on the
> timestamps.

This might actually be something different...the service refuses to restart, with the same error every time. I filed bug 1197855 for it.
After loss of mysql during outage in bug 1241334, all 3 BBB hosts showed up in nagios as errors (only 1 process running).

Restarting on all 3 showed the same symptoms:
[root@buildbot-master72.bb.releng.usw2.mozilla.com ~]# for i in bblistener tclistener reflector; do supervisorctl restart buildbot_bridge_$i; done
buildbot_bridge_bblistener: ERROR (not running)
buildbot_bridge_bblistener: started
buildbot_bridge_tclistener: stopped
buildbot_bridge_tclistener: started
buildbot_bridge_reflector: ERROR (not running)
buildbot_bridge_reflector: started
[root@buildbot-master72.bb.releng.usw2.mozilla.com ~]#
At about 05:40 this morning (2016-01-31) there was some mysql issue ('Too many connections' in nagios and buildbot logs), and the bblistener and reflector failed on bm72 and bm82. Supervisord tried to restart them but failed, possibly gave up too soon. Restarted all three services on both hosts, with the command from comment #4.
Ben: do we still need to build in some resiliency here?
Flags: needinfo?(bhearsum)
(In reply to Chris Cooper [:coop] from comment #6)
> Ben: do we still need to build in some resiliency here?

It would probably be good to have.
Flags: needinfo?(bhearsum)
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.