Closed Bug 702393 Opened 13 years ago Closed 13 years ago

masters and schedulers are waiting on db connection

Categories

(Data & BI Services Team :: DB: MySQL, task)

x86
macOS
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: mpressman)

References

Details

(Whiteboard: [buildmasters][scheduler][mysql])

Attachments

(1 file)

I was trying to load up several slaves pages in buildbot-master15 but pages were never loading.
I looked at ganglia and I noticed that there was 30minutes without CPU usage.
Then I looked at twistd.log and the output said this and was not growing:
> 2011-11-14 11:43:31-0800 [-] starting build <Build Rev3 WINNT 5.1 mozilla-inbound debug test mochitests-2/5>.. pinging the slave <SlaveBuilder builder='Rev3 WINNT 5.1 mozilla-inbound debug test mochitests-2/5' slave='talos-r3-xp-047'>
> 2011-11-14 11:43:31-0800 [-] sending ping

I was checking 30 minutes after 11:43 so this meant that things were not good.

With catlee I did this:
[cltbld@buildbot-master15 tests1-windows]$ cat master/twistd.pid 
15573[cltbld@buildbot-master15 tests1-windows]$ strace -p 15573
Process 15573 attached - interrupt to quit
read(10,  <unfinished ...>
Process 15573 detached
[cltbld@buildbot-master15 tests1-windows]$ /usr/sbin/lsof -p 15573 | grep 10u
buildbot 15573 cltbld   10u  IPv4 76882967               TCP buildbot-master15.build.scl1.mozilla.com:49388->tm-b01-master01.mozilla.org:mysql (ESTABLISHED)
[cltbld@buildbot-master15 tests1-windows]$
buildbot-master16 as well:
> 2011-11-14 11:47:32-0800 [-] command '['python', 'tools/buildfarm/maintenance/count_and_reboot.py', '-f', '../reboot_count.txt', '-n', '1', '-z']' in dir '.'
> 2011-11-14 11:47:32-0800 [-] LoggedRemoteCommand.start

[cltbld@buildbot-master16 master]$ cat twistd.pid 
9886[cltbld@buildbot-master16 master]$ strace -p 9886
Process 9886 attached - interrupt to quit
read(10,  <unfinished ...>
Process 9886 detached
[cltbld@buildbot-master16 master]$ /usr/sbin/lsof -p 9886 | grep 10u
buildbot 9886 cltbld   10u  IPv4 72380858               TCP buildbot-master16.build.scl1.mozilla.com:37156->tm-b01-master01.mozilla.org:mysql (ESTABLISHED)
buildbot 9886 cltbld  110u  IPv4 72330374               TCP buildbot-master16.build.scl1.mozilla.com:wap-wsp-wtp->talos-r3-w7-030.build.scl1.mozilla.com:49184 (ESTABLISHED)
Summary: buildbot-master15 is stucked → buildbot-master15 and buildbot-master16 are stucked
Not sure yet what the severity of this is (it could be higher than 'major').
Assignee: nobody → server-ops-database
Severity: normal → major
Component: Release Engineering → Server Operations: Database
QA Contact: release → cshields
Summary: buildbot-master15 and buildbot-master16 are stucked → buildbot-master15 and buildbot-master16 stuck - waiting on db connection
Raising criticality after developers commented on #build:

mbrubeck: On tbpl it looks like builds aren't getting run for pushes in the past hour, e.g. https://tbpl.mozilla.org/?tree=Mozilla-Inbound and https://tbpl.mozilla.org/?tree=Birch&noignore=1
[3:39pm] mbrubeck: Is this known?  Should I close the tree?
[3:40pm] mbrubeck: also https://tbpl.mozilla.org/?tree=Try
[3:40pm] catlee: mbrubeck: we're having issues with our database connections that I think could be causing that
Severity: major → blocker
There are litmus queries that are not using indexes which are eating up the memory from table scans and causing the rest of the databases on the host to suffer
Assignee: server-ops-database → mpressman
changed title to show how major this is

was: buildbot-master15 and buildbot-master16 stuck
Summary: buildbot-master15 and buildbot-master16 stuck - waiting on db connection → masters and schedulers are waiting on db connection
per releng meeting; developers have closed trees because of large # of pending jobs caused by this db problem.
It appears the litmus queries are coming in repeatedly, the primary query that was running, and locking, when I took a snapshot, although fully using indexes, were still scanning a lot of of rows and still has the same affect in the memory usage. The tables probably could stand to be optimized which will help, but the continuing number of running the same queries is what's really killing this.
From buildbot-master15's logs:


2011-11-14 11:43:31-0800 [-] starting build <Build Rev3 WINNT 5.1 mozilla-inbound debug test mochitests-2/5>.. pinging the slave <SlaveBuilder builder='Rev3 WINNT 5.1 mozilla-inbound debug test mochitests-2/5' slave='talos-r3-xp-047'>
2011-11-14 11:43:31-0800 [-] sending ping
2011-11-14 14:00:43-0800 [-] rollback failed, will reconnect next query
2011-11-14 14:00:43-0800 [-] Unhandled Error
        Traceback (most recent call last):
          File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/internet/defer.py", line 125, in maybeDeferred
            result = f(*args, **kw)
          File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/process/builder.py", line 681, in reclaimAllBuilds
            self.master_incarnation, brids, am_reclaiming=True):
          File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 853, in claim_buildrequests
            am_reclaiming)
          File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 212, in runInteractionNow
            return self._runInteractionNow(interaction, *args, **kwargs)
        --- <exception caught here> ---
          File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 244, in _runInteractionNow
            conn.rollback()
        _mysql_exceptions.OperationalError: (2006, 'MySQL server has gone away')

2011-11-14 14:00:43-0800 [-] Error in reclaimAllBuilds
2011-11-14 14:00:43-0800 [-] Unhandled Error
        Traceback (most recent call last):
          File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/internet/base.py", line 1174, in mainLoop
            self.runUntilCurrent()
          File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/internet/base.py", line 796, in runUntilCurrent
            call.func(*call.args, **call.kw)
          File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/internet/task.py", line 194, in __call__
            d = defer.maybeDeferred(self.f, *self.a, **self.kw)
          File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/internet/defer.py", line 125, in maybeDeferred
            result = f(*args, **kw)
        --- <exception caught here> ---
          File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/process/builder.py", line 681, in reclaimAllBuilds
            self.master_incarnation, brids, am_reclaiming=True):
          File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 853, in claim_buildrequests
            am_reclaiming)
          File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 212, in runInteractionNow
            return self._runInteractionNow(interaction, *args, **kwargs)
          File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 237, in _runInteractionNow
            result = interaction(c, *args, **kwargs)
          File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 873, in _txn_claim_buildrequests
            t.execute(q, qargs)
          File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/MySQLdb/cursors.py", line 174, in execute
            self.errorhandler(self, exc, value)
          File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
            raise errorclass, errorvalue
        _mysql_exceptions.OperationalError: (2013, 'Lost connection to MySQL server during query')

2011-11-14 14:01:06-0800 [-] closing log <buildbot.status.builder.LogFile instance at 0x2aaae30b9680>
2011-11-14 14:01:06-0800 [-] releaseLocks(<buildbotcustom.steps.misc.UnpackFile instance at 0x2aaae9d7bef0>): []
2011-11-14 14:01:06-0800 [-]  step 'unpack_symbols' complete: success
the long litmus query has been optimized and is now returning in about 6 seconds, still extremely slow, but much better from where it was at, there is also a very poorly running query hitting graphs_mozilla_org_new that I'll check and see if maintenance can help it
here is my final conclusion, the litmus tables need some maintenance as well as the buildbot_schedulers and buildbot.
More importantly, the key_buffer_size variable is set to 120MB which is way, way, way too small, I am upping it to 1G which will allows for more indexes to be stored in memory. This will have a huge performance benefit as well as upping the max_table_cache from it's 1024 to 2048 which requires a table/file handler to it to be closed when a new connection uses a new table, this eats up time and memory. It will take a little bit for the key_buffer_size to allocate and align itself optimally, but once completed, we should see decent performance, save for poor queries hitting the db.
(In reply to Matt Pressman [:mpressman] from comment #10)
> the long litmus query has been optimized and is now returning in about 6
> seconds, still extremely slow, but much better from where it was at, there
> is also a very poorly running query hitting graphs_mozilla_org_new that I'll
> check and see if maintenance can help it

I assume you optimized this by adding an index/indices. Do you know which fields were changed so I can update https://hg.mozilla.org/webtools/litmus/file/default/schema.pl ?
Blocks: 702400
Update for those developers playing along at home, or work - mpressman and RelEng are working to understand why the db performance has regressed today. The litmus issues above are now resolved but we are still seeing slow performance, which is causing issues to the buildbot masters. More news when know it.
(In reply to Chris Cooper [:coop] from comment #12)
> (In reply to Matt Pressman [:mpressman] from comment #10)
> > the long litmus query has been optimized and is now returning in about 6
> > seconds, still extremely slow, but much better from where it was at, there
> > is also a very poorly running query hitting graphs_mozilla_org_new that I'll
> > check and see if maintenance can help it
> 
> I assume you optimized this by adding an index/indices. Do you know which
> fields were changed so I can update
> https://hg.mozilla.org/webtools/litmus/file/default/schema.pl ?
No new indexes were added. I just use the optimize table command to defrag the tables.
from irc, trees were reopened at ~18:25 PDT
And we're all clear with respect to having all the finished builds available to tbpl.
Here is a summary of events regarding this issue. Ultimately the problem was due to fragmented myisam tables which caused the query execution planner to choose poor indexes thus leading to a larger number of rows to be scanned which required more memory to complete. In order to prevent this from reoccurring a weekly maintenance schedule of the myisam tables should be employed to prevent these levels of fragmentation.

-issue #1 litmus
(identification)
Initially the load coming from litmus was very high. I identified this by using the queries/second running along with length of time queries took to complete. This pointed into the direction of the litmus database. To further verify that litmus was the cause, I ran explain plans for the top litmus queries and counted the number of rows projected for query execution. This further identified to me that litmus was the cause for the slowdown as a significant number of rows were needed to be scanned. A large majority of the litmus queries had MyISAM storage engines, typically the cause of MyISAM tables having slowdown is due to fragmented tables.

(fix)
 I optimized the tables identified previously using the mysql optimize command. I reran explain plans and query execution projection looked much better picking more optimal indexes

-success
At this point arr and bear reported that the speed of the masters and schedulers were more in line and no perceived slowness remained.

-issue #2 maintenance leading to further swapping/load issues
Although the slowness issue was reportedly better, I still noticed that the database server wasn't in very good shape. Query times were still significantly high as well as load and system swap. I decided to run further maintenance on the other top queries on their tables as explain plans showed poor execution and index choices. Although absolutely necessary to provide regular maintenance on myisam storage engines, I believe it was my optimization at this time that caused the expedited server instability. While optimizing a table, it acquires an exclusive lock and each query yields until the lock is released. Because of the amount of time it takes to complete the optimization, I chose not run it on the larger tables. Unfortunately, this locking caused more memory to be used as queries built-up and the server slowed down. This was mistaken as a continuation of the previous issue. When in fact had I optimized buildbot.buildbot_properties, it would have solved the problem, but this was not identified until later. This error pushed me down a path of tuning the server because of low buffers and caches.

I discovered the key_buffer_size was set way too low and after hearing that the system had been swapping for the last couple of weeks decided to increase the key_buffer_size from 128MB to 1GB. I also increased the table_cache from 1028 to 2048. Because of the query build-up especially with my tuning options, this allowed even more memory to swap until finally the linux scheduler ended up killing the mysqld process. Upon restart, I modified the innodb_buffer_pool down from 6GB to 4GB. This 6GB setting for a server with only 8GB of memory is way too high. It was then discovered that the buildbot_properties table had crashed when the mysql server crashed, which is not un-common for in-use myisam tables upon a server crash. The repair I performed had the same effect as an optimize and the issue was successfully closed. 

Fortunately, the tuning options employed are very beneficial and the server is performing extremely well.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Whiteboard: [buildmasters][scheduler][mysql]
Bug 703365 has been filed to fix the longer term fix to prevent something like this from tearing everything down.
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: