Closed
Bug 702393
Opened 13 years ago
Closed 13 years ago
masters and schedulers are waiting on db connection
Categories
(Data & BI Services Team :: DB: MySQL, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: armenzg, Assigned: mpressman)
References
Details
(Whiteboard: [buildmasters][scheduler][mysql])
Attachments
(1 file)
39.65 KB,
image/png
|
Details |
I was trying to load up several slaves pages in buildbot-master15 but pages were never loading.
I looked at ganglia and I noticed that there was 30minutes without CPU usage.
Then I looked at twistd.log and the output said this and was not growing:
> 2011-11-14 11:43:31-0800 [-] starting build <Build Rev3 WINNT 5.1 mozilla-inbound debug test mochitests-2/5>.. pinging the slave <SlaveBuilder builder='Rev3 WINNT 5.1 mozilla-inbound debug test mochitests-2/5' slave='talos-r3-xp-047'>
> 2011-11-14 11:43:31-0800 [-] sending ping
I was checking 30 minutes after 11:43 so this meant that things were not good.
With catlee I did this:
[cltbld@buildbot-master15 tests1-windows]$ cat master/twistd.pid
15573[cltbld@buildbot-master15 tests1-windows]$ strace -p 15573
Process 15573 attached - interrupt to quit
read(10, <unfinished ...>
Process 15573 detached
[cltbld@buildbot-master15 tests1-windows]$ /usr/sbin/lsof -p 15573 | grep 10u
buildbot 15573 cltbld 10u IPv4 76882967 TCP buildbot-master15.build.scl1.mozilla.com:49388->tm-b01-master01.mozilla.org:mysql (ESTABLISHED)
[cltbld@buildbot-master15 tests1-windows]$
Reporter | ||
Comment 1•13 years ago
|
||
Reporter | ||
Comment 2•13 years ago
|
||
buildbot-master16 as well:
> 2011-11-14 11:47:32-0800 [-] command '['python', 'tools/buildfarm/maintenance/count_and_reboot.py', '-f', '../reboot_count.txt', '-n', '1', '-z']' in dir '.'
> 2011-11-14 11:47:32-0800 [-] LoggedRemoteCommand.start
[cltbld@buildbot-master16 master]$ cat twistd.pid
9886[cltbld@buildbot-master16 master]$ strace -p 9886
Process 9886 attached - interrupt to quit
read(10, <unfinished ...>
Process 9886 detached
[cltbld@buildbot-master16 master]$ /usr/sbin/lsof -p 9886 | grep 10u
buildbot 9886 cltbld 10u IPv4 72380858 TCP buildbot-master16.build.scl1.mozilla.com:37156->tm-b01-master01.mozilla.org:mysql (ESTABLISHED)
buildbot 9886 cltbld 110u IPv4 72330374 TCP buildbot-master16.build.scl1.mozilla.com:wap-wsp-wtp->talos-r3-w7-030.build.scl1.mozilla.com:49184 (ESTABLISHED)
Summary: buildbot-master15 is stucked → buildbot-master15 and buildbot-master16 are stucked
Reporter | ||
Comment 3•13 years ago
|
||
Not sure yet what the severity of this is (it could be higher than 'major').
Assignee: nobody → server-ops-database
Severity: normal → major
Component: Release Engineering → Server Operations: Database
QA Contact: release → cshields
Updated•13 years ago
|
Summary: buildbot-master15 and buildbot-master16 are stucked → buildbot-master15 and buildbot-master16 stuck - waiting on db connection
Reporter | ||
Comment 4•13 years ago
|
||
Raising criticality after developers commented on #build: mbrubeck: On tbpl it looks like builds aren't getting run for pushes in the past hour, e.g. https://tbpl.mozilla.org/?tree=Mozilla-Inbound and https://tbpl.mozilla.org/?tree=Birch&noignore=1 [3:39pm] mbrubeck: Is this known? Should I close the tree? [3:40pm] mbrubeck: also https://tbpl.mozilla.org/?tree=Try [3:40pm] catlee: mbrubeck: we're having issues with our database connections that I think could be causing that
Severity: major → blocker
Assignee | ||
Comment 5•13 years ago
|
||
There are litmus queries that are not using indexes which are eating up the memory from table scans and causing the rest of the databases on the host to suffer
Updated•13 years ago
|
Assignee: server-ops-database → mpressman
Comment 6•13 years ago
|
||
changed title to show how major this is was: buildbot-master15 and buildbot-master16 stuck
Summary: buildbot-master15 and buildbot-master16 stuck - waiting on db connection → masters and schedulers are waiting on db connection
Comment 7•13 years ago
|
||
per releng meeting; developers have closed trees because of large # of pending jobs caused by this db problem.
Assignee | ||
Comment 8•13 years ago
|
||
It appears the litmus queries are coming in repeatedly, the primary query that was running, and locking, when I took a snapshot, although fully using indexes, were still scanning a lot of of rows and still has the same affect in the memory usage. The tables probably could stand to be optimized which will help, but the continuing number of running the same queries is what's really killing this.
Reporter | ||
Comment 9•13 years ago
|
||
From buildbot-master15's logs: 2011-11-14 11:43:31-0800 [-] starting build <Build Rev3 WINNT 5.1 mozilla-inbound debug test mochitests-2/5>.. pinging the slave <SlaveBuilder builder='Rev3 WINNT 5.1 mozilla-inbound debug test mochitests-2/5' slave='talos-r3-xp-047'> 2011-11-14 11:43:31-0800 [-] sending ping 2011-11-14 14:00:43-0800 [-] rollback failed, will reconnect next query 2011-11-14 14:00:43-0800 [-] Unhandled Error Traceback (most recent call last): File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/internet/defer.py", line 125, in maybeDeferred result = f(*args, **kw) File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/process/builder.py", line 681, in reclaimAllBuilds self.master_incarnation, brids, am_reclaiming=True): File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 853, in claim_buildrequests am_reclaiming) File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 212, in runInteractionNow return self._runInteractionNow(interaction, *args, **kwargs) --- <exception caught here> --- File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 244, in _runInteractionNow conn.rollback() _mysql_exceptions.OperationalError: (2006, 'MySQL server has gone away') 2011-11-14 14:00:43-0800 [-] Error in reclaimAllBuilds 2011-11-14 14:00:43-0800 [-] Unhandled Error Traceback (most recent call last): File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/internet/base.py", line 1174, in mainLoop self.runUntilCurrent() File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/internet/base.py", line 796, in runUntilCurrent call.func(*call.args, **call.kw) File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/internet/task.py", line 194, in __call__ d = defer.maybeDeferred(self.f, *self.a, **self.kw) File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/internet/defer.py", line 125, in maybeDeferred result = f(*args, **kw) --- <exception caught here> --- File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/process/builder.py", line 681, in reclaimAllBuilds self.master_incarnation, brids, am_reclaiming=True): File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 853, in claim_buildrequests am_reclaiming) File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 212, in runInteractionNow return self._runInteractionNow(interaction, *args, **kwargs) File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 237, in _runInteractionNow result = interaction(c, *args, **kwargs) File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 873, in _txn_claim_buildrequests t.execute(q, qargs) File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/MySQLdb/cursors.py", line 174, in execute self.errorhandler(self, exc, value) File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler raise errorclass, errorvalue _mysql_exceptions.OperationalError: (2013, 'Lost connection to MySQL server during query') 2011-11-14 14:01:06-0800 [-] closing log <buildbot.status.builder.LogFile instance at 0x2aaae30b9680> 2011-11-14 14:01:06-0800 [-] releaseLocks(<buildbotcustom.steps.misc.UnpackFile instance at 0x2aaae9d7bef0>): [] 2011-11-14 14:01:06-0800 [-] step 'unpack_symbols' complete: success
Assignee | ||
Comment 10•13 years ago
|
||
the long litmus query has been optimized and is now returning in about 6 seconds, still extremely slow, but much better from where it was at, there is also a very poorly running query hitting graphs_mozilla_org_new that I'll check and see if maintenance can help it
Assignee | ||
Comment 11•13 years ago
|
||
here is my final conclusion, the litmus tables need some maintenance as well as the buildbot_schedulers and buildbot. More importantly, the key_buffer_size variable is set to 120MB which is way, way, way too small, I am upping it to 1G which will allows for more indexes to be stored in memory. This will have a huge performance benefit as well as upping the max_table_cache from it's 1024 to 2048 which requires a table/file handler to it to be closed when a new connection uses a new table, this eats up time and memory. It will take a little bit for the key_buffer_size to allocate and align itself optimally, but once completed, we should see decent performance, save for poor queries hitting the db.
Comment 12•13 years ago
|
||
(In reply to Matt Pressman [:mpressman] from comment #10) > the long litmus query has been optimized and is now returning in about 6 > seconds, still extremely slow, but much better from where it was at, there > is also a very poorly running query hitting graphs_mozilla_org_new that I'll > check and see if maintenance can help it I assume you optimized this by adding an index/indices. Do you know which fields were changed so I can update https://hg.mozilla.org/webtools/litmus/file/default/schema.pl ?
Comment 13•13 years ago
|
||
Update for those developers playing along at home, or work - mpressman and RelEng are working to understand why the db performance has regressed today. The litmus issues above are now resolved but we are still seeing slow performance, which is causing issues to the buildbot masters. More news when know it.
Assignee | ||
Comment 14•13 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #12) > (In reply to Matt Pressman [:mpressman] from comment #10) > > the long litmus query has been optimized and is now returning in about 6 > > seconds, still extremely slow, but much better from where it was at, there > > is also a very poorly running query hitting graphs_mozilla_org_new that I'll > > check and see if maintenance can help it > > I assume you optimized this by adding an index/indices. Do you know which > fields were changed so I can update > https://hg.mozilla.org/webtools/litmus/file/default/schema.pl ? No new indexes were added. I just use the optimize table command to defrag the tables.
Comment 15•13 years ago
|
||
from irc, trees were reopened at ~18:25 PDT
Comment 16•13 years ago
|
||
And we're all clear with respect to having all the finished builds available to tbpl.
Assignee | ||
Comment 17•13 years ago
|
||
Here is a summary of events regarding this issue. Ultimately the problem was due to fragmented myisam tables which caused the query execution planner to choose poor indexes thus leading to a larger number of rows to be scanned which required more memory to complete. In order to prevent this from reoccurring a weekly maintenance schedule of the myisam tables should be employed to prevent these levels of fragmentation. -issue #1 litmus (identification) Initially the load coming from litmus was very high. I identified this by using the queries/second running along with length of time queries took to complete. This pointed into the direction of the litmus database. To further verify that litmus was the cause, I ran explain plans for the top litmus queries and counted the number of rows projected for query execution. This further identified to me that litmus was the cause for the slowdown as a significant number of rows were needed to be scanned. A large majority of the litmus queries had MyISAM storage engines, typically the cause of MyISAM tables having slowdown is due to fragmented tables. (fix) I optimized the tables identified previously using the mysql optimize command. I reran explain plans and query execution projection looked much better picking more optimal indexes -success At this point arr and bear reported that the speed of the masters and schedulers were more in line and no perceived slowness remained. -issue #2 maintenance leading to further swapping/load issues Although the slowness issue was reportedly better, I still noticed that the database server wasn't in very good shape. Query times were still significantly high as well as load and system swap. I decided to run further maintenance on the other top queries on their tables as explain plans showed poor execution and index choices. Although absolutely necessary to provide regular maintenance on myisam storage engines, I believe it was my optimization at this time that caused the expedited server instability. While optimizing a table, it acquires an exclusive lock and each query yields until the lock is released. Because of the amount of time it takes to complete the optimization, I chose not run it on the larger tables. Unfortunately, this locking caused more memory to be used as queries built-up and the server slowed down. This was mistaken as a continuation of the previous issue. When in fact had I optimized buildbot.buildbot_properties, it would have solved the problem, but this was not identified until later. This error pushed me down a path of tuning the server because of low buffers and caches. I discovered the key_buffer_size was set way too low and after hearing that the system had been swapping for the last couple of weeks decided to increase the key_buffer_size from 128MB to 1GB. I also increased the table_cache from 1028 to 2048. Because of the query build-up especially with my tuning options, this allowed even more memory to swap until finally the linux scheduler ended up killing the mysqld process. Upon restart, I modified the innodb_buffer_pool down from 6GB to 4GB. This 6GB setting for a server with only 8GB of memory is way too high. It was then discovered that the buildbot_properties table had crashed when the mysql server crashed, which is not un-common for in-use myisam tables upon a server crash. The repair I performed had the same effect as an optimize and the issue was successfully closed. Fortunately, the tuning options employed are very beneficial and the server is performing extremely well.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•13 years ago
|
Whiteboard: [buildmasters][scheduler][mysql]
Comment 18•13 years ago
|
||
Bug 703365 has been filed to fix the longer term fix to prevent something like this from tearing everything down.
Updated•10 years ago
|
Product: mozilla.org → Data & BI Services Team
You need to log in
before you can comment on or make changes to this bug.
Description
•