702393 - masters and schedulers are waiting on db connection

Reporter

Description

•

14 years ago

I was trying to load up several slaves pages in buildbot-master15 but pages were never loading. I looked at ganglia and I noticed that there was 30minutes without CPU usage. Then I looked at twistd.log and the output said this and was not growing: > 2011-11-14 11:43:31-0800 [-] starting build <Build Rev3 WINNT 5.1 mozilla-inbound debug test mochitests-2/5>.. pinging the slave <SlaveBuilder builder='Rev3 WINNT 5.1 mozilla-inbound debug test mochitests-2/5' slave='talos-r3-xp-047'> > 2011-11-14 11:43:31-0800 [-] sending ping I was checking 30 minutes after 11:43 so this meant that things were not good. With catlee I did this: [cltbld@buildbot-master15 tests1-windows]$ cat master/twistd.pid 15573[cltbld@buildbot-master15 tests1-windows]$ strace -p 15573 Process 15573 attached - interrupt to quit read(10, <unfinished ...> Process 15573 detached [cltbld@buildbot-master15 tests1-windows]$ /usr/sbin/lsof -p 15573 | grep 10u buildbot 15573 cltbld 10u IPv4 76882967 TCP buildbot-master15.build.scl1.mozilla.com:49388->tm-b01-master01.mozilla.org:mysql (ESTABLISHED) [cltbld@buildbot-master15 tests1-windows]$

Armen [:armenzg]

Reporter

Comment 1

•

14 years ago

Attached image bm15 cpu usage in the last hour — Details

Armen [:armenzg]

Reporter

Comment 2

•

14 years ago

buildbot-master16 as well: > 2011-11-14 11:47:32-0800 [-] command '['python', 'tools/buildfarm/maintenance/count_and_reboot.py', '-f', '../reboot_count.txt', '-n', '1', '-z']' in dir '.' > 2011-11-14 11:47:32-0800 [-] LoggedRemoteCommand.start [cltbld@buildbot-master16 master]$ cat twistd.pid 9886[cltbld@buildbot-master16 master]$ strace -p 9886 Process 9886 attached - interrupt to quit read(10, <unfinished ...> Process 9886 detached [cltbld@buildbot-master16 master]$ /usr/sbin/lsof -p 9886 | grep 10u buildbot 9886 cltbld 10u IPv4 72380858 TCP buildbot-master16.build.scl1.mozilla.com:37156->tm-b01-master01.mozilla.org:mysql (ESTABLISHED) buildbot 9886 cltbld 110u IPv4 72330374 TCP buildbot-master16.build.scl1.mozilla.com:wap-wsp-wtp->talos-r3-w7-030.build.scl1.mozilla.com:49184 (ESTABLISHED)

Summary: buildbot-master15 is stucked → buildbot-master15 and buildbot-master16 are stucked

Armen [:armenzg]

Reporter

Comment 3

•

14 years ago

Not sure yet what the severity of this is (it could be higher than 'major').

Assignee: nobody → server-ops-database

Severity: normal → major

Component: Release Engineering → Server Operations: Database

QA Contact: release → cshields

Nick Thomas [:nthomas] (UTC+12)

Updated

•

14 years ago

Summary: buildbot-master15 and buildbot-master16 are stucked → buildbot-master15 and buildbot-master16 stuck - waiting on db connection

Armen [:armenzg]

Reporter

Comment 4

•

14 years ago

Raising criticality after developers commented on #build: mbrubeck: On tbpl it looks like builds aren't getting run for pushes in the past hour, e.g. https://tbpl.mozilla.org/?tree=Mozilla-Inbound and https://tbpl.mozilla.org/?tree=Birch&noignore=1 [3:39pm] mbrubeck: Is this known? Should I close the tree? [3:40pm] mbrubeck: also https://tbpl.mozilla.org/?tree=Try [3:40pm] catlee: mbrubeck: we're having issues with our database connections that I think could be causing that

Severity: major → blocker

Matt Pressman [:mpressman]

Assignee

Comment 5

•

14 years ago

There are litmus queries that are not using indexes which are eating up the memory from table scans and causing the rest of the databases on the host to suffer

Phong Tran [:phong]

Updated

•

14 years ago

Assignee: server-ops-database → mpressman

Mike Taylor [:bear]

Comment 6

•

14 years ago

changed title to show how major this is was: buildbot-master15 and buildbot-master16 stuck

Summary: buildbot-master15 and buildbot-master16 stuck - waiting on db connection → masters and schedulers are waiting on db connection

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 7

•

14 years ago

per releng meeting; developers have closed trees because of large # of pending jobs caused by this db problem.

Matt Pressman [:mpressman]

Assignee

Comment 8

•

14 years ago

It appears the litmus queries are coming in repeatedly, the primary query that was running, and locking, when I took a snapshot, although fully using indexes, were still scanning a lot of of rows and still has the same affect in the memory usage. The tables probably could stand to be optimized which will help, but the continuing number of running the same queries is what's really killing this.

Armen [:armenzg]

Reporter

Comment 9

•

14 years ago

From buildbot-master15's logs: 2011-11-14 11:43:31-0800 [-] starting build <Build Rev3 WINNT 5.1 mozilla-inbound debug test mochitests-2/5>.. pinging the slave <SlaveBuilder builder='Rev3 WINNT 5.1 mozilla-inbound debug test mochitests-2/5' slave='talos-r3-xp-047'> 2011-11-14 11:43:31-0800 [-] sending ping 2011-11-14 14:00:43-0800 [-] rollback failed, will reconnect next query 2011-11-14 14:00:43-0800 [-] Unhandled Error Traceback (most recent call last): File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/internet/defer.py", line 125, in maybeDeferred result = f(*args, **kw) File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/process/builder.py", line 681, in reclaimAllBuilds self.master_incarnation, brids, am_reclaiming=True): File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 853, in claim_buildrequests am_reclaiming) File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 212, in runInteractionNow return self._runInteractionNow(interaction, *args, **kwargs) --- <exception caught here> --- File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 244, in _runInteractionNow conn.rollback() _mysql_exceptions.OperationalError: (2006, 'MySQL server has gone away') 2011-11-14 14:00:43-0800 [-] Error in reclaimAllBuilds 2011-11-14 14:00:43-0800 [-] Unhandled Error Traceback (most recent call last): File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/internet/base.py", line 1174, in mainLoop self.runUntilCurrent() File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/internet/base.py", line 796, in runUntilCurrent call.func(*call.args, **call.kw) File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/internet/task.py", line 194, in __call__ d = defer.maybeDeferred(self.f, *self.a, **self.kw) File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/internet/defer.py", line 125, in maybeDeferred result = f(*args, **kw) --- <exception caught here> --- File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/process/builder.py", line 681, in reclaimAllBuilds self.master_incarnation, brids, am_reclaiming=True): File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 853, in claim_buildrequests am_reclaiming) File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 212, in runInteractionNow return self._runInteractionNow(interaction, *args, **kwargs) File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 237, in _runInteractionNow result = interaction(c, *args, **kwargs) File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 873, in _txn_claim_buildrequests t.execute(q, qargs) File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/MySQLdb/cursors.py", line 174, in execute self.errorhandler(self, exc, value) File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler raise errorclass, errorvalue _mysql_exceptions.OperationalError: (2013, 'Lost connection to MySQL server during query') 2011-11-14 14:01:06-0800 [-] closing log <buildbot.status.builder.LogFile instance at 0x2aaae30b9680> 2011-11-14 14:01:06-0800 [-] releaseLocks(<buildbotcustom.steps.misc.UnpackFile instance at 0x2aaae9d7bef0>): [] 2011-11-14 14:01:06-0800 [-] step 'unpack_symbols' complete: success

Matt Pressman [:mpressman]

Assignee

Comment 10

•

14 years ago

the long litmus query has been optimized and is now returning in about 6 seconds, still extremely slow, but much better from where it was at, there is also a very poorly running query hitting graphs_mozilla_org_new that I'll check and see if maintenance can help it

Matt Pressman [:mpressman]

Assignee

Comment 11

•

14 years ago

here is my final conclusion, the litmus tables need some maintenance as well as the buildbot_schedulers and buildbot. More importantly, the key_buffer_size variable is set to 120MB which is way, way, way too small, I am upping it to 1G which will allows for more indexes to be stored in memory. This will have a huge performance benefit as well as upping the max_table_cache from it's 1024 to 2048 which requires a table/file handler to it to be closed when a new connection uses a new table, this eats up time and memory. It will take a little bit for the key_buffer_size to allocate and align itself optimally, but once completed, we should see decent performance, save for poor queries hitting the db.

Chris Cooper [:coop] (he/him)

Comment 12

•

14 years ago

(In reply to Matt Pressman [:mpressman] from comment #10) > the long litmus query has been optimized and is now returning in about 6 > seconds, still extremely slow, but much better from where it was at, there > is also a very poorly running query hitting graphs_mozilla_org_new that I'll > check and see if maintenance can help it I assume you optimized this by adding an index/indices. Do you know which fields were changed so I can update https://hg.mozilla.org/webtools/litmus/file/default/schema.pl ?

Gregory Szorc [:gps]

Updated

•

14 years ago

Blocks: 702400

Nick Thomas [:nthomas] (UTC+12)

Comment 13

•

14 years ago

Update for those developers playing along at home, or work - mpressman and RelEng are working to understand why the db performance has regressed today. The litmus issues above are now resolved but we are still seeing slow performance, which is causing issues to the buildbot masters. More news when know it.

Matt Pressman [:mpressman]

Assignee

Comment 14

•

14 years ago

(In reply to Chris Cooper [:coop] from comment #12) > (In reply to Matt Pressman [:mpressman] from comment #10) > > the long litmus query has been optimized and is now returning in about 6 > > seconds, still extremely slow, but much better from where it was at, there > > is also a very poorly running query hitting graphs_mozilla_org_new that I'll > > check and see if maintenance can help it > > I assume you optimized this by adding an index/indices. Do you know which > fields were changed so I can update > https://hg.mozilla.org/webtools/litmus/file/default/schema.pl ? No new indexes were added. I just use the optimize table command to defrag the tables.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 15

•

14 years ago

from irc, trees were reopened at ~18:25 PDT

Nick Thomas [:nthomas] (UTC+12)

Comment 16

•

14 years ago

And we're all clear with respect to having all the finished builds available to tbpl.

Matt Pressman [:mpressman]

Assignee

Comment 17

•

14 years ago

Here is a summary of events regarding this issue. Ultimately the problem was due to fragmented myisam tables which caused the query execution planner to choose poor indexes thus leading to a larger number of rows to be scanned which required more memory to complete. In order to prevent this from reoccurring a weekly maintenance schedule of the myisam tables should be employed to prevent these levels of fragmentation. -issue #1 litmus (identification) Initially the load coming from litmus was very high. I identified this by using the queries/second running along with length of time queries took to complete. This pointed into the direction of the litmus database. To further verify that litmus was the cause, I ran explain plans for the top litmus queries and counted the number of rows projected for query execution. This further identified to me that litmus was the cause for the slowdown as a significant number of rows were needed to be scanned. A large majority of the litmus queries had MyISAM storage engines, typically the cause of MyISAM tables having slowdown is due to fragmented tables. (fix) I optimized the tables identified previously using the mysql optimize command. I reran explain plans and query execution projection looked much better picking more optimal indexes -success At this point arr and bear reported that the speed of the masters and schedulers were more in line and no perceived slowness remained. -issue #2 maintenance leading to further swapping/load issues Although the slowness issue was reportedly better, I still noticed that the database server wasn't in very good shape. Query times were still significantly high as well as load and system swap. I decided to run further maintenance on the other top queries on their tables as explain plans showed poor execution and index choices. Although absolutely necessary to provide regular maintenance on myisam storage engines, I believe it was my optimization at this time that caused the expedited server instability. While optimizing a table, it acquires an exclusive lock and each query yields until the lock is released. Because of the amount of time it takes to complete the optimization, I chose not run it on the larger tables. Unfortunately, this locking caused more memory to be used as queries built-up and the server slowed down. This was mistaken as a continuation of the previous issue. When in fact had I optimized buildbot.buildbot_properties, it would have solved the problem, but this was not identified until later. This error pushed me down a path of tuning the server because of low buffers and caches. I discovered the key_buffer_size was set way too low and after hearing that the system had been swapping for the last couple of weeks decided to increase the key_buffer_size from 128MB to 1GB. I also increased the table_cache from 1028 to 2048. Because of the query build-up especially with my tuning options, this allowed even more memory to swap until finally the linux scheduler ended up killing the mysqld process. Upon restart, I modified the innodb_buffer_pool down from 6GB to 4GB. This 6GB setting for a server with only 8GB of memory is way too high. It was then discovered that the buildbot_properties table had crashed when the mysql server crashed, which is not un-common for in-use myisam tables upon a server crash. The repair I performed had the same effect as an optimize and the issue was successfully closed. Fortunately, the tuning options employed are very beneficial and the server is performing extremely well.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Mike Taylor [:bear]

Updated

•

14 years ago

Whiteboard: [buildmasters][scheduler][mysql]

matthew zeier [:mrz]

Comment 18

•

14 years ago

Bug 703365 has been filed to fix the longer term fix to prevent something like this from tearing everything down.

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Data & BI Services Team