Created attachment 8806212 [details] Newrelic graphs Newrelic is reporting high I/O utilisation on the buildbot r/w master (see attachment) which is impacting some of our services. Could you please investigate what is causing that ? We are not aware of any changes on our side, and can't find any requests against the services which would cause long running queries. 'show processlist' as firstname.lastname@example.org looks like normal buildbot work.
Hi, We could find, IO load has increase from Oct 29th. Also when checksum runs we have high IO load from 6AM-10AM and then 6PM-9PM. We could also find some select queries hitting the server and creating temporary table and using filesort algorithm, which is expensive on disk IO contributing to load. We need to look into the queries as well for any optimization possibilities. It would be great if we can get some details from New Relic graphs for data size growth, as it seems like not much increased. We dig further for more details. We have also created Pythian ticket for internal tracking: https://secure.pythian.com/track/cr/1101708 Thanks, Rohit Kumar
(Opening this bug up, so the sheriffs can check status) Space used on /data has been pretty constant at 340GB (67% used). No big changes there. gcox looked at the cron log, and found the db checksums job had been running prior to the 29th, and also looked for puppet changes: 'I see nothing changing in manifests/nodes/db_mysql since 2016-10-21, so that's not interesting from what I see.' We're still seeing timeouts on the webservice (eg https://secure.pub.build.mozilla.org/buildapi/running, requires LDAP), but the checksums load has subsided. I read this a while back: "While a transaction is open, the system keeps a snapshot of the data as seen at the beginning of the transaction, which can cause substantial overhead if the system inserts, updates, and deletes millions of rows while a stray transaction keeps running. Thus, take care to avoid transactions that run for too long" (https://dev.mysql.com/doc/refman/5.6/en/converting-tables-to-innodb.html#innodb-convert-transactions) Is there any way to check for long lived db connections which might be holding a snapshot ? It's possible we have badly behaved code from MyISAM days were there were no transactions. With connections going via Zeus VIP we may need to talk to moc/webops to resolve the host making connections. Also, what is the global value of autocommit ?
We discovered that one of the two webservers was not functioning properly, despite the newrelic data saying it was OK but a bit slow. Restarting apache solved the intermittent timeouts (bug 1314130). Should have caught that quicker, sorry. I'm still curious to see what happens with disk I/O on buildbot3, to see if checksums continue to give it a hard time, and to know if there are quick answers to my questions in comment #2.
There was a replication issue on buildbot2 which was fixed on 29th. buildbot3 sent almost 20 days of binary logs to buildbot2 which can explain IO issues during this time. We can stop replication on buildbot2 since it's out of traffic. Should we decom it as well?
I checked with catlee and Releng agree with stopping replication to buildbot2, and go ahead with decomm. There is no possibility we'll go back to MyISAM at this point.
Stopped replication on builbot1,2.
looks like this has been agreed to by Releng
I guess we need another bug for decomm, can we close this bug?
Hi Please, let us know if any further work is needed on this matter. Thank you [email@example.com]
We haven't seen any production-impacting issues since the replication to buildbot2 was identified. The disk I/O on buildbot3 still seems quite high for comfort when the checksums check runs (6am/6pm UTC); high enough that newrelic alerts for > 90%.
[beebe.pythian] Per discussion with matt, closing. If further issue or incidents, please feel free to open again.
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.