Investigate high I/O utilisation on buildbot3.db.scl3

RESOLVED FIXED

Status

RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: nthomas, Assigned: team73)

Tracking

Details

Attachments

(1 attachment)

(Reporter)

Description

2 years ago
Created attachment 8806212 [details]
Newrelic graphs

Newrelic is reporting high I/O utilisation on the buildbot r/w master (see attachment) which is impacting some of our services. Could you please investigate what is causing that ? 

We are not aware of any changes on our side, and can't find any requests against the services which would cause long running queries. 'show processlist' as buildbot@buildbot-rw-vip.db.scl3 looks like normal buildbot work.
(Assignee)

Comment 1

2 years ago
Hi,

We could find, IO load has increase from Oct 29th.
Also when checksum runs we have high IO load from 6AM-10AM and then 6PM-9PM.

We could also find some select queries hitting the server and creating temporary table and using filesort algorithm, which is expensive on disk IO contributing to load. We need to look into the queries as well for any optimization possibilities.

It would be great if we can get some details from New Relic graphs for data size growth, as it seems like not much increased.

We dig further for more details.
We have also created Pythian ticket for internal tracking:
https://secure.pythian.com/track/cr/1101708


Thanks,
Rohit Kumar
(Reporter)

Comment 2

2 years ago
(Opening this bug up, so the sheriffs can check status)

Space used on /data has been pretty constant at 340GB (67% used). No big changes there.

gcox looked at the cron log, and found the db checksums job had been running prior to the 29th, and also looked for puppet changes:
'I see nothing changing in manifests/nodes/db_mysql since 2016-10-21, so that's not interesting from what I see.'

We're still seeing timeouts on the webservice (eg https://secure.pub.build.mozilla.org/buildapi/running, requires LDAP), but the checksums load has subsided.

I read this a while back:
"While a transaction is open, the system keeps a snapshot of the data as seen at the beginning of the transaction, which can cause substantial overhead if the system inserts, updates, and deletes millions of rows while a stray transaction keeps running. Thus, take care to avoid transactions that run for too long" 
(https://dev.mysql.com/doc/refman/5.6/en/converting-tables-to-innodb.html#innodb-convert-transactions)

Is there any way to check for long lived db connections which might be holding a snapshot ? It's possible we have badly behaved code from MyISAM days were there were no transactions. With connections going via Zeus VIP we may need to talk to moc/webops to resolve the host making connections.  Also, what is the global value of autocommit ?
Group: metrics-private
(Reporter)

Comment 3

2 years ago
We discovered that one of the two webservers was not functioning properly, despite the newrelic data saying it was OK but a bit slow. Restarting apache solved the intermittent timeouts (bug 1314130). Should have caught that quicker, sorry.

I'm still curious to see what happens with disk I/O on buildbot3, to see if checksums continue to give it a hard time, and to know if there are quick answers to my questions in comment #2.
(Assignee)

Comment 4

2 years ago
There was a replication issue on buildbot2 which was fixed on 29th. buildbot3 sent almost 20 days of binary logs to buildbot2 which can explain IO issues during this time. 

We can stop replication on buildbot2 since it's out of traffic. 

Should we decom it as well?
Flags: needinfo?(mpressman)
(Reporter)

Comment 5

2 years ago
I checked with catlee and Releng agree with stopping replication to buildbot2, and go ahead with decomm. There is no possibility we'll go back to MyISAM at this point.
(Assignee)

Comment 6

2 years ago
Stopped replication on builbot1,2.
looks like this has been agreed to by Releng
Flags: needinfo?(mpressman)
(Assignee)

Comment 8

2 years ago
I guess we need another bug for decomm, can we close this bug?
(Assignee)

Comment 9

2 years ago
Hi

Please, let us know if any further work is needed on this matter.

Thank you
[goncalves@pythian.com]
(Reporter)

Comment 10

2 years ago
We haven't seen any production-impacting issues since the replication to buildbot2 was identified. The disk I/O on buildbot3 still seems quite high for comfort when the checksums check runs (6am/6pm UTC); high enough that newrelic alerts for > 90%.
(Assignee)

Updated

2 years ago
Status: NEW → ASSIGNED
(Assignee)

Comment 11

2 years ago
[beebe.pythian]
Per discussion with matt, closing. If further issue or incidents, please feel free to open again.
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.