Closed Bug 1153144 Opened 10 years ago Closed 10 years ago

Reduce CPU impact of twice daily Treeherder DB checksum task

Categories

(Data & BI Services Team :: DB: MySQL, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Unassigned)

References

Details

In bug 1153142, we're seeing the Treeherder prod DBs max out at 100% CPU and also high disk IO. eg: https://rpm.newrelic.com/accounts/677903/servers/6106888?tw[dur]=last_60_minutes&tw[end]=1428648870 I'd like to increase CPUs from 2 to 4, and also possibly increase the amount of RAM (currently 10GB), presuming that would give MySQL more to work with for cache and so need to hit the disk less. The nodes are: treeherder1.db.scl3.mozilla.com treeherder2.db.scl3.mozilla.com Many thanks! :-)
For context, the TBPL DBs were something like 8 or 12 CPU and 32GB RAM, so even with this increase, we're still well under what they were using.
Unfortunately, you can't just compare physical to virtual - in phys land it's all about what you've bought, in VM land it's all about what you NEED. We're taking a look at the VMs, and while we see memory constraints, the CPU is a more muddled space - the CPU idle for the master is 20-50% - so that indicates a real usage of 80%-ish. Depending on load and other factors that MAY be a concern - but it's not 100%. The memory constraints look more clear cut. However, we're not DB admins - we'd like to ask the DBAS to take a look around, as they've had a LOT of experience in right sizing DB VMs for the data they serve, as well as configuring the DB parameters to play well in a VM. :sheeri/:mpressman - can you offer any guidance as to suggested sizes/configurations?
I agree it's not a direct comparison, but note I'm not asking for more CPU/RAM _just because_ we had more before - I'm asking because at certain times of day (for up to an hour) the CPU usage _is_ pegged at 100% (and not 80%) - from my interpretation of New Relic anyway :-) > The memory constraints look more clear cut. However, we're not DB admins - > we'd like to ask the DBAS to take a look around, as they've had a LOT of > experience in right sizing DB VMs for the data they serve, as well as > configuring the DB parameters to play well in a VM. Agree good to loop them in - thank you :-)
Totally - wasn't saying it wasn't justified - just that there's a disagreement between newrelic and the box itself. I am seeing a period of high CPU (not quite pegged, but closer than perhaps we'd like) around 1800-1930 UTC (some days much less, some that time... picked the longest range). Does that match your times? Is there anything maintenance wise going on at those times? Memory wise, it's just always high memory usage. Hence why I'd like a DBA opinion on making sure A) we give you enough, and B) you're configured to make best use of it.
:-) I believe it is the checksum task, which runs twice a day, and an IRC conversation with the DBAs previously said it can't be made any less aggressive. For this mornings run for example, we saw the fairly constant 75ms average web app transaction time jump up to an average of 250ms, with spikes as high as 400ms, for the time interval when New Relic reported the CPU usage as high.
So the problem is that the checksum goes through ALL the data - I'm not sure that more RAM will actually help, since it needs to go through hundreds of gigs of data, so increasing from 8 to 10G won't necessarily help as much. We can add CPU - though I'm not confident that will help either, as MySQL works to max out one CPU before going to the next one (it's an artifact of the way MySQL is a single-processor server). Alternatively, what if we cut the checksum down to weekly, on Saturdays (or a specific time that's off-peak for treeherder....Saturdays is a generic off-peak time for Mozilla properties)? Right now the checksum runs twice daily, at 600 and 1800 UTC. NewRelic shows CPU peaks at both those times - but you didn't mention the 600 timeframe. Maybe we could cut it down to once a day at 600 UTC? (or weekly, we're fine with that - the caveat is that if the data ever gets out of sync it could take up to a week for us to know).
Reason I didn't mention the 0600 spike is that it is marginally smaller in time and size - though I can see sar logs showing more activity then. (Once I know what timeframes to look at, the analysis gets easier. :) From a sysadmin side, it does look like it's tight (not exhausted) on RAM, however, I bow to your expertise. :emorley - any of these sound good to you as options?
Flags: needinfo?(emorley)
(In reply to Sheeri Cabral [:sheeri] from comment #6) > Right now the checksum runs twice daily, at 600 and 1800 UTC. NewRelic shows > CPU peaks at both those times - but you didn't mention the 600 timeframe. > Maybe we could cut it down to once a day at 600 UTC? Let's try once a day at 0600 to start - thank you :-)
Flags: needinfo?(emorley)
Based on this no longer being a resource add, but potentially needing continued analysis from the data team, we're moving this to the data team. Obviously, if you need our consultation, we're happy to help.
Assignee: server-ops-virtualization → server-ops-database
Group: metrics-private
Component: Virtualization → Database Operations
Product: Infrastructure & Operations → Data & BI Services Team
QA Contact: cshields → scabral
Group: metrics-private
Summary: Increase CPUs and RAM on Treeherder prod DBs → Reduce CPU impact of twice daily Treeherder DB checksum task
I have changed the crons (via puppet) to only run once daily at 6 am. Please verify and close in a few days if all is well.
We just had another spike: https://rpm.newrelic.com/accounts/677903/incidents/14935737 Is there definitely nothing else running at that time?
Nobody was doing anything on treeherder today, so it should be the same as every day. So, the checksums used to run at 6 am and 6 pm server time, which is UTC, so it was 11 pm and 11 am Pacific time. We changed it to run at 6 am UTC only, 11 pm Pacific time. Now you're asking about a spike that went from 9 am - 10:30 am Pacific time, which is not the time periods we had talked about, so it's orthogonal to the checksum issue. https://rpm.newrelic.com/accounts/677903/applications/4180461/datastores#/overview/MySQL?value=calls_per_minute shows some of the slow queries. Have you looked to see if there was more traffic being sent? Today's spike looks completely different - https://rpm.newrelic.com/accounts/677903/servers/6106888? there's no disk I/O utilization, and higher CPU than is normal. Looks like this was right after an infrasec ssh? https://rpm.newrelic.com/accounts/677903/servers/6106888/processes#id=605230663&sort_by=cpu Maybe infrasec did something?
Flags: needinfo?(jvehent)
I don't think it's related to our security tools. The last Nexpose scan on that box ran at 2:19:32 PM UTC. The last MIG Action ran at 2015-04-23 15:07:20.566949 +0000 UTC (and doesn't use SSH).
Flags: needinfo?(jvehent)
Looks better now, thank you :-)
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.