1153144 - Reduce CPU impact of twice daily Treeherder DB checksum task

Reporter

Description

•

10 years ago

In bug 1153142, we're seeing the Treeherder prod DBs max out at 100% CPU and also high disk IO. eg: https://rpm.newrelic.com/accounts/677903/servers/6106888?tw[dur]=last_60_minutes&tw[end]=1428648870 I'd like to increase CPUs from 2 to 4, and also possibly increase the amount of RAM (currently 10GB), presuming that would give MySQL more to work with for cache and so need to hit the disk less. The nodes are: treeherder1.db.scl3.mozilla.com treeherder2.db.scl3.mozilla.com Many thanks! :-)

Ed Morley [:emorley]

Reporter

Comment 1

•

10 years ago

For context, the TBPL DBs were something like 8 or 12 CPU and 32GB RAM, so even with this increase, we're still well under what they were using.

Chris Knowles [:cknowles]

Comment 2

•

10 years ago

Unfortunately, you can't just compare physical to virtual - in phys land it's all about what you've bought, in VM land it's all about what you NEED. We're taking a look at the VMs, and while we see memory constraints, the CPU is a more muddled space - the CPU idle for the master is 20-50% - so that indicates a real usage of 80%-ish. Depending on load and other factors that MAY be a concern - but it's not 100%. The memory constraints look more clear cut. However, we're not DB admins - we'd like to ask the DBAS to take a look around, as they've had a LOT of experience in right sizing DB VMs for the data they serve, as well as configuring the DB parameters to play well in a VM. :sheeri/:mpressman - can you offer any guidance as to suggested sizes/configurations?

Ed Morley [:emorley]

Reporter

Comment 3

•

10 years ago

I agree it's not a direct comparison, but note I'm not asking for more CPU/RAM _just because_ we had more before - I'm asking because at certain times of day (for up to an hour) the CPU usage _is_ pegged at 100% (and not 80%) - from my interpretation of New Relic anyway :-) > The memory constraints look more clear cut. However, we're not DB admins - > we'd like to ask the DBAS to take a look around, as they've had a LOT of > experience in right sizing DB VMs for the data they serve, as well as > configuring the DB parameters to play well in a VM. Agree good to loop them in - thank you :-)

Chris Knowles [:cknowles]

Comment 4

•

10 years ago

Totally - wasn't saying it wasn't justified - just that there's a disagreement between newrelic and the box itself. I am seeing a period of high CPU (not quite pegged, but closer than perhaps we'd like) around 1800-1930 UTC (some days much less, some that time... picked the longest range). Does that match your times? Is there anything maintenance wise going on at those times? Memory wise, it's just always high memory usage. Hence why I'd like a DBA opinion on making sure A) we give you enough, and B) you're configured to make best use of it.

Ed Morley [:emorley]

Reporter

Comment 5

•

10 years ago

:-) I believe it is the checksum task, which runs twice a day, and an IRC conversation with the DBAs previously said it can't be made any less aggressive. For this mornings run for example, we saw the fairly constant 75ms average web app transaction time jump up to an average of 250ms, with spikes as high as 400ms, for the time interval when New Relic reported the CPU usage as high.

Sheeri Cabral [:sheeri]

Comment 6

•

10 years ago

So the problem is that the checksum goes through ALL the data - I'm not sure that more RAM will actually help, since it needs to go through hundreds of gigs of data, so increasing from 8 to 10G won't necessarily help as much. We can add CPU - though I'm not confident that will help either, as MySQL works to max out one CPU before going to the next one (it's an artifact of the way MySQL is a single-processor server). Alternatively, what if we cut the checksum down to weekly, on Saturdays (or a specific time that's off-peak for treeherder....Saturdays is a generic off-peak time for Mozilla properties)? Right now the checksum runs twice daily, at 600 and 1800 UTC. NewRelic shows CPU peaks at both those times - but you didn't mention the 600 timeframe. Maybe we could cut it down to once a day at 600 UTC? (or weekly, we're fine with that - the caveat is that if the data ever gets out of sync it could take up to a week for us to know).

Chris Knowles [:cknowles]

Comment 7

•

10 years ago

Reason I didn't mention the 0600 spike is that it is marginally smaller in time and size - though I can see sar logs showing more activity then. (Once I know what timeframes to look at, the analysis gets easier. :) From a sysadmin side, it does look like it's tight (not exhausted) on RAM, however, I bow to your expertise. :emorley - any of these sound good to you as options?

Jonathan Griffin (:jgriffin)

Updated

•

10 years ago

Flags: needinfo?(emorley)

Ed Morley [:emorley]

Reporter

Comment 8

•

10 years ago

(In reply to Sheeri Cabral [:sheeri] from comment #6) > Right now the checksum runs twice daily, at 600 and 1800 UTC. NewRelic shows > CPU peaks at both those times - but you didn't mention the 600 timeframe. > Maybe we could cut it down to once a day at 600 UTC? Let's try once a day at 0600 to start - thank you :-)

Flags: needinfo?(emorley)

Chris Knowles [:cknowles]

Comment 9

•

10 years ago

Based on this no longer being a resource add, but potentially needing continued analysis from the data team, we're moving this to the data team. Obviously, if you need our consultation, we're happy to help.

Assignee: server-ops-virtualization → server-ops-database

Group: metrics-private

Component: Virtualization → Database Operations

Product: Infrastructure & Operations → Data & BI Services Team

QA Contact: cshields → scabral

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Group: metrics-private

Summary: Increase CPUs and RAM on Treeherder prod DBs → Reduce CPU impact of twice daily Treeherder DB checksum task

Sheeri Cabral [:sheeri]

Comment 10

•

10 years ago

I have changed the crons (via puppet) to only run once daily at 6 am. Please verify and close in a few days if all is well.

Ed Morley [:emorley]

Reporter

Comment 11

•

10 years ago

We just had another spike: https://rpm.newrelic.com/accounts/677903/incidents/14935737 Is there definitely nothing else running at that time?

Sheeri Cabral [:sheeri]

Comment 12

•

10 years ago

Nobody was doing anything on treeherder today, so it should be the same as every day. So, the checksums used to run at 6 am and 6 pm server time, which is UTC, so it was 11 pm and 11 am Pacific time. We changed it to run at 6 am UTC only, 11 pm Pacific time. Now you're asking about a spike that went from 9 am - 10:30 am Pacific time, which is not the time periods we had talked about, so it's orthogonal to the checksum issue. https://rpm.newrelic.com/accounts/677903/applications/4180461/datastores#/overview/MySQL?value=calls_per_minute shows some of the slow queries. Have you looked to see if there was more traffic being sent? Today's spike looks completely different - https://rpm.newrelic.com/accounts/677903/servers/6106888? there's no disk I/O utilization, and higher CPU than is normal. Looks like this was right after an infrasec ssh? https://rpm.newrelic.com/accounts/677903/servers/6106888/processes#id=605230663&sort_by=cpu Maybe infrasec did something?

Flags: needinfo?(jvehent)

Julien Vehent

Comment 13

•

10 years ago

I don't think it's related to our security tools. The last Nexpose scan on that box ran at 2:19:32 PM UTC. The last MIG Action ran at 2015-04-23 15:07:20.566949 +0000 UTC (and doesn't use SSH).

Flags: needinfo?(jvehent)

Ed Morley [:emorley]

Reporter

Comment 14

•

10 years ago

Looks better now, thank you :-)

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Bugzilla

Reduce CPU impact of twice daily Treeherder DB checksum task

Categories

(Data & BI Services Team :: DB: MySQL, task)

Tracking

(Not tracked)

People

(Reporter: emorley, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Comment 8

Comment 9

Updated

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14