Closed Bug 1604489 Opened 6 years ago Closed 6 years ago

Socorro -new-prod: high cpu usage every day at 4:00am

Categories

(Socorro :: Infra, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: brian)

Details

Last week, I fixed verifyprocessed. It runs at 4:00am every day. Before I fixed it, it would run at 4:00am and check to make sure that all the raw crashes for that day had a corresponding processed crash. I fixed it so now it looks at the previous day which is a completed day. That makes it take a lot longer to run since it's looking at a full day of crashes rather than just 4 hours.

Looking at the cronrun logs, it takes between 8 and 10 minutes for verifyprocessed to run. During that time, it's pegging the cpu of the crontabber node at 100%. If cronrun runs any other tasks during that cycle, then it will exceed 10 minutes of continuous 100% cpu usage.

We have an alert that triggers if the avg cpu is over 80% for 10 minutes on a non-processor node. So now it's triggering every morning at 4:00am.

This bug covers figuring out what to do about that.

I might be able to tinker with how verifyprocessed is running so as to reduce the CPU usage, but it might cause it to take longer to run. That possibly creates other issues. I don't know what to tinker with offhand and it's tricky to test outside of prod. We're in a changefreeze, so I'd have to wait on tinkering until January.

Maybe we should decide this is the new normal and adjust the alert?

I'm not sure what other options we have.

I've bumped the alert to require 30 minutes of average cpu over 70% before sending an email.

Assignee: nobody → bpitts
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.