Socorro -new-prod: high cpu usage every day at 4:00am
Categories
(Socorro :: Infra, defect, P2)
Tracking
(Not tracked)
People
(Reporter: willkg, Assigned: brian)
Details
Last week, I fixed verifyprocessed. It runs at 4:00am every day. Before I fixed it, it would run at 4:00am and check to make sure that all the raw crashes for that day had a corresponding processed crash. I fixed it so now it looks at the previous day which is a completed day. That makes it take a lot longer to run since it's looking at a full day of crashes rather than just 4 hours.
Looking at the cronrun logs, it takes between 8 and 10 minutes for verifyprocessed to run. During that time, it's pegging the cpu of the crontabber node at 100%. If cronrun runs any other tasks during that cycle, then it will exceed 10 minutes of continuous 100% cpu usage.
We have an alert that triggers if the avg cpu is over 80% for 10 minutes on a non-processor node. So now it's triggering every morning at 4:00am.
This bug covers figuring out what to do about that.
Reporter | ||
Comment 1•6 years ago
|
||
I might be able to tinker with how verifyprocessed is running so as to reduce the CPU usage, but it might cause it to take longer to run. That possibly creates other issues. I don't know what to tinker with offhand and it's tricky to test outside of prod. We're in a changefreeze, so I'd have to wait on tinkering until January.
Maybe we should decide this is the new normal and adjust the alert?
I'm not sure what other options we have.
Assignee | ||
Comment 2•6 years ago
|
||
I've bumped the alert to require 30 minutes of average cpu over 70% before sending an email.
Assignee | ||
Updated•6 years ago
|
Description
•