Closed Bug 774161 Opened 13 years ago Closed 13 years ago

reduce the # of cron errors sent to infra-dbnotices

Categories

(Data & BI Services Team :: DB: MySQL, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cshields, Assigned: scabral)

Details

(Whiteboard: [2012q3])

(internal IT goal) Focus some time in Q3 on reducing the number of these errors. If some help is needed in adding a bit more intelligence to our cron scripts so they are more resilient w/r/t error cases, reach out to Rob for help.
The goal is to "reduce the number of these errors." How are we going to measure success here since it doesn't appear to be an all or nothing task. Is there some percentage of clean up we are looking for?
So, some numbers: From a random sampling of Jul 1-7th (one week), there were 49 different e-mail threads, with 2,048 messages (each e-mail thread had multiple messages). But there are only a few different actual errors - for example, the kill_pigs-from-cron.pl script accounted for 34 of the 49 threads. So I'd say 50% reduction.
While looking into the numbers, I'm noticing that kill_pigs-from-cron.pl are all coming from: app-bugs01 app-bugs02 app-bugs03 It should be relatively easy to stop these from coming to cron. These are actually not errors, they're output, but the output is relatively useless, all we get is the PID and the # of seconds, for example: "killing pid 22635 at 360 seconds." That doesn't really help unless we have the general log turned on, because the slow query log doesn't actually log the query until it finishes (so it knows how long the whole query took). If the script showed us what query was being run it'd be more useful. I think it's reasonable to set a goal of optimizing the killed queries so they happen less frequently. If they're just crazy searches, we may want to consider logging the kill information locally on the machine, not to infra-dbnotices (if it's truly just noise).
Adding bug 775248 - another frequent cron error (every 15 minutes) is the fact that addons1 can't copy binary logs to db-backup1.ops.phx1.mozila.com for incremental backup purposes. Once that's done we should change the script to copy from addons7 anyway.
As for the backup logs, the name has changed and the VLAN, so I changed the script to be: BACKUPDEST="backup1.db.phx1.mozilla.com" I changed this on addons1 itself, and am working on changing puppet now.
This file is not under puppet control, so that's not a problem. This particular cron script is A-OK.
We are down to: 10 threads from Aug 16-23rd, 38 new messages. I'm calling this resolved. Next Q we can work on resolving specific errors that haven't been resolved here, now that we can actually handle reading the threads.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.