Closed Bug 513590 Opened 15 years ago Closed 15 years ago

[AMO] Getting MySQL errors in production

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

All
Other
task
Not set
major

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: clouserw, Assigned: chizu)

References

Details

(Whiteboard: [blocked on locks])

We're getting a lot[1] of errors in production on the cron jobs we run, mostly MySQL Error 0 which I think usually means that the SQL server has gone away or somehow timed out / hit a limit.  I thought we upped the timeouts on the server but maybe that got reverted or we're beginning to hit the new limits?  Can you tell me what the timeouts are in the config?

Also, this only appears to be happening on production and not preview or our dev boxes.  Production does have more data but we resync preview every month or two so it seems like this would happen on at least 1 cron job there.  Anyway, can you verify all settings between preview and production are the same and let us know what they are?  Thanks

[1] I'd guess double or triple digits of failed queries per day.
Assignee: server-ops → justdave
We're averaging 9 emails per day.  A few have 1 error in them and 5 (the ADU jobs) have hundreds per mail.
No response for 11 days.  Bumping to major.
Severity: normal → major
Wil, could you forward a copy of the errors or attach a copy?
They are all available at https://mail.mozilla.org/private/amo-developers/

Examples:

PHP Notice:  MySQL Error 0: 
Query was: [
                    UPDATE `addons` SET `average_daily_users`='1680.2473'  WHERE `id`='4'
                ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123


PHP Notice:  MySQL Error 0: 
Query was: [
            DELETE users
            FROM users
            LEFT JOIN addons_users on users.id = addons_users.user_id
            WHERE created < DATE_SUB(CURDATE(), INTERVAL 2 WEEK)
            AND confirmationcode != ''
            AND addons_users.user_id IS NULL
        ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
Which database do the cron jobs actually hit?  Do the hit a slave or the master?
The amo master has a wait timeout of 820 seconds.  staging master (where preview points) is set for 120 seconds.

Off to check the slaves next.
Hmm, the slaves are all set at 820 across the board, too, so I guess that wouldn't matter which one you hit if it was really a timeout.

On the other hand, the slave connections go through the load balancer...  We did have an issue with the Netscaler a year or two ago where it would randomly drop connections in mysql pools...  thought that was long gone though.
We're only getting timeouts on the master. (All reads go to the slaves, writes to the master)

We're doing some really long winded queries and were doing some that are a gazillion 1 row updates in a loop.  We can make them more efficient if we have to, but I'd like to figure out what the problem is.
Any updates on this?  It's been a few weeks and we're getting a lot of broken queries every day
I think we're all focusing on comment #4 but Wil said that was just an example.  Looking at the archives,

PHP Notice:  MySQL Error 1451: Cannot delete or update a parent row: a foreign key constraint fails (`addons_reskin/api_auth_tokens`, CONSTRAINT `api_auth_tokens_ibfk_1` FOREIGN KEY (`user_id`) REFERENCES `users` (`id`))
Query was: [
            DELETE users
            FROM users
            LEFT JOIN addons_users on users.id = addons_users.user_id
            WHERE created < DATE_SUB(CURDATE(), INTERVAL 2 WEEK)
            AND confirmationcode != ''
            AND addons_users.user_id IS NULL
        ] in /data/amo/www/addons.mozilla.org-preview/bin/database.class.php on line 123




PHP Notice:  MySQL Error 0: 
Query was: [
            DELETE users
            FROM users
            LEFT JOIN addons_users on users.id = addons_users.user_id
            WHERE created < DATE_SUB(CURDATE(), INTERVAL 2 WEEK)
            AND confirmationcode != ''
            AND addons_users.user_id IS NULL
        ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123

PHP Notice:  MySQL Error 0: 
Query was: [
                    UPDATE `addons` SET `average_daily_users`='1660.2775'  WHERE `id`='4'
                ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
For the record, we can handle the foreign key problems - that's one of the reasons we have the logging setup in the first place.  It's the Error 0 problems I'm concerned about.  Thanks.
Assignee: justdave → thardcastle
Error 0 is pointing to connection issues. There's a large spike in network traffic on the master when some of the error 0 queries failed. But other than that, there's not an obvious load problem. Queries do appear to work fine (fast, no errors) if they can get a connection.
the error 0 ones - did those actually fail? How would I check?
(In reply to comment #13)
> the error 0 ones - did those actually fail? How would I check?

I'm pretty sure they are failing.  Taking the example from comment #10, I think this should be returning zero:

SELECT count(id) FROM users
LEFT JOIN addons_users on users.id = addons_users.user_id
WHERE created < DATE_SUB(CURDATE(), INTERVAL 2 WEEK)
AND confirmationcode != ''
AND addons_users.user_id IS NULL

but it returns almost 30000.  Change it to INTERVAL 3 WEEK if you want to be extra sure and it's still over 25k.
What is the next step here?  justdave/chizu?
Doesn't look like munin is monitoring connection counts at all.  That'd be a good thing to have record of, so then we can see if there's any correlation between these errors and the DB server hitting MaxClients or something.  Although it seems like running out of connections would generate a real error code and not 0.
Comments from Mark Smith - 

Couple thoughts.

The error 0 is a connection issue, but it sounds like you've got that
already (read the bug).  One general purpose comment before diving in:

The connection itself can change the timeout.  I.e., if you set your
connection timeout on the server to 820 seconds, but the MySQL client
connection (from PHP) uses "SET wait_timeout = 5" or something low,
then the timeout will be 5 seconds.

This means, you SHOULD set the connection timeout to be LOW on the
server, and then tell the clients [cron jobs, AMO, etc] to set a HIGH
wait_timeout if they need it.  This makes sure that things like web
servers that don't need long running connections don't sit and eat up
the connection limit, and still lets cron jobs do useful things
without timing out.

Now, as to this particular error...

I'd put my money on this error being caused by connections sitting
open while log processing is being done.  I'm pretty sure that the AMO
log processing cron job works something like this (from memory):

1. Connect to the database
2. Look at directory of log files
3. Process one log file
4. Save results to database
5. Repeat from #2 until done
6. Disconnect from the database

The problem is if step #3 takes more than your current timeout value,
the connection goes away.  So, most of the time, if processing a log
file is pretty fast, it's no issue.  Then on really big files, it will
error out.

There are ways of dealing with this.  You can move step 1 to be part
of step 4 (so it connects to the database for every log file), which
is probably what I'd do.

They can verify this behavior by having their stats system print out
logs of every query it executes, which database it executes it on, and
what time it executes at.  I.e., something like,

[2009-09-30 20:37:02 slave] SELECT ...
[2009-09-30 20:37:03 master] UPDATE ...

Put that in the database layer that actually issues the query.  (I
recall AMO has a nice DB abstraction to do the memcache storage, which
should make this fairly trivial to implement.)

The other thing that could be happening is if they do (instead of log
processing) lots of SELECTs on the *slave*, but don't actually update
the master very much.  Basically, it comes down to the fact that if
they ever take more than the timeout value of time between using a
connection, it gets disconnected.  And that's pretty easy to do.
> This means, you SHOULD set the connection timeout to be LOW on the
> server, and then tell the clients [cron jobs, AMO, etc] to set a HIGH
> wait_timeout if they need it.  This makes sure that things like web
> servers that don't need long running connections don't sit and eat up
> the connection limit, and still lets cron jobs do useful things
> without timing out.

We're not changing this on the client.  If we think it will help, I'm happy to do it though.

> I'd put my money on this error being caused by connections sitting
> open while log processing is being done.  I'm pretty sure that the AMO
> log processing cron job works something like this (from memory):

AMO doesn't do log processing like that any more, but it does do some long running jobs.  Several of the form:

1) SELECT a gazillion rows, iterate over each one doing an INSERT
2) Do a long running UPDATE that affects a gazillion rows

It looks like we're having failures of both types to me.

> They can verify this behavior by having their stats system print out
> logs of every query it executes, which database it executes it on, and
> what time it executes at.

We have some amount of debugging that we can flip on, which I'll do now.

> The other thing that could be happening is if they do (instead of log
> processing) lots of SELECTs on the *slave*, but don't actually update
> the master very much.  Basically, it comes down to the fact that if
> they ever take more than the timeout value of time between using a
> connection, it gets disconnected.  And that's pretty easy to do.

This is an interesting idea, but not something that I think would happen.  I'll turn on debugging though and hopefully that can help us figure out timings.  For the record, all SELECTs go to the slaves and any updating goes to the master.
Depends on: 519909
Alright, we had debugging turned on for a day and a half.  There are over 1300 cron jobs per day that run, very few actually fail.  Performance of each cron job can vary wildly.  This cron job runs every 10 minutes and affects 22k rows:

UPDATE collections SET rating= IFNULL(CAST(upvotes - downvotes AS SIGNED) * LN(upvotes + downvotes),0)

Sometimes it takes about 2 minutes and other times it takes well over an hour to complete.  There is no locking in our maintenance script so chances are jobs like that are tripping over themselves.

Regarding this bug specifically, and Error 0, it seems to happen most often in the morning (5-8am) but that's not a rule.  Also, it doesn't appear to be related to timing out specifically as a lot of these jobs are short.  Some examples from last night:

--------------------------------------------------------------------------------
08:15:03 Updating average ratings...
PHP Notice:  MySQL Error 0: 
Query was: [
            UPDATE addons AS a
            INNER JOIN (
                SELECT
                    versions.addon_id as addon_id,
                    AVG(rating) as avg_rating
                FROM reviews
                INNER JOIN versions ON reviews.version_id = versions.id
                WHERE reviews.reply_to IS NULL
                    AND reviews.rating > 0
                GROUP BY versions.addon_id
            ) AS c ON (a.id = c.addon_id)
            SET a.averagerating = ROUND(c.avg_rating, 2)
        ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
08:16:01 Updating bayesian ratings...
08:16:19 Affected rows: 1
08:16:19 Time: 75.707126140594
08:16:19 Exiting ...
--------------------------------------------------------------------------------
This died in under 75 seconds.  There was no read query here, it's just a single query to the master.


--------------------------------------------------------------------------------
08:15:03 Updating average ratings...
PHP Notice:  MySQL Error 0: 
Query was: [
            UPDATE addons AS a
            INNER JOIN (
                SELECT
                    versions.addon_id as addon_id,
                    AVG(rating) as avg_rating
                FROM reviews
                INNER JOIN versions ON reviews.version_id = versions.id
                WHERE reviews.reply_to IS NULL
                    AND reviews.rating > 0
                GROUP BY versions.addon_id
            ) AS c ON (a.id = c.addon_id)
            SET a.averagerating = ROUND(c.avg_rating, 2)
        ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
08:16:01 Updating bayesian ratings...
08:16:19 Affected rows: 1
08:16:19 Time: 75.707126140594
08:16:19 Exiting ...
--------------------------------------------------------------------------------
This died in under 75 seconds.  There was no read query here, it's just a single query to the master.  This job also died at 7:15 and 7:35, but it ran fine (couple dozen times) the rest of the night.


--------------------------------------------------------------------------------
04:35:04 Retrieving ADU counts from `update_counts` ...
PHP Notice:  MySQL Error 0: 
Query was: [
                    UPDATE `addons` SET `average_daily_users`='1643.8061'  WHERE `id`='4'
                ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
PHP Notice:  MySQL Error 0: 
Query was: [
                    UPDATE `addons` SET `average_daily_users`='439.8699'  WHERE `id`='5'
                ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
PHP Notice:  MySQL Error 0: 
Query was: [
                    UPDATE `addons` SET `average_daily_users`='5198.8214'  WHERE `id`='6'
                ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
PHP Notice:  MySQL Error 0: 
Query was: [
                    UPDATE `addons` SET `average_daily_users`='234207.2500'  WHERE `id`='7'
                ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123

<...many more pages...>

PHP Notice:  MySQL Error 0: 
Query was: [
                    UPDATE `addons` SET `average_daily_users`='1.0000'  WHERE `id`='14634'
                ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
PHP Notice:  MySQL Error 0: 
Query was: [UPDATE `config` SET `value`='0' WHERE `key`='stats_updating'] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
05:18:02 Affected rows: 10799
05:18:02 Time: 2580.116437912
05:18:02 Exiting ...
--------------------------------------------------------------------------------
This one points to the jobs tripping over themselves.  This job runs every 30 minutes and every time it took under 30 minutes (sometimes as low as 4 minutes) it worked fine.  Every time it went over, we got pages of these errors.  This is a job that retrieves 10k rows, loops over them, and does an UPDATE on each.


--------------------------------------------------------------------------------
05:05:02 Starting review total updates...
PHP Notice:  MySQL Error 0: 
Query was: [
            UPDATE addons AS a
            INNER JOIN (
                SELECT
                    versions.addon_id as addon_id,
                    COUNT(*) as count
                FROM reviews
                INNER JOIN versions ON reviews.version_id = versions.id
                WHERE reviews.reply_to IS NULL
                    AND reviews.rating > 0
                GROUP BY versions.addon_id
            ) AS c ON (a.id = c.addon_id)
            SET a.totalreviews = c.count
        ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
05:05:54 Affected rows: 1
05:05:54 Time: 52.831606864929
05:05:54 Exiting ...
--------------------------------------------------------------------------------
This job runs every 20 minutes, it's a single UPDATE statement written to the master.  The rest of the night it took 20-40 seconds to run.  At 5:05am it failed in 52 seconds with Error 0.  Definitely not tripping over itself - perhaps tripping over another job?


--------------------------------------------------------------------------------
07:10:04 Retrieving all add-on ids...
07:10:04 Retrieving seven-day counts from `download_counts` ...
07:10:05 Updating seven day counts in `main` ...
PHP Notice:  MySQL Error 0: 
Query was: [
                    UPDATE `addons` SET `weeklydownloads`='86' WHERE `id`='151'
                ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
07:19:24 Affected rows: 10034
07:19:24 Time: 560.40586519241
07:19:24 Exiting ...
--------------------------------------------------------------------------------
Again, this ran every 20 minutes all night until 7ish and then failed in 2 consecutive jobs.  It took 560 seconds here but other times that night it took longer and ran fine.


--------------------------------------------------------------------------------
06:13:01 Starting global stats update...
06:13:01 Updating addon_total_downloads...
06:15:57 Updating addon_downloads_new...
06:15:57 Updating addon_count_public...
06:15:57 Updating addon_count_pending...
06:15:57 Updating addon_count_experimental...
06:15:57 Updating addon_count_nominated...
06:15:57 Updating addon_count_new...
06:15:58 Updating version_count_new...
06:15:58 Updating user_count_total...
PHP Notice:  MySQL Error 0: 
Query was: [REPLACE INTO global_stats (name, count, date) VALUES ('user_count_total', (SELECT COUNT(*) FROM users), '2009-10-02')] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
06:16:23 Updating user_count_new...
06:16:33 Updating review_count_total...
06:16:34 Updating review_count_new...
06:16:34 Updating collection_count_total...
06:16:34 Updating collection_count_new...
06:16:34 Updating collection_count_private...
06:16:34 Updating collection_count_public...
06:16:34 Updating collection_count_autopublishers...
06:16:34 Updating collection_count_editorspicks...
06:16:34 Updating collection_count_normal...
06:16:34 Updating collection_addon_downloads...
06:16:43 Updating collector_total_downloads...
06:16:44 Updating addon_total_updatepings...
06:16:44 Updating collector_updatepings...
06:16:44 Affected rows: 23
06:16:44 Time: 222.846118927
06:16:44 Exiting ...
--------------------------------------------------------------------------------
This job does a REPLACE with a nested SELECT for each of the values up there ^^.  You can see this failed within a few seconds.  That particular query was doing a "SELECT count(*) from users" which is one of our biggest table and may have been in use by another cron job.


--------------------------------------------------------------------------------
00:42:01 Removing user accounts that haven't been confirmed for two weeks...
PHP Notice:  MySQL Error 0: 
Query was: [
            DELETE users
            FROM users
            LEFT JOIN addons_users on users.id = addons_users.user_id
            WHERE created < DATE_SUB(CURDATE(), INTERVAL 2 WEEK)
            AND confirmationcode != ''
            AND addons_users.user_id IS NULL
        ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
00:43:12 Affected rows: 1
00:43:12 Time: 70.957583904266
00:43:12 Exiting ...
--------------------------------------------------------------------------------
This job only runs once per day but seems to fail every time.  It only takes 70 seconds to fail.


So, it seems like we have some jobs tripping over their older selves, and perhaps we have jobs tripping over different jobs that are using the same tables (not sure why that would be causing failures).  I think a good first step right now would be to add a locking flag to our maintenance script to keep jobs from tripping over themselves.  Hopefully that will reduce the crazy differences in query execution time.  What do you think?

> Doesn't look like munin is monitoring connection counts at all.  That'd be a
> good thing to have record of, so then we can see if there's any correlation
> between these errors and the DB server hitting MaxClients or something. 
> Although it seems like running out of connections would generate a real error
> code and not 0.

What did you find out here - are we hitting the max_connections?
> What did you find out here - are we hitting the max_connections?

oremj, can that be trended in munin?
15 cron jobs failed in about the past hour.  Any updates here?
(In reply to comment #21)
> 15 cron jobs failed in about the past hour.  Any updates here?

I wonder if it's correlated with the new sphinx indexer?
Any updates?  Haven't heard anything for 10 days
We looked at this when the Percona guy was here this morning, and it does turn out that the cron jobs are backing up all over themselves.  There were 43 copies of share_count_totals running and 39 copies of collections_ratings, all running concurrently.
There should have been a large number fail this morning because we manually killed them off.
The sphinx indexer was hitting the database hard for about an hour when it was deployed. Since then, it hasn't contributed much to load.

i/o wait spikes during or around cron jobs, and often leaves MySQL threads piling up. To fix this, we'll need to reduce queries to the master or increase capacity somehow (faster hardware, sharding, clustering). Memory might help a little, stale pages do get swapped out:

             total       used       free     shared    buffers     cached
Mem:          3953       3917         35          0        138        723
-/+ buffers/cache:       3055        897
Swap:         1983         50       1933
I've got bug 522112 filed to request additional RAM for the AMO DB servers, based on the recommendation from Percona this morning.
bug 522073 is tracking locks on cron jobs.
Depends on: 522112
Depends on: 522073
(In reply to comment #28)
> bug 522073 is tracking locks on cron jobs.

Should we keep this bug open until that one is resolved?  Or is this bug mostly resolved?
We are still getting many error 0s.  We don't actually know if cron locks will fix this, so I don't think we should close this.

We've moved a bunch of the cron jobs to python and they are using (cron) locks and they are failing just as often with (mysql) lock issues.  There are a couple other open IT bugs about this including adjusting the timing in hopes of having less conflicts (bug 526553) and potentially being out of disk space (bug 526414).
Whiteboard: [blocked on locks]
Update: we haven't seen any cron errors since the changes in bug 529413 were applied.
Since we don't know what causes the error 0s it's tough to call this fixed.  On the other hand, there is no LOOKSLIKEITSWORKINGNOW, so I guess FIXED is as good as it gets.

Either way, I'm happy to not have cron email every 10 minutes. :)
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
(In reply to comment #32)
> Since we don't know what causes the error 0s it's tough to call this fixed.  On
> the other hand, there is no LOOKSLIKEITSWORKINGNOW, so I guess FIXED is as good
> as it gets.

LOOKSLIKEITSWORKINGNOW is called WORKSFORME. :)
Resolution: FIXED → WORKSFORME
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.