Closed
Bug 513590
Opened 15 years ago
Closed 15 years ago
[AMO] Getting MySQL errors in production
Categories
(Infrastructure & Operations Graveyard :: WebOps: Other, task)
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: clouserw, Assigned: chizu)
References
Details
(Whiteboard: [blocked on locks])
We're getting a lot[1] of errors in production on the cron jobs we run, mostly MySQL Error 0 which I think usually means that the SQL server has gone away or somehow timed out / hit a limit. I thought we upped the timeouts on the server but maybe that got reverted or we're beginning to hit the new limits? Can you tell me what the timeouts are in the config? Also, this only appears to be happening on production and not preview or our dev boxes. Production does have more data but we resync preview every month or two so it seems like this would happen on at least 1 cron job there. Anyway, can you verify all settings between preview and production are the same and let us know what they are? Thanks [1] I'd guess double or triple digits of failed queries per day.
Updated•15 years ago
|
Assignee: server-ops → justdave
Reporter | ||
Comment 1•15 years ago
|
||
We're averaging 9 emails per day. A few have 1 error in them and 5 (the ADU jobs) have hundreds per mail.
Reporter | ||
Comment 2•15 years ago
|
||
No response for 11 days. Bumping to major.
Severity: normal → major
Comment 3•15 years ago
|
||
Wil, could you forward a copy of the errors or attach a copy?
Reporter | ||
Comment 4•15 years ago
|
||
They are all available at https://mail.mozilla.org/private/amo-developers/ Examples: PHP Notice: MySQL Error 0: Query was: [ UPDATE `addons` SET `average_daily_users`='1680.2473' WHERE `id`='4' ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123 PHP Notice: MySQL Error 0: Query was: [ DELETE users FROM users LEFT JOIN addons_users on users.id = addons_users.user_id WHERE created < DATE_SUB(CURDATE(), INTERVAL 2 WEEK) AND confirmationcode != '' AND addons_users.user_id IS NULL ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
Comment 5•15 years ago
|
||
Which database do the cron jobs actually hit? Do the hit a slave or the master?
Comment 6•15 years ago
|
||
The amo master has a wait timeout of 820 seconds. staging master (where preview points) is set for 120 seconds. Off to check the slaves next.
Comment 7•15 years ago
|
||
Hmm, the slaves are all set at 820 across the board, too, so I guess that wouldn't matter which one you hit if it was really a timeout. On the other hand, the slave connections go through the load balancer... We did have an issue with the Netscaler a year or two ago where it would randomly drop connections in mysql pools... thought that was long gone though.
Reporter | ||
Comment 8•15 years ago
|
||
We're only getting timeouts on the master. (All reads go to the slaves, writes to the master) We're doing some really long winded queries and were doing some that are a gazillion 1 row updates in a loop. We can make them more efficient if we have to, but I'd like to figure out what the problem is.
Reporter | ||
Comment 9•15 years ago
|
||
Any updates on this? It's been a few weeks and we're getting a lot of broken queries every day
Comment 10•15 years ago
|
||
I think we're all focusing on comment #4 but Wil said that was just an example. Looking at the archives, PHP Notice: MySQL Error 1451: Cannot delete or update a parent row: a foreign key constraint fails (`addons_reskin/api_auth_tokens`, CONSTRAINT `api_auth_tokens_ibfk_1` FOREIGN KEY (`user_id`) REFERENCES `users` (`id`)) Query was: [ DELETE users FROM users LEFT JOIN addons_users on users.id = addons_users.user_id WHERE created < DATE_SUB(CURDATE(), INTERVAL 2 WEEK) AND confirmationcode != '' AND addons_users.user_id IS NULL ] in /data/amo/www/addons.mozilla.org-preview/bin/database.class.php on line 123 PHP Notice: MySQL Error 0: Query was: [ DELETE users FROM users LEFT JOIN addons_users on users.id = addons_users.user_id WHERE created < DATE_SUB(CURDATE(), INTERVAL 2 WEEK) AND confirmationcode != '' AND addons_users.user_id IS NULL ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123 PHP Notice: MySQL Error 0: Query was: [ UPDATE `addons` SET `average_daily_users`='1660.2775' WHERE `id`='4' ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
Reporter | ||
Comment 11•15 years ago
|
||
For the record, we can handle the foreign key problems - that's one of the reasons we have the logging setup in the first place. It's the Error 0 problems I'm concerned about. Thanks.
Assignee | ||
Comment 12•15 years ago
|
||
Error 0 is pointing to connection issues. There's a large spike in network traffic on the master when some of the error 0 queries failed. But other than that, there's not an obvious load problem. Queries do appear to work fine (fast, no errors) if they can get a connection.
Comment 13•15 years ago
|
||
the error 0 ones - did those actually fail? How would I check?
Reporter | ||
Comment 14•15 years ago
|
||
(In reply to comment #13) > the error 0 ones - did those actually fail? How would I check? I'm pretty sure they are failing. Taking the example from comment #10, I think this should be returning zero: SELECT count(id) FROM users LEFT JOIN addons_users on users.id = addons_users.user_id WHERE created < DATE_SUB(CURDATE(), INTERVAL 2 WEEK) AND confirmationcode != '' AND addons_users.user_id IS NULL but it returns almost 30000. Change it to INTERVAL 3 WEEK if you want to be extra sure and it's still over 25k.
Comment 15•15 years ago
|
||
What is the next step here? justdave/chizu?
Comment 16•15 years ago
|
||
Doesn't look like munin is monitoring connection counts at all. That'd be a good thing to have record of, so then we can see if there's any correlation between these errors and the DB server hitting MaxClients or something. Although it seems like running out of connections would generate a real error code and not 0.
Comment 17•15 years ago
|
||
Comments from Mark Smith - Couple thoughts. The error 0 is a connection issue, but it sounds like you've got that already (read the bug). One general purpose comment before diving in: The connection itself can change the timeout. I.e., if you set your connection timeout on the server to 820 seconds, but the MySQL client connection (from PHP) uses "SET wait_timeout = 5" or something low, then the timeout will be 5 seconds. This means, you SHOULD set the connection timeout to be LOW on the server, and then tell the clients [cron jobs, AMO, etc] to set a HIGH wait_timeout if they need it. This makes sure that things like web servers that don't need long running connections don't sit and eat up the connection limit, and still lets cron jobs do useful things without timing out. Now, as to this particular error... I'd put my money on this error being caused by connections sitting open while log processing is being done. I'm pretty sure that the AMO log processing cron job works something like this (from memory): 1. Connect to the database 2. Look at directory of log files 3. Process one log file 4. Save results to database 5. Repeat from #2 until done 6. Disconnect from the database The problem is if step #3 takes more than your current timeout value, the connection goes away. So, most of the time, if processing a log file is pretty fast, it's no issue. Then on really big files, it will error out. There are ways of dealing with this. You can move step 1 to be part of step 4 (so it connects to the database for every log file), which is probably what I'd do. They can verify this behavior by having their stats system print out logs of every query it executes, which database it executes it on, and what time it executes at. I.e., something like, [2009-09-30 20:37:02 slave] SELECT ... [2009-09-30 20:37:03 master] UPDATE ... Put that in the database layer that actually issues the query. (I recall AMO has a nice DB abstraction to do the memcache storage, which should make this fairly trivial to implement.) The other thing that could be happening is if they do (instead of log processing) lots of SELECTs on the *slave*, but don't actually update the master very much. Basically, it comes down to the fact that if they ever take more than the timeout value of time between using a connection, it gets disconnected. And that's pretty easy to do.
Reporter | ||
Comment 18•15 years ago
|
||
> This means, you SHOULD set the connection timeout to be LOW on the > server, and then tell the clients [cron jobs, AMO, etc] to set a HIGH > wait_timeout if they need it. This makes sure that things like web > servers that don't need long running connections don't sit and eat up > the connection limit, and still lets cron jobs do useful things > without timing out. We're not changing this on the client. If we think it will help, I'm happy to do it though. > I'd put my money on this error being caused by connections sitting > open while log processing is being done. I'm pretty sure that the AMO > log processing cron job works something like this (from memory): AMO doesn't do log processing like that any more, but it does do some long running jobs. Several of the form: 1) SELECT a gazillion rows, iterate over each one doing an INSERT 2) Do a long running UPDATE that affects a gazillion rows It looks like we're having failures of both types to me. > They can verify this behavior by having their stats system print out > logs of every query it executes, which database it executes it on, and > what time it executes at. We have some amount of debugging that we can flip on, which I'll do now. > The other thing that could be happening is if they do (instead of log > processing) lots of SELECTs on the *slave*, but don't actually update > the master very much. Basically, it comes down to the fact that if > they ever take more than the timeout value of time between using a > connection, it gets disconnected. And that's pretty easy to do. This is an interesting idea, but not something that I think would happen. I'll turn on debugging though and hopefully that can help us figure out timings. For the record, all SELECTs go to the slaves and any updating goes to the master.
Reporter | ||
Comment 19•15 years ago
|
||
Alright, we had debugging turned on for a day and a half. There are over 1300 cron jobs per day that run, very few actually fail. Performance of each cron job can vary wildly. This cron job runs every 10 minutes and affects 22k rows:
UPDATE collections SET rating= IFNULL(CAST(upvotes - downvotes AS SIGNED) * LN(upvotes + downvotes),0)
Sometimes it takes about 2 minutes and other times it takes well over an hour to complete. There is no locking in our maintenance script so chances are jobs like that are tripping over themselves.
Regarding this bug specifically, and Error 0, it seems to happen most often in the morning (5-8am) but that's not a rule. Also, it doesn't appear to be related to timing out specifically as a lot of these jobs are short. Some examples from last night:
--------------------------------------------------------------------------------
08:15:03 Updating average ratings...
PHP Notice: MySQL Error 0:
Query was: [
UPDATE addons AS a
INNER JOIN (
SELECT
versions.addon_id as addon_id,
AVG(rating) as avg_rating
FROM reviews
INNER JOIN versions ON reviews.version_id = versions.id
WHERE reviews.reply_to IS NULL
AND reviews.rating > 0
GROUP BY versions.addon_id
) AS c ON (a.id = c.addon_id)
SET a.averagerating = ROUND(c.avg_rating, 2)
] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
08:16:01 Updating bayesian ratings...
08:16:19 Affected rows: 1
08:16:19 Time: 75.707126140594
08:16:19 Exiting ...
--------------------------------------------------------------------------------
This died in under 75 seconds. There was no read query here, it's just a single query to the master.
--------------------------------------------------------------------------------
08:15:03 Updating average ratings...
PHP Notice: MySQL Error 0:
Query was: [
UPDATE addons AS a
INNER JOIN (
SELECT
versions.addon_id as addon_id,
AVG(rating) as avg_rating
FROM reviews
INNER JOIN versions ON reviews.version_id = versions.id
WHERE reviews.reply_to IS NULL
AND reviews.rating > 0
GROUP BY versions.addon_id
) AS c ON (a.id = c.addon_id)
SET a.averagerating = ROUND(c.avg_rating, 2)
] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
08:16:01 Updating bayesian ratings...
08:16:19 Affected rows: 1
08:16:19 Time: 75.707126140594
08:16:19 Exiting ...
--------------------------------------------------------------------------------
This died in under 75 seconds. There was no read query here, it's just a single query to the master. This job also died at 7:15 and 7:35, but it ran fine (couple dozen times) the rest of the night.
--------------------------------------------------------------------------------
04:35:04 Retrieving ADU counts from `update_counts` ...
PHP Notice: MySQL Error 0:
Query was: [
UPDATE `addons` SET `average_daily_users`='1643.8061' WHERE `id`='4'
] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
PHP Notice: MySQL Error 0:
Query was: [
UPDATE `addons` SET `average_daily_users`='439.8699' WHERE `id`='5'
] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
PHP Notice: MySQL Error 0:
Query was: [
UPDATE `addons` SET `average_daily_users`='5198.8214' WHERE `id`='6'
] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
PHP Notice: MySQL Error 0:
Query was: [
UPDATE `addons` SET `average_daily_users`='234207.2500' WHERE `id`='7'
] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
<...many more pages...>
PHP Notice: MySQL Error 0:
Query was: [
UPDATE `addons` SET `average_daily_users`='1.0000' WHERE `id`='14634'
] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
PHP Notice: MySQL Error 0:
Query was: [UPDATE `config` SET `value`='0' WHERE `key`='stats_updating'] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
05:18:02 Affected rows: 10799
05:18:02 Time: 2580.116437912
05:18:02 Exiting ...
--------------------------------------------------------------------------------
This one points to the jobs tripping over themselves. This job runs every 30 minutes and every time it took under 30 minutes (sometimes as low as 4 minutes) it worked fine. Every time it went over, we got pages of these errors. This is a job that retrieves 10k rows, loops over them, and does an UPDATE on each.
--------------------------------------------------------------------------------
05:05:02 Starting review total updates...
PHP Notice: MySQL Error 0:
Query was: [
UPDATE addons AS a
INNER JOIN (
SELECT
versions.addon_id as addon_id,
COUNT(*) as count
FROM reviews
INNER JOIN versions ON reviews.version_id = versions.id
WHERE reviews.reply_to IS NULL
AND reviews.rating > 0
GROUP BY versions.addon_id
) AS c ON (a.id = c.addon_id)
SET a.totalreviews = c.count
] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
05:05:54 Affected rows: 1
05:05:54 Time: 52.831606864929
05:05:54 Exiting ...
--------------------------------------------------------------------------------
This job runs every 20 minutes, it's a single UPDATE statement written to the master. The rest of the night it took 20-40 seconds to run. At 5:05am it failed in 52 seconds with Error 0. Definitely not tripping over itself - perhaps tripping over another job?
--------------------------------------------------------------------------------
07:10:04 Retrieving all add-on ids...
07:10:04 Retrieving seven-day counts from `download_counts` ...
07:10:05 Updating seven day counts in `main` ...
PHP Notice: MySQL Error 0:
Query was: [
UPDATE `addons` SET `weeklydownloads`='86' WHERE `id`='151'
] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
07:19:24 Affected rows: 10034
07:19:24 Time: 560.40586519241
07:19:24 Exiting ...
--------------------------------------------------------------------------------
Again, this ran every 20 minutes all night until 7ish and then failed in 2 consecutive jobs. It took 560 seconds here but other times that night it took longer and ran fine.
--------------------------------------------------------------------------------
06:13:01 Starting global stats update...
06:13:01 Updating addon_total_downloads...
06:15:57 Updating addon_downloads_new...
06:15:57 Updating addon_count_public...
06:15:57 Updating addon_count_pending...
06:15:57 Updating addon_count_experimental...
06:15:57 Updating addon_count_nominated...
06:15:57 Updating addon_count_new...
06:15:58 Updating version_count_new...
06:15:58 Updating user_count_total...
PHP Notice: MySQL Error 0:
Query was: [REPLACE INTO global_stats (name, count, date) VALUES ('user_count_total', (SELECT COUNT(*) FROM users), '2009-10-02')] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
06:16:23 Updating user_count_new...
06:16:33 Updating review_count_total...
06:16:34 Updating review_count_new...
06:16:34 Updating collection_count_total...
06:16:34 Updating collection_count_new...
06:16:34 Updating collection_count_private...
06:16:34 Updating collection_count_public...
06:16:34 Updating collection_count_autopublishers...
06:16:34 Updating collection_count_editorspicks...
06:16:34 Updating collection_count_normal...
06:16:34 Updating collection_addon_downloads...
06:16:43 Updating collector_total_downloads...
06:16:44 Updating addon_total_updatepings...
06:16:44 Updating collector_updatepings...
06:16:44 Affected rows: 23
06:16:44 Time: 222.846118927
06:16:44 Exiting ...
--------------------------------------------------------------------------------
This job does a REPLACE with a nested SELECT for each of the values up there ^^. You can see this failed within a few seconds. That particular query was doing a "SELECT count(*) from users" which is one of our biggest table and may have been in use by another cron job.
--------------------------------------------------------------------------------
00:42:01 Removing user accounts that haven't been confirmed for two weeks...
PHP Notice: MySQL Error 0:
Query was: [
DELETE users
FROM users
LEFT JOIN addons_users on users.id = addons_users.user_id
WHERE created < DATE_SUB(CURDATE(), INTERVAL 2 WEEK)
AND confirmationcode != ''
AND addons_users.user_id IS NULL
] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
00:43:12 Affected rows: 1
00:43:12 Time: 70.957583904266
00:43:12 Exiting ...
--------------------------------------------------------------------------------
This job only runs once per day but seems to fail every time. It only takes 70 seconds to fail.
So, it seems like we have some jobs tripping over their older selves, and perhaps we have jobs tripping over different jobs that are using the same tables (not sure why that would be causing failures). I think a good first step right now would be to add a locking flag to our maintenance script to keep jobs from tripping over themselves. Hopefully that will reduce the crazy differences in query execution time. What do you think?
> Doesn't look like munin is monitoring connection counts at all. That'd be a
> good thing to have record of, so then we can see if there's any correlation
> between these errors and the DB server hitting MaxClients or something.
> Although it seems like running out of connections would generate a real error
> code and not 0.
What did you find out here - are we hitting the max_connections?
Comment 20•15 years ago
|
||
> What did you find out here - are we hitting the max_connections?
oremj, can that be trended in munin?
Reporter | ||
Comment 21•15 years ago
|
||
15 cron jobs failed in about the past hour. Any updates here?
Comment 22•15 years ago
|
||
(In reply to comment #21) > 15 cron jobs failed in about the past hour. Any updates here? I wonder if it's correlated with the new sphinx indexer?
Reporter | ||
Comment 23•15 years ago
|
||
Any updates? Haven't heard anything for 10 days
Comment 24•15 years ago
|
||
We looked at this when the Percona guy was here this morning, and it does turn out that the cron jobs are backing up all over themselves. There were 43 copies of share_count_totals running and 39 copies of collections_ratings, all running concurrently.
Comment 25•15 years ago
|
||
There should have been a large number fail this morning because we manually killed them off.
Assignee | ||
Comment 26•15 years ago
|
||
The sphinx indexer was hitting the database hard for about an hour when it was deployed. Since then, it hasn't contributed much to load. i/o wait spikes during or around cron jobs, and often leaves MySQL threads piling up. To fix this, we'll need to reduce queries to the master or increase capacity somehow (faster hardware, sharding, clustering). Memory might help a little, stale pages do get swapped out: total used free shared buffers cached Mem: 3953 3917 35 0 138 723 -/+ buffers/cache: 3055 897 Swap: 1983 50 1933
Comment 27•15 years ago
|
||
I've got bug 522112 filed to request additional RAM for the AMO DB servers, based on the recommendation from Percona this morning.
Comment 28•15 years ago
|
||
bug 522073 is tracking locks on cron jobs.
Comment 29•15 years ago
|
||
(In reply to comment #28) > bug 522073 is tracking locks on cron jobs. Should we keep this bug open until that one is resolved? Or is this bug mostly resolved?
Reporter | ||
Comment 30•15 years ago
|
||
We are still getting many error 0s. We don't actually know if cron locks will fix this, so I don't think we should close this. We've moved a bunch of the cron jobs to python and they are using (cron) locks and they are failing just as often with (mysql) lock issues. There are a couple other open IT bugs about this including adjusting the timing in hopes of having less conflicts (bug 526553) and potentially being out of disk space (bug 526414).
Updated•15 years ago
|
Whiteboard: [blocked on locks]
Comment 31•15 years ago
|
||
Update: we haven't seen any cron errors since the changes in bug 529413 were applied.
Reporter | ||
Comment 32•15 years ago
|
||
Since we don't know what causes the error 0s it's tough to call this fixed. On the other hand, there is no LOOKSLIKEITSWORKINGNOW, so I guess FIXED is as good as it gets. Either way, I'm happy to not have cron email every 10 minutes. :)
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Comment 33•15 years ago
|
||
(In reply to comment #32) > Since we don't know what causes the error 0s it's tough to call this fixed. On > the other hand, there is no LOOKSLIKEITSWORKINGNOW, so I guess FIXED is as good > as it gets. LOOKSLIKEITSWORKINGNOW is called WORKSFORME. :)
Resolution: FIXED → WORKSFORME
Updated•11 years ago
|
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•