Bugzilla

Reporter

Comment 1

•

15 years ago

We're averaging 9 emails per day.  A few have 1 error in them and 5 (the ADU jobs) have hundreds per mail.

Reporter

Comment 2

•

15 years ago

No response for 11 days.  Bumping to major.

Severity: normal → major

Comment 3

•

15 years ago

Wil, could you forward a copy of the errors or attach a copy?

Reporter

Comment 4

•

15 years ago

They are all available at https://mail.mozilla.org/private/amo-developers/

Examples:

PHP Notice:  MySQL Error 0: 
Query was: [
                    UPDATE `addons` SET `average_daily_users`='1680.2473'  WHERE `id`='4'
                ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123


PHP Notice:  MySQL Error 0: 
Query was: [
            DELETE users
            FROM users
            LEFT JOIN addons_users on users.id = addons_users.user_id
            WHERE created < DATE_SUB(CURDATE(), INTERVAL 2 WEEK)
            AND confirmationcode != ''
            AND addons_users.user_id IS NULL
        ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123

Comment 5

•

15 years ago

Which database do the cron jobs actually hit?  Do the hit a slave or the master?

Comment 6

•

15 years ago

The amo master has a wait timeout of 820 seconds.  staging master (where preview points) is set for 120 seconds.

Off to check the slaves next.

Comment 7

•

15 years ago

Hmm, the slaves are all set at 820 across the board, too, so I guess that wouldn't matter which one you hit if it was really a timeout.

On the other hand, the slave connections go through the load balancer...  We did have an issue with the Netscaler a year or two ago where it would randomly drop connections in mysql pools...  thought that was long gone though.

Reporter

Comment 8

•

15 years ago

We're only getting timeouts on the master. (All reads go to the slaves, writes to the master)

We're doing some really long winded queries and were doing some that are a gazillion 1 row updates in a loop.  We can make them more efficient if we have to, but I'd like to figure out what the problem is.

Reporter

Comment 9

•

15 years ago

Any updates on this?  It's been a few weeks and we're getting a lot of broken queries every day

Comment 10

•

15 years ago

I think we're all focusing on comment #4 but Wil said that was just an example.  Looking at the archives,

PHP Notice:  MySQL Error 1451: Cannot delete or update a parent row: a foreign key constraint fails (`addons_reskin/api_auth_tokens`, CONSTRAINT `api_auth_tokens_ibfk_1` FOREIGN KEY (`user_id`) REFERENCES `users` (`id`))
Query was: [
            DELETE users
            FROM users
            LEFT JOIN addons_users on users.id = addons_users.user_id
            WHERE created < DATE_SUB(CURDATE(), INTERVAL 2 WEEK)
            AND confirmationcode != ''
            AND addons_users.user_id IS NULL
        ] in /data/amo/www/addons.mozilla.org-preview/bin/database.class.php on line 123




PHP Notice:  MySQL Error 0: 
Query was: [
            DELETE users
            FROM users
            LEFT JOIN addons_users on users.id = addons_users.user_id
            WHERE created < DATE_SUB(CURDATE(), INTERVAL 2 WEEK)
            AND confirmationcode != ''
            AND addons_users.user_id IS NULL
        ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123

PHP Notice:  MySQL Error 0: 
Query was: [
                    UPDATE `addons` SET `average_daily_users`='1660.2775'  WHERE `id`='4'
                ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123

Reporter

Comment 11

•

15 years ago

For the record, we can handle the foreign key problems - that's one of the reasons we have the logging setup in the first place.  It's the Error 0 problems I'm concerned about.  Thanks.

Assignee

Updated

•

15 years ago

Assignee: justdave → thardcastle

Assignee

Comment 12

•

15 years ago

Error 0 is pointing to connection issues. There's a large spike in network traffic on the master when some of the error 0 queries failed. But other than that, there's not an obvious load problem. Queries do appear to work fine (fast, no errors) if they can get a connection.

Comment 13

•

15 years ago

the error 0 ones - did those actually fail? How would I check?

Reporter

Comment 14

•

15 years ago

(In reply to comment #13)
> the error 0 ones - did those actually fail? How would I check?

I'm pretty sure they are failing.  Taking the example from comment #10, I think this should be returning zero:

SELECT count(id) FROM users
LEFT JOIN addons_users on users.id = addons_users.user_id
WHERE created < DATE_SUB(CURDATE(), INTERVAL 2 WEEK)
AND confirmationcode != ''
AND addons_users.user_id IS NULL

but it returns almost 30000.  Change it to INTERVAL 3 WEEK if you want to be extra sure and it's still over 25k.

Comment 15

•

15 years ago

What is the next step here?  justdave/chizu?

Comment 16

•

15 years ago

Doesn't look like munin is monitoring connection counts at all.  That'd be a good thing to have record of, so then we can see if there's any correlation between these errors and the DB server hitting MaxClients or something.  Although it seems like running out of connections would generate a real error code and not 0.

Comment 17

•

15 years ago

Comments from Mark Smith - 

Couple thoughts.

The error 0 is a connection issue, but it sounds like you've got that
already (read the bug).  One general purpose comment before diving in:

The connection itself can change the timeout.  I.e., if you set your
connection timeout on the server to 820 seconds, but the MySQL client
connection (from PHP) uses "SET wait_timeout = 5" or something low,
then the timeout will be 5 seconds.

This means, you SHOULD set the connection timeout to be LOW on the
server, and then tell the clients [cron jobs, AMO, etc] to set a HIGH
wait_timeout if they need it.  This makes sure that things like web
servers that don't need long running connections don't sit and eat up
the connection limit, and still lets cron jobs do useful things
without timing out.

Now, as to this particular error...

I'd put my money on this error being caused by connections sitting
open while log processing is being done.  I'm pretty sure that the AMO
log processing cron job works something like this (from memory):

1. Connect to the database
2. Look at directory of log files
3. Process one log file
4. Save results to database
5. Repeat from #2 until done
6. Disconnect from the database

The problem is if step #3 takes more than your current timeout value,
the connection goes away.  So, most of the time, if processing a log
file is pretty fast, it's no issue.  Then on really big files, it will
error out.

There are ways of dealing with this.  You can move step 1 to be part
of step 4 (so it connects to the database for every log file), which
is probably what I'd do.

They can verify this behavior by having their stats system print out
logs of every query it executes, which database it executes it on, and
what time it executes at.  I.e., something like,

[2009-09-30 20:37:02 slave] SELECT ...
[2009-09-30 20:37:03 master] UPDATE ...

Put that in the database layer that actually issues the query.  (I
recall AMO has a nice DB abstraction to do the memcache storage, which
should make this fairly trivial to implement.)

The other thing that could be happening is if they do (instead of log
processing) lots of SELECTs on the *slave*, but don't actually update
the master very much.  Basically, it comes down to the fact that if
they ever take more than the timeout value of time between using a
connection, it gets disconnected.  And that's pretty easy to do.

Reporter

Comment 18

•

15 years ago

> This means, you SHOULD set the connection timeout to be LOW on the
> server, and then tell the clients [cron jobs, AMO, etc] to set a HIGH
> wait_timeout if they need it.  This makes sure that things like web
> servers that don't need long running connections don't sit and eat up
> the connection limit, and still lets cron jobs do useful things
> without timing out.

We're not changing this on the client.  If we think it will help, I'm happy to do it though.

> I'd put my money on this error being caused by connections sitting
> open while log processing is being done.  I'm pretty sure that the AMO
> log processing cron job works something like this (from memory):

AMO doesn't do log processing like that any more, but it does do some long running jobs.  Several of the form:

1) SELECT a gazillion rows, iterate over each one doing an INSERT
2) Do a long running UPDATE that affects a gazillion rows

It looks like we're having failures of both types to me.

> They can verify this behavior by having their stats system print out
> logs of every query it executes, which database it executes it on, and
> what time it executes at.

We have some amount of debugging that we can flip on, which I'll do now.

> The other thing that could be happening is if they do (instead of log
> processing) lots of SELECTs on the *slave*, but don't actually update
> the master very much.  Basically, it comes down to the fact that if
> they ever take more than the timeout value of time between using a
> connection, it gets disconnected.  And that's pretty easy to do.

This is an interesting idea, but not something that I think would happen.  I'll turn on debugging though and hopefully that can help us figure out timings.  For the record, all SELECTs go to the slaves and any updating goes to the master.

Reporter

Updated

•

15 years ago

Depends on: 519909

Reporter

Comment 19

•

15 years ago

Alright, we had debugging turned on for a day and a half.  There are over 1300 cron jobs per day that run, very few actually fail.  Performance of each cron job can vary wildly.  This cron job runs every 10 minutes and affects 22k rows:

UPDATE collections SET rating= IFNULL(CAST(upvotes - downvotes AS SIGNED) * LN(upvotes + downvotes),0)

Sometimes it takes about 2 minutes and other times it takes well over an hour to complete.  There is no locking in our maintenance script so chances are jobs like that are tripping over themselves.

Regarding this bug specifically, and Error 0, it seems to happen most often in the morning (5-8am) but that's not a rule.  Also, it doesn't appear to be related to timing out specifically as a lot of these jobs are short.  Some examples from last night:

--------------------------------------------------------------------------------
08:15:03 Updating average ratings...
PHP Notice:  MySQL Error 0: 
Query was: [
            UPDATE addons AS a
            INNER JOIN (
                SELECT
                    versions.addon_id as addon_id,
                    AVG(rating) as avg_rating
                FROM reviews
                INNER JOIN versions ON reviews.version_id = versions.id
                WHERE reviews.reply_to IS NULL
                    AND reviews.rating > 0
                GROUP BY versions.addon_id
            ) AS c ON (a.id = c.addon_id)
            SET a.averagerating = ROUND(c.avg_rating, 2)
        ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
08:16:01 Updating bayesian ratings...
08:16:19 Affected rows: 1
08:16:19 Time: 75.707126140594
08:16:19 Exiting ...
--------------------------------------------------------------------------------
This died in under 75 seconds.  There was no read query here, it's just a single query to the master.


--------------------------------------------------------------------------------
08:15:03 Updating average ratings...
PHP Notice:  MySQL Error 0: 
Query was: [
            UPDATE addons AS a
            INNER JOIN (
                SELECT
                    versions.addon_id as addon_id,
                    AVG(rating) as avg_rating
                FROM reviews
                INNER JOIN versions ON reviews.version_id = versions.id
                WHERE reviews.reply_to IS NULL
                    AND reviews.rating > 0
                GROUP BY versions.addon_id
            ) AS c ON (a.id = c.addon_id)
            SET a.averagerating = ROUND(c.avg_rating, 2)
        ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
08:16:01 Updating bayesian ratings...
08:16:19 Affected rows: 1
08:16:19 Time: 75.707126140594
08:16:19 Exiting ...
--------------------------------------------------------------------------------
This died in under 75 seconds.  There was no read query here, it's just a single query to the master.  This job also died at 7:15 and 7:35, but it ran fine (couple dozen times) the rest of the night.


--------------------------------------------------------------------------------
04:35:04 Retrieving ADU counts from `update_counts` ...
PHP Notice:  MySQL Error 0: 
Query was: [
                    UPDATE `addons` SET `average_daily_users`='1643.8061'  WHERE `id`='4'
                ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
PHP Notice:  MySQL Error 0: 
Query was: [
                    UPDATE `addons` SET `average_daily_users`='439.8699'  WHERE `id`='5'
                ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
PHP Notice:  MySQL Error 0: 
Query was: [
                    UPDATE `addons` SET `average_daily_users`='5198.8214'  WHERE `id`='6'
                ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
PHP Notice:  MySQL Error 0: 
Query was: [
                    UPDATE `addons` SET `average_daily_users`='234207.2500'  WHERE `id`='7'
                ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123

<...many more pages...>

PHP Notice:  MySQL Error 0: 
Query was: [
                    UPDATE `addons` SET `average_daily_users`='1.0000'  WHERE `id`='14634'
                ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
PHP Notice:  MySQL Error 0: 
Query was: [UPDATE `config` SET `value`='0' WHERE `key`='stats_updating'] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
05:18:02 Affected rows: 10799
05:18:02 Time: 2580.116437912
05:18:02 Exiting ...
--------------------------------------------------------------------------------
This one points to the jobs tripping over themselves.  This job runs every 30 minutes and every time it took under 30 minutes (sometimes as low as 4 minutes) it worked fine.  Every time it went over, we got pages of these errors.  This is a job that retrieves 10k rows, loops over them, and does an UPDATE on each.


--------------------------------------------------------------------------------
05:05:02 Starting review total updates...
PHP Notice:  MySQL Error 0: 
Query was: [
            UPDATE addons AS a
            INNER JOIN (
                SELECT
                    versions.addon_id as addon_id,
                    COUNT(*) as count
                FROM reviews
                INNER JOIN versions ON reviews.version_id = versions.id
                WHERE reviews.reply_to IS NULL
                    AND reviews.rating > 0
                GROUP BY versions.addon_id
            ) AS c ON (a.id = c.addon_id)
            SET a.totalreviews = c.count
        ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
05:05:54 Affected rows: 1
05:05:54 Time: 52.831606864929
05:05:54 Exiting ...
--------------------------------------------------------------------------------
This job runs every 20 minutes, it's a single UPDATE statement written to the master.  The rest of the night it took 20-40 seconds to run.  At 5:05am it failed in 52 seconds with Error 0.  Definitely not tripping over itself - perhaps tripping over another job?


--------------------------------------------------------------------------------
07:10:04 Retrieving all add-on ids...
07:10:04 Retrieving seven-day counts from `download_counts` ...
07:10:05 Updating seven day counts in `main` ...
PHP Notice:  MySQL Error 0: 
Query was: [
                    UPDATE `addons` SET `weeklydownloads`='86' WHERE `id`='151'
                ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
07:19:24 Affected rows: 10034
07:19:24 Time: 560.40586519241
07:19:24 Exiting ...
--------------------------------------------------------------------------------
Again, this ran every 20 minutes all night until 7ish and then failed in 2 consecutive jobs.  It took 560 seconds here but other times that night it took longer and ran fine.


--------------------------------------------------------------------------------
06:13:01 Starting global stats update...
06:13:01 Updating addon_total_downloads...
06:15:57 Updating addon_downloads_new...
06:15:57 Updating addon_count_public...
06:15:57 Updating addon_count_pending...
06:15:57 Updating addon_count_experimental...
06:15:57 Updating addon_count_nominated...
06:15:57 Updating addon_count_new...
06:15:58 Updating version_count_new...
06:15:58 Updating user_count_total...
PHP Notice:  MySQL Error 0: 
Query was: [REPLACE INTO global_stats (name, count, date) VALUES ('user_count_total', (SELECT COUNT(*) FROM users), '2009-10-02')] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
06:16:23 Updating user_count_new...
06:16:33 Updating review_count_total...
06:16:34 Updating review_count_new...
06:16:34 Updating collection_count_total...
06:16:34 Updating collection_count_new...
06:16:34 Updating collection_count_private...
06:16:34 Updating collection_count_public...
06:16:34 Updating collection_count_autopublishers...
06:16:34 Updating collection_count_editorspicks...
06:16:34 Updating collection_count_normal...
06:16:34 Updating collection_addon_downloads...
06:16:43 Updating collector_total_downloads...
06:16:44 Updating addon_total_updatepings...
06:16:44 Updating collector_updatepings...
06:16:44 Affected rows: 23
06:16:44 Time: 222.846118927
06:16:44 Exiting ...
--------------------------------------------------------------------------------
This job does a REPLACE with a nested SELECT for each of the values up there ^^.  You can see this failed within a few seconds.  That particular query was doing a "SELECT count(*) from users" which is one of our biggest table and may have been in use by another cron job.


--------------------------------------------------------------------------------
00:42:01 Removing user accounts that haven't been confirmed for two weeks...
PHP Notice:  MySQL Error 0: 
Query was: [
            DELETE users
            FROM users
            LEFT JOIN addons_users on users.id = addons_users.user_id
            WHERE created < DATE_SUB(CURDATE(), INTERVAL 2 WEEK)
            AND confirmationcode != ''
            AND addons_users.user_id IS NULL
        ] in /data/amo/www/addons.mozilla.org-remora/bin/database.class.php on line 123
00:43:12 Affected rows: 1
00:43:12 Time: 70.957583904266
00:43:12 Exiting ...
--------------------------------------------------------------------------------
This job only runs once per day but seems to fail every time.  It only takes 70 seconds to fail.


So, it seems like we have some jobs tripping over their older selves, and perhaps we have jobs tripping over different jobs that are using the same tables (not sure why that would be causing failures).  I think a good first step right now would be to add a locking flag to our maintenance script to keep jobs from tripping over themselves.  Hopefully that will reduce the crazy differences in query execution time.  What do you think?

> Doesn't look like munin is monitoring connection counts at all.  That'd be a
> good thing to have record of, so then we can see if there's any correlation
> between these errors and the DB server hitting MaxClients or something. 
> Although it seems like running out of connections would generate a real error
> code and not 0.

What did you find out here - are we hitting the max_connections?

Comment 20

•

15 years ago

> What did you find out here - are we hitting the max_connections?

oremj, can that be trended in munin?

Reporter

Comment 21

•

15 years ago

15 cron jobs failed in about the past hour.  Any updates here?

Jeff Balogh (:jbalogh)

Comment 22

•

15 years ago

(In reply to comment #21)
> 15 cron jobs failed in about the past hour.  Any updates here?

I wonder if it's correlated with the new sphinx indexer?

Reporter

Comment 23

•

15 years ago

Any updates?  Haven't heard anything for 10 days

Comment 24

•

15 years ago

We looked at this when the Percona guy was here this morning, and it does turn out that the cron jobs are backing up all over themselves.  There were 43 copies of share_count_totals running and 39 copies of collections_ratings, all running concurrently.

Comment 25

•

15 years ago

There should have been a large number fail this morning because we manually killed them off.

Assignee

Comment 26

•

15 years ago

The sphinx indexer was hitting the database hard for about an hour when it was deployed. Since then, it hasn't contributed much to load.

i/o wait spikes during or around cron jobs, and often leaves MySQL threads piling up. To fix this, we'll need to reduce queries to the master or increase capacity somehow (faster hardware, sharding, clustering). Memory might help a little, stale pages do get swapped out:

             total       used       free     shared    buffers     cached
Mem:          3953       3917         35          0        138        723
-/+ buffers/cache:       3055        897
Swap:         1983         50       1933

Comment 27

•

15 years ago

I've got bug 522112 filed to request additional RAM for the AMO DB servers, based on the recommendation from Percona this morning.

Jeff Balogh (:jbalogh)

Comment 28

•

15 years ago

bug 522073 is tracking locks on cron jobs.

Assignee

Updated

•

15 years ago

Depends on: 522112

Updated

•

15 years ago

Depends on: 522073

Comment 29

•

15 years ago

(In reply to comment #28)
> bug 522073 is tracking locks on cron jobs.

Should we keep this bug open until that one is resolved?  Or is this bug mostly resolved?

Reporter

Comment 30

•

15 years ago

We are still getting many error 0s.  We don't actually know if cron locks will fix this, so I don't think we should close this.

We've moved a bunch of the cron jobs to python and they are using (cron) locks and they are failing just as often with (mysql) lock issues.  There are a couple other open IT bugs about this including adjusting the timing in hopes of having less conflicts (bug 526553) and potentially being out of disk space (bug 526414).

Updated

•

15 years ago

Whiteboard: [blocked on locks]

Jeff Balogh (:jbalogh)

Comment 31

•

15 years ago

Update: we haven't seen any cron errors since the changes in bug 529413 were applied.