Closed Bug 1485089 Opened 7 years ago Closed 7 years ago

Intermittents Commenter experiencing OperationalError: (3024, 'Query execution was interrupted, maximum statement execution time exceeded')

Categories

(Tree Management :: Treeherder, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sclements, Assigned: sclements)

Details

Attachments

(1 file)

This has happened twice in the past week (and once the week before), according to an alert I've set up in papertrail. The odd thing is that when I run it manually with the Heroku CLI (with the prod database), I often hit this same error so I'm unable to run the daily commenter even manually for that day. However, if I run it locally with the db mirror I use for development, it runs successfully. So far this has only affected the daily comments, not the weekly comments. Transaction trace: https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions#id=5b224f746865725472616e73616374696f6e2f43656c6572792f696e7465726d697474656e74732d636f6d6d656e746572222c22225d&tab-detail_556f290e-a510-11e8-b58d-0242ac110006_2104_3630=database_queries&app_trace_id=556f290e-a510-11e8-b58d-0242ac110006_2104_3630 Stack trace: https://rpm.newrelic.com/accounts/677903/applications/14179757/filterable_errors#/show/2c6881e6-a4d7-11e8-b58d-0242ac110006_0_4600/stack_trace?top_facet=transactionUiName&primary_facet=error.class&barchart=barchart&filters=%5B%7B%22key%22%3A%22error.class%22%2C%22value%22%3A%22django.db.utils%3AOperationalError%22%2C%22like%22%3Afalse%7D%5D
Priority: -- → P1
Ed, do you know if the db mirror I use for development is up to date?
Flags: needinfo?(emorley)
Running `SHOW SLAVE STATUS` on the replica will show whether replication is still running (ie hasn't errored) and if there is any lag (a few seconds to a minute is expected). If your username doesn't have permissions to run that command; temporarily log into the replica using the credentials stored under DATABASE_URL on the treeherder-prod Heroku app (but make sure to connect over TLS and to switch back to the other credentials after).
Flags: needinfo?(emorley)
I ran that command on the replica and it looks to be in working order. I got the following (edited): Slave_IO_State: Waiting for master to send event Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: mysql.plugin,mysql.rds_monitor,mysql.rds_sysinfo,mysql.rds_replication_status,mysql.rds_history,innodb_memcache.config_options,innodb_memcache.cache_policies Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Skip_Counter: 0 Exec_Master_Log_Pos: 2296140 Relay_Log_Space: 25125710 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
I ran an `optimize table treeherder.bugscache;` on prod's RDS for bug 1484642; it may help here too.
This hasn't happened for weeks but unfortunately occurred again this morning (daily task).
It failed again this Tuesday (daily task) and again I couldn't run the task manually with the heroku CLI. Since I've set up the papertrail alerts this has happened on: 8/18 - Saturday 8/21 - Tuesday *skipped 3 weeks after Ed optimized bugscache (coincidence??) 9/11 - Tuesday 9/18 - Tuesday The task runs on the same dyno as the seta-analyze-failures and fetch-push-logs but those runs once a day and every 5 minutes, respectively. Looking in New Relic, I don't see a spike in non-web transactions memory usage at the time the task failed. I did make recent optimization to the commenter's queries (bug 1491133) so it'll be interesting to see if that helps at all. Cam suggested it could be the RDS snapshot instance (running at GMT-700) overlapping with the Commenter task (running at 7am UTC) so it might be worth changing the daily task to run an hour earlier.
Attachment #9010459 - Flags: review?(emorley)
Comment on attachment 9010459 [details] [review] Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/4046 Thank you :-)
Attachment #9010459 - Flags: review?(emorley) → review+
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/bb4aee78a3d3d704e9de79f15121bad68115a034 Bug 1485089 - Change commenter daily task Change daily task time to run one hour earlier since it appears that the tuesday daily task is running at the same time the RDS snapshot occurs and might be contributing to the celery task timeout exception.
Assignee: nobody → sclements
Status: NEW → ASSIGNED
Keeping this open for a few more weeks to see if the above mentioned changes fix the problem.
I haven't seen this particular error reoccur in the past month, so closing it out.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Intermittent Failures View → TreeHerder
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: