Closed
Bug 1348029
Opened 7 years ago
Closed 7 years ago
Treeherder prod read-only replica suffering from table locks and replica lag
Categories
(Tree Management :: Treeherder: Infrastructure, enhancement, P1)
Tree Management
Treeherder: Infrastructure
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: ekyle, Assigned: emorley)
Details
Attachments
(1 file)
206.08 KB,
image/jpeg
|
Details |
The following query does not seem to work > select * from jobs limit 1 A `show processlist` gives a hint (columns removed for clarity) > Id db Command Time State Info > 72124 treeherder Sleep 113 > 72129 treeherder Sleep 50 > 72133 treeherder Query 0 init show processlist > 72135 treeherder Query 34 Waiting for table metadata lock SELECT * FROM `treeherder`.`job` LIMIT 1 Are there other clients locking the job table?
Comment 1•7 years ago
|
||
Hmm, there was a redash query which had been ongoing for 3009489 seconds from :jmaher. I went ahead and killed that. It appears as if some kind of alter table operation is also in progress, probably from a migration :jgraham did earlier this week. My guess is that the former operation slowed down the replication of the later. Hopefully things will start behaving better soon now that I've killed the misbehaving query.
Comment 2•7 years ago
|
||
Now that I think about it though, if the schema migration didn't happen correctly on the mirror when it was supposed to happen, we might be kind of hosed. We might need to reset its state to that of the master database. I'm not sure how to do that, so over to :emorley (when he has time). In the mean time, the readonly database should be useable, though I wouldn't 100% trust it.
Assignee: nobody → emorley
Assignee | ||
Comment 3•7 years ago
|
||
Here are the CloudWatch graphs for the last 3 days. The replication lag starts increasing at 1000 UTC 14th March (as expected, since that's when the schema change occurred) and continues increasing even now. It's current value is 209000000ms (58 hours). AWS replication status reports replication hasn't errored, and is ongoing. Currently this is the only query running: ALTER TABLE `job` ADD COLUMN `autoclassify_status` integer DEFAULT 0 NOT NULL The `time` for that query is reported as 212609s, which would be worrying, were it not that the write IOPS graphs show that the query has actually been sat stuck doing nothing, rather than taking that long to apply. Killing the stuck redash query in comment 1 appears to have worked. Looking at the slow query log [1] I see jmaher's redash query: # Time: 170316 18:16:15 # User@Host: redash[redash] @ SNIP Id: 20736 # Query_time: 3009530.989732 Lock_time: 0.000191 Rows_sent: 0 Rows_examined: 2358 use treeherder; SET timestamp=1489688175; /* Username: jmaher@SNIP, Task ID: SNIP, Query ID: 2517, Queue: queries, Query Hash: SNIP */ select distinct jn.text as fixed_by_revision, sig.build_system_type, jt.name as job_type_name, op.name as platform_option, sig.name as ref_data_name, bp.platform as platform from job_note jn, job j, job_type jt, reference_data_signatures sig, option_collection oc, `option` op, build_platform bp where jn.failure_classification_id=2 and jn.text!='' and jn.created>'2016-10-30' and jn.job_id=j.id and j.job_type_id=jt.id and j.signature_id=sig.id and sig.option_collection_hash=oc.option_collection_hash and oc.option_id=op.id and jt.name like 'test-%' or jt.name like 'desktop-test-%' and j.build_platform_id=bp.id; The questions now are: * How fast will the replication catch up the 58 hours, and should we just reset the replica and start over? (I don't have enough access to do this, :fubar will need to if so) * Why did jmaher's query get stuck? * Is there anything we can do to prevent this again in the future? (I'd guess not given `max_execution_time` wasn't added until MySQL 5.7.8 [2]). [1] https://console.aws.amazon.com/rds/home?region=us-east-1#dbinstance:id=treeherder-prod-ro;view=logs [2] https://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_max_execution_time
Reporter | ||
Comment 4•7 years ago
|
||
I am guessing jmaher's query got stuck because
> jt.name like 'test-%' or jt.name like 'desktop-test-%'
was not in parenthesis
Assignee | ||
Comment 5•7 years ago
|
||
Youch...
> explain select distinct jn.text as fixed_by_revision, sig.build_system_type, jt.name as job_type_name, op.name as platform_option, sig.name as ref_data_name, bp.platform as platform from job_note jn, job j, job_type jt, reference_data_signatures sig, option_collection oc, `option` op, build_platform bp where jn.failure_classification_id=2 and jn.text!='' and jn.created>'2016-10-30' and jn.job_id=j.id and j.job_type_id=jt.id and j.signature_id=sig.id and sig.option_collection_hash=oc.option_collection_hash and oc.option_id=op.id and jt.name like 'test-%' or jt.name like 'desktop-test-%' and j.build_platform_id=bp.id
******************** 1. row *********************
id: 1
select_type: SIMPLE
table: op
type: index
possible_keys: PRIMARY
key: option_name_78ffea107809a885_uniq
key_len: 302
ref:
rows: 21
Extra: Using index; Using temporary
******************** 2. row *********************
id: 1
select_type: SIMPLE
table: oc
type: index
possible_keys: uni_option_collection,idx_option
key: uni_option_collection
key_len: 126
ref:
rows: 23
Extra: Using index; Using join buffer (Block Nested Loop)
******************** 3. row *********************
id: 1
select_type: SIMPLE
table: bp
type: index
possible_keys: PRIMARY
key: idx_platform
key_len: 302
ref:
rows: 244
Extra: Using index; Using join buffer (Block Nested Loop)
******************** 4. row *********************
id: 1
select_type: SIMPLE
table: jt
type: range
possible_keys: PRIMARY,uni_name_symbol
key: uni_name_symbol
key_len: 302
ref:
rows: 13632
Extra: Using where; Using index; Using join buffer (Block Nested Loop)
******************** 5. row *********************
id: 1
select_type: SIMPLE
table: j
type: ALL
possible_keys: job_9aefd743,job_d1803079
key:
key_len:
ref:
rows: 25134319
Extra: Range checked for each record (index map: 0x30)
******************** 6. row *********************
id: 1
select_type: SIMPLE
table: sig
type: ALL
possible_keys: PRIMARY,idx_option_collection_hash
key:
key_len:
ref:
rows: 246434
Extra: Using where; Using join buffer (Block Nested Loop)
******************** 7. row *********************
id: 1
select_type: SIMPLE
table: jn
type: ALL
possible_keys: D0c96672d1abd821fcd7e1f1723a2fd8,job_note_d697ea38
key:
key_len:
ref:
rows: 277637
Extra: Using where; Using join buffer (Block Nested Loop)
7 rows in set
Reporter | ||
Comment 6•7 years ago
|
||
I am also guessing that catching up 58 hours will take less time than a fresh replication. Assuming it is not caught up already.
Reporter | ||
Comment 7•7 years ago
|
||
NM, it appears the database is down, presumably to fix this.
Assignee | ||
Comment 8•7 years ago
|
||
(In reply to Kyle Lahnakoski [:ekyle] from comment #6) > I am also guessing that catching up 58 hours will take less time than a > fresh replication. Assuming it is not caught up already. Possibly not, since a fresh replica can use the last AWS checkpoint from the master (which are every 5 minutes), whereas replication has to catch up using a single thread (and the replica is on an m4.xlarge vs prod's m4.2xlarge). (In reply to Kyle Lahnakoski [:ekyle] from comment #7) > NM, it appears the database is down, presumably to fix this. I rebooted it (one of the few things I have access to do), since it appeared stuck even after the ALTER TABLE had finished.
Reporter | ||
Comment 9•7 years ago
|
||
The DB is still locked up at
> SELECT * FROM `treeherder`.`job_detail` LIMIT 1
Assignee | ||
Comment 10•7 years ago
|
||
Replica lag is now 48 hours (from 58). Not sure why the query in comment 9 is still unresponsive (I can repro). Kendall, I don't suppose you could just nuke treeherder-prod-ro and start over?
Flags: needinfo?(klibby)
Priority: -- → P1
Summary: TH RO DB Locked up? → Treeherder prod read-only replica suffering from table locks and replica lag
Comment 12•7 years ago
|
||
New treeherder-prod-ro has been created!
Assignee | ||
Comment 13•7 years ago
|
||
Many thanks for deleting/recreating. However the replication lag is still increasing, and CPU usage at <= 1% ? From the AWS event logs (times in UTC), the timeline was: Mar 17 1:21 PM: (treeherder-prod-ro) DB instance deleted Mar 17 1:25 PM: (treeherder-prod) Backing up DB instance Mar 17 1:58 PM: (treeherder-prod) Finished DB Instance backup Mar 17 2:06 PM: (treeherder-prod-ro) Restored from snapshot Mar 17 2:06 PM: (treeherder-prod-ro) DB instance shutdown Mar 17 2:06 PM: (treeherder-prod-ro) DB instance restarted Mar 17 2:07 PM: (treeherder-prod-ro) Replication for the Read Replica resumed (which is what I would have expected) Replication lag trend: 14:08: 2530s 14:15: 2703s 14:16: 2698s (so it did start going down) 14:17: 2721s (then very steep increase) 15:14: 4619s error/mysql-error-running.log doesn't report anything suspicious, latest replication related entry was: 2017-03-17 14:07:00 5460 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin-changelog.049248' at position 110144, relay log '/rdsdbdata/log/relaylog/relaylog.000019' position: 4 CPU usage is at <= 1%, write IOPS <200 (compared to prod's 500). Output of `SHOW FULL PROCESSLIST` / various replication status tables and `SHOW ENGINE INNODB STATUS`: https://emorley.pastebin.mozilla.org/8982328 If it weren't for the low CPU usage and write IOPS, I wouldn't think anything of this (perhaps just a large transaction being caught up), however it looks like replication just isn't working properly.
Assignee | ||
Comment 14•7 years ago
|
||
Oh I missed out the obvious `SHOW MASTER STATUS` and `SHOW SLAVE STATUS`: https://emorley.pastebin.mozilla.org/8982329 Notably... Master: File: mysql-bin-changelog.049271 Position: 5251973 Slave: Slave_IO_State: Waiting for master to send event Master_Log_File: mysql-bin-changelog.049271 Read_Master_Log_Pos: 5409267 Relay_Log_File: relaylog.000036 Relay_Log_Pos: 742558 Relay_Master_Log_File: mysql-bin-changelog.049256 Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 742385 Relay_Log_Space: 171054636 Seconds_Behind_Master: 4695 Last_IO_Error: Last_SQL_Error: SQL_Remaining_Delay: Slave_SQL_Running_State: System lock
Assignee | ||
Comment 15•7 years ago
|
||
Replication lag now dropping fast, and down to 3489s. Write IOPS are now gradually increasing - guessing perhaps just slow write performance straight after restoring? (I know for EBS at least, restoring from snapshot has a performance penalty initially).
Assignee | ||
Comment 16•7 years ago
|
||
Replication caught up ~10 mins ago. For future reference, time from replica instantiation to when it was caught up: 3 hours 30 mins (Still quicker than if we had waited for the 58 hours to catch up, even ignoring the broken table locks).
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•