Closed Bug 1348029 Opened 7 years ago Closed 7 years ago

Treeherder prod read-only replica suffering from table locks and replica lag

Categories

(Tree Management :: Treeherder: Infrastructure, enhancement, P1)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ekyle, Assigned: emorley)

Details

Attachments

(1 file)

The following query does not seem to work 

> select * from jobs limit 1

A `show processlist` gives a hint (columns removed for clarity)

> Id      db           Command  Time   State                              Info
> 72124   treeherder    Sleep    113        
> 72129   treeherder    Sleep    50        
> 72133   treeherder    Query    0     init                               show processlist
> 72135   treeherder    Query    34    Waiting for table metadata lock    SELECT * FROM `treeherder`.`job` LIMIT 1


Are there other clients locking the job table?
Hmm, there was a redash query which had been ongoing for 3009489 seconds from :jmaher. I went ahead and killed that.

It appears as if some kind of alter table operation is also in progress, probably from a migration :jgraham did earlier this week. My guess is that the former operation slowed down the replication of the later. Hopefully things will start behaving better soon now that I've killed the misbehaving query.
Now that I think about it though, if the schema migration didn't happen correctly on the mirror when it was supposed to happen, we might be kind of hosed. We might need to reset its state to that of the master database. I'm not sure how to do that, so over to :emorley (when he has time).

In the mean time, the readonly database should be useable, though I wouldn't 100% trust it.
Assignee: nobody → emorley
Attached image CloudWatch graphs
Here are the CloudWatch graphs for the last 3 days.

The replication lag starts increasing at 1000 UTC 14th March (as expected, since that's when the schema change occurred) and continues increasing even now. It's current value is 209000000ms (58 hours).

AWS replication status reports replication hasn't errored, and is ongoing.

Currently this is the only query running:

ALTER TABLE `job` ADD COLUMN `autoclassify_status` integer DEFAULT 0 NOT NULL

The `time` for that query is reported as 212609s, which would be worrying, were it not that the write IOPS graphs show that the query has actually been sat stuck doing nothing, rather than taking that long to apply. Killing the stuck redash query in comment 1 appears to have worked.

Looking at the slow query log [1] I see jmaher's redash query:

# Time: 170316 18:16:15
# User@Host: redash[redash] @ SNIP Id: 20736
# Query_time: 3009530.989732 Lock_time: 0.000191 Rows_sent: 0 Rows_examined: 2358
use treeherder;
SET timestamp=1489688175;
/* Username: jmaher@SNIP, Task ID: SNIP, Query ID: 2517, Queue: queries, Query Hash: SNIP */ select distinct jn.text as fixed_by_revision, sig.build_system_type, jt.name as job_type_name, op.name as platform_option, sig.name as ref_data_name, bp.platform as platform from job_note jn, job j, job_type jt, reference_data_signatures sig, option_collection oc, `option` op, build_platform bp where jn.failure_classification_id=2 and jn.text!='' and jn.created>'2016-10-30' and jn.job_id=j.id and j.job_type_id=jt.id and j.signature_id=sig.id and sig.option_collection_hash=oc.option_collection_hash and oc.option_id=op.id and jt.name like 'test-%' or jt.name like 'desktop-test-%' and j.build_platform_id=bp.id;


The questions now are:
* How fast will the replication catch up the 58 hours, and should we just reset the replica and start over? (I don't have enough access to do this, :fubar will need to if so)
* Why did jmaher's query get stuck?
* Is there anything we can do to prevent this again in the future? (I'd guess not given `max_execution_time` wasn't added until MySQL 5.7.8 [2]).


[1] https://console.aws.amazon.com/rds/home?region=us-east-1#dbinstance:id=treeherder-prod-ro;view=logs
[2] https://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_max_execution_time
I am guessing jmaher's query got stuck because 

> jt.name like 'test-%' or jt.name like 'desktop-test-%'

was not in parenthesis
Youch...

> explain select distinct jn.text as fixed_by_revision, sig.build_system_type, jt.name as job_type_name, op.name as platform_option, sig.name as ref_data_name, bp.platform as platform from job_note jn, job j, job_type jt, reference_data_signatures sig, option_collection oc, `option` op, build_platform bp where jn.failure_classification_id=2 and jn.text!='' and jn.created>'2016-10-30' and jn.job_id=j.id and j.job_type_id=jt.id and j.signature_id=sig.id and sig.option_collection_hash=oc.option_collection_hash and oc.option_id=op.id and jt.name like 'test-%' or jt.name like 'desktop-test-%' and j.build_platform_id=bp.id

******************** 1. row *********************
           id: 1
  select_type: SIMPLE
        table: op
         type: index
possible_keys: PRIMARY
          key: option_name_78ffea107809a885_uniq
      key_len: 302
          ref: 
         rows: 21
        Extra: Using index; Using temporary
******************** 2. row *********************
           id: 1
  select_type: SIMPLE
        table: oc
         type: index
possible_keys: uni_option_collection,idx_option
          key: uni_option_collection
      key_len: 126
          ref: 
         rows: 23
        Extra: Using index; Using join buffer (Block Nested Loop)
******************** 3. row *********************
           id: 1
  select_type: SIMPLE
        table: bp
         type: index
possible_keys: PRIMARY
          key: idx_platform
      key_len: 302
          ref: 
         rows: 244
        Extra: Using index; Using join buffer (Block Nested Loop)
******************** 4. row *********************
           id: 1
  select_type: SIMPLE
        table: jt
         type: range
possible_keys: PRIMARY,uni_name_symbol
          key: uni_name_symbol
      key_len: 302
          ref: 
         rows: 13632
        Extra: Using where; Using index; Using join buffer (Block Nested Loop)
******************** 5. row *********************
           id: 1
  select_type: SIMPLE
        table: j
         type: ALL
possible_keys: job_9aefd743,job_d1803079
          key: 
      key_len: 
          ref: 
         rows: 25134319
        Extra: Range checked for each record (index map: 0x30)
******************** 6. row *********************
           id: 1
  select_type: SIMPLE
        table: sig
         type: ALL
possible_keys: PRIMARY,idx_option_collection_hash
          key: 
      key_len: 
          ref: 
         rows: 246434
        Extra: Using where; Using join buffer (Block Nested Loop)
******************** 7. row *********************
           id: 1
  select_type: SIMPLE
        table: jn
         type: ALL
possible_keys: D0c96672d1abd821fcd7e1f1723a2fd8,job_note_d697ea38
          key: 
      key_len: 
          ref: 
         rows: 277637
        Extra: Using where; Using join buffer (Block Nested Loop)
7 rows in set
I am also guessing that catching up 58 hours will take less time than a fresh replication.  Assuming it is not caught up already.
NM, it appears the database is down, presumably to fix this.
(In reply to Kyle Lahnakoski [:ekyle] from comment #6)
> I am also guessing that catching up 58 hours will take less time than a
> fresh replication.  Assuming it is not caught up already.

Possibly not, since a fresh replica can use the last AWS checkpoint from the master (which are every 5 minutes), whereas replication has to catch up using a single thread (and the replica is on an m4.xlarge vs prod's m4.2xlarge).

(In reply to Kyle Lahnakoski [:ekyle] from comment #7)
> NM, it appears the database is down, presumably to fix this.

I rebooted it (one of the few things I have access to do), since it appeared stuck even after the ALTER TABLE had finished.
The DB is still locked up at 

> SELECT * FROM `treeherder`.`job_detail` LIMIT 1
Replica lag is now 48 hours (from 58).

Not sure why the query in comment 9 is still unresponsive (I can repro).

Kendall, I don't suppose you could just nuke treeherder-prod-ro and start over?
Flags: needinfo?(klibby)
Priority: -- → P1
Summary: TH RO DB Locked up? → Treeherder prod read-only replica suffering from table locks and replica lag
Can do. I'll go kick it off.
Flags: needinfo?(klibby)
New treeherder-prod-ro has been created!
Many thanks for deleting/recreating.

However the replication lag is still increasing, and CPU usage at <= 1% ?

From the AWS event logs (times in UTC), the timeline was:

Mar 17 1:21 PM: (treeherder-prod-ro) DB instance deleted
Mar 17 1:25 PM: (treeherder-prod) Backing up DB instance
Mar 17 1:58 PM: (treeherder-prod) Finished DB Instance backup
Mar 17 2:06 PM: (treeherder-prod-ro) Restored from snapshot 
Mar 17 2:06 PM: (treeherder-prod-ro) DB instance shutdown
Mar 17 2:06 PM: (treeherder-prod-ro) DB instance restarted
Mar 17 2:07 PM: (treeherder-prod-ro) Replication for the Read Replica resumed

(which is what I would have expected)

Replication lag trend:
14:08: 2530s
14:15: 2703s
14:16: 2698s (so it did start going down)
14:17: 2721s
(then very steep increase)
15:14: 4619s

error/mysql-error-running.log doesn't report anything suspicious, latest replication related entry was:

2017-03-17 14:07:00 5460 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin-changelog.049248' at position 110144, relay log '/rdsdbdata/log/relaylog/relaylog.000019' position: 4

CPU usage is at <= 1%, write IOPS <200 (compared to prod's 500).

Output of `SHOW FULL PROCESSLIST` / various replication status tables and `SHOW ENGINE INNODB STATUS`:
https://emorley.pastebin.mozilla.org/8982328

If it weren't for the low CPU usage and write IOPS, I wouldn't think anything of this (perhaps just a large transaction being caught up), however it looks like replication just isn't working properly.
Oh I missed out the obvious `SHOW MASTER STATUS` and `SHOW SLAVE STATUS`:
https://emorley.pastebin.mozilla.org/8982329

Notably...

Master:
                     File: mysql-bin-changelog.049271
                 Position: 5251973

Slave:
           Slave_IO_State: Waiting for master to send event
          Master_Log_File: mysql-bin-changelog.049271
      Read_Master_Log_Pos: 5409267
           Relay_Log_File: relaylog.000036
            Relay_Log_Pos: 742558
    Relay_Master_Log_File: mysql-bin-changelog.049256
         Slave_IO_Running: Yes
        Slave_SQL_Running: Yes
               Last_Errno: 0
               Last_Error: 
             Skip_Counter: 0
      Exec_Master_Log_Pos: 742385
          Relay_Log_Space: 171054636
    Seconds_Behind_Master: 4695
            Last_IO_Error: 
           Last_SQL_Error: 
      SQL_Remaining_Delay: 
  Slave_SQL_Running_State: System lock
Replication lag now dropping fast, and down to 3489s.

Write IOPS are now gradually increasing - guessing perhaps just slow write performance straight after restoring? (I know for EBS at least, restoring from snapshot has a performance penalty initially).
Replication caught up ~10 mins ago.

For future reference, time from replica instantiation to when it was caught up: 3 hours 30 mins
(Still quicker than if we had waited for the 58 hours to catch up, even ignoring the broken table locks).
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: