1348029 - Treeherder prod read-only replica suffering from table locks and replica lag

Reporter

Description

•

7 years ago

The following query does not seem to work 

> select * from jobs limit 1

A `show processlist` gives a hint (columns removed for clarity)

> Id      db           Command  Time   State                              Info
> 72124   treeherder    Sleep    113        
> 72129   treeherder    Sleep    50        
> 72133   treeherder    Query    0     init                               show processlist
> 72135   treeherder    Query    34    Waiting for table metadata lock    SELECT * FROM `treeherder`.`job` LIMIT 1


Are there other clients locking the job table?

William Lachance (:wlach)

Comment 1

•

7 years ago

Hmm, there was a redash query which had been ongoing for 3009489 seconds from :jmaher. I went ahead and killed that.

It appears as if some kind of alter table operation is also in progress, probably from a migration :jgraham did earlier this week. My guess is that the former operation slowed down the replication of the later. Hopefully things will start behaving better soon now that I've killed the misbehaving query.

William Lachance (:wlach)

Comment 2

•

7 years ago

Now that I think about it though, if the schema migration didn't happen correctly on the mirror when it was supposed to happen, we might be kind of hosed. We might need to reset its state to that of the master database. I'm not sure how to do that, so over to :emorley (when he has time).

In the mean time, the readonly database should be useable, though I wouldn't 100% trust it.

Assignee: nobody → emorley

Ed Morley [:emorley]

Assignee

Comment 3

•

7 years ago

Attached image CloudWatch graphs — Details

Here are the CloudWatch graphs for the last 3 days.

The replication lag starts increasing at 1000 UTC 14th March (as expected, since that's when the schema change occurred) and continues increasing even now. It's current value is 209000000ms (58 hours).

AWS replication status reports replication hasn't errored, and is ongoing.

Currently this is the only query running:

ALTER TABLE `job` ADD COLUMN `autoclassify_status` integer DEFAULT 0 NOT NULL

The `time` for that query is reported as 212609s, which would be worrying, were it not that the write IOPS graphs show that the query has actually been sat stuck doing nothing, rather than taking that long to apply. Killing the stuck redash query in comment 1 appears to have worked.

Looking at the slow query log [1] I see jmaher's redash query:

# Time: 170316 18:16:15
# User@Host: redash[redash] @ SNIP Id: 20736
# Query_time: 3009530.989732 Lock_time: 0.000191 Rows_sent: 0 Rows_examined: 2358
use treeherder;
SET timestamp=1489688175;
/* Username: jmaher@SNIP, Task ID: SNIP, Query ID: 2517, Queue: queries, Query Hash: SNIP */ select distinct jn.text as fixed_by_revision, sig.build_system_type, jt.name as job_type_name, op.name as platform_option, sig.name as ref_data_name, bp.platform as platform from job_note jn, job j, job_type jt, reference_data_signatures sig, option_collection oc, `option` op, build_platform bp where jn.failure_classification_id=2 and jn.text!='' and jn.created>'2016-10-30' and jn.job_id=j.id and j.job_type_id=jt.id and j.signature_id=sig.id and sig.option_collection_hash=oc.option_collection_hash and oc.option_id=op.id and jt.name like 'test-%' or jt.name like 'desktop-test-%' and j.build_platform_id=bp.id;


The questions now are:
* How fast will the replication catch up the 58 hours, and should we just reset the replica and start over? (I don't have enough access to do this, :fubar will need to if so)
* Why did jmaher's query get stuck?
* Is there anything we can do to prevent this again in the future? (I'd guess not given `max_execution_time` wasn't added until MySQL 5.7.8 [2]).


[1] https://console.aws.amazon.com/rds/home?region=us-east-1#dbinstance:id=treeherder-prod-ro;view=logs
[2] https://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_max_execution_time

Kyle Lahnakoski [:ekyle]

Reporter

Comment 4

•

7 years ago

I am guessing jmaher's query got stuck because 

> jt.name like 'test-%' or jt.name like 'desktop-test-%'

was not in parenthesis

Ed Morley [:emorley]

Assignee

Comment 5

•

7 years ago

Youch...

> explain select distinct jn.text as fixed_by_revision, sig.build_system_type, jt.name as job_type_name, op.name as platform_option, sig.name as ref_data_name, bp.platform as platform from job_note jn, job j, job_type jt, reference_data_signatures sig, option_collection oc, `option` op, build_platform bp where jn.failure_classification_id=2 and jn.text!='' and jn.created>'2016-10-30' and jn.job_id=j.id and j.job_type_id=jt.id and j.signature_id=sig.id and sig.option_collection_hash=oc.option_collection_hash and oc.option_id=op.id and jt.name like 'test-%' or jt.name like 'desktop-test-%' and j.build_platform_id=bp.id

******************** 1. row *********************
           id: 1
  select_type: SIMPLE
        table: op
         type: index
possible_keys: PRIMARY
          key: option_name_78ffea107809a885_uniq
      key_len: 302
          ref: 
         rows: 21
        Extra: Using index; Using temporary
******************** 2. row *********************
           id: 1
  select_type: SIMPLE
        table: oc
         type: index
possible_keys: uni_option_collection,idx_option
          key: uni_option_collection
      key_len: 126
          ref: 
         rows: 23
        Extra: Using index; Using join buffer (Block Nested Loop)
******************** 3. row *********************
           id: 1
  select_type: SIMPLE
        table: bp
         type: index
possible_keys: PRIMARY
          key: idx_platform
      key_len: 302
          ref: 
         rows: 244
        Extra: Using index; Using join buffer (Block Nested Loop)
******************** 4. row *********************
           id: 1
  select_type: SIMPLE
        table: jt
         type: range
possible_keys: PRIMARY,uni_name_symbol
          key: uni_name_symbol
      key_len: 302
          ref: 
         rows: 13632
        Extra: Using where; Using index; Using join buffer (Block Nested Loop)
******************** 5. row *********************
           id: 1
  select_type: SIMPLE
        table: j
         type: ALL
possible_keys: job_9aefd743,job_d1803079
          key: 
      key_len: 
          ref: 
         rows: 25134319
        Extra: Range checked for each record (index map: 0x30)
******************** 6. row *********************
           id: 1
  select_type: SIMPLE
        table: sig
         type: ALL
possible_keys: PRIMARY,idx_option_collection_hash
          key: 
      key_len: 
          ref: 
         rows: 246434
        Extra: Using where; Using join buffer (Block Nested Loop)
******************** 7. row *********************
           id: 1
  select_type: SIMPLE
        table: jn
         type: ALL
possible_keys: D0c96672d1abd821fcd7e1f1723a2fd8,job_note_d697ea38
          key: 
      key_len: 
          ref: 
         rows: 277637
        Extra: Using where; Using join buffer (Block Nested Loop)
7 rows in set

Kyle Lahnakoski [:ekyle]

Reporter

Comment 6

•

7 years ago

I am also guessing that catching up 58 hours will take less time than a fresh replication.  Assuming it is not caught up already.

Kyle Lahnakoski [:ekyle]

Reporter

Comment 7

•

7 years ago

NM, it appears the database is down, presumably to fix this.

Ed Morley [:emorley]

Assignee

Comment 8

•

7 years ago

(In reply to Kyle Lahnakoski [:ekyle] from comment #6)
> I am also guessing that catching up 58 hours will take less time than a
> fresh replication.  Assuming it is not caught up already.

Possibly not, since a fresh replica can use the last AWS checkpoint from the master (which are every 5 minutes), whereas replication has to catch up using a single thread (and the replica is on an m4.xlarge vs prod's m4.2xlarge).

(In reply to Kyle Lahnakoski [:ekyle] from comment #7)
> NM, it appears the database is down, presumably to fix this.

I rebooted it (one of the few things I have access to do), since it appeared stuck even after the ALTER TABLE had finished.

Kyle Lahnakoski [:ekyle]

Reporter

Comment 9

•

7 years ago

The DB is still locked up at 

> SELECT * FROM `treeherder`.`job_detail` LIMIT 1

Ed Morley [:emorley]

Assignee

Comment 10

•

7 years ago

Replica lag is now 48 hours (from 58).

Not sure why the query in comment 9 is still unresponsive (I can repro).

Kendall, I don't suppose you could just nuke treeherder-prod-ro and start over?

Flags: needinfo?(klibby)

Priority: -- → P1

Summary: TH RO DB Locked up? → Treeherder prod read-only replica suffering from table locks and replica lag

Kendall Libby [:fubar] (he/him)

Comment 11

•

7 years ago

Can do. I'll go kick it off.

Flags: needinfo?(klibby)

Kendall Libby [:fubar] (he/him)

Comment 12

•

7 years ago

New treeherder-prod-ro has been created!

Ed Morley [:emorley]

Assignee

Comment 13

•

7 years ago

Many thanks for deleting/recreating.

However the replication lag is still increasing, and CPU usage at <= 1% ?

From the AWS event logs (times in UTC), the timeline was:

Mar 17 1:21 PM: (treeherder-prod-ro) DB instance deleted
Mar 17 1:25 PM: (treeherder-prod) Backing up DB instance
Mar 17 1:58 PM: (treeherder-prod) Finished DB Instance backup
Mar 17 2:06 PM: (treeherder-prod-ro) Restored from snapshot 
Mar 17 2:06 PM: (treeherder-prod-ro) DB instance shutdown
Mar 17 2:06 PM: (treeherder-prod-ro) DB instance restarted
Mar 17 2:07 PM: (treeherder-prod-ro) Replication for the Read Replica resumed

(which is what I would have expected)

Replication lag trend:
14:08: 2530s
14:15: 2703s
14:16: 2698s (so it did start going down)
14:17: 2721s
(then very steep increase)
15:14: 4619s

error/mysql-error-running.log doesn't report anything suspicious, latest replication related entry was:

2017-03-17 14:07:00 5460 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin-changelog.049248' at position 110144, relay log '/rdsdbdata/log/relaylog/relaylog.000019' position: 4

CPU usage is at <= 1%, write IOPS <200 (compared to prod's 500).

Output of `SHOW FULL PROCESSLIST` / various replication status tables and `SHOW ENGINE INNODB STATUS`:
https://emorley.pastebin.mozilla.org/8982328

If it weren't for the low CPU usage and write IOPS, I wouldn't think anything of this (perhaps just a large transaction being caught up), however it looks like replication just isn't working properly.

Ed Morley [:emorley]

Assignee

Comment 14

•

7 years ago

Oh I missed out the obvious `SHOW MASTER STATUS` and `SHOW SLAVE STATUS`:
https://emorley.pastebin.mozilla.org/8982329

Notably...

Master:
                     File: mysql-bin-changelog.049271
                 Position: 5251973

Slave:
           Slave_IO_State: Waiting for master to send event
          Master_Log_File: mysql-bin-changelog.049271
      Read_Master_Log_Pos: 5409267
           Relay_Log_File: relaylog.000036
            Relay_Log_Pos: 742558
    Relay_Master_Log_File: mysql-bin-changelog.049256
         Slave_IO_Running: Yes
        Slave_SQL_Running: Yes
               Last_Errno: 0
               Last_Error: 
             Skip_Counter: 0
      Exec_Master_Log_Pos: 742385
          Relay_Log_Space: 171054636
    Seconds_Behind_Master: 4695
            Last_IO_Error: 
           Last_SQL_Error: 
      SQL_Remaining_Delay: 
  Slave_SQL_Running_State: System lock

Ed Morley [:emorley]

Assignee

Comment 15

•

7 years ago

Replication lag now dropping fast, and down to 3489s.

Write IOPS are now gradually increasing - guessing perhaps just slow write performance straight after restoring? (I know for EBS at least, restoring from snapshot has a performance penalty initially).

Ed Morley [:emorley]

Assignee

Comment 16

•

7 years ago

Replication caught up ~10 mins ago.

For future reference, time from replica instantiation to when it was caught up: 3 hours 30 mins
(Still quicker than if we had waited for the 58 hours to catch up, even ignoring the broken table locks).

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Bugzilla

Quick Search

Treeherder prod read-only replica suffering from table locks and replica lag

Categories

(Tree Management :: Treeherder: Infrastructure, enhancement, P1)

Tracking

(Not tracked)

People

(Reporter: ekyle, Assigned: emorley)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Attachment

General

Description

File Name

Content Type