Closed Bug 1597136 Opened 5 years ago Closed 5 years ago

[meta] Database slow downs causing Tree closures

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: aryx, Assigned: armenzg)

References

Details

Attachments

(24 files)

[screenshot] Increased Read/Write throughput since the 14th 5 years ago Armen [:armenzg] 234.54 KB, image/png		Details
Screenshot_2019-11-18 RDS · AWS Console.png 5 years ago Kendall Libby [:fubar] (he/him) 214.33 KB, image/png		Details
[screenshot] Comparison of stage VS production 5 years ago Armen [:armenzg] 613.39 KB, image/png		Details
[screenshot] 1 week non-web transactions (stage vs production) 5 years ago Armen [:armenzg] 839.82 KB, image/png		Details
[screenshot] 1 hour period where the DB was not taking connections 5 years ago Armen [:armenzg] 406.72 KB, image/png		Details
[screenshot] pegged CPU 5 years ago Armen [:armenzg] 422.35 KB, image/png	ekyle : feedback+ armenzg : feedback+ trink : feedback+	Details
[screenshot] Database load last 24 hours (ET times) 5 years ago Armen [:armenzg] 114.32 KB, image/png		Details
[screenshot] Current IOPS for production 5 years ago Armen [:armenzg] 149.52 KB, image/png		Details
[screenshot] Write Throughput Production Vs Stage (last week) 5 years ago Armen [:armenzg] 142.67 KB, image/png		Details
mysql-error-running.log 5 years ago Armen [:armenzg] 2.79 MB, application/octet-stream		Details
Activiity around 9am GMT 5 years ago Kyle Lahnakoski [:ekyle] 203.99 KB, image/png		Details
production_show_table_status.csv 5 years ago Armen [:armenzg] 6.24 KB, application/octet-stream		Details
mysql-slowquery(1).log 5 years ago Armen [:armenzg] 1.14 MB, application/octet-stream		Details
Lots of performance_datum requested.png 5 years ago Kyle Lahnakoski [:ekyle] 177.72 KB, image/png		Details
[screenshot] io/table/sql/handler 5 years ago Armen [:armenzg] 282.16 KB, image/png		Details
[screenshot] Load from "Other" host 5 years ago Armen [:armenzg] 230.26 KB, image/png		Details
Screenshot_2019-11-21 RDS · AWS Console.png 5 years ago Kendall Libby [:fubar] (he/him) 273.61 KB, image/png		Details
Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/5669 5 years ago GitHub Bugzilla PR Linker 47 bytes, text/x-github-pull-request		Details \| Review
[screenshot] CloudWatch metrics for production and stage 5 years ago Armen [:armenzg] 458.27 KB, image/png		Details
prod_db_params.txt 5 years ago Armen [:armenzg] 4.00 KB, text/plain		Details
stage_db_params.txt 5 years ago Armen [:armenzg] 4.00 KB, text/plain		Details
dev_db_params.txt 5 years ago Armen [:armenzg] 4.00 KB, text/plain		Details
[screenshot] last 24 hours treeherder-prod metrics 5 years ago Armen [:armenzg] 404.22 KB, image/png		Details
[screenshot] Write Throughput before abnormal load and now 5 years ago Armen [:armenzg] 299.32 KB, image/png		Details

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Description

•

5 years ago

As discussed earlier, there are slowdowns in the interaction with treeherder, e.g. that failure lines load slowly.

Newrelic shows spikes in mysql response time for non-web transactions which started
https://rpm.newrelic.com/accounts/677903/applications/14179757

The big, longer spikes start at 5am UTC each day (first one on Nov 15th UTC).

Another spike can be found between 1pm and 2:30pm UTC each day.

Deploy before this started: https://github.com/mozilla/treeherder/compare/c725c8a41cf19ab5a8217ac9a10e3da1dd2dd54f...9289f542e8729e5574ef4a566ddc007a4c943085 - related to bug 1571369? These new tables are not shown on the read-only replica even after refreshing the tables list.

The 1pm to 2:30pm UTC spike happens when cycle_data runs whose performance has deteriorated since Friday (and only runs at that time).

The amount of data added (e.g. into the table job_detail is unchanged compared to the days in the previous week). Can a one-off task be run to check if performance is lower than normal at any given time of the day and cron jobs only making this noticeable?

Ionuț Goldan [:igoldan]

Updated

•

5 years ago

Updated

•

5 years ago

Summary: repeated slowdowns fetching data due spikes in mysql response time for non-web transactions → Database slow downs causing Tree closures

Parameter	Production	Stage	Dev
innodb_io_capacity_max	5000	5000	2500
innodb_lru_scan_depth	256	256	256
innodb_log_file_size	17179869184 (16GB)	17179869184 (16GB)	8589934592 (8GB)
innodb_read_io_threads	8	8	4
innodb_write_io_threads	8	8	4
innodb_buffer_pool_size	48318382080 (45GB)	11811160064 (11GB)	11811160064 (11GB)

Name	Storage type	EC2	Size	IOPS	Multi-AZ	Replica
production	Provisioned IOPS	db.m5.4xlarge	2,000GB	7,000	Yes	Yes
stage	General Purpose	db.m5.xlarge	1,000GB	3,000	No	No
dev	General Purpose	db.m5.xlarge	1,100GB	3,300	No	No

Model	Core Count	vCPU*	Mem (GiB)	Storage (GiB)	EBS Bandwidth (Mbps)	Network Perf (Gbps)
db.m5.xlarge	2	4	16	EBS-only	Up to 3,500	Up to 10
db.m5.4xlarge	8	16	64	EBS-only	3,500	Up to 10

Name	Allocated	Free space
production	2,000GB	1,300GB
stage	1,100GB	200GB