Closed Bug 1553199 Opened 6 years ago Closed 5 years ago

[Meta] - High database spikes affect UI response time

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

Product:

Component:

Type:

defect

Priority:

P1

Severity:

major

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: camd, Assigned: armenzg)

References

Details

Attachments

(1 file)

mysql-slowquery.log.16 5 years ago Armen [:armenzg] 270.24 KB, text/plain		Details

Cameron Dawson [:camd]

Reporter

Description

•

6 years ago

We have seen this a few times. The UI becomes unresponsive and at that time, there are several large logs being parsed in a row. This seems to slow down the database and then the UI takes a hit while trying to retrieve pushes/jobs.

I'm not sure what the solution is, but there must be a way to insulate the front-end from the log parsing.

Cameron Dawson [:camd]

Reporter

Comment 1

•

6 years ago

This happens just occasionally, and it usually passes by within about 10 minutes. So it's hard to call this critical. But it's certainly annoying and frustrating for the sheriffs, especially.

Here are a few logs that got caught in this. We will keep the links to them for a while, so we should be able to track the jobs/pushes down from the URLs:

See Also: → 1455721

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 2

•

6 years ago

https://taskcluster-artifacts.net/IruPgTdnRmmqiqapqAvqww/0/public/logs/live_backing.log is an autoland push and a backout, so nothing crazy from Try: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=acd39065fd9fa5f7fe7a4530d5a991f64dcbc9a0&selectedJob=247557345

The other one is from inbound: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=4d7ef530fffba20a2b352cff9fa23438128105a7&selectedJob=247561821

Armen [:armenzg]

Assignee

Comment 3

•

5 years ago

I saw a new relic responsiveness email and looked into the last 6 hours with new relic [1]
I zommed in the area with the spike.
I selected 'log-parser' from under 'Transactions'
I selected any points around the pick.
You can see that MySql performance_datum insert took over 80% of processing time for a log-parser event with 244 inserts costing 142 seconds [2]
You can see further details under "Database queries" and you can drilldown even further if you see an icon to the left [3][4] which shows which piece of code is the one that executes the insert statements (assuming I understand what NewRelic is trying to do).

After some investigations through various trees, loading 50+ pushes and filtering with 'talos' I found this push:
https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&searchStr=talos&revision=cc08a11db7d94d70e81e59afd4d2cfd4b85aa153

The command used was |mach try --preset perf --rebuild 10|. That caused many more jobs than intended. The developer intended to obtain "Tp and raptor, but probably across platforms"

I have various thoughts crossing my mind as to what could be tried:

Verify that slowdowns we've seen in the past follow a pattern like this or not
Reduce the number of log parsing workers (currently 6) and/or increase the power of the dynos
- This is to reduce of number of workers that might be trying to write to the DB at the same time
Reevaluate the implications of using --concurrency 7 [5]
Use structured logs to make log parsing less intensive on Treeherder
Reevaluate the approach on how we ingest performance data
- For example, the Talos tasks generate an artifact with performance data instead of parsing 1 line from a full log. Such task is ingested via a different Treeherder task
- This would decouple the perf data ingestion from log parsing and make it easier to work on and test
Add more New Relic instrumentation
- I saw some tracing messages on some of the deepest drilldowns saying that I need more in-code logic to see further details

This bug could be just a curiosity if it does not have any impact on the users, however, I believe it does if we look at ''/treeherder.webapp.api.jobs:JobsViewSet.list" [6]. There's a huge impact on the "MySql job select" operations. This means that during 15 minutes there would be all sorts of slow downs since we can't get the data to the users in a reasonable manner.

I just started looking at NewRelic today so I hope I did not just misunderstand all of this and wasted your time!

[1] https://cl.ly/38b18d90c0ca
[2] https://cl.ly/d70e90bd044b
[3] INSERT INTO performance_datum (repository_id, signature_id, value, push_timestamp, job_id, push_id, ds_job_id, result_set_id) VALUES (%s, %s, %s, %s, %s, %s, %s, %s)
[4] https://cl.ly/95f92c24f6a8
[5] https://github.com/mozilla/treeherder/blob/master/Procfile#L46
[6] https://cl.ly/bd9179b482f8

Armen [:armenzg]

Assignee

Updated

•

5 years ago

Assignee: nobody → armenzg

Status: NEW → ASSIGNED

Armen [:armenzg]

Assignee

Comment 4

•

5 years ago

camd recommended looking at https://docs.djangoproject.com/en/2.2/ref/models/querysets/#bulk-create

Armen [:armenzg]

Assignee

Updated

•

5 years ago

Summary: Log parsing multiple large logs in a row causes UI slowdown → Log parsing of logs than require MySql inserts slowsdown the UI

Armen [:armenzg]

Assignee

Updated

•

5 years ago

See Also: → 1560657

Armen [:armenzg]

Assignee

Comment 7

•

5 years ago

Copying from bug 1560657:
https://bugzilla.mozilla.org/show_bug.cgi?id=1560657#c0

There were a couple of Heroku 95th percentile emails. This bug is only analyzing the latest one.

This spike does not seem to be related to a high number of talos jobs completing:
https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&fromchange=e053c95bb1b4a99407040a33be906b9465c5fb2b&searchStr=talos
https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&searchStr=talos&fromchange=99f1b315fa89db68ef3e7f8875bbd0a4b818423d

You can see the spike on MySql job selects:
https://cl.ly/3ace8cb7afbb

You can see the JobsView set spiking:
https://cl.ly/65e6ba9e749a

Now, once I look closely at log parser:
https://cl.ly/a4918e40c9e2
You will see that first we start having problem with MySql line failures and then it starts affecting other insert operations. This next screenshots only leaves MySql related operation and you will notice it more clearly
https://cl.ly/ded35e84aa41

I believe my proposal to use the replica for read operations is gaining weight.

Summary: Log parsing of logs than require MySql inserts slowsdown the UI → High database spikes affect UI response time

Armen [:armenzg]

Assignee

Comment 8

•

5 years ago

For those with New Relic access I created a dashboard to track the performance issues:
https://rpm.newrelic.com/accounts/677903/custom_dashboards/pages/11944

I'm considering now the following options:

Upgrade the DB instance from db.m4.2xlarge to something larger
- This will hopefully get us out of the hole for a bit
- I will need to measure the stage app before upgrading to a larger instance
Test out the R/O idea on stage
- Get sheriffs to test that out and see if they can use it
Investigate having less API calls from the UI
- Or even time out sooner
Evaluate our caching strategies for fetching jobs
- I'm sure we hit the DB more than we ought to

:camd, :sclements: If you could look into the last two items that would be ideal.

Copying comment from a sheriff:
https://bugzilla.mozilla.org/show_bug.cgi?id=1561128#c0

https://treeherder.mozilla.org was down for 2~3 minutes and after that was really slow for ~15 mins, from 3 secs to 15 to load a failure.
Load time in firefox: https://irccloud.mozilla.com/file/5uVSoYj5/image.png,

Armen [:armenzg]

Assignee

Comment 9

•

5 years ago

Attached file mysql-slowquery.log.16 — Details

This a slow MySql query from around the events of yesterday. In case any of you can make sense of it.

Armen [:armenzg]

Assignee

Updated

•

5 years ago

Attachment #9073884 - Attachment mime type: application/octet-stream → text/plain

Armen [:armenzg]

Assignee

Comment 10

•

5 years ago

I'm looking at the hardware specifications in here:
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.DBInstanceClass.html#Concepts.DBInstanceClass.Summary

I'm going to upgrade treeherder-stage to a better instance to see how we fair.

Currently, only production uses db.m4.2xlarge while the other DB instances are db.m4.xlarge:
db.m4.xlarge $0.35 --> $0.342 (-2.3%)
db.m4.2xlarge $0.7. --> $0.684 (-2.3%)

Instance Class	vCPU	ECU	Memory (GiB)	Max. Bandwidth (Mbps)	Network Performance
db.m4.xlarge	4	13	16	750	High
db.m4.2xlarge	8	25.5	32	1,000	High
db.m5.xlarge	4	15	16	3,500	Up to 10 Gigabit
db.m5.2xlarge	8	31	32	3,500	Up to 10 Gigabit

Upgrading from db.m4.2xlarge (production) to db.m4.2xlarge would be:

22% higher processing power (ECU units)
2.5x max bandwidth
an actual cap of 10 Gigabit
- TODO: Evaluate the implication of this; would we need more?

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.DBInstanceClass.html:

db.m5 – Latest-generation general-purpose instance classes that provide a balance of compute, memory, and network resources, and are a good choice for many applications. The db.m5 instance classes provide more computing capacity than the previous db.m4 instance classes.
db.m4 – Current-generation general-purpose instance classes that provide more computing capacity than the previous db.m3 instance classes.

We might also consider changing the storage type from 'General Purpose SSD Storage' to 'Provisioned IOPS Storage'
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Storage.html#USER_PIOPS

Armen [:armenzg]

Assignee

Updated

•

5 years ago

Depends on: 1561415

Armen [:armenzg]

Assignee

Comment 11

•

5 years ago

I've tried changing treeherder-stage's instance but I need to wait for IT to do it for me.
I've also made the Heroku workers to be the same as production. I want to get a closer set up for the data pipeline.

I've also added another Performance dashboard but for treeherder-stage. I want to better evaluate the impact of changing the instance type.

Sarah Clements [:sclements]

Comment 12

•

5 years ago

•

This is a lot of information - thanks for looking into it.

Were you able to get any improvements to the log parser by changing the query?

Regarding your point above re "Investigate having less API calls from the UI" - I've removed as many unnecessary API calls as possible during the Perfherder conversion but there are queries (for all of TH) that need to be cleanup and optimized as outlined in bug 1407377. Some of these might only needed minor changes, like adding 'select_related's. I think we should focus on the low hanging fruit and see how much that helps before making bigger changes.

Armen [:armenzg]

Assignee

Comment 13

•

5 years ago

I'm more interested on reducing the calls from the main landing page (e.g. autoland or try).

I agree with delaying certain changes until we see others not being effective.

Armen [:armenzg]

Assignee

Updated

•

5 years ago

Depends on: 1562882

Armen [:armenzg]

Assignee

Comment 14

•

5 years ago

See bug 1564566 to details on options to read less.

Armen [:armenzg]

Assignee

Updated

•

5 years ago

Depends on: 1564584

Armen [:armenzg]

Assignee

Updated

•

5 years ago

Severity: normal → major

Component: Treeherder: Log Parsing & Classification → Treeherder: Infrastructure

Armen [:armenzg]

Assignee

Updated

•

5 years ago

Depends on: 1564860

Armen [:armenzg]

Assignee

Updated

•

5 years ago

Summary: High database spikes affect UI response time → meta - High database spikes affect UI response time

Armen [:armenzg]

Assignee

Updated

•

5 years ago

Depends on: 1562017

Armen [:armenzg]

Assignee

Updated

•

5 years ago

Depends on: 1565041

Armen [:armenzg]

Assignee

Updated

•

5 years ago

Depends on: 1565042

Armen [:armenzg]

Assignee

Updated

•

5 years ago

Depends on: 1566204

Armen [:armenzg]

Assignee

Updated

•

5 years ago

Priority: -- → P1

Armen [:armenzg]

Assignee

Updated

•

5 years ago

Depends on: 1567257

Armen [:armenzg]

Assignee

Comment 15

•

5 years ago

In the last few hours we have doubled the IOPS cap for the treeherder-prod database:
https://bugzilla.mozilla.org/show_bug.cgi?id=1567257
Instead of 3,000 IOPS we have 6,000 IOPS and I see close to 3,800 being used at this moment.

This change plus the auto-scaling mechanism for web nodes might be all we need for now. Let's see.

Please ping me if you notice any UI slowdowns so I can look at NewRelic.

Thanks to @dividehex for all his support with DB changes.

Armen [:armenzg]

Assignee

Comment 16

•

5 years ago

Summary:

In comment 3 I spotted database issues by using NewRelic
In bug 1562882 we tried upgrading to M5 instances
In bug 1567257 we doubled the storage (which doubles the IOPS cap) of prod and prod-ro
- Instead of switching to IOPS storage (cheaper)

The smaller outages we saw yesterday had to do with the auto-scaling set up needing some tweaking.

I believe this is fixed.

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED

Summary: meta - High database spikes affect UI response time → [Meta] - High database spikes affect UI response time

Armen [:armenzg]

Assignee

Comment 17

•

5 years ago

I will leave it open for a month and verify that we're truly done.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Armen [:armenzg]

Assignee

Updated

•

5 years ago

Depends on: 1570068

Armen [:armenzg]

Assignee

Comment 18

•

5 years ago

This seems done as far as I can tell.

Status: REOPENED → RESOLVED

Closed: 5 years ago → 5 years ago

Resolution: --- → FIXED

Armen [:armenzg]

Assignee

Updated

•

5 years ago

No longer depends on: 1562017

Armen [:armenzg]

Assignee

Updated

•

5 years ago

No longer depends on: 1570068

You need to log in before you can comment on or make changes to this bug.