[Meta] - High database spikes affect UI response time
Categories
(Tree Management :: Treeherder: Infrastructure, defect, P1)
Tracking
(Not tracked)
People
(Reporter: camd, Assigned: armenzg)
References
Details
Attachments
(1 file)
270.24 KB,
text/plain
|
Details |
We have seen this a few times. The UI becomes unresponsive and at that time, there are several large logs being parsed in a row. This seems to slow down the database and then the UI takes a hit while trying to retrieve pushes/jobs.
I'm not sure what the solution is, but there must be a way to insulate the front-end from the log parsing.
Reporter | ||
Comment 1•6 years ago
|
||
This happens just occasionally, and it usually passes by within about 10 minutes. So it's hard to call this critical. But it's certainly annoying and frustrating for the sheriffs, especially.
Here are a few logs that got caught in this. We will keep the links to them for a while, so we should be able to track the jobs/pushes down from the URLs:
Comment 2•6 years ago
|
||
https://taskcluster-artifacts.net/IruPgTdnRmmqiqapqAvqww/0/public/logs/live_backing.log is an autoland push and a backout, so nothing crazy from Try: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=acd39065fd9fa5f7fe7a4530d5a991f64dcbc9a0&selectedJob=247557345
The other one is from inbound: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=4d7ef530fffba20a2b352cff9fa23438128105a7&selectedJob=247561821
Assignee | ||
Comment 3•5 years ago
|
||
I saw a new relic responsiveness email and looked into the last 6 hours with new relic [1]
I zommed in the area with the spike.
I selected 'log-parser' from under 'Transactions'
I selected any points around the pick.
You can see that MySql performance_datum insert took over 80% of processing time for a log-parser event with 244 inserts costing 142 seconds [2]
You can see further details under "Database queries" and you can drilldown even further if you see an icon to the left [3][4] which shows which piece of code is the one that executes the insert statements (assuming I understand what NewRelic is trying to do).
After some investigations through various trees, loading 50+ pushes and filtering with 'talos' I found this push:
https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&searchStr=talos&revision=cc08a11db7d94d70e81e59afd4d2cfd4b85aa153
The command used was |mach try --preset perf --rebuild 10|. That caused many more jobs than intended. The developer intended to obtain "Tp and raptor, but probably across platforms"
I have various thoughts crossing my mind as to what could be tried:
- Verify that slowdowns we've seen in the past follow a pattern like this or not
- Reduce the number of log parsing workers (currently 6) and/or increase the power of the dynos
- This is to reduce of number of workers that might be trying to write to the DB at the same time
- Reevaluate the implications of using
--concurrency 7
[5] - Use structured logs to make log parsing less intensive on Treeherder
- Reevaluate the approach on how we ingest performance data
- For example, the Talos tasks generate an artifact with performance data instead of parsing 1 line from a full log. Such task is ingested via a different Treeherder task
- This would decouple the perf data ingestion from log parsing and make it easier to work on and test
- Add more New Relic instrumentation
- I saw some tracing messages on some of the deepest drilldowns saying that I need more in-code logic to see further details
This bug could be just a curiosity if it does not have any impact on the users, however, I believe it does if we look at ''/treeherder.webapp.api.jobs:JobsViewSet.list" [6]. There's a huge impact on the "MySql job select" operations. This means that during 15 minutes there would be all sorts of slow downs since we can't get the data to the users in a reasonable manner.
I just started looking at NewRelic today so I hope I did not just misunderstand all of this and wasted your time!
[1] https://cl.ly/38b18d90c0ca
[2] https://cl.ly/d70e90bd044b
[3] INSERT INTO performance_datum
(repository_id
, signature_id
, value
, push_timestamp
, job_id
, push_id
, ds_job_id
, result_set_id
) VALUES (%s, %s, %s, %s, %s, %s, %s, %s)
[4] https://cl.ly/95f92c24f6a8
[5] https://github.com/mozilla/treeherder/blob/master/Procfile#L46
[6] https://cl.ly/bd9179b482f8
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 4•5 years ago
|
||
camd recommended looking at https://docs.djangoproject.com/en/2.2/ref/models/querysets/#bulk-create
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 7•5 years ago
|
||
Copying from bug 1560657:
https://bugzilla.mozilla.org/show_bug.cgi?id=1560657#c0
There were a couple of Heroku 95th percentile emails. This bug is only analyzing the latest one.
This spike does not seem to be related to a high number of talos jobs completing:
https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&fromchange=e053c95bb1b4a99407040a33be906b9465c5fb2b&searchStr=talos
https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&searchStr=talos&fromchange=99f1b315fa89db68ef3e7f8875bbd0a4b818423d
You can see the spike on MySql job selects:
https://cl.ly/3ace8cb7afbb
You can see the JobsView set spiking:
https://cl.ly/65e6ba9e749a
Now, once I look closely at log parser:
https://cl.ly/a4918e40c9e2
You will see that first we start having problem with MySql line failures and then it starts affecting other insert operations. This next screenshots only leaves MySql related operation and you will notice it more clearly
https://cl.ly/ded35e84aa41
I believe my proposal to use the replica for read operations is gaining weight.
Assignee | ||
Comment 8•5 years ago
|
||
For those with New Relic access I created a dashboard to track the performance issues:
https://rpm.newrelic.com/accounts/677903/custom_dashboards/pages/11944
I'm considering now the following options:
- Upgrade the DB instance from db.m4.2xlarge to something larger
- This will hopefully get us out of the hole for a bit
- I will need to measure the stage app before upgrading to a larger instance
- Test out the R/O idea on stage
- Get sheriffs to test that out and see if they can use it
- Investigate having less API calls from the UI
- Or even time out sooner
- Evaluate our caching strategies for fetching jobs
- I'm sure we hit the DB more than we ought to
:camd, :sclements: If you could look into the last two items that would be ideal.
Copying comment from a sheriff:
https://bugzilla.mozilla.org/show_bug.cgi?id=1561128#c0
https://treeherder.mozilla.org was down for 2~3 minutes and after that was really slow for ~15 mins, from 3 secs to 15 to load a failure.
Load time in firefox: https://irccloud.mozilla.com/file/5uVSoYj5/image.png,
Assignee | ||
Comment 9•5 years ago
|
||
This a slow MySql query from around the events of yesterday. In case any of you can make sense of it.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 10•5 years ago
|
||
I'm looking at the hardware specifications in here:
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.DBInstanceClass.html#Concepts.DBInstanceClass.Summary
I'm going to upgrade treeherder-stage to a better instance to see how we fair.
Currently, only production uses db.m4.2xlarge
while the other DB instances are db.m4.xlarge
:
db.m4.xlarge $0.35 --> $0.342 (-2.3%)
db.m4.2xlarge $0.7. --> $0.684 (-2.3%)
Instance Class | vCPU | ECU | Memory (GiB) | Max. Bandwidth (Mbps) | Network Performance |
---|---|---|---|---|---|
db.m4.xlarge | 4 | 13 | 16 | 750 | High |
db.m4.2xlarge | 8 | 25.5 | 32 | 1,000 | High |
db.m5.xlarge | 4 | 15 | 16 | 3,500 | Up to 10 Gigabit |
db.m5.2xlarge | 8 | 31 | 32 | 3,500 | Up to 10 Gigabit |
Upgrading from db.m4.2xlarge
(production) to db.m4.2xlarge
would be:
- 22% higher processing power (ECU units)
- 2.5x max bandwidth
- an actual cap of 10 Gigabit
- TODO: Evaluate the implication of this; would we need more?
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.DBInstanceClass.html:
- db.m5 – Latest-generation general-purpose instance classes that provide a balance of compute, memory, and network resources, and are a good choice for many applications. The db.m5 instance classes provide more computing capacity than the previous db.m4 instance classes.
- db.m4 – Current-generation general-purpose instance classes that provide more computing capacity than the previous db.m3 instance classes.
We might also consider changing the storage type from 'General Purpose SSD Storage' to 'Provisioned IOPS Storage'
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Storage.html#USER_PIOPS
Assignee | ||
Comment 11•5 years ago
|
||
I've tried changing treeherder-stage's instance but I need to wait for IT to do it for me.
I've also made the Heroku workers to be the same as production. I want to get a closer set up for the data pipeline.
I've also added another Performance dashboard but for treeherder-stage. I want to better evaluate the impact of changing the instance type.
Comment 12•5 years ago
•
|
||
This is a lot of information - thanks for looking into it.
Were you able to get any improvements to the log parser by changing the query?
Regarding your point above re "Investigate having less API calls from the UI" - I've removed as many unnecessary API calls as possible during the Perfherder conversion but there are queries (for all of TH) that need to be cleanup and optimized as outlined in bug 1407377. Some of these might only needed minor changes, like adding 'select_related's. I think we should focus on the low hanging fruit and see how much that helps before making bigger changes.
Assignee | ||
Comment 13•5 years ago
|
||
I'm more interested on reducing the calls from the main landing page (e.g. autoland or try).
I agree with delaying certain changes until we see others not being effective.
Assignee | ||
Comment 14•5 years ago
|
||
See bug 1564566 to details on options to read less.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 15•5 years ago
|
||
In the last few hours we have doubled the IOPS cap for the treeherder-prod database:
https://bugzilla.mozilla.org/show_bug.cgi?id=1567257
Instead of 3,000 IOPS we have 6,000 IOPS and I see close to 3,800 being used at this moment.
This change plus the auto-scaling mechanism for web nodes might be all we need for now. Let's see.
Please ping me if you notice any UI slowdowns so I can look at NewRelic.
Thanks to @dividehex for all his support with DB changes.
Assignee | ||
Comment 16•5 years ago
|
||
Summary:
- In comment 3 I spotted database issues by using NewRelic
- In bug 1562882 we tried upgrading to M5 instances
- In bug 1567257 we doubled the storage (which doubles the IOPS cap) of
prod
andprod-ro
- Instead of switching to IOPS storage (cheaper)
The smaller outages we saw yesterday had to do with the auto-scaling set up needing some tweaking.
I believe this is fixed.
Assignee | ||
Comment 17•5 years ago
|
||
I will leave it open for a month and verify that we're truly done.
Assignee | ||
Comment 18•5 years ago
|
||
This seems done as far as I can tell.
Description
•