Closed Bug 1553199 Opened 6 years ago Closed 5 years ago

[Meta] - High database spikes affect UI response time

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: camd, Assigned: armenzg)

References

Details

Attachments

(1 file)

We have seen this a few times. The UI becomes unresponsive and at that time, there are several large logs being parsed in a row. This seems to slow down the database and then the UI takes a hit while trying to retrieve pushes/jobs.

I'm not sure what the solution is, but there must be a way to insulate the front-end from the log parsing.

This happens just occasionally, and it usually passes by within about 10 minutes. So it's hard to call this critical. But it's certainly annoying and frustrating for the sheriffs, especially.

Here are a few logs that got caught in this. We will keep the links to them for a while, so we should be able to track the jobs/pushes down from the URLs:

See Also: → 1455721

I saw a new relic responsiveness email and looked into the last 6 hours with new relic [1]
I zommed in the area with the spike.
I selected 'log-parser' from under 'Transactions'
I selected any points around the pick.
You can see that MySql performance_datum insert took over 80% of processing time for a log-parser event with 244 inserts costing 142 seconds [2]
You can see further details under "Database queries" and you can drilldown even further if you see an icon to the left [3][4] which shows which piece of code is the one that executes the insert statements (assuming I understand what NewRelic is trying to do).

After some investigations through various trees, loading 50+ pushes and filtering with 'talos' I found this push:
https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&searchStr=talos&revision=cc08a11db7d94d70e81e59afd4d2cfd4b85aa153

The command used was |mach try --preset perf --rebuild 10|. That caused many more jobs than intended. The developer intended to obtain "Tp and raptor, but probably across platforms"

I have various thoughts crossing my mind as to what could be tried:

  • Verify that slowdowns we've seen in the past follow a pattern like this or not
  • Reduce the number of log parsing workers (currently 6) and/or increase the power of the dynos
    • This is to reduce of number of workers that might be trying to write to the DB at the same time
  • Reevaluate the implications of using --concurrency 7 [5]
  • Use structured logs to make log parsing less intensive on Treeherder
  • Reevaluate the approach on how we ingest performance data
    • For example, the Talos tasks generate an artifact with performance data instead of parsing 1 line from a full log. Such task is ingested via a different Treeherder task
    • This would decouple the perf data ingestion from log parsing and make it easier to work on and test
  • Add more New Relic instrumentation
    • I saw some tracing messages on some of the deepest drilldowns saying that I need more in-code logic to see further details

This bug could be just a curiosity if it does not have any impact on the users, however, I believe it does if we look at ''/treeherder.webapp.api.jobs:JobsViewSet.list" [6]. There's a huge impact on the "MySql job select" operations. This means that during 15 minutes there would be all sorts of slow downs since we can't get the data to the users in a reasonable manner.

I just started looking at NewRelic today so I hope I did not just misunderstand all of this and wasted your time!

[1] https://cl.ly/38b18d90c0ca
[2] https://cl.ly/d70e90bd044b
[3] INSERT INTO performance_datum (repository_id, signature_id, value, push_timestamp, job_id, push_id, ds_job_id, result_set_id) VALUES (%s, %s, %s, %s, %s, %s, %s, %s)
[4] https://cl.ly/95f92c24f6a8
[5] https://github.com/mozilla/treeherder/blob/master/Procfile#L46
[6] https://cl.ly/bd9179b482f8

Assignee: nobody → armenzg
Status: NEW → ASSIGNED
Summary: Log parsing multiple large logs in a row causes UI slowdown → Log parsing of logs than require MySql inserts slowsdown the UI
See Also: → 1560657

Copying from bug 1560657:
https://bugzilla.mozilla.org/show_bug.cgi?id=1560657#c0

There were a couple of Heroku 95th percentile emails. This bug is only analyzing the latest one.

This spike does not seem to be related to a high number of talos jobs completing:
https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&fromchange=e053c95bb1b4a99407040a33be906b9465c5fb2b&searchStr=talos
https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&searchStr=talos&fromchange=99f1b315fa89db68ef3e7f8875bbd0a4b818423d

You can see the spike on MySql job selects:
https://cl.ly/3ace8cb7afbb

You can see the JobsView set spiking:
https://cl.ly/65e6ba9e749a

Now, once I look closely at log parser:
https://cl.ly/a4918e40c9e2
You will see that first we start having problem with MySql line failures and then it starts affecting other insert operations. This next screenshots only leaves MySql related operation and you will notice it more clearly
https://cl.ly/ded35e84aa41

I believe my proposal to use the replica for read operations is gaining weight.

Summary: Log parsing of logs than require MySql inserts slowsdown the UI → High database spikes affect UI response time

For those with New Relic access I created a dashboard to track the performance issues:
https://rpm.newrelic.com/accounts/677903/custom_dashboards/pages/11944

I'm considering now the following options:

  • Upgrade the DB instance from db.m4.2xlarge to something larger
    • This will hopefully get us out of the hole for a bit
    • I will need to measure the stage app before upgrading to a larger instance
  • Test out the R/O idea on stage
    • Get sheriffs to test that out and see if they can use it
  • Investigate having less API calls from the UI
    • Or even time out sooner
  • Evaluate our caching strategies for fetching jobs
    • I'm sure we hit the DB more than we ought to

:camd, :sclements: If you could look into the last two items that would be ideal.

Copying comment from a sheriff:
https://bugzilla.mozilla.org/show_bug.cgi?id=1561128#c0

https://treeherder.mozilla.org was down for 2~3 minutes and after that was really slow for ~15 mins, from 3 secs to 15 to load a failure.
Load time in firefox: https://irccloud.mozilla.com/file/5uVSoYj5/image.png,

Attached file mysql-slowquery.log.16

This a slow MySql query from around the events of yesterday. In case any of you can make sense of it.

Attachment #9073884 - Attachment mime type: application/octet-stream → text/plain

I'm looking at the hardware specifications in here:
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.DBInstanceClass.html#Concepts.DBInstanceClass.Summary

I'm going to upgrade treeherder-stage to a better instance to see how we fair.

Currently, only production uses db.m4.2xlarge while the other DB instances are db.m4.xlarge:
db.m4.xlarge $0.35 --> $0.342 (-2.3%)
db.m4.2xlarge $0.7. --> $0.684 (-2.3%)

Instance Class vCPU ECU Memory (GiB) Max. Bandwidth (Mbps) Network Performance
db.m4.xlarge 4 13 16 750 High
db.m4.2xlarge 8 25.5 32 1,000 High
db.m5.xlarge 4 15 16 3,500 Up to 10 Gigabit
db.m5.2xlarge 8 31 32 3,500 Up to 10 Gigabit

Upgrading from db.m4.2xlarge (production) to db.m4.2xlarge would be:

  • 22% higher processing power (ECU units)
  • 2.5x max bandwidth
  • an actual cap of 10 Gigabit
    • TODO: Evaluate the implication of this; would we need more?

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.DBInstanceClass.html:

  • db.m5 – Latest-generation general-purpose instance classes that provide a balance of compute, memory, and network resources, and are a good choice for many applications. The db.m5 instance classes provide more computing capacity than the previous db.m4 instance classes.
  • db.m4 – Current-generation general-purpose instance classes that provide more computing capacity than the previous db.m3 instance classes.

We might also consider changing the storage type from 'General Purpose SSD Storage' to 'Provisioned IOPS Storage'
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Storage.html#USER_PIOPS

Depends on: 1561415

I've tried changing treeherder-stage's instance but I need to wait for IT to do it for me.
I've also made the Heroku workers to be the same as production. I want to get a closer set up for the data pipeline.

I've also added another Performance dashboard but for treeherder-stage. I want to better evaluate the impact of changing the instance type.

This is a lot of information - thanks for looking into it.

Were you able to get any improvements to the log parser by changing the query?

Regarding your point above re "Investigate having less API calls from the UI" - I've removed as many unnecessary API calls as possible during the Perfherder conversion but there are queries (for all of TH) that need to be cleanup and optimized as outlined in bug 1407377. Some of these might only needed minor changes, like adding 'select_related's. I think we should focus on the low hanging fruit and see how much that helps before making bigger changes.

I'm more interested on reducing the calls from the main landing page (e.g. autoland or try).

I agree with delaying certain changes until we see others not being effective.

Depends on: 1562882

See bug 1564566 to details on options to read less.

Depends on: 1564584
Severity: normal → major
Component: Treeherder: Log Parsing & Classification → Treeherder: Infrastructure
Depends on: 1564860
Summary: High database spikes affect UI response time → meta - High database spikes affect UI response time
Depends on: 1562017
Depends on: 1565041
Depends on: 1565042
Depends on: 1566204
Priority: -- → P1
Depends on: 1567257

In the last few hours we have doubled the IOPS cap for the treeherder-prod database:
https://bugzilla.mozilla.org/show_bug.cgi?id=1567257
Instead of 3,000 IOPS we have 6,000 IOPS and I see close to 3,800 being used at this moment.

This change plus the auto-scaling mechanism for web nodes might be all we need for now. Let's see.

Please ping me if you notice any UI slowdowns so I can look at NewRelic.

Thanks to @dividehex for all his support with DB changes.

Summary:

  • In comment 3 I spotted database issues by using NewRelic
  • In bug 1562882 we tried upgrading to M5 instances
  • In bug 1567257 we doubled the storage (which doubles the IOPS cap) of prod and prod-ro
    • Instead of switching to IOPS storage (cheaper)

The smaller outages we saw yesterday had to do with the auto-scaling set up needing some tweaking.

I believe this is fixed.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Summary: meta - High database spikes affect UI response time → [Meta] - High database spikes affect UI response time

I will leave it open for a month and verify that we're truly done.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 1570068

This seems done as far as I can tell.

Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
No longer depends on: 1562017
No longer depends on: 1570068
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: