Closed Bug 1597136 Opened 5 years ago Closed 5 years ago

[meta] Database slow downs causing Tree closures

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aryx, Assigned: armenzg)

References

Details

Attachments

(24 files)

234.54 KB, image/png
Details
214.33 KB, image/png
Details
613.39 KB, image/png
Details
839.82 KB, image/png
Details
406.72 KB, image/png
Details
422.35 KB, image/png
ekyle
: feedback+
armenzg
: feedback+
trink
: feedback+
Details
114.32 KB, image/png
Details
149.52 KB, image/png
Details
142.67 KB, image/png
Details
2.79 MB, application/octet-stream
Details
203.99 KB, image/png
Details
6.24 KB, application/octet-stream
Details
1.14 MB, application/octet-stream
Details
177.72 KB, image/png
Details
282.16 KB, image/png
Details
230.26 KB, image/png
Details
273.61 KB, image/png
Details
47 bytes, text/x-github-pull-request
Details | Review
458.27 KB, image/png
Details
4.00 KB, text/plain
Details
4.00 KB, text/plain
Details
4.00 KB, text/plain
Details
404.22 KB, image/png
Details
299.32 KB, image/png
Details

As discussed earlier, there are slowdowns in the interaction with treeherder, e.g. that failure lines load slowly.

Newrelic shows spikes in mysql response time for non-web transactions which started
https://rpm.newrelic.com/accounts/677903/applications/14179757

The big, longer spikes start at 5am UTC each day (first one on Nov 15th UTC).

Another spike can be found between 1pm and 2:30pm UTC each day.

Deploy before this started: https://github.com/mozilla/treeherder/compare/c725c8a41cf19ab5a8217ac9a10e3da1dd2dd54f...9289f542e8729e5574ef4a566ddc007a4c943085 - related to bug 1571369? These new tables are not shown on the read-only replica even after refreshing the tables list.

The 1pm to 2:30pm UTC spike happens when cycle_data runs whose performance has deteriorated since Friday (and only runs at that time).

The amount of data added (e.g. into the table job_detail is unchanged compared to the days in the previous week). Can a one-off task be run to check if performance is lower than normal at any given time of the day and cron jobs only making this noticeable?

See Also: → 1596671
Summary: repeated slowdowns fetching data due spikes in mysql response time for non-web transactions → Database slow downs causing Tree closures
Assignee: nobody → armenzg
Severity: normal → critical
Priority: -- → P1

queue depth, read/write throughput and read IOPS have all increased since thursday, nov 14, around 0900UTC. looks like network traffic out has also increased a fair bit (~15MB/s to ~25MB/s).

Armen says that we're recently added a bunch of mobile stuff that may be the cause; he's tracking that down. in the meantime, he's tweaked things on the heroku side to get the trees open.

The trees were re-open a bit ago. The backlogs of tasks got cleared out by increasing the number of "store pulse data" workers and I reduced the "log parser" workers to 4 from 5.

The screenshot shows the differences between stage and production databases.
It is mainly for curiosity's sake since there are the following differences between the two instances:

  • The production DB has 2TBs of storage to allow for 6,000 IOPS instead of 3,000 IOPS
  • The number for pulse data workers on stage was 7 rather than 10 which makes for less writing to the DB
  • The EC2 instance for stage is xlarge instead of 2xlarge
  • Production has users usage which causes more Read IOPS
  • Production has auto-scaling of backend nodes, thus increasing Read IOPS, instead of a single one web node

Because of deployments we're considering this code range:
https://github.com/mozilla/treeherder/compare/c725c8a41cf19ab5a8217ac9a10e3da1dd2dd54f...9289f542e8729e5574ef4a566ddc007a4c943085

However, I don't understand how it could be code related. We should have seen something similar on stage.

We have ~4x more write throughput for production.

Code changes being the root cause seem less plausible.

Code deployment (Thursday 14:43 UTC) vs Increase on Network throughtput (Thursday ~4:00 UTC )

Depends on: 1597396

There's few things we're going to try for this.
Sarah will be looking at bug 1407377 to see what we can gain.
Jake will be enabling Performance Insights for stage and production.
Let's turn recommendations from New Relic into bugs.

I also have an idea to prevent tree closures while outages occur. I'm going to create a treeherder-read-only Heroku app that will use the replica DB. That way sheriffs will have an alternative app while things are under water. Even if there are processing delays, the read-only app will not be unresponsive. The app will only have the backend/UI and ignore the ingestion pipelin

Depends on: 1597368

(In reply to Armen [:armenzg] from comment #6)

I also have an idea to prevent tree closures while outages occur. I'm going to create a treeherder-read-only Heroku app that will use the replica DB. That way sheriffs will have an alternative app while things are under water. Even if there are processing delays, the read-only app will not be unresponsive. The app will only have the backend/UI and ignore the ingestion pipelin

We'll need to be sure they're aware they won't be able to classify failures while using that.

Trees got closed for this at 07:15 UTC. At the moment, the reaction time is faster than normal, requests to e.g. https://treeherder.mozilla.org/api/project/try/push/?full=true&count=100&fromchange=fb70a214d19366120fe8787c4e804a177be06d81 fail with an internal server error. Heroku also seems to be crashing (got the heroku page of treeherder once, sometimes treeherder reloads itself).

I have not seen the DB reject all connections before.

I'm also seeing errors like these:

2019-11-19T11:37:51.273586Z 1168705 [Note] Aborted connection 1168705 to db: 'treeherder' user: 'th_admin' host: 'ec2-18-212-31-149.compute-1.amazonaws.com' (Got an error reading communication packets)
2019-11-19T11:38:00.997609Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 11294ms. The settings might not be optimal. (flushed=1170 and evicted=6280, during the time.)
Attached image [screenshot] pegged CPU

How do I interpret that for an hour the CPU was running at 100% but there were no Read/Write operations?

Attachment #9109938 - Flags: feedback?(mtrinkala)
Attachment #9109938 - Flags: feedback?(klibby)
Attachment #9109938 - Flags: feedback?(klahnakoski)

We caught up ingesting tasks in the last hour. Soon we should have caught up with parsing logs.

Taking a step back. There's mainly two things happening:

  • Something is different on the production APP or the production DB that is causing much higher write throughput (see screenshot).
  • Because of it, the DB is very fragile and the clean up tasks can tip it over.

I'm open to suggestions as to what to try.
I contacted Amazon but they said they will call me back.
Meanwhile I'm trying to reach folks within Mozilla as they wake up to ask for advice.

Depends on: 1597670

Armen, I've re analyzed the suspect patch.

There's this new piece of code from treeherder/perf/models.py which increases DB writes:

def update_status(self, using=None):
    autodetermined_status = self.autodetermine_status()
    self.save(using=using)

Previously, the self.save(using=using) was conditioned. Now, it isn't.
Still, I skimmed the app code, looking for potential abuses of this function (for-loops, Python tasks which are called frequently etc). Couldn't find one though, to confirm this is indeed the culprit.

(In reply to Ionuț Goldan [:igoldan], Performance Sheriff from comment #15)

Armen, I've re analyzed the suspect patch.

There's this new piece of code from treeherder/perf/models.py which increases DB writes:

def update_status(self, using=None):
    autodetermined_status = self.autodetermine_status()
    self.save(using=using)

Previously, the self.save(using=using) was conditioned. Now, it isn't.
Still, I skimmed the app code, looking for potential abuses of this function (for-loops, Python tasks which are called frequently etc). Couldn't find one though, to confirm this is indeed the culprit.

Thanks for checking. If this was a code related issue we would also see something similar happening on stage.

Here is a close up of the activity during the 100%CPU problem. Notice the problems happen soon after 9am GMT, which is when the perf cycle_data is scheduled.

(In reply to Armen [:armenzg] from comment #16)

Thanks for checking. If this was a code related issue we would also see something similar happening on stage.

How does your stage database compare in size to your production database? Also, this production database is a 2xlarge of what instance class? R3? Also helpful would be the output of "SHOW TABLE STATUS;", the RDS parameter group, and the table schema.

Depends on: 1597699

(In reply to Bob Micheletto [:bobm] from comment #19)

(In reply to Armen [:armenzg] from comment #16)

Thanks for checking. If this was a code related issue we would also see something similar happening on stage.

How does your stage database compare in size to your production database?

Name Allocated Free space
production 2,000GB 1,300GB
stage 1,100GB 200GB

Also, this production database is a 2xlarge of what instance class?

db.m5.xlarge vs db.m5.2xlarge

Also helpful would be the output of "SHOW TABLE STATUS;", the RDS parameter group, and the table schema.

I will attach this information

Attached file mysql-slowquery(1).log
Comment on attachment 9109938 [details] [screenshot] pegged CPU Having bobm look at this since he has a lot of MySQL performance experience with Sync
Attachment #9109938 - Flags: feedback?(mtrinkala) → feedback+
Depends on: 1597740

One of the problems is an overall increase in load on the TH database. I find it odd that the performance_datum requests are demanding more resources than the job selection. Maybe there is a new performance dashboard out there?

(In reply to Kyle Lahnakoski [:ekyle] from comment #24)

[...] I find it odd that the performance_datum requests are demanding more resources than the job selection. Maybe there is a new performance dashboard out there?

Did you notice any increase in usage of the https://health.graphics/ web app? Could this be a reason for the TH database load increase?

Flags: needinfo?(klahnakoski)

:igoldan, Two weeks ago I got a query about using the heath.graphics code; so maybe someone is using the code without the dashboard. I am looking into it. We probably have API endpoint hit counts on a dashboard somewhere to determine if this load is third party, or something internal.

Flags: needinfo?(klahnakoski)
Attachment #9109938 - Flags: feedback?(klahnakoski) → feedback+
Regressions: 1598039
Depends on: 1597476
Depends on: 1598091
Summary: Database slow downs causing Tree closures → [meta] Database slow downs causing Tree closures
Depends on: 1598210
Depends on: 1598213
Blocks: 1571366

Trees closed again for this, some logs not parsed 20 minutes after they finished. Failure suggestions loaded slower a few minutes before that.

This is new to me.

Any thoughts?

Flags: needinfo?(klahnakoski)

I'm thinking of rotating th_admin password.

Message to AWS:

We upgraded to Provisioned IOPS from General storage and db.m5.2xlarge -> db.m5.4xlarge.
Nevertheless, we're right now under water.
I have noticed io/table/sql/handler in Performance Insights: https://bugzilla.mozilla.org/show_bug.cgi?id=1597136#c28
Also, there seems to be an "other" host taking almost 50% of the load. See screenshot.

What happened around 0630EST / 1030UTC ?

Blocks: 1598333
Depends on: 1598339

As you mentioned, I can see that the given RDS instance is running with db.m5.4xlarge - 2000 GiB - 7000 iops configuration.

I understand that you are observing io/table/sql/handler wait event in Performance Insights and I would like to mention that this is table IO wait event which is related to table access (irrespective of memory or disk bound). In order to further diagnose this, I may request you to consider using performance schema:

Apart from this, when we observe page_cleaner activities falling short very frequently i.e. informational messages saying "InnoDB: page_cleaner: 1000ms intended loop took 12830ms. The settings might not be optimal. ", it implies that buffer pool flushing/page cleaning is not able to catch up with workload i.e. rate of changes happening.

In your case, as we can see these messages happening in high frequency (which means rate of changes are respectively quite high & flushing falling behind), I have suggested you to configure the "innodb_lru_scan_depth" parameter in my previous notes.

"The innodb_lru_scan_depth variable specifies, per buffer pool instance, how far down the buffer pool LRU list the page cleaner thread scans looking for dirty pages to flush. This is a background operation performed by a page cleaner thread once per second."

However, while modifying the "innodb_lru_scan_depth" parameter, it is quite necessary to test and tune innodb_io_capacity alonwith innodb_io_capacity_max parameters to improve flushing.

Also, it might be beneficial to size of redo log files via parameter "innodb_log_file_size" though please beware larger log files also make crash recovery slower in an event of crash.

Importantly, I may encourage you to engage DBA to help in reviewing SHOW ENGINE INNODB STATUS & other engine diagnostic outputs regularly. For your quick reference, I have attached one custom script which you may run in timely manner to record the outputs. However, please modify the script according to your environment.

Should you have any queries or concerns please do not hesitate to contact us. We will be more than happy to help you.

To see the file named 'DBscript.sh' included with this correspondence, please use the case link given below the signature.

Flags: needinfo?(klahnakoski)
Depends on: 1598433
Depends on: 1598450

We did some parameter group changes and rebooted the DBs based on the recommendations of AWS. We also deleted a lot of older jobs (older than 60 days) to make the tables smaller.

We've caught up on tasks being processed but we're 50k logs behind to be parsed.

Blocks: 1598639
Attached file prod_db_params.txt
Attached file stage_db_params.txt
Attached file dev_db_params.txt
Blocks: 1598645

Current paramater (specifically innodb_% ones) differences between the three environments:

Parameter Production Stage Dev
innodb_io_capacity_max 5000 5000 2500
innodb_lru_scan_depth 256 256 256
innodb_log_file_size 17179869184 (16GB) 17179869184 (16GB) 8589934592 (8GB)
innodb_read_io_threads 8 8 4
innodb_write_io_threads 8 8 4
innodb_buffer_pool_size 48318382080 (45GB) 11811160064 (11GB) 11811160064 (11GB)

innodb_buffer_pool_size is one of the differences that was possibly produced after we upgraded from 1 to 2TBs.

I believe innodb_log_file_size also changed when we increased the database size.
Dev was configured with 11811160064 (11,264MB) while production was configured with 134217728 (128MB).

You can see the treeherder-dev changes here and treeherder-{prod,stage} in here.

We need to bring treeherder-dev closer to treeherder-stage since they have DB size, storage type & EC2 instance types being the same. I will file a bug.

Depends on: 1598657

For the curious, here's the major RDS and EC2 differences for the 3 different DBs:

Name Storage type EC2 Size IOPS Multi-AZ Replica
production Provisioned IOPS db.m5.4xlarge 2,000GB 7,000 Yes Yes
stage General Purpose db.m5.xlarge 1,000GB 3,000 No No
dev General Purpose db.m5.xlarge 1,100GB 3,300 No No

Instance types differences:

Model Core Count vCPU* Mem (GiB) Storage (GiB) EBS Bandwidth (Mbps) Network Perf (Gbps)
db.m5.xlarge 2 4 16 EBS-only Up to 3,500 Up to 10
db.m5.4xlarge 8 16 64 EBS-only 3,500 Up to 10

I've requested AWS for recommending which metric to track in order to detect that we're in a bad shape sooner.

I wanted to take a snapshot of the last 24 hours since the DB was brought up with the latest changes and after having gone through a day's worth of load.

At the top of the screenshot you can see the write throughtput when we transitioned into the abnormal load of the 24th versus the load we got most of today. See you Monday!

We're out of the woods, nevertheless, we need to review the DB parameter changes to understand them and make various improvements to prevent getting into this situation again.

For follow up work please follow bug 1599095.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Attachment #9109938 - Flags: feedback?(klibby) → feedback+
No longer blocks: 1598639, 1598645
No longer depends on: 1597368, 1598433
No longer depends on: 1598657
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: