[meta] Database slow downs causing Tree closures
Categories
(Tree Management :: Treeherder: Infrastructure, defect, P1)
Tracking
(Not tracked)
People
(Reporter: aryx, Assigned: armenzg)
References
Details
Attachments
(24 files)
234.54 KB,
image/png
|
Details | |
214.33 KB,
image/png
|
Details | |
613.39 KB,
image/png
|
Details | |
839.82 KB,
image/png
|
Details | |
406.72 KB,
image/png
|
Details | |
422.35 KB,
image/png
|
ekyle
:
feedback+
armenzg
:
feedback+
trink
:
feedback+
|
Details |
114.32 KB,
image/png
|
Details | |
149.52 KB,
image/png
|
Details | |
142.67 KB,
image/png
|
Details | |
2.79 MB,
application/octet-stream
|
Details | |
203.99 KB,
image/png
|
Details | |
6.24 KB,
application/octet-stream
|
Details | |
1.14 MB,
application/octet-stream
|
Details | |
177.72 KB,
image/png
|
Details | |
282.16 KB,
image/png
|
Details | |
230.26 KB,
image/png
|
Details | |
273.61 KB,
image/png
|
Details | |
47 bytes,
text/x-github-pull-request
|
Details | Review | |
458.27 KB,
image/png
|
Details | |
4.00 KB,
text/plain
|
Details | |
4.00 KB,
text/plain
|
Details | |
4.00 KB,
text/plain
|
Details | |
404.22 KB,
image/png
|
Details | |
299.32 KB,
image/png
|
Details |
As discussed earlier, there are slowdowns in the interaction with treeherder, e.g. that failure lines load slowly.
Newrelic shows spikes in mysql response time for non-web transactions which started
https://rpm.newrelic.com/accounts/677903/applications/14179757
The big, longer spikes start at 5am UTC each day (first one on Nov 15th UTC).
Another spike can be found between 1pm and 2:30pm UTC each day.
Deploy before this started: https://github.com/mozilla/treeherder/compare/c725c8a41cf19ab5a8217ac9a10e3da1dd2dd54f...9289f542e8729e5574ef4a566ddc007a4c943085 - related to bug 1571369? These new tables are not shown on the read-only replica even after refreshing the tables list.
The 1pm to 2:30pm UTC spike happens when cycle_data runs whose performance has deteriorated since Friday (and only runs at that time).
The amount of data added (e.g. into the table job_detail
is unchanged compared to the days in the previous week). Can a one-off task be run to check if performance is lower than normal at any given time of the day and cron jobs only making this noticeable?
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 1•5 years ago
|
||
Comment 2•5 years ago
|
||
queue depth, read/write throughput and read IOPS have all increased since thursday, nov 14, around 0900UTC. looks like network traffic out has also increased a fair bit (~15MB/s to ~25MB/s).
Armen says that we're recently added a bunch of mobile stuff that may be the cause; he's tracking that down. in the meantime, he's tweaked things on the heroku side to get the trees open.
Assignee | ||
Comment 3•5 years ago
|
||
The trees were re-open a bit ago. The backlogs of tasks got cleared out by increasing the number of "store pulse data" workers and I reduced the "log parser" workers to 4 from 5.
The screenshot shows the differences between stage and production databases.
It is mainly for curiosity's sake since there are the following differences between the two instances:
- The production DB has 2TBs of storage to allow for 6,000 IOPS instead of 3,000 IOPS
- The number for pulse data workers on stage was 7 rather than 10 which makes for less writing to the DB
- The EC2 instance for stage is
xlarge
instead of2xlarge
- Production has users usage which causes more
Read IOPS
- Production has auto-scaling of backend nodes, thus increasing
Read IOPS
, instead of a single one web node
Assignee | ||
Comment 4•5 years ago
|
||
Because of deployments we're considering this code range:
https://github.com/mozilla/treeherder/compare/c725c8a41cf19ab5a8217ac9a10e3da1dd2dd54f...9289f542e8729e5574ef4a566ddc007a4c943085
However, I don't understand how it could be code related. We should have seen something similar on stage.
We have ~4x more write throughput for production.
Assignee | ||
Comment 5•5 years ago
|
||
Code changes being the root cause seem less plausible.
Code deployment (Thursday 14:43 UTC) vs Increase on Network throughtput (Thursday ~4:00 UTC )
Assignee | ||
Comment 6•5 years ago
|
||
There's few things we're going to try for this.
Sarah will be looking at bug 1407377 to see what we can gain.
Jake will be enabling Performance Insights for stage and production.
Let's turn recommendations from New Relic into bugs.
I also have an idea to prevent tree closures while outages occur. I'm going to create a treeherder-read-only
Heroku app that will use the replica DB. That way sheriffs will have an alternative app while things are under water. Even if there are processing delays, the read-only app will not be unresponsive. The app will only have the backend/UI and ignore the ingestion pipelin
Comment 7•5 years ago
|
||
(In reply to Armen [:armenzg] from comment #6)
I also have an idea to prevent tree closures while outages occur. I'm going to create a
treeherder-read-only
Heroku app that will use the replica DB. That way sheriffs will have an alternative app while things are under water. Even if there are processing delays, the read-only app will not be unresponsive. The app will only have the backend/UI and ignore the ingestion pipelin
We'll need to be sure they're aware they won't be able to classify failures while using that.
Reporter | ||
Comment 8•5 years ago
|
||
Trees got closed for this at 07:15 UTC. At the moment, the reaction time is faster than normal, requests to e.g. https://treeherder.mozilla.org/api/project/try/push/?full=true&count=100&fromchange=fb70a214d19366120fe8787c4e804a177be06d81 fail with an internal server error. Heroku also seems to be crashing (got the heroku page of treeherder once, sometimes treeherder reloads itself).
Assignee | ||
Comment 9•5 years ago
|
||
I have not seen the DB reject all connections before.
Assignee | ||
Comment 10•5 years ago
|
||
I'm also seeing errors like these:
2019-11-19T11:37:51.273586Z 1168705 [Note] Aborted connection 1168705 to db: 'treeherder' user: 'th_admin' host: 'ec2-18-212-31-149.compute-1.amazonaws.com' (Got an error reading communication packets)
2019-11-19T11:38:00.997609Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 11294ms. The settings might not be optimal. (flushed=1170 and evicted=6280, during the time.)
Assignee | ||
Comment 11•5 years ago
|
||
How do I interpret that for an hour the CPU was running at 100% but there were no Read/Write operations?
Assignee | ||
Comment 12•5 years ago
|
||
Assignee | ||
Comment 13•5 years ago
|
||
We caught up ingesting tasks in the last hour. Soon we should have caught up with parsing logs.
Assignee | ||
Comment 14•5 years ago
|
||
Taking a step back. There's mainly two things happening:
- Something is different on the production APP or the production DB that is causing much higher write throughput (see screenshot).
- Because of it, the DB is very fragile and the clean up tasks can tip it over.
I'm open to suggestions as to what to try.
I contacted Amazon but they said they will call me back.
Meanwhile I'm trying to reach folks within Mozilla as they wake up to ask for advice.
Comment 15•5 years ago
•
|
||
Armen, I've re analyzed the suspect patch.
There's this new piece of code from treeherder/perf/models.py
which increases DB writes:
def update_status(self, using=None):
autodetermined_status = self.autodetermine_status()
self.save(using=using)
Previously, the self.save(using=using)
was conditioned. Now, it isn't.
Still, I skimmed the app code, looking for potential abuses of this function (for-loops, Python tasks which are called frequently etc). Couldn't find one though, to confirm this is indeed the culprit.
Assignee | ||
Comment 16•5 years ago
|
||
(In reply to Ionuț Goldan [:igoldan], Performance Sheriff from comment #15)
Armen, I've re analyzed the suspect patch.
There's this new piece of code from
treeherder/perf/models.py
which increases DB writes:def update_status(self, using=None): autodetermined_status = self.autodetermine_status() self.save(using=using)
Previously, the
self.save(using=using)
was conditioned. Now, it isn't.
Still, I skimmed the app code, looking for potential abuses of this function (for-loops, Python tasks which are called frequently etc). Couldn't find one though, to confirm this is indeed the culprit.
Thanks for checking. If this was a code related issue we would also see something similar happening on stage.
Assignee | ||
Comment 17•5 years ago
|
||
Comment 18•5 years ago
|
||
Here is a close up of the activity during the 100%CPU problem. Notice the problems happen soon after 9am GMT, which is when the perf cycle_data is scheduled.
Comment 19•5 years ago
•
|
||
(In reply to Armen [:armenzg] from comment #16)
Thanks for checking. If this was a code related issue we would also see something similar happening on stage.
How does your stage database compare in size to your production database? Also, this production database is a 2xlarge of what instance class? R3? Also helpful would be the output of "SHOW TABLE STATUS;", the RDS parameter group, and the table schema.
Assignee | ||
Comment 20•5 years ago
|
||
(In reply to Bob Micheletto [:bobm] from comment #19)
(In reply to Armen [:armenzg] from comment #16)
Thanks for checking. If this was a code related issue we would also see something similar happening on stage.
How does your stage database compare in size to your production database?
Name | Allocated | Free space |
---|---|---|
production | 2,000GB | 1,300GB |
stage | 1,100GB | 200GB |
Also, this production database is a 2xlarge of what instance class?
db.m5.xlarge vs db.m5.2xlarge
Also helpful would be the output of "SHOW TABLE STATUS;", the RDS parameter group, and the table schema.
I will attach this information
Assignee | ||
Comment 21•5 years ago
|
||
Assignee | ||
Comment 22•5 years ago
|
||
Comment 23•5 years ago
|
||
Comment 24•5 years ago
|
||
One of the problems is an overall increase in load on the TH database. I find it odd that the performance_datum requests are demanding more resources than the job selection. Maybe there is a new performance dashboard out there?
Comment 25•5 years ago
|
||
(In reply to Kyle Lahnakoski [:ekyle] from comment #24)
[...] I find it odd that the performance_datum requests are demanding more resources than the job selection. Maybe there is a new performance dashboard out there?
Did you notice any increase in usage of the https://health.graphics/ web app? Could this be a reason for the TH database load increase?
Comment 26•5 years ago
|
||
:igoldan, Two weeks ago I got a query about using the heath.graphics code; so maybe someone is using the code without the dashboard. I am looking into it. We probably have API endpoint hit counts on a dashboard somewhere to determine if this load is third party, or something internal.
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Reporter | ||
Comment 27•5 years ago
|
||
Trees closed again for this, some logs not parsed 20 minutes after they finished. Failure suggestions loaded slower a few minutes before that.
Assignee | ||
Comment 29•5 years ago
|
||
I'm thinking of rotating th_admin password.
Assignee | ||
Comment 30•5 years ago
|
||
Message to AWS:
We upgraded to Provisioned IOPS from General storage and db.m5.2xlarge -> db.m5.4xlarge.
Nevertheless, we're right now under water.
I have noticed io/table/sql/handler in Performance Insights: https://bugzilla.mozilla.org/show_bug.cgi?id=1597136#c28
Also, there seems to be an "other" host taking almost 50% of the load. See screenshot.
Comment 31•5 years ago
|
||
What happened around 0630EST / 1030UTC ?
Assignee | ||
Comment 32•5 years ago
|
||
As you mentioned, I can see that the given RDS instance is running with db.m5.4xlarge - 2000 GiB - 7000 iops configuration.
I understand that you are observing io/table/sql/handler wait event in Performance Insights and I would like to mention that this is table IO wait event which is related to table access (irrespective of memory or disk bound). In order to further diagnose this, I may request you to consider using performance schema:
- https://dev.mysql.com/doc/refman/5.7/en/table-waits-summary-tables.html#table-io-waits-summary-by-table-table
- http://marcalff.blogspot.com/2011/06/performance-schema-overhead-tuning.html
Apart from this, when we observe page_cleaner activities falling short very frequently i.e. informational messages saying "InnoDB: page_cleaner: 1000ms intended loop took 12830ms. The settings might not be optimal. ", it implies that buffer pool flushing/page cleaning is not able to catch up with workload i.e. rate of changes happening.
In your case, as we can see these messages happening in high frequency (which means rate of changes are respectively quite high & flushing falling behind), I have suggested you to configure the "innodb_lru_scan_depth" parameter in my previous notes.
"The innodb_lru_scan_depth variable specifies, per buffer pool instance, how far down the buffer pool LRU list the page cleaner thread scans looking for dirty pages to flush. This is a background operation performed by a page cleaner thread once per second."
- https://dev.mysql.com/doc/refman/5.7/en/innodb-buffer-pool-flushing.html
- https://www.percona.com/doc/percona-server/5.7/performance/page_cleaner_tuning.html
However, while modifying the "innodb_lru_scan_depth" parameter, it is quite necessary to test and tune innodb_io_capacity alonwith innodb_io_capacity_max parameters to improve flushing.
- https://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_io_capacity_max
- https://dev.mysql.com/doc/refman/5.7/en/innodb-performance-multiple_io_threads.html
Also, it might be beneficial to size of redo log files via parameter "innodb_log_file_size" though please beware larger log files also make crash recovery slower in an event of crash.
Importantly, I may encourage you to engage DBA to help in reviewing SHOW ENGINE INNODB STATUS & other engine diagnostic outputs regularly. For your quick reference, I have attached one custom script which you may run in timely manner to record the outputs. However, please modify the script according to your environment.
Should you have any queries or concerns please do not hesitate to contact us. We will be more than happy to help you.
To see the file named 'DBscript.sh' included with this correspondence, please use the case link given below the signature.
Comment 33•5 years ago
|
||
Assignee | ||
Comment 34•5 years ago
|
||
We did some parameter group changes and rebooted the DBs based on the recommendations of AWS. We also deleted a lot of older jobs (older than 60 days) to make the tables smaller.
We've caught up on tasks being processed but we're 50k logs behind to be parsed.
Assignee | ||
Comment 35•5 years ago
|
||
Assignee | ||
Comment 36•5 years ago
|
||
Assignee | ||
Comment 37•5 years ago
|
||
Assignee | ||
Comment 38•5 years ago
|
||
Current paramater (specifically innodb_%
ones) differences between the three environments:
Parameter | Production | Stage | Dev |
---|---|---|---|
innodb_io_capacity_max | 5000 | 5000 | 2500 |
innodb_lru_scan_depth | 256 | 256 | 256 |
innodb_log_file_size | 17179869184 (16GB) | 17179869184 (16GB) | 8589934592 (8GB) |
innodb_read_io_threads | 8 | 8 | 4 |
innodb_write_io_threads | 8 | 8 | 4 |
innodb_buffer_pool_size | 48318382080 (45GB) | 11811160064 (11GB) | 11811160064 (11GB) |
innodb_buffer_pool_size
is one of the differences that was possibly produced after we upgraded from 1 to 2TBs.
I believe innodb_log_file_size
also changed when we increased the database size.
Dev was configured with 11811160064 (11,264MB) while production was configured with 134217728 (128MB).
You can see the treeherder-dev
changes here and treeherder-{prod,stage}
in here.
We need to bring treeherder-dev
closer to treeherder-stage
since they have DB size, storage type & EC2 instance types being the same. I will file a bug.
Assignee | ||
Comment 39•5 years ago
|
||
For the curious, here's the major RDS and EC2 differences for the 3 different DBs:
Name | Storage type | EC2 | Size | IOPS | Multi-AZ | Replica |
---|---|---|---|---|---|---|
production | Provisioned IOPS | db.m5.4xlarge | 2,000GB | 7,000 | Yes | Yes |
stage | General Purpose | db.m5.xlarge | 1,000GB | 3,000 | No | No |
dev | General Purpose | db.m5.xlarge | 1,100GB | 3,300 | No | No |
Instance types differences:
Model | Core Count | vCPU* | Mem (GiB) | Storage (GiB) | EBS Bandwidth (Mbps) | Network Perf (Gbps) |
---|---|---|---|---|---|---|
db.m5.xlarge | 2 | 4 | 16 | EBS-only | Up to 3,500 | Up to 10 |
db.m5.4xlarge | 8 | 16 | 64 | EBS-only | 3,500 | Up to 10 |
Assignee | ||
Comment 40•5 years ago
|
||
I've requested AWS for recommending which metric to track in order to detect that we're in a bad shape sooner.
Assignee | ||
Comment 41•5 years ago
|
||
I wanted to take a snapshot of the last 24 hours since the DB was brought up with the latest changes and after having gone through a day's worth of load.
Assignee | ||
Comment 42•5 years ago
|
||
At the top of the screenshot you can see the write throughtput when we transitioned into the abnormal load of the 24th versus the load we got most of today. See you Monday!
Assignee | ||
Comment 43•5 years ago
|
||
We're out of the woods, nevertheless, we need to review the DB parameter changes to understand them and make various improvements to prevent getting into this situation again.
For follow up work please follow bug 1599095.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Description
•