Closed
Bug 1397716
Opened 7 years ago
Closed 7 years ago
treeherder-stage RDS instance using 100% CPU
Categories
(Tree Management :: Treeherder: Infrastructure, enhancement, P1)
Tree Management
Treeherder: Infrastructure
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Assigned: emorley)
References
Details
Starting 11-12 hours ago, the stage RDS instances CPU usage has been hitting 100%.
There's nothing obvious showing up on the queries in client connections.
Assignee | ||
Comment 1•7 years ago
|
||
I did an `OPTIMIZE TABLE treeherder.reference_data_signatures` and rebooted the stage instance to no avail.
However looking at New Relic I saw a drop in the number of memcached gets at the same time the MySQL usage spiked.
Looking at New Relic I see lots of:
[2017-09-07 12:30:36,493: ERROR/Worker-1] MemcachedError: error 47 from memcached_set: (0x3ad7870) SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY, host: localhost:11211 -> libmemcached/connect.cc:720
Looks like memcached stopped working for some reason.
Assignee | ||
Comment 2•7 years ago
|
||
It turns out Memcachier have forgotten to renew/rotate their TLS cert for the stunnel, for which I've filed bug 1397726.
I'll leave this open to see about what we can do to improve the situation for monitoring/load even if memcache isn't up. (The `reference_data_signatures` queries are very slow)
Assignee | ||
Comment 3•7 years ago
|
||
This is causing CloudAMQP alerts too:
Name treeherder-stage
Server REDACTED
Vhost REDACTED
Queue store_pulse_jobs
Current # messages 55147
Alarm queue regexp .*
Alarm threshold 1000
Assignee | ||
Comment 4•7 years ago
|
||
This was resolved as of bug 1397726 comment 1, however I'll leave this bug open until I've filed some followups to improve the situation when memcached isn't working (this query shouldn't be quite so severe).
Assignee | ||
Comment 5•7 years ago
|
||
get_signatures_for_project() is the method whose lack of caching caused the majority of the problem for this bug, and it's going away as part of bug 1387640. Hurray! \o/
Depends on: 1387640
Assignee | ||
Comment 6•7 years ago
|
||
All done here.
Enabling CPU alerts for RDS is bug 1306597, figuring out a less fragile memcached setup is bug 1300082 (or bug 1384518 for switching to Redis and another provider).
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•