Closed Bug 1397716 Opened 7 years ago Closed 7 years ago

treeherder-stage RDS instance using 100% CPU

Categories

(Tree Management :: Treeherder: Infrastructure, enhancement, P1)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

Starting 11-12 hours ago, the stage RDS instances CPU usage has been hitting 100%.

There's nothing obvious showing up on the queries in client connections.
I did an `OPTIMIZE TABLE treeherder.reference_data_signatures` and rebooted the stage instance to no avail.

However looking at New Relic I saw a drop in the number of memcached gets at the same time the MySQL usage spiked.

Looking at New Relic I see lots of:

[2017-09-07 12:30:36,493: ERROR/Worker-1] MemcachedError: error 47 from memcached_set: (0x3ad7870) SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY,  host: localhost:11211 -> libmemcached/connect.cc:720 

Looks like memcached stopped working for some reason.
Depends on: 1397726
It turns out Memcachier have forgotten to renew/rotate their TLS cert for the stunnel, for which I've filed bug 1397726.

I'll leave this open to see about what we can do to improve the situation for monitoring/load even if memcache isn't up. (The `reference_data_signatures` queries are very slow)
This is causing CloudAMQP alerts too:

Name 	treeherder-stage
Server 	REDACTED
Vhost 	REDACTED
Queue 	store_pulse_jobs
Current # messages 	55147
Alarm queue regexp 	.*
Alarm threshold 	1000
This was resolved as of bug 1397726 comment 1, however I'll leave this bug open until I've filed some followups to improve the situation when memcached isn't working (this query shouldn't be quite so severe).
get_signatures_for_project() is the method whose lack of caching caused the majority of the problem for this bug, and it's going away as part of bug 1387640. Hurray! \o/
Depends on: 1387640
All done here. 

Enabling CPU alerts for RDS is bug 1306597, figuring out a less fragile memcached setup is bug 1300082 (or bug 1384518 for switching to Redis and another provider).
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.