80% of 30gb used on webheads. Time to full 5 days. Usage was flat until about a week ago - not sure what changed? Log rotation stopped? See: https://rpm.newrelic.com/accounts/677903/servers/5313375/disks?tw[end]=1428334448&tw[start]=1425656048#id=603135097 Don't have ssh/vpn access set up on my tablet, will take a closer look when back at home tomorrow (public holiday today) - but if someone else could take a look in the meantime, that would be great.
New relic alert link from email: https://rpm.newrelic.com/accounts/677903/incidents/14622548
httpd access logs have grown in size (also not compressed on rotation): 125M access_log-20150323 139M access_log-20150324 148M access_log-20150325 329M access_log-20150326 368M access_log-20150327 122M access_log-20150328 83M access_log-20150329 89M access_log-20150330 217M access_log-20150331 715M access_log-20150401 812M access_log-20150402 989M access_log-20150403 1017M access_log-20150404 699M access_log-20150405 679M access_log-20150406 Something is hammering /api/project/try/jobs/?count=2000&result_set_id__in=&return_type=list: treeherder1.webapp.scl3# grep -c '/api/project/try/jobs/?count=2000&result_set_id__in=&return_type=list' access_log-20150406 2542090 treeherder1.webapp.scl3# wc -l access_log-20150406 2753121 access_log-20150406 that's 92%. wtf. coming from... 126.96.36.199 - 1 188.8.131.52 - 1430157 (corp-nat.p2p.sfo1.mozilla.com) 184.108.40.206 - 1 220.127.116.11 - 2 18.104.22.168 - 392137 (h-235-34.a149.priv.bahnhof.se) 22.214.171.124 - 719792 have committed change to logrotate to compress logs, and am compressing old logs manually.
That would explain the jump in load on the api (bug 1150631) - guess I should perform some further analysis and ban IPs if needed. We should also probably set an API limit/threshold for reads too (we have a limit already, but presuming it's for submissions only - or else the threshold is too low).
And thank you for sorting compression :-)
yeah, I think 17 queries/sec is a bit much. try gets used a log, but that feel like it's off by at least two orders of magnitude. :-P
camd, jgfiffin, just saw this: 19:03 camd jgriffin: so, I think based on this, there's nothing immediate that needs to happen: https://rpm.newrelic.com/accounts/677903/incidents 19:03 camd all the recent alerts were not error rate changes, like on 3/31. 19:03 jgriffin I agree 19:03 camd just disk and cpu space issues that are hovering. I may be misunderstanding, but neither the disk space alerts (this bug) nor the cpu issues in bug 1150631 are us just hovering near a threshold we've been at for a while, there was a distinct change in traffic (likely from something abusing or api) and the only reason the disk usage alert closed was because fubar kindly switched on log compression. See comment 2 onwards :-)
I'll file bugs shortly for the followups to this.
Assignee: nobody → klibby
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.