Prod webheads filling disks as of 1st April (Alert open: Fullest disk > 80% on 'th-prod-web{1-3}')



Tree Management
Treeherder: Infrastructure
3 years ago
3 years ago


(Reporter: emorley, Assigned: fubar)



80% of 30gb used on webheads.
Time to full 5 days.

Usage was flat until about a week ago - not sure what changed? Log rotation stopped? See:[end]=1428334448&tw[start]=1425656048#id=603135097

Don't have ssh/vpn access set up on my tablet, will take a closer look when back at home tomorrow (public holiday today) - but if someone else could take a look in the meantime, that would be great.
httpd access logs have grown in size (also not compressed on rotation):

125M	access_log-20150323
139M	access_log-20150324
148M	access_log-20150325
329M	access_log-20150326
368M	access_log-20150327
122M	access_log-20150328
83M	access_log-20150329
89M	access_log-20150330
217M	access_log-20150331
715M	access_log-20150401
812M	access_log-20150402
989M	access_log-20150403
1017M	access_log-20150404
699M	access_log-20150405
679M	access_log-20150406

Something is hammering /api/project/try/jobs/?count=2000&result_set_id__in=&return_type=list:

treeherder1.webapp.scl3# grep -c '/api/project/try/jobs/?count=2000&result_set_id__in=&return_type=list' access_log-20150406
treeherder1.webapp.scl3# wc -l access_log-20150406
2753121 access_log-20150406

that's 92%. wtf. coming from... - 1 - 1430157  ( - 1 - 2 - 392137   ( - 719792

have committed change to logrotate to compress logs, and am compressing old logs manually.
That would explain the jump in load on the api (bug 1150631) - guess I should perform some further analysis and ban IPs if needed. We should also probably set an API limit/threshold for reads too (we have a limit already, but presuming it's for submissions only - or else the threshold is too low).
And thank you for sorting compression :-)
yeah, I think 17 queries/sec is a bit much. try gets used a log, but that feel like it's off by at least two orders of magnitude. :-P
camd, jgfiffin, just saw this: 

19:03	camd	jgriffin: so, I think based on this, there's nothing immediate that needs to happen:
19:03	camd	all the recent alerts were not error rate changes, like on 3/31.
19:03	jgriffin	I agree
19:03	camd	just disk and cpu space issues that are hovering.

I may be misunderstanding, but neither the disk space alerts (this bug) nor the cpu issues in bug 1150631 are us just hovering near a threshold we've been at for a while, there was a distinct change in traffic (likely from something abusing or api) and the only reason the disk usage alert closed was because fubar kindly switched on log compression. See comment 2 onwards :-)
I'll file bugs shortly for the followups to this.
Assignee: nobody → klibby
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.