Nagios alerts in #buildduty nagios-releng 13:58:12 Wed 04:58:12 PDT  signing6.srv.releng.scl3.mozilla.com:disk - / is CRITICAL: DISK CRITICAL - free space: / 19504 MB (6% inode=99%): (http://m.mozilla.org/disk+-+/) 13:59:13 Wed 04:59:12 PDT  signing4.srv.releng.scl3.mozilla.com:disk - / is CRITICAL: DISK CRITICAL - free space: / 27696 MB (9% inode=99%): (http://m.mozilla.org/disk+-+/) 14:39:13 Wed 05:39:12 PDT  signing5.srv.releng.scl3.mozilla.com:disk - / is WARNING: DISK WARNING - free space: / 32889 MB (11% inode=99%): (http://m.mozilla.org/disk+-+/)
The usage is genuine, but the problem is that the release build artefacts are taking up the most space. Probably due to the high number of releases we are having at the moment. e.g. on signing6, /builds/signing/rel-key-signing-server is 186GB, and /builds/signing is 257GB of the total 260GB used
It looks like cleanup strategy is based on age of artefact, rather than available disk space: https://github.com/mozilla/build-tools/blob/master/lib/python/signing/server.py#L408 -> https://github.com/mozilla/build-tools/blob/master/lib/python/signing/server.py#L331
A short term solution (to avoid filling up disks completely) is to reduce server.max_file_age in the signing.ini file. Currently it is set to 12 hours: <snip> [server] listen = 0.0.0.0 port = 9120 redis = max_file_age = 43200 ; 12 hours cleanup_interval = 600 ; 10 Minutes daemonize = yes </snip> A longer term solution might be to change cleanup strategy to be based on available free disk space.
Created attachment 8494480 [details] [diff] [review] bug1072274_puppet_v1.patch Not sure if I'll need to restart the signing servers to pick up the change from the template, after puppet lands the change?
this is hitting the trees as well
(In reply to Carsten Book [:Tomcat] from comment #5) > this is hitting the trees as well and since we get more and more red results because of it closed integration trees
Immediate issue fixed. Will raise a separate bug for long term solution (comment 3).