socorro1.db.sjc1.mozilla.com has run out of disk space a few times due to falling behind the database master for continuous backup. since both the archive logs and the database are on the same iscsi share, this can lead to an unrecoverable database situation and needing to do a full resync from master. This would be less likely with the archive logs on a separate partitition/share/quota'd directory. After resolving the iscsi reliability issues per other bug, please do the following: 1) add a separate 50GB partition, LUN, or other size-limited volume called "wal_archive". 2) add a nagios/ganglia monitor on wal_archive to make an alert whenever that volume is more than 30GB.
note: the new wal_archive volume can be on local disk, if space is available.
Assignee: server-ops → jdow
Whiteboard: pending stage site in PHX
replaydb has been permanently shut down. In phx we'll have a different architecture, so we'll implement this as needed from the beginning.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.