Per bug 767559 comment 4; this bug tracks putting the monitoring in place. Filing since there was no response to bug 767559 comment 5. Pasting rbryce's comment here: tl;dr mailman's archive process was blocked on i/o causing further delays in mail handling. The parent mailman qrunner process was hung up by 4 children archive processes that were blocked by i/o apparently. As the subsequent qrunner threads spawned handling the mailmail posts they also processed mails that the parent qrunner thread had not removed from queue. The main qrunner process was to blame for the delayed mail and not being able to timely remove the from the queue, the children are to blame for the dupe emails. This also caused *deferred* emails to queue longer than normal, further delaying some email. These are almost entirely comprised bounces messages from spam. I manually purged the deferred mail queue after I restarted mailman and postfix.(manually verified before purging) This cleared the hung qrunner thread and all seems to back to normal. The nagios check I have discussed with others will be to measure the age of mailman processes to hopefully detect this problem going forward. Also, I am not sure what the disk usage was when this started, but I suspect insufficient disk space may have caused this problem or at least exacerbated the issue.
Assignee: server-ops-infra → server-ops
Component: Server Operations: Infrastructure → Server Operations
QA Contact: jdow → phong
These are already setup at - https://ganglia-scl3.mozilla.org/ganglia/?c=zlb.ops.scl3
Assignee: server-ops → ashish
argh, wrong bug. please ignore #c1.
rbryce: where does this fit in your work queue?
This is implemented. We are checking the postfix queue as well as the mailman process.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.