Closed Bug 1063642 Opened 11 years ago Closed 11 years ago

Load on pulse-app1.dmz.phx1.mozilla.com is CRITICAL: CRITICAL - load average: 29.16, 23.89, 16.32

Categories

(mozilla.org Graveyard :: Server Operations: MOC, task)

Other
Other
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nagiosapi, Assigned: cliang)

References

()

Details

(Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/1214] [id=nagios1.private.phx1.mozilla.com:368575])

Automated alert report from nagios1.private.phx1.mozilla.com: Hostname: pulse-app1.dmz.phx1.mozilla.com Service: Load State: CRITICAL Output: CRITICAL - load average: 29.16, 23.89, 16.32 Runbook: http://m.allizom.org/Load
Automated alert recovery: Hostname: pulse-app1.dmz.phx1.mozilla.com Service: Load State: OK Output: OK - load average: 0.50, 13.72, 27.99
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
(Temporarily unresolving it so that the kanban will see it.)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [id=nagios1.private.phx1.mozilla.com:368575] → [kanban:https://kanbanize.com/ctrl_board/4/1214] [id=nagios1.private.phx1.mozilla.com:368575]
1) The pulse rabbit server had a lot of messages in the queue (42,993,853) and that number kept increasing, with several queues accumulating a LOT of messages and nothing consuming them. Among the ones deleted were: - changes - quickstart-XXXX - amq.gen-g01SmMHEbw8o6ESoCUmvtw (which had no consumers!) I'm not 100% sure the large number of messages was causing the issue, but it certainly couldn't have helped. This brought up the question of why the rabbitmq monitoring had not triggered for this server. That question will be addressed in a separate bug. 2) It looked like we were accruing hg_shim processes, e.g.: root 31111 31106 0 11:34 ? 00:00:00 /bin/sh -c /usr/bin/python /data/www/pulse/pulseshims/hg_shim.py mozilla-central >> /data/workdir/hg-shim/mozilla-central-shim.log 2>&1 root 31114 31111 9 11:34 ? 00:00:16 /usr/bin/python /data/www/pulse/pulseshims/hg_shim.py mozilla-central mcote confirmed in IRC that these processes should not be running. I commented the following lines in puppet and pushed out the change. Existing cron entries for these were delete and all existing processes were killed. webapp::pulse::shim::hg { 'comm-central': ; 'mozilla-central': ; 'users/jgriffin_mozilla.com/synkme': ; } This seems to have started the system back onto the road to recovery.
Assignee: nobody → cliang
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.