Closed
Bug 1063642
Opened 11 years ago
Closed 11 years ago
Load on pulse-app1.dmz.phx1.mozilla.com is CRITICAL: CRITICAL - load average: 29.16, 23.89, 16.32
Categories
(mozilla.org Graveyard :: Server Operations: MOC, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nagiosapi, Assigned: cliang)
References
()
Details
(Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/1214] [id=nagios1.private.phx1.mozilla.com:368575])
Automated alert report from nagios1.private.phx1.mozilla.com:
Hostname: pulse-app1.dmz.phx1.mozilla.com
Service: Load
State: CRITICAL
Output: CRITICAL - load average: 29.16, 23.89, 16.32
Runbook: http://m.allizom.org/Load
| Reporter | ||
Comment 1•11 years ago
|
||
Automated alert recovery:
Hostname: pulse-app1.dmz.phx1.mozilla.com
Service: Load
State: OK
Output: OK - load average: 0.50, 13.72, 27.99
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
| Assignee | ||
Comment 2•11 years ago
|
||
(Temporarily unresolving it so that the kanban will see it.)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [id=nagios1.private.phx1.mozilla.com:368575] → [kanban:https://kanbanize.com/ctrl_board/4/1214] [id=nagios1.private.phx1.mozilla.com:368575]
| Assignee | ||
Comment 3•11 years ago
|
||
1) The pulse rabbit server had a lot of messages in the queue (42,993,853) and that number kept increasing, with several queues accumulating a LOT of messages and nothing consuming them. Among the ones deleted were:
- changes
- quickstart-XXXX
- amq.gen-g01SmMHEbw8o6ESoCUmvtw (which had no consumers!)
I'm not 100% sure the large number of messages was causing the issue, but it certainly couldn't have helped. This brought up the question of why the rabbitmq monitoring had not triggered for this server. That question will be addressed in a separate bug.
2) It looked like we were accruing hg_shim processes, e.g.:
root 31111 31106 0 11:34 ? 00:00:00
/bin/sh -c /usr/bin/python /data/www/pulse/pulseshims/hg_shim.py mozilla-central >> /data/workdir/hg-shim/mozilla-central-shim.log 2>&1
root 31114 31111 9 11:34 ? 00:00:16
/usr/bin/python /data/www/pulse/pulseshims/hg_shim.py mozilla-central
mcote confirmed in IRC that these processes should not be running. I commented the following lines in puppet and pushed out the change. Existing cron entries for these were delete and all existing processes were killed.
webapp::pulse::shim::hg {
'comm-central': ;
'mozilla-central': ;
'users/jgriffin_mozilla.com/synkme': ;
}
This seems to have started the system back onto the road to recovery.
Assignee: nobody → cliang
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•