Closed Bug 613252 Opened 14 years ago Closed 14 years ago

SUMO production is in trouble

Categories

(mozilla.org Graveyard :: Server Operations, task)

All
Other
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jsocol, Assigned: phong)

Details

Seeing Zeus-style Service Unavailable at http://support.mozilla.com/, also not seeing metrics at all on http://nm-dash01.nms.mozilla.org/ganglia/?c=sumo&m=load_one&r=hour&s=descending&hc=4&mc=2 and the SUMO graphs on http://nm-dash01.nms.mozilla.org/ganglia/ cut off maybe 8 or 10 minutes ago.
It should be back now.  I thought it could handle taking the webheads down for a RAM upgrade, but it didn't.  We'll have to schedule a downtime next time around.
Assignee: server-ops → phong
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Phong - can you talk to mrz about taking stuff down next time?  He's supposed to announce stuff like that.  Thanks for trying either way.
We should have enough machines so you can take one host down without affecting the site.  This tells me we have N when we should have N+1.

Filed bug 613323 to track.
Matthew, would you please cc me to bug 613323? If possible it would also be nice to move our RabbitMQ and celeryd instances off a web server so that traffic spikes won't cause a snowball effect with celery.
I opened that bug up.  I'll do whatever you guys think is best - I don't know the whole system arch as well as you or oremj/fox2mike/.
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.