Closed Bug 859822 Opened 11 years ago Closed 11 years ago

[sumo][stage] celery issues

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

All
Other
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rrosario, Assigned: cturra)

References

Details

See:
https://errormill.mozilla.org/support/sumo-stage/

There are a bunch of AMQP related errors, for example:
https://errormill.mozilla.org/support/sumo-stage/group/15341/

[Errno 113] No route to host

This has happened at least 3 times this morning and the site stays broken until a deploy restarts everything.
Blocks: 859816
Every time I deploy, things work fine for a bit since everything restarts. But then the errors happen again and the site goes down.
i have spent a little time tracking this down and it turns out that support-celery1.stage is currently offline. the esx cluster is reporting the following error for it:

 support-celery1.stage
 Alert
 vSphere HA virtual machine failover failed
 vc1.private.phx1.mozilla.com
 4/9/2013 12:23:11 AM


i am working with our sre team to investigate and get it back online.
Assignee: server-ops-webops → cturra
last night one of the esx hosts blew up causing this host to go down. other vms on this host were manually recovered, but this one was missed. the sre team is reviewing the monitoring of this node to avoid this from happening again in the future.

as of now tho, the host is back online and functioning as expected.

*important note, since it's been offline for a bit, please be sure to do another chief deploy to get all your latest code onto this stage node.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
(ESX host failure was 859698, for internal cross-reference purposes)
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.