See: https://errormill.mozilla.org/support/sumo-stage/ There are a bunch of AMQP related errors, for example: https://errormill.mozilla.org/support/sumo-stage/group/15341/ [Errno 113] No route to host This has happened at least 3 times this morning and the site stays broken until a deploy restarts everything.
Every time I deploy, things work fine for a bit since everything restarts. But then the errors happen again and the site goes down.
i have spent a little time tracking this down and it turns out that support-celery1.stage is currently offline. the esx cluster is reporting the following error for it: support-celery1.stage Alert vSphere HA virtual machine failover failed vc1.private.phx1.mozilla.com 4/9/2013 12:23:11 AM i am working with our sre team to investigate and get it back online.
Assignee: server-ops-webops → cturra
last night one of the esx hosts blew up causing this host to go down. other vms on this host were manually recovered, but this one was missed. the sre team is reviewing the monitoring of this node to avoid this from happening again in the future. as of now tho, the host is back online and functioning as expected. *important note, since it's been offline for a bit, please be sure to do another chief deploy to get all your latest code onto this stage node.
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
(ESX host failure was 859698, for internal cross-reference purposes)
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.