[sumo][stage] celery issues

RESOLVED FIXED

Status

Infrastructure & Operations
WebOps: Other
--
major
RESOLVED FIXED
5 years ago
5 years ago

People

(Reporter: rrosario, Assigned: cturra)

Tracking

Details

(Reporter)

Description

5 years ago
See:
https://errormill.mozilla.org/support/sumo-stage/

There are a bunch of AMQP related errors, for example:
https://errormill.mozilla.org/support/sumo-stage/group/15341/

[Errno 113] No route to host

This has happened at least 3 times this morning and the site stays broken until a deploy restarts everything.
(Reporter)

Updated

5 years ago
Blocks: 859816
(Reporter)

Comment 1

5 years ago
Every time I deploy, things work fine for a bit since everything restarts. But then the errors happen again and the site goes down.
(Assignee)

Comment 2

5 years ago
i have spent a little time tracking this down and it turns out that support-celery1.stage is currently offline. the esx cluster is reporting the following error for it:

 support-celery1.stage
 Alert
 vSphere HA virtual machine failover failed
 vc1.private.phx1.mozilla.com
 4/9/2013 12:23:11 AM


i am working with our sre team to investigate and get it back online.
Assignee: server-ops-webops → cturra
(Assignee)

Comment 3

5 years ago
last night one of the esx hosts blew up causing this host to go down. other vms on this host were manually recovered, but this one was missed. the sre team is reviewing the monitoring of this node to avoid this from happening again in the future.

as of now tho, the host is back online and functioning as expected.

*important note, since it's been offline for a bit, please be sure to do another chief deploy to get all your latest code onto this stage node.
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED

Comment 4

5 years ago
(ESX host failure was 859698, for internal cross-reference purposes)
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.