Closed Bug 1413553 Opened 7 years ago Closed 5 years ago

Investigate why DB failovers require worker dyno restarts for tasks to resume

Categories

(Tree Management :: Treeherder: Infrastructure, enhancement, P1)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

Details

A couple of times now (the most recent being bug 1413547), the DB has failed over successfully, but the celery queue backlogs for pulse job ingestion (amongst other things) has appeared to continue to grow until the Heroku dynos have been restarted.

If this is actually the case (vs just coincidental timing), then we should try to resolve this, so that unexpected DB failovers in the middle of the night don't require human intervention.

I imagine the best way to confirm the problem would be to:
1) Initiate a failover on stage (dev isn't multi-AZ, so wouldn't work)
2) Monitor the queue sizes on New Relic
3) Don't restart the dynos until a reasonable amount of time has passed (say 30 minutes), to see if they self-recover after the celery task timeouts have passed

I'd imagine the outcome would be either:
* the queues self-recover within a few minutes - ie: the previous instances were just unlucky timing of the restart, so there's nothing more to do here.
* the queues self-recover after 10-20 minutes - ie: the tasks have to hit the celery timeouts, which we could lower to improve the situation.
* the queues never self-recover, in which case this might be a Heroku dyno DNS caching issue or similar.

We may also want to try updating to Celery 4 (bug 1337717), since this may be a bug they've already fixed.
Assignee: nobody → emorley

I've rebooted prototype's RDS instance (confusingly called treeherder-dev) to try out the approach above.

Status: NEW → ASSIGNED

The workers reconnected to the DB with no intervention required - so looks like this has been fixed in the meantime. (Perhaps by the newer Celery/kombu from bug 1337717)

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.